[jira] [Commented] (SPARK-48105) Fix the data corruption issue when state store unload and snapshotting happens concurrently for HDFS state store
[ https://issues.apache.org/jira/browse/SPARK-48105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843123#comment-17843123 ] Anish Shrigondekar commented on SPARK-48105: Thanks [~huanli.wang] Worth noting that we believe that this also fixes the stream-stream join null pointer issue - https://issues.apache.org/jira/browse/SPARK-31754 This was effectively happening due to the same state data loss/corruption scenarios explained above cc - [~kabhwan] > Fix the data corruption issue when state store unload and snapshotting > happens concurrently for HDFS state store > - > > Key: SPARK-48105 > URL: https://issues.apache.org/jira/browse/SPARK-48105 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Huanli Wang >Priority: Major > Labels: pull-request-available > > There are two race conditions between state store snapshotting and state > store unloading which could result in query failure and potential data > corruption. > > Case 1: > # the maintenance thread pool encounters some issues and call the > [stopMaintenanceTask,|https://github.com/apache/spark/blob/d9d79a54a3cd487380039c88ebe9fa708e0dcf23/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L774] > this function further calls > [threadPool.stop.|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L587] > However, this function doesn't wait for the stop operation to be completed > and move to do the state store [unload and > clear.|https://github.com/apache/spark/blob/d9d79a54a3cd487380039c88ebe9fa708e0dcf23/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L775-L778] > # the provider unload will [close the state > store|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L719-L721] > which [clear the values of > loadedMaps|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L353-L355] > for HDFS backed state store. > # if the not-yet-stop maintenance thread is still running and trying to do > the snapshot, but the data in the underlying `HDFSBackedStateStoreMap` has > been removed. if this snapshot process completes successfully, then we will > write corrupted data and the following batches will consume this corrupted > data. > Case 2: > # In executor_1, the maintenance thread is going to do the snapshot for > state_store_1, it retrieves the `HDFSBackedStateStoreMap` object from the > loadedMaps, after this, the maintenance thread [releases the lock of the > loadedMaps|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L750-L751]. > # state_store_1 is loaded in another executor, e.g. executor_2. > # another state store, state_store_2, is loaded on executor_1 and > [reportActiveStoreInstance|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L854-L871] > to driver. > # executor_1 does the > [unload|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L713] > for those no longer active state store which clears the data entries in the > `HDFSBackedStateStoreMap` > # the snapshotting thread is terminated and uploads the incomplete snapshot > to cloud because the [iterator doesn't have next > element|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L634] > after doing the clear. > # future batches are consuming the corrupted data. > > Proposed fix: > * When we close the hdfs state store, we should only remove the entry from > `loadedMaps` rather than doing the active data cleanup. JVM GC should be able > to help us GC those objects. > * we should wait for the maintenance thread to stop before unloading the > providers. > > Thanks [~anishshri-db] for helping debug this issue! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48103) Promote ` KubernetesDriverBuilder` to `DeveloperApi`
[ https://issues.apache.org/jira/browse/SPARK-48103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48103: --- Labels: pull-request-available (was: ) > Promote ` KubernetesDriverBuilder` to `DeveloperApi` > > > Key: SPARK-48103 > URL: https://issues.apache.org/jira/browse/SPARK-48103 > Project: Spark > Issue Type: Sub-task > Components: k8s >Affects Versions: kubernetes-operator-0.1.0 >Reporter: Zhou JIANG >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48105) Fix the data corruption issue when state store unload and snapshotting happens concurrently for HDFS state store
[ https://issues.apache.org/jira/browse/SPARK-48105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48105: --- Labels: pull-request-available (was: ) > Fix the data corruption issue when state store unload and snapshotting > happens concurrently for HDFS state store > - > > Key: SPARK-48105 > URL: https://issues.apache.org/jira/browse/SPARK-48105 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Huanli Wang >Priority: Major > Labels: pull-request-available > > There are two race conditions between state store snapshotting and state > store unloading which could result in query failure and potential data > corruption. > > Case 1: > # the maintenance thread pool encounters some issues and call the > [stopMaintenanceTask,|https://github.com/apache/spark/blob/d9d79a54a3cd487380039c88ebe9fa708e0dcf23/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L774] > this function further calls > [threadPool.stop.|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L587] > However, this function doesn't wait for the stop operation to be completed > and move to do the state store [unload and > clear.|https://github.com/apache/spark/blob/d9d79a54a3cd487380039c88ebe9fa708e0dcf23/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L775-L778] > # the provider unload will [close the state > store|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L719-L721] > which [clear the values of > loadedMaps|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L353-L355] > for HDFS backed state store. > # if the not-yet-stop maintenance thread is still running and trying to do > the snapshot, but the data in the underlying `HDFSBackedStateStoreMap` has > been removed. if this snapshot process completes successfully, then we will > write corrupted data and the following batches will consume this corrupted > data. > Case 2: > # In executor_1, the maintenance thread is going to do the snapshot for > state_store_1, it retrieves the `HDFSBackedStateStoreMap` object from the > loadedMaps, after this, the maintenance thread [releases the lock of the > loadedMaps|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L750-L751]. > # state_store_1 is loaded in another executor, e.g. executor_2. > # another state store, state_store_2, is loaded on executor_1 and > [reportActiveStoreInstance|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L854-L871] > to driver. > # executor_1 does the > [unload|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L713] > for those no longer active state store which clears the data entries in the > `HDFSBackedStateStoreMap` > # the snapshotting thread is terminated and uploads the incomplete snapshot > to cloud because the [iterator doesn't have next > element|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L634] > after doing the clear. > # future batches are consuming the corrupted data. > > Proposed fix: > * When we close the hdfs state store, we should only remove the entry from > `loadedMaps` rather than doing the active data cleanup. JVM GC should be able > to help us GC those objects. > * we should wait for the maintenance thread to stop before unloading the > providers. > > Thanks [~anishshri-db] for helping debug this issue! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48108) Skip `tpcds-1g` and `docker-integration-tests` tests from `RocksDB UI-Backend` job
[ https://issues.apache.org/jira/browse/SPARK-48108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48108. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46355 [https://github.com/apache/spark/pull/46355] > Skip `tpcds-1g` and `docker-integration-tests` tests from `RocksDB > UI-Backend` job > -- > > Key: SPARK-48108 > URL: https://issues.apache.org/jira/browse/SPARK-48108 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48109) Enable `k8s-integration-tests` only for `kubernetes` module change
[ https://issues.apache.org/jira/browse/SPARK-48109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48109: - Assignee: Dongjoon Hyun > Enable `k8s-integration-tests` only for `kubernetes` module change > -- > > Key: SPARK-48109 > URL: https://issues.apache.org/jira/browse/SPARK-48109 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > Although there is a chance of missing the related core module change, daily > CI test coverage will reveal that. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48109) Enable `k8s-integration-tests` only for `kubernetes` module change
[ https://issues.apache.org/jira/browse/SPARK-48109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48109: --- Labels: pull-request-available (was: ) > Enable `k8s-integration-tests` only for `kubernetes` module change > -- > > Key: SPARK-48109 > URL: https://issues.apache.org/jira/browse/SPARK-48109 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > Although there is a chance of missing the related core module change, daily > CI test coverage will reveal that. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48109) Enable `k8s-integration-tests` only for `kubernetes` module change
Dongjoon Hyun created SPARK-48109: - Summary: Enable `k8s-integration-tests` only for `kubernetes` module change Key: SPARK-48109 URL: https://issues.apache.org/jira/browse/SPARK-48109 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Dongjoon Hyun Although there is a chance of missing the related core module change, daily CI test coverage will reveal that. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance
[ https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48094: -- Description: h2. ASF INFRA POLICY - https://infra.apache.org/github-actions-policy.html h2. MONITORING [https://infra-reports.apache.org/#ghactions=spark=168] !Screenshot 2024-05-02 at 20.59.18.png|width=100%! h2. TARGET * All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. * All workflows SHOULD have a job concurrency level less than or equal to 15. Just because 20 is the max, doesn't mean you should strive for 20. * The average number of minutes a project uses per calendar week MUST NOT exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours). * The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). h2. DEADLINE bq. 17th of May, 2024 Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. was: h2. ASF INFRA POLICY - https://infra.apache.org/github-actions-policy.html h2. MONITORING [https://infra-reports.apache.org/#ghactions=spark=168] !Screenshot 2024-05-02 at 20.59.18.png! h2. TARGET * All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. * All workflows SHOULD have a job concurrency level less than or equal to 15. Just because 20 is the max, doesn't mean you should strive for 20. * The average number of minutes a project uses per calendar week MUST NOT exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours). * The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). h2. DEADLINE bq. 17th of May, 2024 Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. > Reduce GitHub Action usage according to ASF project allowance > - > > Key: SPARK-48094 > URL: https://issues.apache.org/jira/browse/SPARK-48094 > Project: Spark > Issue Type: Umbrella > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Attachments: Screenshot 2024-05-02 at 20.59.18.png > > > h2. ASF INFRA POLICY > - https://infra.apache.org/github-actions-policy.html > h2. MONITORING > [https://infra-reports.apache.org/#ghactions=spark=168] > !Screenshot 2024-05-02 at 20.59.18.png|width=100%! > h2. TARGET > * All workflows MUST have a job concurrency level less than or equal to 20. > This means a workflow cannot have more than 20 jobs running at the same time > across all matrices. > * All workflows SHOULD have a job concurrency level less than or equal to 15. > Just because 20 is the max, doesn't mean you should strive for 20. > * The average number of minutes a project uses per calendar week MUST NOT > exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 > hours). > * The average number of minutes a project uses in any consecutive five-day > period MUST NOT exceed the equivalent of 30 full-time runners (216,000 > minutes, or 3,600 hours). > h2. DEADLINE > bq. 17th of May, 2024 > Since the deadline is 17th of May, 2024, I set this as the highest priority, > `Blocker`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance
[ https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48094: -- Description: h2. ASF INFRA POLICY - https://infra.apache.org/github-actions-policy.html h2. MONITORING [https://infra-reports.apache.org/#ghactions=spark=168] !Screenshot 2024-05-02 at 20.59.18.png! h2. TARGET * All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. * All workflows SHOULD have a job concurrency level less than or equal to 15. Just because 20 is the max, doesn't mean you should strive for 20. * The average number of minutes a project uses per calendar week MUST NOT exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours). * The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). h2. DEADLINE bq. 17th of May, 2024 Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. was: h2. ASF INFRA POLICY - https://infra.apache.org/github-actions-policy.html h2. MONITORING [https://infra-reports.apache.org/#ghactions=spark=168] h2. TARGET * All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. * All workflows SHOULD have a job concurrency level less than or equal to 15. Just because 20 is the max, doesn't mean you should strive for 20. * The average number of minutes a project uses per calendar week MUST NOT exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours). * The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). h2. DEADLINE bq. 17th of May, 2024 Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. > Reduce GitHub Action usage according to ASF project allowance > - > > Key: SPARK-48094 > URL: https://issues.apache.org/jira/browse/SPARK-48094 > Project: Spark > Issue Type: Umbrella > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Attachments: Screenshot 2024-05-02 at 20.59.18.png > > > h2. ASF INFRA POLICY > - https://infra.apache.org/github-actions-policy.html > h2. MONITORING > [https://infra-reports.apache.org/#ghactions=spark=168] > !Screenshot 2024-05-02 at 20.59.18.png! > h2. TARGET > * All workflows MUST have a job concurrency level less than or equal to 20. > This means a workflow cannot have more than 20 jobs running at the same time > across all matrices. > * All workflows SHOULD have a job concurrency level less than or equal to 15. > Just because 20 is the max, doesn't mean you should strive for 20. > * The average number of minutes a project uses per calendar week MUST NOT > exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 > hours). > * The average number of minutes a project uses in any consecutive five-day > period MUST NOT exceed the equivalent of 30 full-time runners (216,000 > minutes, or 3,600 hours). > h2. DEADLINE > bq. 17th of May, 2024 > Since the deadline is 17th of May, 2024, I set this as the highest priority, > `Blocker`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance
[ https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48094: -- Attachment: (was: Screenshot 2024-05-02 at 13.18.42.png) > Reduce GitHub Action usage according to ASF project allowance > - > > Key: SPARK-48094 > URL: https://issues.apache.org/jira/browse/SPARK-48094 > Project: Spark > Issue Type: Umbrella > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Attachments: Screenshot 2024-05-02 at 20.59.18.png > > > h2. ASF INFRA POLICY > - https://infra.apache.org/github-actions-policy.html > h2. MONITORING > [https://infra-reports.apache.org/#ghactions=spark=168] > h2. TARGET > * All workflows MUST have a job concurrency level less than or equal to 20. > This means a workflow cannot have more than 20 jobs running at the same time > across all matrices. > * All workflows SHOULD have a job concurrency level less than or equal to 15. > Just because 20 is the max, doesn't mean you should strive for 20. > * The average number of minutes a project uses per calendar week MUST NOT > exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 > hours). > * The average number of minutes a project uses in any consecutive five-day > period MUST NOT exceed the equivalent of 30 full-time runners (216,000 > minutes, or 3,600 hours). > h2. DEADLINE > bq. 17th of May, 2024 > Since the deadline is 17th of May, 2024, I set this as the highest priority, > `Blocker`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance
[ https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48094: -- Description: h2. ASF INFRA POLICY - https://infra.apache.org/github-actions-policy.html h2. MONITORING [https://infra-reports.apache.org/#ghactions=spark=168] h2. TARGET * All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. * All workflows SHOULD have a job concurrency level less than or equal to 15. Just because 20 is the max, doesn't mean you should strive for 20. * The average number of minutes a project uses per calendar week MUST NOT exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours). * The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). h2. DEADLINE bq. 17th of May, 2024 Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. was: h2. ASF INFRA POLICY - https://infra.apache.org/github-actions-policy.html h2. MONITORING [https://infra-reports.apache.org/#ghactions=spark=168] h2. TARGET * All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. * All workflows SHOULD have a job concurrency level less than or equal to 15. Just because 20 is the max, doesn't mean you should strive for 20. * The average number of minutes a project uses per calendar week MUST NOT exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours). * The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). h2. DEADLINE bq. 17th of May, 2024 Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. !Screenshot 2024-05-02 at 13.18.42.png|width=100%! > Reduce GitHub Action usage according to ASF project allowance > - > > Key: SPARK-48094 > URL: https://issues.apache.org/jira/browse/SPARK-48094 > Project: Spark > Issue Type: Umbrella > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Attachments: Screenshot 2024-05-02 at 20.59.18.png > > > h2. ASF INFRA POLICY > - https://infra.apache.org/github-actions-policy.html > h2. MONITORING > [https://infra-reports.apache.org/#ghactions=spark=168] > h2. TARGET > * All workflows MUST have a job concurrency level less than or equal to 20. > This means a workflow cannot have more than 20 jobs running at the same time > across all matrices. > * All workflows SHOULD have a job concurrency level less than or equal to 15. > Just because 20 is the max, doesn't mean you should strive for 20. > * The average number of minutes a project uses per calendar week MUST NOT > exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 > hours). > * The average number of minutes a project uses in any consecutive five-day > period MUST NOT exceed the equivalent of 30 full-time runners (216,000 > minutes, or 3,600 hours). > h2. DEADLINE > bq. 17th of May, 2024 > Since the deadline is 17th of May, 2024, I set this as the highest priority, > `Blocker`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance
[ https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48094: -- Attachment: Screenshot 2024-05-02 at 20.59.18.png > Reduce GitHub Action usage according to ASF project allowance > - > > Key: SPARK-48094 > URL: https://issues.apache.org/jira/browse/SPARK-48094 > Project: Spark > Issue Type: Umbrella > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Attachments: Screenshot 2024-05-02 at 20.59.18.png > > > h2. ASF INFRA POLICY > - https://infra.apache.org/github-actions-policy.html > h2. MONITORING > [https://infra-reports.apache.org/#ghactions=spark=168] > h2. TARGET > * All workflows MUST have a job concurrency level less than or equal to 20. > This means a workflow cannot have more than 20 jobs running at the same time > across all matrices. > * All workflows SHOULD have a job concurrency level less than or equal to 15. > Just because 20 is the max, doesn't mean you should strive for 20. > * The average number of minutes a project uses per calendar week MUST NOT > exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 > hours). > * The average number of minutes a project uses in any consecutive five-day > period MUST NOT exceed the equivalent of 30 full-time runners (216,000 > minutes, or 3,600 hours). > h2. DEADLINE > bq. 17th of May, 2024 > Since the deadline is 17th of May, 2024, I set this as the highest priority, > `Blocker`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48107) Exclude tests from Python distribution
[ https://issues.apache.org/jira/browse/SPARK-48107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48107: --- Labels: pull-request-available (was: ) > Exclude tests from Python distribution > -- > > Key: SPARK-48107 > URL: https://issues.apache.org/jira/browse/SPARK-48107 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48107) Exclude tests from Python distribution
Nicholas Chammas created SPARK-48107: Summary: Exclude tests from Python distribution Key: SPARK-48107 URL: https://issues.apache.org/jira/browse/SPARK-48107 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48106) Use `Python 3.11` in `pyspark` tests of `build_and_test.yml`
[ https://issues.apache.org/jira/browse/SPARK-48106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48106: - Assignee: Dongjoon Hyun > Use `Python 3.11` in `pyspark` tests of `build_and_test.yml` > > > Key: SPARK-48106 > URL: https://issues.apache.org/jira/browse/SPARK-48106 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > - https://docs.python.org/3/whatsnew/3.11.html#summary-release-highlights > bq. Python 3.11 is between 10-60% faster than Python 3.10. On average, we > measured a 1.25x speedup on the standard benchmark suite. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48106) Use `Python 3.11` in `pyspark` tests of `build_and_test.yml`
[ https://issues.apache.org/jira/browse/SPARK-48106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48106: -- Description: - https://docs.python.org/3/whatsnew/3.11.html#summary-release-highlights bq. Python 3.11 is between 10-60% faster than Python 3.10. On average, we measured a 1.25x speedup on the standard benchmark suite. was: - https://docs.python.org/3/whatsnew/3.11.html#summary-release-highlights > Python 3.11 is between 10-60% faster than Python 3.10. On average, we > measured a 1.25x speedup on the standard benchmark suite. > Use `Python 3.11` in `pyspark` tests of `build_and_test.yml` > > > Key: SPARK-48106 > URL: https://issues.apache.org/jira/browse/SPARK-48106 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > - https://docs.python.org/3/whatsnew/3.11.html#summary-release-highlights > bq. Python 3.11 is between 10-60% faster than Python 3.10. On average, we > measured a 1.25x speedup on the standard benchmark suite. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48106) Use `Python 3.11` in `pyspark` tests of `build_and_test.yml`
[ https://issues.apache.org/jira/browse/SPARK-48106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48106: -- Description: - https://docs.python.org/3/whatsnew/3.11.html#summary-release-highlights > Python 3.11 is between 10-60% faster than Python 3.10. On average, we > measured a 1.25x speedup on the standard benchmark suite. > Use `Python 3.11` in `pyspark` tests of `build_and_test.yml` > > > Key: SPARK-48106 > URL: https://issues.apache.org/jira/browse/SPARK-48106 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > - https://docs.python.org/3/whatsnew/3.11.html#summary-release-highlights > > Python 3.11 is between 10-60% faster than Python 3.10. On average, we > > measured a 1.25x speedup on the standard benchmark suite. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48106) Use `Python 3.11` in `pyspark` tests of `build_and_test.yml`
[ https://issues.apache.org/jira/browse/SPARK-48106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48106: --- Labels: pull-request-available (was: ) > Use `Python 3.11` in `pyspark` tests of `build_and_test.yml` > > > Key: SPARK-48106 > URL: https://issues.apache.org/jira/browse/SPARK-48106 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48106) Use `Python 3.11` in `pyspark` tests of `build_and_test.yml`
Dongjoon Hyun created SPARK-48106: - Summary: Use `Python 3.11` in `pyspark` tests of `build_and_test.yml` Key: SPARK-48106 URL: https://issues.apache.org/jira/browse/SPARK-48106 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48104) Run `publish_snapshot.yml` once per day
[ https://issues.apache.org/jira/browse/SPARK-48104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48104: - Assignee: Dongjoon Hyun > Run `publish_snapshot.yml` once per day > --- > > Key: SPARK-48104 > URL: https://issues.apache.org/jira/browse/SPARK-48104 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48104) Run `publish_snapshot.yml` once per day
[ https://issues.apache.org/jira/browse/SPARK-48104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48104. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46352 [https://github.com/apache/spark/pull/46352] > Run `publish_snapshot.yml` once per day > --- > > Key: SPARK-48104 > URL: https://issues.apache.org/jira/browse/SPARK-48104 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48105) Fix the data corruption issue when state store unload and snapshotting happens concurrently for HDFS state store
Huanli Wang created SPARK-48105: --- Summary: Fix the data corruption issue when state store unload and snapshotting happens concurrently for HDFS state store Key: SPARK-48105 URL: https://issues.apache.org/jira/browse/SPARK-48105 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Huanli Wang There are two race conditions between state store snapshotting and state store unloading which could result in query failure and potential data corruption. Case 1: # the maintenance thread pool encounters some issues and call the [stopMaintenanceTask,|https://github.com/apache/spark/blob/d9d79a54a3cd487380039c88ebe9fa708e0dcf23/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L774] this function further calls [threadPool.stop.|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L587] However, this function doesn't wait for the stop operation to be completed and move to do the state store [unload and clear.|https://github.com/apache/spark/blob/d9d79a54a3cd487380039c88ebe9fa708e0dcf23/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L775-L778] # the provider unload will [close the state store|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L719-L721] which [clear the values of loadedMaps|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L353-L355] for HDFS backed state store. # if the not-yet-stop maintenance thread is still running and trying to do the snapshot, but the data in the underlying `HDFSBackedStateStoreMap` has been removed. if this snapshot process completes successfully, then we will write corrupted data and the following batches will consume this corrupted data. Case 2: # In executor_1, the maintenance thread is going to do the snapshot for state_store_1, it retrieves the `HDFSBackedStateStoreMap` object from the loadedMaps, after this, the maintenance thread [releases the lock of the loadedMaps|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L750-L751]. # state_store_1 is loaded in another executor, e.g. executor_2. # another state store, state_store_2, is loaded on executor_1 and [reportActiveStoreInstance|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L854-L871] to driver. # executor_1 does the [unload|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L713] for those no longer active state store which clears the data entries in the `HDFSBackedStateStoreMap` # the snapshotting thread is terminated and uploads the incomplete snapshot to cloud because the [iterator doesn't have next element|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L634] after doing the clear. # future batches are consuming the corrupted data. Proposed fix: * When we close the hdfs state store, we should only remove the entry from `loadedMaps` rather than doing the active data cleanup. JVM GC should be able to help us GC those objects. * we should wait for the maintenance thread to stop before unloading the providers. Thanks [~anishshri-db] for helping debug this issue! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48104) Run `publish_snapshot.yml` once per day
[ https://issues.apache.org/jira/browse/SPARK-48104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48104: --- Labels: pull-request-available (was: ) > Run `publish_snapshot.yml` once per day > --- > > Key: SPARK-48104 > URL: https://issues.apache.org/jira/browse/SPARK-48104 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48104) Run `publish_snapshot.yml` once per day
Dongjoon Hyun created SPARK-48104: - Summary: Run `publish_snapshot.yml` once per day Key: SPARK-48104 URL: https://issues.apache.org/jira/browse/SPARK-48104 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47671) Enable structured logging in log4j2.properties.template and update `configuration.md`
[ https://issues.apache.org/jira/browse/SPARK-47671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47671. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46349 [https://github.com/apache/spark/pull/46349] > Enable structured logging in log4j2.properties.template and update > `configuration.md` > - > > Key: SPARK-47671 > URL: https://issues.apache.org/jira/browse/SPARK-47671 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > # rename the current log4j2.properties.template as > log4j2.properties.pattern-layout-template > # Enable structured logging in log4j2.properties.template > # Update `configuration.md` on how to configure logging -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48102) Track time to acquire source progress metrics for streaming triggers
[ https://issues.apache.org/jira/browse/SPARK-48102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-48102. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46350 [https://github.com/apache/spark/pull/46350] > Track time to acquire source progress metrics for streaming triggers > > > Key: SPARK-48102 > URL: https://issues.apache.org/jira/browse/SPARK-48102 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Track time to acquire source progress metrics for streaming triggers -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48102) Track time to acquire source progress metrics for streaming triggers
[ https://issues.apache.org/jira/browse/SPARK-48102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-48102: Assignee: Anish Shrigondekar > Track time to acquire source progress metrics for streaming triggers > > > Key: SPARK-48102 > URL: https://issues.apache.org/jira/browse/SPARK-48102 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > Labels: pull-request-available > > Track time to acquire source progress metrics for streaming triggers -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`
[ https://issues.apache.org/jira/browse/SPARK-48099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48099. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46347 [https://github.com/apache/spark/pull/46347] > Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)` > > > Key: SPARK-48099 > URL: https://issues.apache.org/jira/browse/SPARK-48099 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: Screenshot 2024-05-02 at 14.59.14.png > > > `Java 21 on MacOS 14` is the fastest Maven test and covers both Java 17 and > Apple Silicon use case. > !Screenshot 2024-05-02 at 14.59.14.png|width=100%! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47920) Add documentation for python streaming data source
[ https://issues.apache.org/jira/browse/SPARK-47920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-47920: - Affects Version/s: 4.0.0 (was: 3.5.1) > Add documentation for python streaming data source > -- > > Key: SPARK-47920 > URL: https://issues.apache.org/jira/browse/SPARK-47920 > Project: Spark > Issue Type: New Feature > Components: PySpark, SS >Affects Versions: 4.0.0 >Reporter: Chaoqin Li >Priority: Major > Labels: pull-request-available > > Add documentation (user guide) for Python data source API. > The DOC should explain how to develop and use DataSourceStreamReader and > DataSourceStreamWriter -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48103) Promote ` KubernetesDriverBuilder` to `DeveloperApi`
Zhou JIANG created SPARK-48103: -- Summary: Promote ` KubernetesDriverBuilder` to `DeveloperApi` Key: SPARK-48103 URL: https://issues.apache.org/jira/browse/SPARK-48103 Project: Spark Issue Type: Sub-task Components: k8s Affects Versions: kubernetes-operator-0.1.0 Reporter: Zhou JIANG -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48102) Track time to acquire source progress metrics for streaming triggers
[ https://issues.apache.org/jira/browse/SPARK-48102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48102: --- Labels: pull-request-available (was: ) > Track time to acquire source progress metrics for streaming triggers > > > Key: SPARK-48102 > URL: https://issues.apache.org/jira/browse/SPARK-48102 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Anish Shrigondekar >Priority: Major > Labels: pull-request-available > > Track time to acquire source progress metrics for streaming triggers -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48102) Track time to acquire source progress metrics for streaming triggers
Anish Shrigondekar created SPARK-48102: -- Summary: Track time to acquire source progress metrics for streaming triggers Key: SPARK-48102 URL: https://issues.apache.org/jira/browse/SPARK-48102 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Anish Shrigondekar Track time to acquire source progress metrics for streaming triggers -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48065) SPJ: allowJoinKeysSubsetOfPartitionKeys is too strict
[ https://issues.apache.org/jira/browse/SPARK-48065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-48065. -- Resolution: Fixed Issue resolved by pull request 46325 [https://github.com/apache/spark/pull/46325] > SPJ: allowJoinKeysSubsetOfPartitionKeys is too strict > - > > Key: SPARK-48065 > URL: https://issues.apache.org/jira/browse/SPARK-48065 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.3 >Reporter: Szehon Ho >Assignee: Szehon Ho >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > If spark.sql.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled is true, > then SPJ no longer triggers if there are more join keys than partition keys. > It is triggered only if join keys is equal to , or less than, partition keys. > > We can relax this constraint, as this case was supported if the flag is not > enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48065) SPJ: allowJoinKeysSubsetOfPartitionKeys is too strict
[ https://issues.apache.org/jira/browse/SPARK-48065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-48065: Assignee: Szehon Ho > SPJ: allowJoinKeysSubsetOfPartitionKeys is too strict > - > > Key: SPARK-48065 > URL: https://issues.apache.org/jira/browse/SPARK-48065 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.3 >Reporter: Szehon Ho >Assignee: Szehon Ho >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > If spark.sql.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled is true, > then SPJ no longer triggers if there are more join keys than partition keys. > It is triggered only if join keys is equal to , or less than, partition keys. > > We can relax this constraint, as this case was supported if the flag is not > enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47671) Enable structured logging in log4j2.properties.template and update `configuration.md`
[ https://issues.apache.org/jira/browse/SPARK-47671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47671: --- Labels: pull-request-available (was: ) > Enable structured logging in log4j2.properties.template and update > `configuration.md` > - > > Key: SPARK-47671 > URL: https://issues.apache.org/jira/browse/SPARK-47671 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Gengliang Wang >Priority: Major > Labels: pull-request-available > > # rename the current log4j2.properties.template as > log4j2.properties.pattern-layout-template > # Enable structured logging in log4j2.properties.template > # Update `configuration.md` on how to configure logging -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47671) Enable structured logging in log4j2.properties.template and update `configuration.md`
[ https://issues.apache.org/jira/browse/SPARK-47671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-47671: --- Description: # rename the current log4j2.properties.template as log4j2.properties.pattern-layout-template # Enable structured logging in log4j2.properties.template # Update `configuration.md` on how to configure logging was: # rename the current log4j2.properties.template as log4j2-pattern-layout.properties.template # Enable structured logging in log4j2.properties.template # Update `configuration.md` on how to configure logging > Enable structured logging in log4j2.properties.template and update > `configuration.md` > - > > Key: SPARK-47671 > URL: https://issues.apache.org/jira/browse/SPARK-47671 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Gengliang Wang >Priority: Major > > # rename the current log4j2.properties.template as > log4j2.properties.pattern-layout-template > # Enable structured logging in log4j2.properties.template > # Update `configuration.md` on how to configure logging -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48101) When using INSERT OVERWRITE with Spark CTEs they may not be fully resolved
[ https://issues.apache.org/jira/browse/SPARK-48101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-48101: - Priority: Minor (was: Major) > When using INSERT OVERWRITE with Spark CTEs they may not be fully resolved > -- > > Key: SPARK-48101 > URL: https://issues.apache.org/jira/browse/SPARK-48101 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.1 >Reporter: Holden Karau >Priority: Minor > > Repro: > ```sql > DROP TABLE IF EXISTS local.cte1; > DROP TABLE IF EXISTS local.cte2; > DROP TABLE IF EXISTS local.cte3; > CREATE TABLE local.cte1 (id INT, fname STRING); > CREATE TABLE local.cte2 (id2 INT); > CREATE TABLE local.cte3 (id INT); > WITH test_fake AS (SELECT * FROM local.cte1 WHERE id = 1 AND id2 = 1), > test_fake2 AS (SELECT * FROM local.cte2 WHERE id2 = 1) INSERT OVERWRITE TABLE > local.cte3 SELECT id2 as id FROM test_fake2; > WITH test_fake AS (SELECT * FROM local.cte1 WHERE id = 1 AND id2 = 1), > test_fake2 AS (SELECT * FROM local.cte2 WHERE id2 = 1) SELECT id2 as id FROM > test_fake2; > ``` > > Here we would expect both of the last two SQL expressions to fail, but > instead only the first one does. > > There are more complicated cases, and in those cases, the invalid CTE is > treated as a null table, but this is the simplest repro I've been able to > come up with so far. > > This occurs using both local w/Iceberg catalog or the SparkSession catalog. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`
[ https://issues.apache.org/jira/browse/SPARK-48099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48099: - Assignee: Dongjoon Hyun > Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)` > > > Key: SPARK-48099 > URL: https://issues.apache.org/jira/browse/SPARK-48099 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Attachments: Screenshot 2024-05-02 at 14.59.14.png > > > `Java 21 on MacOS 14` is the fastest Maven test and covers both Java 17 and > Apple Silicon use case. > !Screenshot 2024-05-02 at 14.59.14.png|width=100%! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48101) When using INSERT OVERWRITE with Spark CTEs they may not be fully resolved
Holden Karau created SPARK-48101: Summary: When using INSERT OVERWRITE with Spark CTEs they may not be fully resolved Key: SPARK-48101 URL: https://issues.apache.org/jira/browse/SPARK-48101 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.1, 3.4.0, 3.3.0 Reporter: Holden Karau Repro: ```sql DROP TABLE IF EXISTS local.cte1; DROP TABLE IF EXISTS local.cte2; DROP TABLE IF EXISTS local.cte3; CREATE TABLE local.cte1 (id INT, fname STRING); CREATE TABLE local.cte2 (id2 INT); CREATE TABLE local.cte3 (id INT); WITH test_fake AS (SELECT * FROM local.cte1 WHERE id = 1 AND id2 = 1), test_fake2 AS (SELECT * FROM local.cte2 WHERE id2 = 1) INSERT OVERWRITE TABLE local.cte3 SELECT id2 as id FROM test_fake2; WITH test_fake AS (SELECT * FROM local.cte1 WHERE id = 1 AND id2 = 1), test_fake2 AS (SELECT * FROM local.cte2 WHERE id2 = 1) SELECT id2 as id FROM test_fake2; ``` Here we would expect both of the last two SQL expressions to fail, but instead only the first one does. There are more complicated cases, and in those cases, the invalid CTE is treated as a null table, but this is the simplest repro I've been able to come up with so far. This occurs using both local w/Iceberg catalog or the SparkSession catalog. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48100) [SQL][XML] Fix issues in skipping nested structure fields not selected in schema
[ https://issues.apache.org/jira/browse/SPARK-48100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shujing Yang updated SPARK-48100: - Description: Previously, the XML parser can't skip nested structure data fields effectively when they were not selected in the schema. For instance, in the below example, `df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't effectively skipped. This PR fixes this issue. {code:java} 1 2 {code} was: Previously, the XML parser can't skip nested structure data fields when they were not selected in the schema. For instance, in the below example, `df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't effectively skipped. This PR fixes this issue. {code:java} 1 2 {code} > [SQL][XML] Fix issues in skipping nested structure fields not selected in > schema > > > Key: SPARK-48100 > URL: https://issues.apache.org/jira/browse/SPARK-48100 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Shujing Yang >Priority: Major > Labels: pull-request-available > > Previously, the XML parser can't skip nested structure data fields > effectively when they were not selected in the schema. For instance, in the > below example, `df.select("struct2").collect()` returns `Seq(null)` as > `struct1` wasn't effectively skipped. This PR fixes this issue. > {code:java} > > > 1 > > > 2 > > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48100) [SQL][XML] Fix issues in skipping nested structure fields not selected in schema
[ https://issues.apache.org/jira/browse/SPARK-48100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48100: --- Labels: pull-request-available (was: ) > [SQL][XML] Fix issues in skipping nested structure fields not selected in > schema > > > Key: SPARK-48100 > URL: https://issues.apache.org/jira/browse/SPARK-48100 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Shujing Yang >Priority: Major > Labels: pull-request-available > > Previously, the XML parser can't skip nested structure data fields when they > were not selected in the schema. For instance, in the below example, > `df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't > effectively skipped. This PR fixes this issue. > {code:java} > > > 1 > > > 2 > > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48100) [SQL][XML] Fix issues in skipping nested structure fields not selected in schema
[ https://issues.apache.org/jira/browse/SPARK-48100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shujing Yang updated SPARK-48100: - Summary: [SQL][XML] Fix issues in skipping nested structure fields not selected in schema (was: [SQL][XML] Fix projection issue when there's a nested struct) > [SQL][XML] Fix issues in skipping nested structure fields not selected in > schema > > > Key: SPARK-48100 > URL: https://issues.apache.org/jira/browse/SPARK-48100 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Shujing Yang >Priority: Major > > Previously, the XML parser can't skip nested structure data fields when they > were not selected in the schema. For instance, in the below example, > `df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't > effectively skipped. This PR fixes this issue. > {code:java} > > > 1 > > > 2 > > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48100) [SQL][XML] Fix projection issue when there's a nested struct
Shujing Yang created SPARK-48100: Summary: [SQL][XML] Fix projection issue when there's a nested struct Key: SPARK-48100 URL: https://issues.apache.org/jira/browse/SPARK-48100 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Shujing Yang Previously, the XML parser can't skip nested structure data fields when they were not selected in the schema. For instance, in the below example, `df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't effectively skipped. This PR fixes this issue. {code:java} 1 2 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48097) Limit GHA job execution time to up to 3 hours in `build_and_test.yml`
[ https://issues.apache.org/jira/browse/SPARK-48097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48097: - Assignee: Dongjoon Hyun > Limit GHA job execution time to up to 3 hours in `build_and_test.yml` > - > > Key: SPARK-48097 > URL: https://issues.apache.org/jira/browse/SPARK-48097 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48097) Limit GHA job execution time to up to 3 hours in `build_and_test.yml`
[ https://issues.apache.org/jira/browse/SPARK-48097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48097. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46344 [https://github.com/apache/spark/pull/46344] > Limit GHA job execution time to up to 3 hours in `build_and_test.yml` > - > > Key: SPARK-48097 > URL: https://issues.apache.org/jira/browse/SPARK-48097 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`
[ https://issues.apache.org/jira/browse/SPARK-48099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48099: --- Labels: pull-request-available (was: ) > Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)` > > > Key: SPARK-48099 > URL: https://issues.apache.org/jira/browse/SPARK-48099 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Attachments: Screenshot 2024-05-02 at 14.59.14.png > > > `Java 21 on MacOS 14` is the fastest Maven test and covers both Java 17 and > Apple Silicon use case. > !Screenshot 2024-05-02 at 14.59.14.png|width=100%! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`
[ https://issues.apache.org/jira/browse/SPARK-48099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48099: -- Description: `Java 21 on MacOS 14` is the fastest Maven test and covers both Java 17 and Apple Silicon use case. !Screenshot 2024-05-02 at 14.59.14.png|width=100%! was: !Screenshot 2024-05-02 at 14.59.14.png|width=100%! > Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)` > > > Key: SPARK-48099 > URL: https://issues.apache.org/jira/browse/SPARK-48099 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Attachments: Screenshot 2024-05-02 at 14.59.14.png > > > `Java 21 on MacOS 14` is the fastest Maven test and covers both Java 17 and > Apple Silicon use case. > !Screenshot 2024-05-02 at 14.59.14.png|width=100%! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`
[ https://issues.apache.org/jira/browse/SPARK-48099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48099: -- Description: !Screenshot 2024-05-02 at 14.59.14.png!width=100%! > Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)` > > > Key: SPARK-48099 > URL: https://issues.apache.org/jira/browse/SPARK-48099 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Attachments: Screenshot 2024-05-02 at 14.59.14.png > > > !Screenshot 2024-05-02 at 14.59.14.png!width=100%! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`
[ https://issues.apache.org/jira/browse/SPARK-48099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48099: -- Description: !Screenshot 2024-05-02 at 14.59.14.png! (was: !Screenshot 2024-05-02 at 14.59.14.png!width=100%! ) > Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)` > > > Key: SPARK-48099 > URL: https://issues.apache.org/jira/browse/SPARK-48099 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Attachments: Screenshot 2024-05-02 at 14.59.14.png > > > !Screenshot 2024-05-02 at 14.59.14.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`
[ https://issues.apache.org/jira/browse/SPARK-48099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48099: -- Description: !Screenshot 2024-05-02 at 14.59.14.png|width=100%! (was: !Screenshot 2024-05-02 at 14.59.14.png! ) > Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)` > > > Key: SPARK-48099 > URL: https://issues.apache.org/jira/browse/SPARK-48099 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Attachments: Screenshot 2024-05-02 at 14.59.14.png > > > !Screenshot 2024-05-02 at 14.59.14.png|width=100%! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`
Dongjoon Hyun created SPARK-48099: - Summary: Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)` Key: SPARK-48099 URL: https://issues.apache.org/jira/browse/SPARK-48099 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Dongjoon Hyun Attachments: Screenshot 2024-05-02 at 14.59.14.png -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`
[ https://issues.apache.org/jira/browse/SPARK-48099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48099: -- Attachment: Screenshot 2024-05-02 at 14.59.14.png > Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)` > > > Key: SPARK-48099 > URL: https://issues.apache.org/jira/browse/SPARK-48099 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Attachments: Screenshot 2024-05-02 at 14.59.14.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance
[ https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48094: -- Issue Type: Umbrella (was: Task) > Reduce GitHub Action usage according to ASF project allowance > - > > Key: SPARK-48094 > URL: https://issues.apache.org/jira/browse/SPARK-48094 > Project: Spark > Issue Type: Umbrella > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Attachments: Screenshot 2024-05-02 at 13.18.42.png > > > h2. ASF INFRA POLICY > - https://infra.apache.org/github-actions-policy.html > h2. MONITORING > [https://infra-reports.apache.org/#ghactions=spark=168] > h2. TARGET > * All workflows MUST have a job concurrency level less than or equal to 20. > This means a workflow cannot have more than 20 jobs running at the same time > across all matrices. > * All workflows SHOULD have a job concurrency level less than or equal to 15. > Just because 20 is the max, doesn't mean you should strive for 20. > * The average number of minutes a project uses per calendar week MUST NOT > exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 > hours). > * The average number of minutes a project uses in any consecutive five-day > period MUST NOT exceed the equivalent of 30 full-time runners (216,000 > minutes, or 3,600 hours). > h2. DEADLINE > bq. 17th of May, 2024 > Since the deadline is 17th of May, 2024, I set this as the highest priority, > `Blocker`. > !Screenshot 2024-05-02 at 13.18.42.png|width=100%! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48098) Enable `NOLINT_ON_COMPILE` for all except `linter` job
Dongjoon Hyun created SPARK-48098: - Summary: Enable `NOLINT_ON_COMPILE` for all except `linter` job Key: SPARK-48098 URL: https://issues.apache.org/jira/browse/SPARK-48098 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance
[ https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48094: -- Description: h2. ASF INFRA POLICY - https://infra.apache.org/github-actions-policy.html h2. MONITORING [https://infra-reports.apache.org/#ghactions=spark=168] h2. TARGET * All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. * All workflows SHOULD have a job concurrency level less than or equal to 15. Just because 20 is the max, doesn't mean you should strive for 20. * The average number of minutes a project uses per calendar week MUST NOT exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours). * The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). h2. DEADLINE bq. 17th of May, 2024 Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. !Screenshot 2024-05-02 at 13.18.42.png|width=100%! was: h2. ASF INFRA POLICY - https://infra.apache.org/github-actions-policy.html h2. MONITORING [https://infra-reports.apache.org/#ghactions=spark=168] h2. TARGET bq. 4,250 hours of build time. This policy went into effect on April 20th[2]. h2. DEADLINE bq. 17th of May, 2024 Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. !Screenshot 2024-05-02 at 13.18.42.png|width=100%! > Reduce GitHub Action usage according to ASF project allowance > - > > Key: SPARK-48094 > URL: https://issues.apache.org/jira/browse/SPARK-48094 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Attachments: Screenshot 2024-05-02 at 13.18.42.png > > > h2. ASF INFRA POLICY > - https://infra.apache.org/github-actions-policy.html > h2. MONITORING > [https://infra-reports.apache.org/#ghactions=spark=168] > h2. TARGET > * All workflows MUST have a job concurrency level less than or equal to 20. > This means a workflow cannot have more than 20 jobs running at the same time > across all matrices. > * All workflows SHOULD have a job concurrency level less than or equal to 15. > Just because 20 is the max, doesn't mean you should strive for 20. > * The average number of minutes a project uses per calendar week MUST NOT > exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 > hours). > * The average number of minutes a project uses in any consecutive five-day > period MUST NOT exceed the equivalent of 30 full-time runners (216,000 > minutes, or 3,600 hours). > h2. DEADLINE > bq. 17th of May, 2024 > Since the deadline is 17th of May, 2024, I set this as the highest priority, > `Blocker`. > !Screenshot 2024-05-02 at 13.18.42.png|width=100%! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48097) Limit GHA job execution time to up to 3 hours in `build_and_test.yml`
[ https://issues.apache.org/jira/browse/SPARK-48097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48097: --- Labels: pull-request-available (was: ) > Limit GHA job execution time to up to 3 hours in `build_and_test.yml` > - > > Key: SPARK-48097 > URL: https://issues.apache.org/jira/browse/SPARK-48097 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48097) Limit GHA job execution time to up to 3 hours in `build_and_test.yml`
Dongjoon Hyun created SPARK-48097: - Summary: Limit GHA job execution time to up to 3 hours in `build_and_test.yml` Key: SPARK-48097 URL: https://issues.apache.org/jira/browse/SPARK-48097 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48095) Run `build_non_ansi.yml` once per day
[ https://issues.apache.org/jira/browse/SPARK-48095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48095: - Assignee: Dongjoon Hyun > Run `build_non_ansi.yml` once per day > - > > Key: SPARK-48095 > URL: https://issues.apache.org/jira/browse/SPARK-48095 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48095) Run `build_non_ansi.yml` once per day
[ https://issues.apache.org/jira/browse/SPARK-48095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48095. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46342 [https://github.com/apache/spark/pull/46342] > Run `build_non_ansi.yml` once per day > - > > Key: SPARK-48095 > URL: https://issues.apache.org/jira/browse/SPARK-48095 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance
[ https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48094: -- Description: h2. ASF INFRA POLICY - https://infra.apache.org/github-actions-policy.html h2. MONITORING [https://infra-reports.apache.org/#ghactions=spark=168] h2. TARGET bq. 4,250 hours of build time. This policy went into effect on April 20th[2]. h2. DEADLINE bq. 17th of May, 2024 Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. !Screenshot 2024-05-02 at 13.18.42.png|width=100%! was: h2. MONITORING [https://infra-reports.apache.org/#ghactions=spark=168] h2. TARGET bq. 4,250 hours of build time. This policy went into effect on April 20th[2]. h2. DEADLINE bq. 17th of May, 2024 Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. !Screenshot 2024-05-02 at 13.18.42.png|width=100%! > Reduce GitHub Action usage according to ASF project allowance > - > > Key: SPARK-48094 > URL: https://issues.apache.org/jira/browse/SPARK-48094 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Attachments: Screenshot 2024-05-02 at 13.18.42.png > > > h2. ASF INFRA POLICY > - https://infra.apache.org/github-actions-policy.html > h2. MONITORING > [https://infra-reports.apache.org/#ghactions=spark=168] > h2. TARGET > bq. 4,250 hours of build time. This policy went into effect on April 20th[2]. > h2. DEADLINE > bq. 17th of May, 2024 > Since the deadline is 17th of May, 2024, I set this as the highest priority, > `Blocker`. > !Screenshot 2024-05-02 at 13.18.42.png|width=100%! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance
[ https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48094: -- Description: h2. MONITORING [https://infra-reports.apache.org/#ghactions=spark=168] h2. TARGET bq. 4,250 hours of build time. This policy went into effect on April 20th[2]. h2. DEADLINE bq. 17th of May, 2024 Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. !Screenshot 2024-05-02 at 13.18.42.png|width=100%! was: h1. MONITORING [https://infra-reports.apache.org/#ghactions=spark=168] h1. TARGET bq. 4,250 hours of build time. This policy went into effect on April 20th[2]. h2. DEADLINE bq. 17th of May, 2024 Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. !Screenshot 2024-05-02 at 13.18.42.png|width=100%! > Reduce GitHub Action usage according to ASF project allowance > - > > Key: SPARK-48094 > URL: https://issues.apache.org/jira/browse/SPARK-48094 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Attachments: Screenshot 2024-05-02 at 13.18.42.png > > > h2. MONITORING > [https://infra-reports.apache.org/#ghactions=spark=168] > h2. TARGET > bq. 4,250 hours of build time. This policy went into effect on April 20th[2]. > h2. DEADLINE > bq. 17th of May, 2024 > Since the deadline is 17th of May, 2024, I set this as the highest priority, > `Blocker`. > !Screenshot 2024-05-02 at 13.18.42.png|width=100%! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance
[ https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48094: -- Description: h1. MONITORING [https://infra-reports.apache.org/#ghactions=spark=168] h1. TARGET bq. 4,250 hours of build time. This policy went into effect on April 20th[2]. Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. !Screenshot 2024-05-02 at 13.18.42.png|width=100%! was: h1. MONITORING [https://infra-reports.apache.org/#ghactions=spark=168] h1. TARGET bq. 4,250 hours of build time. This policy went into effect on April 20th[2]. Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. !Screenshot 2024-05-02 at 13.18.42.png|width=100! > Reduce GitHub Action usage according to ASF project allowance > - > > Key: SPARK-48094 > URL: https://issues.apache.org/jira/browse/SPARK-48094 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Attachments: Screenshot 2024-05-02 at 13.18.42.png > > > h1. MONITORING > [https://infra-reports.apache.org/#ghactions=spark=168] > h1. TARGET > bq. 4,250 hours of build time. This policy went into effect on April 20th[2]. > Since the deadline is 17th of May, 2024, I set this as the highest priority, > `Blocker`. > !Screenshot 2024-05-02 at 13.18.42.png|width=100%! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance
[ https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48094: -- Description: *{*}MONITORING{*}* [https://infra-reports.apache.org/#ghactions=spark=168] *{*}TARGET{*}* bq. 4,250 hours of build time. This policy went into effect on April 20th[2]. Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. !Screenshot 2024-05-02 at 13.18.42.png|width=100! was: **MONITORING** https://infra-reports.apache.org/#ghactions=spark=168 Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. !Screenshot 2024-05-02 at 13.18.42.png|width=100%! > Reduce GitHub Action usage according to ASF project allowance > - > > Key: SPARK-48094 > URL: https://issues.apache.org/jira/browse/SPARK-48094 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Attachments: Screenshot 2024-05-02 at 13.18.42.png > > > *{*}MONITORING{*}* > [https://infra-reports.apache.org/#ghactions=spark=168] > *{*}TARGET{*}* > bq. 4,250 hours of build time. This policy went into effect on April 20th[2]. > Since the deadline is 17th of May, 2024, I set this as the highest priority, > `Blocker`. > !Screenshot 2024-05-02 at 13.18.42.png|width=100! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance
[ https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48094: -- Description: h1. MONITORING [https://infra-reports.apache.org/#ghactions=spark=168] h1. TARGET bq. 4,250 hours of build time. This policy went into effect on April 20th[2]. Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. !Screenshot 2024-05-02 at 13.18.42.png|width=100! was: *{*}MONITORING{*}* [https://infra-reports.apache.org/#ghactions=spark=168] *{*}TARGET{*}* bq. 4,250 hours of build time. This policy went into effect on April 20th[2]. Since the deadline is 17th of May, 2024, I set this as the highest priority, `Blocker`. !Screenshot 2024-05-02 at 13.18.42.png|width=100! > Reduce GitHub Action usage according to ASF project allowance > - > > Key: SPARK-48094 > URL: https://issues.apache.org/jira/browse/SPARK-48094 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Attachments: Screenshot 2024-05-02 at 13.18.42.png > > > h1. MONITORING > [https://infra-reports.apache.org/#ghactions=spark=168] > h1. TARGET > bq. 4,250 hours of build time. This policy went into effect on April 20th[2]. > Since the deadline is 17th of May, 2024, I set this as the highest priority, > `Blocker`. > !Screenshot 2024-05-02 at 13.18.42.png|width=100! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance
[ https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48094: -- Attachment: Screenshot 2024-05-02 at 13.18.42.png > Reduce GitHub Action usage according to ASF project allowance > - > > Key: SPARK-48094 > URL: https://issues.apache.org/jira/browse/SPARK-48094 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Attachments: Screenshot 2024-05-02 at 13.18.42.png > > > **MONITORING** > https://infra-reports.apache.org/#ghactions=spark=168 > Since the deadline is 17th of May, 2024, I set this as the highest priority, > `Blocker`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48095) Run `build_non_ansi.yml` once per day
[ https://issues.apache.org/jira/browse/SPARK-48095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48095: --- Labels: pull-request-available (was: ) > Run `build_non_ansi.yml` once per day > - > > Key: SPARK-48095 > URL: https://issues.apache.org/jira/browse/SPARK-48095 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48095) Run `build_non_ansi.yml` once per day
Dongjoon Hyun created SPARK-48095: - Summary: Run `build_non_ansi.yml` once per day Key: SPARK-48095 URL: https://issues.apache.org/jira/browse/SPARK-48095 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48093) Add config to switch between client side listener and server side listener
[ https://issues.apache.org/jira/browse/SPARK-48093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48093: --- Labels: pull-request-available (was: ) > Add config to switch between client side listener and server side listener > --- > > Key: SPARK-48093 > URL: https://issues.apache.org/jira/browse/SPARK-48093 > Project: Spark > Issue Type: New Feature > Components: Connect, SS >Affects Versions: 3.5.0, 3.5.1, 3.5.2 >Reporter: Wei Liu >Priority: Major > Labels: pull-request-available > > We are moving the implementation of Streaming Query Listener from server to > client. For clients already running client side listener, to prevent > regression, we should add a config to let them decide what type of listener > the user want to use. > > This is only added to 3.5.x published versions. For 4.0 and upwards we only > use client side listener. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48081) Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type
[ https://issues.apache.org/jira/browse/SPARK-48081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48081: -- Fix Version/s: 3.5.2 3.4.4 > Fix ClassCastException in NTile.checkInputDataTypes() when argument is > non-foldable or of wrong type > > > Key: SPARK-48081 > URL: https://issues.apache.org/jira/browse/SPARK-48081 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.4 > > > {code:java} > sql("select ntile(99.9) OVER (order by id) from range(10)"){code} > results in > {code} > java.lang.ClassCastException: class org.apache.spark.sql.types.Decimal > cannot be cast to class java.lang.Integer (org.apache.spark.sql.types.Decimal > is in unnamed module of loader 'app'; java.lang.Integer is in module > java.base of loader 'bootstrap') > at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99) > at > org.apache.spark.sql.catalyst.expressions.NTile.checkInputDataTypes(windowExpressions.scala:877) > at > org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:267) > at > org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:267) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$childrenResolved$1(Expression.scala:279) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$childrenResolved$1$adapted(Expression.scala:279) > at scala.collection.IterableOnceOps.forall(IterableOnce.scala:633) > at scala.collection.IterableOnceOps.forall$(IterableOnce.scala:630) > at scala.collection.AbstractIterable.forall(Iterable.scala:935) > at > org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:279) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$22$$anonfun$applyOrElse$157.applyOrElse(Analyzer.scala:2243) > > {code} > instead of the intended user-facing error message. This is a minor bug that > was introduced in a previous error class refactoring PR. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48043) Kryo serialization issue with push-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-48043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Romain Ardiet updated SPARK-48043: -- Description: I'm running a spark job on AWS EMR. I wanted to test the new push-based shuffle introduced in Spark 3.2 but it's failing with a kryo exception when I'm enabling it. The issue is happening when Executor starts, during KryoSerializerInstance.getAutoReset() check: {code:java} 24/04/24 15:36:22 ERROR YarnCoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to create executor due to Failed to register classes with Kryo org.apache.spark.SparkException: Failed to register classes with Kryo at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:186) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?] at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:241) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:174) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:105) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48) ~[kryo-shaded-4.0.2.jar:?] at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:112) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:352) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializerInstance.getAutoReset(KryoSerializer.scala:452) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects$lzycompute(KryoSerializer.scala:259) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects(KryoSerializer.scala:255) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.util.Utils$.serializerIsSupported$lzycompute$1(Utils.scala:2721) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.util.Utils$.serializerIsSupported$1(Utils.scala:2716) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.util.Utils$.isPushBasedShuffleEnabled(Utils.scala:2730) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:554) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.executor.Executor.(Executor.scala:143) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:190) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_402] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_402] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_402] Caused by: java.lang.ClassNotFoundException: com.analytics.AnalyticsEventWrapper at java.net.URLClassLoader.findClass(URLClassLoader.java:387) ~[?:1.8.0_402] at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_402] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_402] at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_402] at java.lang.Class.forName0(Native Method) ~[?:1.8.0_402] at java.lang.Class.forName(Class.java:348) ~[?:1.8.0_402] at org.apache.spark.util.Utils$.classForName(Utils.scala:228) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$6(KryoSerializer.scala:177) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) ~[scala-library-2.12.15.jar:?] at
[jira] [Updated] (SPARK-36705) Disable push based shuffle when IO encryption is enabled or serializer is not relocatable
[ https://issues.apache.org/jira/browse/SPARK-36705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-36705: --- Labels: pull-request-available (was: ) > Disable push based shuffle when IO encryption is enabled or serializer is not > relocatable > - > > Key: SPARK-36705 > URL: https://issues.apache.org/jira/browse/SPARK-36705 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Mridul Muralidharan >Assignee: Minchu Yang >Priority: Blocker > Labels: pull-request-available > Fix For: 3.2.0 > > > Push based shuffle is not compatible with io encryption or non-relocatable > serialization. > This is similar to SPARK-34790 > We have to disable push based shuffle if either of these two are true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48043) Kryo serialization issue with push-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-48043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Romain Ardiet updated SPARK-48043: -- Description: I'm running a spark job on AWS EMR. I wanted to test the new push-based shuffle introduced in Spark 3.2 but it's failing with a kryo exception when I'm enabling it. The issue is happening when Executor starts, during KryoSerializerInstance.getAutoReset() check: {code:java} 24/04/24 15:36:22 ERROR YarnCoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to create executor due to Failed to register classes with Kryo org.apache.spark.SparkException: Failed to register classes with Kryo at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:186) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?] at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:241) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:174) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:105) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48) ~[kryo-shaded-4.0.2.jar:?] at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:112) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:352) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializerInstance.getAutoReset(KryoSerializer.scala:452) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects$lzycompute(KryoSerializer.scala:259) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects(KryoSerializer.scala:255) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.util.Utils$.serializerIsSupported$lzycompute$1(Utils.scala:2721) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.util.Utils$.serializerIsSupported$1(Utils.scala:2716) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.util.Utils$.isPushBasedShuffleEnabled(Utils.scala:2730) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:554) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.executor.Executor.(Executor.scala:143) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:190) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_402] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_402] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_402] Caused by: java.lang.ClassNotFoundException: com.analytics.AnalyticsEventWrapper at java.net.URLClassLoader.findClass(URLClassLoader.java:387) ~[?:1.8.0_402] at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_402] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_402] at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_402] at java.lang.Class.forName0(Native Method) ~[?:1.8.0_402] at java.lang.Class.forName(Class.java:348) ~[?:1.8.0_402] at org.apache.spark.util.Utils$.classForName(Utils.scala:228) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$6(KryoSerializer.scala:177) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) ~[scala-library-2.12.15.jar:?] at
[jira] [Updated] (SPARK-48043) Kryo serialization issue with push-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-48043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Romain Ardiet updated SPARK-48043: -- Description: I'm running a spark job on AWS EMR. I wanted to test the new push-based shuffle introduced in Spark 3.2 but it's failing with a kryo exception when I'm enabling it. The issue is happening when Executor starts, during KryoSerializerInstance.getAutoReset() check: {code:java} 24/04/24 15:36:22 ERROR YarnCoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to create executor due to Failed to register classes with Kryo org.apache.spark.SparkException: Failed to register classes with Kryo at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:186) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?] at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:241) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:174) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:105) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48) ~[kryo-shaded-4.0.2.jar:?] at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:112) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:352) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializerInstance.getAutoReset(KryoSerializer.scala:452) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects$lzycompute(KryoSerializer.scala:259) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects(KryoSerializer.scala:255) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.util.Utils$.serializerIsSupported$lzycompute$1(Utils.scala:2721) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.util.Utils$.serializerIsSupported$1(Utils.scala:2716) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.util.Utils$.isPushBasedShuffleEnabled(Utils.scala:2730) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:554) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.executor.Executor.(Executor.scala:143) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:190) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_402] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_402] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_402] Caused by: java.lang.ClassNotFoundException: com.analytics.AnalyticsEventWrapper at java.net.URLClassLoader.findClass(URLClassLoader.java:387) ~[?:1.8.0_402] at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_402] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_402] at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_402] at java.lang.Class.forName0(Native Method) ~[?:1.8.0_402] at java.lang.Class.forName(Class.java:348) ~[?:1.8.0_402] at org.apache.spark.util.Utils$.classForName(Utils.scala:228) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$6(KryoSerializer.scala:177) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) ~[scala-library-2.12.15.jar:?] at
[jira] [Updated] (SPARK-48043) Kryo serialization issue with push-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-48043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Romain Ardiet updated SPARK-48043: -- Description: I'm running a spark job on AWS EMR. I wanted to test the new push-based shuffle introduced in Spark 3.2 but it's failing with a kryo exception when I'm enabling it. The issue seems happening when Executor starts, on KryoSerializerInstance.getAutoReset() check: {code:java} 24/04/24 15:36:22 ERROR YarnCoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to create executor due to Failed to register classes with Kryo org.apache.spark.SparkException: Failed to register classes with Kryo at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:186) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?] at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:241) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:174) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:105) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48) ~[kryo-shaded-4.0.2.jar:?] at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:112) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:352) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializerInstance.getAutoReset(KryoSerializer.scala:452) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects$lzycompute(KryoSerializer.scala:259) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects(KryoSerializer.scala:255) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.util.Utils$.serializerIsSupported$lzycompute$1(Utils.scala:2721) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.util.Utils$.serializerIsSupported$1(Utils.scala:2716) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.util.Utils$.isPushBasedShuffleEnabled(Utils.scala:2730) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:554) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.executor.Executor.(Executor.scala:143) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:190) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_402] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_402] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_402] Caused by: java.lang.ClassNotFoundException: com.analytics.AnalyticsEventWrapper at java.net.URLClassLoader.findClass(URLClassLoader.java:387) ~[?:1.8.0_402] at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_402] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_402] at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_402] at java.lang.Class.forName0(Native Method) ~[?:1.8.0_402] at java.lang.Class.forName(Class.java:348) ~[?:1.8.0_402] at org.apache.spark.util.Utils$.classForName(Utils.scala:228) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$6(KryoSerializer.scala:177) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) ~[scala-library-2.12.15.jar:?] at
[jira] [Updated] (SPARK-48043) Kryo serialization issue with push-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-48043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Romain Ardiet updated SPARK-48043: -- Description: I'm running a spark job on AWS EMR. I wanted to test the new push-based shuffle introduced in Spark 3.2 but it's failing with a kryo exception when I'm enabling it. The issue seems happening when Executor starts, on KryoSerializerInstance.getAutoReset() check: {code:java} 24/04/24 15:36:22 ERROR YarnCoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to create executor due to Failed to register classes with Kryo org.apache.spark.SparkException: Failed to register classes with Kryo at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:186) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?] at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:241) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:174) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:105) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48) ~[kryo-shaded-4.0.2.jar:?] at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:112) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:352) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializerInstance.getAutoReset(KryoSerializer.scala:452) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects$lzycompute(KryoSerializer.scala:259) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects(KryoSerializer.scala:255) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.util.Utils$.serializerIsSupported$lzycompute$1(Utils.scala:2721) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.util.Utils$.serializerIsSupported$1(Utils.scala:2716) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.util.Utils$.isPushBasedShuffleEnabled(Utils.scala:2730) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:554) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.executor.Executor.(Executor.scala:143) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:190) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_402] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_402] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_402] Caused by: java.lang.ClassNotFoundException: com.analytics.AnalyticsEventWrapper at java.net.URLClassLoader.findClass(URLClassLoader.java:387) ~[?:1.8.0_402] at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_402] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_402] at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_402] at java.lang.Class.forName0(Native Method) ~[?:1.8.0_402] at java.lang.Class.forName(Class.java:348) ~[?:1.8.0_402] at org.apache.spark.util.Utils$.classForName(Utils.scala:228) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$6(KryoSerializer.scala:177) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1] at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) ~[scala-library-2.12.15.jar:?] at
[jira] [Created] (SPARK-48093) Add config to switch between client side listener and server side listener
Wei Liu created SPARK-48093: --- Summary: Add config to switch between client side listener and server side listener Key: SPARK-48093 URL: https://issues.apache.org/jira/browse/SPARK-48093 Project: Spark Issue Type: New Feature Components: Connect, SS Affects Versions: 3.5.1, 3.5.0, 3.5.2 Reporter: Wei Liu We are moving the implementation of Streaming Query Listener from server to client. For clients already running client side listener, to prevent regression, we should add a config to let them decide what type of listener the user want to use. This is only added to 3.5.x published versions. For 4.0 and upwards we only use client side listener. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48067) Fix Variant default columns for more complex default variants
[ https://issues.apache.org/jira/browse/SPARK-48067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-48067. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46312 [https://github.com/apache/spark/pull/46312] > Fix Variant default columns for more complex default variants > - > > Key: SPARK-48067 > URL: https://issues.apache.org/jira/browse/SPARK-48067 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Richard Chen >Assignee: Richard Chen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Default columns are stored as structfield metadata (string -> string) map. > This means the literal values are stored as strings. > However, the string representation of a variant is the JSONified string. > Thus, we need to wrap the string with `parse_json` to correctly use the > default values -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48067) Fix Variant default columns for more complex default variants
[ https://issues.apache.org/jira/browse/SPARK-48067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-48067: -- Assignee: Richard Chen > Fix Variant default columns for more complex default variants > - > > Key: SPARK-48067 > URL: https://issues.apache.org/jira/browse/SPARK-48067 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Richard Chen >Assignee: Richard Chen >Priority: Major > Labels: pull-request-available > > Default columns are stored as structfield metadata (string -> string) map. > This means the literal values are stored as strings. > However, the string representation of a variant is the JSONified string. > Thus, we need to wrap the string with `parse_json` to correctly use the > default values -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48089) Streaming query listener not working in 3.5 client <> 4.0 server
[ https://issues.apache.org/jira/browse/SPARK-48089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48089: --- Labels: pull-request-available (was: ) > Streaming query listener not working in 3.5 client <> 4.0 server > > > Key: SPARK-48089 > URL: https://issues.apache.org/jira/browse/SPARK-48089 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > {code} > == > ERROR [1.488s]: test_listener_events > (pyspark.sql.tests.connect.streaming.test_parity_listener.StreamingListenerParityTests.test_listener_events) > -- > Traceback (most recent call last): > File > "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/connect/streaming/test_parity_listener.py", > line 53, in test_listener_events > self.spark.streams.addListener(test_listener) > File > "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py", > line 244, in addListener > self._execute_streaming_query_manager_cmd(cmd) > File > "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py", > line 260, in _execute_streaming_query_manager_cmd > (_, properties) = self._session.client.execute_command(exec_cmd) > ^^ > File > "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", > line 982, in execute_command > data, _, _, _, properties = self._execute_and_fetch(req) > > File > "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", > line 1283, in _execute_and_fetch > for response in self._execute_and_fetch_as_iterator(req): > File > "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", > line 1264, in _execute_and_fetch_as_iterator > self._handle_error(error) > File > "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", > line 1503, in _handle_error > self._handle_rpc_error(error) > File > "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", > line 1539, in _handle_rpc_error > raise convert_exception(info, status.message) from None > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (java.io.EOFException) > -- > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45988) Fix `pyspark.pandas.tests.computation.test_apply_func` in Python 3.11
[ https://issues.apache.org/jira/browse/SPARK-45988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45988: -- Fix Version/s: 3.4.4 > Fix `pyspark.pandas.tests.computation.test_apply_func` in Python 3.11 > - > > Key: SPARK-45988 > URL: https://issues.apache.org/jira/browse/SPARK-45988 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.4 > > > https://github.com/apache/spark/actions/runs/6914662405/job/18812759697 > {code} > == > ERROR [0.686s]: test_apply_batch_with_type > (pyspark.pandas.tests.computation.test_apply_func.FrameApplyFunctionTests.test_apply_batch_with_type) > -- > Traceback (most recent call last): > File > "/__w/spark/spark/python/pyspark/pandas/tests/computation/test_apply_func.py", > line 248, in test_apply_batch_with_type > def identify3(x) -> ps.DataFrame[float, [int, List[int]]]: > ^ > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13540, in > __class_getitem__ > return create_tuple_for_frame_type(params) >^^^ > File "/__w/spark/spark/python/pyspark/pandas/typedef/typehints.py", line > 721, in create_tuple_for_frame_type > return Tuple[_to_type_holders(params)] > > File "/__w/spark/spark/python/pyspark/pandas/typedef/typehints.py", line > 766, in _to_type_holders > data_types = _new_type_holders(data_types, NameTypeHolder) > ^ > File "/__w/spark/spark/python/pyspark/pandas/typedef/typehints.py", line > 832, in _new_type_holders > raise TypeError( > TypeError: Type hints should be specified as one of: > - DataFrame[type, type, ...] > - DataFrame[name: type, name: type, ...] > - DataFrame[dtypes instance] > - DataFrame[zip(names, types)] > - DataFrame[index_type, [type, ...]] > - DataFrame[(index_name, index_type), [(name, type), ...]] > - DataFrame[dtype instance, dtypes instance] > - DataFrame[(index_name, index_type), zip(names, types)] > - DataFrame[[index_type, ...], [type, ...]] > - DataFrame[[(index_name, index_type), ...], [(name, type), ...]] > - DataFrame[dtypes instance, dtypes instance] > - DataFrame[zip(index_names, index_types), zip(names, types)] > However, got (, typing.List[int]). > -- > Ran 10 tests in 34.327s > FAILED (errors=1) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45989) Fix `pyspark.pandas.tests.connect.computation.test_parity_apply_func` in Python 3.11
[ https://issues.apache.org/jira/browse/SPARK-45989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45989: -- Fix Version/s: 3.4.4 > Fix `pyspark.pandas.tests.connect.computation.test_parity_apply_func` in > Python 3.11 > > > Key: SPARK-45989 > URL: https://issues.apache.org/jira/browse/SPARK-45989 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 4.0.0, 3.5.2, 3.4.4 > > > https://github.com/apache/spark/actions/runs/6914662405/job/18816505612 > {code} > == > ERROR [1.237s]: test_apply_batch_with_type > (pyspark.pandas.tests.connect.computation.test_parity_apply_func.FrameParityApplyFunctionTests.test_apply_batch_with_type) > -- > Traceback (most recent call last): > File > "/__w/spark/spark/python/pyspark/pandas/tests/computation/test_apply_func.py", > line 248, in test_apply_batch_with_type > def identify3(x) -> ps.DataFrame[float, [int, List[int]]]: > ^ > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13540, in > __class_getitem__ > return create_tuple_for_frame_type(params) >^^^ > File "/__w/spark/spark/python/pyspark/pandas/typedef/typehints.py", line > 721, in create_tuple_for_frame_type > return Tuple[_to_type_holders(params)] > > File "/__w/spark/spark/python/pyspark/pandas/typedef/typehints.py", line > 766, in _to_type_holders > data_types = _new_type_holders(data_types, NameTypeHolder) > ^ > File "/__w/spark/spark/python/pyspark/pandas/typedef/typehints.py", line > 832, in _new_type_holders > raise TypeError( > TypeError: Type hints should be specified as one of: > - DataFrame[type, type, ...] > - DataFrame[name: type, name: type, ...] > - DataFrame[dtypes instance] > - DataFrame[zip(names, types)] > - DataFrame[index_type, [type, ...]] > - DataFrame[(index_name, index_type), [(name, type), ...]] > - DataFrame[dtype instance, dtypes instance] > - DataFrame[(index_name, index_type), zip(names, types)] > - DataFrame[[index_type, ...], [type, ...]] > - DataFrame[[(index_name, index_type), ...], [(name, type), ...]] > - DataFrame[dtypes instance, dtypes instance] > - DataFrame[zip(index_names, index_types), zip(names, types)] > However, got (, typing.List[int]). > -- > Ran 10 tests in 78.247s > FAILED (errors=1) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48081) Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type
[ https://issues.apache.org/jira/browse/SPARK-48081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48081: -- Fix Version/s: (was: 3.5.2) (was: 3.4.4) > Fix ClassCastException in NTile.checkInputDataTypes() when argument is > non-foldable or of wrong type > > > Key: SPARK-48081 > URL: https://issues.apache.org/jira/browse/SPARK-48081 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > {code:java} > sql("select ntile(99.9) OVER (order by id) from range(10)"){code} > results in > {code} > java.lang.ClassCastException: class org.apache.spark.sql.types.Decimal > cannot be cast to class java.lang.Integer (org.apache.spark.sql.types.Decimal > is in unnamed module of loader 'app'; java.lang.Integer is in module > java.base of loader 'bootstrap') > at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99) > at > org.apache.spark.sql.catalyst.expressions.NTile.checkInputDataTypes(windowExpressions.scala:877) > at > org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:267) > at > org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:267) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$childrenResolved$1(Expression.scala:279) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$childrenResolved$1$adapted(Expression.scala:279) > at scala.collection.IterableOnceOps.forall(IterableOnce.scala:633) > at scala.collection.IterableOnceOps.forall$(IterableOnce.scala:630) > at scala.collection.AbstractIterable.forall(Iterable.scala:935) > at > org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:279) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$22$$anonfun$applyOrElse$157.applyOrElse(Analyzer.scala:2243) > > {code} > instead of the intended user-facing error message. This is a minor bug that > was introduced in a previous error class refactoring PR. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48092) Spark images are based of Python 3.8 which is soon EOL
[ https://issues.apache.org/jira/browse/SPARK-48092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayur Madnani updated SPARK-48092: -- Description: Python 3.8 will be EOL in Oct 2024 and all the Spark docker images are based on Python 3.8 as of now. I am proposing to use Python 3.10 as default. Let me know if I can pick this up to make the changes in [spark-docker|[https://github.com/apache/spark-docker]] was: Python 3.8 will be EOL in Oct 2024 and all the Spark docker images are based on Python 3.8 as of now. I am proposing to use Python 3.10 as default. Let me know if I can pick this up to make the changes in [spark-docker|[https://github.com/apache/spark-docker]] !Screenshot 2024-05-02 at 21.00.18.png! > Spark images are based of Python 3.8 which is soon EOL > -- > > Key: SPARK-48092 > URL: https://issues.apache.org/jira/browse/SPARK-48092 > Project: Spark > Issue Type: Bug > Components: Spark Docker >Affects Versions: 3.4.2, 3.4.0, 3.4.1, 3.5.0, 3.5.1 >Reporter: Mayur Madnani >Priority: Blocker > Attachments: Screenshot 2024-05-02 at 21.00.18.png, Screenshot > 2024-05-02 at 21.00.48.png > > > Python 3.8 will be EOL in Oct 2024 and all the Spark docker images are based > on Python 3.8 as of now. > I am proposing to use Python 3.10 as default. Let me know if I can pick this > up to make the changes in > [spark-docker|[https://github.com/apache/spark-docker]] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48092) Spark images are based of Python 3.8 which is soon EOL
[ https://issues.apache.org/jira/browse/SPARK-48092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayur Madnani updated SPARK-48092: -- Attachment: Screenshot 2024-05-02 at 21.00.18.png Screenshot 2024-05-02 at 21.00.48.png > Spark images are based of Python 3.8 which is soon EOL > -- > > Key: SPARK-48092 > URL: https://issues.apache.org/jira/browse/SPARK-48092 > Project: Spark > Issue Type: Bug > Components: Spark Docker >Affects Versions: 3.4.2, 3.4.0, 3.4.1, 3.5.0, 3.5.1 >Reporter: Mayur Madnani >Priority: Blocker > Attachments: Screenshot 2024-05-02 at 21.00.18.png, Screenshot > 2024-05-02 at 21.00.48.png > > > Python 3.8 will be EOL in Oct 2024 and all the Spark docker images are based > on Python 3.8 as of now. > I am proposing to use Python 3.10 as default. Let me know if I can pick this > up to make the changes in > [spark-docker|[https://github.com/apache/spark-docker]] > > !image-2024-05-02-14-45-18-492.png! > !image-2024-05-02-14-44-43-423.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48092) Spark images are based of Python 3.8 which is soon EOL
[ https://issues.apache.org/jira/browse/SPARK-48092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayur Madnani updated SPARK-48092: -- Description: Python 3.8 will be EOL in Oct 2024 and all the Spark docker images are based on Python 3.8 as of now. I am proposing to use Python 3.10 as default. Let me know if I can pick this up to make the changes in [spark-docker|[https://github.com/apache/spark-docker]] !Screenshot 2024-05-02 at 21.00.18.png! was: Python 3.8 will be EOL in Oct 2024 and all the Spark docker images are based on Python 3.8 as of now. I am proposing to use Python 3.10 as default. Let me know if I can pick this up to make the changes in [spark-docker|[https://github.com/apache/spark-docker]] !image-2024-05-02-14-45-18-492.png! !image-2024-05-02-14-44-43-423.png! > Spark images are based of Python 3.8 which is soon EOL > -- > > Key: SPARK-48092 > URL: https://issues.apache.org/jira/browse/SPARK-48092 > Project: Spark > Issue Type: Bug > Components: Spark Docker >Affects Versions: 3.4.2, 3.4.0, 3.4.1, 3.5.0, 3.5.1 >Reporter: Mayur Madnani >Priority: Blocker > Attachments: Screenshot 2024-05-02 at 21.00.18.png, Screenshot > 2024-05-02 at 21.00.48.png > > > Python 3.8 will be EOL in Oct 2024 and all the Spark docker images are based > on Python 3.8 as of now. > I am proposing to use Python 3.10 as default. Let me know if I can pick this > up to make the changes in > [spark-docker|[https://github.com/apache/spark-docker]] > > !Screenshot 2024-05-02 at 21.00.18.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48092) Spark images are based of Python 3.8 which is soon EOL
Mayur Madnani created SPARK-48092: - Summary: Spark images are based of Python 3.8 which is soon EOL Key: SPARK-48092 URL: https://issues.apache.org/jira/browse/SPARK-48092 Project: Spark Issue Type: Bug Components: Spark Docker Affects Versions: 3.5.1, 3.5.0, 3.4.1, 3.4.0, 3.4.2 Reporter: Mayur Madnani Python 3.8 will be EOL in Oct 2024 and all the Spark docker images are based on Python 3.8 as of now. I am proposing to use Python 3.10 as default. Let me know if I can pick this up to make the changes in [spark-docker|[https://github.com/apache/spark-docker]] !image-2024-05-02-14-45-18-492.png! !image-2024-05-02-14-44-43-423.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48079) Upgrade maven-install/deploy-plugin to 3.1.2
[ https://issues.apache.org/jira/browse/SPARK-48079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48079. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46330 [https://github.com/apache/spark/pull/46330] > Upgrade maven-install/deploy-plugin to 3.1.2 > > > Key: SPARK-48079 > URL: https://issues.apache.org/jira/browse/SPARK-48079 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48081) Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type
[ https://issues.apache.org/jira/browse/SPARK-48081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48081. --- Fix Version/s: 3.4.4 3.5.2 4.0.0 Resolution: Fixed Issue resolved by pull request 46333 [https://github.com/apache/spark/pull/46333] > Fix ClassCastException in NTile.checkInputDataTypes() when argument is > non-foldable or of wrong type > > > Key: SPARK-48081 > URL: https://issues.apache.org/jira/browse/SPARK-48081 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Labels: pull-request-available > Fix For: 3.4.4, 3.5.2, 4.0.0 > > > {code:java} > sql("select ntile(99.9) OVER (order by id) from range(10)"){code} > results in > {code} > java.lang.ClassCastException: class org.apache.spark.sql.types.Decimal > cannot be cast to class java.lang.Integer (org.apache.spark.sql.types.Decimal > is in unnamed module of loader 'app'; java.lang.Integer is in module > java.base of loader 'bootstrap') > at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99) > at > org.apache.spark.sql.catalyst.expressions.NTile.checkInputDataTypes(windowExpressions.scala:877) > at > org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:267) > at > org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:267) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$childrenResolved$1(Expression.scala:279) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$childrenResolved$1$adapted(Expression.scala:279) > at scala.collection.IterableOnceOps.forall(IterableOnce.scala:633) > at scala.collection.IterableOnceOps.forall$(IterableOnce.scala:630) > at scala.collection.AbstractIterable.forall(Iterable.scala:935) > at > org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:279) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$22$$anonfun$applyOrElse$157.applyOrElse(Analyzer.scala:2243) > > {code} > instead of the intended user-facing error message. This is a minor bug that > was introduced in a previous error class refactoring PR. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48072) Improve SQLQuerySuite test output - use `===` instead of `sameElements` for Arrays
[ https://issues.apache.org/jira/browse/SPARK-48072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48072: -- Summary: Improve SQLQuerySuite test output - use `===` instead of `sameElements` for Arrays (was: Test output is not descriptive for some Array comparisons in SQLQuerySuite) > Improve SQLQuerySuite test output - use `===` instead of `sameElements` for > Arrays > -- > > Key: SPARK-48072 > URL: https://issues.apache.org/jira/browse/SPARK-48072 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Vladimir Golubev >Assignee: Vladimir Golubev >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > Actual and expected queries are not printed in the output when using > `.sameElements` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48072) Test output is not descriptive for some Array comparisons in SQLQuerySuite
[ https://issues.apache.org/jira/browse/SPARK-48072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48072: - Assignee: Vladimir Golubev > Test output is not descriptive for some Array comparisons in SQLQuerySuite > -- > > Key: SPARK-48072 > URL: https://issues.apache.org/jira/browse/SPARK-48072 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Vladimir Golubev >Assignee: Vladimir Golubev >Priority: Minor > Labels: pull-request-available > > Actual and expected queries are not printed in the output when using > `.sameElements` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48072) Test output is not descriptive for some Array comparisons in SQLQuerySuite
[ https://issues.apache.org/jira/browse/SPARK-48072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48072. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46318 [https://github.com/apache/spark/pull/46318] > Test output is not descriptive for some Array comparisons in SQLQuerySuite > -- > > Key: SPARK-48072 > URL: https://issues.apache.org/jira/browse/SPARK-48072 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Vladimir Golubev >Assignee: Vladimir Golubev >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > Actual and expected queries are not printed in the output when using > `.sameElements` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48091) Using `explode` together with `transform` in the same select statement causes aliases in the transformed column to be ignored
[ https://issues.apache.org/jira/browse/SPARK-48091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Serruya updated SPARK-48091: Description: When using an `explode` function, and `transform` function in the same select statement, aliases used inside the transformed column are ignored. This behaviour only happens using the pyspark API, and not when using the SQL API {code:java} from pyspark.sql import functions as F # Create the df df = spark.createDataFrame([ {"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]} ]){code} Good case, where all aliases are used {code:java} df.select( F.transform( 'array2', lambda x: F.struct(x.alias("some_alias"), F.col("id").alias("second_alias")) ).alias("new_array2") ).printSchema() root |-- new_array2: array (nullable = true) ||-- element: struct (containsNull = false) |||-- some_alias: long (nullable = true) |||-- second_alias: long (nullable = true){code} Bad case, when using explode, the alises inside the transformed column is ignored, and `id` is kept instead of `second_alias`, and `x_17` is used instead of `some_alias` {code:java} df.select( F.explode("array1").alias("exploded"), F.transform( 'array2', lambda x: F.struct(x.alias("some_alias"), F.col("id").alias("second_alias")) ).alias("new_array2") ).printSchema() root |-- exploded: string (nullable = true) |-- new_array2: array (nullable = true) ||-- element: struct (containsNull = false) |||-- x_17: long (nullable = true) |||-- id: long (nullable = true) {code} When using the SQL API instead, it works fine {code:java} spark.sql( """ select explode(array1) as exploded, transform(array2, x-> struct(x as some_alias, id as second_alias)) as array2 from {df} """, df=df ).printSchema() root |-- exploded: string (nullable = true) |-- array2: array (nullable = true) ||-- element: struct (containsNull = false) |||-- some_alias: long (nullable = true) |||-- second_alias: long (nullable = true) {code} Workaround: for now, using F.named_struct can be used as a workaround was: When using an `explode` function, and `transform` function in the same select statement, aliases used inside the transformed column are ignored. This behaviour only happens using the pyspark API, and not when using the SQL API {code:java} from pyspark.sql import functions as F # Create the df df = spark.createDataFrame([ {"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]} ]){code} Good case, where all aliases are used {code:java} df.select( F.transform( 'array2', lambda x: F.struct(x.alias("some_alias"), F.col("id").alias("second_alias")) ).alias("new_array2") ).printSchema() root |-- new_array2: array (nullable = true) ||-- element: struct (containsNull = false) |||-- some_alias: long (nullable = true) |||-- second_alias: long (nullable = true){code} Bad case, when using explode, the alises inside the transformed column is ignored, and `id` is kept instead of `second_alias`, and `x_17` is used instead of `some_alias` {code:java} df.select( F.explode("array1").alias("exploded"), F.transform( 'array2', lambda x: F.struct(x.alias("some_alias"), F.col("id").alias("second_alias")) ).alias("new_array2") ).printSchema() root |-- exploded: string (nullable = true) |-- new_array2: array (nullable = true) ||-- element: struct (containsNull = false) |||-- x_17: long (nullable = true) |||-- id: long (nullable = true) {code} When using the SQL API instead, it works fine {code:java} spark.sql( """ select explode(array1) as exploded, transform(array2, x-> struct(x as some_alias, id as second_alias)) as array2 from {df} """, df=df ).printSchema() root |-- exploded: string (nullable = true) |-- array2: array (nullable = true) ||-- element: struct (containsNull = false) |||-- some_alias: long (nullable = true) |||-- second_alias: long (nullable = true) {code} > Using `explode` together with `transform` in the same select statement causes > aliases in the transformed column to be ignored > - > > Key: SPARK-48091 > URL: https://issues.apache.org/jira/browse/SPARK-48091 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0, 3.5.0, 3.5.1 > Environment: Python 3.10, 3.12, OSX 14.4 and Databricks DBR 13.3, > 14.3, Pyspark 3.4.0, 3.5.0, 3.5.1 >Reporter: Ron Serruya >Priority: Minor > Labels: PySpark, alias > > When using an `explode` function, and `transform` function in the same select >
[jira] [Created] (SPARK-48091) Using `explode` together with `transform` in the same select statement causes aliases in the transformed column to be ignored
Ron Serruya created SPARK-48091: --- Summary: Using `explode` together with `transform` in the same select statement causes aliases in the transformed column to be ignored Key: SPARK-48091 URL: https://issues.apache.org/jira/browse/SPARK-48091 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.5.1, 3.5.0, 3.4.0 Environment: Python 3.10, 3.12, OSX 14.4 and Databricks DBR 13.3, 14.3, Pyspark 3.4.0, 3.5.0, 3.5.1 Reporter: Ron Serruya When using an `explode` function, and `transform` function in the same select statement, aliases used inside the transformed column are ignored. This behaviour only happens using the pyspark API, and not when using the SQL API {code:java} from pyspark.sql import functions as F # Create the df df = spark.createDataFrame([ {"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]} ]){code} Good case, where all aliases are used {code:java} df.select( F.transform( 'array2', lambda x: F.struct(x.alias("some_alias"), F.col("id").alias("second_alias")) ).alias("new_array2") ).printSchema() root |-- new_array2: array (nullable = true) ||-- element: struct (containsNull = false) |||-- some_alias: long (nullable = true) |||-- second_alias: long (nullable = true){code} Bad case, when using explode, the alises inside the transformed column is ignored, and `id` is kept instead of `second_alias`, and `x_17` is used instead of `some_alias` {code:java} df.select( F.explode("array1").alias("exploded"), F.transform( 'array2', lambda x: F.struct(x.alias("some_alias"), F.col("id").alias("second_alias")) ).alias("new_array2") ).printSchema() root |-- exploded: string (nullable = true) |-- new_array2: array (nullable = true) ||-- element: struct (containsNull = false) |||-- x_17: long (nullable = true) |||-- id: long (nullable = true) {code} When using the SQL API instead, it works fine {code:java} spark.sql( """ select explode(array1) as exploded, transform(array2, x-> struct(x as some_alias, id as second_alias)) as array2 from {df} """, df=df ).printSchema() root |-- exploded: string (nullable = true) |-- array2: array (nullable = true) ||-- element: struct (containsNull = false) |||-- some_alias: long (nullable = true) |||-- second_alias: long (nullable = true) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48056) [CONNECT][PYTHON] Session not found error should automatically retry during reattach
[ https://issues.apache.org/jira/browse/SPARK-48056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48056. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46297 [https://github.com/apache/spark/pull/46297] > [CONNECT][PYTHON] Session not found error should automatically retry during > reattach > > > Key: SPARK-48056 > URL: https://issues.apache.org/jira/browse/SPARK-48056 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 3.4.3 >Reporter: Niranjan Jayakar >Assignee: Niranjan Jayakar >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > When an OPERATION_NOT_FOUND error is raised and no prior responses were > received, the client retries the ExecutePlan RPC: > [https://github.com/apache/spark/blob/e6217c111fbdd73f202400494c42091e93d3041f/python/pyspark/sql/connect/client/reattach.py#L257] > > Another error SESSION_NOT_FOUND should follow the same logic. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48088) Skip tests being failed in client 3.5 <> server 4.0
[ https://issues.apache.org/jira/browse/SPARK-48088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48088: --- Labels: pull-request-available (was: ) > Skip tests being failed in client 3.5 <> server 4.0 > --- > > Key: SPARK-48088 > URL: https://issues.apache.org/jira/browse/SPARK-48088 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.2 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > We should skip, and set the CI first. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48090) Streaming exception catch failure in 3.5 client <> 4.0 server
Hyukjin Kwon created SPARK-48090: Summary: Streaming exception catch failure in 3.5 client <> 4.0 server Key: SPARK-48090 URL: https://issues.apache.org/jira/browse/SPARK-48090 Project: Spark Issue Type: Sub-task Components: PySpark, Structured Streaming Affects Versions: 4.0.0 Reporter: Hyukjin Kwon {code} == FAIL [1.975s]: test_stream_exception (pyspark.sql.tests.connect.streaming.test_parity_streaming.StreamingParityTests.test_stream_exception) -- Traceback (most recent call last): File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/streaming/test_streaming.py", line 287, in test_stream_exception sq.processAllAvailable() File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py", line 129, in processAllAvailable self._execute_streaming_query_cmd(cmd) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py", line 177, in _execute_streaming_query_cmd (_, properties) = self._session.client.execute_command(exec_cmd) ^^ File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 982, in execute_command data, _, _, _, properties = self._execute_and_fetch(req) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 1283, in _execute_and_fetch for response in self._execute_and_fetch_as_iterator(req): File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 1264, in _execute_and_fetch_as_iterator self._handle_error(error) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line [150](https://github.com/HyukjinKwon/spark/actions/runs/8907172876/job/24460568471#step:9:151)3, in _handle_error self._handle_rpc_error(error) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 1539, in _handle_rpc_error raise convert_exception(info, status.message) from None pyspark.errors.exceptions.connect.StreamingQueryException: [STREAM_FAILED] Query [id = 38d0d145-1f57-4b92-b317-d9de727d9468, runId = 2b963119-d391-4c62-abea-970274859b80] terminated with exception: Job aborted due to stage failure: Task 0 in stage 79.0 failed 1 times, most recent failure: Lost task 0.0 in stage 79.0 (TID 116) (fv-az1144-341.tm43j05r3bqe3lauap1nzddazg.ex.internal.cloudapp.net executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1834, in main process() File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1826, in process serializer.dump_stream(out_iter, outfile) File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 224, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 145, in dump_stream for obj in iterator: File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 213, in _batched for item in iterator: File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1734, in mapper result = tuple(f(*[a[o] for o in arg_offsets]) for arg_offsets, f in udfs) ^ File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1734, in result = tuple(f(*[a[o] for o in arg_offsets]) for arg_offsets, f in udfs) ^^^ File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 112, in return args_kwargs_offsets, lambda *a: func(*a) File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 118, in wrapper return f(*args, **kwargs) ^^ File "/home/runner/work/spark/spark-3 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/streaming/test_streaming.py", line 291, in test_stream_exception self._assert_exception_tree_contains_msg(e, "ZeroDivisionError") File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/streaming/test_streaming.py", line 300, in _assert_exception_tree_contains_msg
[jira] [Created] (SPARK-48089) Streaming query listener not working in 3.5 client <> 4.0 server
Hyukjin Kwon created SPARK-48089: Summary: Streaming query listener not working in 3.5 client <> 4.0 server Key: SPARK-48089 URL: https://issues.apache.org/jira/browse/SPARK-48089 Project: Spark Issue Type: Sub-task Components: PySpark, Structured Streaming Affects Versions: 4.0.0 Reporter: Hyukjin Kwon {code} == ERROR [1.488s]: test_listener_events (pyspark.sql.tests.connect.streaming.test_parity_listener.StreamingListenerParityTests.test_listener_events) -- Traceback (most recent call last): File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/connect/streaming/test_parity_listener.py", line 53, in test_listener_events self.spark.streams.addListener(test_listener) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py", line 244, in addListener self._execute_streaming_query_manager_cmd(cmd) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py", line 260, in _execute_streaming_query_manager_cmd (_, properties) = self._session.client.execute_command(exec_cmd) ^^ File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 982, in execute_command data, _, _, _, properties = self._execute_and_fetch(req) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 1283, in _execute_and_fetch for response in self._execute_and_fetch_as_iterator(req): File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 1264, in _execute_and_fetch_as_iterator self._handle_error(error) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 1503, in _handle_error self._handle_rpc_error(error) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 1539, in _handle_rpc_error raise convert_exception(info, status.message) from None pyspark.errors.exceptions.connect.SparkConnectGrpcException: (java.io.EOFException) -- {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48085) ANSI enabled by default brings different results in the tests in 3.5 client <> 4.0 server
[ https://issues.apache.org/jira/browse/SPARK-48085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48085. -- Resolution: Invalid We actually made this all compatible in master branch. I will avoid backporting them because they are just tests. > ANSI enabled by default brings different results in the tests in 3.5 client > <> 4.0 server > - > > Key: SPARK-48085 > URL: https://issues.apache.org/jira/browse/SPARK-48085 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, SQL >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > == > FAIL [0.169s]: test_checking_csv_header > (pyspark.sql.tests.connect.test_parity_datasources.DataSourcesParityTests.test_checking_csv_header) > -- > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (org.apache.spark.SparkException) [FAILED_READ_FILE.NO_HINT] Encountered > error while reading file > file:///home/runner/work/spark/spark-3.5/python/target/38acabf5-710b-4c21-b359-f61619e2adc7/tmpm7qyq23g/part-0-d6c8793b-772d-44e7-bcca-6eeae9cc0ec7-c000.csv. > SQLSTATE: KD001 > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/test_datasources.py", > line > [167](https://github.com/HyukjinKwon/spark/actions/runs/8908464265/job/24464135564#step:9:168), > in test_checking_csv_header > self.assertRaisesRegex( > AssertionError: "CSV header does not conform to the schema" does not match > "(org.apache.spark.SparkException) [FAILED_READ_FILE.NO_HINT] Encountered > error while reading file > file:///home/runner/work/spark/spark-3.5/python/target/38acabf5-710b-4c21-b359-f61619e2adc7/tmpm7qyq23g/part-0-d6c8793b-772d-44e7-bcca-6eeae9cc0ec7-c000.csv. > SQLSTATE: KD001" > {code} > {code} > == > ERROR [0.059s]: test_large_variable_types > (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests.test_large_variable_types) > -- > Traceback (most recent call last): > File > "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/pandas/test_pandas_map.py", > line 115, in test_large_variable_types > actual = df.mapInPandas(func, "str string, bin binary").collect() > > File > "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/dataframe.py", > line 1645, in collect > table, schema = self._session.client.to_table(query) > > File > "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", > line 858, in to_table > table, schema, _, _, _ = self._execute_and_fetch(req) > > File > "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", > line 1283, in _execute_and_fetch > for response in self._execute_and_fetch_as_iterator(req): > File > "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", > line 1264, in _execute_and_fetch_as_iterator > self._handle_error(error) > File > "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", > line 1503, in _handle_error > self._handle_rpc_error(error) > File > "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", > line 1539, in _handle_rpc_error > raise convert_exception(info, status.message) from None > pyspark.errors.exceptions.connect.IllegalArgumentException: > [INVALID_PARAMETER_VALUE.CHARSET] The value of parameter(s) `charset` in > `encode` is invalid: expects one of the charsets 'US-ASCII', 'ISO-8859-1', > 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16', but got utf8. SQLSTATE: > 2[202](https://github.com/HyukjinKwon/spark/actions/runs/8909131027/job/24465959134#step:9:203)3 > {code} > {code} > == > ERROR [0.024s]: test_assert_approx_equal_decimaltype_custom_rtol_pass > (pyspark.sql.tests.connect.test_utils.ConnectUtilsTests.test_assert_approx_equal_decimaltype_custom_rtol_pass) > -- > Traceback (most recent call last): > File > "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/test_utils.py", > line 279, in test_assert_approx_equal_decimaltype_custom_rtol_pass >
[jira] [Updated] (SPARK-48085) ANSI enabled by default brings different results in the tests in 3.5 client <> 4.0 server
[ https://issues.apache.org/jira/browse/SPARK-48085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-48085: - Description: {code} == FAIL [0.169s]: test_checking_csv_header (pyspark.sql.tests.connect.test_parity_datasources.DataSourcesParityTests.test_checking_csv_header) -- pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkException) [FAILED_READ_FILE.NO_HINT] Encountered error while reading file file:///home/runner/work/spark/spark-3.5/python/target/38acabf5-710b-4c21-b359-f61619e2adc7/tmpm7qyq23g/part-0-d6c8793b-772d-44e7-bcca-6eeae9cc0ec7-c000.csv. SQLSTATE: KD001 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/test_datasources.py", line [167](https://github.com/HyukjinKwon/spark/actions/runs/8908464265/job/24464135564#step:9:168), in test_checking_csv_header self.assertRaisesRegex( AssertionError: "CSV header does not conform to the schema" does not match "(org.apache.spark.SparkException) [FAILED_READ_FILE.NO_HINT] Encountered error while reading file file:///home/runner/work/spark/spark-3.5/python/target/38acabf5-710b-4c21-b359-f61619e2adc7/tmpm7qyq23g/part-0-d6c8793b-772d-44e7-bcca-6eeae9cc0ec7-c000.csv. SQLSTATE: KD001" {code} {code} == ERROR [0.059s]: test_large_variable_types (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests.test_large_variable_types) -- Traceback (most recent call last): File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/pandas/test_pandas_map.py", line 115, in test_large_variable_types actual = df.mapInPandas(func, "str string, bin binary").collect() File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/dataframe.py", line 1645, in collect table, schema = self._session.client.to_table(query) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 858, in to_table table, schema, _, _, _ = self._execute_and_fetch(req) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 1283, in _execute_and_fetch for response in self._execute_and_fetch_as_iterator(req): File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 1264, in _execute_and_fetch_as_iterator self._handle_error(error) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 1503, in _handle_error self._handle_rpc_error(error) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 1539, in _handle_rpc_error raise convert_exception(info, status.message) from None pyspark.errors.exceptions.connect.IllegalArgumentException: [INVALID_PARAMETER_VALUE.CHARSET] The value of parameter(s) `charset` in `encode` is invalid: expects one of the charsets 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16', but got utf8. SQLSTATE: 2[202](https://github.com/HyukjinKwon/spark/actions/runs/8909131027/job/24465959134#step:9:203)3 {code} {code} == ERROR [0.024s]: test_assert_approx_equal_decimaltype_custom_rtol_pass (pyspark.sql.tests.connect.test_utils.ConnectUtilsTests.test_assert_approx_equal_decimaltype_custom_rtol_pass) -- Traceback (most recent call last): File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/test_utils.py", line 279, in test_assert_approx_equal_decimaltype_custom_rtol_pass assertDataFrameEqual(df1, df2, rtol=1e-1) File "/home/runner/work/spark/spark-3.5/python/pyspark/testing/utils.py", line 595, in assertDataFrameEqual actual_list = actual.collect() File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/dataframe.py", line 1645, in collect table, schema = self._session.client.to_table(query) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 858, in to_table table, schema, _, _, _ = self._execute_and_fetch(req) File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", line 1283, in _execute_and_fetch for
[jira] [Updated] (SPARK-48088) Skip tests being failed in client 3.5 <> server 4.0
[ https://issues.apache.org/jira/browse/SPARK-48088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-48088: - Affects Version/s: 3.5.2 (was: 4.0.0) > Skip tests being failed in client 3.5 <> server 4.0 > --- > > Key: SPARK-48088 > URL: https://issues.apache.org/jira/browse/SPARK-48088 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.2 >Reporter: Hyukjin Kwon >Priority: Major > > We should skip, and set the CI first. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org