[jira] [Commented] (SPARK-48105) Fix the data corruption issue when state store unload and snapshotting happens concurrently for HDFS state store

2024-05-02 Thread Anish Shrigondekar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843123#comment-17843123
 ] 

Anish Shrigondekar commented on SPARK-48105:


Thanks [~huanli.wang] 

Worth noting that we believe that this also fixes the stream-stream join null 
pointer issue - https://issues.apache.org/jira/browse/SPARK-31754

This was effectively happening due to the same state data loss/corruption 
scenarios explained above

cc - [~kabhwan] 

>  Fix the data corruption issue when state store unload and snapshotting 
> happens concurrently for HDFS state store
> -
>
> Key: SPARK-48105
> URL: https://issues.apache.org/jira/browse/SPARK-48105
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Huanli Wang
>Priority: Major
>  Labels: pull-request-available
>
> There are two race conditions between state store snapshotting and state 
> store unloading which could result in query failure and potential data 
> corruption.
>  
> Case 1:
>  # the maintenance thread pool encounters some issues and call the 
> [stopMaintenanceTask,|https://github.com/apache/spark/blob/d9d79a54a3cd487380039c88ebe9fa708e0dcf23/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L774]
>  this function further calls 
> [threadPool.stop.|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L587]
>  However, this function doesn't wait for the stop operation to be completed 
> and move to do the state store [unload and 
> clear.|https://github.com/apache/spark/blob/d9d79a54a3cd487380039c88ebe9fa708e0dcf23/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L775-L778]
>  # the provider unload will [close the state 
> store|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L719-L721]
>  which [clear the values of 
> loadedMaps|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L353-L355]
>  for HDFS backed state store.
>  # if the not-yet-stop maintenance thread is still running and trying to do 
> the snapshot, but the data in the underlying `HDFSBackedStateStoreMap` has 
> been removed. if this snapshot process completes successfully, then we will 
> write corrupted data and the following batches will consume this corrupted 
> data.
> Case 2:
>  # In executor_1, the maintenance thread is going to do the snapshot for 
> state_store_1, it retrieves the `HDFSBackedStateStoreMap` object from the 
> loadedMaps, after this, the maintenance thread [releases the lock of the 
> loadedMaps|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L750-L751].
>  # state_store_1 is loaded in another executor, e.g. executor_2.
>  # another state store, state_store_2, is loaded on executor_1 and 
> [reportActiveStoreInstance|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L854-L871]
>  to driver.
>  # executor_1 does the 
> [unload|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L713]
>  for those no longer active state store which clears the data entries in the 
> `HDFSBackedStateStoreMap`
>  # the snapshotting thread is terminated and uploads the incomplete snapshot 
> to cloud because the [iterator doesn't have next 
> element|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L634]
>  after doing the clear.
>  # future batches are consuming the corrupted data.
>  
> Proposed fix:
>  * When we close the hdfs state store, we should only remove the entry from 
> `loadedMaps` rather than doing the active data cleanup. JVM GC should be able 
> to help us GC those objects.
>  * we should wait for the maintenance thread to stop before unloading the 
> providers. 
>  
> Thanks [~anishshri-db] for helping debug this issue!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48103) Promote ` KubernetesDriverBuilder` to `DeveloperApi`

2024-05-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48103:
---
Labels: pull-request-available  (was: )

> Promote ` KubernetesDriverBuilder` to `DeveloperApi`
> 
>
> Key: SPARK-48103
> URL: https://issues.apache.org/jira/browse/SPARK-48103
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48105) Fix the data corruption issue when state store unload and snapshotting happens concurrently for HDFS state store

2024-05-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48105:
---
Labels: pull-request-available  (was: )

>  Fix the data corruption issue when state store unload and snapshotting 
> happens concurrently for HDFS state store
> -
>
> Key: SPARK-48105
> URL: https://issues.apache.org/jira/browse/SPARK-48105
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Huanli Wang
>Priority: Major
>  Labels: pull-request-available
>
> There are two race conditions between state store snapshotting and state 
> store unloading which could result in query failure and potential data 
> corruption.
>  
> Case 1:
>  # the maintenance thread pool encounters some issues and call the 
> [stopMaintenanceTask,|https://github.com/apache/spark/blob/d9d79a54a3cd487380039c88ebe9fa708e0dcf23/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L774]
>  this function further calls 
> [threadPool.stop.|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L587]
>  However, this function doesn't wait for the stop operation to be completed 
> and move to do the state store [unload and 
> clear.|https://github.com/apache/spark/blob/d9d79a54a3cd487380039c88ebe9fa708e0dcf23/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L775-L778]
>  # the provider unload will [close the state 
> store|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L719-L721]
>  which [clear the values of 
> loadedMaps|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L353-L355]
>  for HDFS backed state store.
>  # if the not-yet-stop maintenance thread is still running and trying to do 
> the snapshot, but the data in the underlying `HDFSBackedStateStoreMap` has 
> been removed. if this snapshot process completes successfully, then we will 
> write corrupted data and the following batches will consume this corrupted 
> data.
> Case 2:
>  # In executor_1, the maintenance thread is going to do the snapshot for 
> state_store_1, it retrieves the `HDFSBackedStateStoreMap` object from the 
> loadedMaps, after this, the maintenance thread [releases the lock of the 
> loadedMaps|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L750-L751].
>  # state_store_1 is loaded in another executor, e.g. executor_2.
>  # another state store, state_store_2, is loaded on executor_1 and 
> [reportActiveStoreInstance|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L854-L871]
>  to driver.
>  # executor_1 does the 
> [unload|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L713]
>  for those no longer active state store which clears the data entries in the 
> `HDFSBackedStateStoreMap`
>  # the snapshotting thread is terminated and uploads the incomplete snapshot 
> to cloud because the [iterator doesn't have next 
> element|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L634]
>  after doing the clear.
>  # future batches are consuming the corrupted data.
>  
> Proposed fix:
>  * When we close the hdfs state store, we should only remove the entry from 
> `loadedMaps` rather than doing the active data cleanup. JVM GC should be able 
> to help us GC those objects.
>  * we should wait for the maintenance thread to stop before unloading the 
> providers. 
>  
> Thanks [~anishshri-db] for helping debug this issue!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48108) Skip `tpcds-1g` and `docker-integration-tests` tests from `RocksDB UI-Backend` job

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48108.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46355
[https://github.com/apache/spark/pull/46355]

> Skip `tpcds-1g` and `docker-integration-tests` tests from `RocksDB 
> UI-Backend` job
> --
>
> Key: SPARK-48108
> URL: https://issues.apache.org/jira/browse/SPARK-48108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48109) Enable `k8s-integration-tests` only for `kubernetes` module change

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48109:
-

Assignee: Dongjoon Hyun

> Enable `k8s-integration-tests` only for `kubernetes` module change
> --
>
> Key: SPARK-48109
> URL: https://issues.apache.org/jira/browse/SPARK-48109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> Although there is a chance of missing the related core module change, daily 
> CI test coverage will reveal that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48109) Enable `k8s-integration-tests` only for `kubernetes` module change

2024-05-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48109:
---
Labels: pull-request-available  (was: )

> Enable `k8s-integration-tests` only for `kubernetes` module change
> --
>
> Key: SPARK-48109
> URL: https://issues.apache.org/jira/browse/SPARK-48109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> Although there is a chance of missing the related core module change, daily 
> CI test coverage will reveal that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48109) Enable `k8s-integration-tests` only for `kubernetes` module change

2024-05-02 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-48109:
-

 Summary: Enable `k8s-integration-tests` only for `kubernetes` 
module change
 Key: SPARK-48109
 URL: https://issues.apache.org/jira/browse/SPARK-48109
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun


Although there is a chance of missing the related core module change, daily CI 
test coverage will reveal that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48094:
--
Description: 
h2. ASF INFRA POLICY
- https://infra.apache.org/github-actions-policy.html

h2. MONITORING
[https://infra-reports.apache.org/#ghactions=spark=168]

 !Screenshot 2024-05-02 at 20.59.18.png|width=100%! 

h2. TARGET
* All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
* All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
* The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours).
* The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, 
or 3,600 hours).

h2. DEADLINE
bq. 17th of May, 2024

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.



  was:
h2. ASF INFRA POLICY
- https://infra.apache.org/github-actions-policy.html

h2. MONITORING
[https://infra-reports.apache.org/#ghactions=spark=168]

 !Screenshot 2024-05-02 at 20.59.18.png! 

h2. TARGET
* All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
* All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
* The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours).
* The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, 
or 3,600 hours).

h2. DEADLINE
bq. 17th of May, 2024

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.




> Reduce GitHub Action usage according to ASF project allowance
> -
>
> Key: SPARK-48094
> URL: https://issues.apache.org/jira/browse/SPARK-48094
> Project: Spark
>  Issue Type: Umbrella
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
> Attachments: Screenshot 2024-05-02 at 20.59.18.png
>
>
> h2. ASF INFRA POLICY
> - https://infra.apache.org/github-actions-policy.html
> h2. MONITORING
> [https://infra-reports.apache.org/#ghactions=spark=168]
>  !Screenshot 2024-05-02 at 20.59.18.png|width=100%! 
> h2. TARGET
> * All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> * All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> * The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> * The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> h2. DEADLINE
> bq. 17th of May, 2024
> Since the deadline is 17th of May, 2024, I set this as the highest priority, 
> `Blocker`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48094:
--
Description: 
h2. ASF INFRA POLICY
- https://infra.apache.org/github-actions-policy.html

h2. MONITORING
[https://infra-reports.apache.org/#ghactions=spark=168]

 !Screenshot 2024-05-02 at 20.59.18.png! 

h2. TARGET
* All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
* All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
* The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours).
* The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, 
or 3,600 hours).

h2. DEADLINE
bq. 17th of May, 2024

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.



  was:
h2. ASF INFRA POLICY
- https://infra.apache.org/github-actions-policy.html

h2. MONITORING
[https://infra-reports.apache.org/#ghactions=spark=168]

h2. TARGET
* All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
* All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
* The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours).
* The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, 
or 3,600 hours).

h2. DEADLINE
bq. 17th of May, 2024

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.




> Reduce GitHub Action usage according to ASF project allowance
> -
>
> Key: SPARK-48094
> URL: https://issues.apache.org/jira/browse/SPARK-48094
> Project: Spark
>  Issue Type: Umbrella
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
> Attachments: Screenshot 2024-05-02 at 20.59.18.png
>
>
> h2. ASF INFRA POLICY
> - https://infra.apache.org/github-actions-policy.html
> h2. MONITORING
> [https://infra-reports.apache.org/#ghactions=spark=168]
>  !Screenshot 2024-05-02 at 20.59.18.png! 
> h2. TARGET
> * All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> * All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> * The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> * The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> h2. DEADLINE
> bq. 17th of May, 2024
> Since the deadline is 17th of May, 2024, I set this as the highest priority, 
> `Blocker`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48094:
--
Attachment: (was: Screenshot 2024-05-02 at 13.18.42.png)

> Reduce GitHub Action usage according to ASF project allowance
> -
>
> Key: SPARK-48094
> URL: https://issues.apache.org/jira/browse/SPARK-48094
> Project: Spark
>  Issue Type: Umbrella
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
> Attachments: Screenshot 2024-05-02 at 20.59.18.png
>
>
> h2. ASF INFRA POLICY
> - https://infra.apache.org/github-actions-policy.html
> h2. MONITORING
> [https://infra-reports.apache.org/#ghactions=spark=168]
> h2. TARGET
> * All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> * All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> * The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> * The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> h2. DEADLINE
> bq. 17th of May, 2024
> Since the deadline is 17th of May, 2024, I set this as the highest priority, 
> `Blocker`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48094:
--
Description: 
h2. ASF INFRA POLICY
- https://infra.apache.org/github-actions-policy.html

h2. MONITORING
[https://infra-reports.apache.org/#ghactions=spark=168]

h2. TARGET
* All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
* All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
* The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours).
* The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, 
or 3,600 hours).

h2. DEADLINE
bq. 17th of May, 2024

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.



  was:
h2. ASF INFRA POLICY
- https://infra.apache.org/github-actions-policy.html

h2. MONITORING
[https://infra-reports.apache.org/#ghactions=spark=168]

h2. TARGET
* All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
* All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
* The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours).
* The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, 
or 3,600 hours).

h2. DEADLINE
bq. 17th of May, 2024

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.

!Screenshot 2024-05-02 at 13.18.42.png|width=100%!


> Reduce GitHub Action usage according to ASF project allowance
> -
>
> Key: SPARK-48094
> URL: https://issues.apache.org/jira/browse/SPARK-48094
> Project: Spark
>  Issue Type: Umbrella
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
> Attachments: Screenshot 2024-05-02 at 20.59.18.png
>
>
> h2. ASF INFRA POLICY
> - https://infra.apache.org/github-actions-policy.html
> h2. MONITORING
> [https://infra-reports.apache.org/#ghactions=spark=168]
> h2. TARGET
> * All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> * All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> * The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> * The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> h2. DEADLINE
> bq. 17th of May, 2024
> Since the deadline is 17th of May, 2024, I set this as the highest priority, 
> `Blocker`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48094:
--
Attachment: Screenshot 2024-05-02 at 20.59.18.png

> Reduce GitHub Action usage according to ASF project allowance
> -
>
> Key: SPARK-48094
> URL: https://issues.apache.org/jira/browse/SPARK-48094
> Project: Spark
>  Issue Type: Umbrella
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
> Attachments: Screenshot 2024-05-02 at 20.59.18.png
>
>
> h2. ASF INFRA POLICY
> - https://infra.apache.org/github-actions-policy.html
> h2. MONITORING
> [https://infra-reports.apache.org/#ghactions=spark=168]
> h2. TARGET
> * All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> * All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> * The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> * The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> h2. DEADLINE
> bq. 17th of May, 2024
> Since the deadline is 17th of May, 2024, I set this as the highest priority, 
> `Blocker`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48107) Exclude tests from Python distribution

2024-05-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48107:
---
Labels: pull-request-available  (was: )

> Exclude tests from Python distribution
> --
>
> Key: SPARK-48107
> URL: https://issues.apache.org/jira/browse/SPARK-48107
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48107) Exclude tests from Python distribution

2024-05-02 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-48107:


 Summary: Exclude tests from Python distribution
 Key: SPARK-48107
 URL: https://issues.apache.org/jira/browse/SPARK-48107
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48106) Use `Python 3.11` in `pyspark` tests of `build_and_test.yml`

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48106:
-

Assignee: Dongjoon Hyun

> Use `Python 3.11` in `pyspark` tests of `build_and_test.yml`
> 
>
> Key: SPARK-48106
> URL: https://issues.apache.org/jira/browse/SPARK-48106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> - https://docs.python.org/3/whatsnew/3.11.html#summary-release-highlights
> bq. Python 3.11 is between 10-60% faster than Python 3.10. On average, we 
> measured a 1.25x speedup on the standard benchmark suite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48106) Use `Python 3.11` in `pyspark` tests of `build_and_test.yml`

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48106:
--
Description: 
- https://docs.python.org/3/whatsnew/3.11.html#summary-release-highlights

bq. Python 3.11 is between 10-60% faster than Python 3.10. On average, we 
measured a 1.25x speedup on the standard benchmark suite.

  was:
- https://docs.python.org/3/whatsnew/3.11.html#summary-release-highlights

> Python 3.11 is between 10-60% faster than Python 3.10. On average, we 
> measured a 1.25x speedup on the standard benchmark suite.


> Use `Python 3.11` in `pyspark` tests of `build_and_test.yml`
> 
>
> Key: SPARK-48106
> URL: https://issues.apache.org/jira/browse/SPARK-48106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> - https://docs.python.org/3/whatsnew/3.11.html#summary-release-highlights
> bq. Python 3.11 is between 10-60% faster than Python 3.10. On average, we 
> measured a 1.25x speedup on the standard benchmark suite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48106) Use `Python 3.11` in `pyspark` tests of `build_and_test.yml`

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48106:
--
Description: 
- https://docs.python.org/3/whatsnew/3.11.html#summary-release-highlights

> Python 3.11 is between 10-60% faster than Python 3.10. On average, we 
> measured a 1.25x speedup on the standard benchmark suite.

> Use `Python 3.11` in `pyspark` tests of `build_and_test.yml`
> 
>
> Key: SPARK-48106
> URL: https://issues.apache.org/jira/browse/SPARK-48106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> - https://docs.python.org/3/whatsnew/3.11.html#summary-release-highlights
> > Python 3.11 is between 10-60% faster than Python 3.10. On average, we 
> > measured a 1.25x speedup on the standard benchmark suite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48106) Use `Python 3.11` in `pyspark` tests of `build_and_test.yml`

2024-05-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48106:
---
Labels: pull-request-available  (was: )

> Use `Python 3.11` in `pyspark` tests of `build_and_test.yml`
> 
>
> Key: SPARK-48106
> URL: https://issues.apache.org/jira/browse/SPARK-48106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48106) Use `Python 3.11` in `pyspark` tests of `build_and_test.yml`

2024-05-02 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-48106:
-

 Summary: Use `Python 3.11` in `pyspark` tests of 
`build_and_test.yml`
 Key: SPARK-48106
 URL: https://issues.apache.org/jira/browse/SPARK-48106
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48104) Run `publish_snapshot.yml` once per day

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48104:
-

Assignee: Dongjoon Hyun

> Run `publish_snapshot.yml` once per day
> ---
>
> Key: SPARK-48104
> URL: https://issues.apache.org/jira/browse/SPARK-48104
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48104) Run `publish_snapshot.yml` once per day

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48104.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46352
[https://github.com/apache/spark/pull/46352]

> Run `publish_snapshot.yml` once per day
> ---
>
> Key: SPARK-48104
> URL: https://issues.apache.org/jira/browse/SPARK-48104
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48105) Fix the data corruption issue when state store unload and snapshotting happens concurrently for HDFS state store

2024-05-02 Thread Huanli Wang (Jira)
Huanli Wang created SPARK-48105:
---

 Summary:  Fix the data corruption issue when state store unload 
and snapshotting happens concurrently for HDFS state store
 Key: SPARK-48105
 URL: https://issues.apache.org/jira/browse/SPARK-48105
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Huanli Wang


There are two race conditions between state store snapshotting and state store 
unloading which could result in query failure and potential data corruption.

 

Case 1:
 # the maintenance thread pool encounters some issues and call the 
[stopMaintenanceTask,|https://github.com/apache/spark/blob/d9d79a54a3cd487380039c88ebe9fa708e0dcf23/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L774]
 this function further calls 
[threadPool.stop.|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L587]
 However, this function doesn't wait for the stop operation to be completed and 
move to do the state store [unload and 
clear.|https://github.com/apache/spark/blob/d9d79a54a3cd487380039c88ebe9fa708e0dcf23/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L775-L778]
 # the provider unload will [close the state 
store|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L719-L721]
 which [clear the values of 
loadedMaps|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L353-L355]
 for HDFS backed state store.
 # if the not-yet-stop maintenance thread is still running and trying to do the 
snapshot, but the data in the underlying `HDFSBackedStateStoreMap` has been 
removed. if this snapshot process completes successfully, then we will write 
corrupted data and the following batches will consume this corrupted data.

Case 2:
 # In executor_1, the maintenance thread is going to do the snapshot for 
state_store_1, it retrieves the `HDFSBackedStateStoreMap` object from the 
loadedMaps, after this, the maintenance thread [releases the lock of the 
loadedMaps|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L750-L751].
 # state_store_1 is loaded in another executor, e.g. executor_2.
 # another state store, state_store_2, is loaded on executor_1 and 
[reportActiveStoreInstance|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L854-L871]
 to driver.
 # executor_1 does the 
[unload|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L713]
 for those no longer active state store which clears the data entries in the 
`HDFSBackedStateStoreMap`
 # the snapshotting thread is terminated and uploads the incomplete snapshot to 
cloud because the [iterator doesn't have next 
element|https://github.com/apache/spark/blob/c6696cdcd611a682ebf5b7a183e2970ecea3b58c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L634]
 after doing the clear.
 # future batches are consuming the corrupted data.

 

Proposed fix:
 * When we close the hdfs state store, we should only remove the entry from 
`loadedMaps` rather than doing the active data cleanup. JVM GC should be able 
to help us GC those objects.
 * we should wait for the maintenance thread to stop before unloading the 
providers. 

 

Thanks [~anishshri-db] for helping debug this issue!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48104) Run `publish_snapshot.yml` once per day

2024-05-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48104:
---
Labels: pull-request-available  (was: )

> Run `publish_snapshot.yml` once per day
> ---
>
> Key: SPARK-48104
> URL: https://issues.apache.org/jira/browse/SPARK-48104
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48104) Run `publish_snapshot.yml` once per day

2024-05-02 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-48104:
-

 Summary: Run `publish_snapshot.yml` once per day
 Key: SPARK-48104
 URL: https://issues.apache.org/jira/browse/SPARK-48104
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47671) Enable structured logging in log4j2.properties.template and update `configuration.md`

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47671.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46349
[https://github.com/apache/spark/pull/46349]

> Enable structured logging in log4j2.properties.template and update 
> `configuration.md`
> -
>
> Key: SPARK-47671
> URL: https://issues.apache.org/jira/browse/SPARK-47671
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> # rename the current log4j2.properties.template as 
> log4j2.properties.pattern-layout-template
>  # Enable structured logging in log4j2.properties.template
>  # Update `configuration.md` on how to configure logging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48102) Track time to acquire source progress metrics for streaming triggers

2024-05-02 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-48102.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46350
[https://github.com/apache/spark/pull/46350]

> Track time to acquire source progress metrics for streaming triggers
> 
>
> Key: SPARK-48102
> URL: https://issues.apache.org/jira/browse/SPARK-48102
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Track time to acquire source progress metrics for streaming triggers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48102) Track time to acquire source progress metrics for streaming triggers

2024-05-02 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-48102:


Assignee: Anish Shrigondekar

> Track time to acquire source progress metrics for streaming triggers
> 
>
> Key: SPARK-48102
> URL: https://issues.apache.org/jira/browse/SPARK-48102
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
>
> Track time to acquire source progress metrics for streaming triggers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48099.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46347
[https://github.com/apache/spark/pull/46347]

> Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`
> 
>
> Key: SPARK-48099
> URL: https://issues.apache.org/jira/browse/SPARK-48099
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: Screenshot 2024-05-02 at 14.59.14.png
>
>
> `Java 21 on MacOS 14` is the fastest Maven test and covers both Java 17 and 
> Apple Silicon use case.
>  !Screenshot 2024-05-02 at 14.59.14.png|width=100%! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47920) Add documentation for python streaming data source

2024-05-02 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-47920:
-
Affects Version/s: 4.0.0
   (was: 3.5.1)

> Add documentation for python streaming data source
> --
>
> Key: SPARK-47920
> URL: https://issues.apache.org/jira/browse/SPARK-47920
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SS
>Affects Versions: 4.0.0
>Reporter: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
>
> Add documentation (user guide) for Python data source API.
> The DOC should explain how to develop and use DataSourceStreamReader and 
> DataSourceStreamWriter



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48103) Promote ` KubernetesDriverBuilder` to `DeveloperApi`

2024-05-02 Thread Zhou JIANG (Jira)
Zhou JIANG created SPARK-48103:
--

 Summary: Promote ` KubernetesDriverBuilder` to `DeveloperApi`
 Key: SPARK-48103
 URL: https://issues.apache.org/jira/browse/SPARK-48103
 Project: Spark
  Issue Type: Sub-task
  Components: k8s
Affects Versions: kubernetes-operator-0.1.0
Reporter: Zhou JIANG






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48102) Track time to acquire source progress metrics for streaming triggers

2024-05-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48102:
---
Labels: pull-request-available  (was: )

> Track time to acquire source progress metrics for streaming triggers
> 
>
> Key: SPARK-48102
> URL: https://issues.apache.org/jira/browse/SPARK-48102
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
>
> Track time to acquire source progress metrics for streaming triggers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48102) Track time to acquire source progress metrics for streaming triggers

2024-05-02 Thread Anish Shrigondekar (Jira)
Anish Shrigondekar created SPARK-48102:
--

 Summary: Track time to acquire source progress metrics for 
streaming triggers
 Key: SPARK-48102
 URL: https://issues.apache.org/jira/browse/SPARK-48102
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Anish Shrigondekar


Track time to acquire source progress metrics for streaming triggers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48065) SPJ: allowJoinKeysSubsetOfPartitionKeys is too strict

2024-05-02 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-48065.
--
Resolution: Fixed

Issue resolved by pull request 46325
[https://github.com/apache/spark/pull/46325]

> SPJ: allowJoinKeysSubsetOfPartitionKeys is too strict
> -
>
> Key: SPARK-48065
> URL: https://issues.apache.org/jira/browse/SPARK-48065
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Szehon Ho
>Assignee: Szehon Ho
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> If spark.sql.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled is true, 
> then SPJ no longer triggers if there are more join keys than partition keys.  
> It is triggered only if join keys is equal to , or less than, partition keys.
>  
> We can relax this constraint, as this case was supported if the flag is not 
> enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48065) SPJ: allowJoinKeysSubsetOfPartitionKeys is too strict

2024-05-02 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-48065:


Assignee: Szehon Ho

> SPJ: allowJoinKeysSubsetOfPartitionKeys is too strict
> -
>
> Key: SPARK-48065
> URL: https://issues.apache.org/jira/browse/SPARK-48065
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Szehon Ho
>Assignee: Szehon Ho
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> If spark.sql.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled is true, 
> then SPJ no longer triggers if there are more join keys than partition keys.  
> It is triggered only if join keys is equal to , or less than, partition keys.
>  
> We can relax this constraint, as this case was supported if the flag is not 
> enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47671) Enable structured logging in log4j2.properties.template and update `configuration.md`

2024-05-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47671:
---
Labels: pull-request-available  (was: )

> Enable structured logging in log4j2.properties.template and update 
> `configuration.md`
> -
>
> Key: SPARK-47671
> URL: https://issues.apache.org/jira/browse/SPARK-47671
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
>
> # rename the current log4j2.properties.template as 
> log4j2.properties.pattern-layout-template
>  # Enable structured logging in log4j2.properties.template
>  # Update `configuration.md` on how to configure logging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47671) Enable structured logging in log4j2.properties.template and update `configuration.md`

2024-05-02 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-47671:
---
Description: 
# rename the current log4j2.properties.template as 
log4j2.properties.pattern-layout-template
 # Enable structured logging in log4j2.properties.template
 # Update `configuration.md` on how to configure logging

  was:
# rename the current log4j2.properties.template as 
log4j2-pattern-layout.properties.template 
 # Enable structured logging in log4j2.properties.template
 # Update `configuration.md` on how to configure logging


> Enable structured logging in log4j2.properties.template and update 
> `configuration.md`
> -
>
> Key: SPARK-47671
> URL: https://issues.apache.org/jira/browse/SPARK-47671
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> # rename the current log4j2.properties.template as 
> log4j2.properties.pattern-layout-template
>  # Enable structured logging in log4j2.properties.template
>  # Update `configuration.md` on how to configure logging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48101) When using INSERT OVERWRITE with Spark CTEs they may not be fully resolved

2024-05-02 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-48101:
-
Priority: Minor  (was: Major)

> When using INSERT OVERWRITE with Spark CTEs they may not be fully resolved
> --
>
> Key: SPARK-48101
> URL: https://issues.apache.org/jira/browse/SPARK-48101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.1
>Reporter: Holden Karau
>Priority: Minor
>
> Repro:
> ```sql
> DROP TABLE IF EXISTS local.cte1;
> DROP TABLE IF EXISTS local.cte2;
> DROP TABLE IF EXISTS local.cte3;
> CREATE TABLE local.cte1 (id INT, fname STRING);
> CREATE TABLE local.cte2 (id2 INT);
> CREATE TABLE local.cte3 (id INT);
> WITH test_fake AS (SELECT * FROM local.cte1 WHERE id = 1 AND id2 = 1), 
> test_fake2 AS (SELECT * FROM local.cte2 WHERE id2 = 1) INSERT OVERWRITE TABLE 
> local.cte3 SELECT id2 as id FROM test_fake2;
> WITH test_fake AS (SELECT * FROM local.cte1 WHERE id = 1 AND id2 = 1), 
> test_fake2 AS (SELECT * FROM local.cte2 WHERE id2 = 1) SELECT id2 as id FROM 
> test_fake2;
> ```
>  
> Here we would expect both of the last two SQL expressions to fail, but 
> instead only the first one does.
>  
> There are more complicated cases, and in those cases, the invalid CTE is 
> treated as a null table, but this is the simplest repro I've been able to 
> come up with so far.
>  
> This occurs using both local w/Iceberg catalog or the SparkSession catalog.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48099:
-

Assignee: Dongjoon Hyun

> Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`
> 
>
> Key: SPARK-48099
> URL: https://issues.apache.org/jira/browse/SPARK-48099
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screenshot 2024-05-02 at 14.59.14.png
>
>
> `Java 21 on MacOS 14` is the fastest Maven test and covers both Java 17 and 
> Apple Silicon use case.
>  !Screenshot 2024-05-02 at 14.59.14.png|width=100%! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48101) When using INSERT OVERWRITE with Spark CTEs they may not be fully resolved

2024-05-02 Thread Holden Karau (Jira)
Holden Karau created SPARK-48101:


 Summary: When using INSERT OVERWRITE with Spark CTEs they may not 
be fully resolved
 Key: SPARK-48101
 URL: https://issues.apache.org/jira/browse/SPARK-48101
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1, 3.4.0, 3.3.0
Reporter: Holden Karau


Repro:

```sql

DROP TABLE IF EXISTS local.cte1;
DROP TABLE IF EXISTS local.cte2;
DROP TABLE IF EXISTS local.cte3;
CREATE TABLE local.cte1 (id INT, fname STRING);
CREATE TABLE local.cte2 (id2 INT);
CREATE TABLE local.cte3 (id INT);
WITH test_fake AS (SELECT * FROM local.cte1 WHERE id = 1 AND id2 = 1), 
test_fake2 AS (SELECT * FROM local.cte2 WHERE id2 = 1) INSERT OVERWRITE TABLE 
local.cte3 SELECT id2 as id FROM test_fake2;
WITH test_fake AS (SELECT * FROM local.cte1 WHERE id = 1 AND id2 = 1), 
test_fake2 AS (SELECT * FROM local.cte2 WHERE id2 = 1) SELECT id2 as id FROM 
test_fake2;

```

 

Here we would expect both of the last two SQL expressions to fail, but instead 
only the first one does.

 

There are more complicated cases, and in those cases, the invalid CTE is 
treated as a null table, but this is the simplest repro I've been able to come 
up with so far.

 

This occurs using both local w/Iceberg catalog or the SparkSession catalog.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48100) [SQL][XML] Fix issues in skipping nested structure fields not selected in schema

2024-05-02 Thread Shujing Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shujing Yang updated SPARK-48100:
-
Description: 
Previously, the XML parser can't skip nested structure data fields effectively 
when they were not selected in the schema. For instance, in the below example, 
`df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't 
effectively skipped. This PR fixes this issue.
{code:java}

  
    1
  
  
    2
  
{code}
 

  was:
Previously, the XML parser can't skip nested structure data fields when they 
were not selected in the schema. For instance, in the below example, 
`df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't 
effectively skipped. This PR fixes this issue.
{code:java}

  
    1
  
  
    2
  
{code}
 


> [SQL][XML] Fix issues in skipping nested structure fields not selected in 
> schema
> 
>
> Key: SPARK-48100
> URL: https://issues.apache.org/jira/browse/SPARK-48100
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Shujing Yang
>Priority: Major
>  Labels: pull-request-available
>
> Previously, the XML parser can't skip nested structure data fields 
> effectively when they were not selected in the schema. For instance, in the 
> below example, `df.select("struct2").collect()` returns `Seq(null)` as 
> `struct1` wasn't effectively skipped. This PR fixes this issue.
> {code:java}
> 
>   
>     1
>   
>   
>     2
>   
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48100) [SQL][XML] Fix issues in skipping nested structure fields not selected in schema

2024-05-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48100:
---
Labels: pull-request-available  (was: )

> [SQL][XML] Fix issues in skipping nested structure fields not selected in 
> schema
> 
>
> Key: SPARK-48100
> URL: https://issues.apache.org/jira/browse/SPARK-48100
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Shujing Yang
>Priority: Major
>  Labels: pull-request-available
>
> Previously, the XML parser can't skip nested structure data fields when they 
> were not selected in the schema. For instance, in the below example, 
> `df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't 
> effectively skipped. This PR fixes this issue.
> {code:java}
> 
>   
>     1
>   
>   
>     2
>   
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48100) [SQL][XML] Fix issues in skipping nested structure fields not selected in schema

2024-05-02 Thread Shujing Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shujing Yang updated SPARK-48100:
-
Summary: [SQL][XML] Fix issues in skipping nested structure fields not 
selected in schema  (was: [SQL][XML] Fix projection issue when there's a nested 
struct)

> [SQL][XML] Fix issues in skipping nested structure fields not selected in 
> schema
> 
>
> Key: SPARK-48100
> URL: https://issues.apache.org/jira/browse/SPARK-48100
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Shujing Yang
>Priority: Major
>
> Previously, the XML parser can't skip nested structure data fields when they 
> were not selected in the schema. For instance, in the below example, 
> `df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't 
> effectively skipped. This PR fixes this issue.
> {code:java}
> 
>   
>     1
>   
>   
>     2
>   
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48100) [SQL][XML] Fix projection issue when there's a nested struct

2024-05-02 Thread Shujing Yang (Jira)
Shujing Yang created SPARK-48100:


 Summary: [SQL][XML] Fix projection issue when there's a nested 
struct
 Key: SPARK-48100
 URL: https://issues.apache.org/jira/browse/SPARK-48100
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Shujing Yang


Previously, the XML parser can't skip nested structure data fields when they 
were not selected in the schema. For instance, in the below example, 
`df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't 
effectively skipped. This PR fixes this issue.
{code:java}

  
    1
  
  
    2
  
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48097) Limit GHA job execution time to up to 3 hours in `build_and_test.yml`

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48097:
-

Assignee: Dongjoon Hyun

> Limit GHA job execution time to up to 3 hours in `build_and_test.yml`
> -
>
> Key: SPARK-48097
> URL: https://issues.apache.org/jira/browse/SPARK-48097
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48097) Limit GHA job execution time to up to 3 hours in `build_and_test.yml`

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48097.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46344
[https://github.com/apache/spark/pull/46344]

> Limit GHA job execution time to up to 3 hours in `build_and_test.yml`
> -
>
> Key: SPARK-48097
> URL: https://issues.apache.org/jira/browse/SPARK-48097
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`

2024-05-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48099:
---
Labels: pull-request-available  (was: )

> Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`
> 
>
> Key: SPARK-48099
> URL: https://issues.apache.org/jira/browse/SPARK-48099
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screenshot 2024-05-02 at 14.59.14.png
>
>
> `Java 21 on MacOS 14` is the fastest Maven test and covers both Java 17 and 
> Apple Silicon use case.
>  !Screenshot 2024-05-02 at 14.59.14.png|width=100%! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48099:
--
Description: 
`Java 21 on MacOS 14` is the fastest Maven test and covers both Java 17 and 
Apple Silicon use case.

 !Screenshot 2024-05-02 at 14.59.14.png|width=100%! 

  was: !Screenshot 2024-05-02 at 14.59.14.png|width=100%! 


> Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`
> 
>
> Key: SPARK-48099
> URL: https://issues.apache.org/jira/browse/SPARK-48099
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: Screenshot 2024-05-02 at 14.59.14.png
>
>
> `Java 21 on MacOS 14` is the fastest Maven test and covers both Java 17 and 
> Apple Silicon use case.
>  !Screenshot 2024-05-02 at 14.59.14.png|width=100%! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48099:
--
Description:  !Screenshot 2024-05-02 at 14.59.14.png!width=100%! 

> Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`
> 
>
> Key: SPARK-48099
> URL: https://issues.apache.org/jira/browse/SPARK-48099
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: Screenshot 2024-05-02 at 14.59.14.png
>
>
>  !Screenshot 2024-05-02 at 14.59.14.png!width=100%! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48099:
--
Description:  !Screenshot 2024-05-02 at 14.59.14.png!   (was:  !Screenshot 
2024-05-02 at 14.59.14.png!width=100%! )

> Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`
> 
>
> Key: SPARK-48099
> URL: https://issues.apache.org/jira/browse/SPARK-48099
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: Screenshot 2024-05-02 at 14.59.14.png
>
>
>  !Screenshot 2024-05-02 at 14.59.14.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48099:
--
Description:  !Screenshot 2024-05-02 at 14.59.14.png|width=100%!   (was:  
!Screenshot 2024-05-02 at 14.59.14.png! )

> Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`
> 
>
> Key: SPARK-48099
> URL: https://issues.apache.org/jira/browse/SPARK-48099
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: Screenshot 2024-05-02 at 14.59.14.png
>
>
>  !Screenshot 2024-05-02 at 14.59.14.png|width=100%! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`

2024-05-02 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-48099:
-

 Summary: Run `maven-build` test only on `Java 21 on MacOS 14 
(Apple Silicon)`
 Key: SPARK-48099
 URL: https://issues.apache.org/jira/browse/SPARK-48099
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun
 Attachments: Screenshot 2024-05-02 at 14.59.14.png





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48099) Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48099:
--
Attachment: Screenshot 2024-05-02 at 14.59.14.png

> Run `maven-build` test only on `Java 21 on MacOS 14 (Apple Silicon)`
> 
>
> Key: SPARK-48099
> URL: https://issues.apache.org/jira/browse/SPARK-48099
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: Screenshot 2024-05-02 at 14.59.14.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48094:
--
Issue Type: Umbrella  (was: Task)

> Reduce GitHub Action usage according to ASF project allowance
> -
>
> Key: SPARK-48094
> URL: https://issues.apache.org/jira/browse/SPARK-48094
> Project: Spark
>  Issue Type: Umbrella
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
> Attachments: Screenshot 2024-05-02 at 13.18.42.png
>
>
> h2. ASF INFRA POLICY
> - https://infra.apache.org/github-actions-policy.html
> h2. MONITORING
> [https://infra-reports.apache.org/#ghactions=spark=168]
> h2. TARGET
> * All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> * All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> * The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> * The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> h2. DEADLINE
> bq. 17th of May, 2024
> Since the deadline is 17th of May, 2024, I set this as the highest priority, 
> `Blocker`.
> !Screenshot 2024-05-02 at 13.18.42.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48098) Enable `NOLINT_ON_COMPILE` for all except `linter` job

2024-05-02 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-48098:
-

 Summary: Enable `NOLINT_ON_COMPILE` for all except `linter` job
 Key: SPARK-48098
 URL: https://issues.apache.org/jira/browse/SPARK-48098
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48094:
--
Description: 
h2. ASF INFRA POLICY
- https://infra.apache.org/github-actions-policy.html

h2. MONITORING
[https://infra-reports.apache.org/#ghactions=spark=168]

h2. TARGET
* All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
* All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
* The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours).
* The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, 
or 3,600 hours).

h2. DEADLINE
bq. 17th of May, 2024

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.

!Screenshot 2024-05-02 at 13.18.42.png|width=100%!

  was:
h2. ASF INFRA POLICY
- https://infra.apache.org/github-actions-policy.html

h2. MONITORING
[https://infra-reports.apache.org/#ghactions=spark=168]

h2. TARGET
bq. 4,250 hours of build time. This policy went into effect on April 20th[2].

h2. DEADLINE
bq. 17th of May, 2024

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.

!Screenshot 2024-05-02 at 13.18.42.png|width=100%!


> Reduce GitHub Action usage according to ASF project allowance
> -
>
> Key: SPARK-48094
> URL: https://issues.apache.org/jira/browse/SPARK-48094
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
> Attachments: Screenshot 2024-05-02 at 13.18.42.png
>
>
> h2. ASF INFRA POLICY
> - https://infra.apache.org/github-actions-policy.html
> h2. MONITORING
> [https://infra-reports.apache.org/#ghactions=spark=168]
> h2. TARGET
> * All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> * All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> * The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> * The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> h2. DEADLINE
> bq. 17th of May, 2024
> Since the deadline is 17th of May, 2024, I set this as the highest priority, 
> `Blocker`.
> !Screenshot 2024-05-02 at 13.18.42.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48097) Limit GHA job execution time to up to 3 hours in `build_and_test.yml`

2024-05-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48097:
---
Labels: pull-request-available  (was: )

> Limit GHA job execution time to up to 3 hours in `build_and_test.yml`
> -
>
> Key: SPARK-48097
> URL: https://issues.apache.org/jira/browse/SPARK-48097
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48097) Limit GHA job execution time to up to 3 hours in `build_and_test.yml`

2024-05-02 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-48097:
-

 Summary: Limit GHA job execution time to up to 3 hours in 
`build_and_test.yml`
 Key: SPARK-48097
 URL: https://issues.apache.org/jira/browse/SPARK-48097
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48095) Run `build_non_ansi.yml` once per day

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48095:
-

Assignee: Dongjoon Hyun

> Run `build_non_ansi.yml` once per day
> -
>
> Key: SPARK-48095
> URL: https://issues.apache.org/jira/browse/SPARK-48095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48095) Run `build_non_ansi.yml` once per day

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48095.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46342
[https://github.com/apache/spark/pull/46342]

> Run `build_non_ansi.yml` once per day
> -
>
> Key: SPARK-48095
> URL: https://issues.apache.org/jira/browse/SPARK-48095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48094:
--
Description: 
h2. ASF INFRA POLICY
- https://infra.apache.org/github-actions-policy.html

h2. MONITORING
[https://infra-reports.apache.org/#ghactions=spark=168]

h2. TARGET
bq. 4,250 hours of build time. This policy went into effect on April 20th[2].

h2. DEADLINE
bq. 17th of May, 2024

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.

!Screenshot 2024-05-02 at 13.18.42.png|width=100%!

  was:
h2. MONITORING
[https://infra-reports.apache.org/#ghactions=spark=168]

h2. TARGET
bq. 4,250 hours of build time. This policy went into effect on April 20th[2].

h2. DEADLINE
bq. 17th of May, 2024

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.

!Screenshot 2024-05-02 at 13.18.42.png|width=100%!


> Reduce GitHub Action usage according to ASF project allowance
> -
>
> Key: SPARK-48094
> URL: https://issues.apache.org/jira/browse/SPARK-48094
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
> Attachments: Screenshot 2024-05-02 at 13.18.42.png
>
>
> h2. ASF INFRA POLICY
> - https://infra.apache.org/github-actions-policy.html
> h2. MONITORING
> [https://infra-reports.apache.org/#ghactions=spark=168]
> h2. TARGET
> bq. 4,250 hours of build time. This policy went into effect on April 20th[2].
> h2. DEADLINE
> bq. 17th of May, 2024
> Since the deadline is 17th of May, 2024, I set this as the highest priority, 
> `Blocker`.
> !Screenshot 2024-05-02 at 13.18.42.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48094:
--
Description: 
h2. MONITORING
[https://infra-reports.apache.org/#ghactions=spark=168]

h2. TARGET
bq. 4,250 hours of build time. This policy went into effect on April 20th[2].

h2. DEADLINE
bq. 17th of May, 2024

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.

!Screenshot 2024-05-02 at 13.18.42.png|width=100%!

  was:
h1. MONITORING
[https://infra-reports.apache.org/#ghactions=spark=168]

h1. TARGET
bq. 4,250 hours of build time. This policy went into effect on April 20th[2].

h2. DEADLINE
bq. 17th of May, 2024

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.

!Screenshot 2024-05-02 at 13.18.42.png|width=100%!


> Reduce GitHub Action usage according to ASF project allowance
> -
>
> Key: SPARK-48094
> URL: https://issues.apache.org/jira/browse/SPARK-48094
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
> Attachments: Screenshot 2024-05-02 at 13.18.42.png
>
>
> h2. MONITORING
> [https://infra-reports.apache.org/#ghactions=spark=168]
> h2. TARGET
> bq. 4,250 hours of build time. This policy went into effect on April 20th[2].
> h2. DEADLINE
> bq. 17th of May, 2024
> Since the deadline is 17th of May, 2024, I set this as the highest priority, 
> `Blocker`.
> !Screenshot 2024-05-02 at 13.18.42.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48094:
--
Description: 
h1. MONITORING
[https://infra-reports.apache.org/#ghactions=spark=168]

h1. TARGET
bq. 4,250 hours of build time. This policy went into effect on April 20th[2].

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.

!Screenshot 2024-05-02 at 13.18.42.png|width=100%!

  was:
h1. MONITORING
[https://infra-reports.apache.org/#ghactions=spark=168]

h1. TARGET
bq. 4,250 hours of build time. This policy went into effect on April 20th[2].

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.

!Screenshot 2024-05-02 at 13.18.42.png|width=100!


> Reduce GitHub Action usage according to ASF project allowance
> -
>
> Key: SPARK-48094
> URL: https://issues.apache.org/jira/browse/SPARK-48094
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
> Attachments: Screenshot 2024-05-02 at 13.18.42.png
>
>
> h1. MONITORING
> [https://infra-reports.apache.org/#ghactions=spark=168]
> h1. TARGET
> bq. 4,250 hours of build time. This policy went into effect on April 20th[2].
> Since the deadline is 17th of May, 2024, I set this as the highest priority, 
> `Blocker`.
> !Screenshot 2024-05-02 at 13.18.42.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48094:
--
Description: 
*{*}MONITORING{*}*
[https://infra-reports.apache.org/#ghactions=spark=168]

*{*}TARGET{*}*
bq. 4,250 hours of build time. This policy went into effect on April 20th[2].

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.

!Screenshot 2024-05-02 at 13.18.42.png|width=100!

  was:
**MONITORING**
https://infra-reports.apache.org/#ghactions=spark=168

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.

 !Screenshot 2024-05-02 at 13.18.42.png|width=100%! 


> Reduce GitHub Action usage according to ASF project allowance
> -
>
> Key: SPARK-48094
> URL: https://issues.apache.org/jira/browse/SPARK-48094
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
> Attachments: Screenshot 2024-05-02 at 13.18.42.png
>
>
> *{*}MONITORING{*}*
> [https://infra-reports.apache.org/#ghactions=spark=168]
> *{*}TARGET{*}*
> bq. 4,250 hours of build time. This policy went into effect on April 20th[2].
> Since the deadline is 17th of May, 2024, I set this as the highest priority, 
> `Blocker`.
> !Screenshot 2024-05-02 at 13.18.42.png|width=100!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48094:
--
Description: 
h1. MONITORING
[https://infra-reports.apache.org/#ghactions=spark=168]

h1. TARGET
bq. 4,250 hours of build time. This policy went into effect on April 20th[2].

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.

!Screenshot 2024-05-02 at 13.18.42.png|width=100!

  was:
*{*}MONITORING{*}*
[https://infra-reports.apache.org/#ghactions=spark=168]

*{*}TARGET{*}*
bq. 4,250 hours of build time. This policy went into effect on April 20th[2].

Since the deadline is 17th of May, 2024, I set this as the highest priority, 
`Blocker`.

!Screenshot 2024-05-02 at 13.18.42.png|width=100!


> Reduce GitHub Action usage according to ASF project allowance
> -
>
> Key: SPARK-48094
> URL: https://issues.apache.org/jira/browse/SPARK-48094
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
> Attachments: Screenshot 2024-05-02 at 13.18.42.png
>
>
> h1. MONITORING
> [https://infra-reports.apache.org/#ghactions=spark=168]
> h1. TARGET
> bq. 4,250 hours of build time. This policy went into effect on April 20th[2].
> Since the deadline is 17th of May, 2024, I set this as the highest priority, 
> `Blocker`.
> !Screenshot 2024-05-02 at 13.18.42.png|width=100!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48094) Reduce GitHub Action usage according to ASF project allowance

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48094:
--
Attachment: Screenshot 2024-05-02 at 13.18.42.png

> Reduce GitHub Action usage according to ASF project allowance
> -
>
> Key: SPARK-48094
> URL: https://issues.apache.org/jira/browse/SPARK-48094
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
> Attachments: Screenshot 2024-05-02 at 13.18.42.png
>
>
> **MONITORING**
> https://infra-reports.apache.org/#ghactions=spark=168
> Since the deadline is 17th of May, 2024, I set this as the highest priority, 
> `Blocker`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48095) Run `build_non_ansi.yml` once per day

2024-05-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48095:
---
Labels: pull-request-available  (was: )

> Run `build_non_ansi.yml` once per day
> -
>
> Key: SPARK-48095
> URL: https://issues.apache.org/jira/browse/SPARK-48095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48095) Run `build_non_ansi.yml` once per day

2024-05-02 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-48095:
-

 Summary: Run `build_non_ansi.yml` once per day
 Key: SPARK-48095
 URL: https://issues.apache.org/jira/browse/SPARK-48095
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48093) Add config to switch between client side listener and server side listener

2024-05-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48093:
---
Labels: pull-request-available  (was: )

> Add config to switch between client side listener and server side listener 
> ---
>
> Key: SPARK-48093
> URL: https://issues.apache.org/jira/browse/SPARK-48093
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SS
>Affects Versions: 3.5.0, 3.5.1, 3.5.2
>Reporter: Wei Liu
>Priority: Major
>  Labels: pull-request-available
>
> We are moving the implementation of Streaming Query Listener from server to 
> client. For clients already running client side listener, to prevent 
> regression, we should add a config to let them decide what type of listener 
> the user want to use.
>  
> This is only added to 3.5.x published versions. For 4.0 and upwards we only 
> use client side listener.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48081) Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48081:
--
Fix Version/s: 3.5.2
   3.4.4

> Fix ClassCastException in NTile.checkInputDataTypes() when argument is 
> non-foldable or of wrong type
> 
>
> Key: SPARK-48081
> URL: https://issues.apache.org/jira/browse/SPARK-48081
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.4
>
>
> {code:java}
> sql("select ntile(99.9) OVER (order by id) from range(10)"){code}
> results in
> {code}
>  java.lang.ClassCastException: class org.apache.spark.sql.types.Decimal 
> cannot be cast to class java.lang.Integer (org.apache.spark.sql.types.Decimal 
> is in unnamed module of loader 'app'; java.lang.Integer is in module 
> java.base of loader 'bootstrap')
>   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99)
>   at 
> org.apache.spark.sql.catalyst.expressions.NTile.checkInputDataTypes(windowExpressions.scala:877)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:267)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:267)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$childrenResolved$1(Expression.scala:279)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$childrenResolved$1$adapted(Expression.scala:279)
>   at scala.collection.IterableOnceOps.forall(IterableOnce.scala:633)
>   at scala.collection.IterableOnceOps.forall$(IterableOnce.scala:630)
>   at scala.collection.AbstractIterable.forall(Iterable.scala:935)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:279)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$22$$anonfun$applyOrElse$157.applyOrElse(Analyzer.scala:2243)
>  
> {code}
> instead of the intended user-facing error message. This is a minor bug that 
> was introduced in a previous error class refactoring PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48043) Kryo serialization issue with push-based shuffle

2024-05-02 Thread Romain Ardiet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Romain Ardiet updated SPARK-48043:
--
Description: 
I'm running a spark job on AWS EMR. I wanted to test the new push-based shuffle 
introduced in Spark 3.2 but it's failing with a kryo exception when I'm 
enabling it.

The issue is happening when Executor starts, during 
KryoSerializerInstance.getAutoReset() check:
{code:java}
24/04/24 15:36:22 ERROR YarnCoarseGrainedExecutorBackend: Executor self-exiting 
due to : Unable to create executor due to Failed to register classes with Kryo
org.apache.spark.SparkException: Failed to register classes with Kryo
at 
org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:186)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) 
~[scala-library-2.12.15.jar:?]
at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:241) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:174) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:105)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
 ~[kryo-shaded-4.0.2.jar:?]
at 
org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:112)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:352)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializerInstance.getAutoReset(KryoSerializer.scala:452)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects$lzycompute(KryoSerializer.scala:259)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects(KryoSerializer.scala:255)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.util.Utils$.serializerIsSupported$lzycompute$1(Utils.scala:2721)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.util.Utils$.serializerIsSupported$1(Utils.scala:2716) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.util.Utils$.isPushBasedShuffleEnabled(Utils.scala:2730) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:554) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.executor.Executor.(Executor.scala:143) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:190)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_402]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_402]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_402]
Caused by: java.lang.ClassNotFoundException: com.analytics.AnalyticsEventWrapper
at java.net.URLClassLoader.findClass(URLClassLoader.java:387) ~[?:1.8.0_402]
at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_402]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) 
~[?:1.8.0_402]
at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_402]
at java.lang.Class.forName0(Native Method) ~[?:1.8.0_402]
at java.lang.Class.forName(Class.java:348) ~[?:1.8.0_402]
at org.apache.spark.util.Utils$.classForName(Utils.scala:228) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$6(KryoSerializer.scala:177)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
~[scala-library-2.12.15.jar:?]
at 

[jira] [Updated] (SPARK-36705) Disable push based shuffle when IO encryption is enabled or serializer is not relocatable

2024-05-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-36705:
---
Labels: pull-request-available  (was: )

> Disable push based shuffle when IO encryption is enabled or serializer is not 
> relocatable
> -
>
> Key: SPARK-36705
> URL: https://issues.apache.org/jira/browse/SPARK-36705
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Mridul Muralidharan
>Assignee: Minchu Yang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 3.2.0
>
>
> Push based shuffle is not compatible with io encryption or non-relocatable 
> serialization.
> This is similar to SPARK-34790
> We have to disable push based shuffle if either of these two are true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48043) Kryo serialization issue with push-based shuffle

2024-05-02 Thread Romain Ardiet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Romain Ardiet updated SPARK-48043:
--
Description: 
I'm running a spark job on AWS EMR. I wanted to test the new push-based shuffle 
introduced in Spark 3.2 but it's failing with a kryo exception when I'm 
enabling it.

The issue is happening when Executor starts, during 
KryoSerializerInstance.getAutoReset() check:
{code:java}
24/04/24 15:36:22 ERROR YarnCoarseGrainedExecutorBackend: Executor self-exiting 
due to : Unable to create executor due to Failed to register classes with Kryo
org.apache.spark.SparkException: Failed to register classes with Kryo
at 
org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:186)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) 
~[scala-library-2.12.15.jar:?]
at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:241) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:174) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:105)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
 ~[kryo-shaded-4.0.2.jar:?]
at 
org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:112)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:352)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializerInstance.getAutoReset(KryoSerializer.scala:452)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects$lzycompute(KryoSerializer.scala:259)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects(KryoSerializer.scala:255)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.util.Utils$.serializerIsSupported$lzycompute$1(Utils.scala:2721)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.util.Utils$.serializerIsSupported$1(Utils.scala:2716) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.util.Utils$.isPushBasedShuffleEnabled(Utils.scala:2730) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:554) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.executor.Executor.(Executor.scala:143) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:190)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_402]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_402]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_402]
Caused by: java.lang.ClassNotFoundException: com.analytics.AnalyticsEventWrapper
at java.net.URLClassLoader.findClass(URLClassLoader.java:387) ~[?:1.8.0_402]
at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_402]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) 
~[?:1.8.0_402]
at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_402]
at java.lang.Class.forName0(Native Method) ~[?:1.8.0_402]
at java.lang.Class.forName(Class.java:348) ~[?:1.8.0_402]
at org.apache.spark.util.Utils$.classForName(Utils.scala:228) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$6(KryoSerializer.scala:177)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
~[scala-library-2.12.15.jar:?]
at 

[jira] [Updated] (SPARK-48043) Kryo serialization issue with push-based shuffle

2024-05-02 Thread Romain Ardiet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Romain Ardiet updated SPARK-48043:
--
Description: 
I'm running a spark job on AWS EMR. I wanted to test the new push-based shuffle 
introduced in Spark 3.2 but it's failing with a kryo exception when I'm 
enabling it.

The issue is happening when Executor starts, during 
KryoSerializerInstance.getAutoReset() check:
{code:java}
24/04/24 15:36:22 ERROR YarnCoarseGrainedExecutorBackend: Executor self-exiting 
due to : Unable to create executor due to Failed to register classes with Kryo
org.apache.spark.SparkException: Failed to register classes with Kryo
at 
org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:186)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) 
~[scala-library-2.12.15.jar:?]
at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:241) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:174) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:105)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
 ~[kryo-shaded-4.0.2.jar:?]
at 
org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:112)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:352)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializerInstance.getAutoReset(KryoSerializer.scala:452)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects$lzycompute(KryoSerializer.scala:259)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects(KryoSerializer.scala:255)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.util.Utils$.serializerIsSupported$lzycompute$1(Utils.scala:2721)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.util.Utils$.serializerIsSupported$1(Utils.scala:2716) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.util.Utils$.isPushBasedShuffleEnabled(Utils.scala:2730) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:554) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.executor.Executor.(Executor.scala:143) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:190)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_402]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_402]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_402]
Caused by: java.lang.ClassNotFoundException: com.analytics.AnalyticsEventWrapper
at java.net.URLClassLoader.findClass(URLClassLoader.java:387) ~[?:1.8.0_402]
at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_402]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) 
~[?:1.8.0_402]
at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_402]
at java.lang.Class.forName0(Native Method) ~[?:1.8.0_402]
at java.lang.Class.forName(Class.java:348) ~[?:1.8.0_402]
at org.apache.spark.util.Utils$.classForName(Utils.scala:228) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$6(KryoSerializer.scala:177)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
~[scala-library-2.12.15.jar:?]
at 

[jira] [Updated] (SPARK-48043) Kryo serialization issue with push-based shuffle

2024-05-02 Thread Romain Ardiet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Romain Ardiet updated SPARK-48043:
--
Description: 
I'm running a spark job on AWS EMR. I wanted to test the new push-based shuffle 
introduced in Spark 3.2 but it's failing with a kryo exception when I'm 
enabling it.

The issue seems happening when Executor starts, on 
KryoSerializerInstance.getAutoReset() check:
{code:java}
24/04/24 15:36:22 ERROR YarnCoarseGrainedExecutorBackend: Executor self-exiting 
due to : Unable to create executor due to Failed to register classes with Kryo
org.apache.spark.SparkException: Failed to register classes with Kryo
at 
org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:186)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) 
~[scala-library-2.12.15.jar:?]
at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:241) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:174) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:105)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
 ~[kryo-shaded-4.0.2.jar:?]
at 
org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:112)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:352)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializerInstance.getAutoReset(KryoSerializer.scala:452)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects$lzycompute(KryoSerializer.scala:259)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects(KryoSerializer.scala:255)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.util.Utils$.serializerIsSupported$lzycompute$1(Utils.scala:2721)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.util.Utils$.serializerIsSupported$1(Utils.scala:2716) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.util.Utils$.isPushBasedShuffleEnabled(Utils.scala:2730) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:554) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.executor.Executor.(Executor.scala:143) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:190)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_402]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_402]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_402]
Caused by: java.lang.ClassNotFoundException: com.analytics.AnalyticsEventWrapper
at java.net.URLClassLoader.findClass(URLClassLoader.java:387) ~[?:1.8.0_402]
at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_402]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) 
~[?:1.8.0_402]
at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_402]
at java.lang.Class.forName0(Native Method) ~[?:1.8.0_402]
at java.lang.Class.forName(Class.java:348) ~[?:1.8.0_402]
at org.apache.spark.util.Utils$.classForName(Utils.scala:228) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$6(KryoSerializer.scala:177)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
~[scala-library-2.12.15.jar:?]
at 

[jira] [Updated] (SPARK-48043) Kryo serialization issue with push-based shuffle

2024-05-02 Thread Romain Ardiet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Romain Ardiet updated SPARK-48043:
--
Description: 
I'm running a spark job on AWS EMR. I wanted to test the new push-based shuffle 
introduced in Spark 3.2 but it's failing with a kryo exception when I'm 
enabling it.

The issue seems happening when Executor starts, on 
KryoSerializerInstance.getAutoReset() check:
{code:java}
24/04/24 15:36:22 ERROR YarnCoarseGrainedExecutorBackend: Executor self-exiting 
due to : Unable to create executor due to Failed to register classes with Kryo
org.apache.spark.SparkException: Failed to register classes with Kryo
at 
org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:186)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) 
~[scala-library-2.12.15.jar:?]
at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:241) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:174) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:105)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
 ~[kryo-shaded-4.0.2.jar:?]
at 
org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:112)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:352)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializerInstance.getAutoReset(KryoSerializer.scala:452)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects$lzycompute(KryoSerializer.scala:259)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects(KryoSerializer.scala:255)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.util.Utils$.serializerIsSupported$lzycompute$1(Utils.scala:2721)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.util.Utils$.serializerIsSupported$1(Utils.scala:2716) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.util.Utils$.isPushBasedShuffleEnabled(Utils.scala:2730) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:554) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.executor.Executor.(Executor.scala:143) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:190)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_402]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_402]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_402]
Caused by: java.lang.ClassNotFoundException: com.analytics.AnalyticsEventWrapper
at java.net.URLClassLoader.findClass(URLClassLoader.java:387) ~[?:1.8.0_402]
at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_402]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) 
~[?:1.8.0_402]
at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_402]
at java.lang.Class.forName0(Native Method) ~[?:1.8.0_402]
at java.lang.Class.forName(Class.java:348) ~[?:1.8.0_402]
at org.apache.spark.util.Utils$.classForName(Utils.scala:228) 
~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at 
org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$6(KryoSerializer.scala:177)
 ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
~[scala-library-2.12.15.jar:?]
at 

[jira] [Created] (SPARK-48093) Add config to switch between client side listener and server side listener

2024-05-02 Thread Wei Liu (Jira)
Wei Liu created SPARK-48093:
---

 Summary: Add config to switch between client side listener and 
server side listener 
 Key: SPARK-48093
 URL: https://issues.apache.org/jira/browse/SPARK-48093
 Project: Spark
  Issue Type: New Feature
  Components: Connect, SS
Affects Versions: 3.5.1, 3.5.0, 3.5.2
Reporter: Wei Liu


We are moving the implementation of Streaming Query Listener from server to 
client. For clients already running client side listener, to prevent 
regression, we should add a config to let them decide what type of listener the 
user want to use.

 

This is only added to 3.5.x published versions. For 4.0 and upwards we only use 
client side listener.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48067) Fix Variant default columns for more complex default variants

2024-05-02 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-48067.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46312
[https://github.com/apache/spark/pull/46312]

> Fix Variant default columns for more complex default variants
> -
>
> Key: SPARK-48067
> URL: https://issues.apache.org/jira/browse/SPARK-48067
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Richard Chen
>Assignee: Richard Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Default columns are stored as structfield metadata (string -> string) map. 
> This means the literal values are stored as strings.
> However, the string representation of a variant is the JSONified string. 
> Thus, we need to wrap the string with `parse_json` to correctly use the 
> default values



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48067) Fix Variant default columns for more complex default variants

2024-05-02 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-48067:
--

Assignee: Richard Chen

> Fix Variant default columns for more complex default variants
> -
>
> Key: SPARK-48067
> URL: https://issues.apache.org/jira/browse/SPARK-48067
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Richard Chen
>Assignee: Richard Chen
>Priority: Major
>  Labels: pull-request-available
>
> Default columns are stored as structfield metadata (string -> string) map. 
> This means the literal values are stored as strings.
> However, the string representation of a variant is the JSONified string. 
> Thus, we need to wrap the string with `parse_json` to correctly use the 
> default values



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48089) Streaming query listener not working in 3.5 client <> 4.0 server

2024-05-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48089:
---
Labels: pull-request-available  (was: )

> Streaming query listener not working in 3.5 client <> 4.0 server
> 
>
> Key: SPARK-48089
> URL: https://issues.apache.org/jira/browse/SPARK-48089
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> {code}
> ==
> ERROR [1.488s]: test_listener_events 
> (pyspark.sql.tests.connect.streaming.test_parity_listener.StreamingListenerParityTests.test_listener_events)
> --
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/connect/streaming/test_parity_listener.py",
>  line 53, in test_listener_events
> self.spark.streams.addListener(test_listener)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py",
>  line 244, in addListener
> self._execute_streaming_query_manager_cmd(cmd)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py",
>  line 260, in _execute_streaming_query_manager_cmd
> (_, properties) = self._session.client.execute_command(exec_cmd)
>   ^^
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 982, in execute_command
> data, _, _, _, properties = self._execute_and_fetch(req)
> 
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1283, in _execute_and_fetch
> for response in self._execute_and_fetch_as_iterator(req):
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1264, in _execute_and_fetch_as_iterator
> self._handle_error(error)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1503, in _handle_error
> self._handle_rpc_error(error)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1539, in _handle_rpc_error
> raise convert_exception(info, status.message) from None
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (java.io.EOFException) 
> --
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45988) Fix `pyspark.pandas.tests.computation.test_apply_func` in Python 3.11

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45988:
--
Fix Version/s: 3.4.4

> Fix `pyspark.pandas.tests.computation.test_apply_func` in Python 3.11
> -
>
> Key: SPARK-45988
> URL: https://issues.apache.org/jira/browse/SPARK-45988
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.4
>
>
> https://github.com/apache/spark/actions/runs/6914662405/job/18812759697
> {code}
> ==
> ERROR [0.686s]: test_apply_batch_with_type 
> (pyspark.pandas.tests.computation.test_apply_func.FrameApplyFunctionTests.test_apply_batch_with_type)
> --
> Traceback (most recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/pandas/tests/computation/test_apply_func.py",
>  line 248, in test_apply_batch_with_type
> def identify3(x) -> ps.DataFrame[float, [int, List[int]]]:
> ^
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13540, in 
> __class_getitem__
> return create_tuple_for_frame_type(params)
>^^^
>   File "/__w/spark/spark/python/pyspark/pandas/typedef/typehints.py", line 
> 721, in create_tuple_for_frame_type
> return Tuple[_to_type_holders(params)]
>  
>   File "/__w/spark/spark/python/pyspark/pandas/typedef/typehints.py", line 
> 766, in _to_type_holders
> data_types = _new_type_holders(data_types, NameTypeHolder)
>  ^
>   File "/__w/spark/spark/python/pyspark/pandas/typedef/typehints.py", line 
> 832, in _new_type_holders
> raise TypeError(
> TypeError: Type hints should be specified as one of:
>   - DataFrame[type, type, ...]
>   - DataFrame[name: type, name: type, ...]
>   - DataFrame[dtypes instance]
>   - DataFrame[zip(names, types)]
>   - DataFrame[index_type, [type, ...]]
>   - DataFrame[(index_name, index_type), [(name, type), ...]]
>   - DataFrame[dtype instance, dtypes instance]
>   - DataFrame[(index_name, index_type), zip(names, types)]
>   - DataFrame[[index_type, ...], [type, ...]]
>   - DataFrame[[(index_name, index_type), ...], [(name, type), ...]]
>   - DataFrame[dtypes instance, dtypes instance]
>   - DataFrame[zip(index_names, index_types), zip(names, types)]
> However, got (, typing.List[int]).
> --
> Ran 10 tests in 34.327s
> FAILED (errors=1)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45989) Fix `pyspark.pandas.tests.connect.computation.test_parity_apply_func` in Python 3.11

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45989:
--
Fix Version/s: 3.4.4

> Fix `pyspark.pandas.tests.connect.computation.test_parity_apply_func` in 
> Python 3.11
> 
>
> Key: SPARK-45989
> URL: https://issues.apache.org/jira/browse/SPARK-45989
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 4.0.0, 3.5.2, 3.4.4
>
>
> https://github.com/apache/spark/actions/runs/6914662405/job/18816505612
> {code}
> ==
> ERROR [1.237s]: test_apply_batch_with_type 
> (pyspark.pandas.tests.connect.computation.test_parity_apply_func.FrameParityApplyFunctionTests.test_apply_batch_with_type)
> --
> Traceback (most recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/pandas/tests/computation/test_apply_func.py",
>  line 248, in test_apply_batch_with_type
> def identify3(x) -> ps.DataFrame[float, [int, List[int]]]:
> ^
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13540, in 
> __class_getitem__
> return create_tuple_for_frame_type(params)
>^^^
>   File "/__w/spark/spark/python/pyspark/pandas/typedef/typehints.py", line 
> 721, in create_tuple_for_frame_type
> return Tuple[_to_type_holders(params)]
>  
>   File "/__w/spark/spark/python/pyspark/pandas/typedef/typehints.py", line 
> 766, in _to_type_holders
> data_types = _new_type_holders(data_types, NameTypeHolder)
>  ^
>   File "/__w/spark/spark/python/pyspark/pandas/typedef/typehints.py", line 
> 832, in _new_type_holders
> raise TypeError(
> TypeError: Type hints should be specified as one of:
>   - DataFrame[type, type, ...]
>   - DataFrame[name: type, name: type, ...]
>   - DataFrame[dtypes instance]
>   - DataFrame[zip(names, types)]
>   - DataFrame[index_type, [type, ...]]
>   - DataFrame[(index_name, index_type), [(name, type), ...]]
>   - DataFrame[dtype instance, dtypes instance]
>   - DataFrame[(index_name, index_type), zip(names, types)]
>   - DataFrame[[index_type, ...], [type, ...]]
>   - DataFrame[[(index_name, index_type), ...], [(name, type), ...]]
>   - DataFrame[dtypes instance, dtypes instance]
>   - DataFrame[zip(index_names, index_types), zip(names, types)]
> However, got (, typing.List[int]).
> --
> Ran 10 tests in 78.247s
> FAILED (errors=1)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48081) Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48081:
--
Fix Version/s: (was: 3.5.2)
   (was: 3.4.4)

> Fix ClassCastException in NTile.checkInputDataTypes() when argument is 
> non-foldable or of wrong type
> 
>
> Key: SPARK-48081
> URL: https://issues.apache.org/jira/browse/SPARK-48081
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code:java}
> sql("select ntile(99.9) OVER (order by id) from range(10)"){code}
> results in
> {code}
>  java.lang.ClassCastException: class org.apache.spark.sql.types.Decimal 
> cannot be cast to class java.lang.Integer (org.apache.spark.sql.types.Decimal 
> is in unnamed module of loader 'app'; java.lang.Integer is in module 
> java.base of loader 'bootstrap')
>   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99)
>   at 
> org.apache.spark.sql.catalyst.expressions.NTile.checkInputDataTypes(windowExpressions.scala:877)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:267)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:267)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$childrenResolved$1(Expression.scala:279)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$childrenResolved$1$adapted(Expression.scala:279)
>   at scala.collection.IterableOnceOps.forall(IterableOnce.scala:633)
>   at scala.collection.IterableOnceOps.forall$(IterableOnce.scala:630)
>   at scala.collection.AbstractIterable.forall(Iterable.scala:935)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:279)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$22$$anonfun$applyOrElse$157.applyOrElse(Analyzer.scala:2243)
>  
> {code}
> instead of the intended user-facing error message. This is a minor bug that 
> was introduced in a previous error class refactoring PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48092) Spark images are based of Python 3.8 which is soon EOL

2024-05-02 Thread Mayur Madnani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayur Madnani updated SPARK-48092:
--
Description: 
Python 3.8 will be EOL in Oct 2024 and all the Spark docker images are based on 
Python 3.8 as of now.

I am proposing to use Python 3.10 as default. Let me know if I can pick this up 
to make the changes in [spark-docker|[https://github.com/apache/spark-docker]]


  was:
Python 3.8 will be EOL in Oct 2024 and all the Spark docker images are based on 
Python 3.8 as of now.

I am proposing to use Python 3.10 as default. Let me know if I can pick this up 
to make the changes in [spark-docker|[https://github.com/apache/spark-docker]]

 

!Screenshot 2024-05-02 at 21.00.18.png!


> Spark images are based of Python 3.8 which is soon EOL
> --
>
> Key: SPARK-48092
> URL: https://issues.apache.org/jira/browse/SPARK-48092
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Docker
>Affects Versions: 3.4.2, 3.4.0, 3.4.1, 3.5.0, 3.5.1
>Reporter: Mayur Madnani
>Priority: Blocker
> Attachments: Screenshot 2024-05-02 at 21.00.18.png, Screenshot 
> 2024-05-02 at 21.00.48.png
>
>
> Python 3.8 will be EOL in Oct 2024 and all the Spark docker images are based 
> on Python 3.8 as of now.
> I am proposing to use Python 3.10 as default. Let me know if I can pick this 
> up to make the changes in 
> [spark-docker|[https://github.com/apache/spark-docker]]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48092) Spark images are based of Python 3.8 which is soon EOL

2024-05-02 Thread Mayur Madnani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayur Madnani updated SPARK-48092:
--
Attachment: Screenshot 2024-05-02 at 21.00.18.png
Screenshot 2024-05-02 at 21.00.48.png

> Spark images are based of Python 3.8 which is soon EOL
> --
>
> Key: SPARK-48092
> URL: https://issues.apache.org/jira/browse/SPARK-48092
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Docker
>Affects Versions: 3.4.2, 3.4.0, 3.4.1, 3.5.0, 3.5.1
>Reporter: Mayur Madnani
>Priority: Blocker
> Attachments: Screenshot 2024-05-02 at 21.00.18.png, Screenshot 
> 2024-05-02 at 21.00.48.png
>
>
> Python 3.8 will be EOL in Oct 2024 and all the Spark docker images are based 
> on Python 3.8 as of now.
> I am proposing to use Python 3.10 as default. Let me know if I can pick this 
> up to make the changes in 
> [spark-docker|[https://github.com/apache/spark-docker]]
>  
> !image-2024-05-02-14-45-18-492.png!
> !image-2024-05-02-14-44-43-423.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48092) Spark images are based of Python 3.8 which is soon EOL

2024-05-02 Thread Mayur Madnani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayur Madnani updated SPARK-48092:
--
Description: 
Python 3.8 will be EOL in Oct 2024 and all the Spark docker images are based on 
Python 3.8 as of now.

I am proposing to use Python 3.10 as default. Let me know if I can pick this up 
to make the changes in [spark-docker|[https://github.com/apache/spark-docker]]

 

!Screenshot 2024-05-02 at 21.00.18.png!

  was:
Python 3.8 will be EOL in Oct 2024 and all the Spark docker images are based on 
Python 3.8 as of now.

I am proposing to use Python 3.10 as default. Let me know if I can pick this up 
to make the changes in [spark-docker|[https://github.com/apache/spark-docker]]

 

!image-2024-05-02-14-45-18-492.png!

!image-2024-05-02-14-44-43-423.png!

 


> Spark images are based of Python 3.8 which is soon EOL
> --
>
> Key: SPARK-48092
> URL: https://issues.apache.org/jira/browse/SPARK-48092
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Docker
>Affects Versions: 3.4.2, 3.4.0, 3.4.1, 3.5.0, 3.5.1
>Reporter: Mayur Madnani
>Priority: Blocker
> Attachments: Screenshot 2024-05-02 at 21.00.18.png, Screenshot 
> 2024-05-02 at 21.00.48.png
>
>
> Python 3.8 will be EOL in Oct 2024 and all the Spark docker images are based 
> on Python 3.8 as of now.
> I am proposing to use Python 3.10 as default. Let me know if I can pick this 
> up to make the changes in 
> [spark-docker|[https://github.com/apache/spark-docker]]
>  
> !Screenshot 2024-05-02 at 21.00.18.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48092) Spark images are based of Python 3.8 which is soon EOL

2024-05-02 Thread Mayur Madnani (Jira)
Mayur Madnani created SPARK-48092:
-

 Summary: Spark images are based of Python 3.8 which is soon EOL
 Key: SPARK-48092
 URL: https://issues.apache.org/jira/browse/SPARK-48092
 Project: Spark
  Issue Type: Bug
  Components: Spark Docker
Affects Versions: 3.5.1, 3.5.0, 3.4.1, 3.4.0, 3.4.2
Reporter: Mayur Madnani


Python 3.8 will be EOL in Oct 2024 and all the Spark docker images are based on 
Python 3.8 as of now.

I am proposing to use Python 3.10 as default. Let me know if I can pick this up 
to make the changes in [spark-docker|[https://github.com/apache/spark-docker]]

 

!image-2024-05-02-14-45-18-492.png!

!image-2024-05-02-14-44-43-423.png!

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48079) Upgrade maven-install/deploy-plugin to 3.1.2

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48079.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46330
[https://github.com/apache/spark/pull/46330]

> Upgrade maven-install/deploy-plugin to 3.1.2
> 
>
> Key: SPARK-48079
> URL: https://issues.apache.org/jira/browse/SPARK-48079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48081) Fix ClassCastException in NTile.checkInputDataTypes() when argument is non-foldable or of wrong type

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48081.
---
Fix Version/s: 3.4.4
   3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 46333
[https://github.com/apache/spark/pull/46333]

> Fix ClassCastException in NTile.checkInputDataTypes() when argument is 
> non-foldable or of wrong type
> 
>
> Key: SPARK-48081
> URL: https://issues.apache.org/jira/browse/SPARK-48081
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.4, 3.5.2, 4.0.0
>
>
> {code:java}
> sql("select ntile(99.9) OVER (order by id) from range(10)"){code}
> results in
> {code}
>  java.lang.ClassCastException: class org.apache.spark.sql.types.Decimal 
> cannot be cast to class java.lang.Integer (org.apache.spark.sql.types.Decimal 
> is in unnamed module of loader 'app'; java.lang.Integer is in module 
> java.base of loader 'bootstrap')
>   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99)
>   at 
> org.apache.spark.sql.catalyst.expressions.NTile.checkInputDataTypes(windowExpressions.scala:877)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:267)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:267)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$childrenResolved$1(Expression.scala:279)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$childrenResolved$1$adapted(Expression.scala:279)
>   at scala.collection.IterableOnceOps.forall(IterableOnce.scala:633)
>   at scala.collection.IterableOnceOps.forall$(IterableOnce.scala:630)
>   at scala.collection.AbstractIterable.forall(Iterable.scala:935)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:279)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$22$$anonfun$applyOrElse$157.applyOrElse(Analyzer.scala:2243)
>  
> {code}
> instead of the intended user-facing error message. This is a minor bug that 
> was introduced in a previous error class refactoring PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48072) Improve SQLQuerySuite test output - use `===` instead of `sameElements` for Arrays

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48072:
--
Summary: Improve SQLQuerySuite test output - use `===` instead of 
`sameElements` for Arrays  (was: Test output is not descriptive for some Array 
comparisons in SQLQuerySuite)

> Improve SQLQuerySuite test output - use `===` instead of `sameElements` for 
> Arrays
> --
>
> Key: SPARK-48072
> URL: https://issues.apache.org/jira/browse/SPARK-48072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Assignee: Vladimir Golubev
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Actual and expected queries are not printed in the output when using 
> `.sameElements`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48072) Test output is not descriptive for some Array comparisons in SQLQuerySuite

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48072:
-

Assignee: Vladimir Golubev

> Test output is not descriptive for some Array comparisons in SQLQuerySuite
> --
>
> Key: SPARK-48072
> URL: https://issues.apache.org/jira/browse/SPARK-48072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Assignee: Vladimir Golubev
>Priority: Minor
>  Labels: pull-request-available
>
> Actual and expected queries are not printed in the output when using 
> `.sameElements`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48072) Test output is not descriptive for some Array comparisons in SQLQuerySuite

2024-05-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48072.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46318
[https://github.com/apache/spark/pull/46318]

> Test output is not descriptive for some Array comparisons in SQLQuerySuite
> --
>
> Key: SPARK-48072
> URL: https://issues.apache.org/jira/browse/SPARK-48072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Assignee: Vladimir Golubev
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Actual and expected queries are not printed in the output when using 
> `.sameElements`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48091) Using `explode` together with `transform` in the same select statement causes aliases in the transformed column to be ignored

2024-05-02 Thread Ron Serruya (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Serruya updated SPARK-48091:

Description: 
When using an `explode` function, and `transform` function in the same select 
statement, aliases used inside the transformed column are ignored.

This behaviour only happens using the pyspark API, and not when using the SQL 
API

 
{code:java}
from pyspark.sql import functions as F

# Create the df
df = spark.createDataFrame([
{"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]}
]){code}
Good case, where all aliases are used

 
{code:java}
df.select(
F.transform(
'array2',
lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
).alias("new_array2")
).printSchema() 

root
 |-- new_array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- some_alias: long (nullable = true)
 |||-- second_alias: long (nullable = true){code}
Bad case, when using explode, the alises inside the transformed column is 
ignored, and  `id` is kept instead of `second_alias`, and `x_17` is used 
instead of `some_alias`

 

 
{code:java}
df.select(
F.explode("array1").alias("exploded"),
F.transform(
'array2',
lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
).alias("new_array2")
).printSchema()

root
 |-- exploded: string (nullable = true)
 |-- new_array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- x_17: long (nullable = true)
 |||-- id: long (nullable = true) {code}
 

 

 

When using the SQL API instead, it works fine
{code:java}
spark.sql(
"""
select explode(array1) as exploded, transform(array2, x-> struct(x as 
some_alias, id as second_alias)) as array2 from {df}
""", df=df
).printSchema()

root
 |-- exploded: string (nullable = true)
 |-- array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- some_alias: long (nullable = true)
 |||-- second_alias: long (nullable = true) {code}
 

Workaround: for now, using F.named_struct can be used as a workaround

  was:
When using an `explode` function, and `transform` function in the same select 
statement, aliases used inside the transformed column are ignored.

This behaviour only happens using the pyspark API, and not when using the SQL 
API

 
{code:java}
from pyspark.sql import functions as F

# Create the df
df = spark.createDataFrame([
{"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]}
]){code}
Good case, where all aliases are used

 
{code:java}
df.select(
F.transform(
'array2',
lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
).alias("new_array2")
).printSchema() 

root
 |-- new_array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- some_alias: long (nullable = true)
 |||-- second_alias: long (nullable = true){code}
Bad case, when using explode, the alises inside the transformed column is 
ignored, and  `id` is kept instead of `second_alias`, and `x_17` is used 
instead of `some_alias`

 

 
{code:java}
df.select(
F.explode("array1").alias("exploded"),
F.transform(
'array2',
lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
).alias("new_array2")
).printSchema()

root
 |-- exploded: string (nullable = true)
 |-- new_array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- x_17: long (nullable = true)
 |||-- id: long (nullable = true) {code}
 

 

 

When using the SQL API instead, it works fine
{code:java}
spark.sql(
"""
select explode(array1) as exploded, transform(array2, x-> struct(x as 
some_alias, id as second_alias)) as array2 from {df}
""", df=df
).printSchema()

root
 |-- exploded: string (nullable = true)
 |-- array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- some_alias: long (nullable = true)
 |||-- second_alias: long (nullable = true) {code}
 


> Using `explode` together with `transform` in the same select statement causes 
> aliases in the transformed column to be ignored
> -
>
> Key: SPARK-48091
> URL: https://issues.apache.org/jira/browse/SPARK-48091
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0, 3.5.0, 3.5.1
> Environment: Python 3.10, 3.12, OSX 14.4 and Databricks DBR 13.3, 
> 14.3, Pyspark 3.4.0, 3.5.0, 3.5.1
>Reporter: Ron Serruya
>Priority: Minor
>  Labels: PySpark, alias
>
> When using an `explode` function, and `transform` function in the same select 
> 

[jira] [Created] (SPARK-48091) Using `explode` together with `transform` in the same select statement causes aliases in the transformed column to be ignored

2024-05-02 Thread Ron Serruya (Jira)
Ron Serruya created SPARK-48091:
---

 Summary: Using `explode` together with `transform` in the same 
select statement causes aliases in the transformed column to be ignored
 Key: SPARK-48091
 URL: https://issues.apache.org/jira/browse/SPARK-48091
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.5.1, 3.5.0, 3.4.0
 Environment: Python 3.10, 3.12, OSX 14.4 and Databricks DBR 13.3, 
14.3, Pyspark 3.4.0, 3.5.0, 3.5.1
Reporter: Ron Serruya


When using an `explode` function, and `transform` function in the same select 
statement, aliases used inside the transformed column are ignored.

This behaviour only happens using the pyspark API, and not when using the SQL 
API

 
{code:java}
from pyspark.sql import functions as F

# Create the df
df = spark.createDataFrame([
{"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]}
]){code}
Good case, where all aliases are used

 
{code:java}
df.select(
F.transform(
'array2',
lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
).alias("new_array2")
).printSchema() 

root
 |-- new_array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- some_alias: long (nullable = true)
 |||-- second_alias: long (nullable = true){code}
Bad case, when using explode, the alises inside the transformed column is 
ignored, and  `id` is kept instead of `second_alias`, and `x_17` is used 
instead of `some_alias`

 

 
{code:java}
df.select(
F.explode("array1").alias("exploded"),
F.transform(
'array2',
lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
).alias("new_array2")
).printSchema()

root
 |-- exploded: string (nullable = true)
 |-- new_array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- x_17: long (nullable = true)
 |||-- id: long (nullable = true) {code}
 

 

 

When using the SQL API instead, it works fine
{code:java}
spark.sql(
"""
select explode(array1) as exploded, transform(array2, x-> struct(x as 
some_alias, id as second_alias)) as array2 from {df}
""", df=df
).printSchema()

root
 |-- exploded: string (nullable = true)
 |-- array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- some_alias: long (nullable = true)
 |||-- second_alias: long (nullable = true) {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48056) [CONNECT][PYTHON] Session not found error should automatically retry during reattach

2024-05-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48056.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46297
[https://github.com/apache/spark/pull/46297]

> [CONNECT][PYTHON] Session not found error should automatically retry during 
> reattach
> 
>
> Key: SPARK-48056
> URL: https://issues.apache.org/jira/browse/SPARK-48056
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.4.3
>Reporter: Niranjan Jayakar
>Assignee: Niranjan Jayakar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When an OPERATION_NOT_FOUND error is raised and no prior responses were 
> received, the client retries the ExecutePlan RPC: 
> [https://github.com/apache/spark/blob/e6217c111fbdd73f202400494c42091e93d3041f/python/pyspark/sql/connect/client/reattach.py#L257]
>  
> Another error SESSION_NOT_FOUND should follow the same logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48088) Skip tests being failed in client 3.5 <> server 4.0

2024-05-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48088:
---
Labels: pull-request-available  (was: )

> Skip tests being failed in client 3.5 <> server 4.0
> ---
>
> Key: SPARK-48088
> URL: https://issues.apache.org/jira/browse/SPARK-48088
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.2
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> We should skip, and set the CI first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48090) Streaming exception catch failure in 3.5 client <> 4.0 server

2024-05-02 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-48090:


 Summary: Streaming exception catch failure in 3.5 client <> 4.0 
server
 Key: SPARK-48090
 URL: https://issues.apache.org/jira/browse/SPARK-48090
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, Structured Streaming
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


{code}
==
FAIL [1.975s]: test_stream_exception 
(pyspark.sql.tests.connect.streaming.test_parity_streaming.StreamingParityTests.test_stream_exception)
--
Traceback (most recent call last):
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/streaming/test_streaming.py",
 line 287, in test_stream_exception
sq.processAllAvailable()
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py",
 line 129, in processAllAvailable
self._execute_streaming_query_cmd(cmd)
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py",
 line 177, in _execute_streaming_query_cmd
(_, properties) = self._session.client.execute_command(exec_cmd)
  ^^
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 982, in execute_command
data, _, _, _, properties = self._execute_and_fetch(req)

  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1283, in _execute_and_fetch
for response in self._execute_and_fetch_as_iterator(req):
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1264, in _execute_and_fetch_as_iterator
self._handle_error(error)
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 
[150](https://github.com/HyukjinKwon/spark/actions/runs/8907172876/job/24460568471#step:9:151)3,
 in _handle_error
self._handle_rpc_error(error)
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1539, in _handle_rpc_error
raise convert_exception(info, status.message) from None
pyspark.errors.exceptions.connect.StreamingQueryException: [STREAM_FAILED] 
Query [id = 38d0d145-1f57-4b92-b317-d9de727d9468, runId = 
2b963119-d391-4c62-abea-970274859b80] terminated with exception: Job aborted 
due to stage failure: Task 0 in stage 79.0 failed 1 times, most recent failure: 
Lost task 0.0 in stage 79.0 (TID 116) 
(fv-az1144-341.tm43j05r3bqe3lauap1nzddazg.ex.internal.cloudapp.net executor 
driver): org.apache.spark.api.python.PythonException: Traceback (most recent 
call last):
  File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1834, in main
process()
  File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1826, in process
serializer.dump_stream(out_iter, outfile)
  File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
line 224, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
line 145, in dump_stream
for obj in iterator:
  File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
line 213, in _batched
for item in iterator:
  File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1734, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for arg_offsets, f in udfs)
 ^
  File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1734, in 
result = tuple(f(*[a[o] for o in arg_offsets]) for arg_offsets, f in udfs)
   ^^^
  File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
112, in 
return args_kwargs_offsets, lambda *a: func(*a)
   
  File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/util.py", 
line 118, in wrapper
return f(*args, **kwargs)
   ^^
  File "/home/runner/work/spark/spark-3
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/streaming/test_streaming.py",
 line 291, in test_stream_exception
self._assert_exception_tree_contains_msg(e, "ZeroDivisionError")
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/streaming/test_streaming.py",
 line 300, in _assert_exception_tree_contains_msg

[jira] [Created] (SPARK-48089) Streaming query listener not working in 3.5 client <> 4.0 server

2024-05-02 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-48089:


 Summary: Streaming query listener not working in 3.5 client <> 4.0 
server
 Key: SPARK-48089
 URL: https://issues.apache.org/jira/browse/SPARK-48089
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, Structured Streaming
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


{code}
==
ERROR [1.488s]: test_listener_events 
(pyspark.sql.tests.connect.streaming.test_parity_listener.StreamingListenerParityTests.test_listener_events)
--
Traceback (most recent call last):
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/connect/streaming/test_parity_listener.py",
 line 53, in test_listener_events
self.spark.streams.addListener(test_listener)
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py",
 line 244, in addListener
self._execute_streaming_query_manager_cmd(cmd)
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/streaming/query.py",
 line 260, in _execute_streaming_query_manager_cmd
(_, properties) = self._session.client.execute_command(exec_cmd)
  ^^
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 982, in execute_command
data, _, _, _, properties = self._execute_and_fetch(req)

  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1283, in _execute_and_fetch
for response in self._execute_and_fetch_as_iterator(req):
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1264, in _execute_and_fetch_as_iterator
self._handle_error(error)
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1503, in _handle_error
self._handle_rpc_error(error)
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1539, in _handle_rpc_error
raise convert_exception(info, status.message) from None
pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
(java.io.EOFException) 
--
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48085) ANSI enabled by default brings different results in the tests in 3.5 client <> 4.0 server

2024-05-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48085.
--
Resolution: Invalid

We actually made this all compatible in master branch. I will avoid backporting 
them because they are just tests.

> ANSI enabled by default brings different results in the tests in 3.5 client 
> <> 4.0 server
> -
>
> Key: SPARK-48085
> URL: https://issues.apache.org/jira/browse/SPARK-48085
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> ==
> FAIL [0.169s]: test_checking_csv_header 
> (pyspark.sql.tests.connect.test_parity_datasources.DataSourcesParityTests.test_checking_csv_header)
> --
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (org.apache.spark.SparkException) [FAILED_READ_FILE.NO_HINT] Encountered 
> error while reading file 
> file:///home/runner/work/spark/spark-3.5/python/target/38acabf5-710b-4c21-b359-f61619e2adc7/tmpm7qyq23g/part-0-d6c8793b-772d-44e7-bcca-6eeae9cc0ec7-c000.csv.
>   SQLSTATE: KD001
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/test_datasources.py",
>  line 
> [167](https://github.com/HyukjinKwon/spark/actions/runs/8908464265/job/24464135564#step:9:168),
>  in test_checking_csv_header
> self.assertRaisesRegex(
> AssertionError: "CSV header does not conform to the schema" does not match 
> "(org.apache.spark.SparkException) [FAILED_READ_FILE.NO_HINT] Encountered 
> error while reading file 
> file:///home/runner/work/spark/spark-3.5/python/target/38acabf5-710b-4c21-b359-f61619e2adc7/tmpm7qyq23g/part-0-d6c8793b-772d-44e7-bcca-6eeae9cc0ec7-c000.csv.
>   SQLSTATE: KD001"
> {code}
> {code}
> ==
> ERROR [0.059s]: test_large_variable_types 
> (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests.test_large_variable_types)
> --
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/pandas/test_pandas_map.py",
>  line 115, in test_large_variable_types
> actual = df.mapInPandas(func, "str string, bin binary").collect()
>  
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/dataframe.py", 
> line 1645, in collect
> table, schema = self._session.client.to_table(query)
> 
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 858, in to_table
> table, schema, _, _, _ = self._execute_and_fetch(req)
>  
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1283, in _execute_and_fetch
> for response in self._execute_and_fetch_as_iterator(req):
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1264, in _execute_and_fetch_as_iterator
> self._handle_error(error)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1503, in _handle_error
> self._handle_rpc_error(error)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1539, in _handle_rpc_error
> raise convert_exception(info, status.message) from None
> pyspark.errors.exceptions.connect.IllegalArgumentException: 
> [INVALID_PARAMETER_VALUE.CHARSET] The value of parameter(s) `charset` in 
> `encode` is invalid: expects one of the charsets 'US-ASCII', 'ISO-8859-1', 
> 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16', but got utf8. SQLSTATE: 
> 2[202](https://github.com/HyukjinKwon/spark/actions/runs/8909131027/job/24465959134#step:9:203)3
> {code}
> {code}
> ==
> ERROR [0.024s]: test_assert_approx_equal_decimaltype_custom_rtol_pass 
> (pyspark.sql.tests.connect.test_utils.ConnectUtilsTests.test_assert_approx_equal_decimaltype_custom_rtol_pass)
> --
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/test_utils.py", 
> line 279, in test_assert_approx_equal_decimaltype_custom_rtol_pass
> 

[jira] [Updated] (SPARK-48085) ANSI enabled by default brings different results in the tests in 3.5 client <> 4.0 server

2024-05-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-48085:
-
Description: 
{code}
==
FAIL [0.169s]: test_checking_csv_header 
(pyspark.sql.tests.connect.test_parity_datasources.DataSourcesParityTests.test_checking_csv_header)
--
pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
(org.apache.spark.SparkException) [FAILED_READ_FILE.NO_HINT] Encountered error 
while reading file 
file:///home/runner/work/spark/spark-3.5/python/target/38acabf5-710b-4c21-b359-f61619e2adc7/tmpm7qyq23g/part-0-d6c8793b-772d-44e7-bcca-6eeae9cc0ec7-c000.csv.
  SQLSTATE: KD001
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/test_datasources.py",
 line 
[167](https://github.com/HyukjinKwon/spark/actions/runs/8908464265/job/24464135564#step:9:168),
 in test_checking_csv_header
self.assertRaisesRegex(
AssertionError: "CSV header does not conform to the schema" does not match 
"(org.apache.spark.SparkException) [FAILED_READ_FILE.NO_HINT] Encountered error 
while reading file 
file:///home/runner/work/spark/spark-3.5/python/target/38acabf5-710b-4c21-b359-f61619e2adc7/tmpm7qyq23g/part-0-d6c8793b-772d-44e7-bcca-6eeae9cc0ec7-c000.csv.
  SQLSTATE: KD001"
{code}
{code}
==
ERROR [0.059s]: test_large_variable_types 
(pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests.test_large_variable_types)
--
Traceback (most recent call last):
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/pandas/test_pandas_map.py",
 line 115, in test_large_variable_types
actual = df.mapInPandas(func, "str string, bin binary").collect()
 
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/dataframe.py", 
line 1645, in collect
table, schema = self._session.client.to_table(query)

  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 858, in to_table
table, schema, _, _, _ = self._execute_and_fetch(req)
 
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1283, in _execute_and_fetch
for response in self._execute_and_fetch_as_iterator(req):
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1264, in _execute_and_fetch_as_iterator
self._handle_error(error)
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1503, in _handle_error
self._handle_rpc_error(error)
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1539, in _handle_rpc_error
raise convert_exception(info, status.message) from None
pyspark.errors.exceptions.connect.IllegalArgumentException: 
[INVALID_PARAMETER_VALUE.CHARSET] The value of parameter(s) `charset` in 
`encode` is invalid: expects one of the charsets 'US-ASCII', 'ISO-8859-1', 
'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16', but got utf8. SQLSTATE: 
2[202](https://github.com/HyukjinKwon/spark/actions/runs/8909131027/job/24465959134#step:9:203)3
{code}
{code}
==
ERROR [0.024s]: test_assert_approx_equal_decimaltype_custom_rtol_pass 
(pyspark.sql.tests.connect.test_utils.ConnectUtilsTests.test_assert_approx_equal_decimaltype_custom_rtol_pass)
--
Traceback (most recent call last):
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/test_utils.py", 
line 279, in test_assert_approx_equal_decimaltype_custom_rtol_pass
assertDataFrameEqual(df1, df2, rtol=1e-1)
  File "/home/runner/work/spark/spark-3.5/python/pyspark/testing/utils.py", 
line 595, in assertDataFrameEqual
actual_list = actual.collect()
  
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/dataframe.py", 
line 1645, in collect
table, schema = self._session.client.to_table(query)

  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 858, in to_table
table, schema, _, _, _ = self._execute_and_fetch(req)
 
  File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1283, in _execute_and_fetch
for 

[jira] [Updated] (SPARK-48088) Skip tests being failed in client 3.5 <> server 4.0

2024-05-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-48088:
-
Affects Version/s: 3.5.2
   (was: 4.0.0)

> Skip tests being failed in client 3.5 <> server 4.0
> ---
>
> Key: SPARK-48088
> URL: https://issues.apache.org/jira/browse/SPARK-48088
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.2
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should skip, and set the CI first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >