[jira] [Updated] (SPARK-47126) Re-enable Spark 3.4 test in HiveExternalCatalogVersionsSuite

2024-02-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47126:
--
Parent: (was: SPARK-47046)
Issue Type: Bug  (was: Sub-task)

> Re-enable Spark 3.4 test in HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-47126
> URL: https://issues.apache.org/jira/browse/SPARK-47126
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> HiveExternalCatalogVersionsSuite requires SPARK-46400



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44319) Migrate jersey 2 to jersey 3

2024-02-23 Thread HiuFung Kwok (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820268#comment-17820268
 ] 

HiuFung Kwok commented on SPARK-44319:
--

[~dongjoon] FYI I marked this as resolved also. 

> Migrate jersey 2 to jersey 3
> 
>
> Key: SPARK-44319
> URL: https://issues.apache.org/jira/browse/SPARK-44319
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44319) Migrate jersey 2 to jersey 3

2024-02-23 Thread HiuFung Kwok (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

HiuFung Kwok resolved SPARK-44319.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

The work is done under the scope of SPARK-47118.

> Migrate jersey 2 to jersey 3
> 
>
> Key: SPARK-44319
> URL: https://issues.apache.org/jira/browse/SPARK-44319
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47153) Guard serialize/deserialize in JavaSerializer with try-with-resource block

2024-02-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47153:
---
Labels: pull-request-available  (was: )

> Guard serialize/deserialize in JavaSerializer with try-with-resource block
> --
>
> Key: SPARK-47153
> URL: https://issues.apache.org/jira/browse/SPARK-47153
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yan-Lin (Jared) Wang
>Priority: Minor
>  Labels: pull-request-available
>
> It's a common practice to close unused resources as soon as we're done using 
> it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47153) Guard serialize/deserialize in JavaSerializer with try-with-resource block

2024-02-23 Thread Yan-Lin (Jared) Wang (Jira)
Yan-Lin (Jared) Wang created SPARK-47153:


 Summary: Guard serialize/deserialize in JavaSerializer with 
try-with-resource block
 Key: SPARK-47153
 URL: https://issues.apache.org/jira/browse/SPARK-47153
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Yan-Lin (Jared) Wang


It's a common practice to close unused resources as soon as we're done using it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47151) Update pandas to 2.2.1

2024-02-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47151:
-

Assignee: Bjørn Jørgensen

> Update pandas to 2.2.1
> --
>
> Key: SPARK-47151
> URL: https://issues.apache.org/jira/browse/SPARK-47151
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
>
> [Pandas 2.2.1|https://pandas.pydata.org/docs/whatsnew/v2.2.1.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47151) Update pandas to 2.2.1

2024-02-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47151.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45236
[https://github.com/apache/spark/pull/45236]

> Update pandas to 2.2.1
> --
>
> Key: SPARK-47151
> URL: https://issues.apache.org/jira/browse/SPARK-47151
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [Pandas 2.2.1|https://pandas.pydata.org/docs/whatsnew/v2.2.1.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47152) Provide `CodeHaus Jackson` dependencies via a new optional directory

2024-02-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47152:
-

Assignee: Dongjoon Hyun

> Provide `CodeHaus Jackson` dependencies via a new optional directory
> 
>
> Key: SPARK-47152
> URL: https://issues.apache.org/jira/browse/SPARK-47152
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, SQL
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47152) Provide `CodeHaus Jackson` dependencies via a new optional directory

2024-02-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47152:
--
Summary: Provide `CodeHaus Jackson` dependencies via a new optional 
directory  (was: Provide Apache Hive Jackson dependency via a new optional 
directory)

> Provide `CodeHaus Jackson` dependencies via a new optional directory
> 
>
> Key: SPARK-47152
> URL: https://issues.apache.org/jira/browse/SPARK-47152
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, SQL
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47152) Provide Apache Hive Jackson dependency via a new optional directory

2024-02-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47152:
---
Labels: pull-request-available  (was: )

> Provide Apache Hive Jackson dependency via a new optional directory
> ---
>
> Key: SPARK-47152
> URL: https://issues.apache.org/jira/browse/SPARK-47152
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, SQL
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47152) Provide Apache Hive Jackson dependency via a new optional directory

2024-02-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47152:
--
Component/s: Build

> Provide Apache Hive Jackson dependency via a new optional directory
> ---
>
> Key: SPARK-47152
> URL: https://issues.apache.org/jira/browse/SPARK-47152
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, SQL
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47152) Provide Apache Hive Jackson dependency via a new optional directory

2024-02-23 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47152:
-

 Summary: Provide Apache Hive Jackson dependency via a new optional 
directory
 Key: SPARK-47152
 URL: https://issues.apache.org/jira/browse/SPARK-47152
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47151) Update pandas to 2.2.1

2024-02-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47151:
---
Labels: pull-request-available  (was: )

> Update pandas to 2.2.1
> --
>
> Key: SPARK-47151
> URL: https://issues.apache.org/jira/browse/SPARK-47151
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
>
> [Pandas 2.2.1|https://pandas.pydata.org/docs/whatsnew/v2.2.1.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47151) Update pandas to 2.2.1

2024-02-23 Thread Jira
Bjørn Jørgensen created SPARK-47151:
---

 Summary: Update pandas to 2.2.1
 Key: SPARK-47151
 URL: https://issues.apache.org/jira/browse/SPARK-47151
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Pandas API on Spark
Affects Versions: 4.0.0
Reporter: Bjørn Jørgensen


[Pandas 2.2.1|https://pandas.pydata.org/docs/whatsnew/v2.2.1.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47150) String length (...) exceeds the maximum length (20000000)

2024-02-23 Thread Sergii Mikhtoniuk (Jira)
Sergii Mikhtoniuk created SPARK-47150:
-

 Summary: String length (...) exceeds the maximum length (2000)
 Key: SPARK-47150
 URL: https://issues.apache.org/jira/browse/SPARK-47150
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 3.5.0
Reporter: Sergii Mikhtoniuk


Upgrading to Spark 3.5.0 introduced a regression for us where our query gateway 
(Livy) fails with an error:
{code:java}
com.fasterxml.jackson.core.exc.StreamConstraintsException: String length 
(20054016) exceeds the maximum length (2000)

(sorry, unable to provide full stack trace){code}
The root of this problem is the breaking change in {{jackson}} that (in the 
name of "safety") introduced some JSON size limits, see: 
[https://github.com/FasterXML/jackson-core/issues/1014]

Looks like {{JSONOptions}} in Spark already [support configuring this 
limit|https://github.com/apache/spark/blob/c2dbb6d04bc9c781fb4a7673e5acf2c67b99c203/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala#L55-L58],
 but there seems to be no way to set it globally or pass it down to 
[{{DataFrame::toJSON()}}|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.toJSON.html]
 which our Apache Livy server is using when transmitting data.

Livy is an old project and transferring dataframes via JSON is super 
inefficient, and we really should move to something like Spark Connect, but I 
believe this issue can happen to many people working with basic GeoJSON data.

Spark can handle very large strings, and this arbitrary limit just gets in a 
way of output serialization for no good reason.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47035) Protocol for client side StreamingQueryListener

2024-02-23 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-47035.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45091
[https://github.com/apache/spark/pull/45091]

> Protocol for client side StreamingQueryListener
> ---
>
> Key: SPARK-47035
> URL: https://issues.apache.org/jira/browse/SPARK-47035
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Assignee: Wei Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47149) Add Use Pandas API on Spark section on Pandas Scaling to large datasets page

2024-02-23 Thread Jira
Bjørn Jørgensen created SPARK-47149:
---

 Summary: Add Use Pandas API on Spark section on Pandas Scaling to 
large datasets page
 Key: SPARK-47149
 URL: https://issues.apache.org/jira/browse/SPARK-47149
 Project: Spark
  Issue Type: Documentation
  Components: Pandas API on Spark
Affects Versions: 4.0.0
Reporter: Bjørn Jørgensen


We should make a PR like [DOC: Add Use Modin section on Scaling to large 
datasets page|https://github.com/pandas-dev/pandas/issues/57585]

We can wait to see if pandas does accepts it.

I hope it will be a good thing and that it can attract more users. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47118) Upgrade Jetty to 11

2024-02-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47118.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45154
[https://github.com/apache/spark/pull/45154]

> Upgrade Jetty to 11
> ---
>
> Key: SPARK-47118
> URL: https://issues.apache.org/jira/browse/SPARK-47118
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: HiuFung Kwok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46975) Support dedicated fallback methods

2024-02-23 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-46975.
--
Resolution: Done

Resolved by https://github.com/apache/spark/pull/45026

> Support dedicated fallback methods
> --
>
> Key: SPARK-46975
> URL: https://issues.apache.org/jira/browse/SPARK-46975
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45101) Spark UI: A stage is still active even when all of it's tasks are succeeded

2024-02-23 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-45101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820182#comment-17820182
 ] 

Bjørn Jørgensen commented on SPARK-45101:
-

did you use spark.stop()  

> Spark UI: A stage is still active even when all of it's tasks are succeeded
> ---
>
> Key: SPARK-45101
> URL: https://issues.apache.org/jira/browse/SPARK-45101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1, 3.5.0, 4.0.0
>Reporter: RickyMa
>Priority: Critical
> Attachments: 1.png, 2.png, 3.png
>
>
> In the stage UI, we can see all the tasks' statuses are SUCCESS.
> But the stage is still marked as active.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46975) Support dedicated fallback methods

2024-02-23 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng reassigned SPARK-46975:


Assignee: Ruifeng Zheng

> Support dedicated fallback methods
> --
>
> Key: SPARK-46975
> URL: https://issues.apache.org/jira/browse/SPARK-46975
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47148) Avoid to materialize AQE ShuffleQueryStage on the cancellation

2024-02-23 Thread Eren Avsarogullari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-47148:
---
Description: 
AQE can materialize *ShuffleQueryStage* on the cancellation. This causes 
unnecessary stage materialization by submitting Shuffle Job. Under normal 
circumstances, if the stage is already non-materialized (a.k.a 
ShuffleQueryStage.shuffleFuture is not initialized yet), it should just be 
skipped without materializing it.

Please find sample use-case:
*1- Stage Materialization Steps:*
When stage materialization is failed:
{code:java}
1.1- ShuffleQueryStage1 - is materialized successfully,
1.2- ShuffleQueryStage2 - materialization is failed,
1.3- ShuffleQueryStage3 - Not materialized yet so 
ShuffleQueryStage3.shuffleFuture is not initialized yet{code}
*2- Stage Cancellation Steps:*
{code:java}
2.1- ShuffleQueryStage1 - is canceled due to already materialized,
2.2- ShuffleQueryStage2 - is earlyFailedStage so currently, it is skipped as 
default by AQE because it could not be materialized,
2.3- ShuffleQueryStage3 - Problem is here: This stage is not materialized yet 
but currently, it is also tried to cancel and this stage requires to be 
materialized first.{code}

  was:
AQE can materialize *ShuffleQueryStage* on the cancellation. This causes 
unnecessary stage materialization by submitting Shuffle Job. Under normal 
circumstances, if the stage is already non-materialized (a.k.a 
ShuffleQueryStage.shuffleFuture is not initialized yet), it should just be 
skipped without materializing it.

Please find sample use-case:
*1- Stage Materialization Steps:*
When stage materialization is failed:
{code:java}
1.1- ShuffleQueryStage1 - is materialized successfully,
1.2- ShuffleQueryStage2 - materialization is failed,
1.3- ShuffleQueryStage3 - Not materialized yet so 
ShuffleQueryStage3.shuffleFuture is not initialized yet{code}
*2- Stage Cancellation Steps:*
{code:java}
2.1- ShuffleQueryStage1 - is canceled due to already materialized,
2.2- ShuffleQueryStage2 - is earlyFailedStage so currently, it is skipped as 
default because it could not be materialized,
2.3- ShuffleQueryStage3 - Problem is here: This stage is not materialized yet 
but currently, it is also tried to cancel and this stage requires to be 
materialized first.{code}


> Avoid to materialize AQE ShuffleQueryStage on the cancellation
> --
>
> Key: SPARK-47148
> URL: https://issues.apache.org/jira/browse/SPARK-47148
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL
>Affects Versions: 4.0.0
>Reporter: Eren Avsarogullari
>Priority: Major
>  Labels: pull-request-available
>
> AQE can materialize *ShuffleQueryStage* on the cancellation. This causes 
> unnecessary stage materialization by submitting Shuffle Job. Under normal 
> circumstances, if the stage is already non-materialized (a.k.a 
> ShuffleQueryStage.shuffleFuture is not initialized yet), it should just be 
> skipped without materializing it.
> Please find sample use-case:
> *1- Stage Materialization Steps:*
> When stage materialization is failed:
> {code:java}
> 1.1- ShuffleQueryStage1 - is materialized successfully,
> 1.2- ShuffleQueryStage2 - materialization is failed,
> 1.3- ShuffleQueryStage3 - Not materialized yet so 
> ShuffleQueryStage3.shuffleFuture is not initialized yet{code}
> *2- Stage Cancellation Steps:*
> {code:java}
> 2.1- ShuffleQueryStage1 - is canceled due to already materialized,
> 2.2- ShuffleQueryStage2 - is earlyFailedStage so currently, it is skipped as 
> default by AQE because it could not be materialized,
> 2.3- ShuffleQueryStage3 - Problem is here: This stage is not materialized yet 
> but currently, it is also tried to cancel and this stage requires to be 
> materialized first.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46639) Add WindowExec SQLMetrics

2024-02-23 Thread Eren Avsarogullari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-46639:
---
Description: 
Currently, WindowExec Physical Operator has only spillSize SQLMetric. This jira 
aims to add following SQLMetrics to provide more information from WindowExec 
usage during query execution:
{code:java}
numOfOutputRows: Number of total output rows.
numOfPartitions: Number of processed input partitions.
numOfWindowPartitions: Number of generated window partitions.
spilledRows: Number of total spilled rows.
spillSizeOnDisk: Total spilled data size on disk.{code}
As an example use-case, WindowExec spilling behavior depends on multiple 
factors and it can sometime cause {{SparkOutOfMemoryError}} instead of spilling 
to disk so it is hard to track without SQL Metrics such as:
*1-* WindowExec creates ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) 
per task (a.k.a child RDD partition)
*2-* When ExternalAppendOnlyUnsafeRowArray size exceeds 
spark.sql.windowExec.buffer.in.memory.threshold=4096, 
ExternalAppendOnlyUnsafeRowArray switches to UnsafeExternalSorter as 
spillableArray by moving its all buffered rows into UnsafeExternalSorter and 
ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) is cleared. In this 
case, WindowExec starts to write UnsafeExternalSorter' s buffer (a.k.a 
UnsafeInMemorySorter).
*3-* UnsafeExternalSorter is being created per window partition. When 
UnsafeExternalSorter' buffer size exceeds 
spark.sql.windowExec.buffer.spill.threshold=Integer.MAX_VALUE, it starts to 
write to disk and get cleared all buffer (a.k.a UnsafeInMemorySorter) content. 
In this case, UnsafeExternalSorter will continue to buffer next records until 
exceeding spark.sql.windowExec.buffer.spill.threshold.

*New WindowExec SQLMetrics Sample Screenshot:*
!WindowExec SQLMetrics.png|width=257,height=152!

  was:
Currently, WindowExec Physical Operator has only spillSize SQLMetric. This jira 
aims to add following SQLMetrics to provide more information from WindowExec 
usage during query execution:
{code:java}
numOfOutputRows: Number of total output rows.
numOfPartitions: Number of processed input partitions.
numOfWindowPartitions: Number of generated window partitions.
spilledRows: Number of total spilled rows.
spillSizeOnDisk: Total spilled data size on disk.{code}
As an example use-case, WindowExec spilling behavior depends on multiple 
factors and it can sometime cause {{SparkOutOfMemoryError}} instead of spilling 
to disk so it is hard to track without SQL Metrics such as:
*1-* WindowExec creates ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) 
per task (a.k.a child RDD partition)
*2-* When ExternalAppendOnlyUnsafeRowArray size exceeds 
spark.sql.windowExec.buffer.in.memory.threshold=4096, 
ExternalAppendOnlyUnsafeRowArray switches to UnsafeExternalSorter as 
spillableArray by moving its all buffered rows into UnsafeExternalSorter and 
ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) is cleared. In this 
case, WindowExec starts to write UnsafeExternalSorter' s buffer (a.k.a 
UnsafeInMemorySorter).
*3-* UnsafeExternalSorter is being created per window partition. When 
UnsafeExternalSorter' buffer size exceeds 
spark.sql.windowExec.buffer.spill.threshold=Integer.MAX_VALUE, it starts to 
write to disk and get cleared all buffer (a.k.a UnsafeInMemorySorter) content. 
In this case, UnsafeExternalSorter will continue to buffer next records until 
exceeding spark.sql.windowExec.buffer.spill.threshold.

Sample UI Screenshot:
!WindowExec SQLMetrics.png|width=257,height=152!


> Add WindowExec SQLMetrics
> -
>
> Key: SPARK-46639
> URL: https://issues.apache.org/jira/browse/SPARK-46639
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Eren Avsarogullari
>Priority: Major
>  Labels: pull-request-available
> Attachments: WindowExec SQLMetrics.png
>
>
> Currently, WindowExec Physical Operator has only spillSize SQLMetric. This 
> jira aims to add following SQLMetrics to provide more information from 
> WindowExec usage during query execution:
> {code:java}
> numOfOutputRows: Number of total output rows.
> numOfPartitions: Number of processed input partitions.
> numOfWindowPartitions: Number of generated window partitions.
> spilledRows: Number of total spilled rows.
> spillSizeOnDisk: Total spilled data size on disk.{code}
> As an example use-case, WindowExec spilling behavior depends on multiple 
> factors and it can sometime cause {{SparkOutOfMemoryError}} instead of 
> spilling to disk so it is hard to track without SQL Metrics such as:
> *1-* WindowExec creates ExternalAppendOnlyUnsafeRowArray (internal 
> ArrayBuffer) per task (a.k.a child RDD partition)
> *2-* When ExternalAppendOnlyUnsafeRowArray size 

[jira] [Updated] (SPARK-46639) Add WindowExec SQLMetrics

2024-02-23 Thread Eren Avsarogullari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-46639:
---
Description: 
Currently, WindowExec Physical Operator has only spillSize SQLMetric. This jira 
aims to add following SQLMetrics to provide more information from WindowExec 
usage during query execution:
{code:java}
numOfOutputRows: Number of total output rows.
numOfPartitions: Number of processed input partitions.
numOfWindowPartitions: Number of generated window partitions.
spilledRows: Number of total spilled rows.
spillSizeOnDisk: Total spilled data size on disk.{code}
As an example use-case, WindowExec spilling behavior depends on multiple 
factors and it can sometime cause {{SparkOutOfMemoryError}} instead of spilling 
to disk so it is hard to track without SQL Metrics such as:
*1-* WindowExec creates ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) 
per task (a.k.a child RDD partition)
*2-* When ExternalAppendOnlyUnsafeRowArray size exceeds 
spark.sql.windowExec.buffer.in.memory.threshold=4096, 
ExternalAppendOnlyUnsafeRowArray switches to UnsafeExternalSorter as 
spillableArray by moving its all buffered rows into UnsafeExternalSorter and 
ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) is cleared. In this 
case, WindowExec starts to write UnsafeExternalSorter' s buffer (a.k.a 
UnsafeInMemorySorter).
*3-* UnsafeExternalSorter is being created per window partition. When 
UnsafeExternalSorter' buffer size exceeds 
spark.sql.windowExec.buffer.spill.threshold=Integer.MAX_VALUE, it starts to 
write to disk and get cleared all buffer (a.k.a UnsafeInMemorySorter) content. 
In this case, UnsafeExternalSorter will continue to buffer next records until 
exceeding spark.sql.windowExec.buffer.spill.threshold.

Sample UI Screenshot:
!WindowExec SQLMetrics.png|width=257,height=152!

  was:
Currently, WindowExec Physical Operator has only spillSize SQLMetric. This jira 
aims to add following SQLMetrics to provide more information from WindowExec 
usage during query execution:
{code:java}
numOfOutputRows: Number of total output rows.
numOfPartitions: Number of processed input partitions.
numOfWindowPartitions: Number of generated window partitions.
spilledRows: Number of total spilled rows.
spillSizeOnDisk: Total spilled data size on disk.{code}
As an example use-case, WindowExec spilling behavior depends on multiple 
factors and it can sometime cause {{SparkOutOfMemoryError}} instead of spilling 
to disk so it is hard to track without SQL Metrics such as:
*1-* WindowExec creates ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) 
per task (a.k.a child RDD partition)
*2-* When ExternalAppendOnlyUnsafeRowArray size exceeds 
spark.sql.windowExec.buffer.in.memory.threshold=4096, 
ExternalAppendOnlyUnsafeRowArray switches to UnsafeExternalSorter as 
spillableArray by moving its all buffered rows into UnsafeExternalSorter and 
ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) is cleared. In this 
case, WindowExec starts to write UnsafeExternalSorter' s buffer (a.k.a 
UnsafeInMemorySorter).
*3-* UnsafeExternalSorter is being created per window partition. When 
UnsafeExternalSorter' buffer size exceeds 
spark.sql.windowExec.buffer.spill.threshold=Integer.MAX_VALUE, it starts to 
write to disk and get cleared all buffer (a.k.a UnsafeInMemorySorter) content. 
In this case, UnsafeExternalSorter will continue to buffer next records until 
exceeding spark.sql.windowExec.buffer.spill.threshold.


> Add WindowExec SQLMetrics
> -
>
> Key: SPARK-46639
> URL: https://issues.apache.org/jira/browse/SPARK-46639
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Eren Avsarogullari
>Priority: Major
>  Labels: pull-request-available
> Attachments: WindowExec SQLMetrics.png
>
>
> Currently, WindowExec Physical Operator has only spillSize SQLMetric. This 
> jira aims to add following SQLMetrics to provide more information from 
> WindowExec usage during query execution:
> {code:java}
> numOfOutputRows: Number of total output rows.
> numOfPartitions: Number of processed input partitions.
> numOfWindowPartitions: Number of generated window partitions.
> spilledRows: Number of total spilled rows.
> spillSizeOnDisk: Total spilled data size on disk.{code}
> As an example use-case, WindowExec spilling behavior depends on multiple 
> factors and it can sometime cause {{SparkOutOfMemoryError}} instead of 
> spilling to disk so it is hard to track without SQL Metrics such as:
> *1-* WindowExec creates ExternalAppendOnlyUnsafeRowArray (internal 
> ArrayBuffer) per task (a.k.a child RDD partition)
> *2-* When ExternalAppendOnlyUnsafeRowArray size exceeds 
> spark.sql.windowExec.buffer.in.memory.threshold=4096, 
> 

[jira] [Updated] (SPARK-46639) Add WindowExec SQLMetrics

2024-02-23 Thread Eren Avsarogullari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-46639:
---
Attachment: WindowExec SQLMetrics.png

> Add WindowExec SQLMetrics
> -
>
> Key: SPARK-46639
> URL: https://issues.apache.org/jira/browse/SPARK-46639
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Eren Avsarogullari
>Priority: Major
>  Labels: pull-request-available
> Attachments: WindowExec SQLMetrics.png
>
>
> Currently, WindowExec Physical Operator has only spillSize SQLMetric. This 
> jira aims to add following SQLMetrics to provide more information from 
> WindowExec usage during query execution:
> {code:java}
> numOfOutputRows: Number of total output rows.
> numOfPartitions: Number of processed input partitions.
> numOfWindowPartitions: Number of generated window partitions.
> spilledRows: Number of total spilled rows.
> spillSizeOnDisk: Total spilled data size on disk.{code}
> As an example use-case, WindowExec spilling behavior depends on multiple 
> factors and it can sometime cause {{SparkOutOfMemoryError}} instead of 
> spilling to disk so it is hard to track without SQL Metrics such as:
> *1-* WindowExec creates ExternalAppendOnlyUnsafeRowArray (internal 
> ArrayBuffer) per task (a.k.a child RDD partition)
> *2-* When ExternalAppendOnlyUnsafeRowArray size exceeds 
> spark.sql.windowExec.buffer.in.memory.threshold=4096, 
> ExternalAppendOnlyUnsafeRowArray switches to UnsafeExternalSorter as 
> spillableArray by moving its all buffered rows into UnsafeExternalSorter and 
> ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) is cleared. In this 
> case, WindowExec starts to write UnsafeExternalSorter' s buffer (a.k.a 
> UnsafeInMemorySorter).
> *3-* UnsafeExternalSorter is being created per window partition. When 
> UnsafeExternalSorter' buffer size exceeds 
> spark.sql.windowExec.buffer.spill.threshold=Integer.MAX_VALUE, it starts to 
> write to disk and get cleared all buffer (a.k.a UnsafeInMemorySorter) 
> content. In this case, UnsafeExternalSorter will continue to buffer next 
> records until exceeding spark.sql.windowExec.buffer.spill.threshold.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47148) Avoid to materialize AQE ShuffleQueryStage on the cancellation

2024-02-23 Thread Eren Avsarogullari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-47148:
---
Summary: Avoid to materialize AQE ShuffleQueryStage on the cancellation  
(was: [AQE] Avoid to materialize ShuffleQueryStage on the cancellation)

> Avoid to materialize AQE ShuffleQueryStage on the cancellation
> --
>
> Key: SPARK-47148
> URL: https://issues.apache.org/jira/browse/SPARK-47148
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL
>Affects Versions: 4.0.0
>Reporter: Eren Avsarogullari
>Priority: Major
>  Labels: pull-request-available
>
> AQE can materialize *ShuffleQueryStage* on the cancellation. This causes 
> unnecessary stage materialization by submitting Shuffle Job. Under normal 
> circumstances, if the stage is already non-materialized (a.k.a 
> ShuffleQueryStage.shuffleFuture is not initialized yet), it should just be 
> skipped without materializing it.
> Please find sample use-case:
> *1- Stage Materialization Steps:*
> When stage materialization is failed:
> {code:java}
> 1.1- ShuffleQueryStage1 - is materialized successfully,
> 1.2- ShuffleQueryStage2 - materialization is failed,
> 1.3- ShuffleQueryStage3 - Not materialized yet so 
> ShuffleQueryStage3.shuffleFuture is not initialized yet{code}
> *2- Stage Cancellation Steps:*
> {code:java}
> 2.1- ShuffleQueryStage1 - is canceled due to already materialized,
> 2.2- ShuffleQueryStage2 - is earlyFailedStage so currently, it is skipped as 
> default because it could not be materialized,
> 2.3- ShuffleQueryStage3 - Problem is here: This stage is not materialized yet 
> but currently, it is also tried to cancel and this stage requires to be 
> materialized first.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47148) [AQE] Avoid to materialize ShuffleQueryStage on the cancellation

2024-02-23 Thread Eren Avsarogullari (Jira)
Eren Avsarogullari created SPARK-47148:
--

 Summary: [AQE] Avoid to materialize ShuffleQueryStage on the 
cancellation
 Key: SPARK-47148
 URL: https://issues.apache.org/jira/browse/SPARK-47148
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, SQL
Affects Versions: 4.0.0
Reporter: Eren Avsarogullari


AQE can materialize *ShuffleQueryStage* on the cancellation. This causes 
unnecessary stage materialization by submitting Shuffle Job. Under normal 
circumstances, if the stage is already non-materialized (a.k.a 
ShuffleQueryStage.shuffleFuture is not initialized yet), it should just be 
skipped without materializing it.

Please find sample use-case:
*1- Stage Materialization Steps:*
When stage materialization is failed:
{code:java}
1.1- ShuffleQueryStage1 - is materialized successfully,
1.2- ShuffleQueryStage2 - materialization is failed,
1.3- ShuffleQueryStage3 - Not materialized yet so 
ShuffleQueryStage3.shuffleFuture is not initialized yet{code}
*2- Stage Cancellation Steps:*
{code:java}
2.1- ShuffleQueryStage1 - is canceled due to already materialized,
2.2- ShuffleQueryStage2 - is earlyFailedStage so currently, it is skipped as 
default because it could not be materialized,
2.3- ShuffleQueryStage3 - Problem is here: This stage is not materialized yet 
but currently, it is also tried to cancel and this stage requires to be 
materialized first.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47129) Make ResolveRelations cache connect plan properly

2024-02-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47129.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45214
[https://github.com/apache/spark/pull/45214]

> Make ResolveRelations cache connect plan properly
> -
>
> Key: SPARK-47129
> URL: https://issues.apache.org/jira/browse/SPARK-47129
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47129) Make ResolveRelations cache connect plan properly

2024-02-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47129:
-

Assignee: Ruifeng Zheng

> Make ResolveRelations cache connect plan properly
> -
>
> Key: SPARK-47129
> URL: https://issues.apache.org/jira/browse/SPARK-47129
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44914) Upgrade Ivy to 2.5.2

2024-02-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44914.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45075
[https://github.com/apache/spark/pull/45075]

> Upgrade Ivy to 2.5.2
> 
>
> Key: SPARK-44914
> URL: https://issues.apache.org/jira/browse/SPARK-44914
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bjørn Jørgensen
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [CVE-2022-46751|https://www.cve.org/CVERecord?id=CVE-2022-46751]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47146) Possible thread leak when doing sort merge join

2024-02-23 Thread JacobZheng (Jira)
JacobZheng created SPARK-47146:
--

 Summary: Possible thread leak when doing sort merge join
 Key: SPARK-47146
 URL: https://issues.apache.org/jira/browse/SPARK-47146
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0, 3.3.0, 3.2.0
Reporter: JacobZheng


I have a long-running spark job. stumbled upon executor taking up a lot of 
threads, resulting in no threads available on the server. Querying thread 
details via jstack, there are tons of threads named read-ahead. Checking the 
code confirms that these threads are created by ReadAheadInputStream. This 
class is initialized to create a single-threaded thread pool
{code:java}
private final ExecutorService executorService =
ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code}
This thread pool is closed by ReadAheadInputStream#close(). 

The call stack for the normal case close() method is
{code:java}
ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 in 
stage 71.0 (TID 
258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230
    @org.apache.spark.io.ReadAheadInputStream.close()
        at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149)
        at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121)
        at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87)
        at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187)
        at 
org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
        at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(Thread.java:829) {code}
As shown in UnsafeSorterSpillReader#close, the stream is only closed when the 
data in the stream is read through.
{code:java}
@Override
public void loadNext() throws IOException {
  // Kill the task in case it has been marked as killed. This logic is from
  // InterruptibleIterator, but we inline it here instead of wrapping the 
iterator in order
  // to avoid performance overhead. This check is added here in `loadNext()` 
instead of in
  // `hasNext()` because it's technically possible for the caller to be relying 
on
  // `getNumRecords()` instead of `hasNext()` to know when to stop.
  if (taskContext != null) {
taskContext.killTaskIfInterrupted();
  }
  recordLength = din.readInt();
  keyPrefix = din.readLong();
  if (recordLength > arr.length) {
arr = new byte[recordLength];
baseObject = arr;
  }
  ByteStreams.readFully(in, arr, 0, recordLength);
  numRecordsRemaining--;
  if (numRecordsRemaining == 0) {

[jira] [Updated] (SPARK-47144) Fix Spark Connect collation issue

2024-02-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47144:
---
Labels: pull-request-available  (was: )

> Fix Spark Connect collation issue
> -
>
> Key: SPARK-47144
> URL: https://issues.apache.org/jira/browse/SPARK-47144
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when 
> connecting to sever using Spark Connect:
> {code:java}
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support 
> convert string(UCS_BASIC_LCASE) to connect proto types.{code}
> When using default collation "UCS_BASIC", the error is not occurring.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47023) Upgrade `aircompressor` to 0.26

2024-02-23 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-47023:
-
Description: 
`aircompressor` is a transitive dependency from Apache ORC and Parquet.

`aircompressor` v0.26 reported the following bug fixes recently.
 - [Fix out of bounds read/write in Snappy 
decompressor]([https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2])
 - [Fix ZstdOutputStream corruption on double 
close]([https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2])

  was:
`aircompressor` is a transitive dependency from Apache ORC and Parquet.

`aircompressor` v1.26 reported the following bug fixes recently.
 
- [Fix out of bounds read/write in Snappy 
decompressor](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2)
- [Fix ZstdOutputStream corruption on double 
close](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2)


> Upgrade `aircompressor` to 0.26
> ---
>
> Key: SPARK-47023
> URL: https://issues.apache.org/jira/browse/SPARK-47023
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>
> `aircompressor` is a transitive dependency from Apache ORC and Parquet.
> `aircompressor` v0.26 reported the following bug fixes recently.
>  - [Fix out of bounds read/write in Snappy 
> decompressor]([https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2])
>  - [Fix ZstdOutputStream corruption on double 
> close]([https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47023) Upgrade `aircompressor` to 0.26

2024-02-23 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-47023:
-
Summary: Upgrade `aircompressor` to 0.26  (was: Upgrade `aircompressor` to 
1.26)

> Upgrade `aircompressor` to 0.26
> ---
>
> Key: SPARK-47023
> URL: https://issues.apache.org/jira/browse/SPARK-47023
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>
> `aircompressor` is a transitive dependency from Apache ORC and Parquet.
> `aircompressor` v1.26 reported the following bug fixes recently.
>  
> - [Fix out of bounds read/write in Snappy 
> decompressor](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2)
> - [Fix ZstdOutputStream corruption on double 
> close](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47145) Provide table identifier to scan node when DS v2 strategy is applied

2024-02-23 Thread Uros Stankovic (Jira)
Uros Stankovic created SPARK-47145:
--

 Summary: Provide table identifier to scan node when DS v2 strategy 
is applied
 Key: SPARK-47145
 URL: https://issues.apache.org/jira/browse/SPARK-47145
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Uros Stankovic


Currently, DataSourceScanExec node can accept table identifier, and that 
information can be useful for later logging, debugging, etc, but 
DataSourceV2Strategy does not provide that information to scan node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47144) Fix Spark Connect collation issue

2024-02-23 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47144:
--
Epic Link: SPARK-46830

> Fix Spark Connect collation issue
> -
>
> Key: SPARK-47144
> URL: https://issues.apache.org/jira/browse/SPARK-47144
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
> Fix For: 4.0.0
>
>
> Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when 
> connecting to sever using Spark Connect:
> {code:java}
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support 
> convert string(UCS_BASIC_LCASE) to connect proto types.{code}
> When using default collation "UCS_BASIC", the error is not occurring.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47144) Fix Spark Connect collation issue

2024-02-23 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47144:
--
Component/s: SQL

> Fix Spark Connect collation issue
> -
>
> Key: SPARK-47144
> URL: https://issues.apache.org/jira/browse/SPARK-47144
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
> Fix For: 4.0.0
>
>
> Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when 
> connecting to sever using Spark Connect:
> {code:java}
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support 
> convert string(UCS_BASIC_LCASE) to connect proto types.{code}
> When using default collation "UCS_BASIC", the error is not occurring.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47144) Fix Spark Connect collation issue

2024-02-23 Thread Nikola Mandic (Jira)
Nikola Mandic created SPARK-47144:
-

 Summary: Fix Spark Connect collation issue
 Key: SPARK-47144
 URL: https://issues.apache.org/jira/browse/SPARK-47144
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 4.0.0
Reporter: Nikola Mandic
 Fix For: 4.0.0


Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when 
connecting to sever using Spark Connect:
{code:java}
pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
(org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support convert 
string(UCS_BASIC_LCASE) to connect proto types.{code}
When using default collation "UCS_BASIC", the error is not occurring.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag

2024-02-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47102:
--

Assignee: (was: Apache Spark)

> Add COLLATION_ENABLED config flag
> -
>
> Key: SPARK-47102
> URL: https://issues.apache.org/jira/browse/SPARK-47102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>
> *What changes were proposed in this pull request?*
> This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error 
> class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage 
> of feature under development. 
> *Why are the changes needed?*
> We want to make collations configurable on this flag. These changes disable 
> usage of `collate` and `collation` functions, along with any `COLLATE` syntax 
> when the flag is set to false. By default, the flag is set to false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag

2024-02-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47102:
--

Assignee: Apache Spark

> Add COLLATION_ENABLED config flag
> -
>
> Key: SPARK-47102
> URL: https://issues.apache.org/jira/browse/SPARK-47102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> *What changes were proposed in this pull request?*
> This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error 
> class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage 
> of feature under development. 
> *Why are the changes needed?*
> We want to make collations configurable on this flag. These changes disable 
> usage of `collate` and `collation` functions, along with any `COLLATE` syntax 
> when the flag is set to false. By default, the flag is set to false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag

2024-02-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47102:
--

Assignee: Apache Spark

> Add COLLATION_ENABLED config flag
> -
>
> Key: SPARK-47102
> URL: https://issues.apache.org/jira/browse/SPARK-47102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> *What changes were proposed in this pull request?*
> This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error 
> class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage 
> of feature under development. 
> *Why are the changes needed?*
> We want to make collations configurable on this flag. These changes disable 
> usage of `collate` and `collation` functions, along with any `COLLATE` syntax 
> when the flag is set to false. By default, the flag is set to false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag

2024-02-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47102:
--

Assignee: (was: Apache Spark)

> Add COLLATION_ENABLED config flag
> -
>
> Key: SPARK-47102
> URL: https://issues.apache.org/jira/browse/SPARK-47102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>
> *What changes were proposed in this pull request?*
> This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error 
> class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage 
> of feature under development. 
> *Why are the changes needed?*
> We want to make collations configurable on this flag. These changes disable 
> usage of `collate` and `collation` functions, along with any `COLLATE` syntax 
> when the flag is set to false. By default, the flag is set to false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47135) Implement error classes for Kafka data loss exceptions

2024-02-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47135:
--

Assignee: (was: Apache Spark)

> Implement error classes for Kafka data loss exceptions 
> ---
>
> Key: SPARK-47135
> URL: https://issues.apache.org/jira/browse/SPARK-47135
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: B. Micheal Okutubo
>Priority: Major
>  Labels: pull-request-available
>
> In the kafka connector code, we have several code that throws the java 
> *IllegalStateException* to report data loss, while reading from Kafka. We 
> want to properly classify those exceptions using the new error framework. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47135) Implement error classes for Kafka data loss exceptions

2024-02-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47135:
--

Assignee: Apache Spark

> Implement error classes for Kafka data loss exceptions 
> ---
>
> Key: SPARK-47135
> URL: https://issues.apache.org/jira/browse/SPARK-47135
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: B. Micheal Okutubo
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> In the kafka connector code, we have several code that throws the java 
> *IllegalStateException* to report data loss, while reading from Kafka. We 
> want to properly classify those exceptions using the new error framework. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag

2024-02-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47102:
--

Assignee: Apache Spark

> Add COLLATION_ENABLED config flag
> -
>
> Key: SPARK-47102
> URL: https://issues.apache.org/jira/browse/SPARK-47102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> *What changes were proposed in this pull request?*
> This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error 
> class `COLLATION_SUPPORT_DISABLED` to appropriately report error on usage of 
> feature under development. 
> *Why are the changes needed?*
> We want to make collations configurable on this some flag. These changes 
> disable usage of `collate` and `collation` functions, along with any 
> `COLLATE` syntax when the flag is set to false. By default, the flag is set 
> to false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47102) Add COLLATION_ENABLED config flag

2024-02-23 Thread Mihailo Milosevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mihailo Milosevic updated SPARK-47102:
--
Description: 
*What changes were proposed in this pull request?*
This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error 
class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage of 
feature under development. 

*Why are the changes needed?*
We want to make collations configurable on this flag. These changes disable 
usage of `collate` and `collation` functions, along with any `COLLATE` syntax 
when the flag is set to false. By default, the flag is set to false.

  was:
*What changes were proposed in this pull request?*
This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error 
class `COLLATION_SUPPORT_DISABLED` to appropriately report error on usage of 
feature under development. 

*Why are the changes needed?*
We want to make collations configurable on this some flag. These changes 
disable usage of `collate` and `collation` functions, along with any `COLLATE` 
syntax when the flag is set to false. By default, the flag is set to false.


> Add COLLATION_ENABLED config flag
> -
>
> Key: SPARK-47102
> URL: https://issues.apache.org/jira/browse/SPARK-47102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>
> *What changes were proposed in this pull request?*
> This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error 
> class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage 
> of feature under development. 
> *Why are the changes needed?*
> We want to make collations configurable on this flag. These changes disable 
> usage of `collate` and `collation` functions, along with any `COLLATE` syntax 
> when the flag is set to false. By default, the flag is set to false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag

2024-02-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47102:
--

Assignee: (was: Apache Spark)

> Add COLLATION_ENABLED config flag
> -
>
> Key: SPARK-47102
> URL: https://issues.apache.org/jira/browse/SPARK-47102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>
> *What changes were proposed in this pull request?*
> This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error 
> class `COLLATION_SUPPORT_DISABLED` to appropriately report error on usage of 
> feature under development. 
> *Why are the changes needed?*
> We want to make collations configurable on this some flag. These changes 
> disable usage of `collate` and `collation` functions, along with any 
> `COLLATE` syntax when the flag is set to false. By default, the flag is set 
> to false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-47104) Spark SQL query fails with NullPointerException

2024-02-23 Thread Chhavi Bansal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chhavi Bansal closed SPARK-47104.
-

> Spark SQL query fails with NullPointerException
> ---
>
> Key: SPARK-47104
> URL: https://issues.apache.org/jira/browse/SPARK-47104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.2.1, 3.4.2, 3.5.0
>Reporter: Chhavi Bansal
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> I am trying to run a very simple SQL query involving join and orderby clause 
> and then using UUID() function in the outermost select stmt. The query fails
> {code:java}
> val df = spark.read.format("csv").option("header", 
> "true").load("src/main/resources/titanic.csv")
> df.createOrReplaceTempView("titanic")
> val query = spark.sql(" select name, uuid() as _iid from (select s.name from 
> titanic s join titanic t on s.name = t.name order by name) ;") 
> query.show() // FAILS{code}
> Dataset is a normal csv file with the following columns
> {code:java}
> PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
>  {code}
> Below is the error
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> at scala.collection.TraversableLike.map(TraversableLike.scala:237)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338)
> at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
> at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
> at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
> at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
> at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:808)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:785)
> at 
> hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14)
> at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6)
> at scala.Function0.apply$mcV$sp(Function0.scala:39)
> at scala.Function0.apply$mcV$sp$(Function0.scala:39)
> at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
> at scala.App.$anonfun$main$1$adapted(App.scala:80)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.App.main(App.scala:80)
> at scala.App.main$(App.scala:78)
> at hyperspace2.sparkPlan$.main(sparkPlan.scala:6)
> at hyperspace2.sparkPlan.main(sparkPlan.scala) {code}
> Note:
>  # here if I remove order by clause then it produces the correct output.
>  # This happens when I read the dataset using csv file, works fine if I make 
> the dataframe using Seq().toDf
>  # The query fails if I use spark.sql("query").show() but is success when I 
> simple write it to csv file
> [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception]
> Please can someone look into why this happens just when using `show()` since 
> this is failing queries in production for me.



--
This 

[jira] [Commented] (SPARK-47104) Spark SQL query fails with NullPointerException

2024-02-23 Thread Chhavi Bansal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819970#comment-17819970
 ] 

Chhavi Bansal commented on SPARK-47104:
---

Thanks Team for looking into the issue and rolling out a fix.

> Spark SQL query fails with NullPointerException
> ---
>
> Key: SPARK-47104
> URL: https://issues.apache.org/jira/browse/SPARK-47104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.2.1, 3.4.2, 3.5.0
>Reporter: Chhavi Bansal
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> I am trying to run a very simple SQL query involving join and orderby clause 
> and then using UUID() function in the outermost select stmt. The query fails
> {code:java}
> val df = spark.read.format("csv").option("header", 
> "true").load("src/main/resources/titanic.csv")
> df.createOrReplaceTempView("titanic")
> val query = spark.sql(" select name, uuid() as _iid from (select s.name from 
> titanic s join titanic t on s.name = t.name order by name) ;") 
> query.show() // FAILS{code}
> Dataset is a normal csv file with the following columns
> {code:java}
> PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
>  {code}
> Below is the error
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> at scala.collection.TraversableLike.map(TraversableLike.scala:237)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
> at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366)
> at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338)
> at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
> at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
> at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
> at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
> at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:808)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:785)
> at 
> hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14)
> at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6)
> at scala.Function0.apply$mcV$sp(Function0.scala:39)
> at scala.Function0.apply$mcV$sp$(Function0.scala:39)
> at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
> at scala.App.$anonfun$main$1$adapted(App.scala:80)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.App.main(App.scala:80)
> at scala.App.main$(App.scala:78)
> at hyperspace2.sparkPlan$.main(sparkPlan.scala:6)
> at hyperspace2.sparkPlan.main(sparkPlan.scala) {code}
> Note:
>  # here if I remove order by clause then it produces the correct output.
>  # This happens when I read the dataset using csv file, works fine if I make 
> the dataframe using Seq().toDf
>  # The query fails if I use spark.sql("query").show() but is success when I 
> simple write it to csv file
> [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception]
> Please can someone look into why