[jira] [Updated] (SPARK-41873) Implement DataFrame `pandas_api`

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-41873:
---
Labels: pull-request-available  (was: )

> Implement DataFrame `pandas_api`
> 
>
> Key: SPARK-41873
> URL: https://issues.apache.org/jira/browse/SPARK-41873
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47182) Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro-*`

2024-02-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47182.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45278
[https://github.com/apache/spark/pull/45278]

> Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` 
> and `avro-*`
> -
>
> Key: SPARK-47182
> URL: https://issues.apache.org/jira/browse/SPARK-47182
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47176) Have a ResolveAllExpressionsUpWithPruning helper function

2024-02-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47176.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45270
[https://github.com/apache/spark/pull/45270]

> Have a ResolveAllExpressionsUpWithPruning helper function
> -
>
> Key: SPARK-47176
> URL: https://issues.apache.org/jira/browse/SPARK-47176
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47182) Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro-*`

2024-02-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47182:
-

Assignee: Dongjoon Hyun

> Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` 
> and `avro-*`
> -
>
> Key: SPARK-47182
> URL: https://issues.apache.org/jira/browse/SPARK-47182
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47182) Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro-*`

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47182:
---
Labels: pull-request-available  (was: )

> Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` 
> and `avro-*`
> -
>
> Key: SPARK-47182
> URL: https://issues.apache.org/jira/browse/SPARK-47182
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47182) Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro-*`

2024-02-26 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47182:
-

 Summary: Exclude `commons-(io|lang3)` transitive dependencies from 
`commons-compress` and `avro-*`
 Key: SPARK-47182
 URL: https://issues.apache.org/jira/browse/SPARK-47182
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41811) Implement SparkSession.sql's string formatter

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-41811:
---
Labels: pull-request-available  (was: )

> Implement SparkSession.sql's string formatter
> -
>
> Key: SPARK-41811
> URL: https://issues.apache.org/jira/browse/SPARK-41811
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> {code}
> **
> File "/.../spark/python/pyspark/sql/connect/session.py", line 345, in 
> pyspark.sql.connect.session.SparkSession.sql
> Failed example:
> spark.sql(
> "SELECT * FROM range(10) WHERE id > {bound1} AND id < {bound2}", 
> bound1=7, bound2=9
> ).show()
> Exception raised:
> Traceback (most recent call last):
>   File "/.../miniconda3/envs/python3.9/lib/python3.9/doctest.py", line 
> 1336, in __run
> exec(compile(example.source, filename, "single",
>   File "", line 
> 1, in 
> spark.sql(
> TypeError: sql() got an unexpected keyword argument 'bound1'
> **
> File "/.../spark/python/pyspark/sql/connect/session.py", line 355, in 
> pyspark.sql.connect.session.SparkSession.sql
> Failed example:
> spark.sql(
> "SELECT {col} FROM {mydf} WHERE id IN {x}",
> col=mydf.id, mydf=mydf, x=tuple(range(4))).show()
> Exception raised:
> Traceback (most recent call last):
>   File "/.../miniconda3/envs/python3.9/lib/python3.9/doctest.py", line 
> 1336, in __run
> exec(compile(example.source, filename, "single",
>   File "", line 
> 1, in 
> spark.sql(
> TypeError: sql() got an unexpected keyword argument 'col'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47181) Fix `MasterSuite` to validate the number of registered workers

2024-02-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47181:
-

Assignee: Dongjoon Hyun

> Fix `MasterSuite` to validate the number of registered workers
> --
>
> Key: SPARK-47181
> URL: https://issues.apache.org/jira/browse/SPARK-47181
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47181) Fix `MasterSuite` to validate the number of registered workers

2024-02-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47181.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45274
[https://github.com/apache/spark/pull/45274]

> Fix `MasterSuite` to validate the number of registered workers
> --
>
> Key: SPARK-47181
> URL: https://issues.apache.org/jira/browse/SPARK-47181
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47178) Add a test case for createDataFrame with dataclasses

2024-02-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47178:


Assignee: Hyukjin Kwon

> Add a test case for createDataFrame with dataclasses
> 
>
> Key: SPARK-47178
> URL: https://issues.apache.org/jira/browse/SPARK-47178
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47178) Add a test case for createDataFrame with dataclasses

2024-02-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47178.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45271
[https://github.com/apache/spark/pull/45271]

> Add a test case for createDataFrame with dataclasses
> 
>
> Key: SPARK-47178
> URL: https://issues.apache.org/jira/browse/SPARK-47178
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47181) Fix `MasterSuite` to validate the number of registered workers

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47181:
---
Labels: pull-request-available  (was: )

> Fix `MasterSuite` to validate the number of registered workers
> --
>
> Key: SPARK-47181
> URL: https://issues.apache.org/jira/browse/SPARK-47181
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47181) Fix `MasterSuite` to validate the number of registered workers

2024-02-26 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47181:
-

 Summary: Fix `MasterSuite` to validate the number of registered 
workers
 Key: SPARK-47181
 URL: https://issues.apache.org/jira/browse/SPARK-47181
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, Tests
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47180) Migrate CSV parsing off of Univocity

2024-02-26 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-47180:


 Summary: Migrate CSV parsing off of Univocity
 Key: SPARK-47180
 URL: https://issues.apache.org/jira/browse/SPARK-47180
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas


Univocity appears to be unmaintained.

As of February 2024:
 * The last release was [more than 3 years 
ago|https://github.com/uniVocity/univocity-parsers/releases].
 * The last commit to {{master}} was [almost 3 years 
ago|https://github.com/uniVocity/univocity-parsers/commits/master/].
 * The website is 
[down|https://github.com/uniVocity/univocity-parsers/issues/506].
 * There are 
[multiple|https://github.com/uniVocity/univocity-parsers/issues/494] 
[open|https://github.com/uniVocity/univocity-parsers/issues/495] 
[bugs|https://github.com/uniVocity/univocity-parsers/issues/499] on the tracker 
with no indication that anyone cares.

It's not urgent, but we should consider migrating to an actively maintained CSV 
library in the JVM ecosystem.

There are a bunch of libraries [listed here on this Maven 
Repository|https://mvnrepository.com/open-source/csv-libraries].

[jackson-dataformats-text|https://github.com/FasterXML/jackson-dataformats-text]
 looks interesting. I know we already use FasterXML to parse JSON. Perhaps we 
should use them to parse CSV as well.

I'm guessing we chose Univocity back in the day because it was the fastest CSV 
library on the JVM. However, the last performance benchmark comparing it to 
others was [from February 
2018|https://github.com/uniVocity/csv-parsers-comparison/blob/5548b52f2cc27eb19c11464e9a331491e8ad4ba6/README.md#statistics-updated-28th-of-february-2018],
 so this may no longer be true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47166) Improve merge_spark_pr.py by emphasising input and error

2024-02-26 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47166.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45256
[https://github.com/apache/spark/pull/45256]

> Improve merge_spark_pr.py by emphasising input and error
> 
>
> Key: SPARK-47166
> URL: https://issues.apache.org/jira/browse/SPARK-47166
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47179) Improve error message from SparkThrowableSuite for better debuggability

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47179:
---
Labels: pull-request-available  (was: )

> Improve error message from SparkThrowableSuite for better debuggability
> ---
>
> Key: SPARK-47179
> URL: https://issues.apache.org/jira/browse/SPARK-47179
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Current error message is not very helpful when error classes documentation is 
> not up-to-date so we better improve it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47179) Improve error message from SparkThrowableSuite for better debuggability

2024-02-26 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-47179:
---

 Summary: Improve error message from SparkThrowableSuite for better 
debuggability
 Key: SPARK-47179
 URL: https://issues.apache.org/jira/browse/SPARK-47179
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Haejoon Lee


Current error message is not very helpful when error classes documentation is 
not up-to-date so we better improve it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47178) Add a test case for createDataFrame with dataclasses

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47178:
---
Labels: pull-request-available  (was: )

> Add a test case for createDataFrame with dataclasses
> 
>
> Key: SPARK-47178
> URL: https://issues.apache.org/jira/browse/SPARK-47178
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47178) Add a test case for createDataFrame with dataclasses

2024-02-26 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-47178:


 Summary: Add a test case for createDataFrame with dataclasses
 Key: SPARK-47178
 URL: https://issues.apache.org/jira/browse/SPARK-47178
 Project: Spark
  Issue Type: Test
  Components: PySpark, Tests
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47175) Remove ZOOKEEPER-1844 comment from KafkaTestUtils

2024-02-26 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47175.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45265
[https://github.com/apache/spark/pull/45265]

> Remove ZOOKEEPER-1844 comment from KafkaTestUtils
> -
>
> Key: SPARK-47175
> URL: https://issues.apache.org/jira/browse/SPARK-47175
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47175) Remove ZOOKEEPER-1844 comment from KafkaTestUtils

2024-02-26 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-47175:


Assignee: Dongjoon Hyun

> Remove ZOOKEEPER-1844 comment from KafkaTestUtils
> -
>
> Key: SPARK-47175
> URL: https://issues.apache.org/jira/browse/SPARK-47175
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Trivial
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47165) Pull docker image only when its' absent

2024-02-26 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-47165:


Assignee: Kent Yao

> Pull docker image only when its' absent
> ---
>
> Key: SPARK-47165
> URL: https://issues.apache.org/jira/browse/SPARK-47165
> Project: Spark
>  Issue Type: Test
>  Components: Spark Docker
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47165) Pull docker image only when its' absent

2024-02-26 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47165.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45255
[https://github.com/apache/spark/pull/45255]

> Pull docker image only when its' absent
> ---
>
> Key: SPARK-47165
> URL: https://issues.apache.org/jira/browse/SPARK-47165
> Project: Spark
>  Issue Type: Test
>  Components: Spark Docker
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47164) Make Default Value From Wider Type Narrow Literal of v2 behave the same as v1

2024-02-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47164.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45254
[https://github.com/apache/spark/pull/45254]

> Make Default Value From Wider Type Narrow Literal of v2 behave the same as v1
> -
>
> Key: SPARK-47164
> URL: https://issues.apache.org/jira/browse/SPARK-47164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47176) Have a ResolveAllExpressionsUpWithPruning helper function

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47176:
---
Labels: pull-request-available  (was: )

> Have a ResolveAllExpressionsUpWithPruning helper function
> -
>
> Key: SPARK-47176
> URL: https://issues.apache.org/jira/browse/SPARK-47176
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string

2024-02-26 Thread Ziqi Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziqi Liu updated SPARK-47177:
-
Description: 
AQE plan is expected to display final plan after execution. This is not true 
for cached SQL plan: it will show the initial plan instead. This behavior 
change is introduced in [https://github.com/apache/spark/pull/40812] it tried 
to fix the concurrency issue with cached plan.  I don't have a clear idea how 
yet, maybe we can check whether the AQE plan is finalized(make the final flag 
atomic first, of course), if not we can return the cloned one, otherwise it's 
thread-safe to return the final one, since it's immutable.

 

A simple repro:
{code:java}
d1 = spark.range(1000).withColumn("key", expr("id % 
100")).groupBy("key").agg({"key": "count"})
cached_d2 = d1.cache()
df = cached_d2.withColumn("key2", expr("key % 
10")).groupBy("key2").agg({"key2": "count"})
df.collect() {code}
{code:java}
>>> df.explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   *(2) HashAggregate(keys=[key2#36L], functions=[count(key2#36L)])
   +- AQEShuffleRead coalesced
      +- ShuffleQueryStage 1
         +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, 
[plan_id=83]
            +- *(1) HashAggregate(keys=[key2#36L], 
functions=[partial_count(key2#36L)])
               +- *(1) Project [(key#27L % 10) AS key2#36L]
                  +- TableCacheQueryStage 0
                     +- InMemoryTableScan [key#27L]
                           +- InMemoryRelation [key#27L, count(key)#33L], 
StorageLevel(disk, memory, deserialized, 1 replicas)
                                 +- AdaptiveSparkPlan isFinalPlan=false
                                    +- HashAggregate(keys=[key#4L], 
functions=[count(key#4L)])
                                       +- Exchange hashpartitioning(key#4L, 
200), ENSURE_REQUIREMENTS, [plan_id=33]
                                          +- HashAggregate(keys=[key#4L], 
functions=[partial_count(key#4L)])
                                             +- Project [(id#2L % 100) AS 
key#4L]
                                                +- Range (0, 1000, step=1, 
splits=10)
+- == Initial Plan ==
   HashAggregate(keys=[key2#36L], functions=[count(key2#36L)])
   +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, 
[plan_id=30]
      +- HashAggregate(keys=[key2#36L], functions=[partial_count(key2#36L)])
         +- Project [(key#27L % 10) AS key2#36L]
            +- InMemoryTableScan [key#27L]
                  +- InMemoryRelation [key#27L, count(key)#33L], 
StorageLevel(disk, memory, deserialized, 1 replicas)
                        +- AdaptiveSparkPlan isFinalPlan=false
                           +- HashAggregate(keys=[key#4L], 
functions=[count(key#4L)])
                              +- Exchange hashpartitioning(key#4L, 200), 
ENSURE_REQUIREMENTS, [plan_id=33]
                                 +- HashAggregate(keys=[key#4L], 
functions=[partial_count(key#4L)])
                                    +- Project [(id#2L % 100) AS key#4L]
                                       +- Range (0, 1000, step=1, splits=10) 
{code}

  was:
AQE plan is expected to display final plan after execution. This is not true 
for cached SQL plan: it will show the initial plan instead. This behavior 
change is introduced in [https://github.com/apache/spark/pull/40812] it tried 
to fix the concurrency issue with cached plan.  I don't have a clear idea how 
yet, maybe we can check whether the AQE plan is finalized(make the final flag 
atomic first, of course), if not we can return the cloned one, otherwise it's 
thread-safe to return the final one, since it's immutable.

 

A simple repro:
{code:java}
d1 = spark.range(1000).withColumn("key", expr("id % 
100")).groupBy("key").agg({"key": "count"})
cached_d2 = d1.cache()
df = cached_d2.withColumn("key2", expr("key % 
10")).groupBy("key2").agg({"key2": "count"})
df.collect() {code}
{code:java}
Row(key2=7, count(key2)=10), Row(key2=3, count(key2)=10), Row(key2=1, 
count(key2)=10), Row(key2=8, count(key2)=10)]
>>> df.explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   *(2) HashAggregate(keys=[key2#36L], functions=[count(key2#36L)])
   +- AQEShuffleRead coalesced
      +- ShuffleQueryStage 1
         +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, 
[plan_id=83]
            +- *(1) HashAggregate(keys=[key2#36L], 
functions=[partial_count(key2#36L)])
               +- *(1) Project [(key#27L % 10) AS key2#36L]
                  +- TableCacheQueryStage 0
                     +- InMemoryTableScan [key#27L]
                           +- InMemoryRelation [key#27L, count(key)#33L], 
StorageLevel(disk, memory, deserialized, 1 replicas)
                                 +- AdaptiveSparkPlan isFinalPlan=false
                                    +- HashAggregate(keys=[key#4L], 

[jira] [Created] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string

2024-02-26 Thread Ziqi Liu (Jira)
Ziqi Liu created SPARK-47177:


 Summary: Cached SQL plan do not display final AQE plan in explain 
string
 Key: SPARK-47177
 URL: https://issues.apache.org/jira/browse/SPARK-47177
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.1, 3.5.0, 3.4.2, 4.0.0, 3.5.2
Reporter: Ziqi Liu


AQE plan is expected to display final plan after execution. This is not true 
for cached SQL plan: it will show the initial plan instead. This behavior 
change is introduced in [https://github.com/apache/spark/pull/40812] it tried 
to fix the concurrency issue with cached plan.  I don't have a clear idea how 
yet, maybe we can check whether the AQE plan is finalized(make the final flag 
atomic first, of course), if not we can return the cloned one, otherwise it's 
thread-safe to return the final one, since it's immutable.

 

A simple repro:
{code:java}
d1 = spark.range(1000).withColumn("key", expr("id % 
100")).groupBy("key").agg({"key": "count"})
cached_d2 = d1.cache()
df = cached_d2.withColumn("key2", expr("key % 
10")).groupBy("key2").agg({"key2": "count"})
df.collect() {code}
{code:java}
Row(key2=7, count(key2)=10), Row(key2=3, count(key2)=10), Row(key2=1, 
count(key2)=10), Row(key2=8, count(key2)=10)]
>>> df.explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   *(2) HashAggregate(keys=[key2#36L], functions=[count(key2#36L)])
   +- AQEShuffleRead coalesced
      +- ShuffleQueryStage 1
         +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, 
[plan_id=83]
            +- *(1) HashAggregate(keys=[key2#36L], 
functions=[partial_count(key2#36L)])
               +- *(1) Project [(key#27L % 10) AS key2#36L]
                  +- TableCacheQueryStage 0
                     +- InMemoryTableScan [key#27L]
                           +- InMemoryRelation [key#27L, count(key)#33L], 
StorageLevel(disk, memory, deserialized, 1 replicas)
                                 +- AdaptiveSparkPlan isFinalPlan=false
                                    +- HashAggregate(keys=[key#4L], 
functions=[count(key#4L)])
                                       +- Exchange hashpartitioning(key#4L, 
200), ENSURE_REQUIREMENTS, [plan_id=33]
                                          +- HashAggregate(keys=[key#4L], 
functions=[partial_count(key#4L)])
                                             +- Project [(id#2L % 100) AS 
key#4L]
                                                +- Range (0, 1000, step=1, 
splits=10)
+- == Initial Plan ==
   HashAggregate(keys=[key2#36L], functions=[count(key2#36L)])
   +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, 
[plan_id=30]
      +- HashAggregate(keys=[key2#36L], functions=[partial_count(key2#36L)])
         +- Project [(key#27L % 10) AS key2#36L]
            +- InMemoryTableScan [key#27L]
                  +- InMemoryRelation [key#27L, count(key)#33L], 
StorageLevel(disk, memory, deserialized, 1 replicas)
                        +- AdaptiveSparkPlan isFinalPlan=false
                           +- HashAggregate(keys=[key#4L], 
functions=[count(key#4L)])
                              +- Exchange hashpartitioning(key#4L, 200), 
ENSURE_REQUIREMENTS, [plan_id=33]
                                 +- HashAggregate(keys=[key#4L], 
functions=[partial_count(key#4L)])
                                    +- Project [(id#2L % 100) AS key#4L]
                                       +- Range (0, 1000, step=1, splits=10) 
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47176) Have a ResolveAllExpressionsUpWithPruning helper function

2024-02-26 Thread Rui Wang (Jira)
Rui Wang created SPARK-47176:


 Summary: Have a ResolveAllExpressionsUpWithPruning helper function
 Key: SPARK-47176
 URL: https://issues.apache.org/jira/browse/SPARK-47176
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47094) SPJ : Dynamically rebalance number of buckets when they are not equal

2024-02-26 Thread Szehon Ho (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szehon Ho updated SPARK-47094:
--
Parent: SPARK-37375
Issue Type: Sub-task  (was: New Feature)

> SPJ : Dynamically rebalance number of buckets when they are not equal
> -
>
> Key: SPARK-47094
> URL: https://issues.apache.org/jira/browse/SPARK-47094
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Himadri Pal
>Priority: Major
>  Labels: pull-request-available
>
> SPJ: Storage Partition Join works with Iceberg tables when both the tables 
> have same number of buckets. As part of this feature request, we would like 
> spark to gather the number of buckets information from both the tables and 
> dynamically rebalance the number of buckets by coalesce or repartition so 
> that SPJ will work fine. In this case, we would still have to shuffle but 
> would be better than no SPJ.
> Use Case : 
> Many times we do not have control of the input tables, hence it's not 
> possible to change partitioning scheme on those tables. As a consumer, we 
> would still like them to be used with SPJ when used with other tables and 
> output tables which has different number of buckets.
> In these scenario, we would need to read those tables rewrite them with 
> matching number of buckets for the SPJ to work, this extra step could 
> outweigh the benefits of less shuffle via SPJ. Also when there are multiple 
> different tables being joined, each tables need to be rewritten with matching 
> number of buckets. 
> If this feature is implemented, SPJ functionality will be more powerful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47094) SPJ : Dynamically rebalance number of buckets when they are not equal

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47094:
---
Labels: pull-request-available  (was: )

> SPJ : Dynamically rebalance number of buckets when they are not equal
> -
>
> Key: SPARK-47094
> URL: https://issues.apache.org/jira/browse/SPARK-47094
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Himadri Pal
>Priority: Major
>  Labels: pull-request-available
>
> SPJ: Storage Partition Join works with Iceberg tables when both the tables 
> have same number of buckets. As part of this feature request, we would like 
> spark to gather the number of buckets information from both the tables and 
> dynamically rebalance the number of buckets by coalesce or repartition so 
> that SPJ will work fine. In this case, we would still have to shuffle but 
> would be better than no SPJ.
> Use Case : 
> Many times we do not have control of the input tables, hence it's not 
> possible to change partitioning scheme on those tables. As a consumer, we 
> would still like them to be used with SPJ when used with other tables and 
> output tables which has different number of buckets.
> In these scenario, we would need to read those tables rewrite them with 
> matching number of buckets for the SPJ to work, this extra step could 
> outweigh the benefits of less shuffle via SPJ. Also when there are multiple 
> different tables being joined, each tables need to be rewritten with matching 
> number of buckets. 
> If this feature is implemented, SPJ functionality will be more powerful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24815) Structured Streaming should support dynamic allocation

2024-02-26 Thread Mich Talebzadeh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820915#comment-17820915
 ] 

Mich Talebzadeh edited comment on SPARK-24815 at 2/26/24 11:58 PM:
---

Now that the ticket is reopened let us review the submitted documents. This has 
got 6 votes as of today. I volunteered to mentor it until a committer comes 
forward. Hope this helps to speed up the process and time to delivery.


was (Author: mich.talebza...@gmail.com):
Now that the ticket is reopened let us review the submitted documents. This has 
got 6 votes for now. I volunteered to mentor it until a committer comes forward 
to it. Hope this helps to speed up the process and time to delivery.

> Structured Streaming should support dynamic allocation
> --
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Karthik Palaniappan
>Priority: Minor
>  Labels: pull-request-available
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation

2024-02-26 Thread Mich Talebzadeh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820915#comment-17820915
 ] 

Mich Talebzadeh commented on SPARK-24815:
-

Now that the ticket is reopened let us review the submitted documents. This has 
got 6 votes for now. I volunteered to mentor it until a committer comes forward 
to it. Hope this helps to speed up the process and time to delivery.

> Structured Streaming should support dynamic allocation
> --
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Karthik Palaniappan
>Priority: Minor
>  Labels: pull-request-available
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44400) Improve Scala StreamingQueryListener to provide users a way to access the Spark session for Spark Connect

2024-02-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44400.
--
Resolution: Duplicate

> Improve Scala StreamingQueryListener to provide users a way to access the 
> Spark session for Spark Connect
> -
>
> Key: SPARK-44400
> URL: https://issues.apache.org/jira/browse/SPARK-44400
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Bo Gao
>Priority: Major
>
> Improve the Listener to provide users a way to access the Spark session and 
> perform arbitrary actions inside the Listener. Right now users can use `val 
> spark = SparkSession.builder.getOrCreate()` to create a Spark session inside 
> the Listener, but this is a legacy session instead of a connect remote 
> session.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44462) Fix the session passed to foreachBatch.

2024-02-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44462.
--
Resolution: Duplicate

> Fix the session passed to foreachBatch. 
> 
>
> Key: SPARK-44462
> URL: https://issues.apache.org/jira/browse/SPARK-44462
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.4.1
>Reporter: Raghu Angadi
>Priority: Major
>
> foreachBatch() in Connect uses initial session used while starting the 
> streaming query. But streaming query uses a cloned session, not the the 
> original session. We should set up the mapping for the cloned session and 
> pass that in. Look for this ticket ID in the code for more context inline. 
>  
> Another issue with not creating new session id: foreachBatch worker keeps the 
> session alive. The session mapping at Connect server does not expire and 
> query keeps running even if the original client disappears. This keeps the 
> query running.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39771) If spark.default.parallelism is unset, RDD defaultPartitioner may pick a value that is too large to successfully run

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-39771:
---
Labels: pull-request-available  (was: )

> If spark.default.parallelism is unset, RDD defaultPartitioner may pick a 
> value that is too large to successfully run
> 
>
> Key: SPARK-39771
> URL: https://issues.apache.org/jira/browse/SPARK-39771
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Josh Rosen
>Priority: Major
>  Labels: pull-request-available
>
> [According to its 
> docs|https://github.com/apache/spark/blob/899f6c90eb2de5b46a36710a131d7417010ce4b3/core/src/main/scala/org/apache/spark/Partitioner.scala#L45-L65],
>  {{Partitioner.defaultPartitioner}} will use the maximum number of RDD 
> partitions as its partition count when {{spark.default.parallelism}} is not 
> set. If that number of upstream partitions is very large then this can result 
> in shuffles where {{{}numMappers * numReducers = numMappers^2{}}}, which can 
> cause various problems that prevent the job from successfully running.
> To help users identify when they have run into this problem, I think we 
> should add warning logs to Spark.
> As an example of the problem, let's say that I have an RDD with 100,000 
> partitions and then do a {{reduceByKey}} on it without specifying an explicit 
> partitioner or partition count. In this case, Spark will plan a reduce stage 
> with 100,000 partitions:
> {code:java}
> scala>  sc.parallelize(1 to 10, 10).map(x => (x, x)).reduceByKey(_ + 
> _).toDebugString
> res7: String =
> (10) ShuffledRDD[21] at reduceByKey at :25 []
>+-(10) MapPartitionsRDD[20] at map at :25 []
> | ParallelCollectionRDD[19] at parallelize at :25 []
> {code}
> This results in the creation of 10 billion shuffle blocks, so if this job 
> _does_ run it is likely to be extremely show. However, it's more likely that 
> the driver will crash when serializing map output statuses: if we were able 
> to use one bit per mapper / reducer pair (which is probably overly optimistic 
> in terms of compressibility) then the map statuses would be ~1.25 gigabytes 
> (and the actual size is probably much larger)!
> I don't think that users are likely to intentionally wind up in this 
> scenario: it's more likely that either (a) their job depends on 
> {{spark.default.parallelism}} being set but it was run on an environment 
> lacking a value for that config, or (b) their input data significantly grew 
> in size. These scenarios may be rare, but they can be frustrating to debug 
> (especially if a failure occurs midway through a long-running job).
> I think we should do something to handle this scenario.
> A good starting point might be for {{Partitioner.defaultPartitioner}} to log 
> a warning when the default partition size exceeds some threshold.
> In addition, I think it might be a good idea to log a similar warning in 
> {{MapOutputTrackerMaster}} right before we start trying to serialize map 
> statuses: in a real-world situation where this problem cropped up, the map 
> stage ran successfully but the driver crashed when serializing map statuses. 
> Putting a warning about partition counts here makes it more likely that users 
> will spot that error in the logs and be able to identify the source of the 
> problem (compared to a warning that appears much earlier in the job and 
> therefore much farther from the likely site of a crash).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47175) Remove ZOOKEEPER-1844 comment from KafkaTestUtils

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47175:
---
Labels: pull-request-available  (was: )

> Remove ZOOKEEPER-1844 comment from KafkaTestUtils
> -
>
> Key: SPARK-47175
> URL: https://issues.apache.org/jira/browse/SPARK-47175
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47175) Remove ZOOKEEPER-1844 comment from KafkaTestUtils

2024-02-26 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47175:
-

 Summary: Remove ZOOKEEPER-1844 comment from KafkaTestUtils
 Key: SPARK-47175
 URL: https://issues.apache.org/jira/browse/SPARK-47175
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47079) Unable to create PySpark dataframe containing Variant columns

2024-02-26 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-47079:
-

Assignee: Desmond Cheong

> Unable to create PySpark dataframe containing Variant columns
> -
>
> Key: SPARK-47079
> URL: https://issues.apache.org/jira/browse/SPARK-47079
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Desmond Cheong
>Assignee: Desmond Cheong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Trying to create a dataframe containing a variant type results in:
> AssertionError: Undefined error message parameter for error class: 
> CANNOT_PARSE_DATATYPE. Parameters: {'error': "Undefined error message 
> parameter for error class: CANNOT_PARSE_DATATYPE. Parameters:
> {'error': 'variant'}
> "}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47079) Unable to create PySpark dataframe containing Variant columns

2024-02-26 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-47079.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45131
[https://github.com/apache/spark/pull/45131]

> Unable to create PySpark dataframe containing Variant columns
> -
>
> Key: SPARK-47079
> URL: https://issues.apache.org/jira/browse/SPARK-47079
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Desmond Cheong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Trying to create a dataframe containing a variant type results in:
> AssertionError: Undefined error message parameter for error class: 
> CANNOT_PARSE_DATATYPE. Parameters: {'error': "Undefined error message 
> parameter for error class: CANNOT_PARSE_DATATYPE. Parameters:
> {'error': 'variant'}
> "}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47063) CAST long to timestamp has different behavior for codegen vs interpreted

2024-02-26 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820877#comment-17820877
 ] 

Robert Joseph Evans commented on SPARK-47063:
-

[~planga82] I was not planning on putting up a patch, but it would be willing 
to if no one else wants to put one up. I would just need to know if we want to 
clamp the result or if we are okay with the overflow.

> CAST long to timestamp has different behavior for codegen vs interpreted
> 
>
> Key: SPARK-47063
> URL: https://issues.apache.org/jira/browse/SPARK-47063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2
>Reporter: Robert Joseph Evans
>Priority: Major
>
> It probably impacts a lot more versions of the code than this, but I verified 
> it on 3.4.2. This also appears to be related to 
> https://issues.apache.org/jira/browse/SPARK-39209
> {code:java}
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++---+---+
> |v                   |ts                 |unix_micros(ts)|
> ++---+---+
> |9223372036854775807 |1969-12-31 23:59:59|-100       |
> |-9223372036854775808|1970-01-01 00:00:00|0              |
> |0                   |1970-01-01 00:00:00|0              |
> |1990                |1970-01-01 00:33:10|199000     |
> ++---+---+
> {code}
> It looks like InMemoryTableScanExec is not doing code generation for the 
> expressions, but the ProjectExec after the repartition is.
> If I disable code gen I get the same answer in both cases.
> {code:java}
> scala> spark.conf.set("spark.sql.codegen.wholeStage", false)
> scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN")
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> {code}
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L1627]
> Is the code used in codegen, but
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687]
> is what is used outside of code gen.
> Apparently `SECONDS.toMicros` truncates the value on an overflow, but the 
> codegen does not.
> {code:java}
> scala> 

[jira] [Updated] (SPARK-33356) DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion

2024-02-26 Thread Chris Wells (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Wells updated SPARK-33356:

Reporter: Lucas Brutschy  (was: Lucas Brutschy)

> DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion
> -
>
> Key: SPARK-33356
> URL: https://issues.apache.org/jira/browse/SPARK-33356
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.2, 3.0.1
> Environment: Reproducible locally with 3.0.1, 2.4.2, and latest 
> master.
>Reporter: Lucas Brutschy
>Priority: Minor
>
> The current implementation of the {{DAGScheduler}} exhibits exponential 
> runtime in DAGs with many {{PartitionerAwareUnions}}. The reason seems to be 
> a mutual recursion between {{PartitionerAwareUnion.getPreferredLocations}} 
> and {{DAGScheduler.getPreferredLocs}}.
> A minimal example reproducing the issue:
> {code:scala}
> object Example extends App {
>   val partitioner = new HashPartitioner(2)
>   val sc = new SparkContext(new 
> SparkConf().setAppName("").setMaster("local[*]"))
>   val rdd1 = sc.emptyRDD[(Int, Int)].partitionBy(partitioner)
>   val rdd2 = (1 to 30).map(_ => rdd1)
>   val rdd3 = rdd2.reduce(_ union _)
>   rdd3.collect()
> }
> {code}
> The whole app should take around one second to complete, as no actual work is 
> done. However, it takes more time to submit the job than I am willing to wait.
> The underlying cause appears to be mutual recursion between 
> {{PartitionerAwareUnion.getPreferredLocations}} and 
> {{DAGScheduler.getPreferredLocs}}, which restarts graph traversal at each 
> {{PartitionerAwareUnion}} with no memoization. Each node of the DAG is 
> visited {{O(n!)}} (exponentially many) times.
> Note, that it is clear to me that you could use {{sc.union(rdd2)}} instead of 
> {{rdd2.reduce(_ union _)}} to eliminate the problem. I use this just to 
> demonstrate the issue in a sufficiently small example. Given a large DAG and 
> many PartitionerAwareUnions, especially contructed by iterative algorithms, 
> the problem can become relevant even without "abuse" of the union operation.
> The exponential recursion in DAG Schedular was largely fixed with SPARK-682, 
> but in the special case of PartitionerAwareUnion, it is still possible. This 
> may actually be an underlying cause of SPARK-29181.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47156) SparkSession returns a null context during a dataset creation

2024-02-26 Thread Marc Le Bihan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Le Bihan resolved SPARK-47156.
---
Resolution: Not A Bug

I was lacking Spark knowledge, and learned that executor don't have context to 
give to anyone, at runtime.

> SparkSession returns a null context during a dataset creation
> -
>
> Key: SPARK-47156
> URL: https://issues.apache.org/jira/browse/SPARK-47156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.2
> Environment: Debian 12
> Java 17
>Reporter: Marc Le Bihan
>Priority: Major
>
> I need first to know if I'm in front of a bug or not.
> If it's the case, I'll manage to create a test to help you reproduce the 
> case, but if it isn't, maybe Spark documentation could explain when 
> {{sparkSession.getContext()}} can return {{{}null{}}}.
>  
> I'm willing to ease my development by separating :
>  * parquet files management \{ checking existence, then loading them as 
> cache, or saving data to them },
>  * from dataset creation, when it doesn't exist yet, and should be 
> constituted from scratch.
>  
> The method I'm using is this one:
> {code:java}
> {code:Java}
> protected Dataset constitutionStandard(OptionsCreationLecture 
> optionsCreationLecture,
>Supplier> worker, CacheParqueteur 
> cacheParqueteur) {
>OptionsCreationLecture options = optionsCreationLecture != null ? 
> optionsCreationLecture : optionsCreationLecture();
>Dataset dataset = cacheParqueteur.call(options.useCache());
>return dataset == null ? 
> cacheParqueteur.save(cacheParqueteur.appliquer(worker.get())) : dataset;
> }
> {code}
> In case the dataset doesn't exist in parquet files (= cache) yet, it starts 
> its creation by calling a {{worker.get()}} that is a {{Supplier}} of 
> {{{}Dataset{}}}.
>  
> A concrete usage is this one:
> {code:java}
> {code:Java}
> public Dataset rowEtablissements(OptionsCreationLecture 
> optionsCreationLecture, HistoriqueExecution historiqueExecution, int 
> anneeCOG, int anneeSIRENE, boolean actifsSeulement, boolean communesValides, 
> boolean nomenclaturesNAF2Valides) {
>OptionsCreationLecture options = optionsCreationLecture != null ? 
> optionsCreationLecture : optionsCreationLecture();
>Supplier> worker = () -> {
>   super.setStageDescription(this.messageSource, 
> "row.etablissements.libelle.long", "row.etablissements.libelle.court", 
> anneeSIRENE, anneeCOG, actifsSeulement, communesValides, 
> nomenclaturesNAF2Valides);
>   
>   Map indexs = new HashMap<>();
>   Dataset etablissements = 
> etablissementsNonFiltres(optionsCreationLecture, anneeSIRENE);
>   etablissements = etablissements.filter(
>  (FilterFunction)etablissement -> 
> this.validator.validationEtablissement(this.session, historiqueExecution, 
> etablissement, actifsSeulement, nomenclaturesNAF2Valides, indexs));
>   // Si le filtrage par communes valides a été demandé, l'appliquer.
>   if (communesValides) {
>  etablissements = rowRestreindreAuxCommunesValides(etablissements, 
> anneeCOG, anneeSIRENE, indexs);
>   }
>   else {
>  etablissements = etablissements.withColumn("codeDepartement", 
> substring(CODE_COMMUNE.col(), 1, 2));
>   }
>   // Associer les libellés des codes APE/NAF.
>   Dataset nomenclatureNAF = 
> this.nafDataset.rowNomenclatureNAF(anneeSIRENE);
>   etablissements = etablissements.join(nomenclatureNAF, 
> etablissements.col("activitePrincipale").equalTo(nomenclatureNAF.col("codeNAF"))
>  , "left_outer")
>  .drop("codeNAF", "niveauNAF");
>   // Le dataset est maintenant considéré comme valide, et ses champs 
> peuvent être castés dans leurs types définitifs.
>   return this.validator.cast(etablissements);
>};
>return constitutionStandard(options, () -> worker.get()
>   .withColumn("partitionSiren", SIREN_ENTREPRISE.col().substr(1,2)),
>   new CacheParqueteur<>(options, this.session,
>  "etablissements", 
> "annee_{0,number,#0}-actifs_{1}-communes_verifiees_{2}-nafs_verifies_{3}", 
> DEPARTEMENT_SIREN_SIRET,
>  anneeSIRENE, anneeCOG, actifsSeulement, communesValides));
> } {code}
>  
> In the worker, a filter calls a {{validationEtablissement(SparkSession, 
> HistoriqueExecution, Row, ...)}} on each row to perform complete checking 
> (eight rules to check for an establishment validity).
> When a check fails, along with a warning log, I'm also counting in 
> {{historiqueExecution}} object the number of problems of that kind I've 
> encountered.
> That function increase a {{longAccumulator}} value, and create that 
> accumulator first, that it stores in a {{{}Map 
> accumulators{}}},  if needed.
> {code:java}
> {code:Java}
> public void 

[jira] [Commented] (SPARK-47063) CAST long to timestamp has different behavior for codegen vs interpreted

2024-02-26 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820853#comment-17820853
 ] 

Pablo Langa Blanco commented on SPARK-47063:


[~revans2] are you working on the fix?

> CAST long to timestamp has different behavior for codegen vs interpreted
> 
>
> Key: SPARK-47063
> URL: https://issues.apache.org/jira/browse/SPARK-47063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2
>Reporter: Robert Joseph Evans
>Priority: Major
>
> It probably impacts a lot more versions of the code than this, but I verified 
> it on 3.4.2. This also appears to be related to 
> https://issues.apache.org/jira/browse/SPARK-39209
> {code:java}
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++---+---+
> |v                   |ts                 |unix_micros(ts)|
> ++---+---+
> |9223372036854775807 |1969-12-31 23:59:59|-100       |
> |-9223372036854775808|1970-01-01 00:00:00|0              |
> |0                   |1970-01-01 00:00:00|0              |
> |1990                |1970-01-01 00:33:10|199000     |
> ++---+---+
> {code}
> It looks like InMemoryTableScanExec is not doing code generation for the 
> expressions, but the ProjectExec after the repartition is.
> If I disable code gen I get the same answer in both cases.
> {code:java}
> scala> spark.conf.set("spark.sql.codegen.wholeStage", false)
> scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN")
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> {code}
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L1627]
> Is the code used in codegen, but
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687]
> is what is used outside of code gen.
> Apparently `SECONDS.toMicros` truncates the value on an overflow, but the 
> codegen does not.
> {code:java}
> scala> Long.MaxValue
> res11: Long = 9223372036854775807
> scala> java.util.concurrent.TimeUnit.SECONDS.toMicros(Long.MaxValue)
> res12: Long = 9223372036854775807
> scala> Long.MaxValue 

[jira] [Created] (SPARK-47174) Client Side Listener - Server side implementation

2024-02-26 Thread Wei Liu (Jira)
Wei Liu created SPARK-47174:
---

 Summary: Client Side Listener - Server side implementation
 Key: SPARK-47174
 URL: https://issues.apache.org/jira/browse/SPARK-47174
 Project: Spark
  Issue Type: Improvement
  Components: Connect, SS
Affects Versions: 4.0.0
Reporter: Wei Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47174) Client Side Listener - Server side implementation

2024-02-26 Thread Wei Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820828#comment-17820828
 ] 

Wei Liu commented on SPARK-47174:
-

im working on this

> Client Side Listener - Server side implementation
> -
>
> Key: SPARK-47174
> URL: https://issues.apache.org/jira/browse/SPARK-47174
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, SS
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47173) fix typo in new streaming query listener explanation

2024-02-26 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820817#comment-17820817
 ] 

Max Gekk commented on SPARK-47173:
--

Resolved by https://github.com/apache/spark/pull/45263

> fix typo in new streaming query listener explanation
> 
>
> Key: SPARK-47173
> URL: https://issues.apache.org/jira/browse/SPARK-47173
> Project: Spark
>  Issue Type: Improvement
>  Components: SS, UI
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Assignee: Wei Liu
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> miss spelled 
> flatMapGroupsWithState with flatMapGroupWithState (missed a "s" after group)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47173) fix typo in new streaming query listener explanation

2024-02-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-47173:
-
Fix Version/s: 4.0.0

> fix typo in new streaming query listener explanation
> 
>
> Key: SPARK-47173
> URL: https://issues.apache.org/jira/browse/SPARK-47173
> Project: Spark
>  Issue Type: Improvement
>  Components: SS, UI
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Assignee: Wei Liu
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> miss spelled 
> flatMapGroupsWithState with flatMapGroupWithState (missed a "s" after group)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47173) fix typo in new streaming query listener explanation

2024-02-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-47173:


Assignee: Wei Liu

> fix typo in new streaming query listener explanation
> 
>
> Key: SPARK-47173
> URL: https://issues.apache.org/jira/browse/SPARK-47173
> Project: Spark
>  Issue Type: Improvement
>  Components: SS, UI
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Assignee: Wei Liu
>Priority: Trivial
>  Labels: pull-request-available
>
> miss spelled 
> flatMapGroupsWithState with flatMapGroupWithState (missed a "s" after group)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46067) Upgrade commons-compress to 1.25.0

2024-02-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46067:
--
Parent: SPARK-47046
Issue Type: Sub-task  (was: Improvement)

> Upgrade commons-compress to 1.25.0
> --
>
> Key: SPARK-46067
> URL: https://issues.apache.org/jira/browse/SPARK-46067
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> https://commons.apache.org/proper/commons-compress/changes-report.html#a1.25.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47173) fix typo in new streaming query listener explanation

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47173:
---
Labels: pull-request-available  (was: )

> fix typo in new streaming query listener explanation
> 
>
> Key: SPARK-47173
> URL: https://issues.apache.org/jira/browse/SPARK-47173
> Project: Spark
>  Issue Type: Improvement
>  Components: SS, UI
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Priority: Trivial
>  Labels: pull-request-available
>
> miss spelled 
> flatMapGroupsWithState with flatMapGroupWithState (missed a "s" after group)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47173) fix typo in new streaming query listener explanation

2024-02-26 Thread Wei Liu (Jira)
Wei Liu created SPARK-47173:
---

 Summary: fix typo in new streaming query listener explanation
 Key: SPARK-47173
 URL: https://issues.apache.org/jira/browse/SPARK-47173
 Project: Spark
  Issue Type: Improvement
  Components: SS, UI
Affects Versions: 4.0.0
Reporter: Wei Liu


miss spelled 
flatMapGroupsWithState with flatMapGroupWithState (missed a "s" after group)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47172) Upgrade Transport block cipher mode to GCM

2024-02-26 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-47172:
-
Shepherd:   (was: Sean R. Owen)

> Upgrade Transport block cipher mode to GCM
> --
>
> Key: SPARK-47172
> URL: https://issues.apache.org/jira/browse/SPARK-47172
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Steve Weis
>Priority: Minor
>
> The cipher transformation currently used for encrypting RPC calls is an 
> unauthenticated mode (AES/CTR/NoPadding). This needs to be upgraded to an 
> authenticated mode (AES/GCM/NoPadding) to prevent ciphertext from being 
> modified in transit.
> The relevant line is here: 
> [https://github.com/apache/spark/blob/a939a7d0fd9c6b23c879cbee05275c6fbc939e38/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L220]
> GCM is relatively more computationally expensive than CTR and adds a 16-byte 
> block of authentication tag data to each payload. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47172) Upgrade Transport block cipher mode to GCM

2024-02-26 Thread Steve Weis (Jira)
Steve Weis created SPARK-47172:
--

 Summary: Upgrade Transport block cipher mode to GCM
 Key: SPARK-47172
 URL: https://issues.apache.org/jira/browse/SPARK-47172
 Project: Spark
  Issue Type: Improvement
  Components: Security
Affects Versions: 3.5.0, 3.4.2
Reporter: Steve Weis


The cipher transformation currently used for encrypting RPC calls is an 
unauthenticated mode (AES/CTR/NoPadding). This needs to be upgraded to an 
authenticated mode (AES/GCM/NoPadding) to prevent ciphertext from being 
modified in transit.

The relevant line is here: 
[https://github.com/apache/spark/blob/a939a7d0fd9c6b23c879cbee05275c6fbc939e38/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L220]

GCM is relatively more computationally expensive than CTR and adds a 16-byte 
block of authentication tag data to each payload. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47171) Improve handling of new `exists` attributes within an aggregation

2024-02-26 Thread Anton Lykov (Jira)
Anton Lykov created SPARK-47171:
---

 Summary: Improve handling of new `exists` attributes within an 
aggregation
 Key: SPARK-47171
 URL: https://issues.apache.org/jira/browse/SPARK-47171
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 3.5.0
Reporter: Anton Lykov


See PR comment for context: 
https://github.com/apache/spark/pull/45133#issuecomment-1949522246



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47170) Remove redundant scope identifier for `jakarta.servlet-api` and `javax.servlet-api`

2024-02-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47170.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45258
[https://github.com/apache/spark/pull/45258]

> Remove redundant scope identifier for `jakarta.servlet-api` and 
> `javax.servlet-api`
> ---
>
> Key: SPARK-47170
> URL: https://issues.apache.org/jira/browse/SPARK-47170
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: HiuFung Kwok
>Assignee: HiuFung Kwok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This is a follow-up ticket for SPRARK-47046 to remove the redundant `scope` 
> XML clause  - compile. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47170) Remove redundant scope identifier for `jakarta.servlet-api` and `javax.servlet-api`

2024-02-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47170:
-

Assignee: HiuFung Kwok

> Remove redundant scope identifier for `jakarta.servlet-api` and 
> `javax.servlet-api`
> ---
>
> Key: SPARK-47170
> URL: https://issues.apache.org/jira/browse/SPARK-47170
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: HiuFung Kwok
>Assignee: HiuFung Kwok
>Priority: Major
>  Labels: pull-request-available
>
> This is a follow-up ticket for SPRARK-47046 to remove the redundant `scope` 
> XML clause  - compile. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46077) Error in postgresql when pushing down filter by timestamp_ntz field

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46077:
---
Labels: pull-request-available  (was: )

> Error in postgresql when pushing down filter by timestamp_ntz field
> ---
>
> Key: SPARK-46077
> URL: https://issues.apache.org/jira/browse/SPARK-46077
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Marina Krasilnikova
>Priority: Minor
>  Labels: pull-request-available
>
> code to reproduce:
> SparkSession sparkSession = SparkSession
> .builder()
> .appName("test-app")
> .master("local[*]")
> .config("spark.sql.timestampType", "TIMESTAMP_NTZ")
> .getOrCreate();
> String url = "...";
> String catalogPropPrefix = "spark.sql.catalog.myc";
> sparkSession.conf().set(catalogPropPrefix, JDBCTableCatalog.class.getName());
> sparkSession.conf().set(catalogPropPrefix + ".url", url);
> Map options = new HashMap<>();
> options.put("driver", "org.postgresql.Driver");
> // options.put("pushDownPredicate", "false");  it works fine if  this line is 
> uncommented
> Dataset dataset = sparkSession.read()
> .options(options)
> .table("myc.demo.`My table`");
> dataset.createOrReplaceTempView("view1");
> String sql = "select * from view1 where `my date` = '2021-04-01 00:00:00'";
> Dataset result = sparkSession.sql(sql);
> result.show();
> result.printSchema();
> Field `my date` is of type timestamp. This code results in 
> org.postgresql.util.PSQLException  syntax error
>  
>  
> String sql = "select * from view1 where `my date` = to_timestamp('2021-04-01 
> 00:00:00', '-MM-dd HH:mm:ss')";  // this query also doesn't work
> String sql = "select * from view1 where `my date` = date_trunc('DAY', 
> to_timestamp('2021-04-01 00:00:00', '-MM-dd HH:mm:ss'))";  // but this is 
> OK
>  
> Is it a bug or I got something wrong?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47169) Disable bucketing on collated collumns

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47169:
---
Labels: pull-request-available  (was: )

> Disable bucketing on collated collumns
> --
>
> Key: SPARK-47169
> URL: https://issues.apache.org/jira/browse/SPARK-47169
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-46992) Inconsistent results with 'sort', 'cache', and AQE.

2024-02-26 Thread Denis Tarima (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820705#comment-17820705
 ] 

Denis Tarima edited comment on SPARK-46992 at 2/26/24 1:44 PM:
---

I think the root problem is that {{{}cache{}}}/{{{}persist{}}} changes the 
result. It might be a necessary performance trade-off, but if it's possible to 
keep the same result then the problem will disappear.

{{CacheManager}} is shared between sessions so 
{{{}persist{}}}/{{{}unpersist{}}} affects all {{Dataset}} instances immediately 
creating a possibility of inconsistent results. For example, thread 1 calls 
{{{}df.count(){}}}, thread 2 calls {{{}df.cache().count(){}}}, and finally 
thread 1 calls {{df.count()}} again - thread 1 may get different counts.

If fixing the root problem is infeasible then the secondary problem needs to be 
addressed: {{queryExecution.executedPlan}} is cached ({{{}lazy val{}}}) in 
{{Dataset}} instance, but it's not used by all queries in the same way causing 
inconsistency.
 - {{df}} and {{dfCached = df.cache()}} could have different logical plans so 
{{df}} wouldn't use cached data, but this change would create a backward 
incompatibility
 - {{Dataset}} could verify if it's cached in {{CacheManager}} on each access 
to {{queryExecution}} and use/keep another {{queryExecution}} instance when 
it's in a "cached" state.


was (Author: dtarima):
I think the root problem is that {{{}cache{}}}/{{{}persist{}}} changes the 
result. It might be a necessary performance trade-off, but if it's possible to 
keep the same result then the problem will disappear.

{{CacheManager}} is shared between sessions so 
{{{}persist{}}}/{{{}unpersist{}}} affects all {{Dataset}} instances immediately 
creating a possibility of inconsistent results. For example, thread 1 calls 
{{{}df.count(){}}}, thread 2 calls {{{}df.cache().count(){}}}, and finally 
thread 1 calls {{df.count()}} again - thread 1 may get different counts.

If fixing the root problem is infeasible then the secondary problem needs to be 
addressed: {{queryExecution.executedPlan}} is cached ({{{}lazy val{}}}) in 
{{Dataset}} instance, but it's not used by all queries causing inconsistency.
 - {{df}} and {{dfCached = df.cache()}} could have different logical plans so 
{{df}} wouldn't use cached data, but this change would create a backward 
incompatibility
 - {{Dataset}} could verify if it's cached in {{CacheManager}} on each access 
to {{queryExecution}} and use/keep another {{queryExecution}} instance when 
it's in a "cached" state.

> Inconsistent results with 'sort', 'cache', and AQE.
> ---
>
> Key: SPARK-46992
> URL: https://issues.apache.org/jira/browse/SPARK-46992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Denis Tarima
>Priority: Critical
>  Labels: correctness, pull-request-available
>
>  
> With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes 
> {color:#4c9aff}sample{color} results after caching.
> Moreover, when cached,  {color:#4c9aff}collect{color} returns records as if 
> it's not cached, which is inconsistent with {color:#4c9aff}count{color} and 
> {color:#4c9aff}show{color}.
> A script to reproduce:
> {code:scala}
> import spark.implicits._
> val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123)
> println("NON CACHED:")
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> println("CACHED:")
> df.cache().count()
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> df.unpersist()
> {code}
> output:
> {code:java}
> NON CACHED:
>   count: 2
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  4|
> +---+
> CACHED:
>   count: 3
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> {code}
> BTW, disabling AQE 
> [{color:#4c9aff}spark.conf.set("spark.databricks.optimizer.adaptive.enabled", 
> "false"){color}] helps on Databricks clusters, but locally it has no effect, 
> at least on Spark 3.3.2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47009) Create table with collation

2024-02-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47009.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45105
[https://github.com/apache/spark/pull/45105]

> Create table with collation
> ---
>
> Key: SPARK-47009
> URL: https://issues.apache.org/jira/browse/SPARK-47009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Assignee: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add support for creating table with columns containing non-default collated 
> data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47009) Create table with collation

2024-02-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47009:
---

Assignee: Stefan Kandic

> Create table with collation
> ---
>
> Key: SPARK-47009
> URL: https://issues.apache.org/jira/browse/SPARK-47009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Assignee: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
>
> Add support for creating table with columns containing non-default collated 
> data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46992) Inconsistent results with 'sort', 'cache', and AQE.

2024-02-26 Thread Denis Tarima (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820705#comment-17820705
 ] 

Denis Tarima commented on SPARK-46992:
--

I think the root problem is that {{{}cache{}}}/{{{}persist{}}} changes the 
result. It might be a necessary performance trade-off, but if it's possible to 
keep the same result then the problem will disappear.

{{CacheManager}} is shared between sessions so 
{{{}persist{}}}/{{{}unpersist{}}} affects all {{Dataset}} instances immediately 
creating a possibility of inconsistent results. For example, thread 1 calls 
{{{}df.count(){}}}, thread 2 calls {{{}df.cache().count(){}}}, and finally 
thread 1 calls {{df.count()}} again - thread 1 may get different counts.

If fixing the root problem is infeasible then the secondary problem needs to be 
addressed: {{queryExecution.executedPlan}} is cached ({{{}lazy val{}}}) in 
{{Dataset}} instance, but it's not used by all queries causing inconsistency.
 - {{df}} and {{dfCached = df.cache()}} could have different logical plans so 
{{df}} wouldn't use cached data, but this change would create a backward 
incompatibility
 - {{Dataset}} could verify if it's cached in {{CacheManager}} on each access 
to {{queryExecution}} and use/keep another {{queryExecution}} instance when 
it's in a "cached" state.

> Inconsistent results with 'sort', 'cache', and AQE.
> ---
>
> Key: SPARK-46992
> URL: https://issues.apache.org/jira/browse/SPARK-46992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Denis Tarima
>Priority: Critical
>  Labels: correctness, pull-request-available
>
>  
> With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes 
> {color:#4c9aff}sample{color} results after caching.
> Moreover, when cached,  {color:#4c9aff}collect{color} returns records as if 
> it's not cached, which is inconsistent with {color:#4c9aff}count{color} and 
> {color:#4c9aff}show{color}.
> A script to reproduce:
> {code:scala}
> import spark.implicits._
> val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123)
> println("NON CACHED:")
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> println("CACHED:")
> df.cache().count()
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> df.unpersist()
> {code}
> output:
> {code:java}
> NON CACHED:
>   count: 2
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  4|
> +---+
> CACHED:
>   count: 3
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> {code}
> BTW, disabling AQE 
> [{color:#4c9aff}spark.conf.set("spark.databricks.optimizer.adaptive.enabled", 
> "false"){color}] helps on Databricks clusters, but locally it has no effect, 
> at least on Spark 3.3.2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47170) Remove redundant scope identifier for `jakarta.servlet-api` and `javax.servlet-api`

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47170:
---
Labels: pull-request-available  (was: )

> Remove redundant scope identifier for `jakarta.servlet-api` and 
> `javax.servlet-api`
> ---
>
> Key: SPARK-47170
> URL: https://issues.apache.org/jira/browse/SPARK-47170
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: HiuFung Kwok
>Priority: Major
>  Labels: pull-request-available
>
> This is a follow-up ticket for SPRARK-47046 to remove the redundant `scope` 
> XML clause  - compile. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47170) Remove redundant scope identifier for `jakarta.servlet-api` and `javax.servlet-api`

2024-02-26 Thread Nikita Awasthi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820691#comment-17820691
 ] 

Nikita Awasthi commented on SPARK-47170:


User 'HiuKwok' has created a pull request for this issue:
https://github.com/apache/spark/pull/45258

> Remove redundant scope identifier for `jakarta.servlet-api` and 
> `javax.servlet-api`
> ---
>
> Key: SPARK-47170
> URL: https://issues.apache.org/jira/browse/SPARK-47170
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: HiuFung Kwok
>Priority: Major
>
> This is a follow-up ticket for SPRARK-47046 to remove the redundant `scope` 
> XML clause  - compile. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47145) Provide table identifier to scan node when DS v2 strategy is applied

2024-02-26 Thread Uros Stankovic (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820690#comment-17820690
 ] 

Uros Stankovic commented on SPARK-47145:


PR for this change https://github.com/apache/spark/pull/45200

> Provide table identifier to scan node when DS v2 strategy is applied
> 
>
> Key: SPARK-47145
> URL: https://issues.apache.org/jira/browse/SPARK-47145
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Uros Stankovic
>Priority: Minor
>
> Currently, DataSourceScanExec node can accept table identifier, and that 
> information can be useful for later logging, debugging, etc, but 
> DataSourceV2Strategy does not provide that information to scan node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47170) Remove redundant scope identifier for `jakarta.servlet-api` and `javax.servlet-api`

2024-02-26 Thread HiuFung Kwok (Jira)
HiuFung Kwok created SPARK-47170:


 Summary: Remove redundant scope identifier for 
`jakarta.servlet-api` and `javax.servlet-api`
 Key: SPARK-47170
 URL: https://issues.apache.org/jira/browse/SPARK-47170
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 4.0.0
Reporter: HiuFung Kwok


This is a follow-up ticket for SPRARK-47046 to remove the redundant `scope` XML 
clause  - compile. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47169) Disable bucketint on collated collumns

2024-02-26 Thread Mihailo Milosevic (Jira)
Mihailo Milosevic created SPARK-47169:
-

 Summary: Disable bucketint on collated collumns
 Key: SPARK-47169
 URL: https://issues.apache.org/jira/browse/SPARK-47169
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Mihailo Milosevic






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47169) Disable bucketing on collated collumns

2024-02-26 Thread Mihailo Milosevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mihailo Milosevic updated SPARK-47169:
--
Summary: Disable bucketing on collated collumns  (was: Disable bucketint on 
collated collumns)

> Disable bucketing on collated collumns
> --
>
> Key: SPARK-47169
> URL: https://issues.apache.org/jira/browse/SPARK-47169
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47147) Fix Pyspark collated string conversion error

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47147:
---
Labels: pull-request-available  (was: )

> Fix Pyspark collated string conversion error
> 
>
> Key: SPARK-47147
> URL: https://issues.apache.org/jira/browse/SPARK-47147
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When running Pyspark shell in non-Spark Connect mode, query "SELECT 'abc' 
> COLLATE 'UCS_BASIC_LCASE'" produces the following error:
> {code:java}
> AssertionError: Undefined error message parameter for error class: 
> CANNOT_PARSE_DATATYPE. Parameters: {'error': "Undefined error message 
> parameter for error class: CANNOT_PARSE_DATATYPE. Parameters: {'error': 
> 'string(UCS_BASIC_LCASE)'}"}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47168) Disable parquet filter pushdown for non default collated strings

2024-02-26 Thread Stefan Kandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Kandic updated SPARK-47168:
--
Summary: Disable parquet filter pushdown for non default collated strings  
(was: Disable filter pushdown for non default collated strings)

> Disable parquet filter pushdown for non default collated strings
> 
>
> Key: SPARK-47168
> URL: https://issues.apache.org/jira/browse/SPARK-47168
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47168) Disable filter pushdown for non default collated strings

2024-02-26 Thread Stefan Kandic (Jira)
Stefan Kandic created SPARK-47168:
-

 Summary: Disable filter pushdown for non default collated strings
 Key: SPARK-47168
 URL: https://issues.apache.org/jira/browse/SPARK-47168
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Stefan Kandic






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47033) EXECUTE IMMEDIATE USING does not recognize session variable names

2024-02-26 Thread A G (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820680#comment-17820680
 ] 

A G commented on SPARK-47033:
-

I want to work on this!

> EXECUTE IMMEDIATE USING does not recognize session variable names
> -
>
> Key: SPARK-47033
> URL: https://issues.apache.org/jira/browse/SPARK-47033
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Serge Rielau
>Priority: Major
>
> {noformat}
> DECLARE parm = 'Hello';
> EXECUTE IMMEDIATE 'SELECT :parm' USING parm;
> [ALL_PARAMETERS_MUST_BE_NAMED] Using name parameterized queries requires all 
> parameters to be named. Parameters missing names: "parm". SQLSTATE: 07001
> EXECUTE IMMEDIATE 'SELECT :parm' USING parm AS parm;
> Hello
> {noformat}
> variables are like column references, they act as their own aliases and thus 
> should not be required to be named to associate with a named parameter with 
> the same name.
> Note that unlike for pySpark this should be case insensitive (haven't 
> verified).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47167) Add descriptive relation class

2024-02-26 Thread Uros Stankovic (Jira)
Uros Stankovic created SPARK-47167:
--

 Summary: Add descriptive relation class
 Key: SPARK-47167
 URL: https://issues.apache.org/jira/browse/SPARK-47167
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.5.1
Reporter: Uros Stankovic


BaseRelation class do not provide any descriptive information like name or 
description, etc. So it would be great to add such class so debugging and 
logging would be easier.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47165) Pull docker image only when its' absent

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47165:
---
Labels: pull-request-available  (was: )

> Pull docker image only when its' absent
> ---
>
> Key: SPARK-47165
> URL: https://issues.apache.org/jira/browse/SPARK-47165
> Project: Spark
>  Issue Type: Test
>  Components: Spark Docker
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47103) Make the default storage level of intermediate datasets for MLlib configurable

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47103:
--

Assignee: (was: Apache Spark)

> Make the default storage level of intermediate datasets for MLlib configurable
> --
>
> Key: SPARK-47103
> URL: https://issues.apache.org/jira/browse/SPARK-47103
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47103) Make the default storage level of intermediate datasets for MLlib configurable

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47103:
--

Assignee: Apache Spark

> Make the default storage level of intermediate datasets for MLlib configurable
> --
>
> Key: SPARK-47103
> URL: https://issues.apache.org/jira/browse/SPARK-47103
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47165) Pull docker image only when its' absent

2024-02-26 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-47165:
-
Issue Type: Test  (was: Improvement)

> Pull docker image only when its' absent
> ---
>
> Key: SPARK-47165
> URL: https://issues.apache.org/jira/browse/SPARK-47165
> Project: Spark
>  Issue Type: Test
>  Components: Spark Docker
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47165) Pull docker image only when its' absent

2024-02-26 Thread Kent Yao (Jira)
Kent Yao created SPARK-47165:


 Summary: Pull docker image only when its' absent
 Key: SPARK-47165
 URL: https://issues.apache.org/jira/browse/SPARK-47165
 Project: Spark
  Issue Type: Improvement
  Components: Spark Docker
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47158) Assign proper name and sqlState to _LEGACY_ERROR_TEMP_2134 & 2231

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47158:
--

Assignee: (was: Apache Spark)

> Assign proper name and sqlState to _LEGACY_ERROR_TEMP_2134 & 2231
> -
>
> Key: SPARK-47158
> URL: https://issues.apache.org/jira/browse/SPARK-47158
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Assign proper name and sqlState to _LEGACY_ERROR_TEMP_2134 & 2231



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47158) Assign proper name and sqlState to _LEGACY_ERROR_TEMP_2134 & 2231

2024-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47158:
--

Assignee: Apache Spark

> Assign proper name and sqlState to _LEGACY_ERROR_TEMP_2134 & 2231
> -
>
> Key: SPARK-47158
> URL: https://issues.apache.org/jira/browse/SPARK-47158
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Assign proper name and sqlState to _LEGACY_ERROR_TEMP_2134 & 2231



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2024-02-26 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-45599:
-
Fix Version/s: 3.5.2
   (was: 3.5.1)

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Assignee: Nicholas Chammas
>Priority: Critical
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> 

[jira] [Assigned] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2024-02-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-45599:
---

Assignee: Nicholas Chammas

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Assignee: Nicholas Chammas
>Priority: Critical
>  Labels: correctness, pull-request-available
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> 

[jira] [Resolved] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2024-02-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-45599.
-
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 45036
[https://github.com/apache/spark/pull/45036]

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Assignee: Nicholas Chammas
>Priority: Critical
>  Labels: correctness, pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
>