[jira] [Updated] (SPARK-46962) Implement python worker to run python streaming data source

2024-02-12 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-46962:
-
Component/s: Structured Streaming
 (was: SS)

> Implement python worker to run python streaming data source
> ---
>
> Key: SPARK-46962
> URL: https://issues.apache.org/jira/browse/SPARK-46962
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
>
> Implement python worker to run python streaming data source and communicate 
> with JVM through socket. Create a PythonMicrobatchStream to invoke RPC 
> function call



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46866) Streaming python data source API

2024-02-12 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-46866:
-
Affects Version/s: 4.0.0
   (was: 3.5.0)

> Streaming python data source API
> 
>
> Key: SPARK-46866
> URL: https://issues.apache.org/jira/browse/SPARK-46866
> Project: Spark
>  Issue Type: Epic
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Chaoqin Li
>Priority: Major
>
> This is a follow up of https://issues.apache.org/jira/browse/SPARK-44076. The 
> idea is to enable Python developers to develop streaming data sources in 
> python. The goal is to make a Python-based API that is simple and easy to 
> use, thus making Spark more accessible to the wider Python developer 
> community.
>  
> Design doc: 
> https://docs.google.com/document/d/1cJ-w1hGPOBFp-5DLmf68sTLsAOwb55oW6SAuuAUFEM4/edit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46866) Streaming python data source API

2024-02-12 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-46866:
-
Component/s: Structured Streaming
 (was: SS)

> Streaming python data source API
> 
>
> Key: SPARK-46866
> URL: https://issues.apache.org/jira/browse/SPARK-46866
> Project: Spark
>  Issue Type: Epic
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Chaoqin Li
>Priority: Major
>
> This is a follow up of https://issues.apache.org/jira/browse/SPARK-44076. The 
> idea is to enable Python developers to develop streaming data sources in 
> python. The goal is to make a Python-based API that is simple and easy to 
> use, thus making Spark more accessible to the wider Python developer 
> community.
>  
> Design doc: 
> https://docs.google.com/document/d/1cJ-w1hGPOBFp-5DLmf68sTLsAOwb55oW6SAuuAUFEM4/edit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47031) Union of with non-determinstic expression should be non-deterministic

2024-02-12 Thread Holden Karau (Jira)
Holden Karau created SPARK-47031:


 Summary: Union of with non-determinstic expression should be 
non-deterministic
 Key: SPARK-47031
 URL: https://issues.apache.org/jira/browse/SPARK-47031
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Holden Karau


We have special case handling for nullability already where any expression 
which is unioned with a nullable field becomes nullable, but we should do the 
same for deterministic.

 

I found this while I was poking around with push downs.

 

I believe the code to be updated would be output in the union case class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47030) Add `WebBrowserTest`

2024-02-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47030.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45089
[https://github.com/apache/spark/pull/45089]

> Add `WebBrowserTest`
> 
>
> Key: SPARK-47030
> URL: https://issues.apache.org/jira/browse/SPARK-47030
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL, Structured Streaming, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47025) Switch `Guava 19.0` dependency scope from `provided` to `test`

2024-02-12 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-47025:


Assignee: Dongjoon Hyun

> Switch `Guava 19.0` dependency scope from `provided` to `test`
> --
>
> Key: SPARK-47025
> URL: https://issues.apache.org/jira/browse/SPARK-47025
> Project: Spark
>  Issue Type: Test
>  Components: Build, SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47025) Switch `Guava 19.0` dependency scope from `provided` to `test`

2024-02-12 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-47025.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45088
[https://github.com/apache/spark/pull/45088]

> Switch `Guava 19.0` dependency scope from `provided` to `test`
> --
>
> Key: SPARK-47025
> URL: https://issues.apache.org/jira/browse/SPARK-47025
> Project: Spark
>  Issue Type: Test
>  Components: Build, SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47004) Increase Scala client test coverage

2024-02-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47004:


Assignee: Bo Gao

> Increase Scala client test coverage
> ---
>
> Key: SPARK-47004
> URL: https://issues.apache.org/jira/browse/SPARK-47004
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Bo Gao
>Assignee: Bo Gao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47004) Increase Scala client test coverage

2024-02-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47004.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45063
[https://github.com/apache/spark/pull/45063]

> Increase Scala client test coverage
> ---
>
> Key: SPARK-47004
> URL: https://issues.apache.org/jira/browse/SPARK-47004
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Bo Gao
>Assignee: Bo Gao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47026) Include coverage of JSON data sources in array/struct/map default value tests

2024-02-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47026.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45086
[https://github.com/apache/spark/pull/45086]

> Include coverage of JSON data sources in array/struct/map default value tests
> -
>
> Key: SPARK-47026
> URL: https://issues.apache.org/jira/browse/SPARK-47026
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Mark Jarvin
>Assignee: Mark Jarvin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47026) Include coverage of JSON data sources in array/struct/map default value tests

2024-02-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47026:


Assignee: Mark Jarvin

> Include coverage of JSON data sources in array/struct/map default value tests
> -
>
> Key: SPARK-47026
> URL: https://issues.apache.org/jira/browse/SPARK-47026
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Mark Jarvin
>Assignee: Mark Jarvin
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46944) Follow up to SPARK-46792: Fix minor typing oversight

2024-02-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-46944.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44983
[https://github.com/apache/spark/pull/44983]

> Follow up to SPARK-46792: Fix minor typing oversight
> 
>
> Key: SPARK-46944
> URL: https://issues.apache.org/jira/browse/SPARK-46944
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Alice Sayutina
>Assignee: Alice Sayutina
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47030) Add `WebBrowserTest`

2024-02-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47030:
--
Component/s: Spark Core
 SQL
 Structured Streaming

> Add `WebBrowserTest`
> 
>
> Key: SPARK-47030
> URL: https://issues.apache.org/jira/browse/SPARK-47030
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL, Structured Streaming, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47030) Add `WebBrowserTest`

2024-02-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47030:
-

Assignee: Dongjoon Hyun

> Add `WebBrowserTest`
> 
>
> Key: SPARK-47030
> URL: https://issues.apache.org/jira/browse/SPARK-47030
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47030) Add `WebBrowserTest`

2024-02-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47030:
---
Labels: pull-request-available  (was: )

> Add `WebBrowserTest`
> 
>
> Key: SPARK-47030
> URL: https://issues.apache.org/jira/browse/SPARK-47030
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47030) Add `WebBrowserTest`

2024-02-12 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47030:
-

 Summary: Add `WebBrowserTest`
 Key: SPARK-47030
 URL: https://issues.apache.org/jira/browse/SPARK-47030
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47014) Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession

2024-02-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47014:
---
Labels: pull-request-available  (was: )

> Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession
> -
>
> Key: SPARK-47014
> URL: https://issues.apache.org/jira/browse/SPARK-47014
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Xinrong Meng
>Priority: Major
>  Labels: pull-request-available
>
> Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47027) Use temporary directories for profiler test outputs

2024-02-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47027.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45087
[https://github.com/apache/spark/pull/45087]

> Use temporary directories for profiler test outputs
> ---
>
> Key: SPARK-47027
> URL: https://issues.apache.org/jira/browse/SPARK-47027
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47024) Sum of floats/doubles may be incorrect depending on partitioning

2024-02-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-47024.
--
Resolution: Not A Problem

Resolving this as "Not A Problem".

I mean, it _is_ a problem, but it's a basic problem with floats, and I don't 
think there is anything practical that can be done about it in Spark.

> Sum of floats/doubles may be incorrect depending on partitioning
> 
>
> Key: SPARK-47024
> URL: https://issues.apache.org/jira/browse/SPARK-47024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0, 3.3.4
>Reporter: Nicholas Chammas
>Priority: Major
>  Labels: correctness
>
> I found this problem using 
> [Hypothesis|https://hypothesis.readthedocs.io/en/latest/].
> Here's a reproduction that fails on {{{}master{}}}, 3.5.0, 3.4.2, and 3.3.4 
> (and probably all prior versions as well):
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import col, sum
> SUM_EXAMPLE = [
> (1.0,),
> (0.0,),
> (1.0,),
> (9007199254740992.0,),
> ]
> spark = (
> SparkSession.builder
> .config("spark.log.level", "ERROR")
> .getOrCreate()
> )
> def compare_sums(data, num_partitions):
> df = spark.createDataFrame(data, "val double").coalesce(1)
> result1 = df.agg(sum(col("val"))).collect()[0][0]
> df = spark.createDataFrame(data, "val double").repartition(num_partitions)
> result2 = df.agg(sum(col("val"))).collect()[0][0]
> assert result1 == result2, f"{result1}, {result2}"
> if __name__ == "__main__":
> print(compare_sums(SUM_EXAMPLE, 2))
> {code}
> This fails as follows:
> {code:python}
> AssertionError: 9007199254740994.0, 9007199254740992.0
> {code}
> I suspected some kind of problem related to code generation, so tried setting 
> all of these to {{{}false{}}}:
>  * {{spark.sql.codegen.wholeStage}}
>  * {{spark.sql.codegen.aggregate.map.twolevel.enabled}}
>  * {{spark.sql.codegen.aggregate.splitAggregateFunc.enabled}}
> But this did not change the behavior.
> Somehow, the partitioning of the data affects the computed sum.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47024) Sum of floats/doubles may be incorrect depending on partitioning

2024-02-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-47024:
-
Description: 
I found this problem using 
[Hypothesis|https://hypothesis.readthedocs.io/en/latest/].

Here's a reproduction that fails on {{{}master{}}}, 3.5.0, 3.4.2, and 3.3.4 
(and probably all prior versions as well):
{code:python}
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum

SUM_EXAMPLE = [
(1.0,),
(0.0,),
(1.0,),
(9007199254740992.0,),
]

spark = (
SparkSession.builder
.config("spark.log.level", "ERROR")
.getOrCreate()
)


def compare_sums(data, num_partitions):
df = spark.createDataFrame(data, "val double").coalesce(1)
result1 = df.agg(sum(col("val"))).collect()[0][0]
df = spark.createDataFrame(data, "val double").repartition(num_partitions)
result2 = df.agg(sum(col("val"))).collect()[0][0]
assert result1 == result2, f"{result1}, {result2}"


if __name__ == "__main__":
print(compare_sums(SUM_EXAMPLE, 2))
{code}
This fails as follows:
{code:python}
AssertionError: 9007199254740994.0, 9007199254740992.0
{code}
I suspected some kind of problem related to code generation, so tried setting 
all of these to {{{}false{}}}:
 * {{spark.sql.codegen.wholeStage}}
 * {{spark.sql.codegen.aggregate.map.twolevel.enabled}}
 * {{spark.sql.codegen.aggregate.splitAggregateFunc.enabled}}

But this did not change the behavior.

Somehow, the partitioning of the data affects the computed sum.

  was:Will fill in the details shortly.

Summary: Sum of floats/doubles may be incorrect depending on 
partitioning  (was: Sum is incorrect (exact cause currently unknown))

Sadly, I think this is a case where we may not be able to do anything. The 
problem appears to be a classic case of floating point arithmetic going wrong.
{code:scala}
scala> 9007199254740992.0 + 1.0
val res0: Double = 9.007199254740992E15

scala> 9007199254740992.0 + 2.0
val res1: Double = 9.007199254740994E15
{code}
Notice how adding {{1.0}} did not change the large value, whereas adding 
{{2.0}} did.

So what I believe is happening is that, depending on the order in which the 
rows happen to be added, we either hit or do not hit this corner case.

In other words, if the aggregation goes like this:
{code:java}
(1.0 + 1.0) + (0.0 + 9007199254740992.0)
2.0 + 9007199254740992.0
9007199254740994.0
{code}
Then there is no problem.

However, if we are unlucky and it goes like this:
{code:java}
(1.0 + 0.0) + (1.0 + 9007199254740992.0)
1.0 + 9007199254740992.0
9007199254740992.0
{code}
Then we get the incorrect result shown in the description above.

This violates what I believe should be an invariant in Spark: That declarative 
aggregates like {{sum}} do not compute different results depending on accidents 
of row order or partitioning.

However, given that this is a basic problem of floating point arithmetic, I 
doubt we can really do anything here.

Note that there are many such "special" numbers that have this problem, not 
just 9007199254740992.0:
{code:scala}
scala> 1.7168917017330176e+16 + 1.0
val res2: Double = 1.7168917017330176E16

scala> 1.7168917017330176e+16 + 2.0
val res3: Double = 1.7168917017330178E16
{code}

> Sum of floats/doubles may be incorrect depending on partitioning
> 
>
> Key: SPARK-47024
> URL: https://issues.apache.org/jira/browse/SPARK-47024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0, 3.3.4
>Reporter: Nicholas Chammas
>Priority: Major
>  Labels: correctness
>
> I found this problem using 
> [Hypothesis|https://hypothesis.readthedocs.io/en/latest/].
> Here's a reproduction that fails on {{{}master{}}}, 3.5.0, 3.4.2, and 3.3.4 
> (and probably all prior versions as well):
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import col, sum
> SUM_EXAMPLE = [
> (1.0,),
> (0.0,),
> (1.0,),
> (9007199254740992.0,),
> ]
> spark = (
> SparkSession.builder
> .config("spark.log.level", "ERROR")
> .getOrCreate()
> )
> def compare_sums(data, num_partitions):
> df = spark.createDataFrame(data, "val double").coalesce(1)
> result1 = df.agg(sum(col("val"))).collect()[0][0]
> df = spark.createDataFrame(data, "val double").repartition(num_partitions)
> result2 = df.agg(sum(col("val"))).collect()[0][0]
> assert result1 == result2, f"{result1}, {result2}"
> if __name__ == "__main__":
> print(compare_sums(SUM_EXAMPLE, 2))
> {code}
> This fails as follows:
> {code:python}
> AssertionError: 9007199254740994.0, 9007199254740992.0
> {code}
> I suspected some kind of problem related to code generation, so tried setting 
> all of these to {{{}false{}}}:
>  * {{spark.sql.codegen.who

[jira] [Updated] (SPARK-47027) Use temporary directories for profiler test outputs

2024-02-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47027:
---
Labels: pull-request-available  (was: )

> Use temporary directories for profiler test outputs
> ---
>
> Key: SPARK-47027
> URL: https://issues.apache.org/jira/browse/SPARK-47027
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47026) Include coverage of JSON data sources in array/struct/map default value tests

2024-02-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47026:
---
Labels: pull-request-available  (was: )

> Include coverage of JSON data sources in array/struct/map default value tests
> -
>
> Key: SPARK-47026
> URL: https://issues.apache.org/jira/browse/SPARK-47026
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Mark Jarvin
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47027) Use temporary directories for profiler test outputs

2024-02-12 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-47027:
--
Summary: Use temporary directories for profiler test outputs  (was: Move 
TestUtils to the generic testing utils.)

> Use temporary directories for profiler test outputs
> ---
>
> Key: SPARK-47027
> URL: https://issues.apache.org/jira/browse/SPARK-47027
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47029) ALTER COLUMN DROP DEFAULT test fails with JSON data sources

2024-02-12 Thread Mark Jarvin (Jira)
Mark Jarvin created SPARK-47029:
---

 Summary: ALTER COLUMN DROP DEFAULT test fails with JSON data 
sources
 Key: SPARK-47029
 URL: https://issues.apache.org/jira/browse/SPARK-47029
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0, 4.0.0
Reporter: Mark Jarvin


Enabling the JSON data source causes a test case to fail:
{code:java}
[info] - SPARK-39557 INSERT INTO statements with tables with map defaults *** 
FAILED *** (1 second, 498 milliseconds)
[info]   Results do not match for query:
[info]   Timezone: 
sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
[info]   Timezone Env:
[info]
[info]   == Parsed Logical Plan ==
[info]   'UnresolvedRelation [t], [], false
[info]
[info]   == Analyzed Logical Plan ==
[info]   i: int, s: 
struct>,y:array>>, t: 
array>
[info]   SubqueryAlias spark_catalog.default.t
[info]   +- Relation spark_catalog.default.t[i#13929,s#13930,t#13931] json
[info]
[info]   == Optimized Logical Plan ==
[info]   Relation spark_catalog.default.t[i#13929,s#13930,t#13931] json
[info]
[info]   == Physical Plan ==
[info]   FileScan json spark_catalog.default.t[i#13929,s#13930,t#13931] 
Batched: false, DataFilters: [], Format: JSON, Location: InMemoryFileIndex(1 
paths)[file:/home/mark.jarvin/photon/spark/sql/core/spark-warehouse/org.apach...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct>,y:array>>,t:array                                                              
          
struct>,y:array>>,t:array>>
[info]   ![1,[List([1,2]),List(Map(def -> false, jkl -> true))],List(Map(xyz -> 
true))]   [1,[ArraySeq([1,2]),ArraySeq(Map(def -> false, jkl -> 
true))],ArraySeq(Map(xyz -> true))]
[info]   ![2,null,List(Map(xyz -> true))]                                       
          [2,[ArraySeq([1,2]),ArraySeq(Map(def -> false, jkl -> 
true))],ArraySeq(Map(xyz -> true))]
[info]   ![3,[List([3,4]),List(Map(mno -> false, pqr -> true))],List(Map(xyz -> 
true))]   [3,[ArraySeq([3,4]),ArraySeq(Map(mno -> false, pqr -> 
true))],ArraySeq(Map(xyz -> true))]
[info]   ![4,[List([3,4]),List(Map(mno -> false, pqr -> true))],List(Map(xyz -> 
true))]   [4,[ArraySeq([3,4]),ArraySeq(Map(mno -> false, pqr -> 
true))],ArraySeq(Map(xyz -> true))] (QueryTest.scala:267){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47028) Check SparkUnsupportedOperationException instead of UnsupportedOperationException

2024-02-12 Thread Max Gekk (Jira)
Max Gekk created SPARK-47028:


 Summary: Check SparkUnsupportedOperationException instead of 
UnsupportedOperationException
 Key: SPARK-47028
 URL: https://issues.apache.org/jira/browse/SPARK-47028
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Max Gekk
Assignee: Max Gekk


Use checkError() to test the SparkUnsupportedOperationException exception 
instead of UnsupportedOperationException in the SQL project.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47027) Move TestUtils to the generic testing utils.

2024-02-12 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-47027:
-

 Summary: Move TestUtils to the generic testing utils.
 Key: SPARK-47027
 URL: https://issues.apache.org/jira/browse/SPARK-47027
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 4.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47026) Include coverage of JSON data sources in array/struct/map default value tests

2024-02-12 Thread Mark Jarvin (Jira)
Mark Jarvin created SPARK-47026:
---

 Summary: Include coverage of JSON data sources in array/struct/map 
default value tests
 Key: SPARK-47026
 URL: https://issues.apache.org/jira/browse/SPARK-47026
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0, 4.0.0
Reporter: Mark Jarvin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47023) Upgrade `aircompressor` to 1.26

2024-02-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47023.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45084
[https://github.com/apache/spark/pull/45084]

> Upgrade `aircompressor` to 1.26
> ---
>
> Key: SPARK-47023
> URL: https://issues.apache.org/jira/browse/SPARK-47023
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47025) Switch `Guava 19.0` dependency scope from `provided` to `test`

2024-02-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47025:
---
Labels: pull-request-available  (was: )

> Switch `Guava 19.0` dependency scope from `provided` to `test`
> --
>
> Key: SPARK-47025
> URL: https://issues.apache.org/jira/browse/SPARK-47025
> Project: Spark
>  Issue Type: Test
>  Components: Build, SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47025) Switch `Guava 19.0` dependency scope from `provided` to `test`

2024-02-12 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47025:
-

 Summary: Switch `Guava 19.0` dependency scope from `provided` to 
`test`
 Key: SPARK-47025
 URL: https://issues.apache.org/jira/browse/SPARK-47025
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47025) Switch `Guava 19.0` dependency scope from `provided` to `test`

2024-02-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47025:
--
Component/s: Build

> Switch `Guava 19.0` dependency scope from `provided` to `test`
> --
>
> Key: SPARK-47025
> URL: https://issues.apache.org/jira/browse/SPARK-47025
> Project: Spark
>  Issue Type: Test
>  Components: Build, SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47024) Sum is incorrect (exact cause currently unknown)

2024-02-12 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-47024:


 Summary: Sum is incorrect (exact cause currently unknown)
 Key: SPARK-47024
 URL: https://issues.apache.org/jira/browse/SPARK-47024
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.4, 3.5.0, 3.4.2
Reporter: Nicholas Chammas


Will fill in the details shortly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44445) Upgrade to `htmlunit` 3.10.0 and `htmlunit3-driver` 4.17.0

2024-02-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-5.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45079
[https://github.com/apache/spark/pull/45079]

> Upgrade to `htmlunit` 3.10.0 and `htmlunit3-driver` 4.17.0
> --
>
> Key: SPARK-5
> URL: https://issues.apache.org/jira/browse/SPARK-5
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [CVE-2023-26119|https://nvd.nist.gov/vuln/detail/CVE-2023-26119]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47023) Upgrade `aircompressor` to 1.26

2024-02-12 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47023:
-

 Summary: Upgrade `aircompressor` to 1.26
 Key: SPARK-47023
 URL: https://issues.apache.org/jira/browse/SPARK-47023
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org