[ 
https://issues.apache.org/jira/browse/SPARK-47024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-47024.
--------------------------------------
    Resolution: Not A Problem

Resolving this as "Not A Problem".

I mean, it _is_ a problem, but it's a basic problem with floats, and I don't 
think there is anything practical that can be done about it in Spark.

> Sum of floats/doubles may be incorrect depending on partitioning
> ----------------------------------------------------------------
>
>                 Key: SPARK-47024
>                 URL: https://issues.apache.org/jira/browse/SPARK-47024
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.4.2, 3.5.0, 3.3.4
>            Reporter: Nicholas Chammas
>            Priority: Major
>              Labels: correctness
>
> I found this problem using 
> [Hypothesis|https://hypothesis.readthedocs.io/en/latest/].
> Here's a reproduction that fails on {{{}master{}}}, 3.5.0, 3.4.2, and 3.3.4 
> (and probably all prior versions as well):
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import col, sum
> SUM_EXAMPLE = [
>     (1.0,),
>     (0.0,),
>     (1.0,),
>     (9007199254740992.0,),
> ]
> spark = (
>     SparkSession.builder
>     .config("spark.log.level", "ERROR")
>     .getOrCreate()
> )
> def compare_sums(data, num_partitions):
>     df = spark.createDataFrame(data, "val double").coalesce(1)
>     result1 = df.agg(sum(col("val"))).collect()[0][0]
>     df = spark.createDataFrame(data, "val double").repartition(num_partitions)
>     result2 = df.agg(sum(col("val"))).collect()[0][0]
>     assert result1 == result2, f"{result1}, {result2}"
> if __name__ == "__main__":
>     print(compare_sums(SUM_EXAMPLE, 2))
> {code}
> This fails as follows:
> {code:python}
> AssertionError: 9007199254740994.0, 9007199254740992.0
> {code}
> I suspected some kind of problem related to code generation, so tried setting 
> all of these to {{{}false{}}}:
>  * {{spark.sql.codegen.wholeStage}}
>  * {{spark.sql.codegen.aggregate.map.twolevel.enabled}}
>  * {{spark.sql.codegen.aggregate.splitAggregateFunc.enabled}}
> But this did not change the behavior.
> Somehow, the partitioning of the data affects the computed sum.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to