[ https://issues.apache.org/jira/browse/SPARK-47024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicholas Chammas resolved SPARK-47024. -------------------------------------- Resolution: Not A Problem Resolving this as "Not A Problem". I mean, it _is_ a problem, but it's a basic problem with floats, and I don't think there is anything practical that can be done about it in Spark. > Sum of floats/doubles may be incorrect depending on partitioning > ---------------------------------------------------------------- > > Key: SPARK-47024 > URL: https://issues.apache.org/jira/browse/SPARK-47024 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.4.2, 3.5.0, 3.3.4 > Reporter: Nicholas Chammas > Priority: Major > Labels: correctness > > I found this problem using > [Hypothesis|https://hypothesis.readthedocs.io/en/latest/]. > Here's a reproduction that fails on {{{}master{}}}, 3.5.0, 3.4.2, and 3.3.4 > (and probably all prior versions as well): > {code:python} > from pyspark.sql import SparkSession > from pyspark.sql.functions import col, sum > SUM_EXAMPLE = [ > (1.0,), > (0.0,), > (1.0,), > (9007199254740992.0,), > ] > spark = ( > SparkSession.builder > .config("spark.log.level", "ERROR") > .getOrCreate() > ) > def compare_sums(data, num_partitions): > df = spark.createDataFrame(data, "val double").coalesce(1) > result1 = df.agg(sum(col("val"))).collect()[0][0] > df = spark.createDataFrame(data, "val double").repartition(num_partitions) > result2 = df.agg(sum(col("val"))).collect()[0][0] > assert result1 == result2, f"{result1}, {result2}" > if __name__ == "__main__": > print(compare_sums(SUM_EXAMPLE, 2)) > {code} > This fails as follows: > {code:python} > AssertionError: 9007199254740994.0, 9007199254740992.0 > {code} > I suspected some kind of problem related to code generation, so tried setting > all of these to {{{}false{}}}: > * {{spark.sql.codegen.wholeStage}} > * {{spark.sql.codegen.aggregate.map.twolevel.enabled}} > * {{spark.sql.codegen.aggregate.splitAggregateFunc.enabled}} > But this did not change the behavior. > Somehow, the partitioning of the data affects the computed sum. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org