[ https://issues.apache.org/jira/browse/SPARK-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen updated SPARK-26021: ------------------------------- Labels: correctness (was: ) > -0.0 and 0.0 not treated consistently, doesn't match Hive > --------------------------------------------------------- > > Key: SPARK-26021 > URL: https://issues.apache.org/jira/browse/SPARK-26021 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.0 > Reporter: Sean Owen > Assignee: Alon Doron > Priority: Critical > Labels: correctness > Fix For: 3.0.0 > > > Per [~adoron] and [~mccheah] and SPARK-24834, I'm splitting this out as a new > issue: > The underlying issue is how Spark and Hive treat 0.0 and -0.0, which are > numerically identical but not the same double value: > In hive, 0.0 and -0.0 are equal since > https://issues.apache.org/jira/browse/HIVE-11174. > That's not the case with spark sql as "group by" (non-codegen) treats them > as different values. Since their hash is different they're put in different > buckets of UnsafeFixedWidthAggregationMap. > In addition there's an inconsistency when using the codegen, for example the > following unit test: > {code:java} > println(Seq(0.0d, 0.0d, > -0.0d).toDF("i").groupBy("i").count().collect().mkString(", ")) > {code} > [0.0,3] > {code:java} > println(Seq(0.0d, -0.0d, > 0.0d).toDF("i").groupBy("i").count().collect().mkString(", ")) > {code} > [0.0,1], [-0.0,2] > {code:java} > spark.conf.set("spark.sql.codegen.wholeStage", "false") > println(Seq(0.0d, -0.0d, > 0.0d).toDF("i").groupBy("i").count().collect().mkString(", ")) > {code} > [0.0,2], [-0.0,1] > Note that the only difference between the first 2 lines is the order of the > elements in the Seq. > This inconsistency is resulted by different partitioning of the Seq and the > usage of the generated fast hash map in the first, partial, aggregation. > It looks like we need to add a specific check for -0.0 before hashing (both > in codegen and non-codegen modes) if we want to fix this. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org