[ 
https://issues.apache.org/jira/browse/SPARK-24834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16684146#comment-16684146
 ] 

Alon Doron commented on SPARK-24834:
------------------------------------

[~srowen] [~mcheah] - In hive, 0.0 and -0.0 are equal since 
https://issues.apache.org/jira/browse/HIVE-11174.
 That's not the case with spark sql as "group by" (non-codegen) treats them as 
different values. Since their hash is different they're put in different 
buckets of UnsafeFixedWidthAggregationMap.

In addition there's an inconsistency when using the codegen, for example the 
following unit test:
{code:java}
println(Seq(0.0d, 0.0d, 
-0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
{code}
[0.0,3]
{code:java}
println(Seq(0.0d, -0.0d, 
0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
{code}
[0.0,1], [-0.0,2]
{code:java}
spark.conf.set("spark.sql.codegen.wholeStage", "false")
println(Seq(0.0d, -0.0d, 
0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
{code}
[0.0,2], [-0.0,1]

Note that the only difference between the first 2 lines is the order of the 
elements in the Seq.
 This inconsistency is resulted by different partitioning of the Seq and the 
usage of the generated fast hash map in the first, partial, aggregation.

It looks like we need to add a specific check for -0.0 before hashing (both in 
codegen and non-codegen modes) if we want to fix this.
 Thoughts?

> Utils#nanSafeCompare{Double,Float} functions do not differ from normal java 
> double/float comparison
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24834
>                 URL: https://issues.apache.org/jira/browse/SPARK-24834
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.3.2
>            Reporter: Benjamin Duffield
>            Priority: Minor
>
> Utils.scala contains two functions `nanSafeCompareDoubles` and 
> `nanSafeCompareFloats` which purport to have special handling of NaN values 
> in comparisons.
> The handling in these functions do not appear to differ from 
> java.lang.Double.compare and java.lang.Float.compare - they seem to produce 
> identical output to the built-in java comparison functions.
> I think it's clearer to not have these special Utils functions, and instead 
> just use the standard java comparison functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to