(spark) branch master updated: [SPARK-45599][CORE] Use object equality in OpenHashSet

wenchen Mon, 26 Feb 2024 00:03:56 -0800

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new fcc5dbc9b6c8 [SPARK-45599][CORE] Use object equality in OpenHashSet
fcc5dbc9b6c8 is described below

commit fcc5dbc9b6c8a8e16dc2e0854f3eebc8758a5826
Author: Nicholas Chammas <nicholas.cham...@gmail.com>
AuthorDate: Mon Feb 26 16:03:30 2024 +0800

    [SPARK-45599][CORE] Use object equality in OpenHashSet
    
    ### What changes were proposed in this pull request?
    
    Change `OpenHashSet` to use object equality instead of cooperative equality 
when looking up keys.
    
    ### Why are the changes needed?
    
    This brings the behavior of `OpenHashSet` more in line with the semantics 
of `java.util.HashSet`, and fixes its behavior when comparing values for which 
`equals` and `==` return different results, like 0.0/-0.0 and NaN/NaN.
    
    For example, in certain cases where both 0.0 and -0.0 are provided as keys 
to the set, lookups of one or the other key may return the [wrong 
position][wrong] in the set. This leads to the bug described in SPARK-45599 and 
summarized in [this comment][1].
    
    [wrong]: 
https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283
    [1]: 
https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, it resolves the bug described in SPARK-45599.
    
    `OpenHashSet` is used widely under the hood, including in:
    - `OpenHashMap`, which itself backs:
        - `TypedImperativeAggregate`
        - aggregate functions like `percentile` and `mode`
        - many algorithms in ML and MLlib
    - `SQLOpenHashSet`, which backs array functions like `array_union` and 
`array_distinct`
    
    However, the user-facing changes should be limited to the kind of edge case 
described in SPARK-45599.
    
    ### How was this patch tested?
    
    New and existing unit tests. Of the new tests added in this PR, some simply 
validate that we have not changed existing SQL semantics, while others confirm 
that we have fixed the specific bug reported in SPARK-45599 along with any 
related incorrect behavior.
    
    New tests failing on `master` but passing on this branch:
    - [Handling 0.0 and -0.0 in 
`OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R273)
    - [Handling NaN in 
`OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R302)
    - [Handling 0.0 and -0.0 in 
`OpenHashMap`](https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R253)
    - [Handling 0.0 and -0.0 when computing 
percentile](https://github.com/apache/spark/pull/45036/files#diff-bd3d5c79ede5675f4bf10d2efb313db893d57443d6d6d67b1f8766e6ce741271R1092)
    
    New tests passing both on `master` and this branch:
    - [Handling 0.0, -0.0, and NaN in 
`array_union`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R793)
    - [Handling 0.0, -0.0, and NaN in 
`array_distinct`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R801)
    - [Handling 0.0, -0.0, and NaN in `GROUP 
BY`](https://github.com/apache/spark/pull/45036/files#diff-496edb8b03201f078c3772ca81f7c7f80002acc11dff00b1d06d288b87855264R1107)
    - [Normalizing -0 and 
-0.0](https://github.com/apache/spark/pull/45036/files#diff-4bdd04d06a2d88049dd5c8a67715c5566dd68a1c4ebffc689dc74b6b2e0b3b04R782)
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #45036 from nchammas/SPARK-45599-plus-and-minus-zero.
    
    Authored-by: Nicholas Chammas <nicholas.cham...@gmail.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
---
 .../apache/spark/util/collection/OpenHashSet.scala | 16 +++++++--
 .../spark/util/collection/OpenHashMapSuite.scala   | 30 +++++++++++++++++
 .../spark/util/collection/OpenHashSetSuite.scala   | 39 ++++++++++++++++++++++
 .../sql-tests/analyzer-results/ansi/array.sql.out  | 14 ++++++++
 .../analyzer-results/ansi/literals.sql.out         |  7 ++++
 .../sql-tests/analyzer-results/array.sql.out       | 14 ++++++++
 .../sql-tests/analyzer-results/group-by.sql.out    | 19 +++++++++++
 .../sql-tests/analyzer-results/literals.sql.out    |  7 ++++
 .../src/test/resources/sql-tests/inputs/array.sql  |  4 +++
 .../test/resources/sql-tests/inputs/group-by.sql   | 15 +++++++++
 .../test/resources/sql-tests/inputs/literals.sql   |  3 ++
 .../resources/sql-tests/results/ansi/array.sql.out | 16 +++++++++
 .../sql-tests/results/ansi/literals.sql.out        |  8 +++++
 .../test/resources/sql-tests/results/array.sql.out | 16 +++++++++
 .../resources/sql-tests/results/group-by.sql.out   | 22 ++++++++++++
 .../resources/sql-tests/results/literals.sql.out   |  8 +++++
 .../apache/spark/sql/DataFrameAggregateSuite.scala | 33 ++++++++++++++++++
 17 files changed, 268 insertions(+), 3 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala 
b/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
index faee9ce56a0a..a42fa9ba6bc8 100644
--- a/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
+++ b/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
@@ -110,6 +110,17 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) 
T: ClassTag](
     this
   }
 
+  /**
+   * Check if a key exists at the provided position using object equality 
rather than
+   * cooperative equality. Otherwise, hash sets will mishandle values for 
which `==`
+   * and `equals` return different results, like 0.0/-0.0 and NaN/NaN.
+   *
+   * See: https://issues.apache.org/jira/browse/SPARK-45599
+   */
+  @annotation.nowarn("cat=other-non-cooperative-equals")
+  private def keyExistsAtPos(k: T, pos: Int) =
+    _data(pos) equals k
+
   /**
    * Add an element to the set. This one differs from add in that it doesn't 
trigger rehashing.
    * The caller is responsible for calling rehashIfNeeded.
@@ -130,8 +141,7 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) T: 
ClassTag](
         _bitset.set(pos)
         _size += 1
         return pos | NONEXISTENCE_MASK
-      } else if (_data(pos) == k) {
-        // Found an existing key.
+      } else if (keyExistsAtPos(k, pos)) {
         return pos
       } else {
         // quadratic probing with values increase by 1, 2, 3, ...
@@ -165,7 +175,7 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) T: 
ClassTag](
     while (true) {
       if (!_bitset.get(pos)) {
         return INVALID_POS
-      } else if (k == _data(pos)) {
+      } else if (keyExistsAtPos(k, pos)) {
         return pos
       } else {
         // quadratic probing with values increase by 1, 2, 3, ...
diff --git 
a/core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala 
b/core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala
index 155c855c8723..ae9fb54bddfb 100644
--- 
a/core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala
@@ -249,4 +249,34 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers 
{
     map(null) = null
     assert(map.get(null) === Some(null))
   }
+
+  test("SPARK-45599: 0.0 and -0.0 should count distinctly; NaNs should count 
together") {
+    // Exactly these elements provided in roughly this order trigger a 
condition where lookups of
+    // 0.0 and -0.0 in the bitset happen to collide, causing their counts to 
be merged incorrectly
+    // and inconsistently if `==` is used to check for key equality.
+    val spark45599Repro = Seq(
+      Double.NaN,
+      2.0,
+      168.0,
+      Double.NaN,
+      Double.NaN,
+      -0.0,
+      153.0,
+      0.0
+    )
+
+    val map1 = new OpenHashMap[Double, Int]()
+    spark45599Repro.foreach(map1.changeValue(_, 1, {_ + 1}))
+    assert(map1(0.0) == 1)
+    assert(map1(-0.0) == 1)
+    assert(map1(Double.NaN) == 3)
+
+    val map2 = new OpenHashMap[Double, Int]()
+    // Simply changing the order in which the elements are added to the map 
should not change the
+    // counts for 0.0 and -0.0.
+    spark45599Repro.reverse.foreach(map2.changeValue(_, 1, {_ + 1}))
+    assert(map2(0.0) == 1)
+    assert(map2(-0.0) == 1)
+    assert(map2(Double.NaN) == 3)
+  }
 }
diff --git 
a/core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala 
b/core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala
index 89a308556d5d..0bc8aa067f57 100644
--- 
a/core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala
@@ -269,4 +269,43 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers 
{
       assert(pos1 == pos2)
     }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {
+    // Therefore, 0.0 and -0.0 should get separate entries in the hash set.
+    //
+    // Exactly these elements provided in roughly this order will trigger the 
following scenario:
+    // When probing the bitset in `getPos(-0.0)`, the loop will happen upon 
the entry for 0.0.
+    // In the old logic pre-SPARK-45599, the loop will find that the bit is 
set and, because
+    // -0.0 == 0.0, it will think that's the position of -0.0. But in reality 
this is the position
+    // of 0.0. So -0.0 and 0.0 will be stored at different positions, but 
`getPos()` will return
+    // the same position for them. This can cause users of OpenHashSet, like 
OpenHashMap, to
+    // return the wrong value for a key based on whether or not this bitset 
lookup collision
+    // happens.
+    val spark45599Repro = Seq(
+      Double.NaN,
+      2.0,
+      168.0,
+      Double.NaN,
+      Double.NaN,
+      -0.0,
+      153.0,
+      0.0
+    )
+    val set = new OpenHashSet[Double]()
+    spark45599Repro.foreach(set.add)
+    assert(set.size == 6)
+    val zeroPos = set.getPos(0.0)
+    val negZeroPos = set.getPos(-0.0)
+    assert(zeroPos != negZeroPos)
+  }
+
+  test("SPARK-45599: NaN and NaN are the same but not equal") {
+    // Any mathematical comparison to NaN will return false, but when we place 
it in
+    // a hash set we want the lookup to work like a "normal" value.
+    val set = new OpenHashSet[Double]()
+    set.add(Double.NaN)
+    set.add(Double.NaN)
+    assert(set.contains(Double.NaN))
+    assert(set.size == 1)
+  }
 }
diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/array.sql.out 
b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/array.sql.out
index 1ab4b03dcb26..2998803698c3 100644
--- a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/array.sql.out
+++ b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/array.sql.out
@@ -747,3 +747,17 @@ select array_prepend(array(CAST(NULL AS String)), 
CAST(NULL AS String))
 -- !query analysis
 Project [array_prepend(array(cast(null as string)), cast(null as string)) AS 
array_prepend(array(CAST(NULL AS STRING)), CAST(NULL AS STRING))#x]
 +- OneRowRelation
+
+
+-- !query
+select array_union(array(0.0, -0.0, DOUBLE("NaN")), array(0.0, -0.0, 
DOUBLE("NaN")))
+-- !query analysis
+Project [array_union(array(cast(0.0 as double), cast(0.0 as double), cast(NaN 
as double)), array(cast(0.0 as double), cast(0.0 as double), cast(NaN as 
double))) AS array_union(array(0.0, 0.0, NaN), array(0.0, 0.0, NaN))#x]
++- OneRowRelation
+
+
+-- !query
+select array_distinct(array(0.0, -0.0, -0.0, DOUBLE("NaN"), DOUBLE("NaN")))
+-- !query analysis
+Project [array_distinct(array(cast(0.0 as double), cast(0.0 as double), 
cast(0.0 as double), cast(NaN as double), cast(NaN as double))) AS 
array_distinct(array(0.0, 0.0, 0.0, NaN, NaN))#x]
++- OneRowRelation
diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/literals.sql.out 
b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/literals.sql.out
index d0e26f698fba..570cfb73444e 100644
--- 
a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/literals.sql.out
+++ 
b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/literals.sql.out
@@ -699,3 +699,10 @@ org.apache.spark.sql.catalyst.ExtendedAnalysisException
     "fragment" : "-x'2379ACFe'"
   } ]
 }
+
+
+-- !query
+select -0, -0.0
+-- !query analysis
+Project [0 AS 0#x, 0.0 AS 0.0#x]
++- OneRowRelation
diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/array.sql.out 
b/sql/core/src/test/resources/sql-tests/analyzer-results/array.sql.out
index 67c9bb8d992c..459c5613e919 100644
--- a/sql/core/src/test/resources/sql-tests/analyzer-results/array.sql.out
+++ b/sql/core/src/test/resources/sql-tests/analyzer-results/array.sql.out
@@ -747,3 +747,17 @@ select array_prepend(array(CAST(NULL AS String)), 
CAST(NULL AS String))
 -- !query analysis
 Project [array_prepend(array(cast(null as string)), cast(null as string)) AS 
array_prepend(array(CAST(NULL AS STRING)), CAST(NULL AS STRING))#x]
 +- OneRowRelation
+
+
+-- !query
+select array_union(array(0.0, -0.0, DOUBLE("NaN")), array(0.0, -0.0, 
DOUBLE("NaN")))
+-- !query analysis
+Project [array_union(array(cast(0.0 as double), cast(0.0 as double), cast(NaN 
as double)), array(cast(0.0 as double), cast(0.0 as double), cast(NaN as 
double))) AS array_union(array(0.0, 0.0, NaN), array(0.0, 0.0, NaN))#x]
++- OneRowRelation
+
+
+-- !query
+select array_distinct(array(0.0, -0.0, -0.0, DOUBLE("NaN"), DOUBLE("NaN")))
+-- !query analysis
+Project [array_distinct(array(cast(0.0 as double), cast(0.0 as double), 
cast(0.0 as double), cast(NaN as double), cast(NaN as double))) AS 
array_distinct(array(0.0, 0.0, 0.0, NaN, NaN))#x]
++- OneRowRelation
diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/group-by.sql.out 
b/sql/core/src/test/resources/sql-tests/analyzer-results/group-by.sql.out
index 229d0cb6faf2..324a0a366d85 100644
--- a/sql/core/src/test/resources/sql-tests/analyzer-results/group-by.sql.out
+++ b/sql/core/src/test/resources/sql-tests/analyzer-results/group-by.sql.out
@@ -1171,3 +1171,22 @@ Aggregate [c#x], [(c#x * 2) AS d#x]
          +- Project [if ((a#x < 0)) 0 else a#x AS b#x]
             +- SubqueryAlias t1
                +- LocalRelation [a#x]
+
+
+-- !query
+SELECT col1, count(*) AS cnt
+FROM VALUES
+  (0.0),
+  (-0.0),
+  (double('NaN')),
+  (double('NaN')),
+  (double('Infinity')),
+  (double('Infinity')),
+  (-double('Infinity')),
+  (-double('Infinity'))
+GROUP BY col1
+ORDER BY col1
+-- !query analysis
+Sort [col1#x ASC NULLS FIRST], true
++- Aggregate [col1#x], [col1#x, count(1) AS cnt#xL]
+   +- LocalRelation [col1#x]
diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/literals.sql.out 
b/sql/core/src/test/resources/sql-tests/analyzer-results/literals.sql.out
index d0e26f698fba..570cfb73444e 100644
--- a/sql/core/src/test/resources/sql-tests/analyzer-results/literals.sql.out
+++ b/sql/core/src/test/resources/sql-tests/analyzer-results/literals.sql.out
@@ -699,3 +699,10 @@ org.apache.spark.sql.catalyst.ExtendedAnalysisException
     "fragment" : "-x'2379ACFe'"
   } ]
 }
+
+
+-- !query
+select -0, -0.0
+-- !query analysis
+Project [0 AS 0#x, 0.0 AS 0.0#x]
++- OneRowRelation
diff --git a/sql/core/src/test/resources/sql-tests/inputs/array.sql 
b/sql/core/src/test/resources/sql-tests/inputs/array.sql
index 52a0906ea739..865dc8bac4ea 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/array.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/array.sql
@@ -177,3 +177,7 @@ select array_prepend(CAST(null AS ARRAY<String>), CAST(null 
as String));
 select array_prepend(array(), 1);
 select array_prepend(CAST(array() AS ARRAY<String>), CAST(NULL AS String));
 select array_prepend(array(CAST(NULL AS String)), CAST(NULL AS String));
+
+-- SPARK-45599: Confirm 0.0, -0.0, and NaN are handled appropriately.
+select array_union(array(0.0, -0.0, DOUBLE("NaN")), array(0.0, -0.0, 
DOUBLE("NaN")));
+select array_distinct(array(0.0, -0.0, -0.0, DOUBLE("NaN"), DOUBLE("NaN")));
diff --git a/sql/core/src/test/resources/sql-tests/inputs/group-by.sql 
b/sql/core/src/test/resources/sql-tests/inputs/group-by.sql
index ea1e2f323151..7d6116ac1e3f 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/group-by.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/group-by.sql
@@ -259,3 +259,18 @@ FROM (
          GROUP BY b
      ) t3
 GROUP BY c;
+
+-- SPARK-45599: Check that "weird" doubles group and sort as desired.
+SELECT col1, count(*) AS cnt
+FROM VALUES
+  (0.0),
+  (-0.0),
+  (double('NaN')),
+  (double('NaN')),
+  (double('Infinity')),
+  (double('Infinity')),
+  (-double('Infinity')),
+  (-double('Infinity'))
+GROUP BY col1
+ORDER BY col1
+;
diff --git a/sql/core/src/test/resources/sql-tests/inputs/literals.sql 
b/sql/core/src/test/resources/sql-tests/inputs/literals.sql
index 9f0eefc16a8c..e1e4a370bffd 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/literals.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/literals.sql
@@ -118,3 +118,6 @@ select +X'1';
 select -date '1999-01-01';
 select -timestamp '1999-01-01';
 select -x'2379ACFe';
+
+-- normalize -0 and -0.0
+select -0, -0.0;
diff --git a/sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out 
b/sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out
index c0f78f45db41..d17d87900fc7 100644
--- a/sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out
@@ -907,3 +907,19 @@ select array_prepend(array(CAST(NULL AS String)), 
CAST(NULL AS String))
 struct<array_prepend(array(CAST(NULL AS STRING)), CAST(NULL AS 
STRING)):array<string>>
 -- !query output
 [null,null]
+
+
+-- !query
+select array_union(array(0.0, -0.0, DOUBLE("NaN")), array(0.0, -0.0, 
DOUBLE("NaN")))
+-- !query schema
+struct<array_union(array(0.0, 0.0, NaN), array(0.0, 0.0, NaN)):array<double>>
+-- !query output
+[0.0,NaN]
+
+
+-- !query
+select array_distinct(array(0.0, -0.0, -0.0, DOUBLE("NaN"), DOUBLE("NaN")))
+-- !query schema
+struct<array_distinct(array(0.0, 0.0, 0.0, NaN, NaN)):array<double>>
+-- !query output
+[0.0,NaN]
diff --git 
a/sql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out 
b/sql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out
index b79a329b62ab..4e4c70cc333b 100644
--- a/sql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out
@@ -777,3 +777,11 @@ org.apache.spark.sql.catalyst.ExtendedAnalysisException
     "fragment" : "-x'2379ACFe'"
   } ]
 }
+
+
+-- !query
+select -0, -0.0
+-- !query schema
+struct<0:int,0.0:decimal(1,1)>
+-- !query output
+0      0.0
diff --git a/sql/core/src/test/resources/sql-tests/results/array.sql.out 
b/sql/core/src/test/resources/sql-tests/results/array.sql.out
index 12635385bbb7..92da0a490ff8 100644
--- a/sql/core/src/test/resources/sql-tests/results/array.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/array.sql.out
@@ -788,3 +788,19 @@ select array_prepend(array(CAST(NULL AS String)), 
CAST(NULL AS String))
 struct<array_prepend(array(CAST(NULL AS STRING)), CAST(NULL AS 
STRING)):array<string>>
 -- !query output
 [null,null]
+
+
+-- !query
+select array_union(array(0.0, -0.0, DOUBLE("NaN")), array(0.0, -0.0, 
DOUBLE("NaN")))
+-- !query schema
+struct<array_union(array(0.0, 0.0, NaN), array(0.0, 0.0, NaN)):array<double>>
+-- !query output
+[0.0,NaN]
+
+
+-- !query
+select array_distinct(array(0.0, -0.0, -0.0, DOUBLE("NaN"), DOUBLE("NaN")))
+-- !query schema
+struct<array_distinct(array(0.0, 0.0, 0.0, NaN, NaN)):array<double>>
+-- !query output
+[0.0,NaN]
diff --git a/sql/core/src/test/resources/sql-tests/results/group-by.sql.out 
b/sql/core/src/test/resources/sql-tests/results/group-by.sql.out
index 3e94c62b7c78..0b7bdfd85223 100644
--- a/sql/core/src/test/resources/sql-tests/results/group-by.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/group-by.sql.out
@@ -1102,3 +1102,25 @@ struct<d:int>
 -- !query output
 0
 2
+
+
+-- !query
+SELECT col1, count(*) AS cnt
+FROM VALUES
+  (0.0),
+  (-0.0),
+  (double('NaN')),
+  (double('NaN')),
+  (double('Infinity')),
+  (double('Infinity')),
+  (-double('Infinity')),
+  (-double('Infinity'))
+GROUP BY col1
+ORDER BY col1
+-- !query schema
+struct<col1:double,cnt:bigint>
+-- !query output
+-Infinity      2
+0.0    2
+Infinity       2
+NaN    2
diff --git a/sql/core/src/test/resources/sql-tests/results/literals.sql.out 
b/sql/core/src/test/resources/sql-tests/results/literals.sql.out
index b79a329b62ab..4e4c70cc333b 100644
--- a/sql/core/src/test/resources/sql-tests/results/literals.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/literals.sql.out
@@ -777,3 +777,11 @@ org.apache.spark.sql.catalyst.ExtendedAnalysisException
     "fragment" : "-x'2379ACFe'"
   } ]
 }
+
+
+-- !query
+select -0, -0.0
+-- !query schema
+struct<0:int,0.0:decimal(1,1)>
+-- !query output
+0      0.0
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
index 15dffb3696dd..ec589fa77241 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
@@ -1089,6 +1089,39 @@ class DataFrameAggregateSuite extends QueryTest
     )
   }
 
+  test("SPARK-45599: Neither 0.0 nor -0.0 should be dropped when computing 
percentile") {
+    // To reproduce the bug described in SPARK-45599, we need exactly these 
rows in roughly
+    // this order in a DataFrame with exactly 1 partition.
+    // scalastyle:off line.size.limit
+    // See: 
https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17806954
+    // scalastyle:on line.size.limit
+    val spark45599Repro: DataFrame = Seq(
+        0.0,
+        2.0,
+        153.0,
+        168.0,
+        3252411229536261.0,
+        7.205759403792794e+16,
+        1.7976931348623157e+308,
+        0.25,
+        Double.NaN,
+        Double.NaN,
+        -0.0,
+        -128.0,
+        Double.NaN,
+        Double.NaN
+      ).toDF("val").coalesce(1)
+
+    checkAnswer(
+      spark45599Repro.agg(
+        percentile(col("val"), lit(0.1))
+      ),
+      // With the buggy implementation of OpenHashSet, this returns 
`0.050000000000000044`
+      // instead of `-0.0`.
+      List(Row(-0.0))
+    )
+  }
+
   test("any_value") {
     checkAnswer(
       courseSales.groupBy("course").agg(


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-45599][CORE] Use object equality in OpenHashSet

Reply via email to