(spark) branch branch-3.5 updated: [SPARK-45599][CORE] Use object equality in OpenHashSet

wenchen Mon, 26 Feb 2024 00:04:28 -0800

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.5 by this push:
     new 588a55d010fe [SPARK-45599][CORE] Use object equality in OpenHashSet
588a55d010fe is described below

commit 588a55d010fefda7a63cde3b616ac38728fe4cfe
Author: Nicholas Chammas <nicholas.cham...@gmail.com>
AuthorDate: Mon Feb 26 16:03:30 2024 +0800

    [SPARK-45599][CORE] Use object equality in OpenHashSet
    
    ### What changes were proposed in this pull request?
    
    Change `OpenHashSet` to use object equality instead of cooperative equality 
when looking up keys.
    
    ### Why are the changes needed?
    
    This brings the behavior of `OpenHashSet` more in line with the semantics 
of `java.util.HashSet`, and fixes its behavior when comparing values for which 
`equals` and `==` return different results, like 0.0/-0.0 and NaN/NaN.
    
    For example, in certain cases where both 0.0 and -0.0 are provided as keys 
to the set, lookups of one or the other key may return the [wrong 
position][wrong] in the set. This leads to the bug described in SPARK-45599 and 
summarized in [this comment][1].
    
    [wrong]: 
https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283
    [1]: 
https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, it resolves the bug described in SPARK-45599.
    
    `OpenHashSet` is used widely under the hood, including in:
    - `OpenHashMap`, which itself backs:
        - `TypedImperativeAggregate`
        - aggregate functions like `percentile` and `mode`
        - many algorithms in ML and MLlib
    - `SQLOpenHashSet`, which backs array functions like `array_union` and 
`array_distinct`
    
    However, the user-facing changes should be limited to the kind of edge case 
described in SPARK-45599.
    
    ### How was this patch tested?
    
    New and existing unit tests. Of the new tests added in this PR, some simply 
validate that we have not changed existing SQL semantics, while others confirm 
that we have fixed the specific bug reported in SPARK-45599 along with any 
related incorrect behavior.
    
    New tests failing on `master` but passing on this branch:
    - [Handling 0.0 and -0.0 in 
`OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R273)
    - [Handling NaN in 
`OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R302)
    - [Handling 0.0 and -0.0 in 
`OpenHashMap`](https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R253)
    - [Handling 0.0 and -0.0 when computing 
percentile](https://github.com/apache/spark/pull/45036/files#diff-bd3d5c79ede5675f4bf10d2efb313db893d57443d6d6d67b1f8766e6ce741271R1092)
    
    New tests passing both on `master` and this branch:
    - [Handling 0.0, -0.0, and NaN in 
`array_union`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R793)
    - [Handling 0.0, -0.0, and NaN in 
`array_distinct`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R801)
    - [Handling 0.0, -0.0, and NaN in `GROUP 
BY`](https://github.com/apache/spark/pull/45036/files#diff-496edb8b03201f078c3772ca81f7c7f80002acc11dff00b1d06d288b87855264R1107)
    - [Normalizing -0 and 
-0.0](https://github.com/apache/spark/pull/45036/files#diff-4bdd04d06a2d88049dd5c8a67715c5566dd68a1c4ebffc689dc74b6b2e0b3b04R782)
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #45036 from nchammas/SPARK-45599-plus-and-minus-zero.
    
    Authored-by: Nicholas Chammas <nicholas.cham...@gmail.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
    (cherry picked from commit fcc5dbc9b6c8a8e16dc2e0854f3eebc8758a5826)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
---
 .../apache/spark/util/collection/OpenHashSet.scala | 16 +++++++--
 .../spark/util/collection/OpenHashMapSuite.scala   | 30 +++++++++++++++++
 .../spark/util/collection/OpenHashSetSuite.scala   | 39 ++++++++++++++++++++++
 .../sql-tests/analyzer-results/ansi/array.sql.out  | 14 ++++++++
 .../analyzer-results/ansi/literals.sql.out         |  7 ++++
 .../sql-tests/analyzer-results/array.sql.out       | 14 ++++++++
 .../sql-tests/analyzer-results/group-by.sql.out    | 19 +++++++++++
 .../sql-tests/analyzer-results/literals.sql.out    |  7 ++++
 .../src/test/resources/sql-tests/inputs/array.sql  |  4 +++
 .../test/resources/sql-tests/inputs/group-by.sql   | 15 +++++++++
 .../test/resources/sql-tests/inputs/literals.sql   |  3 ++
 .../resources/sql-tests/results/ansi/array.sql.out | 16 +++++++++
 .../sql-tests/results/ansi/literals.sql.out        |  8 +++++
 .../test/resources/sql-tests/results/array.sql.out | 16 +++++++++
 .../resources/sql-tests/results/group-by.sql.out   | 22 ++++++++++++
 .../resources/sql-tests/results/literals.sql.out   |  8 +++++
 .../apache/spark/sql/DataFrameAggregateSuite.scala | 33 ++++++++++++++++++
 17 files changed, 268 insertions(+), 3 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala 
b/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
index 6815e47a198d..435cf1a03cbc 100644
--- a/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
+++ b/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
@@ -126,6 +126,17 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) 
T: ClassTag](
     this
   }
 
+  /**
+   * Check if a key exists at the provided position using object equality 
rather than
+   * cooperative equality. Otherwise, hash sets will mishandle values for 
which `==`
+   * and `equals` return different results, like 0.0/-0.0 and NaN/NaN.
+   *
+   * See: https://issues.apache.org/jira/browse/SPARK-45599
+   */
+  @annotation.nowarn("cat=other-non-cooperative-equals")
+  private def keyExistsAtPos(k: T, pos: Int) =
+    _data(pos) equals k
+
   /**
    * Add an element to the set. This one differs from add in that it doesn't 
trigger rehashing.
    * The caller is responsible for calling rehashIfNeeded.
@@ -146,8 +157,7 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) T: 
ClassTag](
         _bitset.set(pos)
         _size += 1
         return pos | NONEXISTENCE_MASK
-      } else if (_data(pos) == k) {
-        // Found an existing key.
+      } else if (keyExistsAtPos(k, pos)) {
         return pos
       } else {
         // quadratic probing with values increase by 1, 2, 3, ...
@@ -181,7 +191,7 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) T: 
ClassTag](
     while (true) {
       if (!_bitset.get(pos)) {
         return INVALID_POS
-      } else if (k == _data(pos)) {
+      } else if (keyExistsAtPos(k, pos)) {
         return pos
       } else {
         // quadratic probing with values increase by 1, 2, 3, ...
diff --git 
a/core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala 
b/core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala
index 1af99e9017c9..f7b026ab565f 100644
--- 
a/core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala
@@ -249,4 +249,34 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers 
{
     map(null) = null
     assert(map.get(null) === Some(null))
   }
+
+  test("SPARK-45599: 0.0 and -0.0 should count distinctly; NaNs should count 
together") {
+    // Exactly these elements provided in roughly this order trigger a 
condition where lookups of
+    // 0.0 and -0.0 in the bitset happen to collide, causing their counts to 
be merged incorrectly
+    // and inconsistently if `==` is used to check for key equality.
+    val spark45599Repro = Seq(
+      Double.NaN,
+      2.0,
+      168.0,
+      Double.NaN,
+      Double.NaN,
+      -0.0,
+      153.0,
+      0.0
+    )
+
+    val map1 = new OpenHashMap[Double, Int]()
+    spark45599Repro.foreach(map1.changeValue(_, 1, {_ + 1}))
+    assert(map1(0.0) == 1)
+    assert(map1(-0.0) == 1)
+    assert(map1(Double.NaN) == 3)
+
+    val map2 = new OpenHashMap[Double, Int]()
+    // Simply changing the order in which the elements are added to the map 
should not change the
+    // counts for 0.0 and -0.0.
+    spark45599Repro.reverse.foreach(map2.changeValue(_, 1, {_ + 1}))
+    assert(map2(0.0) == 1)
+    assert(map2(-0.0) == 1)
+    assert(map2(Double.NaN) == 3)
+  }
 }
diff --git 
a/core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala 
b/core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala
index 89a308556d5d..0bc8aa067f57 100644
--- 
a/core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala
@@ -269,4 +269,43 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers 
{
       assert(pos1 == pos2)
     }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {
+    // Therefore, 0.0 and -0.0 should get separate entries in the hash set.
+    //
+    // Exactly these elements provided in roughly this order will trigger the 
following scenario:
+    // When probing the bitset in `getPos(-0.0)`, the loop will happen upon 
the entry for 0.0.
+    // In the old logic pre-SPARK-45599, the loop will find that the bit is 
set and, because
+    // -0.0 == 0.0, it will think that's the position of -0.0. But in reality 
this is the position
+    // of 0.0. So -0.0 and 0.0 will be stored at different positions, but 
`getPos()` will return
+    // the same position for them. This can cause users of OpenHashSet, like 
OpenHashMap, to
+    // return the wrong value for a key based on whether or not this bitset 
lookup collision
+    // happens.
+    val spark45599Repro = Seq(
+      Double.NaN,
+      2.0,
+      168.0,
+      Double.NaN,
+      Double.NaN,
+      -0.0,
+      153.0,
+      0.0
+    )
+    val set = new OpenHashSet[Double]()
+    spark45599Repro.foreach(set.add)
+    assert(set.size == 6)
+    val zeroPos = set.getPos(0.0)
+    val negZeroPos = set.getPos(-0.0)
+    assert(zeroPos != negZeroPos)
+  }
+
+  test("SPARK-45599: NaN and NaN are the same but not equal") {
+    // Any mathematical comparison to NaN will return false, but when we place 
it in
+    // a hash set we want the lookup to work like a "normal" value.
+    val set = new OpenHashSet[Double]()
+    set.add(Double.NaN)
+    set.add(Double.NaN)
+    assert(set.contains(Double.NaN))
+    assert(set.size == 1)
+  }
 }
diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/array.sql.out 
b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/array.sql.out
index 6fc308157933..3b196ea93e40 100644
--- a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/array.sql.out
+++ b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/array.sql.out
@@ -747,3 +747,17 @@ select array_prepend(array(CAST(NULL AS String)), 
CAST(NULL AS String))
 -- !query analysis
 Project [array_prepend(array(cast(null as string)), cast(null as string)) AS 
array_prepend(array(CAST(NULL AS STRING)), CAST(NULL AS STRING))#x]
 +- OneRowRelation
+
+
+-- !query
+select array_union(array(0.0, -0.0, DOUBLE("NaN")), array(0.0, -0.0, 
DOUBLE("NaN")))
+-- !query analysis
+Project [array_union(array(cast(0.0 as double), cast(0.0 as double), cast(NaN 
as double)), array(cast(0.0 as double), cast(0.0 as double), cast(NaN as 
double))) AS array_union(array(0.0, 0.0, NaN), array(0.0, 0.0, NaN))#x]
++- OneRowRelation
+
+
+-- !query
+select array_distinct(array(0.0, -0.0, -0.0, DOUBLE("NaN"), DOUBLE("NaN")))
+-- !query analysis
+Project [array_distinct(array(cast(0.0 as double), cast(0.0 as double), 
cast(0.0 as double), cast(NaN as double), cast(NaN as double))) AS 
array_distinct(array(0.0, 0.0, 0.0, NaN, NaN))#x]
++- OneRowRelation
diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/literals.sql.out 
b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/literals.sql.out
index 53c7327c5871..001dd4d64487 100644
--- 
a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/literals.sql.out
+++ 
b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/literals.sql.out
@@ -692,3 +692,10 @@ org.apache.spark.sql.catalyst.ExtendedAnalysisException
     "fragment" : "-x'2379ACFe'"
   } ]
 }
+
+
+-- !query
+select -0, -0.0
+-- !query analysis
+Project [0 AS 0#x, 0.0 AS 0.0#x]
++- OneRowRelation
diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/array.sql.out 
b/sql/core/src/test/resources/sql-tests/analyzer-results/array.sql.out
index e0585b77cb6b..ca6c89bfadc3 100644
--- a/sql/core/src/test/resources/sql-tests/analyzer-results/array.sql.out
+++ b/sql/core/src/test/resources/sql-tests/analyzer-results/array.sql.out
@@ -747,3 +747,17 @@ select array_prepend(array(CAST(NULL AS String)), 
CAST(NULL AS String))
 -- !query analysis
 Project [array_prepend(array(cast(null as string)), cast(null as string)) AS 
array_prepend(array(CAST(NULL AS STRING)), CAST(NULL AS STRING))#x]
 +- OneRowRelation
+
+
+-- !query
+select array_union(array(0.0, -0.0, DOUBLE("NaN")), array(0.0, -0.0, 
DOUBLE("NaN")))
+-- !query analysis
+Project [array_union(array(cast(0.0 as double), cast(0.0 as double), cast(NaN 
as double)), array(cast(0.0 as double), cast(0.0 as double), cast(NaN as 
double))) AS array_union(array(0.0, 0.0, NaN), array(0.0, 0.0, NaN))#x]
++- OneRowRelation
+
+
+-- !query
+select array_distinct(array(0.0, -0.0, -0.0, DOUBLE("NaN"), DOUBLE("NaN")))
+-- !query analysis
+Project [array_distinct(array(cast(0.0 as double), cast(0.0 as double), 
cast(0.0 as double), cast(NaN as double), cast(NaN as double))) AS 
array_distinct(array(0.0, 0.0, 0.0, NaN, NaN))#x]
++- OneRowRelation
diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/group-by.sql.out 
b/sql/core/src/test/resources/sql-tests/analyzer-results/group-by.sql.out
index 202ceee18046..93c463575dc1 100644
--- a/sql/core/src/test/resources/sql-tests/analyzer-results/group-by.sql.out
+++ b/sql/core/src/test/resources/sql-tests/analyzer-results/group-by.sql.out
@@ -1196,3 +1196,22 @@ Aggregate [c#x], [(c#x * 2) AS d#x]
          +- Project [if ((a#x < 0)) 0 else a#x AS b#x]
             +- SubqueryAlias t1
                +- LocalRelation [a#x]
+
+
+-- !query
+SELECT col1, count(*) AS cnt
+FROM VALUES
+  (0.0),
+  (-0.0),
+  (double('NaN')),
+  (double('NaN')),
+  (double('Infinity')),
+  (double('Infinity')),
+  (-double('Infinity')),
+  (-double('Infinity'))
+GROUP BY col1
+ORDER BY col1
+-- !query analysis
+Sort [col1#x ASC NULLS FIRST], true
++- Aggregate [col1#x], [col1#x, count(1) AS cnt#xL]
+   +- LocalRelation [col1#x]
diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/literals.sql.out 
b/sql/core/src/test/resources/sql-tests/analyzer-results/literals.sql.out
index 53c7327c5871..001dd4d64487 100644
--- a/sql/core/src/test/resources/sql-tests/analyzer-results/literals.sql.out
+++ b/sql/core/src/test/resources/sql-tests/analyzer-results/literals.sql.out
@@ -692,3 +692,10 @@ org.apache.spark.sql.catalyst.ExtendedAnalysisException
     "fragment" : "-x'2379ACFe'"
   } ]
 }
+
+
+-- !query
+select -0, -0.0
+-- !query analysis
+Project [0 AS 0#x, 0.0 AS 0.0#x]
++- OneRowRelation
diff --git a/sql/core/src/test/resources/sql-tests/inputs/array.sql 
b/sql/core/src/test/resources/sql-tests/inputs/array.sql
index 52a0906ea739..865dc8bac4ea 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/array.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/array.sql
@@ -177,3 +177,7 @@ select array_prepend(CAST(null AS ARRAY<String>), CAST(null 
as String));
 select array_prepend(array(), 1);
 select array_prepend(CAST(array() AS ARRAY<String>), CAST(NULL AS String));
 select array_prepend(array(CAST(NULL AS String)), CAST(NULL AS String));
+
+-- SPARK-45599: Confirm 0.0, -0.0, and NaN are handled appropriately.
+select array_union(array(0.0, -0.0, DOUBLE("NaN")), array(0.0, -0.0, 
DOUBLE("NaN")));
+select array_distinct(array(0.0, -0.0, -0.0, DOUBLE("NaN"), DOUBLE("NaN")));
diff --git a/sql/core/src/test/resources/sql-tests/inputs/group-by.sql 
b/sql/core/src/test/resources/sql-tests/inputs/group-by.sql
index c35cdb0de271..ce1b422de319 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/group-by.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/group-by.sql
@@ -264,3 +264,18 @@ FROM (
          GROUP BY b
      ) t3
 GROUP BY c;
+
+-- SPARK-45599: Check that "weird" doubles group and sort as desired.
+SELECT col1, count(*) AS cnt
+FROM VALUES
+  (0.0),
+  (-0.0),
+  (double('NaN')),
+  (double('NaN')),
+  (double('Infinity')),
+  (double('Infinity')),
+  (-double('Infinity')),
+  (-double('Infinity'))
+GROUP BY col1
+ORDER BY col1
+;
diff --git a/sql/core/src/test/resources/sql-tests/inputs/literals.sql 
b/sql/core/src/test/resources/sql-tests/inputs/literals.sql
index 9f0eefc16a8c..e1e4a370bffd 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/literals.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/literals.sql
@@ -118,3 +118,6 @@ select +X'1';
 select -date '1999-01-01';
 select -timestamp '1999-01-01';
 select -x'2379ACFe';
+
+-- normalize -0 and -0.0
+select -0, -0.0;
diff --git a/sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out 
b/sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out
index 49e18411ffa3..6a07d659e39b 100644
--- a/sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out
@@ -907,3 +907,19 @@ select array_prepend(array(CAST(NULL AS String)), 
CAST(NULL AS String))
 struct<array_prepend(array(CAST(NULL AS STRING)), CAST(NULL AS 
STRING)):array<string>>
 -- !query output
 [null,null]
+
+
+-- !query
+select array_union(array(0.0, -0.0, DOUBLE("NaN")), array(0.0, -0.0, 
DOUBLE("NaN")))
+-- !query schema
+struct<array_union(array(0.0, 0.0, NaN), array(0.0, 0.0, NaN)):array<double>>
+-- !query output
+[0.0,NaN]
+
+
+-- !query
+select array_distinct(array(0.0, -0.0, -0.0, DOUBLE("NaN"), DOUBLE("NaN")))
+-- !query schema
+struct<array_distinct(array(0.0, 0.0, 0.0, NaN, NaN)):array<double>>
+-- !query output
+[0.0,NaN]
diff --git 
a/sql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out 
b/sql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out
index 85bcc2713ff5..452580e4f3c3 100644
--- a/sql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out
@@ -770,3 +770,11 @@ org.apache.spark.sql.catalyst.ExtendedAnalysisException
     "fragment" : "-x'2379ACFe'"
   } ]
 }
+
+
+-- !query
+select -0, -0.0
+-- !query schema
+struct<0:int,0.0:decimal(1,1)>
+-- !query output
+0      0.0
diff --git a/sql/core/src/test/resources/sql-tests/results/array.sql.out 
b/sql/core/src/test/resources/sql-tests/results/array.sql.out
index e568f5fa7796..d33fc62f0d9a 100644
--- a/sql/core/src/test/resources/sql-tests/results/array.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/array.sql.out
@@ -788,3 +788,19 @@ select array_prepend(array(CAST(NULL AS String)), 
CAST(NULL AS String))
 struct<array_prepend(array(CAST(NULL AS STRING)), CAST(NULL AS 
STRING)):array<string>>
 -- !query output
 [null,null]
+
+
+-- !query
+select array_union(array(0.0, -0.0, DOUBLE("NaN")), array(0.0, -0.0, 
DOUBLE("NaN")))
+-- !query schema
+struct<array_union(array(0.0, 0.0, NaN), array(0.0, 0.0, NaN)):array<double>>
+-- !query output
+[0.0,NaN]
+
+
+-- !query
+select array_distinct(array(0.0, -0.0, -0.0, DOUBLE("NaN"), DOUBLE("NaN")))
+-- !query schema
+struct<array_distinct(array(0.0, 0.0, 0.0, NaN, NaN)):array<double>>
+-- !query output
+[0.0,NaN]
diff --git a/sql/core/src/test/resources/sql-tests/results/group-by.sql.out 
b/sql/core/src/test/resources/sql-tests/results/group-by.sql.out
index db79646fe435..548917ef79b2 100644
--- a/sql/core/src/test/resources/sql-tests/results/group-by.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/group-by.sql.out
@@ -1121,3 +1121,25 @@ struct<d:int>
 -- !query output
 0
 2
+
+
+-- !query
+SELECT col1, count(*) AS cnt
+FROM VALUES
+  (0.0),
+  (-0.0),
+  (double('NaN')),
+  (double('NaN')),
+  (double('Infinity')),
+  (double('Infinity')),
+  (-double('Infinity')),
+  (-double('Infinity'))
+GROUP BY col1
+ORDER BY col1
+-- !query schema
+struct<col1:double,cnt:bigint>
+-- !query output
+-Infinity      2
+0.0    2
+Infinity       2
+NaN    2
diff --git a/sql/core/src/test/resources/sql-tests/results/literals.sql.out 
b/sql/core/src/test/resources/sql-tests/results/literals.sql.out
index 85bcc2713ff5..452580e4f3c3 100644
--- a/sql/core/src/test/resources/sql-tests/results/literals.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/literals.sql.out
@@ -770,3 +770,11 @@ org.apache.spark.sql.catalyst.ExtendedAnalysisException
     "fragment" : "-x'2379ACFe'"
   } ]
 }
+
+
+-- !query
+select -0, -0.0
+-- !query schema
+struct<0:int,0.0:decimal(1,1)>
+-- !query output
+0      0.0
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
index 631fcd8c0d87..1ba3f6c84d0a 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
@@ -1068,6 +1068,39 @@ class DataFrameAggregateSuite extends QueryTest
     )
   }
 
+  test("SPARK-45599: Neither 0.0 nor -0.0 should be dropped when computing 
percentile") {
+    // To reproduce the bug described in SPARK-45599, we need exactly these 
rows in roughly
+    // this order in a DataFrame with exactly 1 partition.
+    // scalastyle:off line.size.limit
+    // See: 
https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17806954
+    // scalastyle:on line.size.limit
+    val spark45599Repro: DataFrame = Seq(
+        0.0,
+        2.0,
+        153.0,
+        168.0,
+        3252411229536261.0,
+        7.205759403792794e+16,
+        1.7976931348623157e+308,
+        0.25,
+        Double.NaN,
+        Double.NaN,
+        -0.0,
+        -128.0,
+        Double.NaN,
+        Double.NaN
+      ).toDF("val").coalesce(1)
+
+    checkAnswer(
+      spark45599Repro.agg(
+        percentile(col("val"), lit(0.1))
+      ),
+      // With the buggy implementation of OpenHashSet, this returns 
`0.050000000000000044`
+      // instead of `-0.0`.
+      List(Row(-0.0))
+    )
+  }
+
   test("any_value") {
     checkAnswer(
       courseSales.groupBy("course").agg(


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-45599][CORE] Use object equality in OpenHashSet

Reply via email to