This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new fcc5dbc9b6c8 [SPARK-45599][CORE] Use object equality in OpenHashSet fcc5dbc9b6c8 is described below commit fcc5dbc9b6c8a8e16dc2e0854f3eebc8758a5826 Author: Nicholas Chammas <nicholas.cham...@gmail.com> AuthorDate: Mon Feb 26 16:03:30 2024 +0800 [SPARK-45599][CORE] Use object equality in OpenHashSet ### What changes were proposed in this pull request? Change `OpenHashSet` to use object equality instead of cooperative equality when looking up keys. ### Why are the changes needed? This brings the behavior of `OpenHashSet` more in line with the semantics of `java.util.HashSet`, and fixes its behavior when comparing values for which `equals` and `==` return different results, like 0.0/-0.0 and NaN/NaN. For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the [wrong position][wrong] in the set. This leads to the bug described in SPARK-45599 and summarized in [this comment][1]. [wrong]: https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283 [1]: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954 ### Does this PR introduce _any_ user-facing change? Yes, it resolves the bug described in SPARK-45599. `OpenHashSet` is used widely under the hood, including in: - `OpenHashMap`, which itself backs: - `TypedImperativeAggregate` - aggregate functions like `percentile` and `mode` - many algorithms in ML and MLlib - `SQLOpenHashSet`, which backs array functions like `array_union` and `array_distinct` However, the user-facing changes should be limited to the kind of edge case described in SPARK-45599. ### How was this patch tested? New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior. New tests failing on `master` but passing on this branch: - [Handling 0.0 and -0.0 in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R273) - [Handling NaN in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R302) - [Handling 0.0 and -0.0 in `OpenHashMap`](https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R253) - [Handling 0.0 and -0.0 when computing percentile](https://github.com/apache/spark/pull/45036/files#diff-bd3d5c79ede5675f4bf10d2efb313db893d57443d6d6d67b1f8766e6ce741271R1092) New tests passing both on `master` and this branch: - [Handling 0.0, -0.0, and NaN in `array_union`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R793) - [Handling 0.0, -0.0, and NaN in `array_distinct`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R801) - [Handling 0.0, -0.0, and NaN in `GROUP BY`](https://github.com/apache/spark/pull/45036/files#diff-496edb8b03201f078c3772ca81f7c7f80002acc11dff00b1d06d288b87855264R1107) - [Normalizing -0 and -0.0](https://github.com/apache/spark/pull/45036/files#diff-4bdd04d06a2d88049dd5c8a67715c5566dd68a1c4ebffc689dc74b6b2e0b3b04R782) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45036 from nchammas/SPARK-45599-plus-and-minus-zero. Authored-by: Nicholas Chammas <nicholas.cham...@gmail.com> Signed-off-by: Wenchen Fan <wenc...@databricks.com> --- .../apache/spark/util/collection/OpenHashSet.scala | 16 +++++++-- .../spark/util/collection/OpenHashMapSuite.scala | 30 +++++++++++++++++ .../spark/util/collection/OpenHashSetSuite.scala | 39 ++++++++++++++++++++++ .../sql-tests/analyzer-results/ansi/array.sql.out | 14 ++++++++ .../analyzer-results/ansi/literals.sql.out | 7 ++++ .../sql-tests/analyzer-results/array.sql.out | 14 ++++++++ .../sql-tests/analyzer-results/group-by.sql.out | 19 +++++++++++ .../sql-tests/analyzer-results/literals.sql.out | 7 ++++ .../src/test/resources/sql-tests/inputs/array.sql | 4 +++ .../test/resources/sql-tests/inputs/group-by.sql | 15 +++++++++ .../test/resources/sql-tests/inputs/literals.sql | 3 ++ .../resources/sql-tests/results/ansi/array.sql.out | 16 +++++++++ .../sql-tests/results/ansi/literals.sql.out | 8 +++++ .../test/resources/sql-tests/results/array.sql.out | 16 +++++++++ .../resources/sql-tests/results/group-by.sql.out | 22 ++++++++++++ .../resources/sql-tests/results/literals.sql.out | 8 +++++ .../apache/spark/sql/DataFrameAggregateSuite.scala | 33 ++++++++++++++++++ 17 files changed, 268 insertions(+), 3 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala b/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala index faee9ce56a0a..a42fa9ba6bc8 100644 --- a/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala +++ b/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala @@ -110,6 +110,17 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag]( this } + /** + * Check if a key exists at the provided position using object equality rather than + * cooperative equality. Otherwise, hash sets will mishandle values for which `==` + * and `equals` return different results, like 0.0/-0.0 and NaN/NaN. + * + * See: https://issues.apache.org/jira/browse/SPARK-45599 + */ + @annotation.nowarn("cat=other-non-cooperative-equals") + private def keyExistsAtPos(k: T, pos: Int) = + _data(pos) equals k + /** * Add an element to the set. This one differs from add in that it doesn't trigger rehashing. * The caller is responsible for calling rehashIfNeeded. @@ -130,8 +141,7 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag]( _bitset.set(pos) _size += 1 return pos | NONEXISTENCE_MASK - } else if (_data(pos) == k) { - // Found an existing key. + } else if (keyExistsAtPos(k, pos)) { return pos } else { // quadratic probing with values increase by 1, 2, 3, ... @@ -165,7 +175,7 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag]( while (true) { if (!_bitset.get(pos)) { return INVALID_POS - } else if (k == _data(pos)) { + } else if (keyExistsAtPos(k, pos)) { return pos } else { // quadratic probing with values increase by 1, 2, 3, ... diff --git a/core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala b/core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala index 155c855c8723..ae9fb54bddfb 100644 --- a/core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala +++ b/core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala @@ -249,4 +249,34 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers { map(null) = null assert(map.get(null) === Some(null)) } + + test("SPARK-45599: 0.0 and -0.0 should count distinctly; NaNs should count together") { + // Exactly these elements provided in roughly this order trigger a condition where lookups of + // 0.0 and -0.0 in the bitset happen to collide, causing their counts to be merged incorrectly + // and inconsistently if `==` is used to check for key equality. + val spark45599Repro = Seq( + Double.NaN, + 2.0, + 168.0, + Double.NaN, + Double.NaN, + -0.0, + 153.0, + 0.0 + ) + + val map1 = new OpenHashMap[Double, Int]() + spark45599Repro.foreach(map1.changeValue(_, 1, {_ + 1})) + assert(map1(0.0) == 1) + assert(map1(-0.0) == 1) + assert(map1(Double.NaN) == 3) + + val map2 = new OpenHashMap[Double, Int]() + // Simply changing the order in which the elements are added to the map should not change the + // counts for 0.0 and -0.0. + spark45599Repro.reverse.foreach(map2.changeValue(_, 1, {_ + 1})) + assert(map2(0.0) == 1) + assert(map2(-0.0) == 1) + assert(map2(Double.NaN) == 3) + } } diff --git a/core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala b/core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala index 89a308556d5d..0bc8aa067f57 100644 --- a/core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala +++ b/core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala @@ -269,4 +269,43 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers { assert(pos1 == pos2) } } + + test("SPARK-45599: 0.0 and -0.0 are equal but not the same") { + // Therefore, 0.0 and -0.0 should get separate entries in the hash set. + // + // Exactly these elements provided in roughly this order will trigger the following scenario: + // When probing the bitset in `getPos(-0.0)`, the loop will happen upon the entry for 0.0. + // In the old logic pre-SPARK-45599, the loop will find that the bit is set and, because + // -0.0 == 0.0, it will think that's the position of -0.0. But in reality this is the position + // of 0.0. So -0.0 and 0.0 will be stored at different positions, but `getPos()` will return + // the same position for them. This can cause users of OpenHashSet, like OpenHashMap, to + // return the wrong value for a key based on whether or not this bitset lookup collision + // happens. + val spark45599Repro = Seq( + Double.NaN, + 2.0, + 168.0, + Double.NaN, + Double.NaN, + -0.0, + 153.0, + 0.0 + ) + val set = new OpenHashSet[Double]() + spark45599Repro.foreach(set.add) + assert(set.size == 6) + val zeroPos = set.getPos(0.0) + val negZeroPos = set.getPos(-0.0) + assert(zeroPos != negZeroPos) + } + + test("SPARK-45599: NaN and NaN are the same but not equal") { + // Any mathematical comparison to NaN will return false, but when we place it in + // a hash set we want the lookup to work like a "normal" value. + val set = new OpenHashSet[Double]() + set.add(Double.NaN) + set.add(Double.NaN) + assert(set.contains(Double.NaN)) + assert(set.size == 1) + } } diff --git a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/array.sql.out b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/array.sql.out index 1ab4b03dcb26..2998803698c3 100644 --- a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/array.sql.out +++ b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/array.sql.out @@ -747,3 +747,17 @@ select array_prepend(array(CAST(NULL AS String)), CAST(NULL AS String)) -- !query analysis Project [array_prepend(array(cast(null as string)), cast(null as string)) AS array_prepend(array(CAST(NULL AS STRING)), CAST(NULL AS STRING))#x] +- OneRowRelation + + +-- !query +select array_union(array(0.0, -0.0, DOUBLE("NaN")), array(0.0, -0.0, DOUBLE("NaN"))) +-- !query analysis +Project [array_union(array(cast(0.0 as double), cast(0.0 as double), cast(NaN as double)), array(cast(0.0 as double), cast(0.0 as double), cast(NaN as double))) AS array_union(array(0.0, 0.0, NaN), array(0.0, 0.0, NaN))#x] ++- OneRowRelation + + +-- !query +select array_distinct(array(0.0, -0.0, -0.0, DOUBLE("NaN"), DOUBLE("NaN"))) +-- !query analysis +Project [array_distinct(array(cast(0.0 as double), cast(0.0 as double), cast(0.0 as double), cast(NaN as double), cast(NaN as double))) AS array_distinct(array(0.0, 0.0, 0.0, NaN, NaN))#x] ++- OneRowRelation diff --git a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/literals.sql.out b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/literals.sql.out index d0e26f698fba..570cfb73444e 100644 --- a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/literals.sql.out +++ b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/literals.sql.out @@ -699,3 +699,10 @@ org.apache.spark.sql.catalyst.ExtendedAnalysisException "fragment" : "-x'2379ACFe'" } ] } + + +-- !query +select -0, -0.0 +-- !query analysis +Project [0 AS 0#x, 0.0 AS 0.0#x] ++- OneRowRelation diff --git a/sql/core/src/test/resources/sql-tests/analyzer-results/array.sql.out b/sql/core/src/test/resources/sql-tests/analyzer-results/array.sql.out index 67c9bb8d992c..459c5613e919 100644 --- a/sql/core/src/test/resources/sql-tests/analyzer-results/array.sql.out +++ b/sql/core/src/test/resources/sql-tests/analyzer-results/array.sql.out @@ -747,3 +747,17 @@ select array_prepend(array(CAST(NULL AS String)), CAST(NULL AS String)) -- !query analysis Project [array_prepend(array(cast(null as string)), cast(null as string)) AS array_prepend(array(CAST(NULL AS STRING)), CAST(NULL AS STRING))#x] +- OneRowRelation + + +-- !query +select array_union(array(0.0, -0.0, DOUBLE("NaN")), array(0.0, -0.0, DOUBLE("NaN"))) +-- !query analysis +Project [array_union(array(cast(0.0 as double), cast(0.0 as double), cast(NaN as double)), array(cast(0.0 as double), cast(0.0 as double), cast(NaN as double))) AS array_union(array(0.0, 0.0, NaN), array(0.0, 0.0, NaN))#x] ++- OneRowRelation + + +-- !query +select array_distinct(array(0.0, -0.0, -0.0, DOUBLE("NaN"), DOUBLE("NaN"))) +-- !query analysis +Project [array_distinct(array(cast(0.0 as double), cast(0.0 as double), cast(0.0 as double), cast(NaN as double), cast(NaN as double))) AS array_distinct(array(0.0, 0.0, 0.0, NaN, NaN))#x] ++- OneRowRelation diff --git a/sql/core/src/test/resources/sql-tests/analyzer-results/group-by.sql.out b/sql/core/src/test/resources/sql-tests/analyzer-results/group-by.sql.out index 229d0cb6faf2..324a0a366d85 100644 --- a/sql/core/src/test/resources/sql-tests/analyzer-results/group-by.sql.out +++ b/sql/core/src/test/resources/sql-tests/analyzer-results/group-by.sql.out @@ -1171,3 +1171,22 @@ Aggregate [c#x], [(c#x * 2) AS d#x] +- Project [if ((a#x < 0)) 0 else a#x AS b#x] +- SubqueryAlias t1 +- LocalRelation [a#x] + + +-- !query +SELECT col1, count(*) AS cnt +FROM VALUES + (0.0), + (-0.0), + (double('NaN')), + (double('NaN')), + (double('Infinity')), + (double('Infinity')), + (-double('Infinity')), + (-double('Infinity')) +GROUP BY col1 +ORDER BY col1 +-- !query analysis +Sort [col1#x ASC NULLS FIRST], true ++- Aggregate [col1#x], [col1#x, count(1) AS cnt#xL] + +- LocalRelation [col1#x] diff --git a/sql/core/src/test/resources/sql-tests/analyzer-results/literals.sql.out b/sql/core/src/test/resources/sql-tests/analyzer-results/literals.sql.out index d0e26f698fba..570cfb73444e 100644 --- a/sql/core/src/test/resources/sql-tests/analyzer-results/literals.sql.out +++ b/sql/core/src/test/resources/sql-tests/analyzer-results/literals.sql.out @@ -699,3 +699,10 @@ org.apache.spark.sql.catalyst.ExtendedAnalysisException "fragment" : "-x'2379ACFe'" } ] } + + +-- !query +select -0, -0.0 +-- !query analysis +Project [0 AS 0#x, 0.0 AS 0.0#x] ++- OneRowRelation diff --git a/sql/core/src/test/resources/sql-tests/inputs/array.sql b/sql/core/src/test/resources/sql-tests/inputs/array.sql index 52a0906ea739..865dc8bac4ea 100644 --- a/sql/core/src/test/resources/sql-tests/inputs/array.sql +++ b/sql/core/src/test/resources/sql-tests/inputs/array.sql @@ -177,3 +177,7 @@ select array_prepend(CAST(null AS ARRAY<String>), CAST(null as String)); select array_prepend(array(), 1); select array_prepend(CAST(array() AS ARRAY<String>), CAST(NULL AS String)); select array_prepend(array(CAST(NULL AS String)), CAST(NULL AS String)); + +-- SPARK-45599: Confirm 0.0, -0.0, and NaN are handled appropriately. +select array_union(array(0.0, -0.0, DOUBLE("NaN")), array(0.0, -0.0, DOUBLE("NaN"))); +select array_distinct(array(0.0, -0.0, -0.0, DOUBLE("NaN"), DOUBLE("NaN"))); diff --git a/sql/core/src/test/resources/sql-tests/inputs/group-by.sql b/sql/core/src/test/resources/sql-tests/inputs/group-by.sql index ea1e2f323151..7d6116ac1e3f 100644 --- a/sql/core/src/test/resources/sql-tests/inputs/group-by.sql +++ b/sql/core/src/test/resources/sql-tests/inputs/group-by.sql @@ -259,3 +259,18 @@ FROM ( GROUP BY b ) t3 GROUP BY c; + +-- SPARK-45599: Check that "weird" doubles group and sort as desired. +SELECT col1, count(*) AS cnt +FROM VALUES + (0.0), + (-0.0), + (double('NaN')), + (double('NaN')), + (double('Infinity')), + (double('Infinity')), + (-double('Infinity')), + (-double('Infinity')) +GROUP BY col1 +ORDER BY col1 +; diff --git a/sql/core/src/test/resources/sql-tests/inputs/literals.sql b/sql/core/src/test/resources/sql-tests/inputs/literals.sql index 9f0eefc16a8c..e1e4a370bffd 100644 --- a/sql/core/src/test/resources/sql-tests/inputs/literals.sql +++ b/sql/core/src/test/resources/sql-tests/inputs/literals.sql @@ -118,3 +118,6 @@ select +X'1'; select -date '1999-01-01'; select -timestamp '1999-01-01'; select -x'2379ACFe'; + +-- normalize -0 and -0.0 +select -0, -0.0; diff --git a/sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out b/sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out index c0f78f45db41..d17d87900fc7 100644 --- a/sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out @@ -907,3 +907,19 @@ select array_prepend(array(CAST(NULL AS String)), CAST(NULL AS String)) struct<array_prepend(array(CAST(NULL AS STRING)), CAST(NULL AS STRING)):array<string>> -- !query output [null,null] + + +-- !query +select array_union(array(0.0, -0.0, DOUBLE("NaN")), array(0.0, -0.0, DOUBLE("NaN"))) +-- !query schema +struct<array_union(array(0.0, 0.0, NaN), array(0.0, 0.0, NaN)):array<double>> +-- !query output +[0.0,NaN] + + +-- !query +select array_distinct(array(0.0, -0.0, -0.0, DOUBLE("NaN"), DOUBLE("NaN"))) +-- !query schema +struct<array_distinct(array(0.0, 0.0, 0.0, NaN, NaN)):array<double>> +-- !query output +[0.0,NaN] diff --git a/sql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out b/sql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out index b79a329b62ab..4e4c70cc333b 100644 --- a/sql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out @@ -777,3 +777,11 @@ org.apache.spark.sql.catalyst.ExtendedAnalysisException "fragment" : "-x'2379ACFe'" } ] } + + +-- !query +select -0, -0.0 +-- !query schema +struct<0:int,0.0:decimal(1,1)> +-- !query output +0 0.0 diff --git a/sql/core/src/test/resources/sql-tests/results/array.sql.out b/sql/core/src/test/resources/sql-tests/results/array.sql.out index 12635385bbb7..92da0a490ff8 100644 --- a/sql/core/src/test/resources/sql-tests/results/array.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/array.sql.out @@ -788,3 +788,19 @@ select array_prepend(array(CAST(NULL AS String)), CAST(NULL AS String)) struct<array_prepend(array(CAST(NULL AS STRING)), CAST(NULL AS STRING)):array<string>> -- !query output [null,null] + + +-- !query +select array_union(array(0.0, -0.0, DOUBLE("NaN")), array(0.0, -0.0, DOUBLE("NaN"))) +-- !query schema +struct<array_union(array(0.0, 0.0, NaN), array(0.0, 0.0, NaN)):array<double>> +-- !query output +[0.0,NaN] + + +-- !query +select array_distinct(array(0.0, -0.0, -0.0, DOUBLE("NaN"), DOUBLE("NaN"))) +-- !query schema +struct<array_distinct(array(0.0, 0.0, 0.0, NaN, NaN)):array<double>> +-- !query output +[0.0,NaN] diff --git a/sql/core/src/test/resources/sql-tests/results/group-by.sql.out b/sql/core/src/test/resources/sql-tests/results/group-by.sql.out index 3e94c62b7c78..0b7bdfd85223 100644 --- a/sql/core/src/test/resources/sql-tests/results/group-by.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/group-by.sql.out @@ -1102,3 +1102,25 @@ struct<d:int> -- !query output 0 2 + + +-- !query +SELECT col1, count(*) AS cnt +FROM VALUES + (0.0), + (-0.0), + (double('NaN')), + (double('NaN')), + (double('Infinity')), + (double('Infinity')), + (-double('Infinity')), + (-double('Infinity')) +GROUP BY col1 +ORDER BY col1 +-- !query schema +struct<col1:double,cnt:bigint> +-- !query output +-Infinity 2 +0.0 2 +Infinity 2 +NaN 2 diff --git a/sql/core/src/test/resources/sql-tests/results/literals.sql.out b/sql/core/src/test/resources/sql-tests/results/literals.sql.out index b79a329b62ab..4e4c70cc333b 100644 --- a/sql/core/src/test/resources/sql-tests/results/literals.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/literals.sql.out @@ -777,3 +777,11 @@ org.apache.spark.sql.catalyst.ExtendedAnalysisException "fragment" : "-x'2379ACFe'" } ] } + + +-- !query +select -0, -0.0 +-- !query schema +struct<0:int,0.0:decimal(1,1)> +-- !query output +0 0.0 diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala index 15dffb3696dd..ec589fa77241 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala @@ -1089,6 +1089,39 @@ class DataFrameAggregateSuite extends QueryTest ) } + test("SPARK-45599: Neither 0.0 nor -0.0 should be dropped when computing percentile") { + // To reproduce the bug described in SPARK-45599, we need exactly these rows in roughly + // this order in a DataFrame with exactly 1 partition. + // scalastyle:off line.size.limit + // See: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17806954 + // scalastyle:on line.size.limit + val spark45599Repro: DataFrame = Seq( + 0.0, + 2.0, + 153.0, + 168.0, + 3252411229536261.0, + 7.205759403792794e+16, + 1.7976931348623157e+308, + 0.25, + Double.NaN, + Double.NaN, + -0.0, + -128.0, + Double.NaN, + Double.NaN + ).toDF("val").coalesce(1) + + checkAnswer( + spark45599Repro.agg( + percentile(col("val"), lit(0.1)) + ), + // With the buggy implementation of OpenHashSet, this returns `0.050000000000000044` + // instead of `-0.0`. + List(Row(-0.0)) + ) + } + test("any_value") { checkAnswer( courseSales.groupBy("course").agg( --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org