nchammas commented on code in PR #45036: URL: https://github.com/apache/spark/pull/45036#discussion_r1486895251
########## core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala: ########## @@ -269,4 +269,35 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers { assert(pos1 == pos2) } } + + test("SPARK-45599: 0.0 and -0.0 are equal but not the same") { Review Comment: > Spark's `OpenHashSet` does not have to match `java.util.HashSet`. What matters is the SQL semantic. Whether or not `OpenHashSet` matches `java.util.HashSet`, I want to emphasize for the record that `OpenHashSet` mishandles 0.0/-0.0 and NaN. Its behavior is simply _incorrect_. [These][test1] [tests][test2] fail on `master` in ways that can only be described as bugs, regardless of whatever SQL semantics we want to preserve. [This comment][why] explains the root cause. Basically, it is a mistake to combine hash code-based lookups with cooperative equality, at least in the way we are doing it in `OpenHashSet`. [test1]: https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R274-R278 [test2]: https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R308-R309 [why]: https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283 But I understand what you are saying. Fixing bugs in `OpenHashSet` doesn't help us if it also breaks users' SQL. > Can you highlight which functions/operators are using this `OpenHashSet` I've updated the PR description with a summary of what uses `OpenHashSet`. As a side note, I believe that if we accept the change proposed here, we should be able to eliminate `SQLOpenHashSet`. `SQLOpenHashSet` was created specifically to work around the bugs in `OpenHashSet` that we are addressing in this PR. See #33955 and #33993. > and what is the impact of this change to the SQL semantic? I've updated the PR description with a diff of what tests pass or fail on `master` vs. this branch. Please take a look and let me know if you think we need any more tests. I know we are touching a sensitive code path and I appreciate the need for caution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org