Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

via GitHub Mon, 12 Feb 2024 14:57:59 -0800


nchammas commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1486895251



##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala:
##########
@@ -269,4 +269,35 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers 
{
       assert(pos1 == pos2)
     }
   }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {

Review Comment:
   > Spark's `OpenHashSet` does not have to match `java.util.HashSet`. What 
matters is the SQL semantic.
   
   Whether or not `OpenHashSet` matches `java.util.HashSet`, I want to 
emphasize for the record that `OpenHashSet` mishandles 0.0/-0.0 and NaN. Its 
behavior is simply _incorrect_. [These][test1] [tests][test2] fail on `master` 
in ways that can only be described as bugs, regardless of whatever SQL 
semantics we want to preserve.
   
   [This comment][why] explains the root cause. Basically, it is a mistake to 
combine hash code-based lookups with cooperative equality, at least in the way 
we are doing it in `OpenHashSet`.
   
   [test1]: 
https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R274-R278
   [test2]: 
https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R308-R309
   [why]: 
https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283
   
   But I understand what you are saying. Fixing bugs in `OpenHashSet` doesn't 
help us if it also breaks users' SQL.
   
   > Can you highlight which functions/operators are using this `OpenHashSet` 
   
   I've updated the PR description with a summary of what uses `OpenHashSet`.
   
   As a side note, I believe that if we accept the change proposed here, we 
should be able to eliminate `SQLOpenHashSet`. `SQLOpenHashSet` was created 
specifically to work around the bugs in `OpenHashSet` that we are addressing in 
this PR. See #33955 and #33993.
   
   > and what is the impact of this change to the SQL semantic?
   
   I've updated the PR description with a diff of what tests pass or fail on 
`master` vs. this branch. Please take a look and let me know if you think we 
need any more tests. I know we are touching a sensitive code path and I 
appreciate the need for caution.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Reply via email to