asugranyes commented on PR #53695:
URL: https://github.com/apache/spark/pull/53695#issuecomment-4510417671

   > This is an interesting proposal and is directly related to the work I did 
in https://github.com/apache/spark/pull/45036. Please take a look.
   > 
   > 
   > 
   > There are special cases where values can be equal and distinct (-0.0/0.0), 
or unequal but not distinct (NaN/NaN).
   > 
   > 
   > 
   > I agree that Spark should handle -0.0 consistently. However, I would 
consider it a safer change to ensure that -0.0 is always preserved, rather than 
always normalizing it to 0.0. This does not contradict IEEE 754.
   
   @nchammas Thanks for the detailed context and references here.
   
   After looking through #45036, my understanding is that the two changes are 
addressing different layers.
   
   The earlier OpenHashSet PR was focused on making the generic container 
semantics consistent with equals/hashCode, while this PR is intentionally 
scoped to SQL array set-like operations (array_distinct, array_union, 
array_intersect, array_except).
   
   As @cloud-fan noted in the discussion of #45036, what matters for these 
operators is the SQL semantic rather than alignment with java.util.HashSet. In 
Spark SQL, 0.0 and -0.0 are already considered equal in grouping semantics and 
normalized to 0.0. This PR makes the hash-based array operations consistent 
with those existing SQL semantics.
   
   Preserving -0.0 here would keep these array set-like operations inconsistent 
with existing Spark SQL semantics, since equivalent scalar operations 
(DISTINCT, GROUP BY) already normalize it to 0.0.
   
   That is also why the normalization happens before hashing: the issue is not 
only the equality comparison itself, but also that 0.0 and -0.0 produce 
different hash codes and therefore follow different probing paths inside the 
hash set.
   
   The scope here is narrow and modular: only the SQL array set-like operations 
are affected, without changing the generic OpenHashSet semantics.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to