asugranyes commented on PR #53695: URL: https://github.com/apache/spark/pull/53695#issuecomment-4510417671
> This is an interesting proposal and is directly related to the work I did in https://github.com/apache/spark/pull/45036. Please take a look. > > > > There are special cases where values can be equal and distinct (-0.0/0.0), or unequal but not distinct (NaN/NaN). > > > > I agree that Spark should handle -0.0 consistently. However, I would consider it a safer change to ensure that -0.0 is always preserved, rather than always normalizing it to 0.0. This does not contradict IEEE 754. @nchammas Thanks for the detailed context and references here. After looking through #45036, my understanding is that the two changes are addressing different layers. The earlier OpenHashSet PR was focused on making the generic container semantics consistent with equals/hashCode, while this PR is intentionally scoped to SQL array set-like operations (array_distinct, array_union, array_intersect, array_except). As @cloud-fan noted in the discussion of #45036, what matters for these operators is the SQL semantic rather than alignment with java.util.HashSet. In Spark SQL, 0.0 and -0.0 are already considered equal in grouping semantics and normalized to 0.0. This PR makes the hash-based array operations consistent with those existing SQL semantics. Preserving -0.0 here would keep these array set-like operations inconsistent with existing Spark SQL semantics, since equivalent scalar operations (DISTINCT, GROUP BY) already normalize it to 0.0. That is also why the normalization happens before hashing: the issue is not only the equality comparison itself, but also that 0.0 and -0.0 produce different hash codes and therefore follow different probing paths inside the hash set. The scope here is narrow and modular: only the SQL array set-like operations are affected, without changing the generic OpenHashSet semantics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
