[ 
https://issues.apache.org/jira/browse/SPARK-51475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17956143#comment-17956143
 ] 

Robert Joseph Evans commented on SPARK-51475:
---------------------------------------------

[~doki] you did reproduce the bug. You didn't reproduce the good case.  {{0.0}} 
and {{-0.0}} are supposed to be equal. 

[https://en.wikipedia.org/wiki/IEEE_754#Comparison_predicates]
{code:java}
>>> pyspark.__version__
'3.4.2'
>>> spark.createDataFrame([([0.0, 6.0, -0.0],)], 
>>> ['values']).createOrReplaceTempView("tab")
>>> spark.sql("select array_distinct(values) from tab").show()
+----------------------+                                                        
|array_distinct(values)|
+----------------------+
|      [0.0, 6.0, -0.0]|
+----------------------+

>>> spark.createDataFrame([([[0.0], [6.0], [-0.0]],)], 
>>> ['values']).createOrReplaceTempView("tab")
>>> spark.sql("select array_distinct(values) from tab").show()
+----------------------+
|array_distinct(values)|
+----------------------+
|        [[0.0], [6.0]]|
+----------------------+{code}
So why is it that {{0.0}} and {{-0.0}} are not equal for array distinct, but 
[0.0] and [-0.0] are?

Especially if the equals operator defines them as being equal.
{code:java}
>>> spark.createDataFrame([(0.0, -0.0)], 
>>> ['a','b']).createOrReplaceTempView("tab")
>>> spark.sql("select a, b, a = b from tab").show()
+---+----+-------+
|  a|   b|(a = b)|
+---+----+-------+
|0.0|-0.0|   true|
+---+----+-------+ {code}
As does a regular distinct operation
{code:java}
>>> spark.createDataFrame([(0.0,), (-0.0,),], 
>>> ['a']).createOrReplaceTempView("tab")
>>> spark.sql("select COUNT(a), COUNT(DISTINCT a) from tab").show()
+--------+-----------------+
|count(a)|count(DISTINCT a)|
+--------+-----------------+
|       2|                1|
+--------+-----------------+ {code}

> ArrayDistinct Producing Inconsistent Behavior For -0.0 and +0.0
> ---------------------------------------------------------------
>
>                 Key: SPARK-51475
>                 URL: https://issues.apache.org/jira/browse/SPARK-51475
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.5.0, 3.4.4, 3.5.5
>            Reporter: Warrick He
>            Priority: Major
>              Labels: correctness
>
> This impacts array_distinct. This was tested on Spark versions 3.5.5, 3.5.0, 
> and 3.4.4, but it likely affects all versions.
> Problem: inconsistent behavior for 0.0 and -0.0. See below (ran on 3.5.5)
> I'm not sure what the desired behavior is, does Spark want to follow the IEEE 
> standard and set them to equal, giving only -0.0 or 0.0, or should it 
> consider these distinct?
> {quote}>>> spark.createDataFrame([([0.0, 6.0 -0.0],)], 
> ['values']).createOrReplaceTempView("tab")
> >>> spark.sql("select array_distinct(values) from tab").show()
> +----------------------+
> |array_distinct(values)|
> +----------------------+
> |            [0.0, 6.0]|
> +----------------------+
>  
> >>> spark.createDataFrame([([0.0, -0.0, 6.0],)], 
> >>> ['values']).createOrReplaceTempView("tab")
> >>> spark.sql("select array_distinct(values) from tab").show()
> +----------------------+
> |array_distinct(values)|
> +----------------------+
> |      [0.0, -0.0, 6.0]|
> +----------------------+
> {quote}
> This issue could be related to the implementation of OpenHashSet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to