[ https://issues.apache.org/jira/browse/SPARK-51475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17956143#comment-17956143 ]
Robert Joseph Evans commented on SPARK-51475: --------------------------------------------- [~doki] you did reproduce the bug. You didn't reproduce the good case. {{0.0}} and {{-0.0}} are supposed to be equal. [https://en.wikipedia.org/wiki/IEEE_754#Comparison_predicates] {code:java} >>> pyspark.__version__ '3.4.2' >>> spark.createDataFrame([([0.0, 6.0, -0.0],)], >>> ['values']).createOrReplaceTempView("tab") >>> spark.sql("select array_distinct(values) from tab").show() +----------------------+ |array_distinct(values)| +----------------------+ | [0.0, 6.0, -0.0]| +----------------------+ >>> spark.createDataFrame([([[0.0], [6.0], [-0.0]],)], >>> ['values']).createOrReplaceTempView("tab") >>> spark.sql("select array_distinct(values) from tab").show() +----------------------+ |array_distinct(values)| +----------------------+ | [[0.0], [6.0]]| +----------------------+{code} So why is it that {{0.0}} and {{-0.0}} are not equal for array distinct, but [0.0] and [-0.0] are? Especially if the equals operator defines them as being equal. {code:java} >>> spark.createDataFrame([(0.0, -0.0)], >>> ['a','b']).createOrReplaceTempView("tab") >>> spark.sql("select a, b, a = b from tab").show() +---+----+-------+ | a| b|(a = b)| +---+----+-------+ |0.0|-0.0| true| +---+----+-------+ {code} As does a regular distinct operation {code:java} >>> spark.createDataFrame([(0.0,), (-0.0,),], >>> ['a']).createOrReplaceTempView("tab") >>> spark.sql("select COUNT(a), COUNT(DISTINCT a) from tab").show() +--------+-----------------+ |count(a)|count(DISTINCT a)| +--------+-----------------+ | 2| 1| +--------+-----------------+ {code} > ArrayDistinct Producing Inconsistent Behavior For -0.0 and +0.0 > --------------------------------------------------------------- > > Key: SPARK-51475 > URL: https://issues.apache.org/jira/browse/SPARK-51475 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.5.0, 3.4.4, 3.5.5 > Reporter: Warrick He > Priority: Major > Labels: correctness > > This impacts array_distinct. This was tested on Spark versions 3.5.5, 3.5.0, > and 3.4.4, but it likely affects all versions. > Problem: inconsistent behavior for 0.0 and -0.0. See below (ran on 3.5.5) > I'm not sure what the desired behavior is, does Spark want to follow the IEEE > standard and set them to equal, giving only -0.0 or 0.0, or should it > consider these distinct? > {quote}>>> spark.createDataFrame([([0.0, 6.0 -0.0],)], > ['values']).createOrReplaceTempView("tab") > >>> spark.sql("select array_distinct(values) from tab").show() > +----------------------+ > |array_distinct(values)| > +----------------------+ > | [0.0, 6.0]| > +----------------------+ > > >>> spark.createDataFrame([([0.0, -0.0, 6.0],)], > >>> ['values']).createOrReplaceTempView("tab") > >>> spark.sql("select array_distinct(values) from tab").show() > +----------------------+ > |array_distinct(values)| > +----------------------+ > | [0.0, -0.0, 6.0]| > +----------------------+ > {quote} > This issue could be related to the implementation of OpenHashSet. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org