[
https://issues.apache.org/jira/browse/SPARK-51475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050171#comment-18050171
]
Albert Sugranyes commented on SPARK-51475:
------------------------------------------
I've opened SPARK-54918 with a fix that covers this issue and its equivalents
for other array operations.
PR: https://github.com/apache/spark/pull/53695
> ArrayDistinct Producing Inconsistent Behavior For -0.0 and +0.0
> ---------------------------------------------------------------
>
> Key: SPARK-51475
> URL: https://issues.apache.org/jira/browse/SPARK-51475
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.5.0, 3.4.4, 3.5.5
> Reporter: Warrick He
> Priority: Major
> Labels: correctness
>
> This impacts array_distinct. This was tested on Spark versions 3.5.5, 3.5.0,
> and 3.4.4, but it likely affects all versions.
> Problem: inconsistent behavior for 0.0 and -0.0. See below (ran on 3.5.5)
> I'm not sure what the desired behavior is, does Spark want to follow the IEEE
> standard and set them to equal, giving only -0.0 or 0.0, or should it
> consider these distinct?
> {quote}>>> spark.createDataFrame([([0.0, 6.0 -0.0],)],
> ['values']).createOrReplaceTempView("tab")
> >>> spark.sql("select array_distinct(values) from tab").show()
> +----------------------+
> |array_distinct(values)|
> +----------------------+
> | [0.0, 6.0]|
> +----------------------+
>
> >>> spark.createDataFrame([([0.0, -0.0, 6.0],)],
> >>> ['values']).createOrReplaceTempView("tab")
> >>> spark.sql("select array_distinct(values) from tab").show()
> +----------------------+
> |array_distinct(values)|
> +----------------------+
> | [0.0, -0.0, 6.0]|
> +----------------------+
> {quote}
> This issue could be related to the implementation of OpenHashSet.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]