Albert Sugranyes created SPARK-54918:
----------------------------------------
Summary: Array operations do not normalize -0.0 to 0.0
Key: SPARK-54918
URL: https://issues.apache.org/jira/browse/SPARK-54918
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 4.2.0
Reporter: Albert Sugranyes
IEEE 754 defines -0.0 == 0.0, but they have different binary representations
and, consequently, different hash codes. This causes array operations to behave
incorrectly when arrays contain both -0.0 and 0.0.
Spark normalizes -0.0 to 0.0 in join keys, window partition keys, and aggregate
grouping keys via NormalizeFloatingNumbers. However, hash-based array
operations do not apply this normalization, causing inconsistent behavior.
Affected operations:
- array_distinct
- array_union
- array_intersect
- array_except
Examples:
{code:scala}
// Works correctly with SQL literals (optimized at compile time)
spark.sql("SELECT array_distinct(array(0.0, -0.0, 1.0))").show()
// Returns [0.0, 1.0]
// Fails with DataFrame data (processed at runtime)
Seq(Array(0.0, -0.0,
1.0)).toDF("values").selectExpr("array_distinct(values)").show()
// Returns [0.0, -0.0, 1.0] instead of [0.0, 1.0]
Seq((Array(0.0), Array(-0.0))).toDF("a", "b").selectExpr("array_union(a,
b)").show()
// Returns [0.0, -0.0] instead of [0.0]
Seq((Array(0.0, 1.0), Array(-0.0, 2.0))).toDF("a",
"b").selectExpr("array_intersect(a, b)").show()
// Returns [] instead of [0.0]
Seq((Array(0.0, 1.0), Array(-0.0))).toDF("a", "b").selectExpr("array_except(a,
b)").show()
// Returns [0.0, 1.0] instead of [1.0]
{code}
Subsumes SPARK-51475
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]