[jira] [Updated] (SPARK-54918) Array operations do not normalize -0.0 to 0.0

Albert Sugranyes (Jira) Tue, 06 Jan 2026 08:05:05 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-54918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Albert Sugranyes updated SPARK-54918:
-------------------------------------
    Description: 
IEEE 754 defines -0.0 == 0.0, but they have different binary representations 
and different hash codes. This causes array operations to behave incorrectly 
when arrays contain both -0.0 and 0.0.

Spark normalizes -0.0 to 0.0 in join keys, window partition keys, and aggregate 
grouping keys via NormalizeFloatingNumbers. However, hash-based array 
operations do not apply this normalization, causing inconsistent behavior.

Affected operations:
- array_distinct
- array_union
- array_intersect
- array_except

Examples:

{code:scala}
// Works correctly with SQL literals (optimized at compile time)
spark.sql("SELECT array_distinct(array(0.0, -0.0, 1.0))").show()
// Returns [0.0, 1.0]

// Fails with DataFrame data (processed at runtime)
Seq(Array(0.0, -0.0, 
1.0)).toDF("values").selectExpr("array_distinct(values)").show()
// Returns [0.0, -0.0, 1.0] instead of [0.0, 1.0]

Seq((Array(0.0), Array(-0.0))).toDF("a", "b").selectExpr("array_union(a, 
b)").show()
// Returns [0.0, -0.0] instead of [0.0]

Seq((Array(0.0, 1.0), Array(-0.0, 2.0))).toDF("a", 
"b").selectExpr("array_intersect(a, b)").show()
// Returns [] instead of [0.0]

Seq((Array(0.0, 1.0), Array(-0.0))).toDF("a", "b").selectExpr("array_except(a, 
b)").show()
// Returns [0.0, 1.0] instead of [1.0]
{code}

Subsumes SPARK-51475

  was:
IEEE 754 defines -0.0 == 0.0, but they have different binary representations 
and, consequently, different hash codes. This causes array operations to behave 
incorrectly when arrays contain both -0.0 and 0.0.

Spark normalizes -0.0 to 0.0 in join keys, window partition keys, and aggregate 
grouping keys via NormalizeFloatingNumbers. However, hash-based array 
operations do not apply this normalization, causing inconsistent behavior.

Affected operations:
- array_distinct
- array_union
- array_intersect
- array_except

Examples:

{code:scala}
// Works correctly with SQL literals (optimized at compile time)
spark.sql("SELECT array_distinct(array(0.0, -0.0, 1.0))").show()
// Returns [0.0, 1.0]

// Fails with DataFrame data (processed at runtime)
Seq(Array(0.0, -0.0, 
1.0)).toDF("values").selectExpr("array_distinct(values)").show()
// Returns [0.0, -0.0, 1.0] instead of [0.0, 1.0]

Seq((Array(0.0), Array(-0.0))).toDF("a", "b").selectExpr("array_union(a, 
b)").show()
// Returns [0.0, -0.0] instead of [0.0]

Seq((Array(0.0, 1.0), Array(-0.0, 2.0))).toDF("a", 
"b").selectExpr("array_intersect(a, b)").show()
// Returns [] instead of [0.0]

Seq((Array(0.0, 1.0), Array(-0.0))).toDF("a", "b").selectExpr("array_except(a, 
b)").show()
// Returns [0.0, 1.0] instead of [1.0]
{code}

Subsumes SPARK-51475


> Array operations do not normalize -0.0 to 0.0
> ---------------------------------------------
>
>                 Key: SPARK-54918
>                 URL: https://issues.apache.org/jira/browse/SPARK-54918
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Albert Sugranyes
>            Priority: Major
>              Labels: correctness, pull-request-available
>
> IEEE 754 defines -0.0 == 0.0, but they have different binary representations 
> and different hash codes. This causes array operations to behave incorrectly 
> when arrays contain both -0.0 and 0.0.
> Spark normalizes -0.0 to 0.0 in join keys, window partition keys, and 
> aggregate grouping keys via NormalizeFloatingNumbers. However, hash-based 
> array operations do not apply this normalization, causing inconsistent 
> behavior.
> Affected operations:
> - array_distinct
> - array_union
> - array_intersect
> - array_except
> Examples:
> {code:scala}
> // Works correctly with SQL literals (optimized at compile time)
> spark.sql("SELECT array_distinct(array(0.0, -0.0, 1.0))").show()
> // Returns [0.0, 1.0]
> // Fails with DataFrame data (processed at runtime)
> Seq(Array(0.0, -0.0, 
> 1.0)).toDF("values").selectExpr("array_distinct(values)").show()
> // Returns [0.0, -0.0, 1.0] instead of [0.0, 1.0]
> Seq((Array(0.0), Array(-0.0))).toDF("a", "b").selectExpr("array_union(a, 
> b)").show()
> // Returns [0.0, -0.0] instead of [0.0]
> Seq((Array(0.0, 1.0), Array(-0.0, 2.0))).toDF("a", 
> "b").selectExpr("array_intersect(a, b)").show()
> // Returns [] instead of [0.0]
> Seq((Array(0.0, 1.0), Array(-0.0))).toDF("a", 
> "b").selectExpr("array_except(a, b)").show()
> // Returns [0.0, 1.0] instead of [1.0]
> {code}
> Subsumes SPARK-51475



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-54918) Array operations do not normalize -0.0 to 0.0

Reply via email to