[
https://issues.apache.org/jira/browse/SPARK-49554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Diego Horcajuelo updated SPARK-49554:
--------------------------------------------
Description:
### Summary:
When performing a self inner join with null safety in PySpark using the
`eqNullSafe` method, the resulting output is inconsistent with the equivalent
SQL query using the null-safe equality operator `<=>`. Specifically, the
results of the join differ, as the DataFrame API produces more rows containing
`NULL` values compared to the SQL-based approach.
### Steps to Reproduce:
The following code reproduces the issue:
```python
from datetime import date, datetime
import pyspark.sql.functions as f
from pyspark.sql import SparkSession
spark = SparkSession.Builder().getOrCreate()
x = spark.createDataFrame(
[
(1, "A", date(2021, 1, 1), datetime(2021, 1, 1, 10), 20.5, True),
(None, None, None, None, None, None),
(2, "B", date(2021, 1, 2), datetime(2021, 1, 2, 12), 10.8, False),
],
schema="id int, name string, init date, init_t timestamp, amount double,
is_active boolean",
)
# Self inner join using PySpark DataFrame API with null-safe equality
x.alias("x1").join(
x.alias("x2"), how="inner", on=f.col("x1.id").eqNullSafe(f.col("x2.id"))
).show()
# Self inner join using SQL null-safe equality
x.createOrReplaceTempView("x")
spark.sql(
"""
select *
from x x1
inner join x x2 on x1.id <=> x2.id
"""
).show()
```
### Expected Behavior:
The results from both the PySpark DataFrame API (`eqNullSafe`) and the
SQL-based (`<=>`) null-safe equality join should match. Specifically, both
joins should handle `NULL` values in the same way, returning identical rows.
For the dataset provided above, both joins are expected to output:
```
+----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
|null|null|null| null| null| null|null|null|null| null|
null| null|
| 1| A|2021-01-01|2021-01-01 10:00:00| 20.5| true| 1|
A|2021-01-01|2021-01-01 10:00:00| 20.5| true|
| 2| B|2021-01-02|2021-01-02 12:00:00| 10.8| false| 2|
B|2021-01-02|2021-01-02 12:00:00| 10.8| false|
```
### Actual Behavior:
The PySpark DataFrame API's `eqNullSafe` join produces unexpected additional
rows with `NULL` values, which are absent in the SQL-based join:
**PySpark DataFrame API (`eqNullSafe`) result:**
```
+----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
| id|name|init|init_t|amount|is_active| id|name| init|
init_t|amount|is_active|
+----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
|null|null|null| null| null| null| 1| A|2021-01-01|2021-01-01
10:00:00| 20.5| true|
|null|null|null| null| null| null|null|null| null|
null| null| null|
|null|null|null| null| null| null| 2| B|2021-01-02|2021-01-02
12:00:00| 10.8| false|
+----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
```
**SQL-based (`<=>`) result:**
```
+----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
|null|null| null| null| null| null|null|null|
null| null| null| null|
| 1| A|2021-01-01|2021-01-01 10:00:00| 20.5| true| 1|
A|2021-01-01|2021-01-01 10:00:00| 20.5| true|
| 2| B|2021-01-02|2021-01-02 12:00:00| 10.8| false| 2|
B|2021-01-02|2021-01-02 12:00:00| 10.8| false|
+----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
```
### Impact:
This inconsistency causes confusion and could lead to incorrect results when
performing joins involving `NULL` values in PySpark. Users might expect the
same behavior between the PySpark DataFrame API and SQL queries for null-safe
equality joins.
### Suggested Fix:
Ensure that the behavior of `eqNullSafe` in the PySpark DataFrame API matches
the behavior of the SQL `<=>` operator, so that both produce consistent results
when joining on nullable columns.
was:
When performing a self inner join with null safety in PySpark using the
`eqNullSafe` method, the resulting output is inconsistent with the equivalent
SQL query using the null-safe equality operator `<=>`. Specifically, the
results of the join differ, as the DataFrame API produces more rows containing
`NULL` values compared to the SQL-based approach.
### Steps to Reproduce:
The following code reproduces the issue:
```python
from datetime import date, datetime
import pyspark.sql.functions as f
from pyspark.sql import SparkSession
spark = SparkSession.Builder().getOrCreate()
x = spark.createDataFrame(
[
(1, "A", date(2021, 1, 1), datetime(2021, 1, 1, 10), 20.5, True),
(None, None, None, None, None, None),
(2, "B", date(2021, 1, 2), datetime(2021, 1, 2, 12), 10.8, False),
],
schema="id int, name string, init date, init_t timestamp, amount double,
is_active boolean",
)
# Self inner join using PySpark DataFrame API with null-safe equality
x.alias("x1").join(
x.alias("x2"), how="inner", on=f.col("x1.id").eqNullSafe(f.col("x2.id"))
).show()
# Self inner join using SQL null-safe equality
x.createOrReplaceTempView("x")
spark.sql(
"""
select *
from x x1
inner join x x2 on x1.id <=> x2.id
"""
).show()
```
### Expected Behavior:
The results from both the PySpark DataFrame API (`eqNullSafe`) and the
SQL-based (`<=>`) null-safe equality join should match. Specifically, both
joins should handle `NULL` values in the same way, returning identical rows.
For the dataset provided above, both joins are expected to output:
```
+----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
|null|null|null| null| null| null|null|null|null| null|
null| null|
| 1| A|2021-01-01|2021-01-01 10:00:00| 20.5| true| 1|
A|2021-01-01|2021-01-01 10:00:00| 20.5| true|
| 2| B|2021-01-02|2021-01-02 12:00:00| 10.8| false| 2|
B|2021-01-02|2021-01-02 12:00:00| 10.8| false|
```
### Actual Behavior:
The PySpark DataFrame API's `eqNullSafe` join produces unexpected additional
rows with `NULL` values, which are absent in the SQL-based join:
**PySpark DataFrame API (`eqNullSafe`) result:**
```
+----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
| id|name|init|init_t|amount|is_active| id|name| init|
init_t|amount|is_active|
+----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
|null|null|null| null| null| null| 1| A|2021-01-01|2021-01-01
10:00:00| 20.5| true|
|null|null|null| null| null| null|null|null| null|
null| null| null|
|null|null|null| null| null| null| 2| B|2021-01-02|2021-01-02
12:00:00| 10.8| false|
+----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
```
**SQL-based (`<=>`) result:**
```
+----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
|null|null| null| null| null| null|null|null|
null| null| null| null|
| 1| A|2021-01-01|2021-01-01 10:00:00| 20.5| true| 1|
A|2021-01-01|2021-01-01 10:00:00| 20.5| true|
| 2| B|2021-01-02|2021-01-02 12:00:00| 10.8| false| 2|
B|2021-01-02|2021-01-02 12:00:00| 10.8| false|
+----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
```
### Impact:
This inconsistency causes confusion and could lead to incorrect results when
performing joins involving `NULL` values in PySpark. Users might expect the
same behavior between the PySpark DataFrame API and SQL queries for null-safe
equality joins.
### Suggested Fix:
Ensure that the behavior of `eqNullSafe` in the PySpark DataFrame API matches
the behavior of the SQL `<=>` operator, so that both produce consistent results
when joining on nullable columns.
---
Summary: Inconsistent Behavior Between PySpark DataFrame API's
`eqNullSafe` Join and SQL Null-Safe Equality `<=>`** (was: **Inconsistent
Behavior Between PySpark DataFrame API's `eqNullSafe` Join and SQL Null-Safe
Equality `<=>`**)
> Inconsistent Behavior Between PySpark DataFrame API's `eqNullSafe` Join and
> SQL Null-Safe Equality `<=>`**
> ----------------------------------------------------------------------------------------------------------
>
> Key: SPARK-49554
> URL: https://issues.apache.org/jira/browse/SPARK-49554
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.3.0, 3.4.0, 3.5.0
> Environment: - PySpark version: all versions seems to be affected
> - Python version: all python versions
> - Operating system: [Specify OS and version]
> Reporter: Daniel Diego Horcajuelo
> Priority: Trivial
>
> ### Summary:
> When performing a self inner join with null safety in PySpark using the
> `eqNullSafe` method, the resulting output is inconsistent with the equivalent
> SQL query using the null-safe equality operator `<=>`. Specifically, the
> results of the join differ, as the DataFrame API produces more rows
> containing `NULL` values compared to the SQL-based approach.
> ### Steps to Reproduce:
> The following code reproduces the issue:
> ```python
> from datetime import date, datetime
> import pyspark.sql.functions as f
> from pyspark.sql import SparkSession
> spark = SparkSession.Builder().getOrCreate()
> x = spark.createDataFrame(
> [
> (1, "A", date(2021, 1, 1), datetime(2021, 1, 1, 10), 20.5, True),
> (None, None, None, None, None, None),
> (2, "B", date(2021, 1, 2), datetime(2021, 1, 2, 12), 10.8, False),
> ],
> schema="id int, name string, init date, init_t timestamp, amount double,
> is_active boolean",
> )
> # Self inner join using PySpark DataFrame API with null-safe equality
> x.alias("x1").join(
> x.alias("x2"), how="inner", on=f.col("x1.id").eqNullSafe(f.col("x2.id"))
> ).show()
> # Self inner join using SQL null-safe equality
> x.createOrReplaceTempView("x")
> spark.sql(
> """
> select *
> from x x1
> inner join x x2 on x1.id <=> x2.id
> """
> ).show()
> ```
> ### Expected Behavior:
> The results from both the PySpark DataFrame API (`eqNullSafe`) and the
> SQL-based (`<=>`) null-safe equality join should match. Specifically, both
> joins should handle `NULL` values in the same way, returning identical rows.
> For the dataset provided above, both joins are expected to output:
> ```
> +----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
> |null|null|null| null| null| null|null|null|null| null|
> null| null|
> | 1| A|2021-01-01|2021-01-01 10:00:00| 20.5| true| 1|
> A|2021-01-01|2021-01-01 10:00:00| 20.5| true|
> | 2| B|2021-01-02|2021-01-02 12:00:00| 10.8| false| 2|
> B|2021-01-02|2021-01-02 12:00:00| 10.8| false|
> ```
> ### Actual Behavior:
> The PySpark DataFrame API's `eqNullSafe` join produces unexpected additional
> rows with `NULL` values, which are absent in the SQL-based join:
> **PySpark DataFrame API (`eqNullSafe`) result:**
> ```
> +----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
> | id|name|init|init_t|amount|is_active| id|name| init|
> init_t|amount|is_active|
> +----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
> |null|null|null| null| null| null| 1| A|2021-01-01|2021-01-01
> 10:00:00| 20.5| true|
> |null|null|null| null| null| null|null|null| null|
> null| null| null|
> |null|null|null| null| null| null| 2| B|2021-01-02|2021-01-02
> 12:00:00| 10.8| false|
> +----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
> ```
> **SQL-based (`<=>`) result:**
> ```
> +----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
> |null|null| null| null| null| null|null|null|
> null| null| null| null|
> | 1| A|2021-01-01|2021-01-01 10:00:00| 20.5| true| 1|
> A|2021-01-01|2021-01-01 10:00:00| 20.5| true|
> | 2| B|2021-01-02|2021-01-02 12:00:00| 10.8| false| 2|
> B|2021-01-02|2021-01-02 12:00:00| 10.8| false|
> +----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
> ```
> ### Impact:
> This inconsistency causes confusion and could lead to incorrect results when
> performing joins involving `NULL` values in PySpark. Users might expect the
> same behavior between the PySpark DataFrame API and SQL queries for null-safe
> equality joins.
> ### Suggested Fix:
> Ensure that the behavior of `eqNullSafe` in the PySpark DataFrame API matches
> the behavior of the SQL `<=>` operator, so that both produce consistent
> results when joining on nullable columns.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]