[jira] [Updated] (SPARK-49554) Inconsistent Behavior Between PySpark DataFrame API's `eqNullSafe` Join and SQL Null-Safe Equality `<=>`**

Daniel Diego Horcajuelo (Jira) Mon, 09 Sep 2024 13:38:27 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-49554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Daniel Diego Horcajuelo updated SPARK-49554:
--------------------------------------------
    Description: 
### Summary:

When performing a self inner join with null safety in PySpark using the 
`eqNullSafe` method, the resulting output is inconsistent with the equivalent 
SQL query using the null-safe equality operator `<=>`. Specifically, the 
results of the join differ, as the DataFrame API produces more rows containing 
`NULL` values compared to the SQL-based approach.

### Steps to Reproduce:

The following code reproduces the issue:

```python
from datetime import date, datetime
import pyspark.sql.functions as f
from pyspark.sql import SparkSession

spark = SparkSession.Builder().getOrCreate()

x = spark.createDataFrame(
    [
        (1, "A", date(2021, 1, 1), datetime(2021, 1, 1, 10), 20.5, True),
        (None, None, None, None, None, None),
        (2, "B", date(2021, 1, 2), datetime(2021, 1, 2, 12), 10.8, False),
    ],
    schema="id int, name string, init date, init_t timestamp, amount double, 
is_active boolean",
)

# Self inner join using PySpark DataFrame API with null-safe equality
x.alias("x1").join(
    x.alias("x2"), how="inner", on=f.col("x1.id").eqNullSafe(f.col("x2.id"))
).show()

# Self inner join using SQL null-safe equality
x.createOrReplaceTempView("x")

spark.sql(
    """
    select * 
    from x x1
    inner join x x2 on x1.id <=> x2.id
    """
).show()
```

### Expected Behavior:

The results from both the PySpark DataFrame API (`eqNullSafe`) and the 
SQL-based (`<=>`) null-safe equality join should match. Specifically, both 
joins should handle `NULL` values in the same way, returning identical rows.

For the dataset provided above, both joins are expected to output:

```
+----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
|null|null|null|  null|  null|     null|null|null|null|               null|  
null|     null|
|   1|   A|2021-01-01|2021-01-01 10:00:00|  20.5|     true|   1|   
A|2021-01-01|2021-01-01 10:00:00|  20.5|     true|
|   2|   B|2021-01-02|2021-01-02 12:00:00|  10.8|    false|   2|   
B|2021-01-02|2021-01-02 12:00:00|  10.8|    false|
```

### Actual Behavior:

The PySpark DataFrame API's `eqNullSafe` join produces unexpected additional 
rows with `NULL` values, which are absent in the SQL-based join:

**PySpark DataFrame API (`eqNullSafe`) result:**
```
+----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
|  id|name|init|init_t|amount|is_active|  id|name|      init|             
init_t|amount|is_active|
+----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
|null|null|null|  null|  null|     null|   1|   A|2021-01-01|2021-01-01 
10:00:00|  20.5|     true|
|null|null|null|  null|  null|     null|null|null|      null|               
null|  null|     null|
|null|null|null|  null|  null|     null|   2|   B|2021-01-02|2021-01-02 
12:00:00|  10.8|    false|
+----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
```

**SQL-based (`<=>`) result:**
```
+----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
|null|null|      null|               null|  null|     null|null|null|      
null|               null|  null|     null|
|   1|   A|2021-01-01|2021-01-01 10:00:00|  20.5|     true|   1|   
A|2021-01-01|2021-01-01 10:00:00|  20.5|     true|
|   2|   B|2021-01-02|2021-01-02 12:00:00|  10.8|    false|   2|   
B|2021-01-02|2021-01-02 12:00:00|  10.8|    false|
+----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
```

### Impact:

This inconsistency causes confusion and could lead to incorrect results when 
performing joins involving `NULL` values in PySpark. Users might expect the 
same behavior between the PySpark DataFrame API and SQL queries for null-safe 
equality joins.

### Suggested Fix:

Ensure that the behavior of `eqNullSafe` in the PySpark DataFrame API matches 
the behavior of the SQL `<=>` operator, so that both produce consistent results 
when joining on nullable columns.


  was:
When performing a self inner join with null safety in PySpark using the 
`eqNullSafe` method, the resulting output is inconsistent with the equivalent 
SQL query using the null-safe equality operator `<=>`. Specifically, the 
results of the join differ, as the DataFrame API produces more rows containing 
`NULL` values compared to the SQL-based approach.

### Steps to Reproduce:

The following code reproduces the issue:

```python
from datetime import date, datetime
import pyspark.sql.functions as f
from pyspark.sql import SparkSession

spark = SparkSession.Builder().getOrCreate()

x = spark.createDataFrame(
    [
        (1, "A", date(2021, 1, 1), datetime(2021, 1, 1, 10), 20.5, True),
        (None, None, None, None, None, None),
        (2, "B", date(2021, 1, 2), datetime(2021, 1, 2, 12), 10.8, False),
    ],
    schema="id int, name string, init date, init_t timestamp, amount double, 
is_active boolean",
)

# Self inner join using PySpark DataFrame API with null-safe equality
x.alias("x1").join(
    x.alias("x2"), how="inner", on=f.col("x1.id").eqNullSafe(f.col("x2.id"))
).show()

# Self inner join using SQL null-safe equality
x.createOrReplaceTempView("x")

spark.sql(
    """
    select * 
    from x x1
    inner join x x2 on x1.id <=> x2.id
    """
).show()
```

### Expected Behavior:

The results from both the PySpark DataFrame API (`eqNullSafe`) and the 
SQL-based (`<=>`) null-safe equality join should match. Specifically, both 
joins should handle `NULL` values in the same way, returning identical rows.

For the dataset provided above, both joins are expected to output:

```
+----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
|null|null|null|  null|  null|     null|null|null|null|               null|  
null|     null|
|   1|   A|2021-01-01|2021-01-01 10:00:00|  20.5|     true|   1|   
A|2021-01-01|2021-01-01 10:00:00|  20.5|     true|
|   2|   B|2021-01-02|2021-01-02 12:00:00|  10.8|    false|   2|   
B|2021-01-02|2021-01-02 12:00:00|  10.8|    false|
```

### Actual Behavior:

The PySpark DataFrame API's `eqNullSafe` join produces unexpected additional 
rows with `NULL` values, which are absent in the SQL-based join:

**PySpark DataFrame API (`eqNullSafe`) result:**
```
+----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
|  id|name|init|init_t|amount|is_active|  id|name|      init|             
init_t|amount|is_active|
+----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
|null|null|null|  null|  null|     null|   1|   A|2021-01-01|2021-01-01 
10:00:00|  20.5|     true|
|null|null|null|  null|  null|     null|null|null|      null|               
null|  null|     null|
|null|null|null|  null|  null|     null|   2|   B|2021-01-02|2021-01-02 
12:00:00|  10.8|    false|
+----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
```

**SQL-based (`<=>`) result:**
```
+----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
|null|null|      null|               null|  null|     null|null|null|      
null|               null|  null|     null|
|   1|   A|2021-01-01|2021-01-01 10:00:00|  20.5|     true|   1|   
A|2021-01-01|2021-01-01 10:00:00|  20.5|     true|
|   2|   B|2021-01-02|2021-01-02 12:00:00|  10.8|    false|   2|   
B|2021-01-02|2021-01-02 12:00:00|  10.8|    false|
+----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
```

### Impact:

This inconsistency causes confusion and could lead to incorrect results when 
performing joins involving `NULL` values in PySpark. Users might expect the 
same behavior between the PySpark DataFrame API and SQL queries for null-safe 
equality joins.

 

### Suggested Fix:

Ensure that the behavior of `eqNullSafe` in the PySpark DataFrame API matches 
the behavior of the SQL `<=>` operator, so that both produce consistent results 
when joining on nullable columns.

---

 

        Summary: Inconsistent Behavior Between PySpark DataFrame API's 
`eqNullSafe` Join and SQL Null-Safe Equality `<=>`**  (was: **Inconsistent 
Behavior Between PySpark DataFrame API's `eqNullSafe` Join and SQL Null-Safe 
Equality `<=>`**)

> Inconsistent Behavior Between PySpark DataFrame API's `eqNullSafe` Join and 
> SQL Null-Safe Equality `<=>`**
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-49554
>                 URL: https://issues.apache.org/jira/browse/SPARK-49554
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.3.0, 3.4.0, 3.5.0
>         Environment: - PySpark version: all versions seems to be affected
> - Python version: all python versions
> - Operating system: [Specify OS and version]
>            Reporter: Daniel Diego Horcajuelo
>            Priority: Trivial
>
> ### Summary:
> When performing a self inner join with null safety in PySpark using the 
> `eqNullSafe` method, the resulting output is inconsistent with the equivalent 
> SQL query using the null-safe equality operator `<=>`. Specifically, the 
> results of the join differ, as the DataFrame API produces more rows 
> containing `NULL` values compared to the SQL-based approach.
> ### Steps to Reproduce:
> The following code reproduces the issue:
> ```python
> from datetime import date, datetime
> import pyspark.sql.functions as f
> from pyspark.sql import SparkSession
> spark = SparkSession.Builder().getOrCreate()
> x = spark.createDataFrame(
>     [
>         (1, "A", date(2021, 1, 1), datetime(2021, 1, 1, 10), 20.5, True),
>         (None, None, None, None, None, None),
>         (2, "B", date(2021, 1, 2), datetime(2021, 1, 2, 12), 10.8, False),
>     ],
>     schema="id int, name string, init date, init_t timestamp, amount double, 
> is_active boolean",
> )
> # Self inner join using PySpark DataFrame API with null-safe equality
> x.alias("x1").join(
>     x.alias("x2"), how="inner", on=f.col("x1.id").eqNullSafe(f.col("x2.id"))
> ).show()
> # Self inner join using SQL null-safe equality
> x.createOrReplaceTempView("x")
> spark.sql(
>     """
>     select * 
>     from x x1
>     inner join x x2 on x1.id <=> x2.id
>     """
> ).show()
> ```
> ### Expected Behavior:
> The results from both the PySpark DataFrame API (`eqNullSafe`) and the 
> SQL-based (`<=>`) null-safe equality join should match. Specifically, both 
> joins should handle `NULL` values in the same way, returning identical rows.
> For the dataset provided above, both joins are expected to output:
> ```
> +----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
> |null|null|null|  null|  null|     null|null|null|null|               null|  
> null|     null|
> |   1|   A|2021-01-01|2021-01-01 10:00:00|  20.5|     true|   1|   
> A|2021-01-01|2021-01-01 10:00:00|  20.5|     true|
> |   2|   B|2021-01-02|2021-01-02 12:00:00|  10.8|    false|   2|   
> B|2021-01-02|2021-01-02 12:00:00|  10.8|    false|
> ```
> ### Actual Behavior:
> The PySpark DataFrame API's `eqNullSafe` join produces unexpected additional 
> rows with `NULL` values, which are absent in the SQL-based join:
> **PySpark DataFrame API (`eqNullSafe`) result:**
> ```
> +----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
> |  id|name|init|init_t|amount|is_active|  id|name|      init|             
> init_t|amount|is_active|
> +----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
> |null|null|null|  null|  null|     null|   1|   A|2021-01-01|2021-01-01 
> 10:00:00|  20.5|     true|
> |null|null|null|  null|  null|     null|null|null|      null|               
> null|  null|     null|
> |null|null|null|  null|  null|     null|   2|   B|2021-01-02|2021-01-02 
> 12:00:00|  10.8|    false|
> +----+----+----+------+------+---------+----+----+----------+-------------------+------+---------+
> ```
> **SQL-based (`<=>`) result:**
> ```
> +----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
> |null|null|      null|               null|  null|     null|null|null|      
> null|               null|  null|     null|
> |   1|   A|2021-01-01|2021-01-01 10:00:00|  20.5|     true|   1|   
> A|2021-01-01|2021-01-01 10:00:00|  20.5|     true|
> |   2|   B|2021-01-02|2021-01-02 12:00:00|  10.8|    false|   2|   
> B|2021-01-02|2021-01-02 12:00:00|  10.8|    false|
> +----+----+----------+-------------------+------+---------+----+----+----------+-------------------+------+---------+
> ```
> ### Impact:
> This inconsistency causes confusion and could lead to incorrect results when 
> performing joins involving `NULL` values in PySpark. Users might expect the 
> same behavior between the PySpark DataFrame API and SQL queries for null-safe 
> equality joins.
> ### Suggested Fix:
> Ensure that the behavior of `eqNullSafe` in the PySpark DataFrame API matches 
> the behavior of the SQL `<=>` operator, so that both produce consistent 
> results when joining on nullable columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-49554) Inconsistent Behavior Between PySpark DataFrame API's `eqNullSafe` Join and SQL Null-Safe Equality `<=>`**

Reply via email to