[
https://issues.apache.org/jira/browse/SPARK-44947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias Roels updated SPARK-44947:
-----------------------------------
Description:
Taking the sum of two columns behaves differently when there are NULL values
than taking the SUM of a column. This is odd and confusing for users
Reproducible example:
{code:java}
$ from pyspark.sql import SparkSession
$ spark = SparkSession.builder.getOrCreate()
$ df = spark.createDataFrame([(1, 2), (2, None)], ["foo", "bar"])
$ df.show()
>
+---+----+
|foo| bar|
+---+----+
| 1| 2|
| 2|null|
+---+----+
$ df.select(f.sum("foo"), f.sum("bar")).show()
>
+--------+--------+
|sum(foo)|sum(bar)|
+--------+--------+
| 3| 2|
+--------+--------+
$ df.select((f.col("foo") + f.col("bar")).alias("sum(foobar)")).show()
>
+-----------+
|sum(foobar)|
+-----------+
| 3|
| null|
+-----------+
// I expected to get, but I was surprised to see the result above
+-----------+
|sum(foobar)|
+-----------+
| 3|
| 2|
+-----------+
{code}
was:
Taking the sum of two columns behaves differently when there are NULL values
than taking the SUM of a column. This is odd and confusing for users
Reproducible example:
{code:java}
$ from pyspark.sql import SparkSession
$ spark = SparkSession.builder.getOrCreate()
$ df = spark.createDataFrame([(1, 2), (2, None)], ["foo", "bar"])
$ df.show()
>
+---+----+
|foo| bar|
+---+----+
| 1| 2|
| 2|null|
+---+----+
$ df.select(f.sum("foo"), f.sum("bar")).show()
>
+--------+--------+
|sum(foo)|sum(bar)|
+--------+--------+
| 3| 2|
+--------+--------+
$ df.select((f.col("foo") + f.col("bar")).alias("sum(foobar)")).show()
>
+-----------+
|sum(foobar)|
+-----------+
| 3|
| null|
+-----------+{code}
> Taking sum of two columns behaves differently from sum aggregation function
> ---------------------------------------------------------------------------
>
> Key: SPARK-44947
> URL: https://issues.apache.org/jira/browse/SPARK-44947
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.4.1
> Environment: * Docker container: python:3.10-slim-bullseye
> * Java: openjdk-17-jre-headless
> * Spark 3.4.1
> Reporter: Matthias Roels
> Priority: Major
>
> Taking the sum of two columns behaves differently when there are NULL values
> than taking the SUM of a column. This is odd and confusing for users
> Reproducible example:
> {code:java}
> $ from pyspark.sql import SparkSession
> $ spark = SparkSession.builder.getOrCreate()
> $ df = spark.createDataFrame([(1, 2), (2, None)], ["foo", "bar"])
> $ df.show()
> >
> +---+----+
> |foo| bar|
> +---+----+
> | 1| 2|
> | 2|null|
> +---+----+
> $ df.select(f.sum("foo"), f.sum("bar")).show()
> >
> +--------+--------+
> |sum(foo)|sum(bar)|
> +--------+--------+
> | 3| 2|
> +--------+--------+
> $ df.select((f.col("foo") + f.col("bar")).alias("sum(foobar)")).show()
> >
> +-----------+
> |sum(foobar)|
> +-----------+
> | 3|
> | null|
> +-----------+
> // I expected to get, but I was surprised to see the result above
> +-----------+
> |sum(foobar)|
> +-----------+
> | 3|
> | 2|
> +-----------+
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]