[ 
https://issues.apache.org/jira/browse/SPARK-44947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Roels updated SPARK-44947:
-----------------------------------
    Description: 
Taking the sum of two columns behaves differently when there are NULL values 
than taking the SUM of a column. This is odd and confusing for users

Reproducible example: 
{code:java}
$ from pyspark.sql import SparkSession
$ spark = SparkSession.builder.getOrCreate()

$ df = spark.createDataFrame([(1, 2), (2, None)], ["foo", "bar"])
$ df.show()
> 
+---+----+
|foo| bar|
+---+----+
|  1|   2|
|  2|null|
+---+----+

$ df.select(f.sum("foo"), f.sum("bar")).show()
>
+--------+--------+
|sum(foo)|sum(bar)|
+--------+--------+
|       3|       2|
+--------+--------+

$ df.select((f.col("foo") + f.col("bar")).alias("sum(foobar)")).show()
> 
+-----------+
|sum(foobar)|
+-----------+
|          3|
|       null|
+-----------+

// I expected to get, but I was surprised to see the result above
+-----------+
|sum(foobar)|
+-----------+
|          3|
|          2|
+-----------+
{code}
 

  was:
Taking the sum of two columns behaves differently when there are NULL values 
than taking the SUM of a column. This is odd and confusing for users

Reproducible example: 
{code:java}
$ from pyspark.sql import SparkSession
$ spark = SparkSession.builder.getOrCreate()

$ df = spark.createDataFrame([(1, 2), (2, None)], ["foo", "bar"])
$ df.show()
> 
+---+----+
|foo| bar|
+---+----+
|  1|   2|
|  2|null|
+---+----+

$ df.select(f.sum("foo"), f.sum("bar")).show()
>
+--------+--------+
|sum(foo)|sum(bar)|
+--------+--------+
|       3|       2|
+--------+--------+

$ df.select((f.col("foo") + f.col("bar")).alias("sum(foobar)")).show()
> 
+-----------+
|sum(foobar)|
+-----------+
|          3|
|       null|
+-----------+{code}


> Taking sum of two columns behaves differently from sum aggregation function
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-44947
>                 URL: https://issues.apache.org/jira/browse/SPARK-44947
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.4.1
>         Environment: * Docker container: python:3.10-slim-bullseye
>  * Java: openjdk-17-jre-headless
>  * Spark 3.4.1
>            Reporter: Matthias Roels
>            Priority: Major
>
> Taking the sum of two columns behaves differently when there are NULL values 
> than taking the SUM of a column. This is odd and confusing for users
> Reproducible example: 
> {code:java}
> $ from pyspark.sql import SparkSession
> $ spark = SparkSession.builder.getOrCreate()
> $ df = spark.createDataFrame([(1, 2), (2, None)], ["foo", "bar"])
> $ df.show()
> > 
> +---+----+
> |foo| bar|
> +---+----+
> |  1|   2|
> |  2|null|
> +---+----+
> $ df.select(f.sum("foo"), f.sum("bar")).show()
> >
> +--------+--------+
> |sum(foo)|sum(bar)|
> +--------+--------+
> |       3|       2|
> +--------+--------+
> $ df.select((f.col("foo") + f.col("bar")).alias("sum(foobar)")).show()
> > 
> +-----------+
> |sum(foobar)|
> +-----------+
> |          3|
> |       null|
> +-----------+
> // I expected to get, but I was surprised to see the result above
> +-----------+
> |sum(foobar)|
> +-----------+
> |          3|
> |          2|
> +-----------+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to