[jira] [Commented] (SPARK-44947) Taking sum of two columns behaves differently from sum aggregation function

Hyukjin Kwon (Jira) Thu, 24 Aug 2023 20:32:05 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-44947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758807#comment-17758807
 ]


Hyukjin Kwon commented on SPARK-44947:
--------------------------------------

Simply `sum` is null tolerant but the arithmetic operators aren't. I believe 
all DBMSes work like that? 

> Taking sum of two columns behaves differently from sum aggregation function
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-44947
>                 URL: https://issues.apache.org/jira/browse/SPARK-44947
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.4.1
>         Environment: * Docker container: python:3.10-slim-bullseye
>  * Java: openjdk-17-jre-headless
>  * Spark 3.4.1
>            Reporter: Matthias Roels
>            Priority: Minor
>
> Taking the sum of two columns behaves differently when there are NULL values 
> than taking the SUM of a column. This is odd and confusing for users
> Reproducible example: 
> {code:java}
> $ from pyspark.sql import SparkSession
> $ import pyspark.sql.functions as f
> $ spark = SparkSession.builder.getOrCreate()
> $ df = spark.createDataFrame([(1, 2), (2, None)], ["foo", "bar"])
> $ df.show()
> > 
> +---+----+
> |foo| bar|
> +---+----+
> |  1|   2|
> |  2|null|
> +---+----+
> $ df.select(f.sum("foo"), f.sum("bar")).show()
> >
> +--------+--------+
> |sum(foo)|sum(bar)|
> +--------+--------+
> |       3|       2|
> +--------+--------+
> $ df.select((f.col("foo") + f.col("bar")).alias("sum(foobar)")).show()
> > 
> +-----------+
> |sum(foobar)|
> +-----------+
> |          3|
> |       null|
> +-----------+
> // I expected to get, but I was surprised to see the result above
> +-----------+
> |sum(foobar)|
> +-----------+
> |          3|
> |          2|
> +-----------+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44947) Taking sum of two columns behaves differently from sum aggregation function

Reply via email to