[ https://issues.apache.org/jira/browse/SPARK-44947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758807#comment-17758807 ]
Hyukjin Kwon commented on SPARK-44947: -------------------------------------- Simply `sum` is null tolerant but the arithmetic operators aren't. I believe all DBMSes work like that? > Taking sum of two columns behaves differently from sum aggregation function > --------------------------------------------------------------------------- > > Key: SPARK-44947 > URL: https://issues.apache.org/jira/browse/SPARK-44947 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.4.1 > Environment: * Docker container: python:3.10-slim-bullseye > * Java: openjdk-17-jre-headless > * Spark 3.4.1 > Reporter: Matthias Roels > Priority: Minor > > Taking the sum of two columns behaves differently when there are NULL values > than taking the SUM of a column. This is odd and confusing for users > Reproducible example: > {code:java} > $ from pyspark.sql import SparkSession > $ import pyspark.sql.functions as f > $ spark = SparkSession.builder.getOrCreate() > $ df = spark.createDataFrame([(1, 2), (2, None)], ["foo", "bar"]) > $ df.show() > > > +---+----+ > |foo| bar| > +---+----+ > | 1| 2| > | 2|null| > +---+----+ > $ df.select(f.sum("foo"), f.sum("bar")).show() > > > +--------+--------+ > |sum(foo)|sum(bar)| > +--------+--------+ > | 3| 2| > +--------+--------+ > $ df.select((f.col("foo") + f.col("bar")).alias("sum(foobar)")).show() > > > +-----------+ > |sum(foobar)| > +-----------+ > | 3| > | null| > +-----------+ > // I expected to get, but I was surprised to see the result above > +-----------+ > |sum(foobar)| > +-----------+ > | 3| > | 2| > +-----------+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org