[jira] [Comment Edited] (SPARK-54372) PySpark: incorrect `avg()` query result

Vindhya G (Jira) Mon, 08 Dec 2025 01:11:06 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-54372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18043483#comment-18043483
 ]


Vindhya G edited comment on SPARK-54372 at 12/8/25 9:10 AM:
------------------------------------------------------------

With [~teslawho] comment I take back my earlier comment about not needing any 
change.  If there are so many use cases and the doc says it returns timestamp 
then atleast the behavior and the doc should match 


was (Author: JIRAUSER299405):
With [~teslawho] comment I take back my earlier comment about not needing any 
change.  If there are so many use cases and the doc says it returns timestamp 
then atleast the behavior and the doc should match. 

> PySpark: incorrect `avg(<timestamp>)` query result
> --------------------------------------------------
>
>                 Key: SPARK-54372
>                 URL: https://issues.apache.org/jira/browse/SPARK-54372
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 4.0.1
>         Environment: Platform:            Ubuntu 24.04 
> Linux-6.14.0-35-generic-x86_64-with-glibc2.39
> Python:              3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 
> 22:29:10) [GCC 14.3.0]
> openjdk version "17.0.17-internal" 2025-10-21
> OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
> OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode, 
> sharing)
> pyspark                  4.0.1
> duckdb                   1.4.2
> pandas                   2.3.3
> pyarrow                  22.0.0
>            Reporter: asddfl
>            Priority: Critical
>              Labels: pull-request-available
>
> The `avg(<timestamp>)` query result of PySpark is incorrect.
> I consider the query results from PySpark should be the same as those from 
> DuckDBSpark, returning a timestamp value.
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from duckdb.experimental.spark.sql import SparkSession as DuckdbSparkSession
> pd_df = pd.DataFrame({
>     'c0': ['1969-12-21'],
> })
> spark = SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(pd_df)
> df.createOrReplaceTempView("t0")
> print("PySpark result:")
> pyspark_result = 
> spark.table("t0").groupBy("c0").agg(F.avg(F.col("c0").cast("timestamp")))
> pyspark_result.show()
> duckdb_spark = DuckdbSparkSession.builder.getOrCreate()
> df = duckdb_spark.createDataFrame(pd_df)
> df.createOrReplaceTempView("t0")
> from duckdb.experimental.spark.sql import functions as F
> print("Duckdb Spark result: ")
> duckdb_spark_result = 
> duckdb_spark.table("t0").groupBy("c0").agg(F.avg(F.col("c0").cast("timestamp")))
> duckdb_spark_result.show()
> {code}
> {code:bash}
> PySpark result:
> +----------+--------------------------+                                       
>   
> |        c0|avg(CAST(c0 AS TIMESTAMP))|
> +----------+--------------------------+
> |1969-12-21|                 -979200.0|
> +----------+--------------------------+
> Duckdb Spark result: 
> ┌──────┬──────────────┐
> │     c0     │ avg(CAST(c0 AS TIMESTAMP)) │
> │  varchar   │         timestamp          │
> ├──────┼──────────────┤
> │ 1969-12-21 │ 1969-12-21 00:00:00        │
> └──────┴──────────────┘
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-54372) PySpark: incorrect `avg()` query result

Reply via email to