[ 
https://issues.apache.org/jira/browse/SPARK-54372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18043904#comment-18043904
 ] 

Ashrith Bandla commented on SPARK-54372:
----------------------------------------

Thank you both [~teslawho] [~vindhyag] for the feedback, I went through some of 
the docs and did not see where it contradicts, Spark's own SQL docs when it 
comes to timestamp, rather the behavior is not even specified or explained 
properly. Could either of you point me to the docs that claim that 
avg(timestamp) returns timestamp? To me, the gap in documentation appears to be 
the silent casting that happens.

> PySpark: incorrect `avg(<timestamp>)` query result
> --------------------------------------------------
>
>                 Key: SPARK-54372
>                 URL: https://issues.apache.org/jira/browse/SPARK-54372
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 4.0.1
>         Environment: Platform:            Ubuntu 24.04 
> Linux-6.14.0-35-generic-x86_64-with-glibc2.39
> Python:              3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 
> 22:29:10) [GCC 14.3.0]
> openjdk version "17.0.17-internal" 2025-10-21
> OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
> OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode, 
> sharing)
> pyspark                  4.0.1
> duckdb                   1.4.2
> pandas                   2.3.3
> pyarrow                  22.0.0
>            Reporter: asddfl
>            Priority: Critical
>              Labels: pull-request-available
>
> The `avg(<timestamp>)` query result of PySpark is incorrect.
> I consider the query results from PySpark should be the same as those from 
> DuckDBSpark, returning a timestamp value.
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from duckdb.experimental.spark.sql import SparkSession as DuckdbSparkSession
> pd_df = pd.DataFrame({
>     'c0': ['1969-12-21'],
> })
> spark = SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(pd_df)
> df.createOrReplaceTempView("t0")
> print("PySpark result:")
> pyspark_result = 
> spark.table("t0").groupBy("c0").agg(F.avg(F.col("c0").cast("timestamp")))
> pyspark_result.show()
> duckdb_spark = DuckdbSparkSession.builder.getOrCreate()
> df = duckdb_spark.createDataFrame(pd_df)
> df.createOrReplaceTempView("t0")
> from duckdb.experimental.spark.sql import functions as F
> print("Duckdb Spark result: ")
> duckdb_spark_result = 
> duckdb_spark.table("t0").groupBy("c0").agg(F.avg(F.col("c0").cast("timestamp")))
> duckdb_spark_result.show()
> {code}
> {code:bash}
> PySpark result:
> +----------+--------------------------+                                       
>   
> |        c0|avg(CAST(c0 AS TIMESTAMP))|
> +----------+--------------------------+
> |1969-12-21|                 -979200.0|
> +----------+--------------------------+
> Duckdb Spark result: 
> ┌──────┬──────────────┐
> │     c0     │ avg(CAST(c0 AS TIMESTAMP)) │
> │  varchar   │         timestamp          │
> ├──────┼──────────────┤
> │ 1969-12-21 │ 1969-12-21 00:00:00        │
> └──────┴──────────────┘
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to