[ 
https://issues.apache.org/jira/browse/SPARK-54372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

asddfl updated SPARK-54372:
---------------------------
    Priority: Critical  (was: Major)

> PySpark: incorrect `avg(<timestamp>)` query result
> --------------------------------------------------
>
>                 Key: SPARK-54372
>                 URL: https://issues.apache.org/jira/browse/SPARK-54372
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 4.0.1
>         Environment: Platform:            Ubuntu 24.04 
> Linux-6.14.0-35-generic-x86_64-with-glibc2.39
> Python:              3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 
> 22:29:10) [GCC 14.3.0]
> openjdk version "17.0.17-internal" 2025-10-21
> OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
> OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode, 
> sharing)
> pyspark                  4.0.1
> duckdb                   1.4.2
> pandas                   2.3.3
> pyarrow                  22.0.0
>            Reporter: asddfl
>            Priority: Critical
>
> The `avg(<timestamp>)` query result of PySpark is incorrect.
> I consider the query results from PySpark should be the same as those from 
> DuckDBSpark, returning a timestamp value.
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from duckdb.experimental.spark.sql import SparkSession as DuckdbSparkSession
> sql_text = "SELECT AVG(CAST(t0.c0 AS TIMESTAMP)) FROM t0"
> pd_df = pd.DataFrame({
>     'c0': ['1969-12-21'],
> })
> spark = SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(pd_df)
> df.createOrReplaceTempView("t0")
> print("PySpark SQL result:")
> pyspark_result = spark.sql(sql_text)
> pyspark_result.show()
> print("PySpark API result:")
> pyspark_result = 
> spark.table("t0").select(F.avg(F.col("c0").cast("timestamp")))
> pyspark_result.show()
> duckdb_spark = DuckdbSparkSession.builder.getOrCreate()
> df = duckdb_spark.createDataFrame(pd_df)
> df.createOrReplaceTempView("t0")
> print("Duckdb Spark SQL result: ")
> duckdb_spark_result = duckdb_spark.sql(sql_text)
> duckdb_spark_result.show()
> {code}
> {code:bash}
> PySpark SQL result:
> +--------------------------+                                                  
>   
> |avg(CAST(c0 AS TIMESTAMP))|
> +--------------------------+
> |                 -979200.0|
> +--------------------------+
> PySpark API result:
> +--------------------------+
> |avg(CAST(c0 AS TIMESTAMP))|
> +--------------------------+
> |                 -979200.0|
> +--------------------------+
> Duckdb Spark SQL result: 
> ┌────────────────┐
> │ avg(CAST(t0.c0 AS TIMESTAMP))  │
> │           timestamp            │
> ├────────────────┤
> │ 1969-12-21 00:00:00            │
> └────────────────┘
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to