[jira] [Comment Edited] (SPARK-54372) PySpark: incorrect `avg()` query result

asddfl (Jira) Tue, 09 Dec 2025 10:28:38 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-54372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18043464#comment-18043464
 ]


asddfl edited comment on SPARK-54372 at 12/9/25 6:27 PM:
---------------------------------------------------------

[~ashrithb] I agree the numeric value can be explained once we understand the 
Hive rule, but the issue is that the behavior is not intuitive, not documented. 
Most users do not expect an implicit cast to double or negative epoch values.

Also, avg(timestamp) does have real production use cases — IoT event centroids, 
log analytics, time-series windowing, financial tick data, etc. Many modern 
engines (DuckDB, Pandas/Polars) support averaging timestamps directly.

Even if we keep the legacy Hive behavior for compatibility, I think at minimum 
we should:

(1) document it clearly, and
(2) emit a warning,

even better (3) Change return type to timestamp,

so users don’t unknowingly lose timestamp semantics.


was (Author: JIRAUSER311346):
[~ashrithb] I agree the numeric value can be explained once we understand the 
Hive rule, but the issue is that the behavior is not intuitive, not documented, 
and contradicts Spark’s own SQL docs, which say avg(timestamp) returns a 
timestamp. Most users do not expect an implicit cast to double or negative 
epoch values.

Also, avg(timestamp) does have real production use cases — IoT event centroids, 
log analytics, time-series windowing, financial tick data, etc. Many modern 
engines (DuckDB, Pandas/Polars) support averaging timestamps directly.

Even if we keep the legacy Hive behavior for compatibility, I think at minimum 
we should:

(1) document it clearly, and
(2) emit a warning,

even better (3) Change return type to timestamp,

so users don’t unknowingly lose timestamp semantics.

> PySpark: incorrect `avg(<timestamp>)` query result
> --------------------------------------------------
>
>                 Key: SPARK-54372
>                 URL: https://issues.apache.org/jira/browse/SPARK-54372
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 4.0.1
>         Environment: Platform:            Ubuntu 24.04 
> Linux-6.14.0-35-generic-x86_64-with-glibc2.39
> Python:              3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 
> 22:29:10) [GCC 14.3.0]
> openjdk version "17.0.17-internal" 2025-10-21
> OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
> OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode, 
> sharing)
> pyspark                  4.0.1
> duckdb                   1.4.2
> pandas                   2.3.3
> pyarrow                  22.0.0
>            Reporter: asddfl
>            Priority: Critical
>              Labels: pull-request-available
>
> The `avg(<timestamp>)` query result of PySpark is incorrect.
> I consider the query results from PySpark should be the same as those from 
> DuckDBSpark, returning a timestamp value.
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from duckdb.experimental.spark.sql import SparkSession as DuckdbSparkSession
> pd_df = pd.DataFrame({
>     'c0': ['1969-12-21'],
> })
> spark = SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(pd_df)
> df.createOrReplaceTempView("t0")
> print("PySpark result:")
> pyspark_result = 
> spark.table("t0").groupBy("c0").agg(F.avg(F.col("c0").cast("timestamp")))
> pyspark_result.show()
> duckdb_spark = DuckdbSparkSession.builder.getOrCreate()
> df = duckdb_spark.createDataFrame(pd_df)
> df.createOrReplaceTempView("t0")
> from duckdb.experimental.spark.sql import functions as F
> print("Duckdb Spark result: ")
> duckdb_spark_result = 
> duckdb_spark.table("t0").groupBy("c0").agg(F.avg(F.col("c0").cast("timestamp")))
> duckdb_spark_result.show()
> {code}
> {code:bash}
> PySpark result:
> +----------+--------------------------+                                       
>   
> |        c0|avg(CAST(c0 AS TIMESTAMP))|
> +----------+--------------------------+
> |1969-12-21|                 -979200.0|
> +----------+--------------------------+
> Duckdb Spark result: 
> ┌──────┬──────────────┐
> │     c0     │ avg(CAST(c0 AS TIMESTAMP)) │
> │  varchar   │         timestamp          │
> ├──────┼──────────────┤
> │ 1969-12-21 │ 1969-12-21 00:00:00        │
> └──────┴──────────────┘
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-54372) PySpark: incorrect `avg()` query result

Reply via email to