[
https://issues.apache.org/jira/browse/SPARK-54372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18043464#comment-18043464
]
asddfl edited comment on SPARK-54372 at 12/9/25 6:27 PM:
---------------------------------------------------------
[~ashrithb] I agree the numeric value can be explained once we understand the
Hive rule, but the issue is that the behavior is not intuitive, not documented.
Most users do not expect an implicit cast to double or negative epoch values.
Also, avg(timestamp) does have real production use cases — IoT event centroids,
log analytics, time-series windowing, financial tick data, etc. Many modern
engines (DuckDB, Pandas/Polars) support averaging timestamps directly.
Even if we keep the legacy Hive behavior for compatibility, I think at minimum
we should:
(1) document it clearly, and
(2) emit a warning,
even better (3) Change return type to timestamp,
so users don’t unknowingly lose timestamp semantics.
was (Author: JIRAUSER311346):
[~ashrithb] I agree the numeric value can be explained once we understand the
Hive rule, but the issue is that the behavior is not intuitive, not documented,
and contradicts Spark’s own SQL docs, which say avg(timestamp) returns a
timestamp. Most users do not expect an implicit cast to double or negative
epoch values.
Also, avg(timestamp) does have real production use cases — IoT event centroids,
log analytics, time-series windowing, financial tick data, etc. Many modern
engines (DuckDB, Pandas/Polars) support averaging timestamps directly.
Even if we keep the legacy Hive behavior for compatibility, I think at minimum
we should:
(1) document it clearly, and
(2) emit a warning,
even better (3) Change return type to timestamp,
so users don’t unknowingly lose timestamp semantics.
> PySpark: incorrect `avg(<timestamp>)` query result
> --------------------------------------------------
>
> Key: SPARK-54372
> URL: https://issues.apache.org/jira/browse/SPARK-54372
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 4.0.1
> Environment: Platform: Ubuntu 24.04
> Linux-6.14.0-35-generic-x86_64-with-glibc2.39
> Python: 3.10.19 | packaged by conda-forge | (main, Oct 22 2025,
> 22:29:10) [GCC 14.3.0]
> openjdk version "17.0.17-internal" 2025-10-21
> OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
> OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode,
> sharing)
> pyspark 4.0.1
> duckdb 1.4.2
> pandas 2.3.3
> pyarrow 22.0.0
> Reporter: asddfl
> Priority: Critical
> Labels: pull-request-available
>
> The `avg(<timestamp>)` query result of PySpark is incorrect.
> I consider the query results from PySpark should be the same as those from
> DuckDBSpark, returning a timestamp value.
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from duckdb.experimental.spark.sql import SparkSession as DuckdbSparkSession
> pd_df = pd.DataFrame({
> 'c0': ['1969-12-21'],
> })
> spark = SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(pd_df)
> df.createOrReplaceTempView("t0")
> print("PySpark result:")
> pyspark_result =
> spark.table("t0").groupBy("c0").agg(F.avg(F.col("c0").cast("timestamp")))
> pyspark_result.show()
> duckdb_spark = DuckdbSparkSession.builder.getOrCreate()
> df = duckdb_spark.createDataFrame(pd_df)
> df.createOrReplaceTempView("t0")
> from duckdb.experimental.spark.sql import functions as F
> print("Duckdb Spark result: ")
> duckdb_spark_result =
> duckdb_spark.table("t0").groupBy("c0").agg(F.avg(F.col("c0").cast("timestamp")))
> duckdb_spark_result.show()
> {code}
> {code:bash}
> PySpark result:
> +----------+--------------------------+
>
> | c0|avg(CAST(c0 AS TIMESTAMP))|
> +----------+--------------------------+
> |1969-12-21| -979200.0|
> +----------+--------------------------+
> Duckdb Spark result:
> ┌──────┬──────────────┐
> │ c0 │ avg(CAST(c0 AS TIMESTAMP)) │
> │ varchar │ timestamp │
> ├──────┼──────────────┤
> │ 1969-12-21 │ 1969-12-21 00:00:00 │
> └──────┴──────────────┘
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]