[
https://issues.apache.org/jira/browse/SPARK-54372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18042979#comment-18042979
]
Vindhya G commented on SPARK-54372:
-----------------------------------
I see that scala spark also behaves same as pyspark. As per the code pyspark
internally uses java type to cast timestamp as "double" (epoch format). Same
with Scala. Duckdb spark on the other hand seems to be using python native
datetimeobject of python causing difference in behaviour.
[https://github.com/duckdb/duckdb/blob/v1.3-ossivalis/tools/pythonpkg/duckdb/experimental/spark/sql/type_utils.py#L104
|https://github.com/duckdb/duckdb/blob/v1.3-ossivalis/tools/pythonpkg/duckdb/experimental/spark/sql/type_utils.py#L104]
from_unixtime can be used in pyspark to convert epoch back to timestamp value.
{code:java}
.agg(F.from_unixtime(F.avg(F.col("c0").cast("timestamp")))")){code}
> PySpark: incorrect `avg(<timestamp>)` query result
> --------------------------------------------------
>
> Key: SPARK-54372
> URL: https://issues.apache.org/jira/browse/SPARK-54372
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 4.0.1
> Environment: Platform: Ubuntu 24.04
> Linux-6.14.0-35-generic-x86_64-with-glibc2.39
> Python: 3.10.19 | packaged by conda-forge | (main, Oct 22 2025,
> 22:29:10) [GCC 14.3.0]
> openjdk version "17.0.17-internal" 2025-10-21
> OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
> OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode,
> sharing)
> pyspark 4.0.1
> duckdb 1.4.2
> pandas 2.3.3
> pyarrow 22.0.0
> Reporter: asddfl
> Priority: Critical
>
> The `avg(<timestamp>)` query result of PySpark is incorrect.
> I consider the query results from PySpark should be the same as those from
> DuckDBSpark, returning a timestamp value.
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from duckdb.experimental.spark.sql import SparkSession as DuckdbSparkSession
> pd_df = pd.DataFrame({
> 'c0': ['1969-12-21'],
> })
> spark = SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(pd_df)
> df.createOrReplaceTempView("t0")
> print("PySpark result:")
> pyspark_result =
> spark.table("t0").groupBy("c0").agg(F.avg(F.col("c0").cast("timestamp")))
> pyspark_result.show()
> duckdb_spark = DuckdbSparkSession.builder.getOrCreate()
> df = duckdb_spark.createDataFrame(pd_df)
> df.createOrReplaceTempView("t0")
> from duckdb.experimental.spark.sql import functions as F
> print("Duckdb Spark result: ")
> duckdb_spark_result =
> duckdb_spark.table("t0").groupBy("c0").agg(F.avg(F.col("c0").cast("timestamp")))
> duckdb_spark_result.show()
> {code}
> {code:bash}
> PySpark result:
> +----------+--------------------------+
>
> | c0|avg(CAST(c0 AS TIMESTAMP))|
> +----------+--------------------------+
> |1969-12-21| -979200.0|
> +----------+--------------------------+
> Duckdb Spark result:
> ┌──────┬──────────────┐
> │ c0 │ avg(CAST(c0 AS TIMESTAMP)) │
> │ varchar │ timestamp │
> ├──────┼──────────────┤
> │ 1969-12-21 │ 1969-12-21 00:00:00 │
> └──────┴──────────────┘
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]