[jira] [Commented] (SPARK-49858) Pyspark JSON reader incorrectly considers a string of digits a timestamp and fails

Lam Tran (Jira) Sun, 23 Mar 2025 02:13:08 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-49858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937683#comment-17937683
 ]


Lam Tran commented on SPARK-49858:
----------------------------------

[~dhimmel]  I think Spark 3.0 and onward perform a strict timestamp parser from 
a string and you can find datetime patterns in this link: 
[https://spark.apache.org/docs/3.5.3/sql-ref-datetime-pattern.html]

I think that we should not modify the existing behavior of the current 
algorithm since Spark also needs support for other SparkSQL JSON function in 
case the user wants to pass only the year and year + month part and expect a 
Timestamp to return.

Hence, the solution for your use case is indeed to explicitly specify 
*timestampFormat* option so that the *JSON Reader* can infer strings correctly.

> Pyspark JSON reader incorrectly considers a string of digits a timestamp and 
> fails
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-49858
>                 URL: https://issues.apache.org/jira/browse/SPARK-49858
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.5.0
>            Reporter: Daniel Himmelstein
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2025-03-22-17-20-24-495.png, 
> image-2025-03-22-17-23-17-473.png
>
>
> With pyspark 3.5.0 the reading the following JSON will fail:
> {code:python}
> from pyspark.sql import SparkSessionspark = 
> SparkSession.builder.appName("timestamp_test").getOrCreate()
> data = spark.sparkContext.parallelize(['{"field" : "23456"}'])
> df = (
>     spark.read.option("inferTimestamp", True)
>     # .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]")
>     .json(path=data)
> )
> df.printSchema()
> df.collect()
> {code}
> The printSchema command shows that the field is parsed as a timestamp, 
> causing the following error:
> {code:java}
> File 
> ~/miniforge3/envs/facets/lib/python3.11/site-packages/pyspark/sql/types.py:282,
>  in TimestampType.fromInternal(self, ts)
>     279 def fromInternal(self, ts: int) -> datetime.datetime:
>     280     if ts is not None:
>     281         # using int to avoid precision loss in float
> --> 282         return datetime.datetime.fromtimestamp(ts // 
> 1000000).replace(microsecond=ts % 1000000)
> ValueError: year 23455 is out of range
> {code}
> If we uncomment the timestampFormat option, the command succeeds.
> I believe there are two issues:
>  # that a string of digits with length > 4 is inferred to be a timestamp
>  # that setting timestampFormat to the default according to [the 
> documentation|https://spark.apache.org/docs/3.5.0/sql-data-sources-json.html] 
> fixes the problem such that the documented default is not the actual default.
> This might be related to SPARK-45424.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-49858) Pyspark JSON reader incorrectly considers a string of digits a timestamp and fails

Reply via email to