Is anyone familiar with [Datetime patterns](
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) and
`TimestampType` parsing in PySpark?
When reading CSV or JSON files, timestamp columns need to be parsed. via
datasource property `timestampFormat`.
[According to documentation](
https://spark.apache.org/docs/3.3.1/sql-data-sources-json.html#data-source-option:~:text=read/write-,timestampFormat,-%2DMM%2Ddd%27T%27HH)
default value is `-MM-dd'T'HH:mm:ss[.SSS][XXX]`.
However, I noticed some weird behavior:
```python
from pyspark.sql import types as T
json_lines =[
"{'label': 'no tz' , 'value':
'2023-12-24T20:00:00' }",
"{'label': 'UTC', 'value':
'2023-12-24T20:00:00Z' }",
"{'label': 'tz offset hour' , 'value':
'2023-12-24T20:00:00+01' }",
"{'label': 'tz offset minute no colon' , 'value':
'2023-12-24T20:00:00+0100' }",
"{'label': 'tz offset minute with colon', 'value':
'2023-12-24T20:00:00+01:00'}",
"{'label': 'tz offset second no colon' , 'value':
'2023-12-24T20:00:00+01' }",
"{'label': 'tz offset second with colon', 'value':
'2023-12-24T20:00:00+01:00:00' }",
]
schema = T.StructType([
T.StructField("label", T.StringType()),
T.StructField("value", T.TimestampType()),
T.StructField("t_corrupt_record", T.StringType()),
])
df = (spark.read
.schema(schema)
.option("timestampFormat", "-MM-dd'T'HH:mm:ss[.SSS][XXX]") # <--
using the "default" from doc
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "t_corrupt_record")
.json(sc.parallelize(json_lines))
)
df.show(truncate=False)
+---+---+--+
|label |value |t_corrupt_record
|
+---+---+--+
|no tz |2023-12-24 20:00:00|null
|
|UTC|2023-12-24 20:00:00|null
|
|tz offset hour |null |{'label': 'tz offset hour'
, 'value': '2023-12-24T20:00:00+01' }|
|tz offset minute no colon |null |{'label': 'tz offset
minute no colon' , 'value': '2023-12-24T20:00:00+0100' }|
|tz offset minute with colon|2023-12-24 19:00:00|null
|
|tz offset second no colon |null |{'label': 'tz offset
second no colon' , 'value': '2023-12-24T20:00:00+01' }|
|tz offset second with colon|null |{'label': 'tz offset
second with colon', 'value': '2023-12-24T20:00:00+01:00:00' }|
+---+---+--+
```
however, when omitting timestampFormat , the values are parsed just fine
```python
df = (spark.read
.schema(schema)
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "t_corrupt_record")
.json(sc.parallelize(json_lines))
)
df.show(truncate=False)
+---+---++
|label |value |t_corrupt_record|
+---+---++
|no tz |2023-12-24 20:00:00|null|
|UTC|2023-12-24 20:00:00|null|
|tz offset hour |2023-12-24 19:00:00|null|
|tz offset minute no colon |2023-12-24 19:00:00|null|
|tz offset minute with colon|2023-12-24 19:00:00|null|
|tz offset second no colon |2023-12-24 19:00:00|null|
|tz offset second with colon|2023-12-24 19:00:00|null|
+---+---++
```
This is not plausible to me.
Using the default value explicitly should lead to the same results as
omitting it.
Thanks and regards
Martin