Is anyone familiar with [Datetime patterns]( https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) and `TimestampType` parsing in PySpark? When reading CSV or JSON files, timestamp columns need to be parsed. via datasource property `timestampFormat`. [According to documentation]( https://spark.apache.org/docs/3.3.1/sql-data-sources-json.html#data-source-option:~:text=read/write-,timestampFormat,-yyyy%2DMM%2Ddd%27T%27HH) default value is `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`.
However, I noticed some weird behavior: ```python from pyspark.sql import types as T json_lines =[ "{'label': 'no tz' , 'value': '2023-12-24T20:00:00' }", "{'label': 'UTC' , 'value': '2023-12-24T20:00:00Z' }", "{'label': 'tz offset hour' , 'value': '2023-12-24T20:00:00+01' }", "{'label': 'tz offset minute no colon' , 'value': '2023-12-24T20:00:00+0100' }", "{'label': 'tz offset minute with colon' , 'value': '2023-12-24T20:00:00+01:00' }", "{'label': 'tz offset second no colon' , 'value': '2023-12-24T20:00:00+010000' }", "{'label': 'tz offset second with colon' , 'value': '2023-12-24T20:00:00+01:00:00' }", ] schema = T.StructType([ T.StructField("label", T.StringType()), T.StructField("value", T.TimestampType()), T.StructField("t_corrupt_record", T.StringType()), ]) df = (spark.read .schema(schema) .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]") # <-- using the "default" from doc .option("mode", "PERMISSIVE") .option("columnNameOfCorruptRecord", "t_corrupt_record") .json(sc.parallelize(json_lines)) ) df.show(truncate=False) +---------------------------+-------------------+--------------------------------------------------------------------------------------+ |label |value |t_corrupt_record | +---------------------------+-------------------+--------------------------------------------------------------------------------------+ |no tz |2023-12-24 20:00:00|null | |UTC |2023-12-24 20:00:00|null | |tz offset hour |null |{'label': 'tz offset hour' , 'value': '2023-12-24T20:00:00+01' }| |tz offset minute no colon |null |{'label': 'tz offset minute no colon' , 'value': '2023-12-24T20:00:00+0100' }| |tz offset minute with colon|2023-12-24 19:00:00|null | |tz offset second no colon |null |{'label': 'tz offset second no colon' , 'value': '2023-12-24T20:00:00+010000' }| |tz offset second with colon|null |{'label': 'tz offset second with colon' , 'value': '2023-12-24T20:00:00+01:00:00' }| +---------------------------+-------------------+--------------------------------------------------------------------------------------+ ``` however, when omitting timestampFormat , the values are parsed just fine ```python df = (spark.read .schema(schema) .option("mode", "PERMISSIVE") .option("columnNameOfCorruptRecord", "t_corrupt_record") .json(sc.parallelize(json_lines)) ) df.show(truncate=False) +---------------------------+-------------------+----------------+ |label |value |t_corrupt_record| +---------------------------+-------------------+----------------+ |no tz |2023-12-24 20:00:00|null | |UTC |2023-12-24 20:00:00|null | |tz offset hour |2023-12-24 19:00:00|null | |tz offset minute no colon |2023-12-24 19:00:00|null | |tz offset minute with colon|2023-12-24 19:00:00|null | |tz offset second no colon |2023-12-24 19:00:00|null | |tz offset second with colon|2023-12-24 19:00:00|null | +---------------------------+-------------------+----------------+ ``` This is not plausible to me. Using the default value explicitly should lead to the same results as omitting it. Thanks and regards Martin