Amruth Ashok created SPARK-54908:
------------------------------------
Summary: dateFormat option is ignored during schema inference for
JSON files
Key: SPARK-54908
URL: https://issues.apache.org/jira/browse/SPARK-54908
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 4.0.0, 3.5.2, 3.5.0, 3.4.1, 3.3.2
Environment: Tested in Databricks on DBR 16.4 LTS. Similar behavior in
other DBR versions as well.
Appears to be a core spark issue in
[JsonInferSchema.scala|https://github.com/apache/spark/blob/8fe006b20877671c75e4650a27d268b496294299/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala#L42]
Reporter: Amruth Ashok
When using COPY INTO ... FILEFORMAT = JSON with schema inference, the
dateFormat option is ignored during schema inference, timestampFormat works.
This caused date-only strings to be inferred as StringType instead of DateType.
Example:
test.json
{
"created_at": "02JUL14",
"updated_at": "02JUL14 12:17:43.39 UTC"
}
code:
Using COPY INTO in JSON
COPY INTO my_table
FROM '/path/to/test.json'
FILEFORMAT = JSON
OPTIONS (
inferSchema = true,
inferTimestamp = true,
timestampFormat = "ddMMMyy HH:mm:ss.SSS 'UTC'",
dateFormat = "ddMMMyy"
)
{*}Observed behavior:{*}{*}{*}
* created_at: string (should be date)
* updated_at: timestamp (correct)
{*}Expected behavior:{*}{*}{*}
* created_at: date (correct)
* updated_at: timestamp (correct)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]