Naresh P R created SPARK-54697:
----------------------------------
Summary: Read/Write proleptic dates older than 1582-10-04 via
Hive/Spark for interoperability
Key: SPARK-54697
URL: https://issues.apache.org/jira/browse/SPARK-54697
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 3.5.7
Reporter: Naresh P R
eg.,
{code:java}
create external table test_calendar (writerType string, inputDate date) stored
as parquet;
INSERT INTO test.test_calendar values('spark-corrected', CAST('0685-04-12' AS
DATE)), ('spark-corrected', CAST('1582-10-04' AS DATE)); {code}
Hive writes a flag in parquet metadata ({*}writer.date.proleptic{*}) which
helps Hive-Parquet readers to decide whether the date is in hybrid or
proleptic. *hive.parquet.date.proleptic.gregorian* is used in writer flow which
adds *writer.date.proleptic* = true/false on the parquet file metadata.
Setting *hive.parquet.date.proleptic.gregorian=true/false* while reading the
files doesn’t not have any effect, Hive parquet read depends on
*writer.date.proleptic* file specific metadata config on each individual file.
Its better if Spark can comply with Hive *writer.date.proleptic* meta config.
(ie., Spark writer should add writer.date.proleptic=true/false in parquet file
metadata and consider the same metadata config while reading in spark instead
of relying on spark.sql.parquet.datetimeRebaseModeInRead/
spark.sql.parquet.datetimeRebaseModeInWrite as LEGACY/CORRECTED. Or have a
better a common ground so that all reads know whether the dates are Julian or
Gregorian.
Without this common ground, Hive written files will show wrong results in Spark
& Spark written files will show wrong results in Hive.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]