[
https://issues.apache.org/jira/browse/SPARK-54697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Naresh P R updated SPARK-54697:
-------------------------------
Description:
eg.,
{code:java}
create external table test_calendar (writerType string, inputDate date) stored
as parquet;
INSERT INTO test.test_calendar values('spark-corrected', CAST('0685-04-12' AS
DATE)), ('spark-corrected', CAST('1582-10-04' AS DATE)); {code}
Hive writes a flag in parquet metadata ({*}writer.date.proleptic{*}) which
helps Hive-Parquet readers to decide whether the date is in hybrid or
proleptic. *hive.parquet.date.proleptic.gregorian* is used in writer flow which
adds *writer.date.proleptic* = true/false on the parquet file metadata.
Setting *hive.parquet.date.proleptic.gregorian=true/false* while reading the
files doesn’t not have any effect, Hive parquet read depends on
*writer.date.proleptic* file specific metadata config on each individual file.
Its better if Spark can comply with Hive *writer.date.proleptic* meta config.
(ie., Spark writer should add writer.date.proleptic=true/false in parquet file
metadata and consider the same metadata config while reading in spark instead
of relying on spark.sql.parquet.datetimeRebaseModeInRead/
spark.sql.parquet.datetimeRebaseModeInWrite as LEGACY/CORRECTED. Or have a
better a common ground so that all reads know whether the dates are Hybrid or
Gregorian.
Without this common ground, Hive written files will show wrong results in Spark
& Spark written files will show wrong results in Hive.
was:
eg.,
{code:java}
create external table test_calendar (writerType string, inputDate date) stored
as parquet;
INSERT INTO test.test_calendar values('spark-corrected', CAST('0685-04-12' AS
DATE)), ('spark-corrected', CAST('1582-10-04' AS DATE)); {code}
Hive writes a flag in parquet metadata ({*}writer.date.proleptic{*}) which
helps Hive-Parquet readers to decide whether the date is in hybrid or
proleptic. *hive.parquet.date.proleptic.gregorian* is used in writer flow which
adds *writer.date.proleptic* = true/false on the parquet file metadata.
Setting *hive.parquet.date.proleptic.gregorian=true/false* while reading the
files doesn’t not have any effect, Hive parquet read depends on
*writer.date.proleptic* file specific metadata config on each individual file.
Its better if Spark can comply with Hive *writer.date.proleptic* meta config.
(ie., Spark writer should add writer.date.proleptic=true/false in parquet file
metadata and consider the same metadata config while reading in spark instead
of relying on spark.sql.parquet.datetimeRebaseModeInRead/
spark.sql.parquet.datetimeRebaseModeInWrite as LEGACY/CORRECTED. Or have a
better a common ground so that all reads know whether the dates are Julian or
Gregorian.
Without this common ground, Hive written files will show wrong results in Spark
& Spark written files will show wrong results in Hive.
> Read/Write proleptic dates older than 1582-10-04 via Hive/Spark for
> interoperability
> ------------------------------------------------------------------------------------
>
> Key: SPARK-54697
> URL: https://issues.apache.org/jira/browse/SPARK-54697
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.5.7
> Reporter: Naresh P R
> Priority: Major
>
> eg.,
> {code:java}
> create external table test_calendar (writerType string, inputDate date)
> stored as parquet;
> INSERT INTO test.test_calendar values('spark-corrected', CAST('0685-04-12' AS
> DATE)), ('spark-corrected', CAST('1582-10-04' AS DATE)); {code}
> Hive writes a flag in parquet metadata ({*}writer.date.proleptic{*}) which
> helps Hive-Parquet readers to decide whether the date is in hybrid or
> proleptic. *hive.parquet.date.proleptic.gregorian* is used in writer flow
> which adds *writer.date.proleptic* = true/false on the parquet file metadata.
>
> Setting *hive.parquet.date.proleptic.gregorian=true/false* while reading the
> files doesn’t not have any effect, Hive parquet read depends on
> *writer.date.proleptic* file specific metadata config on each individual file.
>
> Its better if Spark can comply with Hive *writer.date.proleptic* meta config.
> (ie., Spark writer should add writer.date.proleptic=true/false in parquet
> file metadata and consider the same metadata config while reading in spark
> instead of relying on spark.sql.parquet.datetimeRebaseModeInRead/
> spark.sql.parquet.datetimeRebaseModeInWrite as LEGACY/CORRECTED. Or have a
> better a common ground so that all reads know whether the dates are Hybrid or
> Gregorian.
>
> Without this common ground, Hive written files will show wrong results in
> Spark & Spark written files will show wrong results in Hive.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]