MaxGekk opened a new pull request #31529:
URL: https://github.com/apache/spark/pull/31529


   ### What changes were proposed in this pull request?
   In the PR, I propose new option `datetimeRebaseMode` for the Avro 
datasource. The option influences on loading ancient dates and timestamps 
column values from avro files. 
   
   The option supports the same values as the SQL config 
`spark.sql.legacy.avro.datetimeRebaseModeInRead` namely;
   - `"LEGACY"`, when an option is set to this value, Spark rebases 
dates/timestamps from the legacy hybrid calendar (Julian + Gregorian) to the 
Proleptic Gregorian calendar.
   - `"CORRECTED"`, dates/timestamps are read AS IS from avro files.
   - `"EXCEPTION"`, when it is set as an option value, Spark will fail the 
reading if it sees ancient dates/timestamps that are ambiguous between the two 
calendars.
   
   ### Why are the changes needed?
   1. New options will allow to load avro files from at least two sources in 
different rebasing modes in the same query. For instance:
   ```scala
   val df1 = spark.read.option("datetimeRebaseMode", 
"legacy").format("avro").load(folder1)
   val df2 = spark.read.option("datetimeRebaseMode", 
"corrected").format("avro").load(folder2)
   df1.join(df2, ...)
   ```
   Before the changes, it is impossible because the SQL config 
`spark.sql.legacy.avro.datetimeRebaseModeInRead` influences on both reads.
   
   2. Mixing of Dataset/DataFrame and RDD APIs should become possible. Since 
SQL configs are not propagated through RDDs, the following code fails on 
ancient timestamps:
   ```scala
   spark.conf.set("spark.sql.legacy.avro.datetimeRebaseModeInRead", "legacy")
   spark.read.format("avro").load(folder).distinct.rdd.collect()
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   By running the modified test suites:
   ```
   $ build/sbt "test:testOnly *AvroV1Suite"
   $ build/sbt "test:testOnly *AvroV2Suite"
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to