linliu-code opened a new pull request, #18794:
URL: https://github.com/apache/hudi/pull/18794

   ### Change Logs
   
   Fixes apache/hudi#18752: the Spark write path used to silently ignore both 
`spark.sql.parquet.outputTimestampType` (the standard Spark setting) and 
`hoodie.parquet.outputtimestamptype` (the documented Hudi override), always 
emitting `TIMESTAMP(MICROS)` for `TimestampType` columns. Spark's own writer 
under the same SparkSession honors both. The bug spans 0.15.0 → 1.1.1 → master 
HEAD.
   
   This is silent broken interop with downstream readers that expect 
`TIMESTAMP(MILLIS)` (smaller files) or `INT96` (legacy Hive/Impala) — no error, 
no warning, the data just lands in the wrong logical type.
   
   ### Root causes (two layers)
   
   **1. `HoodieRowParquetWriteSupport`** (the Row-based bulk_insert writer)
   - Constructor unconditionally set the hadoopConf to 
`config.getStringOrDefault(...)` for the Hudi key, which always returned its 
default (`TIMESTAMP_MICROS`) and silently overrode any value the user had 
configured on the SparkSession.
   - The `TimestampType` writer derived the encoding from the avro writer 
schema's precision (always MICROS for Spark TimestampType) and lacked an INT96 
path entirely.
   - The custom `MessageType` converter (called by parquet's `init()`) 
hard-coded `TIMESTAMP(MICROS)` for Spark TimestampType regardless of the chosen 
output type.
   
   **2. `HoodieSparkSchemaConverters`** (the Spark→Avro conversion used by the 
upsert path)
   - `TimestampType` → `HoodieSchema.createTimestampMicros()` was hard-coded, 
so the avro→parquet pipeline (`HoodieAvroWriteSupport`) could only emit MICROS, 
regardless of the user's setting.
   
   ### Fix
   
   Introduces `HoodieRowParquetWriteSupport.resolveOutputTimestampType` with 
documented priority:
   1. `hoodie.parquet.outputtimestamptype` when explicitly set (compared 
against default value to distinguish from default-population).
   2. `spark.sql.parquet.outputTimestampType` from the SparkSession's `SQLConf` 
when user-set (`SQLConf.contains` distinguishes user-set from Spark's own 
default).
   3. Manually-propagated `spark.sql.parquet.outputTimestampType` in the Hadoop 
conf.
   4. The Hudi default (`TIMESTAMP_MICROS`).
   
   In `HoodieRowParquetWriteSupport`:
   - `makeWriter(TimestampType)` dispatches on the resolved output type: emit 
INT96 binary (Julian-day/nanos-of-day per the [parquet-format 
spec](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp))
 for `INT96`, INT64+MILLIS for `TIMESTAMP_MILLIS`, INT64+MICROS otherwise.
   - `convertField(TimestampType)` dispatches the same way to produce a 
matching parquet schema (so writer and schema agree).
   - Adds `microsToInt96Binary` helper implementing the standard encoding.
   
   In `HoodieSparkSchemaConverters`:
   - `TimestampType` consults `SQLConf` and produces 
`HoodieSchema.createTimestampMillis()` when the user requested 
`TIMESTAMP_MILLIS`, else `createTimestampMicros()` as before.
   
   ### Known limit (documented in test class)
   
   INT96 is bulk_insert-only. The upsert path goes through Avro and Avro 
doesn't model INT96, so INT96 requests fall through to MICROS at the avro 
layer. The fix delivers the full matrix for the bulk_insert path and 
MILLIS/MICROS for the upsert path — covering the realistic use cases 
(downstream readers expecting MILLIS for smaller files, or INT96 for legacy 
Hive/Impala interop where users typically already use bulk_insert).
   
   ### Impact
   
   Describe any public API changes.
   
   No public API change. Internal write path only.
   
   ### Risk level
   
   Medium. The Hudi-config-vs-Spark-config priority change is intentional — 
users who previously relied on the silent default override will see Hudi now 
honor their explicit Spark setting. Users who explicitly set 
`hoodie.parquet.outputtimestamptype` continue to win (priority 1).
   
   The avro-path change (`HoodieSparkSchemaConverters`) means downstream 
avro-aware code now sees `timestamp-millis` instead of `timestamp-micros` when 
the user requested MILLIS. This is the same behavior as Spark's native parquet 
writer.
   
   ### Documentation Update
   
   No documentation changes required. The behavior now matches what 
`hoodie.parquet.outputtimestamptype` was already documented to do.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable (TestOutputTimestampType 
functional tests cover bulk_insert × {MICROS,MILLIS,INT96}, upsert × 
{MICROS,MILLIS}, and the Hudi-vs-Spark priority chain)
   - [ ] CI passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to