voonhous opened a new pull request, #18938:
URL: https://github.com/apache/hudi/pull/18938

   ### Describe the issue this Pull Request addresses
   
   Closes #18931.
   
   Builds on #18065, which added variant shredding on the AVRO write path. That 
PR left a fail-fast guard: when compaction or clustering read an 
already-shredded base file through the AVRO record path, records arrived 
shredded and the writer threw, because nothing reconstructed the unshredded 
variant on read. This PR adds that read-side reconstruction and removes the 
guard.
   
   ### Summary and Changelog
   
   Reading a shredded variant base file via the AVRO record path now rebuilds 
the unshredded `{metadata, value}` variant before records reach the 
merger/writer, so compaction and clustering over shredded base files work. The 
SPARK/InternalRow path is unchanged (Spark reconstructs variants natively).
   
   - Add `VariantShreddingProvider.rebuildVariantRecord` (inverse of 
`shredVariantRecord`). `Spark4VariantShreddingProvider` implements it using 
Spark's `ShreddingUtils.rebuild` over an Avro-backed `ShreddedRow`, mirroring 
the existing write-side `AvroShreddedResult`.
   - `HoodieAvroParquetReader` detects shredded variant columns, reads them at 
the file's shredded schema so `typed_value` is materialized, and reconstructs 
each to the unshredded form per record (new `VariantReconstruction`). The 
provider is resolved from `hoodie.parquet.variant.shredding.provider.class` or 
auto-detected on the classpath; gated on 
`hoodie.parquet.variant.allow.reading.shredded`.
   - Extract `stripVariantShredding` into a shared `VariantSchemaUtils` used by 
both reader and writer.
   - Remove the read-then-reshred guard (`assertInputNotAlreadyShredded`) from 
`HoodieAvroWriteSupport` and its unit test.
   - Extend the MOR compaction test in `TestVariantDataType` to write shredded, 
compact, then read back, covering AVRO reconstruction and the SPARK native path 
via `withRecordType`.
   
   No code copied.
   
   ### Impact
   
   AVRO record-type reads of shredded variant base files now return correct 
unshredded variants instead of failing. No new configs: reuses 
`hoodie.parquet.variant.allow.reading.shredded` (default true) and 
`hoodie.parquet.variant.shredding.provider.class`. No change for non-Spark 
engines or the SPARK read path.
   
   ### Risk Level
   
   Medium. Touches the AVRO base-file read path. Mitigations: reconstruction 
only activates when the file actually has shredded variant columns and a 
provider is available, otherwise reads proceed unchanged; it is gated by 
`hoodie.parquet.variant.allow.reading.shredded`; the SPARK path is untouched. 
Covered by the extended MOR compaction test (write shredded, compact, read 
back) under both AVRO and SPARK record types.
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to