voonhous opened a new pull request, #18961:
URL: https://github.com/apache/hudi/pull/18961

   ### Describe the issue this Pull Request addresses
   
   Closes #18937
   
   Hudi can write and read shredded variants, but `typed_value` only ever comes 
from an explicit table schema or the test-only force-shredding DDL, so 
production tables never shred. Spark 4.1 infers a per-file shredding schema 
from the data 
([SPARK-53659](https://issues.apache.org/jira/browse/SPARK-53659), on by 
default), but that lives in Spark's own writer stack which Hudi bypasses.
   
   Stacked on #18938 (read-side reconstruction), which stacks on #18065. Do not 
merge until both land.
   
   ### Summary and Changelog
   
   When `hoodie.parquet.variant.shredding.schema.inference.enabled` is set 
(default `false`), Hudi infers a shredding schema per base file from a sample 
of the records written to it, for both record types (SPARK, AVRO) and the 
bulk-insert row writer. Requires Spark 4.1+ on the writer classpath; Spark 
4.0/Flink/Java silently keep writing unshredded.
   
   - New config `hoodie.parquet.variant.shredding.schema.inference.enabled` in 
`HoodieStorageConfig`.
   - New `VariantShreddingSchemaInferrer` SPI in hudi-common, loaded by 
classpath detection (`VariantShreddingRuntime`, which also consolidates the 
duplicated provider-candidate arrays).
   - `Spark41VariantShreddingSchemaInferrer` (hudi-spark4.1.x) delegates to 
Spark's `InferVariantShreddingSchema`, so Hudi inherits Spark's heuristics 
verbatim (no code copied). One call per file covers all variant columns 
(preserves the global width budget); object keys that are not valid Avro names 
are dropped and legally fall back to the residual `value` column.
   - `VariantShreddingInferenceFileWriter` (+ a row-writer sibling): buffers up 
to 4096 records / 64MB (mirrors Spark's 
`ParquetOutputWriterWithVariantShredding`), infers once, creates the real 
writer with the inferred `typed_value` spliced in, then replays in order. 
Inference failures fall back to unshredded (a throwing inference must not fail 
compaction); writer-creation or replay failures latch and rethrow through 
`close()` so buffered records cannot be dropped silently.
   - Wiring: the AVRO factory splices the schema argument; the SPARK and 
row-writer factories splice a copied config, since 
`HoodieRowParquetWriteSupport` resolves its schema from `hoodie.write.schema` / 
`hoodie.avro.schema` rather than the factory argument.
   - Fixes latent issues this feature trips: `Variant.getPlainTypedValueSchema` 
is now recursive (nested objects, arrays, value-only wrappers), Avro "Field 
already used" in `stripVariantShredding` / `VariantReconstruction`, and the 
table-schema footer fallback now strips `typed_value` by shape so per-file 
layouts never leak into the resolved table schema.
   - Tests: unit coverage for the decorator, schema utils and `HoodieSchema` 
recursion; functional tests in `TestVariantDataType` for COW (multi-column with 
declines, update over a shredded base), MOR inline compaction, and the 
bulk-insert row writer, all gated on Spark 4.1+.
   
   ### Impact
   
   No behavior change unless the new config is enabled. When enabled on Spark 
4.1+, base files carry a per-file inferred `typed_value`; readers already 
handle shredded files (#18938 on the AVRO path, Spark native otherwise). MOR 
log files always stay unshredded; shredding materializes at compaction. 
Flipping the default on is a follow-up tracked in #18937.
   
   ### Risk Level
   
   Low. The feature is off by default and engines without the Spark 4.1 
inferrer on the classpath are unaffected even when it is on. Verified with new 
unit and functional tests across both record types and all three write paths, 
plus compile checks under the spark3.5, spark4.0 and spark4.1 profiles.
   
   ### Documentation Update
   
   New config documented via its `withDocumentation` text (picked up by the 
generated config reference). Website updates deferred to the default-flip 
follow-up.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to