voonhous opened a new pull request, #18961: URL: https://github.com/apache/hudi/pull/18961
### Describe the issue this Pull Request addresses Closes #18937 Hudi can write and read shredded variants, but `typed_value` only ever comes from an explicit table schema or the test-only force-shredding DDL, so production tables never shred. Spark 4.1 infers a per-file shredding schema from the data ([SPARK-53659](https://issues.apache.org/jira/browse/SPARK-53659), on by default), but that lives in Spark's own writer stack which Hudi bypasses. Stacked on #18938 (read-side reconstruction), which stacks on #18065. Do not merge until both land. ### Summary and Changelog When `hoodie.parquet.variant.shredding.schema.inference.enabled` is set (default `false`), Hudi infers a shredding schema per base file from a sample of the records written to it, for both record types (SPARK, AVRO) and the bulk-insert row writer. Requires Spark 4.1+ on the writer classpath; Spark 4.0/Flink/Java silently keep writing unshredded. - New config `hoodie.parquet.variant.shredding.schema.inference.enabled` in `HoodieStorageConfig`. - New `VariantShreddingSchemaInferrer` SPI in hudi-common, loaded by classpath detection (`VariantShreddingRuntime`, which also consolidates the duplicated provider-candidate arrays). - `Spark41VariantShreddingSchemaInferrer` (hudi-spark4.1.x) delegates to Spark's `InferVariantShreddingSchema`, so Hudi inherits Spark's heuristics verbatim (no code copied). One call per file covers all variant columns (preserves the global width budget); object keys that are not valid Avro names are dropped and legally fall back to the residual `value` column. - `VariantShreddingInferenceFileWriter` (+ a row-writer sibling): buffers up to 4096 records / 64MB (mirrors Spark's `ParquetOutputWriterWithVariantShredding`), infers once, creates the real writer with the inferred `typed_value` spliced in, then replays in order. Inference failures fall back to unshredded (a throwing inference must not fail compaction); writer-creation or replay failures latch and rethrow through `close()` so buffered records cannot be dropped silently. - Wiring: the AVRO factory splices the schema argument; the SPARK and row-writer factories splice a copied config, since `HoodieRowParquetWriteSupport` resolves its schema from `hoodie.write.schema` / `hoodie.avro.schema` rather than the factory argument. - Fixes latent issues this feature trips: `Variant.getPlainTypedValueSchema` is now recursive (nested objects, arrays, value-only wrappers), Avro "Field already used" in `stripVariantShredding` / `VariantReconstruction`, and the table-schema footer fallback now strips `typed_value` by shape so per-file layouts never leak into the resolved table schema. - Tests: unit coverage for the decorator, schema utils and `HoodieSchema` recursion; functional tests in `TestVariantDataType` for COW (multi-column with declines, update over a shredded base), MOR inline compaction, and the bulk-insert row writer, all gated on Spark 4.1+. ### Impact No behavior change unless the new config is enabled. When enabled on Spark 4.1+, base files carry a per-file inferred `typed_value`; readers already handle shredded files (#18938 on the AVRO path, Spark native otherwise). MOR log files always stay unshredded; shredding materializes at compaction. Flipping the default on is a follow-up tracked in #18937. ### Risk Level Low. The feature is off by default and engines without the Spark 4.1 inferrer on the classpath are unaffected even when it is on. Verified with new unit and functional tests across both record types and all three write paths, plus compile checks under the spark3.5, spark4.0 and spark4.1 profiles. ### Documentation Update New config documented via its `withDocumentation` text (picked up by the generated config reference). Website updates deferred to the default-flip follow-up. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
