voonhous opened a new issue, #18937:
URL: https://github.com/apache/hudi/issues/18937

   ### Background
   
   PR #18065 adds the write mechanism for shredded variants (AVRO via 
`HoodieAvroWriteSupport`; the SPARK/Row path via `HoodieRowParquetWriteSupport` 
already existed). Both reuse Spark's `castShredded` for the cast, but neither 
decides WHAT to shred automatically: the shredding schema (`typed_value`) comes 
only from a `typed_value` already in the table schema or the test-only config 
`hoodie.parquet.variant.force.shredding.schema.for.test`. So in prod today Hudi 
never shreds variants unless a schema explicitly declares `typed_value`.
   
   Spark 4.1 added per-file shredding-schema inference (SPARK-53659) and turned 
shredding + inference on by default (SPARK-54454). That inference lives in 
Spark's own `ParquetOutputWriterWithVariantShredding` orchestration, which Hudi 
never invokes (Hudi writes base files through its own write supports). So Hudi 
does not get auto-shredding for free.
   
   ### Reuse
   
   - 
`org.apache.spark.sql.execution.datasources.parquet.InferVariantShreddingSchema`
 is public, with a public `inferSchema(rows: Seq[InternalRow]): StructType`. 
Spark 4.1+ only. Directly usable for the SPARK/Row path (records are 
`InternalRow`).
   - Heuristics: buffer ~4096 rows / 64MB per file, drop fields appearing in 
<10% of rows, keep consistent-typed fields, width cap 300 / depth cap 50.
   - `castShredded` is already reused by both Hudi paths.
   - `ParquetOutputWriterWithVariantShredding` is NOT reusable (tied to Spark's 
`OutputWriter`); we mirror its buffering, not call it.
   
   ### Architectural snag
   
   Hudi creates the base-file writer eagerly in `HoodieCreateHandle` with the 
schema fixed up front; Spark's inference buffers N rows before opening the 
writer. Per-file inference means deferring writer creation.
   
   ### Strategies (decide during design)
   
   1. Per-file inference (faithful to Spark): buffer first N rows in the 
handle, infer per file, open writer lazily. Most adaptive; touches 
handle/writer lifecycle for both paths.
   2. Per-job / sample inference (simpler v1): sample input once, infer a 
schema, inject `typed_value` into the write schema, let existing write supports 
shred. Reuses current plumbing; not per-file adaptive.
   
   ### Record-type specifics
   
   - SPARK/Row: `InternalRow` -> feed `InferVariantShreddingSchema` directly. 
Easier.
   - AVRO: `GenericRecord` variant is `{metadata, value}` binary; needs an avro 
-> `InternalRow`/`VariantVal` shim or a Hudi-side inferencer. Heavier lift.
   
   ### Tasks
   
   - [ ] New config to enable inference (mirror Spark; default off initially).
   - [ ] SPARK/Row: buffer + `InferVariantShreddingSchema`, thread schema into 
`HoodieRowParquetWriteSupport`.
   - [ ] AVRO: avro-variant inference (shim or custom), thread into 
`HoodieAvroWriteSupport`.
   - [ ] Decide buffering location (lazy writer in `HoodieCreateHandle` vs 
sample-before-handle).
   - [ ] Gate by Spark version (4.0 has no inferencer).
   - [ ] Flip shredding default-on only after this AND #18931 land.
   
   ### Dependencies
   
   - Depends on #18065 (write support).
   - Depends on #18931 (read-then-reshred): auto-producing shredded base files 
makes compaction/clustering/updates read them; #18931 must land before enabling 
by default, else those paths hit the fail-fast guard.
   - Independent of #18935.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to