voonhous opened a new issue, #18937:
URL: https://github.com/apache/hudi/issues/18937
### Background
PR #18065 adds the write mechanism for shredded variants (AVRO via
`HoodieAvroWriteSupport`; the SPARK/Row path via `HoodieRowParquetWriteSupport`
already existed). Both reuse Spark's `castShredded` for the cast, but neither
decides WHAT to shred automatically: the shredding schema (`typed_value`) comes
only from a `typed_value` already in the table schema or the test-only config
`hoodie.parquet.variant.force.shredding.schema.for.test`. So in prod today Hudi
never shreds variants unless a schema explicitly declares `typed_value`.
Spark 4.1 added per-file shredding-schema inference (SPARK-53659) and turned
shredding + inference on by default (SPARK-54454). That inference lives in
Spark's own `ParquetOutputWriterWithVariantShredding` orchestration, which Hudi
never invokes (Hudi writes base files through its own write supports). So Hudi
does not get auto-shredding for free.
### Reuse
-
`org.apache.spark.sql.execution.datasources.parquet.InferVariantShreddingSchema`
is public, with a public `inferSchema(rows: Seq[InternalRow]): StructType`.
Spark 4.1+ only. Directly usable for the SPARK/Row path (records are
`InternalRow`).
- Heuristics: buffer ~4096 rows / 64MB per file, drop fields appearing in
<10% of rows, keep consistent-typed fields, width cap 300 / depth cap 50.
- `castShredded` is already reused by both Hudi paths.
- `ParquetOutputWriterWithVariantShredding` is NOT reusable (tied to Spark's
`OutputWriter`); we mirror its buffering, not call it.
### Architectural snag
Hudi creates the base-file writer eagerly in `HoodieCreateHandle` with the
schema fixed up front; Spark's inference buffers N rows before opening the
writer. Per-file inference means deferring writer creation.
### Strategies (decide during design)
1. Per-file inference (faithful to Spark): buffer first N rows in the
handle, infer per file, open writer lazily. Most adaptive; touches
handle/writer lifecycle for both paths.
2. Per-job / sample inference (simpler v1): sample input once, infer a
schema, inject `typed_value` into the write schema, let existing write supports
shred. Reuses current plumbing; not per-file adaptive.
### Record-type specifics
- SPARK/Row: `InternalRow` -> feed `InferVariantShreddingSchema` directly.
Easier.
- AVRO: `GenericRecord` variant is `{metadata, value}` binary; needs an avro
-> `InternalRow`/`VariantVal` shim or a Hudi-side inferencer. Heavier lift.
### Tasks
- [ ] New config to enable inference (mirror Spark; default off initially).
- [ ] SPARK/Row: buffer + `InferVariantShreddingSchema`, thread schema into
`HoodieRowParquetWriteSupport`.
- [ ] AVRO: avro-variant inference (shim or custom), thread into
`HoodieAvroWriteSupport`.
- [ ] Decide buffering location (lazy writer in `HoodieCreateHandle` vs
sample-before-handle).
- [ ] Gate by Spark version (4.0 has no inferencer).
- [ ] Flip shredding default-on only after this AND #18931 land.
### Dependencies
- Depends on #18065 (write support).
- Depends on #18931 (read-then-reshred): auto-producing shredded base files
makes compaction/clustering/updates read them; #18931 must land before enabling
by default, else those paths hit the fail-fast guard.
- Independent of #18935.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]