[PR] perf(common): Avoid per-record HoodieSchema rebuilds on Avro read/merge paths [hudi]

via GitHub Tue, 16 Jun 2026 00:10:12 -0700


voonhous opened a new pull request, #18967:
URL: https://github.com/apache/hudi/pull/18967


   ### Describe the issue this Pull Request addresses
   
   Closes #18966
   
   `HoodieSchema.fromAvroSchema(record.getSchema())` is called on the hottest 
Avro read/merge paths - once per record (and, for field access, per accessed 
field). Each call allocates a fresh `HoodieSchema` and, on first field access, 
rebuilds the full field list and field map (one `HoodieSchemaField` per column 
plus a HashMap collect, i.e. O(schema width) allocations), with nothing cached 
between calls. `AvroRecordContext#getFieldValueFromIndexedRecord` (the field 
accessor behind `RecordContext#getValue` for the Avro engine) is the worst 
case, but the same per-record rebuild shows up on several other read/merge 
paths.
   
   ### Summary and Changelog
   
   Intern the Avro-schema -> `HoodieSchema` conversion so the canonical 
wrapper's lazily built field list and field map are reused across calls; the 
per-record cost drops to a cache hit. This keeps `HoodieSchema` as the 
type-system facade rather than bypassing it with raw Avro traversal.
   
   - New `AvroToHoodieSchemaCache` (in `org.apache.hudi.common.schema`): an 
Avro-`Schema`-keyed cache (`weakKeys`, identity lookups - records of one file 
share the same `Schema` instance) that on a miss converts and value-interns 
through `HoodieSchemaCache`, so equal-but-distinct Avro schema instances still 
converge on one canonical `HoodieSchema`. Kept separate from 
`HoodieSchemaCache` (which interns `HoodieSchema`) and distinct from the 
existing `org.apache.hudi.avro.AvroSchemaCache` (Avro -> Avro).
   - `HoodieSchema#getFields`/`#getFieldMap`: the lazily built field list/map 
are cached in `volatile` fields and published with a benign racy single-check. 
Previously the field map was a plain `HashMap` on a non-volatile field, so a 
racing reader could observe a non-null map with invisible entries and silently 
miss an existing field; `volatile` plus the immutable 
`Collections.unmodifiable*` wrappers fix that, while the harmless 
duplicate-build race under contention remains by design.
   - Interned the genuinely per-record `fromAvroSchema(...)` call sites: 
`AvroRecordContext#getFieldValueFromIndexedRecord`, 
`SparkFileFormatInternalRecordContext#convertAvroRecord`, 
`FlinkRecordContext#convertAvroRecord`, 
`RealtimeCompactedRecordReader#mergeRecord`, 
`HoodieAvroUtils#getRecordColumnValues`, `HoodieJsonPayload#getInsertValue`, 
and the `ExpressionPayload` MERGE-INTO evaluator / deserializer / serializer / 
joinRecords paths.
   - `HoodieAvroDataBlock#getBytes`: the `fromAvroSchema(schema)` was 
loop-invariant, so it is hoisted out of the per-record write loop.
   - Cold / one-time sites (static-final schemas, schema providers / 
post-processors, CLI, archival and LSM timeline readers) and the per-block 
`FileGroupRecordBuffer#composeEvolvedSchemaTransformer` are left unchanged.
   - Extended `TestAvroRecordContext`: top-level and nested access, nullable 
record unions, missing fields, non-unwrappable unions, the empty-name guard, 
and intern canonicalization across equal-but-distinct schema instances.
   
   ### Impact
   
   Performance: removes O(schema width) allocations per record from the hottest 
Avro read/merge paths; for a 200-column table this was hundreds of allocations 
per record (per accessed field in the `AvroRecordContext` case). Interning also 
makes the resulting `HoodieSchema` canonical, improving hit rates for the 
downstream schema-keyed caches (deserializer / serializer / evaluator maps). 
Results and exceptions are unchanged. No public API change.
   
   ### Risk Level
   
   Low. The lookup paths are the same `HoodieSchema` code as before, with 
caching layered on via interning; interning returns an equal canonical 
instance, and the safe-publication change is covered by the existing 
`TestHoodieSchema` suite and the extended `TestAvroRecordContext`.
   
   ### Documentation Update
   
   None.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] perf(common): Avoid per-record HoodieSchema rebuilds on Avro read/merge paths [hudi]

Reply via email to