voonhous opened a new pull request, #18967: URL: https://github.com/apache/hudi/pull/18967
### Describe the issue this Pull Request addresses Closes #18966 `HoodieSchema.fromAvroSchema(record.getSchema())` is called on the hottest Avro read/merge paths - once per record (and, for field access, per accessed field). Each call allocates a fresh `HoodieSchema` and, on first field access, rebuilds the full field list and field map (one `HoodieSchemaField` per column plus a HashMap collect, i.e. O(schema width) allocations), with nothing cached between calls. `AvroRecordContext#getFieldValueFromIndexedRecord` (the field accessor behind `RecordContext#getValue` for the Avro engine) is the worst case, but the same per-record rebuild shows up on several other read/merge paths. ### Summary and Changelog Intern the Avro-schema -> `HoodieSchema` conversion so the canonical wrapper's lazily built field list and field map are reused across calls; the per-record cost drops to a cache hit. This keeps `HoodieSchema` as the type-system facade rather than bypassing it with raw Avro traversal. - New `AvroToHoodieSchemaCache` (in `org.apache.hudi.common.schema`): an Avro-`Schema`-keyed cache (`weakKeys`, identity lookups - records of one file share the same `Schema` instance) that on a miss converts and value-interns through `HoodieSchemaCache`, so equal-but-distinct Avro schema instances still converge on one canonical `HoodieSchema`. Kept separate from `HoodieSchemaCache` (which interns `HoodieSchema`) and distinct from the existing `org.apache.hudi.avro.AvroSchemaCache` (Avro -> Avro). - `HoodieSchema#getFields`/`#getFieldMap`: the lazily built field list/map are cached in `volatile` fields and published with a benign racy single-check. Previously the field map was a plain `HashMap` on a non-volatile field, so a racing reader could observe a non-null map with invisible entries and silently miss an existing field; `volatile` plus the immutable `Collections.unmodifiable*` wrappers fix that, while the harmless duplicate-build race under contention remains by design. - Interned the genuinely per-record `fromAvroSchema(...)` call sites: `AvroRecordContext#getFieldValueFromIndexedRecord`, `SparkFileFormatInternalRecordContext#convertAvroRecord`, `FlinkRecordContext#convertAvroRecord`, `RealtimeCompactedRecordReader#mergeRecord`, `HoodieAvroUtils#getRecordColumnValues`, `HoodieJsonPayload#getInsertValue`, and the `ExpressionPayload` MERGE-INTO evaluator / deserializer / serializer / joinRecords paths. - `HoodieAvroDataBlock#getBytes`: the `fromAvroSchema(schema)` was loop-invariant, so it is hoisted out of the per-record write loop. - Cold / one-time sites (static-final schemas, schema providers / post-processors, CLI, archival and LSM timeline readers) and the per-block `FileGroupRecordBuffer#composeEvolvedSchemaTransformer` are left unchanged. - Extended `TestAvroRecordContext`: top-level and nested access, nullable record unions, missing fields, non-unwrappable unions, the empty-name guard, and intern canonicalization across equal-but-distinct schema instances. ### Impact Performance: removes O(schema width) allocations per record from the hottest Avro read/merge paths; for a 200-column table this was hundreds of allocations per record (per accessed field in the `AvroRecordContext` case). Interning also makes the resulting `HoodieSchema` canonical, improving hit rates for the downstream schema-keyed caches (deserializer / serializer / evaluator maps). Results and exceptions are unchanged. No public API change. ### Risk Level Low. The lookup paths are the same `HoodieSchema` code as before, with caching layered on via interning; interning returns an equal canonical instance, and the safe-publication change is covered by the existing `TestHoodieSchema` suite and the extended `TestAvroRecordContext`. ### Documentation Update None. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
