[I] perf: AvroRecordContext rebuilds a HoodieSchema wrapper and field map on every field access [hudi]

via GitHub Wed, 10 Jun 2026 06:28:48 -0700


voonhous opened a new issue, #18966:
URL: https://github.com/apache/hudi/issues/18966


   ### Describe the problem
   
   `AvroRecordContext#getFieldValueFromIndexedRecord` is the implementation 
behind `RecordContext#getValue` for the Avro engine and runs once per record 
per accessed field in the file group reader flow: MOR snapshot reads, 
compaction, upsert log merging, and metadata table reads (ordering values, 
delete-flag checks, column value access).
   
   Per invocation it:
   
   - calls `HoodieSchema.fromAvroSchema(record.getSchema())`, allocating a 
fresh wrapper and re-deriving the schema type
   - splits the field path with regex-based `String.split`
   - calls `HoodieSchema#getField` on the fresh wrapper, which lazily rebuilds 
the entire field list and field map: one new `HoodieSchemaField` per column 
plus a HashMap collect, i.e. O(schema width) allocations per call
   - wraps every union branch into another new `HoodieSchema` via 
`getNonNullType`
   
   None of it is cached because the wrapper is thrown away after each call. For 
a 200-column table this is hundreds of allocations per record per accessed 
field.
   
   ### Proposed fix
   
   Rewrite `getFieldValueFromIndexedRecord` to traverse the raw Avro schema 
directly: unwrap only two-branch `[null, X]` unions inline (matching the 
current effective behavior), use Avro's own `Schema#getField` which is an 
allocation-free O(1) lookup, read values by `field.pos()` as today, and add a 
fast path that skips splitting when the field name has no dot. Results are 
identical for all valid inputs; the schema must keep coming from the record 
itself since log block schemas can differ from the reader schema.
   
   Will raise a PR for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] perf: AvroRecordContext rebuilds a HoodieSchema wrapper and field map on every field access [hudi]

Reply via email to