voonhous opened a new issue, #18966: URL: https://github.com/apache/hudi/issues/18966
### Describe the problem `AvroRecordContext#getFieldValueFromIndexedRecord` is the implementation behind `RecordContext#getValue` for the Avro engine and runs once per record per accessed field in the file group reader flow: MOR snapshot reads, compaction, upsert log merging, and metadata table reads (ordering values, delete-flag checks, column value access). Per invocation it: - calls `HoodieSchema.fromAvroSchema(record.getSchema())`, allocating a fresh wrapper and re-deriving the schema type - splits the field path with regex-based `String.split` - calls `HoodieSchema#getField` on the fresh wrapper, which lazily rebuilds the entire field list and field map: one new `HoodieSchemaField` per column plus a HashMap collect, i.e. O(schema width) allocations per call - wraps every union branch into another new `HoodieSchema` via `getNonNullType` None of it is cached because the wrapper is thrown away after each call. For a 200-column table this is hundreds of allocations per record per accessed field. ### Proposed fix Rewrite `getFieldValueFromIndexedRecord` to traverse the raw Avro schema directly: unwrap only two-branch `[null, X]` unions inline (matching the current effective behavior), use Avro's own `Schema#getField` which is an allocation-free O(1) lookup, read values by `field.pos()` as today, and add a fast path that skips splitting when the field name has no dot. Results are identical for all valid inputs; the schema must keep coming from the record itself since log block schemas can differ from the reader schema. Will raise a PR for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
