voonhous commented on code in PR #18967:
URL: https://github.com/apache/hudi/pull/18967#discussion_r3407823743
##########
hudi-common/src/main/java/org/apache/hudi/avro/AvroRecordContext.java:
##########
@@ -70,7 +71,10 @@ public AvroRecordContext() {
public static Object getFieldValueFromIndexedRecord(
IndexedRecord record,
String fieldName) {
- HoodieSchema currentSchema =
HoodieSchema.fromAvroSchema(record.getSchema());
+ // Interning returns the canonical wrapper for this schema, whose lazily
built field list and
Review Comment:
Good call - I went through all the `HoodieSchema.fromAvroSchema(...)` call
sites. The large majority are one-time/cold paths (static-final schemas, schema
providers/post-processors, CLI, archival + LSM timeline readers, per-block
schema-merge setup), so they do not need interning. The genuinely per-record
ones I switched to the intern cache (or hoisted):
- `AvroRecordContext.getFieldValueFromIndexedRecord` (this PR's original
change)
- `SparkFileFormatInternalRecordContext.convertAvroRecord`
- `FlinkRecordContext.convertAvroRecord`
- `RealtimeCompactedRecordReader.mergeRecord` (two calls)
- `HoodieAvroUtils.getRecordColumnValues`
- `HoodieJsonPayload.getInsertValue`
- `ExpressionPayload` MERGE-INTO eval paths (update/insert/delete
evaluators, the row deserializer, the result serializer, and joinRecords)
- `HoodieAvroDataBlock#getBytes`: the `fromAvroSchema(schema)` was
loop-invariant, so I hoisted it out of the per-record write loop instead
Interning is semantically transparent (returns an equal, canonical
`HoodieSchema`), and where the result feeds a schema-keyed cache
(deserializer/serializer/evaluator maps) it actually improves hit rates.
`FileGroupRecordBuffer.composeEvolvedSchemaTransformer` is per-block rather
than per-record, so I left it as-is.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]