voonhous commented on code in PR #18967:
URL: https://github.com/apache/hudi/pull/18967#discussion_r3407823743


##########
hudi-common/src/main/java/org/apache/hudi/avro/AvroRecordContext.java:
##########
@@ -70,7 +71,10 @@ public AvroRecordContext() {
   public static Object getFieldValueFromIndexedRecord(
       IndexedRecord record,
       String fieldName) {
-    HoodieSchema currentSchema = 
HoodieSchema.fromAvroSchema(record.getSchema());
+    // Interning returns the canonical wrapper for this schema, whose lazily 
built field list and

Review Comment:
   Good call - I went through all the `HoodieSchema.fromAvroSchema(...)` call 
sites. The large majority are one-time/cold paths (static-final schemas, schema 
providers/post-processors, CLI, archival + LSM timeline readers, per-block 
schema-merge setup), so they do not need interning. The genuinely per-record 
ones I switched to the intern cache (or hoisted):
   
   - `AvroRecordContext.getFieldValueFromIndexedRecord` (this PR's original 
change)
   - `SparkFileFormatInternalRecordContext.convertAvroRecord`
   - `FlinkRecordContext.convertAvroRecord`
   - `RealtimeCompactedRecordReader.mergeRecord` (two calls)
   - `HoodieAvroUtils.getRecordColumnValues`
   - `HoodieJsonPayload.getInsertValue`
   - `ExpressionPayload` MERGE-INTO eval paths (update/insert/delete 
evaluators, the row deserializer, the result serializer, and joinRecords)
   - `HoodieAvroDataBlock#getBytes`: the `fromAvroSchema(schema)` was 
loop-invariant, so I hoisted it out of the per-record write loop instead
   
   Interning is semantically transparent (returns an equal, canonical 
`HoodieSchema`), and where the result feeds a schema-keyed cache 
(deserializer/serializer/evaluator maps) it actually improves hit rates. 
`FileGroupRecordBuffer.composeEvolvedSchemaTransformer` is per-block rather 
than per-record, so I left it as-is.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to