voonhous commented on code in PR #18967:
URL: https://github.com/apache/hudi/pull/18967#discussion_r3426421491


##########
hudi-common/src/main/java/org/apache/hudi/avro/AvroRecordContext.java:
##########
@@ -70,7 +71,10 @@ public AvroRecordContext() {
   public static Object getFieldValueFromIndexedRecord(
       IndexedRecord record,
       String fieldName) {
-    HoodieSchema currentSchema = 
HoodieSchema.fromAvroSchema(record.getSchema());
+    // Interning returns the canonical wrapper for this schema, whose lazily 
built field list and
+    // field map survive across calls, so the per-record cost is a cache hit 
instead of an
+    // O(schema width) wrapper rebuild.
+    HoodieSchema currentSchema = 
AvroToHoodieSchemaCache.intern(record.getSchema());

Review Comment:
   I tried folding the cache into `HoodieSchema.fromAvroSchema`, but it breaks 
`TestHoodieSchemaUtils#testIllegalPromotionsBetweenPrimitives`, so I'm 
reverting it and keeping `AvroToHoodieSchemaCache` as the explicit intern at 
the per-record call sites.
   
   ```
   org.opentest4j.AssertionFailedError:
   Expected :true
   Actual   :false
       at 
org.apache.hudi.TestHoodieSchemaUtils.testIllegalPromotionsBetweenPrimitives(TestHoodieSchemaUtils.java:167)
   ```
   
   That test deduces the writer schema for a reader record carrying `bytes` at 
four field locations (`rec.simpleField`, `rec.arrayField.element`, 
`rec.mapField.value`, `rec.nestedField.nested`) against a writer carrying `int` 
at the same four, and asserts the resulting 
`SchemaBackwardsCompatibilityException` names all four paths.
   
   `HoodieSchemaCompatibilityChecker` memoizes results in `mMemoizeMap` keyed 
by `ReaderWriter`, which compares the `HoodieSchema` instances by reference:
   
   ```java
   public int hashCode() { return System.identityHashCode(mReader) ^ 
System.identityHashCode(mWriter); }
   public boolean equals(Object obj) { ... return (this.mReader == 
that.mReader) && (this.mWriter == that.mWriter); }
   ```
   
   With `fromAvroSchema` interning, all four `bytes` resolve to one canonical 
instance and all four `int` to one, so the four locations become the same 
`(bytes, int)` identity pair. The first one (`rec.simpleField`) is computed and 
memoized; the other three hit the memo and reuse that result without recording 
their own locations, so the message only lists `rec.simpleField` and the 
assertion fails for the other three. (The debug log shows four "reader bytes 
with writer int" lines because the log sits above the memo check, but 
`calculateCompatibility` only runs once.)
   
   The checker's recursion memo relies on a distinct schema instance per 
occurrence - that is how it tells genuine recursion from a sibling subschema of 
the same shape. The interning is only needed for the per-record read path, not 
this one-shot compatibility check, so keeping it at the explicit call sites 
avoids the problem. We could instead make the checker track locations 
independently of instance identity, but that is a riskier change to a core path 
(the memo also guards against exponential recomputation on shared-subschema 
DAGs) - happy to go that route if you would prefer it.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to