voonhous commented on code in PR #18967:
URL: https://github.com/apache/hudi/pull/18967#discussion_r3426421491
##########
hudi-common/src/main/java/org/apache/hudi/avro/AvroRecordContext.java:
##########
@@ -70,7 +71,10 @@ public AvroRecordContext() {
public static Object getFieldValueFromIndexedRecord(
IndexedRecord record,
String fieldName) {
- HoodieSchema currentSchema =
HoodieSchema.fromAvroSchema(record.getSchema());
+ // Interning returns the canonical wrapper for this schema, whose lazily
built field list and
+ // field map survive across calls, so the per-record cost is a cache hit
instead of an
+ // O(schema width) wrapper rebuild.
+ HoodieSchema currentSchema =
AvroToHoodieSchemaCache.intern(record.getSchema());
Review Comment:
I tried folding the cache into `HoodieSchema.fromAvroSchema`, but it breaks
`TestHoodieSchemaUtils#testIllegalPromotionsBetweenPrimitives`, so I'm
reverting it and keeping `AvroToHoodieSchemaCache` as the explicit intern at
the per-record call sites.
```
org.opentest4j.AssertionFailedError:
Expected :true
Actual :false
at
org.apache.hudi.TestHoodieSchemaUtils.testIllegalPromotionsBetweenPrimitives(TestHoodieSchemaUtils.java:167)
```
That test deduces the writer schema for a reader record carrying `bytes` at
four field locations (`rec.simpleField`, `rec.arrayField.element`,
`rec.mapField.value`, `rec.nestedField.nested`) against a writer carrying `int`
at the same four, and asserts the resulting
`SchemaBackwardsCompatibilityException` names all four paths.
`HoodieSchemaCompatibilityChecker` memoizes results in `mMemoizeMap` keyed
by `ReaderWriter`, which compares the `HoodieSchema` instances by reference:
```java
public int hashCode() { return System.identityHashCode(mReader) ^
System.identityHashCode(mWriter); }
public boolean equals(Object obj) { ... return (this.mReader ==
that.mReader) && (this.mWriter == that.mWriter); }
```
With `fromAvroSchema` interning, all four `bytes` resolve to one canonical
instance and all four `int` to one, so the four locations become the same
`(bytes, int)` identity pair. The first one (`rec.simpleField`) is computed and
memoized; the other three hit the memo and reuse that result without recording
their own locations, so the message only lists `rec.simpleField` and the
assertion fails for the other three. (The debug log shows four "reader bytes
with writer int" lines because the log sits above the memo check, but
`calculateCompatibility` only runs once.)
The checker's recursion memo relies on a distinct schema instance per
occurrence - that is how it tells genuine recursion from a sibling subschema of
the same shape. The interning is only needed for the per-record read path, not
this one-shot compatibility check, so keeping it at the explicit call sites
avoids the problem. We could instead make the checker track locations
independently of instance identity, but that is a riskier change to a core path
(the memo also guards against exponential recomputation on shared-subschema
DAGs) - happy to go that route if you would prefer it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]