Michael Skells created AVRO-4253:
------------------------------------

             Summary: unbounded memory leak in FastReaderBuilder
                 Key: AVRO-4253
                 URL: https://issues.apache.org/jira/browse/AVRO-4253
             Project: Apache Avro
          Issue Type: Bug
          Components: java
    Affects Versions: 1.12.1
            Reporter: Michael Skells
         Attachments: FastReaderBuilderCacheRetentionReproducer.java

# FastReaderBuilder WeakIdentityHashMap Cache Retention Bug
 
 ## Summary
 
 FastReaderBuilder's internal readerCache uses WeakIdentityHashMap to allow 
cached RecordReader entries to be garbage-collected when their Schema keys 
become unreachable. However, each RecordReader value holds a {*}strong 
reference to a Schema{*}, so if the reader and writer scheam are the same this, 
prevents the weak reference from ever being cleared. 

 The intended eviction behavior of WeakIdentityHashMap never occurs. Entries 
accumulate for the lifetime of the FastReaderBuilder (and its owning 
GenericData instance, which by default is a global static value).
 
 ## Affected Version
 
 Apache Avro 1.12.1 (and likely all versions since FastReaderBuilder was 
introduced).
 
 ## Root Cause
 
 ### Cache structure
 
 java
 // FastReaderBuilder.java
 private final Map<Schema, Map<Schema, RecordReader>> readerCache = Collections
     .synchronizedMap(new WeakIdentityHashMap<>());
 
 
 The outer map keys are *reader schemas* (weak references). The inner map keys 
are *writer schemas* (also weak references). Values are RecordReader instances.
 
 ### RecordReader retains the key
 
 java
 public static class RecordReader implements FieldReader {
     private ExecutionStep[] readSteps;  // strong references to child 
RecordReaders
     private Schema schema;              // ← STRONG reference to the reader 
Schema (the outer key)
     private InstanceSupplier supplier;  // also references the Schema
     // ...
 
     public void finishInitialization(ExecutionStep[] readSteps, Schema schema, 
InstanceSupplier supp) {
         this.readSteps = readSteps;
         this.schema = schema;          // ← stores the same Schema used as the 
weak key
         this.supplier = supp;
         this.stage = Stage.INITIALIZED;
     }
 }
 
 
 The same RecordReader object is both returned to the caller AND stored in the 
cache (via computeIfAbsent). When the caller discards the DatumReader after 
reading, the cache still holds a strong reference to the RecordReader. And 
RecordReader.schema holds a strong reference back to the Schema used as the 
weak key.
 
 This is a classic *weak-key cache antipattern: value references key*. The 
WeakIdentityHashMap can never clear the weak key because the cache's own value 
keeps the key strongly reachable:
 
 
 FastReaderBuilder (application lifetime)
   └─ readerCache (WeakIdentityHashMap)
        └─ entry: weak(Schema_A) → inner map
             └─ entry: weak(Schema_A) → RecordReader  ← cache holds strong ref 
to value
                  └─ schema → Schema_A                 ← value holds strong ref 
back to key
 
 
 No external reference to Schema_A is needed to keep it alive — the cache is 
self-sustaining.
 
 ### Transitive retention for nested schemas
 
 For a record schema with nested sub-records, initializeRecordReader() 
recursively calls createRecordReader() for each child RECORD field. Each child 
RecordReader is captured inside an ExecutionStep lambda (via 
createFieldSetter()), and each child holds its own schema strong reference.
 
 For a schema with N nesting levels:
 
 
 readerCache:
   weak→ Level_0_Schema → Map:
     weak→ Level_0_Writer → RecordReader_0
       ├─ schema → Level_0_Schema      ← strong ref pinning the outer weak key
       └─ readSteps[child_field] → ExecutionStep (lambda)
            └─ captures RecordReader_1
                 ├─ schema → Level_1_Schema  ← pinning Level_1's weak key
                 └─ readSteps[child_field] → ExecutionStep (lambda)
                      └─ captures RecordReader_2
                           └─ ... (N levels deep)
 
 
 The entire N-level chain is pinned. WeakIdentityHashMap.reap() will never 
remove any of these entries.
 
 ### Impact on WeakIdentityHashMap.reap()
 
 reap() polls the ReferenceQueue for cleared weak references:
 
 java
 private synchronized void reap() {
     Object zombie = queue.poll();
     while (zombie != null) {
         IdentityWeakReference victim = (IdentityWeakReference) zombie;
         backingStore.remove(victim);
         zombie = queue.poll();
     }
 }
 
 
 Since RecordReader.schema keeps the key Schema strongly reachable, the 
IdentityWeakReference is never enqueued → reap() never removes entries → cache 
grows monotonically.
 
 ## Reproduction
 
 See the attached standalone Java file 
FastReaderBuilderCacheRetentionReproducer.java which:
 
 1. Creates a 5-level nested schema (100 fields total)
 2. Calls FastReaderBuilder.createDatumReader(schema) 10 times with 
freshly-parsed Schema objects (different identity, same structure — simulating 
independent file header parsing)
 3. Releases all Schema references and forces GC
 4. Observes that both outer and inner cache entries are retained (schemas are 
NOT collected)
 
 *Key detail:* The single-arg createDatumReader(schema) uses the same Schema 
identity for both writer and reader cache keys. RecordReader.schema then holds 
a strong reference to that same object, creating a cycle that prevents either 
weak key from being cleared.
 
 When writer and reader are different Schema identities (as happens when 
DataFileStream parses its own header schema), the writer schema can be GC'd 
independently, breaking the chain. This reproducer uses the same-identity path 
to demonstrate the retention.
 
 Compile and run:
 
 bash
 javac -cp avro-1.12.1.jar FastReaderBuilderCacheRetentionReproducer.java
 java -cp .:avro-1.12.1.jar FastReaderBuilderCacheRetentionReproducer
 
 
 Expected output:
 
 
 After schema 1: outer=5 inner=5
 After schema 2: outer=10 inner=10
 ...
 After schema 10: outer=50 inner=50
 
 --- Forcing GC (5 cycles with reap) ---
 
 After GC:
   Outer cache entries (reader schemas): 50
   Inner cache entries (writer schemas):    50
   Schema WeakRefs cleared by GC:           0 / 10
 
   Expected if cache eviction works:
     outer=0, inner=0, schemas collected=10
   Expected with retention bug:
     outer=10, inner=50, schemas collected=0
 
 BUG CONFIRMED: Cache retained entries after GC.
   RecordReader.schema holds strong refs to Schema keys,
   preventing WeakIdentityHashMap from clearing entries.
   Schema objects are NOT being collected (pinned by cache values).
 
 ## Suggested Fix

Maybe rework to remove the svhema reference or make it weak (I dont know if 
that would work though)

 ## Workarounds
 
 Applications can mitigate this by:
 
 1. *Schema interning* — ensure all Schema objects passed to FastReaderBuilder 
are canonical (same identity for structurally equal schemas). This makes the 
cache bounded to the number of distinct schemas, which is typically small. The 
entries are never evicted but the memory is bounded.
 
 2. *Per-task GenericData instances* — create a new GenericData (and thus new 
FastReaderBuilder) for each batch of work, and discard it when done. The entire 
cache is collected when the GenericData becomes unreachable.
 
 3. *Disable FastReader* — set GenericData.setFastReaderEnabled(false) and rely 
on the standard ResolvingDecoder path, which uses a separate 
ThreadLocal<WeakIdentityHashMap> cache (same retention issue but scoped to 
thread lifetime).

(1) is accomplisted by https://issues.apache.org/jira/browse/AVRO-4249 / 
https://github.com/apache/avro/pull/3746

What I did to fully mitigate was the above + (2)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to