[ 
https://issues.apache.org/jira/browse/AVRO-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Skells updated AVRO-4253:
---------------------------------
    Description: 
# FastReaderBuilder WeakIdentityHashMap Cache Retention Bug
 
 ## Summary
 
 FastReaderBuilder's internal readerCache uses WeakIdentityHashMap to allow 
cached RecordReader entries to be garbage-collected when their Schema keys 
become unreachable. However, each RecordReader value holds a {*}strong 
reference to a Schema{*}, so if the reader and writer scheam are the same this, 
prevents the weak reference from ever being cleared. 

 The intended eviction behavior of WeakIdentityHashMap never occurs. Entries 
accumulate for the lifetime of the FastReaderBuilder (and its owning 
GenericData instance, which by default is a global static value).
 
 ## Affected Version
 
 Apache Avro 1.12.1 (and likely all versions since FastReaderBuilder was 
introduced).
 
 ## Root Cause
 
 ### Cache structure
 
 java
 // FastReaderBuilder.java
 private final Map<Schema, Map<Schema, RecordReader>> readerCache = Collections
     .synchronizedMap(new WeakIdentityHashMap<>());
 
 
 The outer map keys are *reader schemas* (weak references). The inner map keys 
are *writer schemas* (also weak references). Values are RecordReader instances.
 
 ### RecordReader retains the key
 
 java
 public static class RecordReader implements FieldReader {
     private ExecutionStep[] readSteps;  // strong references to child 
RecordReaders
     private Schema schema;              // ← STRONG reference to the reader 
Schema (the outer key)
     private InstanceSupplier supplier;  // also references the Schema
     // ...
 
     public void finishInitialization(ExecutionStep[] readSteps, Schema schema, 
InstanceSupplier supp) {
         this.readSteps = readSteps;
         this.schema = schema;          // ← stores the same Schema used as the 
weak key
         this.supplier = supp;
         this.stage = Stage.INITIALIZED;
     }
 }
 
 
 The same RecordReader object is both returned to the caller AND stored in the 
cache (via computeIfAbsent). When the caller discards the DatumReader after 
reading, the cache still holds a strong reference to the RecordReader. And 
RecordReader.schema holds a strong reference back to the Schema used as the 
weak key.
 
 This is a classic *weak-key cache antipattern: value references key*. The 
WeakIdentityHashMap can never clear the weak key because the cache's own value 
keeps the key strongly reachable:
 
 
 FastReaderBuilder (application lifetime)
   └─ readerCache (WeakIdentityHashMap)
        └─ entry: weak(Schema_A) → inner map
             └─ entry: weak(Schema_A) → RecordReader  ← cache holds strong ref 
to value
                  └─ schema → Schema_A                 ← value holds strong ref 
back to key
 
 
 No external reference to Schema_A is needed to keep it alive — the cache is 
self-sustaining.
 
 ### Transitive retention for nested schemas
 
 For a record schema with nested sub-records, initializeRecordReader() 
recursively calls createRecordReader() for each child RECORD field. Each child 
RecordReader is captured inside an ExecutionStep lambda (via 
createFieldSetter()), and each child holds its own schema strong reference.
 
 For a schema with N nesting levels:
 
 
 readerCache:
   weak→ Level_0_Schema → Map:
     weak→ Level_0_Writer → RecordReader_0
       ├─ schema → Level_0_Schema      ← strong ref pinning the outer weak key
       └─ readSteps[child_field] → ExecutionStep (lambda)
            └─ captures RecordReader_1
                 ├─ schema → Level_1_Schema  ← pinning Level_1's weak key
                 └─ readSteps[child_field] → ExecutionStep (lambda)
                      └─ captures RecordReader_2
                           └─ ... (N levels deep)
 
 
 The entire N-level chain is pinned. WeakIdentityHashMap.reap() will never 
remove any of these entries.
 
 ### Impact on WeakIdentityHashMap.reap()
 
 reap() polls the ReferenceQueue for cleared weak references:
 
 java
 private synchronized void reap() {
     Object zombie = queue.poll();
     while (zombie != null) {
         IdentityWeakReference victim = (IdentityWeakReference) zombie;
         backingStore.remove(victim);
         zombie = queue.poll();
     }
 }
 
 
 Since RecordReader.schema keeps the key Schema strongly reachable, the 
IdentityWeakReference is never enqueued → reap() never removes entries → cache 
grows monotonically.
 
 ## Reproduction
 
 See the attached standalone Java file 
FastReaderBuilderCacheRetentionReproducer.java which:
 
 1. Creates a 5-level nested schema (100 fields total)
 2. Calls FastReaderBuilder.createDatumReader(schema) 10 times with 
freshly-parsed Schema objects (different identity, same structure — simulating 
independent file header parsing)
 3. Releases all Schema references and forces GC
 4. Observes that both outer and inner cache entries are retained (schemas are 
NOT collected)
 
 *Key detail:* The single-arg createDatumReader(schema) uses the same Schema 
identity for both writer and reader cache keys. RecordReader.schema then holds 
a strong reference to that same object, creating a cycle that prevents either 
weak key from being cleared.
 
 When writer and reader are different Schema identities (as happens when 
DataFileStream parses its own header schema), the writer schema can be GC'd 
independently, breaking the chain. This reproducer uses the same-identity path 
to demonstrate the retention.
 
 Compile and run:
 
 bash
 javac -cp avro-1.12.1.jar FastReaderBuilderCacheRetentionReproducer.java
 java -cp .:avro-1.12.1.jar FastReaderBuilderCacheRetentionReproducer
 
 
 Expected output:
 
 
 After schema 1: outer=5 inner=5
 After schema 2: outer=10 inner=10
 ...
 After schema 10: outer=50 inner=50
 
 --- Forcing GC (5 cycles with reap) ---
 
 After GC:
   Outer cache entries (reader schemas): 50
   Inner cache entries (writer schemas):    50
   Schema WeakRefs cleared by GC:           0 / 10
 
   Expected if cache eviction works:
     outer=0, inner=0, schemas collected=10
   Expected with retention bug:
     outer=10, inner=50, schemas collected=0
 
 BUG CONFIRMED: Cache retained entries after GC.
   RecordReader.schema holds strong refs to Schema keys,
   preventing WeakIdentityHashMap from clearing entries.
   Schema objects are NOT being collected (pinned by cache values).
 
 ## Suggested Fix

If the 2 schemas are the same object then deep clone the reader schema. 
It cant be a shallow clone as this would only free one level of the cache, 
unless we make the cache removal remove the whole tree (currently the tree gets 
removed by the weak reference).
A Shalow clone would mean the you need n cycles of GC, finalisation, map access 
to clear the cache

Maybe rework to remove the schema reference or make it weak (I dont know if 
that would work though. Probable too hard to consider)

 ## Workarounds
 
 Applications can mitigate this by:
 
 1. *Schema interning* — ensure all Schema objects passed to FastReaderBuilder 
are canonical (same identity for structurally equal schemas). This makes the 
cache bounded to the number of distinct schemas, which is typically small. The 
entries are never evicted but the memory is bounded.
 
 2. *Per-task GenericData instances* — create a new GenericData (and thus new 
FastReaderBuilder) for each batch of work, and discard it when done. The entire 
cache is collected when the GenericData becomes unreachable.
 
 3. *Disable FastReader* — set GenericData.setFastReaderEnabled(false) and rely 
on the standard ResolvingDecoder path, which uses a separate 
ThreadLocal<WeakIdentityHashMap> cache (same retention issue but scoped to 
thread lifetime).

(1) is accomplisted by https://issues.apache.org/jira/browse/AVRO-4249 / 
https://github.com/apache/avro/pull/3746
What I did to fully mitigate was the above, together with (2) - periodic 
recycling of the GenericData (and therefore its FastDataBuilder)

  was:
# FastReaderBuilder WeakIdentityHashMap Cache Retention Bug
 
 ## Summary
 
 FastReaderBuilder's internal readerCache uses WeakIdentityHashMap to allow 
cached RecordReader entries to be garbage-collected when their Schema keys 
become unreachable. However, each RecordReader value holds a {*}strong 
reference to a Schema{*}, so if the reader and writer scheam are the same this, 
prevents the weak reference from ever being cleared. 

 The intended eviction behavior of WeakIdentityHashMap never occurs. Entries 
accumulate for the lifetime of the FastReaderBuilder (and its owning 
GenericData instance, which by default is a global static value).
 
 ## Affected Version
 
 Apache Avro 1.12.1 (and likely all versions since FastReaderBuilder was 
introduced).
 
 ## Root Cause
 
 ### Cache structure
 
 java
 // FastReaderBuilder.java
 private final Map<Schema, Map<Schema, RecordReader>> readerCache = Collections
     .synchronizedMap(new WeakIdentityHashMap<>());
 
 
 The outer map keys are *reader schemas* (weak references). The inner map keys 
are *writer schemas* (also weak references). Values are RecordReader instances.
 
 ### RecordReader retains the key
 
 java
 public static class RecordReader implements FieldReader {
     private ExecutionStep[] readSteps;  // strong references to child 
RecordReaders
     private Schema schema;              // ← STRONG reference to the reader 
Schema (the outer key)
     private InstanceSupplier supplier;  // also references the Schema
     // ...
 
     public void finishInitialization(ExecutionStep[] readSteps, Schema schema, 
InstanceSupplier supp) {
         this.readSteps = readSteps;
         this.schema = schema;          // ← stores the same Schema used as the 
weak key
         this.supplier = supp;
         this.stage = Stage.INITIALIZED;
     }
 }
 
 
 The same RecordReader object is both returned to the caller AND stored in the 
cache (via computeIfAbsent). When the caller discards the DatumReader after 
reading, the cache still holds a strong reference to the RecordReader. And 
RecordReader.schema holds a strong reference back to the Schema used as the 
weak key.
 
 This is a classic *weak-key cache antipattern: value references key*. The 
WeakIdentityHashMap can never clear the weak key because the cache's own value 
keeps the key strongly reachable:
 
 
 FastReaderBuilder (application lifetime)
   └─ readerCache (WeakIdentityHashMap)
        └─ entry: weak(Schema_A) → inner map
             └─ entry: weak(Schema_A) → RecordReader  ← cache holds strong ref 
to value
                  └─ schema → Schema_A                 ← value holds strong ref 
back to key
 
 
 No external reference to Schema_A is needed to keep it alive — the cache is 
self-sustaining.
 
 ### Transitive retention for nested schemas
 
 For a record schema with nested sub-records, initializeRecordReader() 
recursively calls createRecordReader() for each child RECORD field. Each child 
RecordReader is captured inside an ExecutionStep lambda (via 
createFieldSetter()), and each child holds its own schema strong reference.
 
 For a schema with N nesting levels:
 
 
 readerCache:
   weak→ Level_0_Schema → Map:
     weak→ Level_0_Writer → RecordReader_0
       ├─ schema → Level_0_Schema      ← strong ref pinning the outer weak key
       └─ readSteps[child_field] → ExecutionStep (lambda)
            └─ captures RecordReader_1
                 ├─ schema → Level_1_Schema  ← pinning Level_1's weak key
                 └─ readSteps[child_field] → ExecutionStep (lambda)
                      └─ captures RecordReader_2
                           └─ ... (N levels deep)
 
 
 The entire N-level chain is pinned. WeakIdentityHashMap.reap() will never 
remove any of these entries.
 
 ### Impact on WeakIdentityHashMap.reap()
 
 reap() polls the ReferenceQueue for cleared weak references:
 
 java
 private synchronized void reap() {
     Object zombie = queue.poll();
     while (zombie != null) {
         IdentityWeakReference victim = (IdentityWeakReference) zombie;
         backingStore.remove(victim);
         zombie = queue.poll();
     }
 }
 
 
 Since RecordReader.schema keeps the key Schema strongly reachable, the 
IdentityWeakReference is never enqueued → reap() never removes entries → cache 
grows monotonically.
 
 ## Reproduction
 
 See the attached standalone Java file 
FastReaderBuilderCacheRetentionReproducer.java which:
 
 1. Creates a 5-level nested schema (100 fields total)
 2. Calls FastReaderBuilder.createDatumReader(schema) 10 times with 
freshly-parsed Schema objects (different identity, same structure — simulating 
independent file header parsing)
 3. Releases all Schema references and forces GC
 4. Observes that both outer and inner cache entries are retained (schemas are 
NOT collected)
 
 *Key detail:* The single-arg createDatumReader(schema) uses the same Schema 
identity for both writer and reader cache keys. RecordReader.schema then holds 
a strong reference to that same object, creating a cycle that prevents either 
weak key from being cleared.
 
 When writer and reader are different Schema identities (as happens when 
DataFileStream parses its own header schema), the writer schema can be GC'd 
independently, breaking the chain. This reproducer uses the same-identity path 
to demonstrate the retention.
 
 Compile and run:
 
 bash
 javac -cp avro-1.12.1.jar FastReaderBuilderCacheRetentionReproducer.java
 java -cp .:avro-1.12.1.jar FastReaderBuilderCacheRetentionReproducer
 
 
 Expected output:
 
 
 After schema 1: outer=5 inner=5
 After schema 2: outer=10 inner=10
 ...
 After schema 10: outer=50 inner=50
 
 --- Forcing GC (5 cycles with reap) ---
 
 After GC:
   Outer cache entries (reader schemas): 50
   Inner cache entries (writer schemas):    50
   Schema WeakRefs cleared by GC:           0 / 10
 
   Expected if cache eviction works:
     outer=0, inner=0, schemas collected=10
   Expected with retention bug:
     outer=10, inner=50, schemas collected=0
 
 BUG CONFIRMED: Cache retained entries after GC.
   RecordReader.schema holds strong refs to Schema keys,
   preventing WeakIdentityHashMap from clearing entries.
   Schema objects are NOT being collected (pinned by cache values).
 
 ## Suggested Fix

Maybe rework to remove the svhema reference or make it weak (I dont know if 
that would work though)

 ## Workarounds
 
 Applications can mitigate this by:
 
 1. *Schema interning* — ensure all Schema objects passed to FastReaderBuilder 
are canonical (same identity for structurally equal schemas). This makes the 
cache bounded to the number of distinct schemas, which is typically small. The 
entries are never evicted but the memory is bounded.
 
 2. *Per-task GenericData instances* — create a new GenericData (and thus new 
FastReaderBuilder) for each batch of work, and discard it when done. The entire 
cache is collected when the GenericData becomes unreachable.
 
 3. *Disable FastReader* — set GenericData.setFastReaderEnabled(false) and rely 
on the standard ResolvingDecoder path, which uses a separate 
ThreadLocal<WeakIdentityHashMap> cache (same retention issue but scoped to 
thread lifetime).

(1) is accomplisted by https://issues.apache.org/jira/browse/AVRO-4249 / 
https://github.com/apache/avro/pull/3746
What I did to fully mitigate was the above, together with (2) - periodic 
recycling of the GenericData (and therefore its FastDataBuilder)


> unbounded memory leak in FastReaderBuilder
> ------------------------------------------
>
>                 Key: AVRO-4253
>                 URL: https://issues.apache.org/jira/browse/AVRO-4253
>             Project: Apache Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.12.1
>            Reporter: Michael Skells
>            Priority: Critical
>              Labels: performance
>         Attachments: FastReaderBuilderCacheRetentionReproducer.java
>
>
> # FastReaderBuilder WeakIdentityHashMap Cache Retention Bug
>  
>  ## Summary
>  
>  FastReaderBuilder's internal readerCache uses WeakIdentityHashMap to allow 
> cached RecordReader entries to be garbage-collected when their Schema keys 
> become unreachable. However, each RecordReader value holds a {*}strong 
> reference to a Schema{*}, so if the reader and writer scheam are the same 
> this, prevents the weak reference from ever being cleared. 
>  The intended eviction behavior of WeakIdentityHashMap never occurs. Entries 
> accumulate for the lifetime of the FastReaderBuilder (and its owning 
> GenericData instance, which by default is a global static value).
>  
>  ## Affected Version
>  
>  Apache Avro 1.12.1 (and likely all versions since FastReaderBuilder was 
> introduced).
>  
>  ## Root Cause
>  
>  ### Cache structure
>  
>  java
>  // FastReaderBuilder.java
>  private final Map<Schema, Map<Schema, RecordReader>> readerCache = 
> Collections
>      .synchronizedMap(new WeakIdentityHashMap<>());
>  
>  
>  The outer map keys are *reader schemas* (weak references). The inner map 
> keys are *writer schemas* (also weak references). Values are RecordReader 
> instances.
>  
>  ### RecordReader retains the key
>  
>  java
>  public static class RecordReader implements FieldReader {
>      private ExecutionStep[] readSteps;  // strong references to child 
> RecordReaders
>      private Schema schema;              // ← STRONG reference to the reader 
> Schema (the outer key)
>      private InstanceSupplier supplier;  // also references the Schema
>      // ...
>  
>      public void finishInitialization(ExecutionStep[] readSteps, Schema 
> schema, InstanceSupplier supp) {
>          this.readSteps = readSteps;
>          this.schema = schema;          // ← stores the same Schema used as 
> the weak key
>          this.supplier = supp;
>          this.stage = Stage.INITIALIZED;
>      }
>  }
>  
>  
>  The same RecordReader object is both returned to the caller AND stored in 
> the cache (via computeIfAbsent). When the caller discards the DatumReader 
> after reading, the cache still holds a strong reference to the RecordReader. 
> And RecordReader.schema holds a strong reference back to the Schema used as 
> the weak key.
>  
>  This is a classic *weak-key cache antipattern: value references key*. The 
> WeakIdentityHashMap can never clear the weak key because the cache's own 
> value keeps the key strongly reachable:
>  
>  
>  FastReaderBuilder (application lifetime)
>    └─ readerCache (WeakIdentityHashMap)
>         └─ entry: weak(Schema_A) → inner map
>              └─ entry: weak(Schema_A) → RecordReader  ← cache holds strong 
> ref to value
>                   └─ schema → Schema_A                 ← value holds strong 
> ref back to key
>  
>  
>  No external reference to Schema_A is needed to keep it alive — the cache is 
> self-sustaining.
>  
>  ### Transitive retention for nested schemas
>  
>  For a record schema with nested sub-records, initializeRecordReader() 
> recursively calls createRecordReader() for each child RECORD field. Each 
> child RecordReader is captured inside an ExecutionStep lambda (via 
> createFieldSetter()), and each child holds its own schema strong reference.
>  
>  For a schema with N nesting levels:
>  
>  
>  readerCache:
>    weak→ Level_0_Schema → Map:
>      weak→ Level_0_Writer → RecordReader_0
>        ├─ schema → Level_0_Schema      ← strong ref pinning the outer weak key
>        └─ readSteps[child_field] → ExecutionStep (lambda)
>             └─ captures RecordReader_1
>                  ├─ schema → Level_1_Schema  ← pinning Level_1's weak key
>                  └─ readSteps[child_field] → ExecutionStep (lambda)
>                       └─ captures RecordReader_2
>                            └─ ... (N levels deep)
>  
>  
>  The entire N-level chain is pinned. WeakIdentityHashMap.reap() will never 
> remove any of these entries.
>  
>  ### Impact on WeakIdentityHashMap.reap()
>  
>  reap() polls the ReferenceQueue for cleared weak references:
>  
>  java
>  private synchronized void reap() {
>      Object zombie = queue.poll();
>      while (zombie != null) {
>          IdentityWeakReference victim = (IdentityWeakReference) zombie;
>          backingStore.remove(victim);
>          zombie = queue.poll();
>      }
>  }
>  
>  
>  Since RecordReader.schema keeps the key Schema strongly reachable, the 
> IdentityWeakReference is never enqueued → reap() never removes entries → 
> cache grows monotonically.
>  
>  ## Reproduction
>  
>  See the attached standalone Java file 
> FastReaderBuilderCacheRetentionReproducer.java which:
>  
>  1. Creates a 5-level nested schema (100 fields total)
>  2. Calls FastReaderBuilder.createDatumReader(schema) 10 times with 
> freshly-parsed Schema objects (different identity, same structure — 
> simulating independent file header parsing)
>  3. Releases all Schema references and forces GC
>  4. Observes that both outer and inner cache entries are retained (schemas 
> are NOT collected)
>  
>  *Key detail:* The single-arg createDatumReader(schema) uses the same Schema 
> identity for both writer and reader cache keys. RecordReader.schema then 
> holds a strong reference to that same object, creating a cycle that prevents 
> either weak key from being cleared.
>  
>  When writer and reader are different Schema identities (as happens when 
> DataFileStream parses its own header schema), the writer schema can be GC'd 
> independently, breaking the chain. This reproducer uses the same-identity 
> path to demonstrate the retention.
>  
>  Compile and run:
>  
>  bash
>  javac -cp avro-1.12.1.jar FastReaderBuilderCacheRetentionReproducer.java
>  java -cp .:avro-1.12.1.jar FastReaderBuilderCacheRetentionReproducer
>  
>  
>  Expected output:
>  
>  
>  After schema 1: outer=5 inner=5
>  After schema 2: outer=10 inner=10
>  ...
>  After schema 10: outer=50 inner=50
>  
>  --- Forcing GC (5 cycles with reap) ---
>  
>  After GC:
>    Outer cache entries (reader schemas): 50
>    Inner cache entries (writer schemas):    50
>    Schema WeakRefs cleared by GC:           0 / 10
>  
>    Expected if cache eviction works:
>      outer=0, inner=0, schemas collected=10
>    Expected with retention bug:
>      outer=10, inner=50, schemas collected=0
>  
>  BUG CONFIRMED: Cache retained entries after GC.
>    RecordReader.schema holds strong refs to Schema keys,
>    preventing WeakIdentityHashMap from clearing entries.
>    Schema objects are NOT being collected (pinned by cache values).
>  
>  ## Suggested Fix
> If the 2 schemas are the same object then deep clone the reader schema. 
> It cant be a shallow clone as this would only free one level of the cache, 
> unless we make the cache removal remove the whole tree (currently the tree 
> gets removed by the weak reference).
> A Shalow clone would mean the you need n cycles of GC, finalisation, map 
> access to clear the cache
> Maybe rework to remove the schema reference or make it weak (I dont know if 
> that would work though. Probable too hard to consider)
>  ## Workarounds
>  
>  Applications can mitigate this by:
>  
>  1. *Schema interning* — ensure all Schema objects passed to 
> FastReaderBuilder are canonical (same identity for structurally equal 
> schemas). This makes the cache bounded to the number of distinct schemas, 
> which is typically small. The entries are never evicted but the memory is 
> bounded.
>  
>  2. *Per-task GenericData instances* — create a new GenericData (and thus new 
> FastReaderBuilder) for each batch of work, and discard it when done. The 
> entire cache is collected when the GenericData becomes unreachable.
>  
>  3. *Disable FastReader* — set GenericData.setFastReaderEnabled(false) and 
> rely on the standard ResolvingDecoder path, which uses a separate 
> ThreadLocal<WeakIdentityHashMap> cache (same retention issue but scoped to 
> thread lifetime).
> (1) is accomplisted by https://issues.apache.org/jira/browse/AVRO-4249 / 
> https://github.com/apache/avro/pull/3746
> What I did to fully mitigate was the above, together with (2) - periodic 
> recycling of the GenericData (and therefore its FastDataBuilder)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to