Re: [I] `EntityCache` leaks memory & prone to inconsistencies & race conditions [polaris]

via GitHub Thu, 27 Mar 2025 04:45:09 -0700


dennishuo commented on issue #761:
URL: https://github.com/apache/polaris/issues/761#issuecomment-2757079500


   Sorry, I should've taken a look at this issue earlier before anyone putting 
too much effort into those unittests.
   
   As @ben-manes alludes to, the EntityCache here is actually *not* expected to 
be the component responsible for strict consistency. Perhaps IndexedCache could 
help reduce some of the redundant queries, but I think some of the original 
assumptions filed in this issue don't quite match what the EntityCache is doing.
   
   In particular, the EntityCache doesn't document any expectations about 
coherence between the `byId` and `byName` lookup. On the contrary, some 
comments such as 
   
   
https://github.com/apache/polaris/blob/34dbfcea164d85df95784f2202c56b3ff6c77231/polaris-core/src/main/java/org/apache/polaris/core/persistence/cache/EntityCache.java#L128C1-L133C6
   
       // only update the name key if this entity was not dropped
       if (!cacheEntry.getEntity().isDropped()) {
         // here we don't really care about concurrent update to the key. 
Basically if we are
         // pointing to the wrong entry, we will detect this and fix the issue
         this.byName.put(nameKey, cacheEntry);
       }
   
   and
   
   
https://github.com/apache/polaris/blob/99add28dce82481adcddd862af9f7c3c4c5deeb3/polaris-core/src/main/java/org/apache/polaris/core/persistence/cache/EntityCache.java#L265
   
       // the caller's fetched entity may have come from a stale lookup byName; 
we should consider
       // the existingCacheEntry to be the older of the two for purposes of 
invalidation to make
       // sure when we replaceCacheEntry we're also removing the old name if 
it's no longer valid
       EntityCacheByNameKey nameKey = new 
EntityCacheByNameKey(entityToValidate);
       ResolvedPolarisEntity existingCacheEntryByName = 
this.getEntityByName(nameKey);
   
   seem to indicate that it is deliberate to allow decoherence, and rather than 
having any assumptions about non-multithreaded access, explicitly states that 
we allow decoherence if there was indeed a concurrent access. Since the 
accessor methods for by-name lookup themselves don't do any kind of 
"detection", this is likely alluding to the tight coupling of `Resolver`.
   
   In general, *every* cache hit is always subject to `bulkValidate` in 
`Resolver.java`: 
https://github.com/apache/polaris/blob/34dbfcea164d85df95784f2202c56b3ff6c77231/polaris-core/src/main/java/org/apache/polaris/core/persistence/resolver/Resolver.java#L519
   
   The call to `loadEntitiesChangeTracking` is required to provide a consistent 
snapshot-read of current entity versions:
   
         // now get the current backend versions of all these entities
         ChangeTrackingResult changeTrackingResult =
             this.polarisMetaStoreManager.loadEntitiesChangeTracking(
                 this.polarisCallContext, entityIds);
   
         // refresh any entity which is not fresh. If an entity is missing, 
reload it
         Iterator<ResolvedPolarisEntity> entityIterator = toValidate.iterator();
         Iterator<PolarisChangeTrackingVersions> versionIterator =
             changeTrackingResult.getChangeTrackingVersions().iterator();
   
   It's true that there's probably a memory leak window, if a `byName.remove` 
"incorrectly" races with a `byName.add` that semantically occurred before the 
removal in the main Caffeine cache but ended up executing the secondary-map 
update out of order. This would presumably be a nice benefit of using the 
`IndexedCache` instead.
   
   Overall thoughts:
   
   1. It might be worthwhile to try to run the concurrency tests against the 
`Resolver` layer or at least simulating the control flow of the resolver by 
calling `loadEntitiesChangeTracking` for anything retrieved from the cache 
(though this might end up being less insightful with the FakeMetaStoreManager 
since the point of that call is that the "strictly linear ordering" of the 
versions returned by the MetaStoreManager is the final authority for discarding 
the cache hit). It's certainly possible there are undiscovered bugs in Resolver 
that don't play well with cache race conditions, but the out-of-order or stale 
cache entries here shouldn't be a problem
   2. If we can make IndexedCache work with the by-name key appropriately, it's 
still probably a good way to clean up the code and get better coherence and get 
rid of the rare but sticky memory leak caused by the separate `byName` map.
   3. So far based on the scenarios listed in the test results here, neither 
the `byId`/`byName` decoherence nor out-of-order versions should manifest as 
any kind of correctness issue
   4. Redundant database reads are expected during race conditions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] `EntityCache` leaks memory & prone to inconsistencies & race conditions [polaris]

Reply via email to