divijvaidya commented on PR #13850: URL: https://github.com/apache/kafka/pull/13850#issuecomment-1594369804
Thank you for looking into this @satishd > But we observed an impact when there is a degradation in our remote storage(HDFS) clusters. Yes, you are right, the impact of this lock contention is felt in cases when fetching from RSM is slow. We observed something similar. > Another way is to change the current LinkedHashMap to a synchronized map and use a slot based lock instead of a global lock. This workaround is less performant than Caffiene but it does not introduce a new dependency. I understand what you are saying. I did consider that approach. So that we are on the same page, the approach looks like: ``` // Thread safe data structure that holds the locks per entry // Keys from this data structure are removed when a val entryLockMap = ConcurrentHashMap[Entry, Lock]() // Internal LRU cache based on LinkedHashMap. LinkedHashMap is made thread safe by synchronising all its operations by using SynchronizedCache (from org.apache.kafka.common.cache) val internalCache = SynchronizedCache(new LinkedHashMap()) getEntry(key)) { // return key is available in cache if (internalCache.contains(key)) { return internalCache.get(key) } // we have to update the entry in the cache. we will acquire a lock on entry, fetch it and then update in cache. createEntryLock in entryLockMap if not present // acquire exclusive lock on entry entryLockMap.get(key).lock() try { // after acquiring lock, check again for presence of key in cache, another thread may have updated it while // this thread was waiting for lock if (internalCache.contains(key)) { return internalCache.get(key) } val entry = fetchFromRSM(key) // no need to acquire a lock in internal cache since get/put methods on this cache are synchronized internalCache.put(entry) } finally { release entry lock } } ``` The advantages of this approach are: 1. No external dependency 2. Same pattern of cache used in other places in the Kafka code (see org.apache.kafka.common.cache.SynchronizedCache). But that code was written in 2015 when perhaps better alternatives didn't exist at that time. The disadvantages of this approach are: 1. Cache metrics will have to be completed manually 2. Cache eviction is done synchronously and hence, it holds the global cache lock. 3. In the absence of a lock that ensures fairness, when we synchronise the global cache, a fetch thread may get starved leading to high tail latencies. Such a scenario will occur when we have a high get() operation load and other threads for get() are getting prioritized by the lock ahead of the put() thread. That is why I am not a big fan of using data structures with coarse grained global locks. Caffeine on the other hand doesn't have any of the disadvantage mentioned above, is used in high throughput low latency systems like Cassandra and has a constant release cadence with great support. Hence, I believe that using Caffeine is the right choice here. (I know that you already agreed to this but adding more context here for other readers of the PR) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org