samarsajnani opened a new issue, #18740:
URL: https://github.com/apache/druid/issues/18740

   The lookups we are fetching are 2.2G total with about 42 lookups. We have a 
5 minute polling period with each lookup taking about 6 seconds while using 6 
lookupThreads. Our heap has 50GB and it still goes OOM. So the lookups should 
complete well within the 5 minute polling period. However, the system sometimes 
does become unstable and is very sensitive on database performance. We realized 
the lookups were the causing OOM because when we split the lookups across 
multiple database replicas they recovered. Also after checking heap dump the 
lookups were taking most of the memory within a sample smaller heap of around 
20G instead of analyzing a 50G heap. The pattern also noticed was that once a 
historical went into the high memory bad state where it would go OOM, it would 
continue going OOM over and over, and lookup connections would increase 
significantly.
   
   ### Affected Version
   
   32.0.0
   
   ### Description
   
   Please include as much detailed information about the problem as possible.
   - 28 historicals, 3 broker/routers, 2 coordinators
   - Configs attached
   
[druid.tar.gz](https://github.com/user-attachments/files/23519578/druid.tar.gz)
   - Setup lookups with large 2.2G total with 42 lookups
   - No error messages we just see the process keeps restarting trying to 
connect back to zookeeper and latencies jump significantly into the 10s of 
seconds to minutes
   - Tried multiple changes with druid lookup threads increased decreased, num 
processing thread changes for historicals and various other configs, offHeap 
(got these errors and queries failing had to revert)
   
   
   Offheap errors: ```2025-04-10T05:02:43,051 ERROR [Cleaner-0] 
org.apache.druid.server.lookup.namespace.cache.OffHeapNamespaceExtractionCacheManager
 - OffHeapNamespaceExtractionCacheManager.disposeCache() was not called, 
disposed resources by the JVM
   2025-04-10T05:02:43,284 ERROR [Cleaner-0] 
org.apache.druid.server.lookup.namespace.cache.OffHeapNamespaceExtractionCacheManager
 - OffHeapNamespaceExtractionCacheManager.disposeCache() was not called, 
disposed resources by the JVM
   2025-04-10T05:08:09,698 ERROR [Cleaner-0] 
org.apache.druid.server.lookup.namespace.cache.OffHeapNamespaceExtractionCacheManager
 - OffHeapNamespaceExtractionCacheManager.disposeCache() was not called, 
disposed resources by the JVM
   2025-04-10T05:08:54,997 ERROR [Cleaner-0] 
org.apache.druid.server.lookup.namespace.cache.OffHeapNamespaceExtractionCacheManager
 - OffHeapNamespaceExtractionCacheManager.disposeCache() was not called, 
disposed resources by the JVM
   2025-04-10T05:11:28,411 ERROR [Cleaner-0] 
org.apache.druid.server.lookup.namespace.cache.OffHeapNamespaceExtractionCacheManager
 - OffHeapNamespaceExtractionCacheManager.disposeCache() was not called, 
disposed resources by the JVM
   2025-04-10T05:13:24,720 ERROR [Cleaner-0] 
org.apache.druid.server.lookup.namespace.cache.OffHeapNamespaceExtractionCacheManager
 - OffHeapNamespaceExtractionCacheManager.disposeCache() was not called, 
disposed resources by the JVM```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to