samarsajnani opened a new issue, #18740: URL: https://github.com/apache/druid/issues/18740
The lookups we are fetching are 2.2G total with about 42 lookups. We have a 5 minute polling period with each lookup taking about 6 seconds while using 6 lookupThreads. Our heap has 50GB and it still goes OOM. So the lookups should complete well within the 5 minute polling period. However, the system sometimes does become unstable and is very sensitive on database performance. We realized the lookups were the causing OOM because when we split the lookups across multiple database replicas they recovered. Also after checking heap dump the lookups were taking most of the memory within a sample smaller heap of around 20G instead of analyzing a 50G heap. The pattern also noticed was that once a historical went into the high memory bad state where it would go OOM, it would continue going OOM over and over, and lookup connections would increase significantly. ### Affected Version 32.0.0 ### Description Please include as much detailed information about the problem as possible. - 28 historicals, 3 broker/routers, 2 coordinators - Configs attached [druid.tar.gz](https://github.com/user-attachments/files/23519578/druid.tar.gz) - Setup lookups with large 2.2G total with 42 lookups - No error messages we just see the process keeps restarting trying to connect back to zookeeper and latencies jump significantly into the 10s of seconds to minutes - Tried multiple changes with druid lookup threads increased decreased, num processing thread changes for historicals and various other configs, offHeap (got these errors and queries failing had to revert) Offheap errors: ```2025-04-10T05:02:43,051 ERROR [Cleaner-0] org.apache.druid.server.lookup.namespace.cache.OffHeapNamespaceExtractionCacheManager - OffHeapNamespaceExtractionCacheManager.disposeCache() was not called, disposed resources by the JVM 2025-04-10T05:02:43,284 ERROR [Cleaner-0] org.apache.druid.server.lookup.namespace.cache.OffHeapNamespaceExtractionCacheManager - OffHeapNamespaceExtractionCacheManager.disposeCache() was not called, disposed resources by the JVM 2025-04-10T05:08:09,698 ERROR [Cleaner-0] org.apache.druid.server.lookup.namespace.cache.OffHeapNamespaceExtractionCacheManager - OffHeapNamespaceExtractionCacheManager.disposeCache() was not called, disposed resources by the JVM 2025-04-10T05:08:54,997 ERROR [Cleaner-0] org.apache.druid.server.lookup.namespace.cache.OffHeapNamespaceExtractionCacheManager - OffHeapNamespaceExtractionCacheManager.disposeCache() was not called, disposed resources by the JVM 2025-04-10T05:11:28,411 ERROR [Cleaner-0] org.apache.druid.server.lookup.namespace.cache.OffHeapNamespaceExtractionCacheManager - OffHeapNamespaceExtractionCacheManager.disposeCache() was not called, disposed resources by the JVM 2025-04-10T05:13:24,720 ERROR [Cleaner-0] org.apache.druid.server.lookup.namespace.cache.OffHeapNamespaceExtractionCacheManager - OffHeapNamespaceExtractionCacheManager.disposeCache() was not called, disposed resources by the JVM``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
