[ https://issues.apache.org/jira/browse/SOLR-17060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790775#comment-17790775 ]
Alex Deparvu edited comment on SOLR-17060 at 11/28/23 10:02 PM: ---------------------------------------------------------------- I took a stab at a proposal for a fix [https://github.com/apache/solr/pull/2101] which I am not entirely sure will work correctly. The approach is to identify the problematic window (core still in init phase) and avoid enabling the metrics. based on this idea I extended the existing isClosed() (refCount.get() <= 0) check to verify that the core is opened instead (refCount.get() <= 1). The test provided is a best-effort, I could not capture the deadlock, but I did get some exception which may indicate some error in opening the searcher but I am not familiar enough with this code to tell for sure. Open to feedback on both fix and test - even if to say that this approach is not good in identifying and fixing the deadlock. Exception trace caused by the test {noformat} org.apache.solr.core.SolrCoreTest > testCoreInitDeadlockMetrics FAILED org.apache.solr.common.SolrException: Error CREATEing SolrCore 'tmpCore': Unable to create core [tmpCore] Caused by: no segments* file found in LockValidatingDirectoryWrapper(MockDirectoryWrapper(ByteBuffersDirectory@140da8ef lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@632b9c9e)): files: [] at __randomizedtesting.SeedInfo.seed([3110F9390179E105:3BE771D8AB76AF96]:0) at app//org.apache.solr.core.CoreContainer.create(CoreContainer.java:1613) at app//org.apache.solr.core.CoreContainer.create(CoreContainer.java:1514) at app//org.apache.solr.core.SolrCoreTest.testCoreInitDeadlockMetrics(SolrCoreTest.java:375) ... Caused by: org.apache.solr.common.SolrException: Unable to create core [tmpCore] at app//org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1738) at app//org.apache.solr.core.CoreContainer.create(CoreContainer.java:1571) ... 48 more Caused by: org.apache.solr.common.SolrException: Error opening new searcher at app//org.apache.solr.core.SolrCore.<init>(SolrCore.java:1234) at app//org.apache.solr.core.SolrCore.<init>(SolrCore.java:1057) at app//org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1705) ... 49 more Caused by: org.apache.solr.common.SolrException: Error opening new searcher at app//org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2464) at app//org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2597) at app//org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1293) at app//org.apache.solr.core.SolrCore.<init>(SolrCore.java:1178) ... 51 more Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in LockValidatingDirectoryWrapper(MockDirectoryWrapper(ByteBuffersDirectory@140da8ef lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@632b9c9e)): files: [] at app//org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1075) at app//org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:145) at app//org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:110) at app//org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:258) at app//org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:125) at app//org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2415) ... 54 more {noformat} was (Author: alex.parvulescu): I took a stab at a proposal for a fix https://github.com/apache/solr/pull/2101 which I am not entirely sure will work correctly. The approach is to identify the problematic window (core still in init phase) and avoid enabling the metrics. based on this idea I extended the existing isClosed() (refCount.get() <= 0) check to verify that the core is opened instead (refCount.get() <= 1). The test provided is a best-effort, I could not capture the deadlock, but I did get some exception which may indicate some error in opening the searcher but I am not familiar enough with this code to tell for sure. Open to feedback on both fix and test - even if to say that this approach is not good in identifying and fixing the deadlock. > CoreContainer#create may deadlock with concurrent requests for metrics > ---------------------------------------------------------------------- > > Key: SOLR-17060 > URL: https://issues.apache.org/jira/browse/SOLR-17060 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: multicore > Affects Versions: 9.2.1, 9.4 > Reporter: Andreas Hubold > Priority: Blocker > Labels: deadlock, metrics, monitoring > Time Spent: 10m > Remaining Estimate: 0h > > CoreContainer#create registers metrics for the created core quite early, and > metrics can be requested for a core that is still being created. If a metrics > request calls SolrCore#getSearcher with unlucky timing (race condition), then > CoreContainer#create can deadlock. > This problem was described on the users lists: > [https://lists.apache.org/thread/mvpp1ogkxfdgfx87mdt6ylhqsttoq2dw] > We ran into such a deadlock with Solr 9.2.1, a CoreAdmin CREATE request, and > a periodic thread that requests all available JMX metrics. > A CoreAdmin CREATE request was received, but its thread waits forever because > onDeckSearchers is 1 and _searcher is null (variables checked in heap dump): > {code:none} > "qtp1800649922-28" #28 prio=5 os_prio=0 cpu=757.69ms elapsed=25383.63s > tid=0x00007fe6dc9613a0 nid=0x70 in Object.wait() [0x00007fe6c68ca000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(java.base@17.0.8.1/Native Method) > - waiting on <no object reference available> > at java.lang.Object.wait(java.base@17.0.8.1/Unknown Source) > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528) > - locked <0x00000000e25dd2c8> (a java.lang.Object) > at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1283) > at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1168) > at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1051) > at > org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1666) > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1532) > at > org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:111) > at > org.apache.solr.handler.admin.CoreAdminOperation$$Lambda$437/0x00007fe66048cc60.execute(Unknown > Source) > at > org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:398) > at > org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:354) > at > org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:219) > {code} > > There's no way for clients to retry the CREATE request, because CoreContainer > rejects additional requests for good reason ("Already creating a core with > name ...", see SOLR-14969). > The only solution is to restart Solr. > > A thread for metrics was also waiting in the same way: > {code:none} > prometheus-http-1-1" #30 daemon prio=5 os_prio=0 cpu=1212.68ms > elapsed=25383.60s tid=0x00007fe620008ac0 nid=0x73 in Object.wait() > [0x00007fe6c680b000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(java.base@17.0.8.1/Native Method) > - waiting on <no object reference available> > at java.lang.Object.wait(java.base@17.0.8.1/Unknown Source) > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528) > - locked <0x00000000e25dd2c8> (a java.lang.Object) > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2271) > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2106) > at org.apache.solr.core.SolrCore.withSearcher(SolrCore.java:2124) > at org.apache.solr.core.SolrCore.getSegmentCount(SolrCore.java:534) > at > org.apache.solr.core.SolrCore.lambda$initializeMetrics$11(SolrCore.java:1360) > at > org.apache.solr.core.SolrCore$$Lambda$760/0x00007fe660634ec8.getValue(Unknown > Source) > at > org.apache.solr.metrics.SolrMetricManager$GaugeWrapper.getValue(SolrMetricManager.java:779) > at > org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxGauge.getValue(JmxMetricsReporter.java:207 > {code} > This thread gathers multiple metrics one after the other. We can assume, that > it had called SolrCore#getSearcher successfully before, and that the previous > call has incremented onDeckSearchers to 1. > A previous call to #getSearcher should eventually decrement onDeckSearchers > again, and also set the _searcher field, but that's not happening. Normally, > this should happen in SolrCore#registerSearcher, but that method is executed > on the searcherExecutor, see > [https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L2681] > The problem is, that searcherExecutor will never execute #registerSearcher > because the executor is still blocked by another job. This is because the > core is still being created, see > [https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1160] > {code:none} > "searcherExecutor-14-thread-1-processing-studio" #50 prio=5 os_prio=0 > cpu=0.27ms elapsed=25212.18s tid=0x00007fe608146080 nid=0xbd waiting on > condition [0x00007fe6c637a000] > java.lang.Thread.State: WAITING (parking) > at jdk.internal.misc.Unsafe.park(java.base@17.0.8.1/Native Method) > - parking to wait for <0x00000000e738c1a8> (a > java.util.concurrent.CountDownLatch$Sync) > at > java.util.concurrent.locks.LockSupport.park(java.base@17.0.8.1/Unknown > Source) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.8.1/Unknown > Source) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@17.0.8.1/Unknown > Source) > at > java.util.concurrent.CountDownLatch.await(java.base@17.0.8.1/Unknown Source) > at org.apache.solr.core.SolrCore.lambda$new$3(SolrCore.java:1162) > at > org.apache.solr.core.SolrCore$$Lambda$815/0x00007fe6606f3c40.call(Unknown > Source) > at java.util.concurrent.FutureTask.run(java.base@17.0.8.1/Unknown > Source) > {code} > In summary, we have > - thread "qtp1800649922-28" in SolrCore.<init> trying to create a core, > which is waiting for > - #registerSearcher job in the searcherExecutor work queue, which is waiting > to be executed by > - thread "searcherExecutor-14-thread-1-processing-studio" which waits for > the first thread "qtp1800649922-28" to unblock it -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org