[ 
https://issues.apache.org/jira/browse/SOLR-17060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787085#comment-17787085
 ] 

Andreas Hubold edited comment on SOLR-17060 at 11/17/23 9:06 AM:
-----------------------------------------------------------------

[~duerrm] In the mail thread, there's are a call to "SolrCore.getSegmentCount" 
for metrics on one thread, and a call to "SolrCore.<init>" on another. It looks 
like exactly the same issue to me.

BTW, the answer in the mail thread "This thread is waiting for the lock on 
'java.lang.Object@6665e80c'" was not entirely correct. The mentioned thread was 
waiting for a notification on that object / on that lock. That does not mean 
that another thread was currently holding the lock and also explains why you 
haven't found another thread holding that lock in the thread dump.


was (Author: ahubold):
[~duerrm] In the mail thread, there's are a call to "SolrCore.getSegmentCount" 
for metrics on one thread, and a call to "SolrCore.<init>" on another. It looks 
like exactly the same issue to me.

BTW, the answer in the mail thread "This thread is waiting for the lock on 
'java.lang.Object@6665e80c'" was not entirely correct. The mentioned thread was 
waiting for a notification on that object / on that lock. That does not mean 
that another thread was currently holding the lock. It explains why you haven't 
found another thread holding that lock in the thread dump.

> CoreContainer#create may deadlock with concurrent requests for metrics
> ----------------------------------------------------------------------
>
>                 Key: SOLR-17060
>                 URL: https://issues.apache.org/jira/browse/SOLR-17060
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: multicore
>    Affects Versions: 9.2.1, 9.4
>            Reporter: Andreas Hubold
>            Priority: Major
>              Labels: deadlock, metrics, monitoring
>
> CoreContainer#create registers metrics for the created core quite early, and 
> metrics can be requested for a core that is still being created. If a metrics 
> request calls SolrCore#getSearcher with unlucky timing (race condition), then 
> CoreContainer#create can deadlock.
> This problem was described on the users lists: 
> [https://lists.apache.org/thread/mvpp1ogkxfdgfx87mdt6ylhqsttoq2dw]
> We ran into such a deadlock with Solr 9.2.1, a CoreAdmin CREATE request, and 
> a periodic thread that requests all available JMX metrics.
> A CoreAdmin CREATE request was received, but its thread waits forever because 
> onDeckSearchers is 1 and _searcher is null (variables checked in heap dump):
> {code:none}
> "qtp1800649922-28" #28 prio=5 os_prio=0 cpu=757.69ms elapsed=25383.63s 
> tid=0x00007fe6dc9613a0 nid=0x70 in Object.wait()  [0x00007fe6c68ca000]
>    java.lang.Thread.State: WAITING (on object monitor)
>     at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
>     - waiting on <no object reference available>
>     at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
>     at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
>     - locked <0x00000000e25dd2c8> (a java.lang.Object)
>     at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1283)
>     at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1168)
>     at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1051)
>     at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1666)
>     at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1532)
>     at 
> org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:111)
>     at 
> org.apache.solr.handler.admin.CoreAdminOperation$$Lambda$437/0x00007fe66048cc60.execute(Unknown
>  Source)
>     at 
> org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:398)
>     at 
> org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:354)
>     at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:219)
> {code}
>  
> There's no way for clients to retry the CREATE request, because CoreContainer 
> rejects additional requests for good reason ("Already creating a core with 
> name ...", see SOLR-14969).
> The only solution is to restart Solr.
>  
> A thread for metrics was also waiting in the same way:
> {code:none}
> prometheus-http-1-1" #30 daemon prio=5 os_prio=0 cpu=1212.68ms 
> elapsed=25383.60s tid=0x00007fe620008ac0 nid=0x73 in Object.wait()  
> [0x00007fe6c680b000]
>    java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
>        - waiting on <no object reference available>
>        at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
>        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
>        - locked <0x00000000e25dd2c8> (a java.lang.Object)
>        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2271)
>        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2106)
>        at org.apache.solr.core.SolrCore.withSearcher(SolrCore.java:2124)
>        at org.apache.solr.core.SolrCore.getSegmentCount(SolrCore.java:534)
>        at 
> org.apache.solr.core.SolrCore.lambda$initializeMetrics$11(SolrCore.java:1360)
>        at 
> org.apache.solr.core.SolrCore$$Lambda$760/0x00007fe660634ec8.getValue(Unknown 
> Source)
>        at 
> org.apache.solr.metrics.SolrMetricManager$GaugeWrapper.getValue(SolrMetricManager.java:779)
>        at 
> org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxGauge.getValue(JmxMetricsReporter.java:207
> {code}
> This thread gathers multiple metrics one after the other. We can assume, that 
> it had called SolrCore#getSearcher successfully before, and that the previous 
> call has incremented onDeckSearchers to 1.
> A previous call to #getSearcher should eventually decrement onDeckSearchers 
> again, and also set the _searcher field, but that's not happening. Normally, 
> this should happen in SolrCore#registerSearcher, but that method is executed 
> on the searcherExecutor, see
> [https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L2681]
> The problem is, that searcherExecutor will never execute #registerSearcher 
> because the executor is still blocked by another job. This is because the 
> core is still being created, see
> [https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1160]
> {code:none}
> "searcherExecutor-14-thread-1-processing-studio" #50 prio=5 os_prio=0 
> cpu=0.27ms elapsed=25212.18s tid=0x00007fe608146080 nid=0xbd waiting on 
> condition  [0x00007fe6c637a000]
>    java.lang.Thread.State: WAITING (parking)
>        at jdk.internal.misc.Unsafe.park(java.base@17.0.8.1/Native  Method)
>        - parking to wait for  <0x00000000e738c1a8> (a 
> java.util.concurrent.CountDownLatch$Sync)
>        at 
> java.util.concurrent.locks.LockSupport.park(java.base@17.0.8.1/Unknown  
> Source)
>        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.8.1/Unknown
>   Source)
>        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@17.0.8.1/Unknown
>   Source)
>        at 
> java.util.concurrent.CountDownLatch.await(java.base@17.0.8.1/Unknown  Source)
>        at org.apache.solr.core.SolrCore.lambda$new$3(SolrCore.java:1162)
>        at 
> org.apache.solr.core.SolrCore$$Lambda$815/0x00007fe6606f3c40.call(Unknown 
> Source)
>        at java.util.concurrent.FutureTask.run(java.base@17.0.8.1/Unknown  
> Source) 
> {code}
> In summary, we have
>  - thread "qtp1800649922-28" in SolrCore.<init> trying to create a core, 
> which is waiting for
>  - #registerSearcher job in the searcherExecutor work queue, which is waiting 
> to be executed by
>  - thread "searcherExecutor-14-thread-1-processing-studio" which waits for 
> the first thread "qtp1800649922-28" to unblock it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to