[ 
https://issues.apache.org/jira/browse/SOLR-17060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786811#comment-17786811
 ] 

Andreas Hubold edited comment on SOLR-17060 at 1/5/24 8:39 AM:
---------------------------------------------------------------

I was able to work around the issue by excluding the INDEX.segments metric in 
solr.xml. Well, I hope so - at least I wasn't able to reproduce the issue 
anymore. The workaround is just valid for gathering metrics via JMX. It will 
not prevent deadlocks if metrics are retrieved from the /admin/metrics HTTP 
endpoint.

For reference, this is how I excluded INDEX.segments but kept all the other 
metrics in solr.xml. It's a bit clumsy, but I did not find a better way to 
exclude a metric - so I included all metrics except the problematic one:

{code:xml}
  <metrics enabled="${metricsEnabled:true}">
    <reporter name="jmx_metrics_core" group="core" 
class="org.apache.solr.metrics.reporters.SolrJmxReporter">
      <!--
        Workaround for SOLR-17060: Exclude 'INDEX.segments' MBean attribute, 
because read access may cause a deadlock.
        Sadly, it's not possible to exclude a single metric by configuration. 
We can only list included name prefixes.
        The following prefixes should cover all metrics except 'INDEX.segments'.
        To that end, we are listing all categories from 
org.apache.solr.core.SolrInfoBean.Category except INDEX,
        and known attributes from INDEX except INDEX.segments.
      -->
      <str name="filter">CONTAINER</str>
      <str name="filter">ADMIN</str>
      <str name="filter">CORE</str>
      <str name="filter">QUERY</str>
      <str name="filter">UPDATE</str>
      <str name="filter">CACHE</str>
      <str name="filter">HIGHLIGHTER</str>
      <str name="filter">QUERYPARSER</str>
      <str name="filter">SPELLCHECKER</str>
      <str name="filter">SEARCHER</str>
      <str name="filter">REPLICATION</str>
      <str name="filter">TLOG</str>
      <str name="filter">DIRECTORY</str>
      <str name="filter">HTTP</str>
      <str name="filter">SECURITY</str>
      <str name="filter">OTHER</str>
      <str name="filter">INDEX.flush</str>
      <str name="filter">INDEX.merge</str>
      <str name="filter">INDEX.size</str>
      <str name="filter">INDEX.sizeInBytes</str>
    </reporter>
    <!--
      Add other available metrics, for all groups except group "core" that is 
defined above.
      The list of possible groups is taken from 
org.apache.solr.core.SolrInfoBean.Group.
    -->
    <reporter name="jmx_metrics_other"
              group="jetty, jvm, node, collection, shard, cluster, overseer"
              class="org.apache.solr.metrics.reporters.SolrJmxReporter"/>
  </metrics>
{code}

 

 


was (Author: ahubold):
I was able to work around the issue by excluding the INDEX.segments metric in 
solr.xml. Well, I hope so - at least I wasn't able to reproduce the issue 
anymore.

For reference, this is how I excluded INDEX.segments but kept all the other 
metrics in solr.xml. It's a bit clumsy, but I did not find a better way to 
exclude a metric - so I included all metrics except the problematic one:

{code:xml}
  <metrics enabled="${metricsEnabled:true}">
    <reporter name="jmx_metrics_core" group="core" 
class="org.apache.solr.metrics.reporters.SolrJmxReporter">
      <!--
        Workaround for SOLR-17060: Exclude 'INDEX.segments' MBean attribute, 
because read access may cause a deadlock.
        Sadly, it's not possible to exclude a single metric by configuration. 
We can only list included name prefixes.
        The following prefixes should cover all metrics except 'INDEX.segments'.
        To that end, we are listing all categories from 
org.apache.solr.core.SolrInfoBean.Category except INDEX,
        and known attributes from INDEX except INDEX.segments.
      -->
      <str name="filter">CONTAINER</str>
      <str name="filter">ADMIN</str>
      <str name="filter">CORE</str>
      <str name="filter">QUERY</str>
      <str name="filter">UPDATE</str>
      <str name="filter">CACHE</str>
      <str name="filter">HIGHLIGHTER</str>
      <str name="filter">QUERYPARSER</str>
      <str name="filter">SPELLCHECKER</str>
      <str name="filter">SEARCHER</str>
      <str name="filter">REPLICATION</str>
      <str name="filter">TLOG</str>
      <str name="filter">DIRECTORY</str>
      <str name="filter">HTTP</str>
      <str name="filter">SECURITY</str>
      <str name="filter">OTHER</str>
      <str name="filter">INDEX.flush</str>
      <str name="filter">INDEX.merge</str>
      <str name="filter">INDEX.size</str>
      <str name="filter">INDEX.sizeInBytes</str>
    </reporter>
    <!--
      Add other available metrics, for all groups except group "core" that is 
defined above.
      The list of possible groups is taken from 
org.apache.solr.core.SolrInfoBean.Group.
    -->
    <reporter name="jmx_metrics_other"
              group="jetty, jvm, node, collection, shard, cluster, overseer"
              class="org.apache.solr.metrics.reporters.SolrJmxReporter"/>
  </metrics>
{code}

 

 

> CoreContainer#create may deadlock with concurrent requests for metrics
> ----------------------------------------------------------------------
>
>                 Key: SOLR-17060
>                 URL: https://issues.apache.org/jira/browse/SOLR-17060
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: multicore
>    Affects Versions: 9.2.1, 9.4
>            Reporter: Andreas Hubold
>            Assignee: Alex Deparvu
>            Priority: Blocker
>              Labels: deadlock, metrics, monitoring
>             Fix For: main (10.0), 9.5, 9.4.1
>
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> CoreContainer#create registers metrics for the created core quite early, and 
> metrics can be requested for a core that is still being created. If a metrics 
> request calls SolrCore#getSearcher with unlucky timing (race condition), then 
> CoreContainer#create can deadlock.
> This problem was described on the users lists: 
> [https://lists.apache.org/thread/mvpp1ogkxfdgfx87mdt6ylhqsttoq2dw]
> We ran into such a deadlock with Solr 9.2.1, a CoreAdmin CREATE request, and 
> a periodic thread that requests all available JMX metrics.
> A CoreAdmin CREATE request was received, but its thread waits forever because 
> onDeckSearchers is 1 and _searcher is null (variables checked in heap dump):
> {code:none}
> "qtp1800649922-28" #28 prio=5 os_prio=0 cpu=757.69ms elapsed=25383.63s 
> tid=0x00007fe6dc9613a0 nid=0x70 in Object.wait()  [0x00007fe6c68ca000]
>    java.lang.Thread.State: WAITING (on object monitor)
>     at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
>     - waiting on <no object reference available>
>     at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
>     at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
>     - locked <0x00000000e25dd2c8> (a java.lang.Object)
>     at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1283)
>     at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1168)
>     at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1051)
>     at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1666)
>     at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1532)
>     at 
> org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:111)
>     at 
> org.apache.solr.handler.admin.CoreAdminOperation$$Lambda$437/0x00007fe66048cc60.execute(Unknown
>  Source)
>     at 
> org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:398)
>     at 
> org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:354)
>     at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:219)
> {code}
>  
> There's no way for clients to retry the CREATE request, because CoreContainer 
> rejects additional requests for good reason ("Already creating a core with 
> name ...", see SOLR-14969).
> The only solution is to restart Solr.
>  
> A thread for metrics was also waiting in the same way:
> {code:none}
> prometheus-http-1-1" #30 daemon prio=5 os_prio=0 cpu=1212.68ms 
> elapsed=25383.60s tid=0x00007fe620008ac0 nid=0x73 in Object.wait()  
> [0x00007fe6c680b000]
>    java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
>        - waiting on <no object reference available>
>        at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
>        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
>        - locked <0x00000000e25dd2c8> (a java.lang.Object)
>        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2271)
>        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2106)
>        at org.apache.solr.core.SolrCore.withSearcher(SolrCore.java:2124)
>        at org.apache.solr.core.SolrCore.getSegmentCount(SolrCore.java:534)
>        at 
> org.apache.solr.core.SolrCore.lambda$initializeMetrics$11(SolrCore.java:1360)
>        at 
> org.apache.solr.core.SolrCore$$Lambda$760/0x00007fe660634ec8.getValue(Unknown 
> Source)
>        at 
> org.apache.solr.metrics.SolrMetricManager$GaugeWrapper.getValue(SolrMetricManager.java:779)
>        at 
> org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxGauge.getValue(JmxMetricsReporter.java:207
> {code}
> This thread gathers multiple metrics one after the other. We can assume, that 
> it had called SolrCore#getSearcher successfully before, and that the previous 
> call has incremented onDeckSearchers to 1.
> A previous call to #getSearcher should eventually decrement onDeckSearchers 
> again, and also set the _searcher field, but that's not happening. Normally, 
> this should happen in SolrCore#registerSearcher, but that method is executed 
> on the searcherExecutor, see
> [https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L2681]
> The problem is, that searcherExecutor will never execute #registerSearcher 
> because the executor is still blocked by another job. This is because the 
> core is still being created, see
> [https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1160]
> {code:none}
> "searcherExecutor-14-thread-1-processing-studio" #50 prio=5 os_prio=0 
> cpu=0.27ms elapsed=25212.18s tid=0x00007fe608146080 nid=0xbd waiting on 
> condition  [0x00007fe6c637a000]
>    java.lang.Thread.State: WAITING (parking)
>        at jdk.internal.misc.Unsafe.park(java.base@17.0.8.1/Native  Method)
>        - parking to wait for  <0x00000000e738c1a8> (a 
> java.util.concurrent.CountDownLatch$Sync)
>        at 
> java.util.concurrent.locks.LockSupport.park(java.base@17.0.8.1/Unknown  
> Source)
>        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.8.1/Unknown
>   Source)
>        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@17.0.8.1/Unknown
>   Source)
>        at 
> java.util.concurrent.CountDownLatch.await(java.base@17.0.8.1/Unknown  Source)
>        at org.apache.solr.core.SolrCore.lambda$new$3(SolrCore.java:1162)
>        at 
> org.apache.solr.core.SolrCore$$Lambda$815/0x00007fe6606f3c40.call(Unknown 
> Source)
>        at java.util.concurrent.FutureTask.run(java.base@17.0.8.1/Unknown  
> Source) 
> {code}
> In summary, we have
>  - thread "qtp1800649922-28" in SolrCore.<init> trying to create a core, 
> which is waiting for
>  - #registerSearcher job in the searcherExecutor work queue, which is waiting 
> to be executed by
>  - thread "searcherExecutor-14-thread-1-processing-studio" which waits for 
> the first thread "qtp1800649922-28" to unblock it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to