Andreas Hubold created SOLR-14969:
-------------------------------------

             Summary: Race condition when creating cores leads to NPE in 
CoreAdmin STATUS
                 Key: SOLR-14969
                 URL: https://issues.apache.org/jira/browse/SOLR-14969
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: multicore
    Affects Versions: 8.6.3, 8.6
            Reporter: Andreas Hubold


CoreContainer#create does not correctly handle concurrent requests to create 
the same core. There's a race condition (see also existing TODO comment in the 
code), and CoreContainer#createFromDescriptor may be called subsequently for 
the same core name.

The _second call_ then fails to create an IndexWriter, and exception handling 
causes an inconsistent CoreContainer state.

{noformat}
2020-10-27 00:29:25.350 ERROR (qtp2029754983-24) [   ] 
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error 
CREATEing SolrCore 'blueprint_acgqqafsogyc_comments': Unable to create core 
[blueprint_acgqqafsogyc_comments] Caused by: Lock held by this virtual machine: 
/var/solr/data/blueprint_acgqqafsogyc_comments/data/index/write.lock

         at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1312)
         at 
org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:95)
         at 
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:367)
...
Caused by: org.apache.solr.common.SolrException: Unable to create core 
[blueprint_acgqqafsogyc_comments]
         at 
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1408)
         at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1273)
         ... 47 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
         at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1071)
         at org.apache.solr.core.SolrCore.<init>(SolrCore.java:906)
         at 
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1387)
         ... 48 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
         at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2184)
         at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2308)
         at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1130)
         at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1012)
         ... 50 more
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by this 
virtual machine: 
/var/solr/data/blueprint_acgqqafsogyc_comments/data/index/write.lock
         at 
org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139)
         at 
org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
         at 
org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
         at 
org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:105)
         at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:785)
         at 
org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:126)
         at 
org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100)
         at 
org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:261)
         at 
org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:135)
         at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2145) 
{noformat}

CoreContainer#createFromDescriptor removes the CoreDescriptor when handling 
this exception. The SolrCore created for the first successful call is still 
registered in SolrCores.cores, but now there's no corresponding CoreDescriptor 
for that name anymore.

This inconsistency leads to subsequent NullPointerExceptions, for example when 
using CoreAdmin STATUS with the core name: CoreAdminOperation#getCoreStatus 
first gets the non-null SolrCore (cores.getCore(cname)) but 
core.getInstancePath() throws an NPE, because the CoreDescriptor is not 
registered anymore:

{noformat}
2020-10-27 00:29:25.353 INFO  (qtp2029754983-19) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/cores 
params={core=blueprint_acgqqafsogyc_comments&action=STATUS&indexInfo=false&wt=javabin&version=2}
 status=500 QTime=0
2020-10-27 00:29:25.353 ERROR (qtp2029754983-19) [   ] o.a.s.s.HttpSolrCall 
null:org.apache.solr.common.SolrException: Error handling 'STATUS' action
         at 
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:372)
         at 
org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:397)
         at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:181)
...
Caused by: java.lang.NullPointerException
         at org.apache.solr.core.SolrCore.getInstancePath(SolrCore.java:333)
         at 
org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:329)
         at org.apache.solr.handler.admin.StatusOp.execute(StatusOp.java:54)
         at 
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:367)
{noformat}

STATUS keeps failing until Solr is restarted.

The NPE for CoreAdmin STATUS is a regression in 8.6. It seems to be caused by 
https://github.com/apache/lucene-solr/commit/17ae79b0905b2bf8635c1b260b30807cae2f5463#diff-9652fe8353b7eff59cd6f128bb2699d88361e670b840ee5ca1018b1bc45584d1R324



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to