Andreas Hubold created SOLR-14969: ------------------------------------- Summary: Race condition when creating cores leads to NPE in CoreAdmin STATUS Key: SOLR-14969 URL: https://issues.apache.org/jira/browse/SOLR-14969 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: multicore Affects Versions: 8.6.3, 8.6 Reporter: Andreas Hubold
CoreContainer#create does not correctly handle concurrent requests to create the same core. There's a race condition (see also existing TODO comment in the code), and CoreContainer#createFromDescriptor may be called subsequently for the same core name. The _second call_ then fails to create an IndexWriter, and exception handling causes an inconsistent CoreContainer state. {noformat} 2020-10-27 00:29:25.350 ERROR (qtp2029754983-24) [ ] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error CREATEing SolrCore 'blueprint_acgqqafsogyc_comments': Unable to create core [blueprint_acgqqafsogyc_comments] Caused by: Lock held by this virtual machine: /var/solr/data/blueprint_acgqqafsogyc_comments/data/index/write.lock at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1312) at org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:95) at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:367) ... Caused by: org.apache.solr.common.SolrException: Unable to create core [blueprint_acgqqafsogyc_comments] at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1408) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1273) ... 47 more Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1071) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:906) at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1387) ... 48 more Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2184) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2308) at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1130) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1012) ... 50 more Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /var/solr/data/blueprint_acgqqafsogyc_comments/data/index/write.lock at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139) at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41) at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45) at org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:105) at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:785) at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:126) at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100) at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:261) at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:135) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2145) {noformat} CoreContainer#createFromDescriptor removes the CoreDescriptor when handling this exception. The SolrCore created for the first successful call is still registered in SolrCores.cores, but now there's no corresponding CoreDescriptor for that name anymore. This inconsistency leads to subsequent NullPointerExceptions, for example when using CoreAdmin STATUS with the core name: CoreAdminOperation#getCoreStatus first gets the non-null SolrCore (cores.getCore(cname)) but core.getInstancePath() throws an NPE, because the CoreDescriptor is not registered anymore: {noformat} 2020-10-27 00:29:25.353 INFO (qtp2029754983-19) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={core=blueprint_acgqqafsogyc_comments&action=STATUS&indexInfo=false&wt=javabin&version=2} status=500 QTime=0 2020-10-27 00:29:25.353 ERROR (qtp2029754983-19) [ ] o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error handling 'STATUS' action at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:372) at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:397) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:181) ... Caused by: java.lang.NullPointerException at org.apache.solr.core.SolrCore.getInstancePath(SolrCore.java:333) at org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:329) at org.apache.solr.handler.admin.StatusOp.execute(StatusOp.java:54) at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:367) {noformat} STATUS keeps failing until Solr is restarted. The NPE for CoreAdmin STATUS is a regression in 8.6. It seems to be caused by https://github.com/apache/lucene-solr/commit/17ae79b0905b2bf8635c1b260b30807cae2f5463#diff-9652fe8353b7eff59cd6f128bb2699d88361e670b840ee5ca1018b1bc45584d1R324 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org