[ 
https://issues.apache.org/jira/browse/SOLR-14969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223623#comment-17223623
 ] 

Andreas Hubold commented on SOLR-14969:
---------------------------------------

[~erickerickson] It seems you've misunderstood my point. The core name must of 
course be removed in the finally block, if but only if it was added by the very 
same invocation. However, it *must not* be removed by an invocation, that 
didn't add the core name itself. Currently, with your fix, the following 
scenario is possible:
 * first call is made, and adds coreName to inFlightCreations
 * second simultaneous call detects that the core is already being created, and 
correctly throws an exception, but it also removes the core from 
inFlightCreations.
 * third call is made, does not see coreName in inFlightCreations, and proceeds 
even though the first call is still not finished

I can reproduce such problems with concurrent create requests.

I hope this makes it more clear. In my attached workaround, this situation is 
handled by setting the variable "createCore" to null if a previous create call 
is still in progress, and an additional "if (createCore!=null)" condition in 
the finally block. This is of course not directly applicable to your fix, but 
the pattern could be similar.

> Prevent creating multiple cores with the same name which leads to 
> instabilities (race condition)
> ------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-14969
>                 URL: https://issues.apache.org/jira/browse/SOLR-14969
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: multicore
>    Affects Versions: 8.6, 8.6.3
>            Reporter: Andreas Hubold
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: CmCoreAdminHandler.java
>
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> CoreContainer#create does not correctly handle concurrent requests to create 
> the same core. There's a race condition (see also existing TODO comment in 
> the code), and CoreContainer#createFromDescriptor may be called subsequently 
> for the same core name.
> The _second call_ then fails to create an IndexWriter, and exception handling 
> causes an inconsistent CoreContainer state.
> {noformat}
> 2020-10-27 00:29:25.350 ERROR (qtp2029754983-24) [   ] 
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error 
> CREATEing SolrCore 'blueprint_acgqqafsogyc_comments': Unable to create core 
> [blueprint_acgqqafsogyc_comments] Caused by: Lock held by this virtual 
> machine: /var/solr/data/blueprint_acgqqafsogyc_comments/data/index/write.lock
>          at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1312)
>          at 
> org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:95)
>          at 
> org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:367)
> ...
> Caused by: org.apache.solr.common.SolrException: Unable to create core 
> [blueprint_acgqqafsogyc_comments]
>          at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1408)
>          at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1273)
>          ... 47 more
> Caused by: org.apache.solr.common.SolrException: Error opening new searcher
>          at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1071)
>          at org.apache.solr.core.SolrCore.<init>(SolrCore.java:906)
>          at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1387)
>          ... 48 more
> Caused by: org.apache.solr.common.SolrException: Error opening new searcher
>          at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2184)
>          at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2308)
>          at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1130)
>          at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1012)
>          ... 50 more
> Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by 
> this virtual machine: 
> /var/solr/data/blueprint_acgqqafsogyc_comments/data/index/write.lock
>          at 
> org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139)
>          at 
> org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
>          at 
> org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
>          at 
> org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:105)
>          at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:785)
>          at 
> org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:126)
>          at 
> org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100)
>          at 
> org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:261)
>          at 
> org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:135)
>          at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2145) 
> {noformat}
> CoreContainer#createFromDescriptor removes the CoreDescriptor when handling 
> this exception. The SolrCore created for the first successful call is still 
> registered in SolrCores.cores, but now there's no corresponding 
> CoreDescriptor for that name anymore.
> This inconsistency leads to subsequent NullPointerExceptions, for example 
> when using CoreAdmin STATUS with the core name: 
> CoreAdminOperation#getCoreStatus first gets the non-null SolrCore 
> (cores.getCore(cname)) but core.getInstancePath() throws an NPE, because the 
> CoreDescriptor is not registered anymore:
> {noformat}
> 2020-10-27 00:29:25.353 INFO  (qtp2029754983-19) [   ] o.a.s.s.HttpSolrCall 
> [admin] webapp=null path=/admin/cores 
> params={core=blueprint_acgqqafsogyc_comments&action=STATUS&indexInfo=false&wt=javabin&version=2}
>  status=500 QTime=0
> 2020-10-27 00:29:25.353 ERROR (qtp2029754983-19) [   ] o.a.s.s.HttpSolrCall 
> null:org.apache.solr.common.SolrException: Error handling 'STATUS' action
>          at 
> org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:372)
>          at 
> org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:397)
>          at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:181)
> ...
> Caused by: java.lang.NullPointerException
>          at org.apache.solr.core.SolrCore.getInstancePath(SolrCore.java:333)
>          at 
> org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:329)
>          at org.apache.solr.handler.admin.StatusOp.execute(StatusOp.java:54)
>          at 
> org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:367)
> {noformat}
> STATUS keeps failing until Solr is restarted.
> The NPE for CoreAdmin STATUS is a regression in 8.6. It seems to be caused by 
> https://github.com/apache/lucene-solr/commit/17ae79b0905b2bf8635c1b260b30807cae2f5463#diff-9652fe8353b7eff59cd6f128bb2699d88361e670b840ee5ca1018b1bc45584d1R324



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to