[
https://issues.apache.org/jira/browse/SOLR-10365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15945444#comment-15945444
]
Ishan Chattopadhyaya commented on SOLR-10365:
---------------------------------------------
Thanks for your review, Noble.
I think what is happening is the following:
How does a failed collection get cleaned up?
# At CoreContainer's create(CoreDescriptor,boolean,boolean) method, there's a
preRegister step. This publishes the core as DOWN before even attempting to
initialize the core.
# When there's a failure to initialize the core, the CoreContainer's
coreInitFailures map gets populated with the exception.
# At OCMH, when there's a failure with the CreateCollection command, an attempt
to clean up is performed. This actually calls DELETE, which in turn calls
UNLOAD core admin command from DeleteCollectionCmd.java.
# This UNLOAD command is invoked from OCMH's collectionCmd() method, which
calls UNLOAD core on every replica registered in step 1.
# At CoreContainer of the replica, when unload() method is invoked, the
coreInitFailures map gets cleared.
This is all fine, when it works. However, the publish step in preRegister seems
intermittent. Sometimes, the publish doesn't work. I can see that the state
opertion is offered to the distributed queue properly, but that message
actually doesn't seem to get processed. Hence, at step 4, no UNLOAD command is
sent to the replica. The latest SOLR-6736 patch's
TestConfigSetsAPI#testUploadWithScriptUpdateProcessor() demonstrates this.
While this maybe a larger issue with the way OCMH works, I can see that the
patch I added here does the job in those circumstances, and the code path
followed after the core is registered successfully properly removes the
previous exception from the coreInitFailures map. Unless someone has any
objections, I am inclined to commit this patch, and hence commit SOLR-6736 and
then continue investigating the above scenario.
> Collection re-creation fails if previous collection creation had failed
> -----------------------------------------------------------------------
>
> Key: SOLR-10365
> URL: https://issues.apache.org/jira/browse/SOLR-10365
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Ishan Chattopadhyaya
> Attachments: SOLR-10365.patch, SOLR-10365.patch, SOLR-10365.patch,
> SOLR-10365.patch
>
>
> Steps to reproduce:
> # Create collection using a bad configset that has some errors, due to which
> collection creation fails.
> # Now, create a collection using the same name, but a good configset. This
> fails sometimes (about 25-30% of the times, according to my rough estimate).
> Here's what happens during the second step (can be seen from stacktrace
> below):
> # In CoreContainer's create(CoreDescriptor, boolean, boolean), there's a line
> {{ zkSys.getZkController().preRegister(dcore);}}.
> # This calls ZkController's publish(), which in turn calls CoreContainer's
> getCore() method. This call *should* return null (since previous attempt of
> core creation didn't succeed). But, it throws the exception associated with
> the previous failure.
> Here's the stack trace for the same.
> {code}
> Caused by: org.apache.solr.common.SolrException: SolrCore
> 'newcollection2_shard1_replica1' is not available due to init failure:
> blahblah
> at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:1312)
> at org.apache.solr.cloud.ZkController.publish(ZkController.java:1225)
> at
> org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1399)
> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:945)
> {code}
> While working on SOLR-6736, I ran into this (nasty?) issue. I'll try to
> isolate this into a standalone test that demonstrates this issue. Otherwise,
> as of now, this can be seen in the SOLR-6736's
> testUploadWithScriptUpdateProcessor() test (which tries to re-create the
> collection, but sometimes fails).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]