[ 
https://issues.apache.org/jira/browse/SOLR-10365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15945444#comment-15945444
 ] 

Ishan Chattopadhyaya commented on SOLR-10365:
---------------------------------------------

Thanks for your review, Noble.

I think what is happening is the following:

How does a failed collection get cleaned up?
# At CoreContainer's create(CoreDescriptor,boolean,boolean) method, there's a 
preRegister step. This publishes the core as DOWN before even attempting to 
initialize the core.
# When there's a failure to initialize the core, the CoreContainer's 
coreInitFailures map gets populated with the exception.
# At OCMH, when there's a failure with the CreateCollection command, an attempt 
to clean up is performed. This actually calls DELETE, which in turn calls 
UNLOAD core admin command from DeleteCollectionCmd.java.
# This UNLOAD command is invoked from OCMH's collectionCmd() method, which 
calls UNLOAD core on every replica registered in step 1.
# At CoreContainer of the replica, when unload() method is invoked, the 
coreInitFailures map gets cleared.

This is all fine, when it works. However, the publish step in preRegister seems 
intermittent. Sometimes, the publish doesn't work. I can see that the state 
opertion is offered to the distributed queue properly, but that message 
actually doesn't seem to get processed. Hence, at step 4, no UNLOAD command is 
sent to the replica. The latest SOLR-6736 patch's 
TestConfigSetsAPI#testUploadWithScriptUpdateProcessor() demonstrates this.

While this maybe a larger issue with the way OCMH works, I can see that the 
patch I added here does the job in those circumstances, and the code path 
followed after the core is registered successfully properly removes the 
previous exception from the coreInitFailures map. Unless someone has any 
objections, I am inclined to commit this patch, and hence commit SOLR-6736 and 
then continue investigating the above scenario.

> Collection re-creation fails if previous collection creation had failed
> -----------------------------------------------------------------------
>
>                 Key: SOLR-10365
>                 URL: https://issues.apache.org/jira/browse/SOLR-10365
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Ishan Chattopadhyaya
>         Attachments: SOLR-10365.patch, SOLR-10365.patch, SOLR-10365.patch, 
> SOLR-10365.patch
>
>
> Steps to reproduce:
> # Create collection using a bad configset that has some errors, due to which 
> collection creation fails.
> # Now, create a collection using the same name, but a good configset. This 
> fails sometimes (about 25-30% of the times, according to my rough estimate).
> Here's what happens during the second step (can be seen from stacktrace 
> below):
> # In CoreContainer's create(CoreDescriptor, boolean, boolean), there's a line 
> {{        zkSys.getZkController().preRegister(dcore);}}.
> # This calls ZkController's publish(), which in turn calls CoreContainer's 
> getCore() method. This call *should* return null (since previous attempt of 
> core creation didn't succeed). But, it throws the exception associated with 
> the previous failure.
> Here's the stack trace for the same.
> {code}
> Caused by: org.apache.solr.common.SolrException: SolrCore 
> 'newcollection2_shard1_replica1' is not available due to init failure: 
> blahblah
>       at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:1312)
>       at org.apache.solr.cloud.ZkController.publish(ZkController.java:1225)
>       at 
> org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1399)
>       at org.apache.solr.core.CoreContainer.create(CoreContainer.java:945)
> {code}
> While working on SOLR-6736, I ran into this (nasty?) issue. I'll try to 
> isolate this into a standalone test that demonstrates this issue. Otherwise, 
> as of now, this can be seen in the SOLR-6736's 
> testUploadWithScriptUpdateProcessor() test (which tries to re-create the 
> collection, but sometimes fails).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to