[ 
https://issues.apache.org/jira/browse/SOLR-13396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822748#comment-16822748
 ] 

Shawn Heisey commented on SOLR-13396:
-------------------------------------

bq. Next step: solr starts, and from all the possible zookeepers it could 
connect to, it connected to the faulty one. And that caused the deletion.

ZK clients (including Solr) connect to *ALL* of the zookeepers that have been 
configured.  They don't connect to just one server unless they have only been 
configured with one server.  ZK should never be placed behind a load balancer.  
If Solr has been configured with multiple servers and it can only connect to 
one, that seems like something we should detect (if we can) and probably refuse 
to proceed with startup.

bq. I'd go even further and says: make it an option, default disabled, to shut 
down the solr in case this happens.

That's an interesting idea.  If we combine it with what I initially proposed, 
then there's a hybrid solution that I will describe here:

We create a new option to prevent Solr startup when there are cores that aren't 
referenced in ZK.  Initially, this option will default to disabled, but at some 
point (probably 9.0) we flip the default to enabled.

If the new option is enabled, then Solr will not complete startup when that 
situation is found.  The log will indicate why this has happened.  The 
"bootstrap" option will take priority over the new option if it is found.

If the new option is disabled, then here's what will happen:

Cores that do not exist in ZK will not start.  Solr will check for a file in 
the solr home, with a name like allow_auto_core_delete, and if that file 
exists, it will be deleted and then Solr will proceed as if another new option 
were enabled, and delete unreferenced cores.  The new option described here 
will default to false and the default will not change in a later release.


> SolrCloud will delete the core data for any core that is not referenced in 
> the clusterstate
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13396
>                 URL: https://issues.apache.org/jira/browse/SOLR-13396
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 7.3.1, 8.0
>            Reporter: Shawn Heisey
>            Priority: Major
>
> SOLR-12066 is an improvement designed to delete core data for replicas that 
> were deleted while the node was down -- better cleanup.
> In practice, that change causes SolrCloud to delete all core data for cores 
> that are not referenced in the ZK clusterstate.  If all the ZK data gets 
> deleted or the Solr instance is pointed at a ZK ensemble with no data, it 
> will proceed to delete all of the cores in the solr home, with no possibility 
> of recovery.
> I do not think that Solr should ever delete core data unless an explicit 
> DELETE action has been made and the node is operational at the time of the 
> request.  If a core exists during startup that cannot be found in the ZK 
> clusterstate, it should be ignored (not started) and a helpful message should 
> be logged.  I think that message should probably be at WARN so that it shows 
> up in the admin UI logging tab with default settings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to