[ 
https://issues.apache.org/jira/browse/SOLR-13396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822665#comment-16822665
 ] 

Koen De Groote commented on SOLR-13396:
---------------------------------------

Just gonna chime in here.

As already stated in the mailing list: the reason I came upon this issue is 
either down to iptables or a network issue.

Here's what I assume happened: A new zookeeper was added, it could not connect 
to the other zookeepers, therefore did not sync up. But the auto-deploy only 
really checks if the zookeeper container is running and not if it's synced to 
the others. So the deploy script continues.

Next step: solr starts, and from all the possible zookeepers it could connect 
to, it connected to the faulty one. And that caused the deletion.

That's a possibility: that there's a network issue just at the moment you're 
deploying. I'd even go as far as to say it's not a rare occurrence. And in that 
event, the data should remain safe, so the deploy can happen again after the 
network issue is fixed.

The delay before deletion sounds good. Probably want a form of logging attached 
to that as a WARN and/or ERROR.
I'd go even further and says: make it an option, default disabled, to shut down 
the solr in case this happens. Or should that be something detected by the 
user? My whole train of thought for that option is exactly: what if there's 
nobody around to notice? Ideally, that should never be the case, certainly not 
in a professional environment.

Still, I feel like automatic deletes should never occur when it comes to data 
storage. If a data set is retired, for whatever reason, it should be up to the 
team maintaining it to decide and then manually do the cleanup.

> SolrCloud will delete the core data for any core that is not referenced in 
> the clusterstate
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13396
>                 URL: https://issues.apache.org/jira/browse/SOLR-13396
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 7.3.1, 8.0
>            Reporter: Shawn Heisey
>            Priority: Major
>
> SOLR-12066 is an improvement designed to delete core data for replicas that 
> were deleted while the node was down -- better cleanup.
> In practice, that change causes SolrCloud to delete all core data for cores 
> that are not referenced in the ZK clusterstate.  If all the ZK data gets 
> deleted or the Solr instance is pointed at a ZK ensemble with no data, it 
> will proceed to delete all of the cores in the solr home, with no possibility 
> of recovery.
> I do not think that Solr should ever delete core data unless an explicit 
> DELETE action has been made and the node is operational at the time of the 
> request.  If a core exists during startup that cannot be found in the ZK 
> clusterstate, it should be ignored (not started) and a helpful message should 
> be logged.  I think that message should probably be at WARN so that it shows 
> up in the admin UI logging tab with default settings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to