[ https://issues.apache.org/jira/browse/SOLR-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281442#comment-15281442 ]
Varun Thacker commented on SOLR-9092: ------------------------------------- bq. otherwise how else would you remove a replica which has been decommissioned. Fair point. One caveat though - for people who aren't using legacyCloud=true ( default ) , if ever the node comes back up the collection will be created , which will be puzzling to the user :) But maybe we should stop optimizing for this as we want to move to "zk as truth" sooner rather than later. > Add safety checks to delete replica/shard/collection commands > ------------------------------------------------------------- > > Key: SOLR-9092 > URL: https://issues.apache.org/jira/browse/SOLR-9092 > Project: Solr > Issue Type: Improvement > Reporter: Varun Thacker > Assignee: Varun Thacker > Priority: Minor > > We should verify the delete commands against live_nodes to make sure the API > can atleast be executed correctly > If we have a two node cluster, a collection with 1 shard 2 replica. Call the > delete replica command against for the replica whose node is currently down. > You get an exception: > {code} > <response> > <lst name="responseHeader"> > <int name="status">0</int> > <int name="QTime">5173</int> > </lst> > <lst name="failure"> > <str > name="192.168.1.101:7574_solr">org.apache.solr.client.solrj.SolrServerException:Server > refused connection at: http://192.168.1.101:7574/solr</str> > </lst> > </response> > {code} > At this point the entry for the replica is gone from state.json . The client > application retries since an error was thrown but the delete command will > never succeed now and an error like this will be seen- > {code} > <response> > <lst name="responseHeader"> > <int name="status">400</int> > <int name="QTime">137</int> > </lst> > <str name="Operation deletereplica caused > exception:">org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: > Invalid replica : core_node3 in shard/collection : shard1/gettingstarted > available replicas are core_node1</str> > <lst name="exception"> > <str name="msg">Invalid replica : core_node3 in shard/collection : > shard1/gettingstarted available replicas are core_node1</str> > <int name="rspCode">400</int> > </lst> > <lst name="error"> > <lst name="metadata"> > <str name="error-class">org.apache.solr.common.SolrException</str> > <str > name="root-error-class">org.apache.solr.common.SolrException</str> > </lst> > <str name="msg">Invalid replica : core_node3 in shard/collection : > shard1/gettingstarted available replicas are core_node1</str> > <int name="code">400</int> > </lst> > </response> > {code} > For create collection/add-replica we check the "createNodeSet" and "node" > params respectively against live_nodes to make sure it has a chance of > succeeding. > We should add a check against live_nodes for the delete commands as well. > Another situation where I saw this can be a problem - A second solr cluster > cloned from the first but the script didn't correctly change the hostnames in > the state.json file. When a delete command was issued against the second > cluster Solr deleted the replica from the first cluster. > In the above case the script was buggy obviously but if we verify against > live_nodes then Solr wouldn't have gone ahead and deleted replicas not > belonging to its cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org