[ https://issues.apache.org/jira/browse/SOLR-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13484629#comment-13484629 ]
Mark Miller commented on SOLR-3561: ----------------------------------- It's very likely this could have been SOLR-3939. > Error during deletion of shard/core > ----------------------------------- > > Key: SOLR-3561 > URL: https://issues.apache.org/jira/browse/SOLR-3561 > Project: Solr > Issue Type: Bug > Components: multicore, replication (java), SolrCloud > Affects Versions: 4.0-ALPHA > Environment: Solr trunk (4.0-SNAPSHOT) from 29/2-2012 > Reporter: Per Steffensen > Assignee: Mark Miller > Fix For: 4.1, 5.0 > > > Running several Solr servers in Cloud-cluster (zkHost set on the Solr > servers). > Several collections with several slices and one replica for each slice (each > slice has two shards) > Basically we want let our system delete an entire collection. We do this by > trying to delete each and every shard under the collection. Each shard is > deleted one by one, by doing CoreAdmin-UNLOAD-requests against the relevant > Solr > {code} > CoreAdminRequest request = new CoreAdminRequest(); > request.setAction(CoreAdminAction.UNLOAD); > request.setCoreName(shardName); > CoreAdminResponse resp = request.process(new CommonsHttpSolrServer(solrUrl)); > {code} > The delete/unload succeeds, but in like 10% of the cases we get errors on > involved Solr servers, right around the time where shard/cores are deleted, > and we end up in a situation where ZK still claims (forever) that the deleted > shard is still present and active. > Form here the issue is easilier explained by a more concrete example: > - 7 Solr servers involved > - Several collection a.o. one called "collection_2012_04", consisting of 28 > slices, 56 shards (remember 1 replica for each slice) named > "collection_2012_04_sliceX_shardY" for all pairs in {X:1..28}x{Y:1,2} > - Each Solr server running 8 shards, e.g Solr server #1 is running shard > "collection_2012_04_slice1_shard1" and Solr server #7 is running shard > "collection_2012_04_slice1_shard2" belonging to the same slice "slice1". > When we decide to delete the collection "collection_2012_04" we go through > all 56 shards and delete/unload them one-by-one - including > "collection_2012_04_slice1_shard1" and "collection_2012_04_slice1_shard2". At > some point during or shortly after all this deletion we see the following > exceptions in solr.log on Solr server #7 > {code} > Aug 1, 2012 12:02:50 AM org.apache.solr.common.SolrException log > SEVERE: Error while trying to recover:org.apache.solr.common.SolrException: > core not found:collection_2012_04_slice1_shard1 > request: > http://solr_server_1:8983/solr/admin/cores?action=PREPRECOVERY&core=collection_2012_04_slice1_shard1&nodeName=solr_server_7%3A8983_solr&coreNodeName=solr_server_7%3A8983_solr_collection_2012_04_slice1_shard2&state=recovering&checkLive=true&pauseFor=6000&wt=javabin&version=2 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > org.apache.solr.common.SolrExceptionPropagationHelper.decodeFromMsg(SolrExceptionPropagationHelper.java:29) > at > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:445) > at > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:264) > at > org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:188) > at > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:285) > at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:206) > Aug 1, 2012 12:02:50 AM org.apache.solr.common.SolrException log > SEVERE: Recovery failed - trying again... > Aug 1, 2012 12:02:51 AM org.apache.solr.cloud.LeaderElector$1 process > WARNING: > java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > at java.util.ArrayList.get(ArrayList.java:322) > at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:96) > at org.apache.solr.cloud.LeaderElector.access$000(LeaderElector.java:57) > at org.apache.solr.cloud.LeaderElector$1.process(LeaderElector.java:121) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) > Aug 1, 2012 12:02:51 AM org.apache.solr.cloud.LeaderElector$1 process > {code} > Im not sure exactly how to interpret this, but it seems to me that some > recovery job tries to recover collection_2012_04_slice1_shard2 on Solr server > #7 from collection_2012_04_slice1_shard1 on Solr server #1, but fail because > Solr server #1 answers back that it doesnt run > collection_2012_04_slice1_shard1 (anymore). > This problem occurs for serveral (in this conrete test for 4) of the 28 slice > pairs. For those 4 shards the end result is that > /node_states/solr_server_X:8983_solr in ZK still contains information about > the shard being running and active. E.g. /node_states/solr_server_7:8983_solr > still contains > {code} > { > "shard":"slice1", > "state":"active", > "core":"collection_2012_04_slice1_shard2", > "collection":"collection_2012_04", > "node_name":"solr_server_7:8983_solr", > "base_url":"http://solr_server_7:8983/solr" > } > {code} > and that CloudState therefore still reports that those shards are running and > active - but thay are not. A.o. I have noticed that > "collection_2012_04_slice1_shard2" HAS been removed from solr.xml on Solr > server #7 (we are running with persistent="true") > Any chance that this bug is fixed in a later revision (than one from > 29/2-2012) of 4.0-SNAPSHOT? > If not we need to get it fixed, I believe. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org