[ 
https://issues.apache.org/jira/browse/SOLR-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13484629#comment-13484629
 ] 

Mark Miller commented on SOLR-3561:
-----------------------------------

It's very likely this could have been SOLR-3939.
                
> Error during deletion of shard/core
> -----------------------------------
>
>                 Key: SOLR-3561
>                 URL: https://issues.apache.org/jira/browse/SOLR-3561
>             Project: Solr
>          Issue Type: Bug
>          Components: multicore, replication (java), SolrCloud
>    Affects Versions: 4.0-ALPHA
>         Environment: Solr trunk (4.0-SNAPSHOT) from 29/2-2012
>            Reporter: Per Steffensen
>            Assignee: Mark Miller
>             Fix For: 4.1, 5.0
>
>
> Running several Solr servers in Cloud-cluster (zkHost set on the Solr 
> servers).
> Several collections with several slices and one replica for each slice (each 
> slice has two shards)
> Basically we want let our system delete an entire collection. We do this by 
> trying to delete each and every shard under the collection. Each shard is 
> deleted one by one, by doing CoreAdmin-UNLOAD-requests against the relevant 
> Solr
> {code}
> CoreAdminRequest request = new CoreAdminRequest();
> request.setAction(CoreAdminAction.UNLOAD);
> request.setCoreName(shardName);
> CoreAdminResponse resp = request.process(new CommonsHttpSolrServer(solrUrl));
> {code}
> The delete/unload succeeds, but in like 10% of the cases we get errors on 
> involved Solr servers, right around the time where shard/cores are deleted, 
> and we end up in a situation where ZK still claims (forever) that the deleted 
> shard is still present and active.
> Form here the issue is easilier explained by a more concrete example:
> - 7 Solr servers involved
> - Several collection a.o. one called "collection_2012_04", consisting of 28 
> slices, 56 shards (remember 1 replica for each slice) named 
> "collection_2012_04_sliceX_shardY" for all pairs in {X:1..28}x{Y:1,2}
> - Each Solr server running 8 shards, e.g Solr server #1 is running shard 
> "collection_2012_04_slice1_shard1" and Solr server #7 is running shard 
> "collection_2012_04_slice1_shard2" belonging to the same slice "slice1".
> When we decide to delete the collection "collection_2012_04" we go through 
> all 56 shards and delete/unload them one-by-one - including 
> "collection_2012_04_slice1_shard1" and "collection_2012_04_slice1_shard2". At 
> some point during or shortly after all this deletion we see the following 
> exceptions in solr.log on Solr server #7
> {code}
> Aug 1, 2012 12:02:50 AM org.apache.solr.common.SolrException log
> SEVERE: Error while trying to recover:org.apache.solr.common.SolrException: 
> core not found:collection_2012_04_slice1_shard1
> request: 
> http://solr_server_1:8983/solr/admin/cores?action=PREPRECOVERY&core=collection_2012_04_slice1_shard1&nodeName=solr_server_7%3A8983_solr&coreNodeName=solr_server_7%3A8983_solr_collection_2012_04_slice1_shard2&state=recovering&checkLive=true&pauseFor=6000&wt=javabin&version=2
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at 
> org.apache.solr.common.SolrExceptionPropagationHelper.decodeFromMsg(SolrExceptionPropagationHelper.java:29)
> at 
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:445)
> at 
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:264)
> at 
> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:188)
> at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:285)
> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:206)
> Aug 1, 2012 12:02:50 AM org.apache.solr.common.SolrException log
> SEVERE: Recovery failed - trying again...
> Aug 1, 2012 12:02:51 AM org.apache.solr.cloud.LeaderElector$1 process
> WARNING:
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> at java.util.ArrayList.get(ArrayList.java:322)
> at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:96)
> at org.apache.solr.cloud.LeaderElector.access$000(LeaderElector.java:57)
> at org.apache.solr.cloud.LeaderElector$1.process(LeaderElector.java:121)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
> Aug 1, 2012 12:02:51 AM org.apache.solr.cloud.LeaderElector$1 process
> {code}
> Im not sure exactly how to interpret this, but it seems to me that some 
> recovery job tries to recover collection_2012_04_slice1_shard2 on Solr server 
> #7 from collection_2012_04_slice1_shard1 on Solr server #1, but fail because 
> Solr server #1 answers back that it doesnt run 
> collection_2012_04_slice1_shard1 (anymore).
> This problem occurs for serveral (in this conrete test for 4) of the 28 slice 
> pairs. For those 4 shards the end result is that 
> /node_states/solr_server_X:8983_solr in ZK still contains information about 
> the shard being running and active. E.g. /node_states/solr_server_7:8983_solr 
> still contains
> {code}
> { 
>  "shard":"slice1",
>  "state":"active",
>  "core":"collection_2012_04_slice1_shard2",
>  "collection":"collection_2012_04",
>  "node_name":"solr_server_7:8983_solr",
>  "base_url":"http://solr_server_7:8983/solr";
> } 
> {code}
> and that CloudState therefore still reports that those shards are running and 
> active - but thay are not. A.o. I have noticed that 
> "collection_2012_04_slice1_shard2" HAS been removed from solr.xml on Solr 
> server #7 (we are running with persistent="true")
> Any chance that this bug is fixed in a later revision (than one from 
> 29/2-2012) of 4.0-SNAPSHOT?
> If not we need to get it fixed, I believe.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to