[ 
https://issues.apache.org/jira/browse/SOLR-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167649#comment-15167649
 ] 

Frank J Kelly commented on SOLR-4900:
-------------------------------------

Hit this also. Now we just stop the service first, and wait for a leader 
election before restart.

Surely there is some way to resolve this "easily"?
If Solr says it is not the leader, ZK should elect a new leader?

> Leader election deadlock after restarting leader in 4.2.1
> ---------------------------------------------------------
>
>                 Key: SOLR-4900
>                 URL: https://issues.apache.org/jira/browse/SOLR-4900
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.2
>         Environment: Linux 64 bit, Tomcat 6.0.35, Java 6u27 64 bit
>            Reporter: John Guerrero
>            Assignee: Mark Miller
>
> Copying post from 
> http://lucene.472066.n3.nabble.com/Leader-election-deadlock-after-restarting-leader-in-4-2-1-td4067988.html
> SOLR 4.2.1, tomcat 6.0.35, CentOS 6.2 (2.6.32-220.4.1.el6.x86_64 #1 SMP), 
> java 6u27 64 bit 
> 6 nodes, 2 shards, 3 replicas each.  Names changed to r1s2 (replica1 - shard 
> 2), r2s2, and r3s2 for each replica in shard 2. 
> *What we see*:
> * Under production load, we restart a leader (r1s2), and observe in the cloud 
> admin 
> that the old leader is in state "Down" and no new leader is ever elected. 
> * The system will stay like this until we stop the old leader (or cause a ZK 
> timeout...see below). 
> *Please note*: the leader is killed, then kill -9'd 5 seconds later, before 
> restarting.  We have since changed this. 
> *Digging into the logs on the old leader (r1s2 = replica1-shard 2)*:
> * The old leader restarted at 5:23:29 PM, but appears to be stuck in 
> SolrDispatchFilter.init() -- (See recovery at bottom). 
> * It doesn't want to become leader, possibly due to the unclean shutdown. 
> May 28, 2013 5:24:42 PM org.apache.solr.update.PeerSync handleVersions 
> INFO: PeerSync: core=browse url=http://r1s2:8080/solr  Our versions are too 
> old. ourHighThreshold=1436325665147191297 
> otherLowThreshold=1436325775374548992 
> * It then tries to recover, but cannot, because there is no leader. 
> May 28, 2013 5:24:43 PM org.apache.solr.common.SolrException log 
> SEVERE: Error while trying to recover. 
> core=browse:org.apache.solr.common.SolrException: No registered leader was 
> found, collection:browse slice:shard2 
> * Meanwhile, it appears that blocking in init(), prevents the http-8080 
> handler from starting (See recovery at bottom). 
> *Digging into the other replicas (r2s2)*:
> * For some reason, the old leader (r1s2) remains in the list of replicas that 
> r2s2 attempts to sync to. 
> May 28, 2013 5:23:42 PM org.apache.solr.update.PeerSync sync 
> INFO: PeerSync: core=browse url=http://r2s2:8080/solr START 
> replicas=[http://r1s2:8080/solr/browse/, http://r3s2:8080/solr/browse/] 
> nUpdates=100 
> * This apparently fails (30 second timeout), possibly due to http-8080 
> handler not being started on r1s2. 
> May 28, 2013 5:24:12 PM org.apache.solr.update.PeerSync handleResponse 
> WARNING: PeerSync: core=browse url=http://r2s2:8080/solr  exception talking 
> to http://r1s2:8080/solr/browse/, failed 
> org.apache.solr.client.solrj.SolrServerException: Timeout occured while 
> waiting response from server at: http://r1s2:8080/solr/browse
> *At this point, the cluster will remain indefinitely without a leader, if 
> nothing else changes.*
> But in this particular instance, we took some stack and heap dumps from r1s2, 
> which paused java 
> long enough to cause a *zookeeper timeout on the old leader (r1s2)*: 
> May 28, 2013 5:33:26 PM org.apache.zookeeper.ClientCnxn$SendThread run 
> INFO: Client session timed out, have not heard from server in 38226ms for 
> sessionid 0x23d28e0f584005d, closing socket connection and attempting 
> reconnect 
> Then, one of the replicas (r3s2) finally stopped trying to sync to r1s2 and 
> succeeded in becoming leader: 
> May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync 
> INFO: PeerSync: core=browse url=http://r3s2:8080/solr START 
> replicas=[http://r2s2:8080/solr/browse/] nUpdates=100 
> May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions 
> INFO: PeerSync: core=browse url=http://r3s2:8080/solr  Received 100 versions 
> from r2s2:8080/solr/browse/ 
> May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions 
> INFO: PeerSync: core=browse url=http://r3s2:8080/solr  Our versions are 
> newer. ourLowThreshold=1436325775374548992 otherHigh=1436325775805513730 
> May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync 
> INFO: PeerSync: core=browse url=http://r3s2:8080/solr DONE. sync succeeded 
> Now that we have a leader, r1s2 can succeed in recovery and finish 
> SolrDispatchFilter.init(), 
> apparently allowing the http-8080 handler to start (r1s2). 
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy replay 
> INFO: No replay needed. core=browse 
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery 
> INFO: Replication Recovery was successful - registering as Active. 
> core=browse 
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish 
> INFO: publishing core=browse state=active 
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish 
> INFO: numShards not found on descriptor - reading it from system property 
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery 
> INFO: Finished recovery process. core=browse 
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy run 
> INFO: Starting recovery process.  core=browse recoveringAfterStartup=false 
> May 28, 2013 5:34:49 PM org.apache.solr.common.cloud.ZkStateReader 
> updateClusterState 
> INFO: Updating cloud state from ZooKeeper... 
> May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init 
> INFO: user.dir=/ 
> May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init 
> *INFO: SolrDispatchFilter.init() done*
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish 
> INFO: publishing core=browse state=recovering 
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish 
> INFO: numShards not found on descriptor - reading it from system property 
> May 28, 2013 5:34:49 PM org.apache.solr.client.solrj.impl.HttpClientUtil 
> createClient 
> INFO: Creating new http client, 
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false 
> May 28, 2013 5:34:49 PM org.apache.catalina.startup.HostConfig 
> deployDirectory 
> INFO: Deploying web application directory ROOT 
> May 28, 2013 5:34:49 PM org.apache.coyote.http11.Http11AprProtocol start 
> *INFO: Starting Coyote HTTP/1.1 on http-8080  #<---------------------- 
> http-8080 started!*
> May 28, 2013 5:34:49 PM org.apache.coyote.ajp.AjpAprProtocol start 
> INFO: Starting Coyote AJP/1.3 on ajp-8009 
> May 28, 2013 5:34:49 PM org.apache.catalina.startup.Catalina start 
> *INFO: Server startup in 668324 ms #<------------------------------ 668 sec = 
> 11 minutes to start Catalina.*
> *Our Workaround*:
> * We changed our script to allow 15 seconds before kill -9. 
> * Also, we no longer do a restart.  We just stop the leader and wait for a 
> new leader.  This 
> still results in a few "No registered leader was found" exceptions, but at 
> least the duration is short. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to