SOLR 4.2.1, tomcat 6.0.35, CentOS 6.2 (2.6.32-220.4.1.el6.x86_64 #1 SMP),
java 6u27 64 bit
6 nodes, 2 shards, 3 replicas each.  Names changed to r1s2 (replica1 - shard
2), r2s2, and r3s2 for each replica in shard 2.

What we see:

* Under production load, we restart a leader (r1s2), and observe in the
cloud admin
that the old leader is in state "Down" and no new leader is ever elected.
* The system will stay like this until we stop the old leader (or cause a ZK
timeout...see below).

*Please note:* the leader is killed, then kill -9'd 5 seconds later, before
restarting.  We have since changed this.

Digging into the logs on the old leader (r1s2 = replica1-shard 2):

* The old leader restarted at 5:23:29 PM, but appears to be stuck in
SolrDispatchFilter.init() -- (See recovery at bottom).
* It doesn't want to become leader, possibly due to the unclean shutdown.
May 28, 2013 5:24:42 PM org.apache.solr.update.PeerSync handleVersions
INFO: PeerSync: core=browse url=http://r1s2:8080/solr  Our versions are too
old. ourHighThreshold=1436325665147191297
otherLowThreshold=1436325775374548992
* It then tries to recover, but cannot, because there is no leader.
May 28, 2013 5:24:43 PM org.apache.solr.common.SolrException log
SEVERE: Error while trying to recover.
core=browse:org.apache.solr.common.SolrException: No registered leader was
found, collection:browse slice:shard2
* Meanwhile, it appears that blocking in init(), prevents the http-8080
handler from starting (See recovery at bottom).

Digging into the other replicas (r2s2):

* For some reason, the old leader (r1s2) remains in the list of replicas
that r2s2 attempts to sync to.
May 28, 2013 5:23:42 PM org.apache.solr.update.PeerSync sync
INFO: PeerSync: core=browse url=http://r2s2:8080/solr START
replicas=[http://r1s2:8080/solr/browse/, http://r3s2:8080/solr/browse/]
nUpdates=100
* This apparently fails (30 second timeout), possibly due to http-8080
handler not being started on r1s2.
May 28, 2013 5:24:12 PM org.apache.solr.update.PeerSync handleResponse
WARNING: PeerSync: core=browse url=http://r2s2:8080/solr  exception talking
to http://r1s2:8080/solr/browse/, failed
org.apache.solr.client.solrj.SolrServerException: Timeout occured while
waiting response from server at: http://r1s2:8080/solr/browse

*At this point, the cluster will remain indefinitely without a leader, if
nothing else changes.*

But in this particular instance, we took some stack and heap dumps from
r1s2, which paused java
long enough to cause a *zookeeper timeout on the old leader (r1s2)*:
May 28, 2013 5:33:26 PM org.apache.zookeeper.ClientCnxn$SendThread run
INFO: Client session timed out, have not heard from server in 38226ms for
sessionid 0x23d28e0f584005d, closing socket connection and attempting
reconnect

Then, one of the replicas (r3s2) finally stopped trying to sync to r1s2 and
succeeded in becoming leader:
May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync
INFO: PeerSync: core=browse url=http://r3s2:8080/solr START
replicas=[http://r2s2:8080/solr/browse/] nUpdates=100
May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions
INFO: PeerSync: core=browse url=http://r3s2:8080/solr  Received 100 versions
from r2s2:8080/solr/browse/
May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions
INFO: PeerSync: core=browse url=http://r3s2:8080/solr  Our versions are
newer. ourLowThreshold=1436325775374548992 otherHigh=1436325775805513730
May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync
INFO: PeerSync: core=browse url=http://r3s2:8080/solr DONE. sync succeeded

Now that we have a leader, r1s2 can succeed in recovery and finish
SolrDispatchFilter.init(),
apparently allowing the http-8080 handler to start (r1s2).
May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy replay
INFO: No replay needed. core=browse
May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
INFO: Replication Recovery was successful - registering as Active.
core=browse
May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish
INFO: publishing core=browse state=active
May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish
INFO: numShards not found on descriptor - reading it from system property
May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
INFO: Finished recovery process. core=browse
May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy run
INFO: Starting recovery process.  core=browse recoveringAfterStartup=false
May 28, 2013 5:34:49 PM org.apache.solr.common.cloud.ZkStateReader
updateClusterState
INFO: Updating cloud state from ZooKeeper...
May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init
INFO: user.dir=/
May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init
*INFO: SolrDispatchFilter.init() done*
May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish
INFO: publishing core=browse state=recovering
May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish
INFO: numShards not found on descriptor - reading it from system property
May 28, 2013 5:34:49 PM org.apache.solr.client.solrj.impl.HttpClientUtil
createClient
INFO: Creating new http client,
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
May 28, 2013 5:34:49 PM org.apache.catalina.startup.HostConfig
deployDirectory
INFO: Deploying web application directory ROOT
May 28, 2013 5:34:49 PM org.apache.coyote.http11.Http11AprProtocol start
*INFO: Starting Coyote HTTP/1.1 on http-8080  #<----------------------
http-8080 started!*
May 28, 2013 5:34:49 PM org.apache.coyote.ajp.AjpAprProtocol start
INFO: Starting Coyote AJP/1.3 on ajp-8009
May 28, 2013 5:34:49 PM org.apache.catalina.startup.Catalina start
*INFO: Server startup in 668324 ms #<------------------------------ 668 sec
= 11 minutes to start Catalina.*

Our Workaround:

* We changed our script to allow 15 seconds before kill -9.
* Also, we no longer do a restart.  We just stop the leader and wait for a
new leader.  This
still results in a few "No registered leader was found" exceptions, but at
least the duration is short.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Leader-election-deadlock-after-restarting-leader-in-4-2-1-tp4067988.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to