[
https://issues.apache.org/jira/browse/SOLR-5615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864389#comment-13864389
]
Ramkumar Aiyengar commented on SOLR-5615:
-----------------------------------------
Here's some log trace which actually happened, might help understand the
scenario above..
{code}
2014-01-06 06:22:03,867 INFO [main-EventThread] o.a.s.c.c.ConnectionManager
[ConnectionManager.java:88] Our previous ZooKeeper session was expired.
Attempting to reconnect to recover relationship with ZooKeeper...
// ..
2014-01-06 06:22:12,529 INFO [main-EventThread] o.a.s.c.c.ConnectionManager
[ConnectionManager.java:103] Connection with ZooKeeper reestablished.
// ..
2014-01-06 06:22:36,573 INFO [main-EventThread] o.a.s.c.ZkController
[ZkController.java:989] publishing core=collection_20131120_shard205_replica2
state=down
// ..
2014-01-06 06:28:01,479 INFO [main-EventThread] o.a.s.c.c.ZkStateReader
[ZkStateReader.java:199] Updating cluster state from ZooKeeper...
2014-01-06 06:28:01,487 INFO [main-EventThread] o.a.s.c.ZkController
[ZkController.java:651] Register node as live in
ZooKeeper:/live_nodes/host5:10750_solr
// See trace above, it directly got cluster state from ZK and successfully
found the leader, so there is actually a leader at this point contrary to what
it finds below
2014-01-06 06:28:01,567 INFO [main-EventThread] o.a.s.c.c.SolrZkClient
[SolrZkClient.java:378] makePath: /live_nodes/host5:10750_solr
2014-01-06 06:28:01,669 INFO [main-EventThread] o.a.s.c.ZkController
[ZkController.java:757] Register replica -
core:collection_20131120_shard241_replica2 address:http://host5:10750/solr
collection:collection_20131120 shard:shard241
2014-01-06 06:28:01,669 INFO [main-EventThread] o.a.s.c.s.i.HttpClientUtil
[HttpClientUtil.java:103] Creating new http client,
config:maxConnections=10000&maxConnectionsPerHost=20&connTimeout=30000&socketTimeout=30000&retry=false
// nothing much after this on main-EventThread for 20 mins..
2014-01-06 06:54:01,786 ERROR [main-EventThread] o.a.s.c.ZkController
[ZkController.java:869] Error getting leader from zk
org.apache.solr.common.SolrException: No registered leader was found,
collection:collection_20131120 slice:shard241
// Then goes on to the next replica ..
2014-01-06 06:54:01,786 INFO [main-EventThread] o.a.s.c.ZkController
[ZkController.java:757] Register replica -
core:collection_20131120_shard209_replica2 address:http://host5:10750/solr
collection:collection_20131120 shard:shard209
2014-01-06 06:54:01,786 INFO [main-EventThread] o.a.s.c.s.i.HttpClientUtil
[HttpClientUtil.java:103] Creating new http client,
config:maxConnections=10000&maxConnectionsPerHost=20&connTimeout=30000&socketTimeout=30000&retry=false
// waits another twenty mins (by which time I ordered a shutdown, so things
started erroring out sooner after that)
2014-01-06 07:19:21,656 ERROR [main-EventThread] o.a.s.c.ZkController
[ZkController.java:869] Error getting leader from zk
org.apache.solr.common.SolrException: No registered leader was found,
collection:collection_20131120 slice:shard209
// After trying to register all other replicas, these failed fast because we
had ordered a shutdown already..
2014-01-06 07:19:21,693 INFO [main-EventThread]
o.a.s.c.c.DefaultConnectionStrategy [DefaultConnectionStrategy.java:48]
Reconnected to ZooKeeper
2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.s.c.c.ConnectionManager
[ConnectionManager.java:130] Connected:true
// And immediately, *now* it fires all the events it was waiting for!
2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.s.c.c.ConnectionManager
[ConnectionManager.java:72] Watcher
org.apache.solr.common.cloud.ConnectionManager@2467da0a
name:ZooKeeperConnection Watcher:host1:11600,host2:11600,host3:11600 got event
WatchedEvent state:Disconnected type:None path:null path:null type:None
2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.z.ClientCnxn
[ClientCnxn.java:509] EventThread shut down
{code}
> Deadlock while trying to recover after a ZK session expiry
> ----------------------------------------------------------
>
> Key: SOLR-5615
> URL: https://issues.apache.org/jira/browse/SOLR-5615
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.4, 4.5, 4.6
> Reporter: Ramkumar Aiyengar
> Attachments: SOLR-5615.patch
>
>
> The sequence of events which might trigger this is as follows:
> - Leader of a shard, say OL, has a ZK expiry
> - The new leader, NL, starts the election process
> - NL, through Overseer, clears the current leader (OL) for the shard from
> the cluster state
> - OL reconnects to ZK, calls onReconnect from event thread (main-EventThread)
> - OL marks itself down
> - OL sets up watches for cluster state, and then retrieves it (with no
> leader for this shard)
> - NL, through Overseer, updates cluster state to mark itself leader for the
> shard
> - OL tries to register itself as a replica, and waits till the cluster state
> is updated
> with the new leader from event thread
> - ZK sends a watch update to OL, but it is blocked on the event thread
> waiting for it.
> Oops. This finally breaks out after trying to register itself as replica
> times out after 20 mins.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]