[ https://issues.apache.org/jira/browse/SOLR-5615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864389#comment-13864389 ]
Ramkumar Aiyengar commented on SOLR-5615: ----------------------------------------- Here's some log trace which actually happened, might help understand the scenario above.. {code} 2014-01-06 06:22:03,867 INFO [main-EventThread] o.a.s.c.c.ConnectionManager [ConnectionManager.java:88] Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... // .. 2014-01-06 06:22:12,529 INFO [main-EventThread] o.a.s.c.c.ConnectionManager [ConnectionManager.java:103] Connection with ZooKeeper reestablished. // .. 2014-01-06 06:22:36,573 INFO [main-EventThread] o.a.s.c.ZkController [ZkController.java:989] publishing core=collection_20131120_shard205_replica2 state=down // .. 2014-01-06 06:28:01,479 INFO [main-EventThread] o.a.s.c.c.ZkStateReader [ZkStateReader.java:199] Updating cluster state from ZooKeeper... 2014-01-06 06:28:01,487 INFO [main-EventThread] o.a.s.c.ZkController [ZkController.java:651] Register node as live in ZooKeeper:/live_nodes/host5:10750_solr // See trace above, it directly got cluster state from ZK and successfully found the leader, so there is actually a leader at this point contrary to what it finds below 2014-01-06 06:28:01,567 INFO [main-EventThread] o.a.s.c.c.SolrZkClient [SolrZkClient.java:378] makePath: /live_nodes/host5:10750_solr 2014-01-06 06:28:01,669 INFO [main-EventThread] o.a.s.c.ZkController [ZkController.java:757] Register replica - core:collection_20131120_shard241_replica2 address:http://host5:10750/solr collection:collection_20131120 shard:shard241 2014-01-06 06:28:01,669 INFO [main-EventThread] o.a.s.c.s.i.HttpClientUtil [HttpClientUtil.java:103] Creating new http client, config:maxConnections=10000&maxConnectionsPerHost=20&connTimeout=30000&socketTimeout=30000&retry=false // nothing much after this on main-EventThread for 20 mins.. 2014-01-06 06:54:01,786 ERROR [main-EventThread] o.a.s.c.ZkController [ZkController.java:869] Error getting leader from zk org.apache.solr.common.SolrException: No registered leader was found, collection:collection_20131120 slice:shard241 // Then goes on to the next replica .. 2014-01-06 06:54:01,786 INFO [main-EventThread] o.a.s.c.ZkController [ZkController.java:757] Register replica - core:collection_20131120_shard209_replica2 address:http://host5:10750/solr collection:collection_20131120 shard:shard209 2014-01-06 06:54:01,786 INFO [main-EventThread] o.a.s.c.s.i.HttpClientUtil [HttpClientUtil.java:103] Creating new http client, config:maxConnections=10000&maxConnectionsPerHost=20&connTimeout=30000&socketTimeout=30000&retry=false // waits another twenty mins (by which time I ordered a shutdown, so things started erroring out sooner after that) 2014-01-06 07:19:21,656 ERROR [main-EventThread] o.a.s.c.ZkController [ZkController.java:869] Error getting leader from zk org.apache.solr.common.SolrException: No registered leader was found, collection:collection_20131120 slice:shard209 // After trying to register all other replicas, these failed fast because we had ordered a shutdown already.. 2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.s.c.c.DefaultConnectionStrategy [DefaultConnectionStrategy.java:48] Reconnected to ZooKeeper 2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.s.c.c.ConnectionManager [ConnectionManager.java:130] Connected:true // And immediately, *now* it fires all the events it was waiting for! 2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.s.c.c.ConnectionManager [ConnectionManager.java:72] Watcher org.apache.solr.common.cloud.ConnectionManager@2467da0a name:ZooKeeperConnection Watcher:host1:11600,host2:11600,host3:11600 got event WatchedEvent state:Disconnected type:None path:null path:null type:None 2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.z.ClientCnxn [ClientCnxn.java:509] EventThread shut down {code} > Deadlock while trying to recover after a ZK session expiry > ---------------------------------------------------------- > > Key: SOLR-5615 > URL: https://issues.apache.org/jira/browse/SOLR-5615 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 4.4, 4.5, 4.6 > Reporter: Ramkumar Aiyengar > Attachments: SOLR-5615.patch > > > The sequence of events which might trigger this is as follows: > - Leader of a shard, say OL, has a ZK expiry > - The new leader, NL, starts the election process > - NL, through Overseer, clears the current leader (OL) for the shard from > the cluster state > - OL reconnects to ZK, calls onReconnect from event thread (main-EventThread) > - OL marks itself down > - OL sets up watches for cluster state, and then retrieves it (with no > leader for this shard) > - NL, through Overseer, updates cluster state to mark itself leader for the > shard > - OL tries to register itself as a replica, and waits till the cluster state > is updated > with the new leader from event thread > - ZK sends a watch update to OL, but it is blocked on the event thread > waiting for it. > Oops. This finally breaks out after trying to register itself as replica > times out after 20 mins. -- This message was sent by Atlassian JIRA (v6.1.5#6160) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org