Varun Thacker created SOLR-11590:
------------------------------------

             Summary: Synchronize ZK connect/disconnect handling
                 Key: SOLR-11590
                 URL: https://issues.apache.org/jira/browse/SOLR-11590
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Varun Thacker
            Priority: Major


Here is a sequence of 2 disconnects and re-connects

{code}
1. 2017-10-31T08:34:23.106-0700 Watcher 
org.apache.solr.common.cloud.ConnectionManager@1579ca20 
name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
state:Disconnected type:None path:null path:null type:None
2. 2017-10-31T08:34:23.106-0700 zkClient has disconnected
3. 2017-10-31T08:34:23.107-0700 Watcher 
org.apache.solr.common.cloud.ConnectionManager@1579ca20 
name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
state:SyncConnected type:None path:null path:null type:None
{code}

{code}
1. 2017-10-31T08:36:46.541-0700 Watcher 
org.apache.solr.common.cloud.ConnectionManager@1579ca20 
name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
state:Disconnected type:None path:null path:null type:None
2. 2017-10-31T08:36:46.549-0700 Watcher 
org.apache.solr.common.cloud.ConnectionManager@1579ca20 
name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
state:SyncConnected type:None path:null path:null type:None
2. 2017-10-31T08:36:46.563-0700 zkClient has disconnected
{code}

In the first disconnect the sequence is -  get disconnect watcher, execute 
disconnect code, execute connect code
In the second disconnect the sequence is - get disconnect watcher, execute 
connect code, execute disconnect code

In the second sequence of events, if the JVM has leader replicas then all 
updates start failing with "Cannot talk to ZooKeeper - Updates are disabled." . 
This starts happening exactly after 27 seconds ( zk client timeout is 30s , 90% 
of 30 = 27 - when the code thinks the session is likely expired). No leadership 
changes since there was no session expiry. Unless you restart the node all 
updates to the system continue to fail.

These log lines correspond are from Solr 5.3 hence where the WatchedEvent was 
still being logged as INFO

We process the connect code and then process the disconnect code out of order 
based on the log ordering. The connection is active but the flag is not set and 
hence after 27 seconds {{zkCheck}} starts complaining that the connection is 
likely expired

A related Jira is SOLR-5721

ZK gives us ordered watch events ( 
https://zookeeper.apache.org/doc/r3.4.8/zookeeperProgrammers.html#sc_WatchGuarantees
 ) but from what I understand Solr can still process them out of order. We 
could take a lock and synchronize {{ConnectionManager#connected}} and 
{{ConnectionManager#disconnected}} . 

Would that be the right approach to take?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to