Dan Benediktson created ZOOKEEPER-2471:
------------------------------------------

             Summary: Java Zookeeper Client incorrectly considers time spent 
sleeping as time spent connecting, potentially resulting in infinite reconnect 
loop
                 Key: ZOOKEEPER-2471
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2471
             Project: ZooKeeper
          Issue Type: Bug
          Components: java client
    Affects Versions: 3.5.1
         Environment: all
            Reporter: Dan Benediktson


ClientCnxnSocket uses a member variable "now" to track the current time, and 
lastSend / lastHeard variables to track socket liveness. Implementations, and 
even ClientCnxn itself, are expected to call both updateNow() to reset "now" to 
System.currentTimeMillis, and then call updateLastSend()/updateLastHeard() on 
IO completions.

This is a fragile contract, so it's not surprising that there's a bug resulting 
from it: ClientCnxn.SendThread.run() calls updateLastSendAndHeard() as soon as 
startConnect() returns, but it does not call updateNow() first. I expect when 
this was written, either the expectation was that startConnect() was an 
asynchronous operation and that updateNow() would have been called very 
recently, or simply the requirement to call updateNow() was forgotten at this 
point. As far as I can see, this bug has been present since the "updateNow" 
method was first introduced in the distant past. As it turns out, since 
startConnect() calls HostProvider.next(), which can sleep, quite a lot of time 
can pass, leaving a big gap between "now" and now.

If you are using very short session timeouts (one of our ZK ensembles has many 
clients using a 1-second timeout), this is potentially disastrous, because the 
sleep time may exceed the connection timeout itself, which can potentially 
result in the Java client being stuck in a perpetual reconnect loop. The exact 
code path it goes through in this case is complicated, because there has to be 
a previously-closed socket still waiting in the selector (otherwise, the first 
timeout evaluation will not fail because "now" still hasn't been updated, and 
then the actual connect timeout will be applied in 
ClientCnxnSocket.doTransport()) so that select() will harvest the IO from the 
previous socket and updateNow(), resulting in the next loop through 
ClientCnxnSocket.SendThread.run() observing the spurious timeout and failing. 
In practice it does happen to us fairly frequently; we only got to the bottom 
of the bug yesterday. Worse, when it does happen, the Zookeeper client object 
is rendered unusable: it's stuck in a perpetual reconnect loop where it keeps 
sleeping, opening a socket, and immediately closing it.

I have a patch. Rather than calling updateNow() right after startConnect(), my 
fix is to remove the "now" member variable and the updateNow() method entirely, 
and to instead just call System.currentTimeMillis() whenever time needs to be 
evaluated. I realize there is a benefit (aside from a trivial 
micro-optimization not worth worrying about) to having the time be "fixed", 
particularly for truth in the logging: if time is fixed by an updateNow() call, 
then the log for a timeout will still show exactly the same value the code 
reasoned about. However, this benefit is in my opinion not enough to merit the 
fragility of the contract which led to this (for us) highly impactful and 
difficult-to-find bug in the first place.

I'm currently running ant tests locally against my patch on trunk, and then 
I'll upload it here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to