[ https://issues.apache.org/jira/browse/ZOOKEEPER-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15598040#comment-15598040 ]
Flavio Junqueira commented on ZOOKEEPER-2619: --------------------------------------------- Thanks for starting this discussion, [~ongardie]. Some of the issues you're describing, we only had the opportunity to discuss in more detail in the book. It is really not self-promotion, and it is where we had a chance to talk some more about some of the subtleties we have. Let me first acknowledge that it is possible that we submit asynchronously operations OP1 and OP2, such that OP2 is executed and OP1 isn't. This can happen in the case that the connection is lost while the client is processing OP1, so the callback is set to deliver the connection loss event, and OP2 is submitted concurrently. ZooKeeper in this scenario makes two important guarantees: 1) it notifies the application that there has been an error for OP1; 2) it doesn't reorder the execution of OP1 and OP2. If they both execute, it will follow the order of submission. If you're building an application, then you have two choices with the current API. You can be optimistic and submit asynchronous operations in a tight loop. In the case you fall in the situation I describe above, then you might need to execute some recovery procedure. If as an application designer you don't want or can't execute such a procedure, then you'll have to execute one at a time. In your first example, you're mixing the synchronous and the asynchronous API. We don't actually enforce that an application uses one or the other, but we do not recommend to mix synchronous and asynchronous calls precisely because of the reasons you're raising. About your proposal, it is fairly extreme, so I'm -1 as is. However, there is one thing we have discussed at some point, but never really got to do is to fail all operations upon a connection loss and only re-enable upon an explicit indication from the application, like a {{reenableOps}} call. It achieves the same effect of dropping a suffix of operations and does not require all the changes you're proposing. It should be a fairly simple change to make and it is compatible with the current API. > Client library reconnecting breaks FIFO client order > ---------------------------------------------------- > > Key: ZOOKEEPER-2619 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2619 > Project: ZooKeeper > Issue Type: Bug > Reporter: Diego Ongaro > > According to the USENIX ATC 2010 > [paper|https://www.usenix.org/conference/usenix-atc-10/zookeeper-wait-free-coordination-internet-scale-systems], > ZooKeeper provides "FIFO client order: all requests from a given client are > executed in the order that they were sent by the client." I believe > applications written using the Java client library are unable to rely on this > guarantee, and any current application that does so is broken. Other client > libraries are also likely to be affected. > Consider this application, which is simplified from the algorithm described > on Page 4 (right column) of the paper: > {code} > zk = new ZooKeeper(...) > zk.createAsync("/data-23857", "...", callback) > zk.createSync("/pointer", "/data-23857") > {code} > Assume an empty ZooKeeper database to begin with and no other writers. > Applying the above definition, if the ZooKeeper database contains /pointer, > it must also contain /data-23857. > Now consider this series of unfortunate events: > {code} > zk = new ZooKeeper(...) > // The library establishes a TCP connection. > zk.createAsync("/data-23857", "...", callback) > // The library/kernel closes the TCP connection because it times out, and > // the create of /data-23857 is doomed to fail with ConnectionLoss. Suppose > // that it never reaches the server. > // The library establishes a new TCP connection. > zk.createSync("/pointer", "/data-23857") > // The create of /pointer succeeds. > {code} > That's the problem: subsequent operations get assigned to the new connection > and succeed, while earlier operations fail. > In general, I believe it's impossible to have a system with the following > three properties: > # FIFO client order for asynchronous operations, > # Failing operations when connections are lost, AND > # Transparently reconnecting when connections are lost. > To argue this, consider an application that issues a series of pipelined > operations, then upon noticing a connection loss, issues a series of recovery > operations, repeating the recovery procedure as necessary. If a pipelined > operation fails, all subsequent operations in the pipeline must also fail. > Yet the client must also carry on eventually: the recovery operations cannot > be trivially failed forever. Unfortunately, the client library does not know > where the pipelined operations end and the recovery operations begin. At the > time of a connection loss, subsequent pipelined operations may or may not be > queued in the library; others might be upcoming in the application thread. If > the library re-establishes a connection too early, it will send pipelined > operations out of FIFO client order. > I considered a possible workaround of having the client diligently check its > callbacks and watchers for connection loss events, and do its best to stop > the subsequent pipelined operations at the first sign of a connection loss. > In addition to being a large burden for the application, this does not solve > the problem all the time. In particular, if the callback thread is delayed > significantly (as can happen due to excessive computation or scheduling > hiccups), the application may not learn about the connection loss event until > after the connection has been re-established and after dependent pipelined > operations have already been transmitted over the new connection. > I suggest the following API changes to fix the problem: > - Add a method ZooKeeper.getConnection() returning a ZKConnection object. > ZKConnection would wrap a TCP connection. It would include all synchronous > and asynchronous operations currently defined on the ZooKeeper class. Upon a > connection loss on a ZKConnection, all subsequent operations on the same > ZKConnection would return a Connection Loss error. Upon noticing, the client > would need to call ZooKeeper.getConnection() again to get a working > ZKConnection object, and it would execute its recovery procedure on this new > connection. > - Deprecate all asynchronous methods on the ZooKeeper object. These are > unsafe to use if the caller assumes they're getting FIFO client order. > - No changes to the protocols or servers are required. > I recognize this could cause a lot of code churn for both ZooKeeper and > projects that use it. On the other hand, the existing asynchronous calls in > applications should now be audited anyhow. > The code affected by this issue may be difficult to contain: > - It likely affects all ZooKeeper client libraries that provide both > asynchronous operations and transparent reconnection. That's probably all > versions of the official Java client library, as well as most other client > libraries. > - It affects all applications using those libraries that depend on the FIFO > client order of asynchronous operations. I don't know how common that is, but > the paper implies that FIFO client order is important. > - Fortunately, the issue can only manifest itself when connections are lost > and transparently reestablished. In practice, it may also require a long > pipeline or a significant delay in the application thread while the library > establishes a new connection. > - In case you're wondering, this issue occurred to me while working on a new > client library for Go. I haven't seen this issue in the wild, but I was able > to produce it locally by placing sleep statements in a Java program and > closing its TCP connections. > I'm new to this community, so I'm looking forward to the discussion. Let me > know if I can clarify any of the above. -- This message was sent by Atlassian JIRA (v6.3.4#6332)