In the reality if you can not connect to ZK (and ConnectionLoss is a client side error) it either means issues with network on client node itself or issues with ZK quorum. In those situations unless you receive (eventually) "Session Expiration" or "Connection reestablished" again you don't know what is going on. What probably would be prudent to do is to timeout if after ConnectionLoss you do not have anything back from ZK server for time > ZK client timeout (30 sec. by default I think). And again it will need to depend on the client - in your example it is a good idea to fail in some other cases it may be a good idea to wait (e.g if you deal with non-idempotent operations) From: Hsuan Yi Chu <hyi...@maprtech.com> To: dev@drill.apache.org Sent: Sunday, November 8, 2015 9:36 AM Subject: Re: Zookeeper down before query starts/after query finishes I just submitted a pull request to address DRILL-3751, which focuses on the scenario where query already finishes and zookeeper dies. So Foreman cannot delete the profiles of running queries in zookeeper.
I think in this case, after a few retries, Foreman can assume Zookeeper is down. And, this query is assumed to fail since client might not be able to receive the result (see the behavior in DRILL-3751 <https://issues.apache.org/jira/browse/DRILL-3751>). Does this make sense? On Fri, Nov 6, 2015 at 10:43 AM, Hsuan Yi Chu <hyi...@maprtech.com> wrote: > My understanding is : > Before query starts/After query finishes, Foreman will put/delete running > query profiles in zookeeper. > > However, if zookeeper is down before the put/delete is successful, Drill > would be blocked at the put/delete operation. > > See https://issues.apache.org/jira/browse/DRILL-3751 > > I think it is not quite right to let Drill just wait for Zookeeper to > respond. Does it make sense to use "time-out" here? > > >