Thank you for your clarification. What you said seems to be consistent with 
what I saw in the code. However, I am still confused by your conclusion. My 
conclusion from the code is that it is totally possible for a client to get 
session expired and reconnect to another sever to still see the ephemeral node. 
The exact reason is caused by the code I pasted below.  I am not sure if we are 
on the same page as you seem to suggest that it is not possible. Let me 
elaborate a bit on how this can happen.

1.  Client A connected to Leader with session S
2.  Session S expired on Leader, which according to the code below it will set 
the session as closing
3.  Leader then send out a close session request to its own first processor 
which will go through the usual pipeline as you mentioned
4. Client A send a request which will go through the checkSession code I pasted 
below and it should get a SessionExpiredException.
5. Client now knows its session expired but for some reason it tries to connect 
to another server B and issues a read.

Now this is a race between the following two chain of operations

1. The closesession request needs to go through quorum and get the majority and 
server B get the commit (or inform) request and actually kill the session in 
the final request processor

2. The read operation that goes through learnerZK process chain in memory (as 
far as I see, there is no session check on read at all so it won’t know until 
it reads the tree)

The first chain can get stuck on a various places like some quorum nodes are 
stuck on some other proposals so it looks to me that the second chain is very 
much likely to win as it does not need to go through any quorum operation. The 
key issue is the client needs to reconnect fast (or in parallel as the original 
post seems to indicate). I think I can definitely simulate this with a test but 
it will be tricky to make it pass/fail deterministically so I didn’t try.

Am I missing something?


Hi Ryan,

I am not sure what you were confused about regarding session cleaning up code. 
Here is my understanding, hope it helps.

* Session clean up is started from marking the state of a session as closed, as 
you noticed. This is because each session clean up will take a while so we need 
make sure that during session clean up, server will not continue processing the 
requests from the client that appertain with this session.

* Once session is marked as closing, we will send a request so the closing of 
the session not only applies to the leader but also on the quorum servers. This 
is just like all other requests that will go through the normal requests 
process pipeline.

* A valid session is a prerequisite for any of the client operations (including 
read operations), so the liveness of the session is validated before processing 
a read operation.

I am a bit confused by the code

Does ZK guarantee that ephemeral nodes from a client are removed on the
sever by the time the client receives a session expiration event?

"the server" is a vague definition, as ZooKeeper ensemble is composed of
multiple servers :).

Therefore, it seems to be possible for a client to connect to another
server to see the node there.

This seems the only case I can think of that lead to the inconsistent view
from client side. I'll elaborate as follows, first the guarantees of
ZooKeeper that's relevant to this case:

* ZooKeeper quorum should have already committed the transaction of closing
the session when a client receives the session expire event.

Here are the code that throws KeeperException.SessionExpiredException

public synchronized void checkSession(long sessionId, Object owner)

        throws KeeperException.SessionExpiredException,
        KeeperException.UnknownSessionException {
    if (session.isClosing()) {
        throw new KeeperException.SessionExpiredException();

Here is the code to set it to be closing directly

synchronized public void setSessionClosing(long sessionId) {
    if (LOG.isTraceEnabled()) {
        LOG.trace("Session closing: 0x" + Long.toHexString(sessionId));
    SessionImpl s = sessionsById.get(sessionId);
    if (s == null) {
    s.isClosing = true;

and here is the code that call the above
public void runImpl() throws InterruptedException {

    while (running) {

        for (SessionImpl s : sessionExpiryQueue.poll()) {

the expire function looks like this

public void expire(Session session) {
    long sessionId = session.getSessionId();


and close function here

private void close(long sessionId) {
    Request si = new Request(null, sessionId, 0, OpCode.closeSession, null, 

so it looks to me that the session is marked as closing first and then the 
closeSession is send. This will happen on
the lead only though as the checkSession is only called on the lead but it is 
called even on read operation.

May I missing something?

* Clean up of ephemeral nodes associated with the session is part of the
closing session transaction, so for the quorum of servers who have already
committed the transaction, the ephemeral nodes should have gone already, on
those servers.

* ZooKeeper quorum would not have processed the new session establishment
request for the same client, until after the closing session request has
been processed because transactions are ordered across quorum.

Given these guarantees, if a client reestablishes a new session via
connecting to a server which was the quorum of servers that committed the
closing session transaction, then the client should not see the old
ephemeral node upon new session established.

ZooKeeper does not guarantee a write transaction occur synchronously across
all of the servers, since a write request only requires a quorum of servers
to acknowledge. As a result, it is valid that some servers might lag behind
the state of the quorum. I suspect this case is possible:

* Client receives session expire event, and client close its connection to
server A.

* Client reconnects to server B, which lags behind quorum, that does not
contain the changes to the data tree regarding ephemeral nodes.

* Client sees the ephemeral node so it does nothing. Later the node is
cleaned up when server B sync with quorum.

Client can ensure it always see the state of truth of the quorum by issuing
a sync() request before issuing a read request. A sync request will force
the server it's connecting to sync with the quorum. If Kafka does this,
will the bug go away? Of course, retry creating ephemeral nodes can also
solve the problem (there are possible other solutions as well, by having
client to do some book keeping work to differentiate versions between
ephemeral nodes).

Good question, AFAIK, it’s not the case.

The server will throw an SessionExpiredException during checkSession call
as soon as the session is marked as isClosing. However, session expiration
actually requires a transaction (of type OpCode.closeSession) which will be
send to the leader to go through the quorum.  The session and ephemeral
node will only be removed after the transaction is  committed and processed
in the final processor on other nodes. Therefore, it seems to be possible
for a client to connect to another server to see the node there. I am not
entirely sure if it can use the same session id though, it seems possible
as the session close is only based on the session expire time and there can
be delays in session pings.

Does ZK guarantee that ephemeral nodes from a client are removed on the
sever by the time the client receives a session expiration event? I am
getting conflicting info on this (
https://issues.apache.org/jira/browse/KAFKA-4277). Could someone clarify?





