[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979814#action_12979814
 ] 

Camille Fournier commented on ZOOKEEPER-922:
--------------------------------------------

Ok, here's a recap of what the problem is, what the boundaries of the problem 
are from my point of view, and what the current solution proposed above lacks. 
If the boundaries of the problem and solution are unacceptable from the POV of 
the rest of the community, then I guess we're at an impasse. So please take 
some time to read:

Problem description: When a client crashes, its ephemeral nodes need to wait 
until the negotiated session timeout has passed before they will be removed by 
the leader. 

We would like for clients that have crashed to have their ephemeral nodes 
removed quickly. The duration of visibility of "stale" ephemeral nodes (that 
is, those that were created by now-dead clients) is directly correlated to the 
window of time in which the system is in an incorrect state, for the purposes 
of our use case (dynamic discovery).

Without changing the code at all, we could simulate this by lowering the 
session timeout for all nodes. However, that would cause a different kind of 
possible inconsistent system state. For one, if there are clients that are 
doing long full-pause Garbage Collection, their sessions will time out despite 
the fact that they are actually still alive (a very real likelihood in our 
working environment). In another case, if one of the ensemble members dies and 
clients have to fail over, a very short session timeout could also result in 
prematurely-killed sessions for otherwise live clients. We would like to be 
able to detect likely cases of client crash and clean up their sessions 
quickly, while having a longer session timeout for clients we believe to be 
connected. 

We are willing to tolerate both a small number of false positives (believing 
clients crashed when they are alive) as well as a small number of false 
negatives (believing clients alive, and waiting for the full session timeout 
before removing them, when they have crashed). Given the nature of systems and 
networks, it is impossible to tell 100% of the time whether a client is truly 
alive or dead (a switch could crash, the client could GC, etc), and the 
occasional missed guess is acceptable so long as the system otherwise retains 
the general coherence and correctness guarantees.

Any solution to this problem must retain the ability for client sessions to 
migrate between ensemble members in the case where the client sees a 
disconnection from the ZK cluster due to the ensemble member crashing . 

Current system fundamentals:

The only way that a server can "see" a client crash is through an error that 
causes the socket to close and throw an exception (NIOServerCnxn:doIO). If a 
client crashes without this socket closing (say, by having the network cable to 
that server pulled), the server will not see a socket close and will have to 
time out normally. This is an acceptable edge condition from our point of view.

Additionally, it is possible that a server will "see" a client crash when in 
fact the socket was closed unexpectedly on both ends, due to a scenario like a 
network switch failure. This would result in a false positive crash detection 
by the zk server, and possibly result in the client's session being timed out 
before the client has a chance to fail over to a different server. This is also 
an acceptable edge condition from our point of view.

The session timeouts are controlled by the SessionTracker, which is maintained 
by the current leader. That tracker table is updated every time the leader 
receives a record of pings from its followers. Sessions are associated with an 
"owner", which is the current ensemble member thought to be maintaining the 
session, however, the "owner" is not checked in the case of a ping. 

Proposal: 

When we see an exception on the socket resulting in a socket close, we lower 
the timeout for the session associated with that connection. If the client does 
not reconnect within this shortened window, the session is timed out and 
ephemeral state is removed.

The simplest version of the change can be seen in the first patch submitted to 
ZOOKEEPER-922. This change does the following:

In NIOServerCnxn:doIO, when an exception is caught that is not from the client 
explicitly calling close, instead of just closing the connection, we "touch" 
the SessionTracker with a timeout set by the user (minSessionTimeout), then 
close the connection.

This results in one of two workflows. Followers will insert the sessionId and 
sessionTimeout in their touchTable, which will be sent to the Leader on the 
next ping. The Leader will then call SessionTrackerImpl.touchSession. In the 
case of the leader being the one to touchAndClose, it will directly call 
SessionTrackerImpl.touchSession, which has been modified to allow the session 
to have its expiration time set lower as well as higher.

These changes have been verified to produce a functional (not necessarily 
bug-free) implementation of the desired spec.

Possible issues:

1. If a client and a server each see a socket disconnection due to a network 
switch failure, the client will have a shorter time window in which to fail 
over to a different server before its session is timed out. This is fine with 
me, but since the shorter timeout is configurable, for those users for whom 
this risk is not worth the benefit, setting the minSessionTimeout to be the 
same as the negotiated session timeout will mitigate this problem. Therefore 
I'm not going to attempt to fix this.

2. If a client and a server both see disconnections but the client manages to 
fail over and migrate its session before the original server sends its session 
tracker update with the reduced session timeout, the client could potentially 
have its session timed out if it does not heartbeat in the reduced timeout 
window, despite having failed over. This reduced timeout window would only last 
until the new zk ensemble member re-pinged for that client, but there is a 
window of vulnerability. This could be fixed before making this change.

Are we all on the same page so far with this? All I want to do is enable fast 
failing for those who want it, if they are willing to accept the possibility 
that certain network failures could cause over-aggressive session timeout for 
clients that are not actually dead. 

> enable faster timeout of sessions in case of unexpected socket disconnect
> -------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-922
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-922
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>            Reporter: Camille Fournier
>            Assignee: Camille Fournier
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-922.patch
>
>
> In the case when a client connection is closed due to socket error instead of 
> the client calling close explicitly, it would be nice to enable the session 
> associated with that client to time out faster than the negotiated session 
> timeout. This would enable a zookeeper ensemble that is acting as a dynamic 
> discovery provider to remove ephemeral nodes for crashed clients quickly, 
> while allowing for a longer heartbeat-based timeout for java clients that 
> need to do long stop-the-world GC. 
> I propose doing this by setting the timeout associated with the crashed 
> session to "minSessionTimeout".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to