Haoze Wu created ZOOKEEPER-4203:
-----------------------------------
Summary: Leader swallows the ZooKeeperServer.State.ERROR from
Leader.LearnerCnxAcceptor in some concurrency condition.
Key: ZOOKEEPER-4203
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4203
Project: ZooKeeper
Issue Type: Bug
Components: server
Affects Versions: 3.6.2
Reporter: Haoze Wu
We found a bug similar to https://issues.apache.org/jira/browse/ZOOKEEPER-2029.
Leader.LearnerCnxAcceptor is a ZooKeeperCriticalThread and its exception is
handled by ZooKeeperCriticalThread#handleException, which is supposed to
shutdown the process or rejoin the quorum. However, in some concurrency
condition, this correct handling does not occur, and thus this leader without
LearnerCnxAcceptor will not accept any follower, which is the same symptom as
https://issues.apache.org/jira/browse/ZOOKEEPER-2029. Ater reproduction, we
confirmed that this bug exists in both the 3.6.2 release version and the master
branch.
This concurrency condition can be constructed as follows: start a ZooKeeper
cluster of 3 nodes, and the ServerSocket#accept invocation throws an exception
in the LearnerCnxAcceptorHandler for the second follower that tries to join the
quorum. If these 3 nodes get started almost simultaneously, e.g., within 2
seconds, then The aforementioned symptom may occur. In the log, we can observe
that one of the followers keeps trying to join the quorum and keeps failing,
and the other 2 servers always get the request from this follower but never let
it join the quorum.
We prepared the reproduction scripts in a gist link, and the throwing exception
of ServerSocket#accept is done by the Byteman injection script
`serverSocketAccept-exception.btm`.
The root cause of this bug is that when the Leader.LearnerCnxAcceptor thread
handles an exception in ZooKeeperCriticalThread#handleException, the exception
is not really "handled", meaning that it basically does nothing but set the
ZooKeeperServerState as ERROR in the following stack trace:
ZooKeeperCriticalThread#handleException ->
ZooKeeperServerListenerImpl#notifyStopping -> LeaderZooKeeperServer#setState.
LeaderZooKeeperServer#setState is implemented in
QuorumZooKeeperServer#setState, basically setting the state as
ZooKeeperServer.State.ERROR in this scenario.
Under normal Circumstances, this ERROR state will be detected by another thread
in the following stack trace: QuorumPeer#run -> Leader#lead -> Leader#isRunning
-> ZooKeeperServer#isRunning. In the code, it is within the infinite while loop
at the end of Leader#lead
([https://github.com/apache/zookeeper/blob/release-3.6.2/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L758]).
If the ERROR state from Leader.LearnerCnxAcceptor or any other
ZooKeeperCriticalThread could be detected in this way, then the leader would be
aware of the ERROR state and turn to
[https://github.com/apache/zookeeper/blob/release-3.6.2/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L779],
which is correct.
However, in some concurrency condition, the ERROR state can't be detected in
this way, because in Leader#lead,
[https://github.com/apache/zookeeper/blob/release-3.6.2/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L685]
can cover this ERROR state with a RUNNING state in the following stack trace:
Leader#lead -> Leader#startZkServer -> LeaderZooKeeperServer#startup ->
ZooKeeperServer#startup -> ZooKeeperServer#setState. Therefore, if the ERROR
state occurs before the invocation of Leader#startZkServer in Leader#lead, then
this ERROR state will be covered, because ZooKeeperServer#setState does not
record or handle the old state.
The reproduction script we provided can construct this concurrency condition,
because, usually, after the server startup, it takes some time for the
QuorumPeer thread to reach
[https://github.com/apache/zookeeper/blob/release-3.6.2/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L685].
Thus the exception LearnerCnxAcceptorHandler, if any, usually occurs earlier
than that. Then the aforementioned symptom happens.
In terms of the fix, we will send a pull request. Basically we just add a
sanity check to prevent the RUNNING/INITIAL state set in the ZK startup from
covering the possible ERROR state.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)