Haoze Wu created ZOOKEEPER-4074:
-----------------------------------
Summary: Network issue while Learner is executing writePacket can
cause the follower to hang
Key: ZOOKEEPER-4074
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4074
Project: ZooKeeper
Issue Type: Bug
Components: server
Affects Versions: 3.6.2
Reporter: Haoze Wu
We were doing some systematic fault injection testing on the latest ZooKeeper
stable release 3.6.2 and found an untimely network issue can cause ZooKeeper
followers to hang: clients connected to this follower get stuck in their
requests, while the follower would not rejoin the quorum for a long time.
Our overall experience through the fault injection testing is that ZooKeeper is
robust to tolerate network issues (delay or packet loss) in various places. If
a thread is doing a socket write and hangs in this operation due to the network
fault, this thread will get stuck. In general, ZooKeeper can handle the issue
correctly even though this thread hangs. For example, in a leader, if the
`LearnerHandler` thread hangs in this way, the `QuorumPeer` (which is running
the Leader#lead method) is able to confirm the stale PING state and abandon the
problematic `QuorumPeer`.
h2. Symptom
However, in the latest ZooKeeper stable release 3.6.2, we found that if a
network issue happens to occur while the `writePacket` method of the `Learner`
class is executing, the entire follower can get stuck. In particular, the whole
`QuorumPeer` thread would be blocked because it is calling the `writePacket`
method. Unlike the other situations in which the fault could be tolerated by
ZooKeeper, QuorumPeer itself and other threads did not detect or handle this
issue. Therefore, this follower hangs in the sense that it is not able to
communicate with the leader, and the leader will abandon this follower once it
does not reply `PING` packets in time, although this follower still believes
it's a follower.
Steps to reproduce are as follows:
# Start a cluster with 1 leader and 2 followers.
# Manually create some datanodes, and do some reads and writes.
# Inject network fault, either using a tool like `tcconfig` or the attached
Byteman scripts.
# Once stuck, you may observe new requests to this follower would also get
stuck.
The scripts for reproduction are provided in
[https://gist.github.com/functioner/ad44b5e457c8cb22eac5fc861f56d0d4].
We confirmed the issue also occurs in the latest master branch version.
h2. Root Cause
The `writePacket` method can be invoked by 3 threads: FollowerRequestProcessor,
SyncRequestProcessor, and QuorumPeer. In particular, we have these possible
stack traces in the attached `stacktrace.md` in
[https://gist.github.com/functioner/ad44b5e457c8cb22eac5fc861f56d0d4].
There are two key issues. First, the network I/O is performed inside a
synchronization block. Second, unlike the socket connection or read operations
that are protected by timeouts in ZooKeeper, the socket (and OutputArchive with
the socket OutputStream) write would not throw the timeout exception when the
write is stuck (until the network issue is resolved). In this case, while the
`QuorumPeer#readPacket` method contains a socket timeout, the reason that the
follower (QuorumPeer) did not initiate the rejoin is because the QuorumPeer is
blocked in writePacket and would not proceed to the receiving stage
(`readPacket`), so no timeout exception is thrown to trigger the error handling.
```
void writePacket(QuorumPacket pp, boolean flush) throws IOException {
synchronized (leaderOs) {
if (pp != null) {
messageTracker.trackSent(pp.getType());
leaderOs.writeRecord(pp, "packet");
}
if (flush) {
bufferedOutput.flush();
}
}
}
```
h2. Fix
When we are preparing to submit this bug report, we find that the
ZOOKEEPER-3575 has already proposed a fix to move the packet sending in learner
to a separate thread. But the reason that we are still able to expose the
symptom in the `master` branch is because the fix is somehow disabled by
default with a configuration parameter `learner.asyncSending` that is not
documented. We tried the fault injection testing on the `master` branch version
with this parameter set to be true (`-Dlearner.asyncSending=true`) and found
the symptom would be gone. Specifically, even though with the network issue,
the packet writes to the leader would still be stuck, now the `QuorumPeer`
thread would not be blocked and it can detect the issue during the phase of
receiving packet from the leader thanks to the timeout in the socket read. As a
result, the follower would be able to quickly go back to `LOOKING` and then
`FOLLOWING` state again, while the problematic `LearnerSender` would be
abandoned and recreated.
h2. Proposed Improvements
It seems that the fix in ZOOKEEPER-3575 was not enabled by default is perhaps
because it was not clear whether the issue could really occur.
We would like to first confirm this is an issue and hope the attached
reproducing scripts can be helpful. Also, our testing shows this issue occurs
not only in the shutdown phase as pointed out in ZOOKEEPER-3575 but also in
regular requests handling, which can be serious.
In addition, we would like to propose making the parameter
`learner.asyncSending` default to be true so the fix can be enabled by default.
A related improvement is to add a description of this parameter in the
documentation. Otherwise, administrators would have to read the source code and
to be lucky enough to stumble upon the beginning section of Learner.java to
realize there is a parameter to fix the behavior.
Lastly, the async packet sending only appears in the `master` branch (and
3.7.x). There is no such fix or parameter in the latest stable release (3.6.2).
We are wondering if this fix should be backported to the 3.6.x branch?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)