[
https://issues.apache.org/jira/browse/ZOOKEEPER-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
mutu updated ZOOKEEPER-4844:
----------------------------
Description:
{*}Symptom:{*}If a thread is doing a file write and stuck in writeLongToFile,
this thread will hang. This blocking shoud be handled by the zookeeper via
PING. However, if the QuorumPeer executes the writeLongToFile and encounters a
fail-slow disk, the entire follower can be stuck. The leader will abandon this
follower, but the follower believes that it is a follower.
Callstack is as following:
{code:java}
at
org.apache.zookeeper.common.AtomicFileOutputStream.write(AtomicFileOutputStream.java:72)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at java.io.BufferedWriter.flush(BufferedWriter.java:254)
at
org.apache.zookeeper.common.AtomicFileWritingIdiom.<init>(AtomicFileWritingIdiom.java:72)
at
org.apache.zookeeper.common.AtomicFileWritingIdiom.<init>(AtomicFileWritingIdiom.java:54)
at
org.apache.zookeeper.server.quorum.QuorumPeer.writeLongToFile(QuorumPeer.java:2233)
at
org.apache.zookeeper.server.quorum.QuorumPeer.setAcceptedEpoch(QuorumPeer.java:2262)
at
org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:510)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:91)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1556)
{code}
*Root cause:* The Quorum is blocked in writeLongToFile and can not execute
readPacket, so no timeout exception is arised to trigger the error handler.
Moreover, this problem cannot be handle by add
"-Dlearner.asyncSending=true"@4070
was:
{*}Symptom:{*}If a thread is doing a file write and stuck in writeLongToFile,
this thread will hang. This blocking shoud be handled by the zookeeper via
PING. However, if the QuorumPeer executes the writeLongToFile and encounters a
fail-slow disk, the entire follower can be stuck. The leader will abandon this
follower, but the follower believes that it is a follower.
Callstack is as following:
{code:java}
at
org.apache.zookeeper.common.AtomicFileOutputStream.write(AtomicFileOutputStream.java:72)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at java.io.BufferedWriter.flush(BufferedWriter.java:254)
at
org.apache.zookeeper.common.AtomicFileWritingIdiom.<init>(AtomicFileWritingIdiom.java:72)
at
org.apache.zookeeper.common.AtomicFileWritingIdiom.<init>(AtomicFileWritingIdiom.java:54)
at
org.apache.zookeeper.server.quorum.QuorumPeer.writeLongToFile(QuorumPeer.java:2233)
at
org.apache.zookeeper.server.quorum.QuorumPeer.setAcceptedEpoch(QuorumPeer.java:2262)
at
org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:510)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:91)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1556)
{code}
*Root cause:* The Quorum is blocked in writeLongToFile and can not execute
readPacket, so no timeout exception is arised to trigger the error handler.
> Fail-slow disk while executing writeLongToFile can cause the follower to hang
> -----------------------------------------------------------------------------
>
> Key: ZOOKEEPER-4844
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4844
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.10.0
> Reporter: mutu
> Priority: Major
> Attachments: system1.log, system2.log, system3.log
>
>
> {*}Symptom:{*}If a thread is doing a file write and stuck in writeLongToFile,
> this thread will hang. This blocking shoud be handled by the zookeeper via
> PING. However, if the QuorumPeer executes the writeLongToFile and encounters
> a fail-slow disk, the entire follower can be stuck. The leader will abandon
> this follower, but the follower believes that it is a follower.
> Callstack is as following:
> {code:java}
> at
> org.apache.zookeeper.common.AtomicFileOutputStream.write(AtomicFileOutputStream.java:72)
>
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
> at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
> at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
> at java.io.BufferedWriter.flush(BufferedWriter.java:254)
> at
> org.apache.zookeeper.common.AtomicFileWritingIdiom.<init>(AtomicFileWritingIdiom.java:72)
>
> at
> org.apache.zookeeper.common.AtomicFileWritingIdiom.<init>(AtomicFileWritingIdiom.java:54)
>
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.writeLongToFile(QuorumPeer.java:2233)
>
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.setAcceptedEpoch(QuorumPeer.java:2262)
>
> at
> org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:510)
>
> at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:91)
>
> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1556)
> {code}
> *Root cause:* The Quorum is blocked in writeLongToFile and can not execute
> readPacket, so no timeout exception is arised to trigger the error handler.
> Moreover, this problem cannot be handle by add
> "-Dlearner.asyncSending=true"@4070
--
This message was sent by Atlassian Jira
(v8.20.10#820010)