[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mutu updated ZOOKEEPER-4844:
----------------------------
          Component/s: server
    Affects Version/s: 3.10.0
          Description: 
{*}Symptom:{*}If a thread is doing a file write and stuck in writeLongToFile, 
this thread will hang. This blocking shoud be handled by the zookeeper via 
PING.  However, if the QuorumPeer executes the writeLongToFile and encounters a 
fail-slow disk, the entire follower can be stuck. The leader will abandon this 
follower, but the follower believes that it is a follower.

Callstack is as following:
{code:java}
    at 
org.apache.zookeeper.common.AtomicFileOutputStream.write(AtomicFileOutputStream.java:72)
    at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)    at 
sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)    at 
sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)    at 
sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)    at 
java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)    at 
java.io.BufferedWriter.flush(BufferedWriter.java:254)    at 
org.apache.zookeeper.common.AtomicFileWritingIdiom.<init>(AtomicFileWritingIdiom.java:72)
    at 
org.apache.zookeeper.common.AtomicFileWritingIdiom.<init>(AtomicFileWritingIdiom.java:54)
    at 
org.apache.zookeeper.server.quorum.QuorumPeer.writeLongToFile(QuorumPeer.java:2233)
    at 
org.apache.zookeeper.server.quorum.QuorumPeer.setAcceptedEpoch(QuorumPeer.java:2262)
    at 
org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:510) 
   at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:91)    
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1556) 
{code}
*Root cause:* The Quorum is blocked in writeLongToFile and can not execute 
readPacket, so no timeout exception is arised to trigger the error handler.
              Summary: Fail-slow disk while executing writeLongToFile can cause 
the follower to hang  (was: Fail-slow disk while Learner is executing 
writeLongToFile can cause the follower to hang)

> Fail-slow disk while executing writeLongToFile can cause the follower to hang
> -----------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4844
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4844
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.10.0
>            Reporter: mutu
>            Priority: Major
>
> {*}Symptom:{*}If a thread is doing a file write and stuck in writeLongToFile, 
> this thread will hang. This blocking shoud be handled by the zookeeper via 
> PING.  However, if the QuorumPeer executes the writeLongToFile and encounters 
> a fail-slow disk, the entire follower can be stuck. The leader will abandon 
> this follower, but the follower believes that it is a follower.
> Callstack is as following:
> {code:java}
>     at 
> org.apache.zookeeper.common.AtomicFileOutputStream.write(AtomicFileOutputStream.java:72)
>     at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)    at 
> sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)    at 
> sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)    at 
> sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)    at 
> java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)    at 
> java.io.BufferedWriter.flush(BufferedWriter.java:254)    at 
> org.apache.zookeeper.common.AtomicFileWritingIdiom.<init>(AtomicFileWritingIdiom.java:72)
>     at 
> org.apache.zookeeper.common.AtomicFileWritingIdiom.<init>(AtomicFileWritingIdiom.java:54)
>     at 
> org.apache.zookeeper.server.quorum.QuorumPeer.writeLongToFile(QuorumPeer.java:2233)
>     at 
> org.apache.zookeeper.server.quorum.QuorumPeer.setAcceptedEpoch(QuorumPeer.java:2262)
>     at 
> org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:510)
>     at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:91)    
> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1556) 
> {code}
> *Root cause:* The Quorum is blocked in writeLongToFile and can not execute 
> readPacket, so no timeout exception is arised to trigger the error handler.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to