[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ZOOKEEPER-4734:
--------------------------------------
    Labels: pull-request-available  (was: )

> FuzzySnapshotRelatedTest becomes flaky when transient disk failure appears
> --------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4734
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4734
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: tests
>    Affects Versions: 3.6.0
>            Reporter: Haoze Wu
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In testPZxidUpdatedWhenLoadingSnapshot(), a quorum server is stopped and 
> restarted to test for loading snapshots. However, during restarting of quorum 
> server, we would call into ZkDataBase#loadDataBase(), from which an 
> IOException could be thrown because of transient disk failure. 
> {code:java}
> public long loadDataBase() throws IOException {
>     long zxid = snapLog.restore(dataTree, sessionsWithTimeouts,   
> commitProposalPlaybackListener); // line 240 and IOException thrown here
>     initialized = true;
>     return zxid;
> } {code}
> In FileTxnSnapLog#restore
> {code:java}
> public long restore(DataTree dt, Map<Long, Integer> sessions,
>                     PlayBackListener listener) throws IOException {
>     long deserializeResult = snapLog.deserialize(dt, sessions); // 
> IOException here!       
> ...
> }{code}
> Here is the stacktrace: 
> {code:java}
>         at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java)
>         at 
> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:862)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:848)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:201)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:124)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:330)
>         at java.lang.Thread.run(Thread.java:748) {code}
> Finally, because of this IOException, restart would be failed and test 
> failed. 
> In terms of the fix, we could either retry the test like the one proposed by 
> ZOOKEEPER-3157 or we could add some configurable retry mechanism to 
> ZkDataBase#loadDataBase() to tolerate possible transient disk failure. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to