[
https://issues.apache.org/jira/browse/ZOOKEEPER-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ZOOKEEPER-4734:
--------------------------------------
Labels: pull-request-available (was: )
> FuzzySnapshotRelatedTest becomes flaky when transient disk failure appears
> --------------------------------------------------------------------------
>
> Key: ZOOKEEPER-4734
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4734
> Project: ZooKeeper
> Issue Type: Bug
> Components: tests
> Affects Versions: 3.6.0
> Reporter: Haoze Wu
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> In testPZxidUpdatedWhenLoadingSnapshot(), a quorum server is stopped and
> restarted to test for loading snapshots. However, during restarting of quorum
> server, we would call into ZkDataBase#loadDataBase(), from which an
> IOException could be thrown because of transient disk failure.
> {code:java}
> public long loadDataBase() throws IOException {
> long zxid = snapLog.restore(dataTree, sessionsWithTimeouts,
> commitProposalPlaybackListener); // line 240 and IOException thrown here
> initialized = true;
> return zxid;
> } {code}
> In FileTxnSnapLog#restore
> {code:java}
> public long restore(DataTree dt, Map<Long, Integer> sessions,
> PlayBackListener listener) throws IOException {
> long deserializeResult = snapLog.deserialize(dt, sessions); //
> IOException here!
> ...
> }{code}
> Here is the stacktrace:
> {code:java}
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java)
> at
> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:862)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:848)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:201)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:124)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:330)
> at java.lang.Thread.run(Thread.java:748) {code}
> Finally, because of this IOException, restart would be failed and test
> failed.
> In terms of the fix, we could either retry the test like the one proposed by
> ZOOKEEPER-3157 or we could add some configurable retry mechanism to
> ZkDataBase#loadDataBase() to tolerate possible transient disk failure.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)