[ https://issues.apache.org/jira/browse/ZOOKEEPER-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ZOOKEEPER-4734: -------------------------------------- Labels: pull-request-available (was: ) > FuzzySnapshotRelatedTest becomes flaky when transient disk failure appears > -------------------------------------------------------------------------- > > Key: ZOOKEEPER-4734 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4734 > Project: ZooKeeper > Issue Type: Bug > Components: tests > Affects Versions: 3.6.0 > Reporter: Haoze Wu > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In testPZxidUpdatedWhenLoadingSnapshot(), a quorum server is stopped and > restarted to test for loading snapshots. However, during restarting of quorum > server, we would call into ZkDataBase#loadDataBase(), from which an > IOException could be thrown because of transient disk failure. > {code:java} > public long loadDataBase() throws IOException { > long zxid = snapLog.restore(dataTree, sessionsWithTimeouts, > commitProposalPlaybackListener); // line 240 and IOException thrown here > initialized = true; > return zxid; > } {code} > In FileTxnSnapLog#restore > {code:java} > public long restore(DataTree dt, Map<Long, Integer> sessions, > PlayBackListener listener) throws IOException { > long deserializeResult = snapLog.deserialize(dt, sessions); // > IOException here! > ... > }{code} > Here is the stacktrace: > {code:java} > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java) > at > org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240) > at > org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:862) > at > org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:848) > at > org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:201) > at > org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:124) > at > org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:330) > at java.lang.Thread.run(Thread.java:748) {code} > Finally, because of this IOException, restart would be failed and test > failed. > In terms of the fix, we could either retry the test like the one proposed by > ZOOKEEPER-3157 or we could add some configurable retry mechanism to > ZkDataBase#loadDataBase() to tolerate possible transient disk failure. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)