Haoze Wu created ZOOKEEPER-4734:
-----------------------------------
Summary: FuzzySnapshotRelatedTest becomes flaky when transient
disk failure appears
Key: ZOOKEEPER-4734
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4734
Project: ZooKeeper
Issue Type: Bug
Components: tests
Affects Versions: 3.6.0
Reporter: Haoze Wu
In testPZxidUpdatedWhenLoadingSnapshot(), a quorum server is stopped and
restarted to test for loading snapshots. However, during restarting of quorum
server, we would call into ZkDataBase#loadDataBase(), from in which an
IOException could be thrown because of transient disk failure.
{code:java}
public long loadDataBase() throws IOException {
long zxid = snapLog.restore(dataTree, sessionsWithTimeouts,
commitProposalPlaybackListener); // line 240 and IOException thrown here
initialized = true;
return zxid;
} {code}
In FileTxnSnapLog#restore
{code:java}
public long restore(DataTree dt, Map<Long, Integer> sessions,
PlayBackListener listener) throws IOException {
long deserializeResult = snapLog.deserialize(dt, sessions); // IOException
...
}{code}
Here is the stacktrace:
{code:java}
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java)
at
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
at
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:862)
at
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:848)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:201)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:124)
at
org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:330)
at java.lang.Thread.run(Thread.java:748) {code}
Finally, because of this IOException, restart would be failed and test failed.
In terms of the fix, we could either retry the test like the one proposed by
ZOOKEEPER-3157 or we could add some configurable retry mechanism to
ZkDataBase#loadDataBase() to tolerate possible transient disk failure.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)