[jira] [Updated] (ZOOKEEPER-4734) FuzzySnapshotRelatedTest becomes flaky when transient disk failure appears

Haoze Wu (Jira) Fri, 11 Aug 2023 13:46:04 -0700


     [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Haoze Wu updated ZOOKEEPER-4734:
--------------------------------
    Description: 
In testPZxidUpdatedWhenLoadingSnapshot(), a quorum server is stopped and 
restarted to test for loading snapshots. However, during restarting of quorum 
server, we would call into ZkDataBase#loadDataBase(), from which an IOException 
could be thrown because of transient disk failure. 
{code:java}
public long loadDataBase() throws IOException {
    long zxid = snapLog.restore(dataTree, sessionsWithTimeouts,   
commitProposalPlaybackListener); // line 240 and IOException thrown here
    initialized = true;
    return zxid;
} {code}
In FileTxnSnapLog#restore
{code:java}
public long restore(DataTree dt, Map<Long, Integer> sessions,
                    PlayBackListener listener) throws IOException {
    long deserializeResult = snapLog.deserialize(dt, sessions); // IOException  
       
...
}{code}
Here is the stacktrace: 
{code:java}
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java)
        at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:862)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:848)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:201)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:124)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:330)
        at java.lang.Thread.run(Thread.java:748) {code}
Finally, because of this IOException, restart would be failed and test failed. 

In terms of the fix, we could either retry the test like the one proposed by 
ZOOKEEPER-3157 or we could add some configurable retry mechanism to 
ZkDataBase#loadDataBase() to tolerate possible transient disk failure. 

 

 

  was:
In testPZxidUpdatedWhenLoadingSnapshot(), a quorum server is stopped and 
restarted to test for loading snapshots. However, during restarting of quorum 
server, we would call into ZkDataBase#loadDataBase(), from in which an 
IOException could be thrown because of transient disk failure. 
{code:java}
public long loadDataBase() throws IOException {
    long zxid = snapLog.restore(dataTree, sessionsWithTimeouts,   
commitProposalPlaybackListener); // line 240 and IOException thrown here
    initialized = true;
    return zxid;
} {code}
In FileTxnSnapLog#restore
{code:java}
public long restore(DataTree dt, Map<Long, Integer> sessions,
                    PlayBackListener listener) throws IOException {
    long deserializeResult = snapLog.deserialize(dt, sessions); // IOException  
       
...
}{code}
Here is the stacktrace: 
{code:java}
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java)
        at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:862)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:848)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:201)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:124)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:330)
        at java.lang.Thread.run(Thread.java:748) {code}
Finally, because of this IOException, restart would be failed and test failed. 

In terms of the fix, we could either retry the test like the one proposed by 
ZOOKEEPER-3157 or we could add some configurable retry mechanism to 
ZkDataBase#loadDataBase() to tolerate possible transient disk failure. 

 

 


> FuzzySnapshotRelatedTest becomes flaky when transient disk failure appears
> --------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4734
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4734
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: tests
>    Affects Versions: 3.6.0
>            Reporter: Haoze Wu
>            Priority: Major
>
> In testPZxidUpdatedWhenLoadingSnapshot(), a quorum server is stopped and 
> restarted to test for loading snapshots. However, during restarting of quorum 
> server, we would call into ZkDataBase#loadDataBase(), from which an 
> IOException could be thrown because of transient disk failure. 
> {code:java}
> public long loadDataBase() throws IOException {
>     long zxid = snapLog.restore(dataTree, sessionsWithTimeouts,   
> commitProposalPlaybackListener); // line 240 and IOException thrown here
>     initialized = true;
>     return zxid;
> } {code}
> In FileTxnSnapLog#restore
> {code:java}
> public long restore(DataTree dt, Map<Long, Integer> sessions,
>                     PlayBackListener listener) throws IOException {
>     long deserializeResult = snapLog.deserialize(dt, sessions); // 
> IOException         
> ...
> }{code}
> Here is the stacktrace: 
> {code:java}
>         at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java)
>         at 
> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:862)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:848)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:201)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:124)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:330)
>         at java.lang.Thread.run(Thread.java:748) {code}
> Finally, because of this IOException, restart would be failed and test 
> failed. 
> In terms of the fix, we could either retry the test like the one proposed by 
> ZOOKEEPER-3157 or we could add some configurable retry mechanism to 
> ZkDataBase#loadDataBase() to tolerate possible transient disk failure. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ZOOKEEPER-4734) FuzzySnapshotRelatedTest becomes flaky when transient disk failure appears

Reply via email to