[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614183#comment-16614183 ] Fangmin Lv commented on ZOOKEEPER-2845: --- [~revans2] sorry to get back to this lately, I was in parental leave and totally missed this thread (my girl was born on Jan 25, so was busy dealing with the new challenges there :) ) I'm revisiting my opening PR today and came across this one. Checked your fix, looks nice and simple! There was one thing I thought which might be a problem but actually it won't be a problem anymore with ZOOKEEPER-2678 you made last time. The thing I was thinking is in [ZooKeeperServer.processTxn(TxnHeader, Record)](https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java#L1213) it didn't add itself to commit log in ZKDatabase, which will leave a hole in commit logs if we apply txns directly to DataTree during DIFF sync, which in turn could cause data inconsistency if it became leader. But we're not doing this anymore with ZOOKEEPER-2678, so it's fine. Our internal patch is a little bit heavier and complexity, we may change to use this simpler solution as well. Thanks again for moving this forward! > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > Fix For: 3.5.4, 3.6.0, 3.4.12 > > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378693#comment-16378693 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on the issue: https://github.com/apache/zookeeper/pull/453 Thanks @afine I closed them. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > Fix For: 3.5.4, 3.6.0, 3.4.12 > > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378692#comment-16378692 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 closed the pull request at: https://github.com/apache/zookeeper/pull/455 > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > Fix For: 3.5.4, 3.6.0, 3.4.12 > > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378691#comment-16378691 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 closed the pull request at: https://github.com/apache/zookeeper/pull/454 > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > Fix For: 3.5.4, 3.6.0, 3.4.12 > > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375208#comment-16375208 ] Hudson commented on ZOOKEEPER-2845: --- SUCCESS: Integrated in Jenkins build ZooKeeper-trunk #3740 (See [https://builds.apache.org/job/ZooKeeper-trunk/3740/]) ZOOKEEPER-2845: Apply commit log when restarting server. (afine: rev 722ba9409a44a35d287aac803813f508cff2420a) * (edit) src/java/main/org/apache/zookeeper/server/ZKDatabase.java * (edit) src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java * (edit) src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java * (edit) src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > Fix For: 3.5.4, 3.6.0, 3.4.12 > > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375107#comment-16375107 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user afine commented on the issue: https://github.com/apache/zookeeper/pull/453 Thanks @revans2. I merged this and the PR's for 3.4 and 3.5 > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > Fix For: 3.5.4, 3.6.0, 3.4.12 > > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375064#comment-16375064 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user asfgit closed the pull request at: https://github.com/apache/zookeeper/pull/453 > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371612#comment-16371612 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on the issue: https://github.com/apache/zookeeper/pull/453 @afine all of the changes in this branch are now in the pull requests to the 3.5 and 3.5 branches, > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371611#comment-16371611 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on the issue: https://github.com/apache/zookeeper/pull/455 I just rebased this and pulled in all of the changes made to the main test. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371535#comment-16371535 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on the issue: https://github.com/apache/zookeeper/pull/454 I just rebased this and pulled in all of the changes made to the main test. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371517#comment-16371517 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on the issue: https://github.com/apache/zookeeper/pull/453 @afine I have addressed you most recent comments. If you want me to squash commits please let me know. I have a pull request for the 3.5 branch #454 and for the 3.4 branch #455. I will be spending some time porting the test to them, and let you know when it is ready. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371509#comment-16371509 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r169662234 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +923,103 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testFailedTxnAsPartOfQuorumLoss() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +servers = LaunchServers(SERVER_COUNT); + +waitForAll(servers, States.CONNECTED); + +// we need to shutdown and start back up to make sure that the create session isn't the first transaction since +// that is rather innocuous. +servers.shutDownAllServers(); +waitForAll(servers, States.CONNECTING); +servers.restartAllServersAndClients(this); +waitForAll(servers, States.CONNECTED); + +// 2. kill all followers +int leader = servers.findLeader(); +Mapoutstanding = servers.mt[leader].main.quorumPeer.leader.outstandingProposals; +// increase the tick time to delay the leader going to looking +servers.mt[leader].main.quorumPeer.tickTime = 1; +LOG.warn("LEADER {}", leader); + +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +servers.mt[i].shutdown(); +} +} + +// 3. start up the followers to form a new quorum +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +servers.mt[i].start(); +} +} + +// 4. wait one of the follower to be the new leader +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +// Recreate a client session since the previous session was not persisted. +servers.restartClient(i, this); +waitForOne(servers.zk[i], States.CONNECTED); +} +} + +// 5. send a create request to old leader and make sure it's synced to disk, +//which means it acked from itself +try { +servers.zk[leader].create("/zk" + leader, "zk".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); +Assert.fail("create /zk" + leader + " should have failed"); +} catch (KeeperException e) { +} + +// just make sure that we actually did get it in process at the +// leader +Assert.assertEquals(1, outstanding.size()); +Proposal p = outstanding.values().iterator().next(); +Assert.assertEquals(OpCode.create, p.request.getHdr().getType()); + +// make sure it has a chance to write it to disk +int sleepTime = 0; +Long longLeader = new Long(leader); +while (!p.qvAcksetPairs.get(0).getAckset().contains(longLeader)) { +if (sleepTime > 2000) { +Assert.fail("Transaction not synced to disk within 1 second " + p.qvAcksetPairs.get(0).getAckset() ++ " expected " + leader); +} +Thread.sleep(100); +sleepTime += 100; +} + +// 6. wait for the leader to quit due to not enough followers and come back up as a part of the new quorum +sleepTime = 0; +Follower f = servers.mt[leader].main.quorumPeer.follower; +while (f == null || !f.isRunning()) { +if (sleepTime > 10_000) { +Assert.fail("Took too long for old leader to time out " + servers.mt[leader].main.quorumPeer.getPeerState()); +} +Thread.sleep(100); +sleepTime += 100; +f = servers.mt[leader].main.quorumPeer.follower; +} +servers.mt[leader].shutdown(); --- End diff -- It is a lot of very specific steps that make the data inconsistency show up. This is needed to force the transaction log to be replayed which has an edit in it that wasn't considered as a part of leader election. > Data inconsistency issue due to retain database in leader election > -- > > Key:
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367963#comment-16367963 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user afine commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168884569 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -465,6 +470,37 @@ private void waitForAll(ZooKeeper[] zks, States state) throws InterruptedExcepti private static class Servers { MainThread mt[]; ZooKeeper zk[]; +int[] clientPorts; + +public void shutDownAllServers() throws InterruptedException { +for (MainThread t: mt) { +t.shutdown(); +} +} + +public void restartAllServersAndClients(Watcher watcher) throws IOException { +for (MainThread t : mt) { +if (!t.isAlive()) { +t.start(); +} +} +for (int i = 0; i < zk.length; i++) { +restartClient(i, watcher); +} +} + +public void restartClient(int i, Watcher watcher) throws IOException { --- End diff -- annoying nitpick: let's use a better argument name than `i` > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367961#comment-16367961 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user afine commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168884819 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -465,6 +470,37 @@ private void waitForAll(ZooKeeper[] zks, States state) throws InterruptedExcepti private static class Servers { MainThread mt[]; ZooKeeper zk[]; +int[] clientPorts; + +public void shutDownAllServers() throws InterruptedException { +for (MainThread t: mt) { +t.shutdown(); +} +} + +public void restartAllServersAndClients(Watcher watcher) throws IOException { +for (MainThread t : mt) { +if (!t.isAlive()) { +t.start(); +} +} +for (int i = 0; i < zk.length; i++) { +restartClient(i, watcher); +} +} + +public void restartClient(int i, Watcher watcher) throws IOException { +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, watcher); +} + +public int findLeader() { --- End diff -- there are other places in this test class that benefit from this refactoring. Would you mind cleaning that up? > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367962#comment-16367962 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user afine commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168886064 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +923,103 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testFailedTxnAsPartOfQuorumLoss() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +servers = LaunchServers(SERVER_COUNT); + +waitForAll(servers, States.CONNECTED); + +// we need to shutdown and start back up to make sure that the create session isn't the first transaction since +// that is rather innocuous. +servers.shutDownAllServers(); +waitForAll(servers, States.CONNECTING); +servers.restartAllServersAndClients(this); +waitForAll(servers, States.CONNECTED); + +// 2. kill all followers +int leader = servers.findLeader(); +Mapoutstanding = servers.mt[leader].main.quorumPeer.leader.outstandingProposals; +// increase the tick time to delay the leader going to looking +servers.mt[leader].main.quorumPeer.tickTime = 1; +LOG.warn("LEADER {}", leader); + +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +servers.mt[i].shutdown(); +} +} + +// 3. start up the followers to form a new quorum +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +servers.mt[i].start(); +} +} + +// 4. wait one of the follower to be the new leader +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +// Recreate a client session since the previous session was not persisted. +servers.restartClient(i, this); +waitForOne(servers.zk[i], States.CONNECTED); +} +} + +// 5. send a create request to old leader and make sure it's synced to disk, +//which means it acked from itself +try { +servers.zk[leader].create("/zk" + leader, "zk".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); +Assert.fail("create /zk" + leader + " should have failed"); +} catch (KeeperException e) { +} + +// just make sure that we actually did get it in process at the +// leader +Assert.assertEquals(1, outstanding.size()); +Proposal p = outstanding.values().iterator().next(); +Assert.assertEquals(OpCode.create, p.request.getHdr().getType()); + +// make sure it has a chance to write it to disk +int sleepTime = 0; +Long longLeader = new Long(leader); +while (!p.qvAcksetPairs.get(0).getAckset().contains(longLeader)) { +if (sleepTime > 2000) { +Assert.fail("Transaction not synced to disk within 1 second " + p.qvAcksetPairs.get(0).getAckset() ++ " expected " + leader); +} +Thread.sleep(100); +sleepTime += 100; +} + +// 6. wait for the leader to quit due to not enough followers and come back up as a part of the new quorum +sleepTime = 0; +Follower f = servers.mt[leader].main.quorumPeer.follower; +while (f == null || !f.isRunning()) { +if (sleepTime > 10_000) { --- End diff -- nitpick: can we reuse the ticktime here to make the relationship more obvious? > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the >
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367960#comment-16367960 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user afine commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168887935 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +923,103 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testFailedTxnAsPartOfQuorumLoss() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +servers = LaunchServers(SERVER_COUNT); + +waitForAll(servers, States.CONNECTED); + +// we need to shutdown and start back up to make sure that the create session isn't the first transaction since +// that is rather innocuous. +servers.shutDownAllServers(); +waitForAll(servers, States.CONNECTING); +servers.restartAllServersAndClients(this); +waitForAll(servers, States.CONNECTED); + +// 2. kill all followers +int leader = servers.findLeader(); +Mapoutstanding = servers.mt[leader].main.quorumPeer.leader.outstandingProposals; +// increase the tick time to delay the leader going to looking +servers.mt[leader].main.quorumPeer.tickTime = 1; +LOG.warn("LEADER {}", leader); + +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +servers.mt[i].shutdown(); +} +} + +// 3. start up the followers to form a new quorum +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +servers.mt[i].start(); +} +} + +// 4. wait one of the follower to be the new leader +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +// Recreate a client session since the previous session was not persisted. +servers.restartClient(i, this); +waitForOne(servers.zk[i], States.CONNECTED); +} +} + +// 5. send a create request to old leader and make sure it's synced to disk, +//which means it acked from itself +try { +servers.zk[leader].create("/zk" + leader, "zk".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); +Assert.fail("create /zk" + leader + " should have failed"); +} catch (KeeperException e) { +} + +// just make sure that we actually did get it in process at the +// leader +Assert.assertEquals(1, outstanding.size()); +Proposal p = outstanding.values().iterator().next(); +Assert.assertEquals(OpCode.create, p.request.getHdr().getType()); + +// make sure it has a chance to write it to disk +int sleepTime = 0; +Long longLeader = new Long(leader); +while (!p.qvAcksetPairs.get(0).getAckset().contains(longLeader)) { +if (sleepTime > 2000) { +Assert.fail("Transaction not synced to disk within 1 second " + p.qvAcksetPairs.get(0).getAckset() ++ " expected " + leader); +} +Thread.sleep(100); +sleepTime += 100; +} + +// 6. wait for the leader to quit due to not enough followers and come back up as a part of the new quorum +sleepTime = 0; +Follower f = servers.mt[leader].main.quorumPeer.follower; +while (f == null || !f.isRunning()) { +if (sleepTime > 10_000) { +Assert.fail("Took too long for old leader to time out " + servers.mt[leader].main.quorumPeer.getPeerState()); +} +Thread.sleep(100); +sleepTime += 100; +f = servers.mt[leader].main.quorumPeer.follower; +} +servers.mt[leader].shutdown(); --- End diff -- why do we need this? > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions:
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367823#comment-16367823 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on the issue: https://github.com/apache/zookeeper/pull/453 @afine and @anmolnar I think I have addressed all of your review comments, except for the one about the change to `waitForOne` and I am happy to adjust however you want there. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367814#comment-16367814 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user afine commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168857757 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testTxnAheadSnapInRetainDB() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +final int clientPorts[] = new int[SERVER_COUNT]; +StringBuilder sb = new StringBuilder(); +for (int i = 0; i < SERVER_COUNT; i++) { +clientPorts[i] = PortAssignment.unique(); +sb.append("server." + i + "=127.0.0.1:" + PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] + "\n"); +} +String quorumCfgSection = sb.toString(); + +MainThread mt[] = new MainThread[SERVER_COUNT]; +ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT]; +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); +mt[i].start(); +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// we need to shutdown and start back up to make sure that the create session isn't the first transaction since +// that is rather innocuous. +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].shutdown(); +} + +waitForAll(zk, States.CONNECTING); + +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].start(); +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// 2. kill all followers +int leader = -1; +Mapoutstanding = null; +for (int i = 0; i < SERVER_COUNT; i++) { +if (mt[i].main.quorumPeer.leader != null) { +leader = i; +outstanding = mt[leader].main.quorumPeer.leader.outstandingProposals; +// increase the tick time to delay the leader going to looking +mt[leader].main.quorumPeer.tickTime = 1; +} +} +LOG.warn("LEADER {}", leader); + +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].shutdown(); +} +} + +// 3. start up the followers to form a new quorum +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].start(); +} +} + +// 4. wait one of the follower to be the leader +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +waitForOne(zk[i], States.CONNECTED); +} +} + +// 5. send a create request to leader and make sure it's synced to disk, +//which means it acked from itself +try { +zk[leader].create("/zk" + leader, "zk".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); +Assert.fail("create /zk" + leader + " should have failed"); +} catch (KeeperException e) { +} + +// just make sure that we actually did get it in process at the +// leader +Assert.assertTrue(outstanding.size() == 1); +Proposal p = (Proposal) outstanding.values().iterator().next(); +Assert.assertTrue(p.request.getHdr().getType() == OpCode.create); + +// make sure it has a chance to write it to disk +Thread.sleep(1000); --- End diff -- @revans2 take a look at `testElectionFraud`, specifically:
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367809#comment-16367809 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user afine commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168857052 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -435,7 +435,7 @@ private void waitForOne(ZooKeeper zk, States state) throws InterruptedException int iterations = ClientBase.CONNECTION_TIMEOUT / 500; while (zk.getState() != state) { if (iterations-- == 0) { -throw new RuntimeException("Waiting too long"); +throw new RuntimeException("Waiting too long " + zk.getState() + " != " + state); --- End diff -- Since @anmolnar thinks it is valuable, I think it is fine for it to be left in. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367559#comment-16367559 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168807853 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testTxnAheadSnapInRetainDB() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +final int clientPorts[] = new int[SERVER_COUNT]; +StringBuilder sb = new StringBuilder(); +for (int i = 0; i < SERVER_COUNT; i++) { +clientPorts[i] = PortAssignment.unique(); +sb.append("server." + i + "=127.0.0.1:" + PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] + "\n"); +} +String quorumCfgSection = sb.toString(); + +MainThread mt[] = new MainThread[SERVER_COUNT]; +ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT]; +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); +mt[i].start(); +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// we need to shutdown and start back up to make sure that the create session isn't the first transaction since +// that is rather innocuous. +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].shutdown(); +} + +waitForAll(zk, States.CONNECTING); + +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].start(); +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// 2. kill all followers +int leader = -1; +Mapoutstanding = null; +for (int i = 0; i < SERVER_COUNT; i++) { +if (mt[i].main.quorumPeer.leader != null) { +leader = i; +outstanding = mt[leader].main.quorumPeer.leader.outstandingProposals; +// increase the tick time to delay the leader going to looking +mt[leader].main.quorumPeer.tickTime = 1; +} +} +LOG.warn("LEADER {}", leader); + +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].shutdown(); +} +} + +// 3. start up the followers to form a new quorum +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].start(); +} +} + +// 4. wait one of the follower to be the leader +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +waitForOne(zk[i], States.CONNECTED); +} +} + +// 5. send a create request to leader and make sure it's synced to disk, +//which means it acked from itself +try { +zk[leader].create("/zk" + leader, "zk".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); +Assert.fail("create /zk" + leader + " should have failed"); +} catch (KeeperException e) { +} + +// just make sure that we actually did get it in process at the +// leader +Assert.assertTrue(outstanding.size() == 1); +Proposal p = (Proposal) outstanding.values().iterator().next(); +Assert.assertTrue(p.request.getHdr().getType() == OpCode.create); + +// make sure it has a chance to write it to disk +Thread.sleep(1000); --- End diff -- I was able to do what you said and drop the 1 second sleep, but the sleep at step 6 I am going to need something else because the leader is only in the
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367562#comment-16367562 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168807943 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testTxnAheadSnapInRetainDB() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +final int clientPorts[] = new int[SERVER_COUNT]; +StringBuilder sb = new StringBuilder(); +for (int i = 0; i < SERVER_COUNT; i++) { +clientPorts[i] = PortAssignment.unique(); +sb.append("server." + i + "=127.0.0.1:" + PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] + "\n"); +} +String quorumCfgSection = sb.toString(); + +MainThread mt[] = new MainThread[SERVER_COUNT]; +ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT]; +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); +mt[i].start(); +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// we need to shutdown and start back up to make sure that the create session isn't the first transaction since +// that is rather innocuous. +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].shutdown(); +} + +waitForAll(zk, States.CONNECTING); + +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].start(); +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// 2. kill all followers +int leader = -1; +Mapoutstanding = null; +for (int i = 0; i < SERVER_COUNT; i++) { +if (mt[i].main.quorumPeer.leader != null) { +leader = i; +outstanding = mt[leader].main.quorumPeer.leader.outstandingProposals; +// increase the tick time to delay the leader going to looking +mt[leader].main.quorumPeer.tickTime = 1; +} +} +LOG.warn("LEADER {}", leader); + +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].shutdown(); +} +} + +// 3. start up the followers to form a new quorum +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].start(); +} +} + +// 4. wait one of the follower to be the leader +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +waitForOne(zk[i], States.CONNECTED); +} +} + +// 5. send a create request to leader and make sure it's synced to disk, +//which means it acked from itself +try { +zk[leader].create("/zk" + leader, "zk".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); +Assert.fail("create /zk" + leader + " should have failed"); +} catch (KeeperException e) { +} + +// just make sure that we actually did get it in process at the +// leader +Assert.assertTrue(outstanding.size() == 1); +Proposal p = (Proposal) outstanding.values().iterator().next(); +Assert.assertTrue(p.request.getHdr().getType() == OpCode.create); + +// make sure it has a chance to write it to disk +Thread.sleep(1000); +p.qvAcksetPairs.get(0).getAckset().contains(leader); + +// 6. wait the leader to quit due to no enough followers +Thread.sleep(4000); +
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367563#comment-16367563 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168807976 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testTxnAheadSnapInRetainDB() throws Exception { --- End diff -- done > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367561#comment-16367561 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168807914 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testTxnAheadSnapInRetainDB() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +final int clientPorts[] = new int[SERVER_COUNT]; +StringBuilder sb = new StringBuilder(); +for (int i = 0; i < SERVER_COUNT; i++) { +clientPorts[i] = PortAssignment.unique(); +sb.append("server." + i + "=127.0.0.1:" + PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] + "\n"); +} +String quorumCfgSection = sb.toString(); + +MainThread mt[] = new MainThread[SERVER_COUNT]; +ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT]; +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); +mt[i].start(); +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// we need to shutdown and start back up to make sure that the create session isn't the first transaction since +// that is rather innocuous. +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].shutdown(); +} + +waitForAll(zk, States.CONNECTING); + +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].start(); +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// 2. kill all followers +int leader = -1; +Mapoutstanding = null; +for (int i = 0; i < SERVER_COUNT; i++) { +if (mt[i].main.quorumPeer.leader != null) { +leader = i; +outstanding = mt[leader].main.quorumPeer.leader.outstandingProposals; +// increase the tick time to delay the leader going to looking +mt[leader].main.quorumPeer.tickTime = 1; +} +} +LOG.warn("LEADER {}", leader); + +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].shutdown(); +} +} + +// 3. start up the followers to form a new quorum +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].start(); +} +} + +// 4. wait one of the follower to be the leader +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +waitForOne(zk[i], States.CONNECTED); +} +} + +// 5. send a create request to leader and make sure it's synced to disk, +//which means it acked from itself +try { +zk[leader].create("/zk" + leader, "zk".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); +Assert.fail("create /zk" + leader + " should have failed"); +} catch (KeeperException e) { +} + +// just make sure that we actually did get it in process at the +// leader +Assert.assertTrue(outstanding.size() == 1); +Proposal p = (Proposal) outstanding.values().iterator().next(); --- End diff -- removed the cast > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug >
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367504#comment-16367504 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168795646 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -435,7 +435,7 @@ private void waitForOne(ZooKeeper zk, States state) throws InterruptedException int iterations = ClientBase.CONNECTION_TIMEOUT / 500; while (zk.getState() != state) { if (iterations-- == 0) { -throw new RuntimeException("Waiting too long"); +throw new RuntimeException("Waiting too long " + zk.getState() + " != " + state); --- End diff -- @anmolnar and @afine I put this in for my own debugging and I forgot to remove it. If you want me to I am happy to either remove it or file a separate JIRA and put it up as a separate pull request, or just leave it. Either way is fine with me. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367503#comment-16367503 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user anmolnar commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168795633 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testTxnAheadSnapInRetainDB() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +final int clientPorts[] = new int[SERVER_COUNT]; +StringBuilder sb = new StringBuilder(); +for (int i = 0; i < SERVER_COUNT; i++) { +clientPorts[i] = PortAssignment.unique(); +sb.append("server." + i + "=127.0.0.1:" + PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] + "\n"); +} +String quorumCfgSection = sb.toString(); + +MainThread mt[] = new MainThread[SERVER_COUNT]; +ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT]; +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); --- End diff -- Use `LaunchServers(numServers, tickTime)` method in this class. It gives you a collection of `MainThread` and `ZooKeeper` objects properly initialized. Test `tearDown()` will care about destroying it. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367497#comment-16367497 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168794042 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testTxnAheadSnapInRetainDB() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +final int clientPorts[] = new int[SERVER_COUNT]; +StringBuilder sb = new StringBuilder(); +for (int i = 0; i < SERVER_COUNT; i++) { +clientPorts[i] = PortAssignment.unique(); +sb.append("server." + i + "=127.0.0.1:" + PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] + "\n"); +} +String quorumCfgSection = sb.toString(); + +MainThread mt[] = new MainThread[SERVER_COUNT]; +ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT]; +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); +mt[i].start(); +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// we need to shutdown and start back up to make sure that the create session isn't the first transaction since +// that is rather innocuous. +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].shutdown(); +} + +waitForAll(zk, States.CONNECTING); + +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].start(); +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// 2. kill all followers +int leader = -1; +Mapoutstanding = null; +for (int i = 0; i < SERVER_COUNT; i++) { +if (mt[i].main.quorumPeer.leader != null) { +leader = i; +outstanding = mt[leader].main.quorumPeer.leader.outstandingProposals; +// increase the tick time to delay the leader going to looking +mt[leader].main.quorumPeer.tickTime = 1; +} +} +LOG.warn("LEADER {}", leader); + +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].shutdown(); +} +} + +// 3. start up the followers to form a new quorum +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].start(); +} +} + +// 4. wait one of the follower to be the leader +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +waitForOne(zk[i], States.CONNECTED); +} +} + +// 5. send a create request to leader and make sure it's synced to disk, +//which means it acked from itself +try { +zk[leader].create("/zk" + leader, "zk".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); +Assert.fail("create /zk" + leader + " should have failed"); +} catch (KeeperException e) { +} + +// just make sure that we actually did get it in process at the +// leader +Assert.assertTrue(outstanding.size() == 1); +Proposal p = (Proposal) outstanding.values().iterator().next(); +Assert.assertTrue(p.request.getHdr().getType() == OpCode.create); + +// make sure it has a chance to write it to disk +Thread.sleep(1000); --- End diff -- I will see if I can make it work. I agree I would love to kill as many of the sleeps as possible. > Data inconsistency issue due to retain database in
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367495#comment-16367495 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168793764 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testTxnAheadSnapInRetainDB() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +final int clientPorts[] = new int[SERVER_COUNT]; +StringBuilder sb = new StringBuilder(); +for (int i = 0; i < SERVER_COUNT; i++) { +clientPorts[i] = PortAssignment.unique(); +sb.append("server." + i + "=127.0.0.1:" + PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] + "\n"); +} +String quorumCfgSection = sb.toString(); + +MainThread mt[] = new MainThread[SERVER_COUNT]; +ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT]; +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); --- End diff -- I am not super familiar with the test infrastructure. If you have a suggestion I would love it, otherwise I will look around and see what I can come up with. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367492#comment-16367492 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user anmolnar commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168793569 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testTxnAheadSnapInRetainDB() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +final int clientPorts[] = new int[SERVER_COUNT]; +StringBuilder sb = new StringBuilder(); +for (int i = 0; i < SERVER_COUNT; i++) { +clientPorts[i] = PortAssignment.unique(); +sb.append("server." + i + "=127.0.0.1:" + PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] + "\n"); +} +String quorumCfgSection = sb.toString(); + +MainThread mt[] = new MainThread[SERVER_COUNT]; +ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT]; +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); --- End diff -- +1 As mentioned testElectionFraud() is a good example for that. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367491#comment-16367491 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user anmolnar commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168793211 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -435,7 +435,7 @@ private void waitForOne(ZooKeeper zk, States state) throws InterruptedException int iterations = ClientBase.CONNECTION_TIMEOUT / 500; while (zk.getState() != state) { if (iterations-- == 0) { -throw new RuntimeException("Waiting too long"); +throw new RuntimeException("Waiting too long " + zk.getState() + " != " + state); --- End diff -- Although I agree with you in general, I think this one here is a good addition to test output. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366484#comment-16366484 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user afine commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168649080 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -435,7 +435,7 @@ private void waitForOne(ZooKeeper zk, States state) throws InterruptedException int iterations = ClientBase.CONNECTION_TIMEOUT / 500; while (zk.getState() != state) { if (iterations-- == 0) { -throw new RuntimeException("Waiting too long"); +throw new RuntimeException("Waiting too long " + zk.getState() + " != " + state); --- End diff -- nit: let's minimize unrelated test changes and whitespace changes > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366485#comment-16366485 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user afine commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168649906 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testTxnAheadSnapInRetainDB() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +final int clientPorts[] = new int[SERVER_COUNT]; +StringBuilder sb = new StringBuilder(); +for (int i = 0; i < SERVER_COUNT; i++) { +clientPorts[i] = PortAssignment.unique(); +sb.append("server." + i + "=127.0.0.1:" + PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] + "\n"); +} +String quorumCfgSection = sb.toString(); + +MainThread mt[] = new MainThread[SERVER_COUNT]; +ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT]; +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); --- End diff -- is there any reason we can't use the existing test infra to clean this up a little > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366486#comment-16366486 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user afine commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168649723 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testTxnAheadSnapInRetainDB() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +final int clientPorts[] = new int[SERVER_COUNT]; +StringBuilder sb = new StringBuilder(); +for (int i = 0; i < SERVER_COUNT; i++) { +clientPorts[i] = PortAssignment.unique(); +sb.append("server." + i + "=127.0.0.1:" + PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] + "\n"); +} +String quorumCfgSection = sb.toString(); + +MainThread mt[] = new MainThread[SERVER_COUNT]; +ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT]; +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); +mt[i].start(); +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// we need to shutdown and start back up to make sure that the create session isn't the first transaction since +// that is rather innocuous. +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].shutdown(); +} + +waitForAll(zk, States.CONNECTING); + +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].start(); +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// 2. kill all followers +int leader = -1; +Mapoutstanding = null; +for (int i = 0; i < SERVER_COUNT; i++) { +if (mt[i].main.quorumPeer.leader != null) { +leader = i; +outstanding = mt[leader].main.quorumPeer.leader.outstandingProposals; +// increase the tick time to delay the leader going to looking +mt[leader].main.quorumPeer.tickTime = 1; +} +} +LOG.warn("LEADER {}", leader); + +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].shutdown(); +} +} + +// 3. start up the followers to form a new quorum +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].start(); +} +} + +// 4. wait one of the follower to be the leader +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +waitForOne(zk[i], States.CONNECTED); +} +} + +// 5. send a create request to leader and make sure it's synced to disk, +//which means it acked from itself +try { +zk[leader].create("/zk" + leader, "zk".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); +Assert.fail("create /zk" + leader + " should have failed"); +} catch (KeeperException e) { +} + +// just make sure that we actually did get it in process at the +// leader +Assert.assertTrue(outstanding.size() == 1); +Proposal p = (Proposal) outstanding.values().iterator().next(); +Assert.assertTrue(p.request.getHdr().getType() == OpCode.create); + +// make sure it has a chance to write it to disk +Thread.sleep(1000); +p.qvAcksetPairs.get(0).getAckset().contains(leader); + +// 6. wait the leader to quit due to no enough followers +Thread.sleep(4000); +
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366481#comment-16366481 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user afine commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168653437 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testTxnAheadSnapInRetainDB() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +final int clientPorts[] = new int[SERVER_COUNT]; +StringBuilder sb = new StringBuilder(); +for (int i = 0; i < SERVER_COUNT; i++) { +clientPorts[i] = PortAssignment.unique(); +sb.append("server." + i + "=127.0.0.1:" + PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] + "\n"); +} +String quorumCfgSection = sb.toString(); + +MainThread mt[] = new MainThread[SERVER_COUNT]; +ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT]; +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); +mt[i].start(); +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// we need to shutdown and start back up to make sure that the create session isn't the first transaction since +// that is rather innocuous. +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].shutdown(); +} + +waitForAll(zk, States.CONNECTING); + +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].start(); +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// 2. kill all followers +int leader = -1; +Mapoutstanding = null; +for (int i = 0; i < SERVER_COUNT; i++) { +if (mt[i].main.quorumPeer.leader != null) { +leader = i; +outstanding = mt[leader].main.quorumPeer.leader.outstandingProposals; +// increase the tick time to delay the leader going to looking +mt[leader].main.quorumPeer.tickTime = 1; +} +} +LOG.warn("LEADER {}", leader); + +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].shutdown(); +} +} + +// 3. start up the followers to form a new quorum +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].start(); +} +} + +// 4. wait one of the follower to be the leader +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +waitForOne(zk[i], States.CONNECTED); +} +} + +// 5. send a create request to leader and make sure it's synced to disk, +//which means it acked from itself +try { +zk[leader].create("/zk" + leader, "zk".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); +Assert.fail("create /zk" + leader + " should have failed"); +} catch (KeeperException e) { +} + +// just make sure that we actually did get it in process at the +// leader +Assert.assertTrue(outstanding.size() == 1); +Proposal p = (Proposal) outstanding.values().iterator().next(); +Assert.assertTrue(p.request.getHdr().getType() == OpCode.create); + +// make sure it has a chance to write it to disk +Thread.sleep(1000); --- End diff -- There is a lot of `Thread.sleep()` going on and I would like to find a way to minimize that. Apache infra can occasionally be quite slow (it can starve
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366483#comment-16366483 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user afine commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168651275 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testTxnAheadSnapInRetainDB() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +final int clientPorts[] = new int[SERVER_COUNT]; +StringBuilder sb = new StringBuilder(); +for (int i = 0; i < SERVER_COUNT; i++) { +clientPorts[i] = PortAssignment.unique(); +sb.append("server." + i + "=127.0.0.1:" + PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] + "\n"); +} +String quorumCfgSection = sb.toString(); + +MainThread mt[] = new MainThread[SERVER_COUNT]; +ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT]; +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); +mt[i].start(); +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// we need to shutdown and start back up to make sure that the create session isn't the first transaction since +// that is rather innocuous. +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].shutdown(); +} + +waitForAll(zk, States.CONNECTING); + +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].start(); +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// 2. kill all followers +int leader = -1; +Mapoutstanding = null; +for (int i = 0; i < SERVER_COUNT; i++) { +if (mt[i].main.quorumPeer.leader != null) { +leader = i; +outstanding = mt[leader].main.quorumPeer.leader.outstandingProposals; +// increase the tick time to delay the leader going to looking +mt[leader].main.quorumPeer.tickTime = 1; +} +} +LOG.warn("LEADER {}", leader); + +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].shutdown(); +} +} + +// 3. start up the followers to form a new quorum +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].start(); +} +} + +// 4. wait one of the follower to be the leader +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +waitForOne(zk[i], States.CONNECTED); +} +} + +// 5. send a create request to leader and make sure it's synced to disk, +//which means it acked from itself +try { +zk[leader].create("/zk" + leader, "zk".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); +Assert.fail("create /zk" + leader + " should have failed"); +} catch (KeeperException e) { +} + +// just make sure that we actually did get it in process at the +// leader +Assert.assertTrue(outstanding.size() == 1); +Proposal p = (Proposal) outstanding.values().iterator().next(); --- End diff -- Do we need this cast? > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type:
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366482#comment-16366482 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user afine commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168649459 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testTxnAheadSnapInRetainDB() throws Exception { --- End diff -- nit: I don't think we use the terminology "RetainDB" anywhere else. Perhaps we can get rid of "retain"? > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362939#comment-16362939 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on the issue: https://github.com/apache/zookeeper/pull/453 @anmolnar I added in an updated version of the test in #310. The issue turned out to be a race condition where the original leader would time out clients and then would join the new quorum too quickly for the test to be able to detect it. I changed it so there is a hard coded sleep instead and then just shut down the leader. I would love to get rid of the hard coded sleep, but I wasn't really sure how to do it without making some major changes in the leader code to put in a synchronization point. If you really want me to do it I can, but it felt rather intrusive. I verified that when I comment out my code that does the fast forward the test fails and when I put it back the test passes. If this looks OK I'll try to port the test to the other release branches too. I also addressed your request to make some of the code common. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362576#comment-16362576 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user anmolnar commented on the issue: https://github.com/apache/zookeeper/pull/453 @revans2 Take a look at `testElectionFraud()` in the same file. Maybe I'm wrong, but it seems to me trying to achieve something similar. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362505#comment-16362505 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on the issue: https://github.com/apache/zookeeper/pull/453 @anmolnar I will add some kind of a test. I ran into a lot of issues with `testTxnAheadSnapInRetainDB`. I could not get it to run correctly against master as it would always end up electing the original leader again and the test would fail, but not because it had reproduced the issue. I finally just did development work based off of the [original patch](https://github.com/apache/zookeeper/compare/master...revans2:ZOOKEEPER-2845-updated-fix?expand=1) and verified that `testTxnAheadSnapInRetainDB` passed, or that if it failed it did so because of leader election. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362425#comment-16362425 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r167885280 --- Diff: src/java/main/org/apache/zookeeper/server/ZKDatabase.java --- @@ -233,14 +233,32 @@ public long getDataTreeLastProcessedZxid() { * @throws IOException */ public long loadDataBase() throws IOException { -PlayBackListener listener=new PlayBackListener(){ +PlayBackListener listener = new PlayBackListener(){ public void onTxnLoaded(TxnHeader hdr,Record txn){ Request r = new Request(0, hdr.getCxid(),hdr.getType(), hdr, txn, hdr.getZxid()); addCommittedProposal(r); } }; -long zxid = snapLog.restore(dataTree,sessionsWithTimeouts,listener); +long zxid = snapLog.restore(dataTree, sessionsWithTimeouts, listener); +initialized = true; +return zxid; +} + +/** + * Fast forward the database adding transactions from the committed log into memory. + * @return the last valid zxid. + * @throws IOException + */ +public long fastForwardDataBase() throws IOException { +PlayBackListener listener = new PlayBackListener(){ --- End diff -- Will do > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362423#comment-16362423 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user anmolnar commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r167884587 --- Diff: src/java/main/org/apache/zookeeper/server/ZKDatabase.java --- @@ -233,14 +233,32 @@ public long getDataTreeLastProcessedZxid() { * @throws IOException */ public long loadDataBase() throws IOException { -PlayBackListener listener=new PlayBackListener(){ +PlayBackListener listener = new PlayBackListener(){ public void onTxnLoaded(TxnHeader hdr,Record txn){ Request r = new Request(0, hdr.getCxid(),hdr.getType(), hdr, txn, hdr.getZxid()); addCommittedProposal(r); } }; -long zxid = snapLog.restore(dataTree,sessionsWithTimeouts,listener); +long zxid = snapLog.restore(dataTree, sessionsWithTimeouts, listener); +initialized = true; +return zxid; +} + +/** + * Fast forward the database adding transactions from the committed log into memory. + * @return the last valid zxid. + * @throws IOException + */ +public long fastForwardDataBase() throws IOException { +PlayBackListener listener = new PlayBackListener(){ --- End diff -- I think it'd be nice to extract the common logic of these two methods into a operate one. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362412#comment-16362412 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user anmolnar commented on the issue: https://github.com/apache/zookeeper/pull/453 @revans2 Your latest change looks good to me and a bit safer than the previous one. Would you please consider adding some unit tests to validate the functionality? What do you think of porting testTxnAheadSnapInRetainDB() test from your codebase? Maybe I can help making it not flaky, if you think it correctly verifies the original issue. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362234#comment-16362234 ] Robert Joseph Evans commented on ZOOKEEPER-2845: [~lvfangmin], Thanks for pushing on this. I had missed an error case in the follower. I have updated the patch to hopefully fix all of the issues, but please have a look at it. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362233#comment-16362233 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on the issue: https://github.com/apache/zookeeper/pull/453 Thank you to everyone who reviewed the patch, but with the help of Fangmin Lv I found one case that the original patch didn't cover. I have reworked the patch to cover that case, but to do so I had to take a completely different approach. I think this is a better approach because it reuses a lot of the code that was originally run to load the database from disk. So now instead of reloading the entire database from disk, we apply all of the uncommitted transactions in the log to the in memory database. This should put it in exactly the same state as if we had cleared the data and reloaded it from disk, but with much less overhead. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362223#comment-16362223 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user anmolnar commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r167838309 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/LearnerHandlerTest.java --- @@ -462,6 +469,8 @@ public void testNewEpochZxid() throws Exception { // Peer has zxid of epoch 1 peerZxid = getZxid(1, 0); +//We are on a different epoch so we don't know 1, 0 is in our log or not. +// So we need to do a full SNAP --- End diff -- I think this comment has been added by mistake. You added (1,0) to the log above, hence syncFollower() returns false which means we don't need to do full SNAP. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1636#comment-1636 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user anmolnar commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r167838605 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/LearnerHandlerTest.java --- @@ -498,31 +507,20 @@ public void testNewEpochZxidWithTxnlogOnly() throws Exception { // Peer has zxid of epoch 3 peerZxid = getZxid(3, 0); -assertFalse(learnerHandler.syncFollower(peerZxid, db, leader)); -// We send DIFF to (6,0) and forward any packet starting at (4,1) -assertOpType(Leader.DIFF, getZxid(6, 0), getZxid(4, 1)); -// DIFF + 1 proposals + 1 commit -assertEquals(3, learnerHandler.getQueuedPackets().size()); -queuedPacketMatches(new long[] { getZxid(4, 1)}); +//There is no 3, 0 proposal in the committed log so sync +assertTrue(learnerHandler.syncFollower(peerZxid, db, leader)); --- End diff -- It seems to me that this test checking the same thing 3 times in a row. Do you think it's necessary to do so? > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362164#comment-16362164 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user mfenes commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r167835407 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java --- @@ -758,6 +760,11 @@ public boolean syncFollower(long peerLastZxid, ZKDatabase db, Leader leader) { currentZxid = maxCommittedLog; needOpPacket = false; needSnap = false; +} else if (peerLastEpoch != lastProcessedEpoch && !db.isInCommittedLog(peerLastZxid)) { --- End diff -- Could you please add a description to the comments above (to "Here are the cases that we want to handle") what this else if case is doing? > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361234#comment-16361234 ] Robert Joseph Evans commented on ZOOKEEPER-2845: [~lvfangmin], You are right I did miss the ID changing on the reload as part of my tests. I will spend some more time debugging. My patch does fix the test case that was uploaded, but I want to be sure I understand the issue well enough to see what situations might not be fixed by it. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360880#comment-16360880 ] Robert Joseph Evans commented on ZOOKEEPER-2845: [~lvfangmin], I will spend some more time debugging it because I could have made a mistake, but that is not what I saw from the unit test you provided. When I logged the zxid used for leader election both before and after clearing the DB it didn't change, but like I said I could have missed something and I am not a regular contributor so I will go back and try to do it again. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360557#comment-16360557 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user mfenes commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r167513290 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java --- @@ -758,6 +760,11 @@ public boolean syncFollower(long peerLastZxid, ZKDatabase db, Leader leader) { currentZxid = maxCommittedLog; needOpPacket = false; needSnap = false; +} else if (peerLastEpoch != lastProcessedEpoch && !db.isInCommittedLog(peerLastZxid)) { +//Be sure we do a snap, because if the epochs are not the same we don't know what +// could have happened in between and it may take a TRUNC + UPDATES to get them in SYNC +LOG.debug("Will send SNAP to peer sid: {} epochs are too our of sync local 0x{} remote 0x{}", --- End diff -- I think there is a typo here: "our of sync" > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359421#comment-16359421 ] Fangmin Lv commented on ZOOKEEPER-2845: --- [~revans2] clean and reload the DB will use the correct zxid to vote or sync with new leader, if it's being elected as new leader, the ensemble will all have this extra txn, otherwise, the new leader will send truncate or snap to this server, which means it will be discarded. With RetainDB, it will ignore the truth that it actually has the txn flushed to disk, and there is race condition that if DB is reloaded from disk it may include this txn. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358899#comment-16358899 ] Robert Joseph Evans commented on ZOOKEEPER-2845: [~lvfangmin], So how does clearing the DB prevent it from re-applying the transactions in the transaction log? > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358804#comment-16358804 ] Fangmin Lv commented on ZOOKEEPER-2845: --- [~revans2] the txn will only be applied to DB when it's quorum committed, the problem here is not lost a txn but with an extra txn which is not quorum committed, and it's what shown in the Jira description. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358627#comment-16358627 ] Robert Joseph Evans commented on ZOOKEEPER-2845: [~lvfangmin], Perhaps I don't understand the issue well enough which is totally possible because I am not a frequent contributor and the path for all of the request processors is kind of complex. My understanding is that the SyncRequestProcessor handles writing out edits to the edit log and snapshots, there are a few other places where this happens at startup though. The SyncRequestProcessor writes out edits as they arrive and will flush them to disk periodically in batches. It also takes snapshots periodically. The in memory portion appears to be updated by the FinalRequestProcessor prior to a quorum of acks being received. So yes there is the possibility that something is written to the transaction log that is not applied to memory. This means that when ZKDatabase.clear() is called it should actually fast forward the in memory changes to match those in the edit log + snapshot. So you are saying that 1) proposals come in, are written to the transaction log, but the in memory database is not updated yet. 2) the server does a soft restart for some reason and some transactions appear to be lost (because the in memory DB was not fast forwarded). 3) more transactions come in (possibly conflicting with the first set of transactions). 4) before a snapshot can happen the leader or follower restarts and has to reconstruct the in memory DB from edits + snapshot. This would then reapply the edits that originally appeared to be lost. This does look like it might happen, so I will look into that as well. But the test in [https://github.com/apache/zookeeper/pull/310] didn't appear to trigger this. I could be wrong because I concentrated most of my debugging on the original leader and what was happening with it, instead of the followers and what was happening with them. I also didn't understand how clearing the leader's in memory database caused an edit to be lost, if the edits are being written out to disk before the in memory DB is updated. What I saw was that 1) a bunch of edits and leaders/followers being restarted that didn't really do much of anything. 2) the original leader lost a connection to the followers. 3a) A transaction was written to the leader in memory DB but it didn't get a quorum of acks 3b) The followers restarted and formed a new quorum 4) The original leader timed out and joined the new quorum 5) As part of the sync when the old leader joined the new quorum it got a diff (not a snap), but it had an edit that was not a part of the new leader so it was off from the others. I could see this second part happening even without my change so I don't really understand how that clearing the database would prevent it. My thinking was that it was a race condition where the edits in the edit log were not flushed yet, and as such when we cleared the DB they were lost. But I didn't confirm this. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358060#comment-16358060 ] Fangmin Lv commented on ZOOKEEPER-2845: --- [~revans2] Thanks for jumping in and working on this issue, the consistency issue mentioned here is not because of syncing protocol, but because there might be uncommitted txns in txn file but not in ZKDatabase during retain database. If I understand your proposal and diff correctly, you're trying to solve the issue by checking the epoch during syncing with the leader, but it doesn't solve the issue that there will be uncommitted txn in txn file, and during replay the txns it could load this txn and cause inconsistency. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347901#comment-16347901 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- GitHub user revans2 opened a pull request: https://github.com/apache/zookeeper/pull/455 ZOOKEEPER-2845: Send a SNAP if transactions cannot be verified. This is the version of #453 for the 3.4 branch You can merge this pull request into a Git repository by running: $ git pull https://github.com/revans2/zookeeper ZOOKEEPER-2845-3.4 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/455.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #455 commit b035df19616424036afb1f31f345dedf26e3b2ae Author: Robert EvansDate: 2018-02-01T02:09:53Z ZOOKEEPER-2845: Send a SNAP if transactions cannot be verified. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347678#comment-16347678 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- GitHub user revans2 opened a pull request: https://github.com/apache/zookeeper/pull/454 ZOOKEEPER-2845: Send a SNAP if transactions cannot be verified. (3.5) This is the version of #453 for the 3.5 branch You can merge this pull request into a Git repository by running: $ git pull https://github.com/revans2/zookeeper ZOOKEEPER-2845-3.5 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/454.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #454 commit 70436249c830af0b129caf3d1bed2f55a2498b6b Author: Robert EvansDate: 2018-01-29T20:27:10Z ZOOKEEPER-2845: Send a SNAP if transactions cannot be verified. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347557#comment-16347557 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- GitHub user revans2 opened a pull request: https://github.com/apache/zookeeper/pull/453 ZOOKEEPER-2845: Send a SNAP if transactions cannot be verified. I will be creating a patch/pull request for 3.4 and 3.5 too, but I wanted to get a pull request up for others to look at ASAP. I have a version of this based off of #310 at https://github.com/revans2/zookeeper/tree/ZOOKEEPER-2845-orig-test-patch but the test itself is flaky. Frequently leader election does not go as planned on the test and it ends up failing but not because it ended up in an inconsistent state. I am happy to answer any questions anyone has about the patch. You can merge this pull request into a Git repository by running: $ git pull https://github.com/revans2/zookeeper ZOOKEEPER-2845-master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/453.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #453 commit 0219b2c9e44527067cd5fed4b642729171721886 Author: Robert EvansDate: 2018-01-29T20:27:10Z ZOOKEEPER-2845: Send a SNAP if transactions cannot be verified. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347026#comment-16347026 ] Robert Joseph Evans commented on ZOOKEEPER-2845: I have a fix that I will be posting shortly. I need to clean up the patch and make sure that I get pull requests ready for all of the branches that ZOOKEEPER-2926 went into. The following table describes the situation that allows a node to get into an inconsistent state. || ||N1||N2||N3|| |Start with cluster in sync N1 is leader|0x0 0x5|0x0 0x5|0x0 0x5| |N2 and N3 go down|0x0 0x5| | | |Proposal to N1 (fails with no quorum)|0x0 0x6| | | |N2 and N3 return, but N1 is restarting. N2 elected leader| |0x1 0x0|0x1 0x0| |A proposal is accepted| |0x1 0x1|0x1 0x1| |N1 returns and is trying to sync with the new leader N2|0x0 0x6|0x1 0x1|0x1 0x1| At this point the code in {{LearnerHandler.syncFollower}} takes over to bring N1 into sync with N2 the new leader. That code checks the following in order # Is there a {{forceSync}}? Not in this case # Are the two zxids in sync already? No {{0x0 0x6 != 0x1 0x1}} # is the peer zxid > the local zxid (and peer didn't just rotate to a new epoch)? No {{0x0 0x6 < 0x1 0x1}} # is the peer zxid in between the max committed log and the min committed log? In this case yes it is, but it shouldn't be. The max committed log is {{0x1 0x1}}. The min committed log is {{0x0 0x5}} or something likely below it because it is based off of distance in the edit log. The issue is that once the epoch changes, {{0x0}} to {{0x1}}, the leader has no idea if the edits are in its edit log without explicitly checking for them. The reason that ZOOKEEPER-2926 exposed this is because previously when a leader was elected the in memory DB was dropped and everything was reread from disk. When this happens the {{0x0 0x6}} proposal was lost. But it is not guaranteed to be lost in all cases. In theory a snapshot could be taken triggered by that proposal, either on the leader, or on a follower that also allied the proposal, but does not join the new quorum in time. As such ZOOKEEPER-2926 really just extended the window of an already existing race. But it extended it almost indefinitely so it is much more likely to happen. My fix is to update {{LearnerHandler.syncFollower}} to only send a {{DIFF}} if the epochs are the same. If they are not the same we don't know if something we inserted that we don't know about. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Robert Joseph Evans >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344024#comment-16344024 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on the issue: https://github.com/apache/zookeeper/pull/310 Apparently for some reason I don't understand if I don't run all of the tests in QuorumPeerMainTest The old leader is elected again each time. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343979#comment-16343979 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on the issue: https://github.com/apache/zookeeper/pull/310 @lvfangmin I am trying to reproduce the issue you have seen here, and I have not been able to do so. The test either fails for me with the same leader being elected each time, or on newer versions with the leader client connected instead of connecting waiting for it to quit with a timeout, that I am not sure if it ever happens. How frequently would this test pass for you? > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1628#comment-1628 ] Fangmin Lv commented on ZOOKEEPER-2845: --- [~davelatham] I meant the broken "retainDB" commit in ZOOKEEPER-2678, we should revert it before we have a sound solution. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284144#comment-16284144 ] Dave Latham commented on ZOOKEEPER-2845: Thanks, [~lvfangmin]. The broken "retainDB" commit is ZOOKEEPER-2845 right? You're suggesting that be reverted? > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284068#comment-16284068 ] Fangmin Lv commented on ZOOKEEPER-2845: --- Can someone help add my teammate jtuple as the contributor? So I can assign the task to him. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284066#comment-16284066 ] Fangmin Lv commented on ZOOKEEPER-2845: --- [~davelatham] our internal patch is based on 3.6 branch, and we found it amplified the issue reported in ZOOKEEPER-2926, on our production we need to disable the local session feature to mitigate the issue. Also, we haven't patched and tested the diff on 3.4 yet, so we're not confident to get it out yet. Instead, I would suggest to revert the existing broken retainDB commit to unblock the next release. I have made a patch for ZOOKEEPER-2926, will update it there. And assign this Jira to my teammate Joseph to follow up, he is the owner of our internal retainDB feature. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283999#comment-16283999 ] Dave Latham commented on ZOOKEEPER-2845: Any updates here? We were considering upgrading our zookeeper, but don't want to go to a release with a known data inconsistency problem. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154246#comment-16154246 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user lvfangmin commented on the issue: https://github.com/apache/zookeeper/pull/310 @revans2 my teammate was working on the fix, and he was planning run it on prod for a while before sending out the diff. I'll sync with him today about the status. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16153628#comment-16153628 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user revans2 commented on the issue: https://github.com/apache/zookeeper/pull/310 @lvfangmin any update on getting a pull request for the actual fix? > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16136242#comment-16136242 ] Michael Han commented on ZOOKEEPER-2845: Thanks for the update, [~lvfangmin]. Good to know the patch is tested in prod environment! > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16135559#comment-16135559 ] Fangmin Lv commented on ZOOKEEPER-2845: --- The internal patch has been stabilized, which have been tested for a long time, we've rolled it out to one of the production environment last week. Joseph from our team will attach the patch here for review this week. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125267#comment-16125267 ] Fangmin Lv commented on ZOOKEEPER-2845: --- [~hanm] we've just finished and tried to test the RetainDB in our internal ensemble, might submit the code for review next week. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124813#comment-16124813 ] Michael Han commented on ZOOKEEPER-2845: [~lvfangmin] Any plan to submit your retain db implementation? This is an important bug to fix. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088464#comment-16088464 ] Michael Han commented on ZOOKEEPER-2845: Make sense to me. I think previously we don't have this issue because the {{zkDb}} was cleared across leader election, and if we restarting C it will recover from both the snap and the tnx log so it will find out its {{lastProcessedZxid}} is T1, rather than T0, which will yield a TRUNC instead of DIFF from leader B. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088306#comment-16088306 ] Fangmin Lv commented on ZOOKEEPER-2845: --- [~hanm] T1 only exists in txn file but hasn't been applied to the data tree yet, the lastProcessedZxid in follower C is T0, so no TRUNC message when sync with leader. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088299#comment-16088299 ] Michael Han commented on ZOOKEEPER-2845: Thanks for reporting this issue [~lvfangmin]. bq. C changed to looking state due to no enough followers, it will sync with leader B with last Zxid T0, which will have an empty diff sync Are you saying leader B is sending a DIFF to follower C in this case? Since B does not have T1, I think it should send a TRUNC and C should drop T1 in its txn log. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088296#comment-16088296 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- Github user lvfangmin commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/310#discussion_r127567852 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -784,4 +784,126 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } +@Test +public void testTxnAheadSnapInRetainDB() throws Exception { +// 1. start up server and wait for leader election to finish +ClientBase.setupTestEnv(); +final int SERVER_COUNT = 3; +final int clientPorts[] = new int[SERVER_COUNT]; +StringBuilder sb = new StringBuilder(); +for(int i = 0; i < SERVER_COUNT; i++) { + clientPorts[i] = PortAssignment.unique(); + sb.append("server."+i+"=127.0.0.1:"+PortAssignment.unique()+":"+PortAssignment.unique()+";"+clientPorts[i]+"\n"); +} +String quorumCfgSection = sb.toString(); + +MainThread mt[] = new MainThread[SERVER_COUNT]; +ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT]; +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); +mt[i].start(); +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +} + +waitForAll(zk, States.CONNECTED); + +// we need to shutdown and start back up to make sure that the create session isn't the first transaction since +// that is rather innocuous. +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].shutdown(); +} + +waitForAll(zk, States.CONNECTING); + +for (int i = 0; i < SERVER_COUNT; i++) { +mt[i].start(); +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); + } + +waitForAll(zk, States.CONNECTED); + + +// 2. kill all followers +int leader = -1; +Mapoutstanding = null; +for (int i = 0; i < SERVER_COUNT; i++) { +if (mt[i].main.quorumPeer.leader != null) { +leader = i; +outstanding = mt[leader].main.quorumPeer.leader.outstandingProposals; +// increase the tick time to delay the leader going to looking +mt[leader].main.quorumPeer.tickTime = 1; +} +} + +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].shutdown(); +} +} + +// 3. start up the followers to form a new quorum +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +mt[i].start(); +} +} + +// 4. wait one of the follower to be the leader +for (int i = 0; i < SERVER_COUNT; i++) { +if (i != leader) { +// Recreate a client session since the previous session was not persisted. +zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); +waitForOne(zk[i], States.CONNECTED); +} +} + +// 5. send a create request to leader and make sure it's synced to disk, +//which means it acked from itself +try { +zk[leader].create("/zk" + leader, "zk".getBytes(), Ids.OPEN_ACL_UNSAFE, +CreateMode.PERSISTENT); +Assert.fail("create /zk" + leader + " should have failed"); +} catch (KeeperException e) {} + +// just make sure that we actually did get it in process at the +// leader +Assert.assertTrue(outstanding.size() == 1); +Proposal p = (Proposal) outstanding.values().iterator().next(); +Assert.assertTrue(p.request.getHdr().getType() == OpCode.create); + +// make sure it has a chance to write it to disk +Thread.sleep(1000); +p.qvAcksetPairs.get(0).getAckset().contains(leader); + +// 6. wait the leader to quit due to no enough followers +waitForOne(zk[leader], States.CONNECTING); + +int newLeader = -1;
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088293#comment-16088293 ] Hadoop QA commented on ZOOKEEPER-2845: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//console This message is automatically generated. > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088277#comment-16088277 ] ASF GitHub Bot commented on ZOOKEEPER-2845: --- GitHub user lvfangmin opened a pull request: https://github.com/apache/zookeeper/pull/310 [ZOOKEEPER-2845][Test] Test used to reproduce the data inconsistency issue due to retain database in leader election You can merge this pull request into a Git repository by running: $ git pull https://github.com/lvfangmin/zookeeper ZOOKEEPER-2845-TEST Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/310.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #310 commit ff0bc49de51635da1d5bff0e4f260a61acc87db0 Author: Fangmin LyuDate: 2017-07-14T23:02:20Z reproduce the data inconsistency issue > Data inconsistency issue due to retain database in leader election > -- > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time > during leader election. In ZooKeeper ensemble, it's possible that the > snapshot is ahead of txn file (due to slow disk on the server, etc), or the > txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will > be drained during shutdown, the snapshot and txn file will keep consistent > before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have > data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, > and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to > txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out > to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with > leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which > includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by > doing consensus between snapshot and txn files before leader election, will > submit for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029)