[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-09-13 Thread Fangmin Lv (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614183#comment-16614183
 ] 

Fangmin Lv commented on ZOOKEEPER-2845:
---

[~revans2] sorry to get back to this lately, I was in parental leave and 
totally missed this thread (my girl was born on Jan 25, so was busy dealing 
with the new challenges there :) )

I'm revisiting my opening PR today and came across this one.

Checked your fix, looks nice and simple!

There was one thing I thought which might be a problem but actually it won't be 
a problem anymore with ZOOKEEPER-2678 you made last time. The thing I was 
thinking is in [ZooKeeperServer.processTxn(TxnHeader, 
Record)](https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java#L1213)
 it didn't add itself to commit log in ZKDatabase, which will leave a hole in 
commit logs if we apply txns directly to DataTree during DIFF sync, which in 
turn could cause data inconsistency if it became leader. But we're not doing 
this anymore with ZOOKEEPER-2678, so it's fine.

Our internal patch is a little bit heavier and complexity, we may change to use 
this simpler solution as well. Thanks again for moving this forward! 

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
> Fix For: 3.5.4, 3.6.0, 3.4.12
>
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378693#comment-16378693
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on the issue:

https://github.com/apache/zookeeper/pull/453
  
Thanks @afine I closed them.


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
> Fix For: 3.5.4, 3.6.0, 3.4.12
>
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378692#comment-16378692
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 closed the pull request at:

https://github.com/apache/zookeeper/pull/455


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
> Fix For: 3.5.4, 3.6.0, 3.4.12
>
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378691#comment-16378691
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 closed the pull request at:

https://github.com/apache/zookeeper/pull/454


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
> Fix For: 3.5.4, 3.6.0, 3.4.12
>
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375208#comment-16375208
 ] 

Hudson commented on ZOOKEEPER-2845:
---

SUCCESS: Integrated in Jenkins build ZooKeeper-trunk #3740 (See 
[https://builds.apache.org/job/ZooKeeper-trunk/3740/])
ZOOKEEPER-2845: Apply commit log when restarting server. (afine: rev 
722ba9409a44a35d287aac803813f508cff2420a)
* (edit) src/java/main/org/apache/zookeeper/server/ZKDatabase.java
* (edit) 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java
* (edit) src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java
* (edit) 
src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
> Fix For: 3.5.4, 3.6.0, 3.4.12
>
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375107#comment-16375107
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user afine commented on the issue:

https://github.com/apache/zookeeper/pull/453
  
Thanks @revans2. I merged this and the PR's for 3.4 and 3.5


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
> Fix For: 3.5.4, 3.6.0, 3.4.12
>
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375064#comment-16375064
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user asfgit closed the pull request at:

https://github.com/apache/zookeeper/pull/453


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371612#comment-16371612
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on the issue:

https://github.com/apache/zookeeper/pull/453
  
@afine all of the changes in this branch are now in the pull requests to 
the 3.5 and 3.5 branches,


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371611#comment-16371611
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on the issue:

https://github.com/apache/zookeeper/pull/455
  
I just rebased this and pulled in all of the changes made to the main test.


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371535#comment-16371535
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on the issue:

https://github.com/apache/zookeeper/pull/454
  
I just rebased this and pulled in all of the changes made to the main test.


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371517#comment-16371517
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on the issue:

https://github.com/apache/zookeeper/pull/453
  
@afine 

I have addressed you most recent comments.  If you want me to squash 
commits please let me know.

I have a pull request for the 3.5 branch #454 and for the 3.4 branch #455.  
I will be spending some time porting the test to them, and let you know when it 
is ready.


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371509#comment-16371509
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r169662234
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -888,4 +923,103 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testFailedTxnAsPartOfQuorumLoss() throws Exception {
+// 1. start up server and wait for leader election to finish
+ClientBase.setupTestEnv();
+final int SERVER_COUNT = 3;
+servers = LaunchServers(SERVER_COUNT);
+
+waitForAll(servers, States.CONNECTED);
+
+// we need to shutdown and start back up to make sure that the 
create session isn't the first transaction since
+// that is rather innocuous.
+servers.shutDownAllServers();
+waitForAll(servers, States.CONNECTING);
+servers.restartAllServersAndClients(this);
+waitForAll(servers, States.CONNECTED);
+
+// 2. kill all followers
+int leader = servers.findLeader();
+Map outstanding =  
servers.mt[leader].main.quorumPeer.leader.outstandingProposals;
+// increase the tick time to delay the leader going to looking
+servers.mt[leader].main.quorumPeer.tickTime = 1;
+LOG.warn("LEADER {}", leader);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+servers.mt[i].shutdown();
+}
+}
+
+// 3. start up the followers to form a new quorum
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+servers.mt[i].start();
+}
+}
+
+// 4. wait one of the follower to be the new leader
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+// Recreate a client session since the previous session 
was not persisted.
+servers.restartClient(i, this);
+waitForOne(servers.zk[i], States.CONNECTED);
+}
+}
+
+// 5. send a create request to old leader and make sure it's 
synced to disk,
+//which means it acked from itself
+try {
+servers.zk[leader].create("/zk" + leader, "zk".getBytes(), 
Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+Assert.fail("create /zk" + leader + " should have failed");
+} catch (KeeperException e) {
+}
+
+// just make sure that we actually did get it in process at the
+// leader
+Assert.assertEquals(1, outstanding.size());
+Proposal p = outstanding.values().iterator().next();
+Assert.assertEquals(OpCode.create, p.request.getHdr().getType());
+
+// make sure it has a chance to write it to disk
+int sleepTime = 0;
+Long longLeader = new Long(leader);
+while (!p.qvAcksetPairs.get(0).getAckset().contains(longLeader)) {
+if (sleepTime > 2000) {
+Assert.fail("Transaction not synced to disk within 1 
second " + p.qvAcksetPairs.get(0).getAckset()
++ " expected " + leader);
+}
+Thread.sleep(100);
+sleepTime += 100;
+}
+
+// 6. wait for the leader to quit due to not enough followers and 
come back up as a part of the new quorum
+sleepTime = 0;
+Follower f = servers.mt[leader].main.quorumPeer.follower;
+while (f == null || !f.isRunning()) {
+if (sleepTime > 10_000) {
+Assert.fail("Took too long for old leader to time out " + 
servers.mt[leader].main.quorumPeer.getPeerState());
+}
+Thread.sleep(100);
+sleepTime += 100;
+f = servers.mt[leader].main.quorumPeer.follower;
+}
+servers.mt[leader].shutdown();
--- End diff --

It is a lot of very specific steps that make the data inconsistency show 
up.  This is needed to force the transaction log to be replayed which has an 
edit in it that wasn't considered as a part of leader election.


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: 

[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367963#comment-16367963
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user afine commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168884569
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -465,6 +470,37 @@ private void waitForAll(ZooKeeper[] zks, States state) 
throws InterruptedExcepti
 private static class Servers {
 MainThread mt[];
 ZooKeeper zk[];
+int[] clientPorts;
+
+public void shutDownAllServers() throws InterruptedException {
+for (MainThread t: mt) {
+t.shutdown();
+}
+}
+
+public void restartAllServersAndClients(Watcher watcher) throws 
IOException {
+for (MainThread t : mt) {
+if (!t.isAlive()) {
+t.start();
+}
+}
+for (int i = 0; i < zk.length; i++) {
+restartClient(i, watcher);
+}
+}
+
+public void restartClient(int i, Watcher watcher) throws 
IOException {
--- End diff --

annoying nitpick: let's use a better argument name than `i`


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367961#comment-16367961
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user afine commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168884819
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -465,6 +470,37 @@ private void waitForAll(ZooKeeper[] zks, States state) 
throws InterruptedExcepti
 private static class Servers {
 MainThread mt[];
 ZooKeeper zk[];
+int[] clientPorts;
+
+public void shutDownAllServers() throws InterruptedException {
+for (MainThread t: mt) {
+t.shutdown();
+}
+}
+
+public void restartAllServersAndClients(Watcher watcher) throws 
IOException {
+for (MainThread t : mt) {
+if (!t.isAlive()) {
+t.start();
+}
+}
+for (int i = 0; i < zk.length; i++) {
+restartClient(i, watcher);
+}
+}
+
+public void restartClient(int i, Watcher watcher) throws 
IOException {
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, watcher);
+}
+
+public int findLeader() {
--- End diff --

there are other places in this test class that benefit from this 
refactoring. Would you mind cleaning that up?


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367962#comment-16367962
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user afine commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168886064
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -888,4 +923,103 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testFailedTxnAsPartOfQuorumLoss() throws Exception {
+// 1. start up server and wait for leader election to finish
+ClientBase.setupTestEnv();
+final int SERVER_COUNT = 3;
+servers = LaunchServers(SERVER_COUNT);
+
+waitForAll(servers, States.CONNECTED);
+
+// we need to shutdown and start back up to make sure that the 
create session isn't the first transaction since
+// that is rather innocuous.
+servers.shutDownAllServers();
+waitForAll(servers, States.CONNECTING);
+servers.restartAllServersAndClients(this);
+waitForAll(servers, States.CONNECTED);
+
+// 2. kill all followers
+int leader = servers.findLeader();
+Map outstanding =  
servers.mt[leader].main.quorumPeer.leader.outstandingProposals;
+// increase the tick time to delay the leader going to looking
+servers.mt[leader].main.quorumPeer.tickTime = 1;
+LOG.warn("LEADER {}", leader);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+servers.mt[i].shutdown();
+}
+}
+
+// 3. start up the followers to form a new quorum
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+servers.mt[i].start();
+}
+}
+
+// 4. wait one of the follower to be the new leader
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+// Recreate a client session since the previous session 
was not persisted.
+servers.restartClient(i, this);
+waitForOne(servers.zk[i], States.CONNECTED);
+}
+}
+
+// 5. send a create request to old leader and make sure it's 
synced to disk,
+//which means it acked from itself
+try {
+servers.zk[leader].create("/zk" + leader, "zk".getBytes(), 
Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+Assert.fail("create /zk" + leader + " should have failed");
+} catch (KeeperException e) {
+}
+
+// just make sure that we actually did get it in process at the
+// leader
+Assert.assertEquals(1, outstanding.size());
+Proposal p = outstanding.values().iterator().next();
+Assert.assertEquals(OpCode.create, p.request.getHdr().getType());
+
+// make sure it has a chance to write it to disk
+int sleepTime = 0;
+Long longLeader = new Long(leader);
+while (!p.qvAcksetPairs.get(0).getAckset().contains(longLeader)) {
+if (sleepTime > 2000) {
+Assert.fail("Transaction not synced to disk within 1 
second " + p.qvAcksetPairs.get(0).getAckset()
++ " expected " + leader);
+}
+Thread.sleep(100);
+sleepTime += 100;
+}
+
+// 6. wait for the leader to quit due to not enough followers and 
come back up as a part of the new quorum
+sleepTime = 0;
+Follower f = servers.mt[leader].main.quorumPeer.follower;
+while (f == null || !f.isRunning()) {
+if (sleepTime > 10_000) {
--- End diff --

nitpick: can we reuse the ticktime here to make the relationship more 
obvious?


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> 

[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367960#comment-16367960
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user afine commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168887935
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -888,4 +923,103 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testFailedTxnAsPartOfQuorumLoss() throws Exception {
+// 1. start up server and wait for leader election to finish
+ClientBase.setupTestEnv();
+final int SERVER_COUNT = 3;
+servers = LaunchServers(SERVER_COUNT);
+
+waitForAll(servers, States.CONNECTED);
+
+// we need to shutdown and start back up to make sure that the 
create session isn't the first transaction since
+// that is rather innocuous.
+servers.shutDownAllServers();
+waitForAll(servers, States.CONNECTING);
+servers.restartAllServersAndClients(this);
+waitForAll(servers, States.CONNECTED);
+
+// 2. kill all followers
+int leader = servers.findLeader();
+Map outstanding =  
servers.mt[leader].main.quorumPeer.leader.outstandingProposals;
+// increase the tick time to delay the leader going to looking
+servers.mt[leader].main.quorumPeer.tickTime = 1;
+LOG.warn("LEADER {}", leader);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+servers.mt[i].shutdown();
+}
+}
+
+// 3. start up the followers to form a new quorum
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+servers.mt[i].start();
+}
+}
+
+// 4. wait one of the follower to be the new leader
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+// Recreate a client session since the previous session 
was not persisted.
+servers.restartClient(i, this);
+waitForOne(servers.zk[i], States.CONNECTED);
+}
+}
+
+// 5. send a create request to old leader and make sure it's 
synced to disk,
+//which means it acked from itself
+try {
+servers.zk[leader].create("/zk" + leader, "zk".getBytes(), 
Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+Assert.fail("create /zk" + leader + " should have failed");
+} catch (KeeperException e) {
+}
+
+// just make sure that we actually did get it in process at the
+// leader
+Assert.assertEquals(1, outstanding.size());
+Proposal p = outstanding.values().iterator().next();
+Assert.assertEquals(OpCode.create, p.request.getHdr().getType());
+
+// make sure it has a chance to write it to disk
+int sleepTime = 0;
+Long longLeader = new Long(leader);
+while (!p.qvAcksetPairs.get(0).getAckset().contains(longLeader)) {
+if (sleepTime > 2000) {
+Assert.fail("Transaction not synced to disk within 1 
second " + p.qvAcksetPairs.get(0).getAckset()
++ " expected " + leader);
+}
+Thread.sleep(100);
+sleepTime += 100;
+}
+
+// 6. wait for the leader to quit due to not enough followers and 
come back up as a part of the new quorum
+sleepTime = 0;
+Follower f = servers.mt[leader].main.quorumPeer.follower;
+while (f == null || !f.isRunning()) {
+if (sleepTime > 10_000) {
+Assert.fail("Took too long for old leader to time out " + 
servers.mt[leader].main.quorumPeer.getPeerState());
+}
+Thread.sleep(100);
+sleepTime += 100;
+f = servers.mt[leader].main.quorumPeer.follower;
+}
+servers.mt[leader].shutdown();
--- End diff --

why do we need this?


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 

[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367823#comment-16367823
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on the issue:

https://github.com/apache/zookeeper/pull/453
  
@afine and @anmolnar I think I have addressed all of your review comments, 
except for the one about the change to `waitForOne` and I am happy to adjust 
however you want there.


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367814#comment-16367814
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user afine commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168857757
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testTxnAheadSnapInRetainDB() throws Exception {
+// 1. start up server and wait for leader election to finish
+ClientBase.setupTestEnv();
+final int SERVER_COUNT = 3;
+final int clientPorts[] = new int[SERVER_COUNT];
+StringBuilder sb = new StringBuilder();
+for (int i = 0; i < SERVER_COUNT; i++) {
+clientPorts[i] = PortAssignment.unique();
+sb.append("server." + i + "=127.0.0.1:" + 
PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] 
+ "\n");
+}
+String quorumCfgSection = sb.toString();
+
+MainThread mt[] = new MainThread[SERVER_COUNT];
+ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT];
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection);
+mt[i].start();
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+}
+
+waitForAll(zk, States.CONNECTED);
+
+// we need to shutdown and start back up to make sure that the 
create session isn't the first transaction since
+// that is rather innocuous.
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].shutdown();
+}
+
+waitForAll(zk, States.CONNECTING);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].start();
+// Recreate a client session since the previous session was 
not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+}
+
+waitForAll(zk, States.CONNECTED);
+
+// 2. kill all followers
+int leader = -1;
+Map outstanding = null;
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (mt[i].main.quorumPeer.leader != null) {
+leader = i;
+outstanding = 
mt[leader].main.quorumPeer.leader.outstandingProposals;
+// increase the tick time to delay the leader going to 
looking
+mt[leader].main.quorumPeer.tickTime = 1;
+}
+}
+LOG.warn("LEADER {}", leader);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].shutdown();
+}
+}
+
+// 3. start up the followers to form a new quorum
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].start();
+}
+}
+
+// 4. wait one of the follower to be the leader
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+// Recreate a client session since the previous session 
was not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+waitForOne(zk[i], States.CONNECTED);
+}
+}
+
+// 5. send a create request to leader and make sure it's synced to 
disk,
+//which means it acked from itself
+try {
+zk[leader].create("/zk" + leader, "zk".getBytes(), 
Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+Assert.fail("create /zk" + leader + " should have failed");
+} catch (KeeperException e) {
+}
+
+// just make sure that we actually did get it in process at the
+// leader
+Assert.assertTrue(outstanding.size() == 1);
+Proposal p = (Proposal) outstanding.values().iterator().next();
+Assert.assertTrue(p.request.getHdr().getType() == OpCode.create);
+
+// make sure it has a chance to write it to disk
+Thread.sleep(1000);
--- End diff --

@revans2 take a look at `testElectionFraud`, specifically: 

[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367809#comment-16367809
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user afine commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168857052
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -435,7 +435,7 @@ private void waitForOne(ZooKeeper zk, States state) 
throws InterruptedException
 int iterations = ClientBase.CONNECTION_TIMEOUT / 500;
 while (zk.getState() != state) {
 if (iterations-- == 0) {
-throw new RuntimeException("Waiting too long");
+throw new RuntimeException("Waiting too long " + 
zk.getState() + " != " + state);
--- End diff --

Since @anmolnar thinks it is valuable, I think it is fine for it to be left 
in. 


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367559#comment-16367559
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168807853
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testTxnAheadSnapInRetainDB() throws Exception {
+// 1. start up server and wait for leader election to finish
+ClientBase.setupTestEnv();
+final int SERVER_COUNT = 3;
+final int clientPorts[] = new int[SERVER_COUNT];
+StringBuilder sb = new StringBuilder();
+for (int i = 0; i < SERVER_COUNT; i++) {
+clientPorts[i] = PortAssignment.unique();
+sb.append("server." + i + "=127.0.0.1:" + 
PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] 
+ "\n");
+}
+String quorumCfgSection = sb.toString();
+
+MainThread mt[] = new MainThread[SERVER_COUNT];
+ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT];
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection);
+mt[i].start();
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+}
+
+waitForAll(zk, States.CONNECTED);
+
+// we need to shutdown and start back up to make sure that the 
create session isn't the first transaction since
+// that is rather innocuous.
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].shutdown();
+}
+
+waitForAll(zk, States.CONNECTING);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].start();
+// Recreate a client session since the previous session was 
not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+}
+
+waitForAll(zk, States.CONNECTED);
+
+// 2. kill all followers
+int leader = -1;
+Map outstanding = null;
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (mt[i].main.quorumPeer.leader != null) {
+leader = i;
+outstanding = 
mt[leader].main.quorumPeer.leader.outstandingProposals;
+// increase the tick time to delay the leader going to 
looking
+mt[leader].main.quorumPeer.tickTime = 1;
+}
+}
+LOG.warn("LEADER {}", leader);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].shutdown();
+}
+}
+
+// 3. start up the followers to form a new quorum
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].start();
+}
+}
+
+// 4. wait one of the follower to be the leader
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+// Recreate a client session since the previous session 
was not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+waitForOne(zk[i], States.CONNECTED);
+}
+}
+
+// 5. send a create request to leader and make sure it's synced to 
disk,
+//which means it acked from itself
+try {
+zk[leader].create("/zk" + leader, "zk".getBytes(), 
Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+Assert.fail("create /zk" + leader + " should have failed");
+} catch (KeeperException e) {
+}
+
+// just make sure that we actually did get it in process at the
+// leader
+Assert.assertTrue(outstanding.size() == 1);
+Proposal p = (Proposal) outstanding.values().iterator().next();
+Assert.assertTrue(p.request.getHdr().getType() == OpCode.create);
+
+// make sure it has a chance to write it to disk
+Thread.sleep(1000);
--- End diff --

I was able to do what you said and drop the 1 second sleep, but the sleep 
at step 6 I am going to need something else because the leader is only in the 

[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367562#comment-16367562
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168807943
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testTxnAheadSnapInRetainDB() throws Exception {
+// 1. start up server and wait for leader election to finish
+ClientBase.setupTestEnv();
+final int SERVER_COUNT = 3;
+final int clientPorts[] = new int[SERVER_COUNT];
+StringBuilder sb = new StringBuilder();
+for (int i = 0; i < SERVER_COUNT; i++) {
+clientPorts[i] = PortAssignment.unique();
+sb.append("server." + i + "=127.0.0.1:" + 
PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] 
+ "\n");
+}
+String quorumCfgSection = sb.toString();
+
+MainThread mt[] = new MainThread[SERVER_COUNT];
+ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT];
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection);
+mt[i].start();
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+}
+
+waitForAll(zk, States.CONNECTED);
+
+// we need to shutdown and start back up to make sure that the 
create session isn't the first transaction since
+// that is rather innocuous.
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].shutdown();
+}
+
+waitForAll(zk, States.CONNECTING);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].start();
+// Recreate a client session since the previous session was 
not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+}
+
+waitForAll(zk, States.CONNECTED);
+
+// 2. kill all followers
+int leader = -1;
+Map outstanding = null;
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (mt[i].main.quorumPeer.leader != null) {
+leader = i;
+outstanding = 
mt[leader].main.quorumPeer.leader.outstandingProposals;
+// increase the tick time to delay the leader going to 
looking
+mt[leader].main.quorumPeer.tickTime = 1;
+}
+}
+LOG.warn("LEADER {}", leader);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].shutdown();
+}
+}
+
+// 3. start up the followers to form a new quorum
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].start();
+}
+}
+
+// 4. wait one of the follower to be the leader
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+// Recreate a client session since the previous session 
was not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+waitForOne(zk[i], States.CONNECTED);
+}
+}
+
+// 5. send a create request to leader and make sure it's synced to 
disk,
+//which means it acked from itself
+try {
+zk[leader].create("/zk" + leader, "zk".getBytes(), 
Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+Assert.fail("create /zk" + leader + " should have failed");
+} catch (KeeperException e) {
+}
+
+// just make sure that we actually did get it in process at the
+// leader
+Assert.assertTrue(outstanding.size() == 1);
+Proposal p = (Proposal) outstanding.values().iterator().next();
+Assert.assertTrue(p.request.getHdr().getType() == OpCode.create);
+
+// make sure it has a chance to write it to disk
+Thread.sleep(1000);
+p.qvAcksetPairs.get(0).getAckset().contains(leader);
+
+// 6. wait the leader to quit due to no enough followers
+Thread.sleep(4000);
+

[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367563#comment-16367563
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168807976
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testTxnAheadSnapInRetainDB() throws Exception {
--- End diff --

done


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367561#comment-16367561
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168807914
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testTxnAheadSnapInRetainDB() throws Exception {
+// 1. start up server and wait for leader election to finish
+ClientBase.setupTestEnv();
+final int SERVER_COUNT = 3;
+final int clientPorts[] = new int[SERVER_COUNT];
+StringBuilder sb = new StringBuilder();
+for (int i = 0; i < SERVER_COUNT; i++) {
+clientPorts[i] = PortAssignment.unique();
+sb.append("server." + i + "=127.0.0.1:" + 
PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] 
+ "\n");
+}
+String quorumCfgSection = sb.toString();
+
+MainThread mt[] = new MainThread[SERVER_COUNT];
+ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT];
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection);
+mt[i].start();
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+}
+
+waitForAll(zk, States.CONNECTED);
+
+// we need to shutdown and start back up to make sure that the 
create session isn't the first transaction since
+// that is rather innocuous.
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].shutdown();
+}
+
+waitForAll(zk, States.CONNECTING);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].start();
+// Recreate a client session since the previous session was 
not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+}
+
+waitForAll(zk, States.CONNECTED);
+
+// 2. kill all followers
+int leader = -1;
+Map outstanding = null;
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (mt[i].main.quorumPeer.leader != null) {
+leader = i;
+outstanding = 
mt[leader].main.quorumPeer.leader.outstandingProposals;
+// increase the tick time to delay the leader going to 
looking
+mt[leader].main.quorumPeer.tickTime = 1;
+}
+}
+LOG.warn("LEADER {}", leader);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].shutdown();
+}
+}
+
+// 3. start up the followers to form a new quorum
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].start();
+}
+}
+
+// 4. wait one of the follower to be the leader
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+// Recreate a client session since the previous session 
was not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+waitForOne(zk[i], States.CONNECTED);
+}
+}
+
+// 5. send a create request to leader and make sure it's synced to 
disk,
+//which means it acked from itself
+try {
+zk[leader].create("/zk" + leader, "zk".getBytes(), 
Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+Assert.fail("create /zk" + leader + " should have failed");
+} catch (KeeperException e) {
+}
+
+// just make sure that we actually did get it in process at the
+// leader
+Assert.assertTrue(outstanding.size() == 1);
+Proposal p = (Proposal) outstanding.values().iterator().next();
--- End diff --

removed the cast


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
> 

[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367504#comment-16367504
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168795646
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -435,7 +435,7 @@ private void waitForOne(ZooKeeper zk, States state) 
throws InterruptedException
 int iterations = ClientBase.CONNECTION_TIMEOUT / 500;
 while (zk.getState() != state) {
 if (iterations-- == 0) {
-throw new RuntimeException("Waiting too long");
+throw new RuntimeException("Waiting too long " + 
zk.getState() + " != " + state);
--- End diff --

@anmolnar  and @afine I put this in for my own debugging and I forgot to 
remove it.  If you want me to I am happy to either remove it or file a separate 
JIRA and put it up as a separate pull request, or just leave it.  Either way is 
fine with me.


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367503#comment-16367503
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user anmolnar commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168795633
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testTxnAheadSnapInRetainDB() throws Exception {
+// 1. start up server and wait for leader election to finish
+ClientBase.setupTestEnv();
+final int SERVER_COUNT = 3;
+final int clientPorts[] = new int[SERVER_COUNT];
+StringBuilder sb = new StringBuilder();
+for (int i = 0; i < SERVER_COUNT; i++) {
+clientPorts[i] = PortAssignment.unique();
+sb.append("server." + i + "=127.0.0.1:" + 
PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] 
+ "\n");
+}
+String quorumCfgSection = sb.toString();
+
+MainThread mt[] = new MainThread[SERVER_COUNT];
+ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT];
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection);
--- End diff --

Use `LaunchServers(numServers, tickTime)` method in this class.
It gives you a collection of `MainThread` and `ZooKeeper` objects properly 
initialized.
Test `tearDown()` will care about destroying it. 


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367497#comment-16367497
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168794042
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testTxnAheadSnapInRetainDB() throws Exception {
+// 1. start up server and wait for leader election to finish
+ClientBase.setupTestEnv();
+final int SERVER_COUNT = 3;
+final int clientPorts[] = new int[SERVER_COUNT];
+StringBuilder sb = new StringBuilder();
+for (int i = 0; i < SERVER_COUNT; i++) {
+clientPorts[i] = PortAssignment.unique();
+sb.append("server." + i + "=127.0.0.1:" + 
PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] 
+ "\n");
+}
+String quorumCfgSection = sb.toString();
+
+MainThread mt[] = new MainThread[SERVER_COUNT];
+ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT];
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection);
+mt[i].start();
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+}
+
+waitForAll(zk, States.CONNECTED);
+
+// we need to shutdown and start back up to make sure that the 
create session isn't the first transaction since
+// that is rather innocuous.
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].shutdown();
+}
+
+waitForAll(zk, States.CONNECTING);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].start();
+// Recreate a client session since the previous session was 
not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+}
+
+waitForAll(zk, States.CONNECTED);
+
+// 2. kill all followers
+int leader = -1;
+Map outstanding = null;
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (mt[i].main.quorumPeer.leader != null) {
+leader = i;
+outstanding = 
mt[leader].main.quorumPeer.leader.outstandingProposals;
+// increase the tick time to delay the leader going to 
looking
+mt[leader].main.quorumPeer.tickTime = 1;
+}
+}
+LOG.warn("LEADER {}", leader);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].shutdown();
+}
+}
+
+// 3. start up the followers to form a new quorum
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].start();
+}
+}
+
+// 4. wait one of the follower to be the leader
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+// Recreate a client session since the previous session 
was not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+waitForOne(zk[i], States.CONNECTED);
+}
+}
+
+// 5. send a create request to leader and make sure it's synced to 
disk,
+//which means it acked from itself
+try {
+zk[leader].create("/zk" + leader, "zk".getBytes(), 
Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+Assert.fail("create /zk" + leader + " should have failed");
+} catch (KeeperException e) {
+}
+
+// just make sure that we actually did get it in process at the
+// leader
+Assert.assertTrue(outstanding.size() == 1);
+Proposal p = (Proposal) outstanding.values().iterator().next();
+Assert.assertTrue(p.request.getHdr().getType() == OpCode.create);
+
+// make sure it has a chance to write it to disk
+Thread.sleep(1000);
--- End diff --

I will see if I can make it work.  I agree I would love to kill as many of 
the sleeps as possible.


> Data inconsistency issue due to retain database in 

[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367495#comment-16367495
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168793764
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testTxnAheadSnapInRetainDB() throws Exception {
+// 1. start up server and wait for leader election to finish
+ClientBase.setupTestEnv();
+final int SERVER_COUNT = 3;
+final int clientPorts[] = new int[SERVER_COUNT];
+StringBuilder sb = new StringBuilder();
+for (int i = 0; i < SERVER_COUNT; i++) {
+clientPorts[i] = PortAssignment.unique();
+sb.append("server." + i + "=127.0.0.1:" + 
PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] 
+ "\n");
+}
+String quorumCfgSection = sb.toString();
+
+MainThread mt[] = new MainThread[SERVER_COUNT];
+ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT];
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection);
--- End diff --

I am not super familiar with the test infrastructure.  If you have a 
suggestion I would love it, otherwise I will look around and see what I can 
come up with.


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367492#comment-16367492
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user anmolnar commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168793569
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testTxnAheadSnapInRetainDB() throws Exception {
+// 1. start up server and wait for leader election to finish
+ClientBase.setupTestEnv();
+final int SERVER_COUNT = 3;
+final int clientPorts[] = new int[SERVER_COUNT];
+StringBuilder sb = new StringBuilder();
+for (int i = 0; i < SERVER_COUNT; i++) {
+clientPorts[i] = PortAssignment.unique();
+sb.append("server." + i + "=127.0.0.1:" + 
PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] 
+ "\n");
+}
+String quorumCfgSection = sb.toString();
+
+MainThread mt[] = new MainThread[SERVER_COUNT];
+ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT];
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection);
--- End diff --

+1
As mentioned testElectionFraud() is a good example for that.


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367491#comment-16367491
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user anmolnar commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168793211
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -435,7 +435,7 @@ private void waitForOne(ZooKeeper zk, States state) 
throws InterruptedException
 int iterations = ClientBase.CONNECTION_TIMEOUT / 500;
 while (zk.getState() != state) {
 if (iterations-- == 0) {
-throw new RuntimeException("Waiting too long");
+throw new RuntimeException("Waiting too long " + 
zk.getState() + " != " + state);
--- End diff --

Although I agree with you in general, I think this one here is a good 
addition to test output.


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366484#comment-16366484
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user afine commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168649080
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -435,7 +435,7 @@ private void waitForOne(ZooKeeper zk, States state) 
throws InterruptedException
 int iterations = ClientBase.CONNECTION_TIMEOUT / 500;
 while (zk.getState() != state) {
 if (iterations-- == 0) {
-throw new RuntimeException("Waiting too long");
+throw new RuntimeException("Waiting too long " + 
zk.getState() + " != " + state);
--- End diff --

nit: let's minimize unrelated test changes and whitespace changes


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366485#comment-16366485
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user afine commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168649906
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testTxnAheadSnapInRetainDB() throws Exception {
+// 1. start up server and wait for leader election to finish
+ClientBase.setupTestEnv();
+final int SERVER_COUNT = 3;
+final int clientPorts[] = new int[SERVER_COUNT];
+StringBuilder sb = new StringBuilder();
+for (int i = 0; i < SERVER_COUNT; i++) {
+clientPorts[i] = PortAssignment.unique();
+sb.append("server." + i + "=127.0.0.1:" + 
PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] 
+ "\n");
+}
+String quorumCfgSection = sb.toString();
+
+MainThread mt[] = new MainThread[SERVER_COUNT];
+ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT];
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection);
--- End diff --

is there any reason we can't use the existing test infra to clean this up a 
little


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366486#comment-16366486
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user afine commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168649723
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testTxnAheadSnapInRetainDB() throws Exception {
+// 1. start up server and wait for leader election to finish
+ClientBase.setupTestEnv();
+final int SERVER_COUNT = 3;
+final int clientPorts[] = new int[SERVER_COUNT];
+StringBuilder sb = new StringBuilder();
+for (int i = 0; i < SERVER_COUNT; i++) {
+clientPorts[i] = PortAssignment.unique();
+sb.append("server." + i + "=127.0.0.1:" + 
PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] 
+ "\n");
+}
+String quorumCfgSection = sb.toString();
+
+MainThread mt[] = new MainThread[SERVER_COUNT];
+ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT];
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection);
+mt[i].start();
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+}
+
+waitForAll(zk, States.CONNECTED);
+
+// we need to shutdown and start back up to make sure that the 
create session isn't the first transaction since
+// that is rather innocuous.
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].shutdown();
+}
+
+waitForAll(zk, States.CONNECTING);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].start();
+// Recreate a client session since the previous session was 
not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+}
+
+waitForAll(zk, States.CONNECTED);
+
+// 2. kill all followers
+int leader = -1;
+Map outstanding = null;
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (mt[i].main.quorumPeer.leader != null) {
+leader = i;
+outstanding = 
mt[leader].main.quorumPeer.leader.outstandingProposals;
+// increase the tick time to delay the leader going to 
looking
+mt[leader].main.quorumPeer.tickTime = 1;
+}
+}
+LOG.warn("LEADER {}", leader);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].shutdown();
+}
+}
+
+// 3. start up the followers to form a new quorum
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].start();
+}
+}
+
+// 4. wait one of the follower to be the leader
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+// Recreate a client session since the previous session 
was not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+waitForOne(zk[i], States.CONNECTED);
+}
+}
+
+// 5. send a create request to leader and make sure it's synced to 
disk,
+//which means it acked from itself
+try {
+zk[leader].create("/zk" + leader, "zk".getBytes(), 
Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+Assert.fail("create /zk" + leader + " should have failed");
+} catch (KeeperException e) {
+}
+
+// just make sure that we actually did get it in process at the
+// leader
+Assert.assertTrue(outstanding.size() == 1);
+Proposal p = (Proposal) outstanding.values().iterator().next();
+Assert.assertTrue(p.request.getHdr().getType() == OpCode.create);
+
+// make sure it has a chance to write it to disk
+Thread.sleep(1000);
+p.qvAcksetPairs.get(0).getAckset().contains(leader);
+
+// 6. wait the leader to quit due to no enough followers
+Thread.sleep(4000);
+  

[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366481#comment-16366481
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user afine commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168653437
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testTxnAheadSnapInRetainDB() throws Exception {
+// 1. start up server and wait for leader election to finish
+ClientBase.setupTestEnv();
+final int SERVER_COUNT = 3;
+final int clientPorts[] = new int[SERVER_COUNT];
+StringBuilder sb = new StringBuilder();
+for (int i = 0; i < SERVER_COUNT; i++) {
+clientPorts[i] = PortAssignment.unique();
+sb.append("server." + i + "=127.0.0.1:" + 
PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] 
+ "\n");
+}
+String quorumCfgSection = sb.toString();
+
+MainThread mt[] = new MainThread[SERVER_COUNT];
+ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT];
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection);
+mt[i].start();
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+}
+
+waitForAll(zk, States.CONNECTED);
+
+// we need to shutdown and start back up to make sure that the 
create session isn't the first transaction since
+// that is rather innocuous.
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].shutdown();
+}
+
+waitForAll(zk, States.CONNECTING);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].start();
+// Recreate a client session since the previous session was 
not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+}
+
+waitForAll(zk, States.CONNECTED);
+
+// 2. kill all followers
+int leader = -1;
+Map outstanding = null;
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (mt[i].main.quorumPeer.leader != null) {
+leader = i;
+outstanding = 
mt[leader].main.quorumPeer.leader.outstandingProposals;
+// increase the tick time to delay the leader going to 
looking
+mt[leader].main.quorumPeer.tickTime = 1;
+}
+}
+LOG.warn("LEADER {}", leader);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].shutdown();
+}
+}
+
+// 3. start up the followers to form a new quorum
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].start();
+}
+}
+
+// 4. wait one of the follower to be the leader
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+// Recreate a client session since the previous session 
was not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+waitForOne(zk[i], States.CONNECTED);
+}
+}
+
+// 5. send a create request to leader and make sure it's synced to 
disk,
+//which means it acked from itself
+try {
+zk[leader].create("/zk" + leader, "zk".getBytes(), 
Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+Assert.fail("create /zk" + leader + " should have failed");
+} catch (KeeperException e) {
+}
+
+// just make sure that we actually did get it in process at the
+// leader
+Assert.assertTrue(outstanding.size() == 1);
+Proposal p = (Proposal) outstanding.values().iterator().next();
+Assert.assertTrue(p.request.getHdr().getType() == OpCode.create);
+
+// make sure it has a chance to write it to disk
+Thread.sleep(1000);
--- End diff --

There is a lot of `Thread.sleep()` going on and I would like to find a way 
to minimize that. Apache infra can occasionally be quite slow (it can starve 

[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366483#comment-16366483
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user afine commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168651275
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testTxnAheadSnapInRetainDB() throws Exception {
+// 1. start up server and wait for leader election to finish
+ClientBase.setupTestEnv();
+final int SERVER_COUNT = 3;
+final int clientPorts[] = new int[SERVER_COUNT];
+StringBuilder sb = new StringBuilder();
+for (int i = 0; i < SERVER_COUNT; i++) {
+clientPorts[i] = PortAssignment.unique();
+sb.append("server." + i + "=127.0.0.1:" + 
PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] 
+ "\n");
+}
+String quorumCfgSection = sb.toString();
+
+MainThread mt[] = new MainThread[SERVER_COUNT];
+ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT];
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection);
+mt[i].start();
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+}
+
+waitForAll(zk, States.CONNECTED);
+
+// we need to shutdown and start back up to make sure that the 
create session isn't the first transaction since
+// that is rather innocuous.
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].shutdown();
+}
+
+waitForAll(zk, States.CONNECTING);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].start();
+// Recreate a client session since the previous session was 
not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+}
+
+waitForAll(zk, States.CONNECTED);
+
+// 2. kill all followers
+int leader = -1;
+Map outstanding = null;
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (mt[i].main.quorumPeer.leader != null) {
+leader = i;
+outstanding = 
mt[leader].main.quorumPeer.leader.outstandingProposals;
+// increase the tick time to delay the leader going to 
looking
+mt[leader].main.quorumPeer.tickTime = 1;
+}
+}
+LOG.warn("LEADER {}", leader);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].shutdown();
+}
+}
+
+// 3. start up the followers to form a new quorum
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].start();
+}
+}
+
+// 4. wait one of the follower to be the leader
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+// Recreate a client session since the previous session 
was not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+waitForOne(zk[i], States.CONNECTED);
+}
+}
+
+// 5. send a create request to leader and make sure it's synced to 
disk,
+//which means it acked from itself
+try {
+zk[leader].create("/zk" + leader, "zk".getBytes(), 
Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+Assert.fail("create /zk" + leader + " should have failed");
+} catch (KeeperException e) {
+}
+
+// just make sure that we actually did get it in process at the
+// leader
+Assert.assertTrue(outstanding.size() == 1);
+Proposal p = (Proposal) outstanding.values().iterator().next();
--- End diff --

Do we need this cast?


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: 

[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366482#comment-16366482
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user afine commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r168649459
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testTxnAheadSnapInRetainDB() throws Exception {
--- End diff --

nit: I don't think we use the terminology "RetainDB" anywhere else. Perhaps 
we can get rid of "retain"?


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362939#comment-16362939
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on the issue:

https://github.com/apache/zookeeper/pull/453
  
@anmolnar I added in an updated version of the test in #310. The issue 
turned out to be a race condition where the original leader would time out 
clients and then would join the new quorum too quickly for the test to be able 
to detect it.  I changed it so there is a hard coded sleep instead and then 
just shut down the leader.  I would love to get rid of the hard coded sleep, 
but I wasn't really sure how to do it without making some major changes in the 
leader code to put in a synchronization point.  If you really want me to do it 
I can, but it felt rather intrusive.

I verified that when I comment out my code that does the fast forward the 
test fails and when I put it back the test passes.  If this looks OK I'll try 
to port the test to the other release branches too.

I also addressed your request to make some of the code common.


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362576#comment-16362576
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user anmolnar commented on the issue:

https://github.com/apache/zookeeper/pull/453
  
@revans2 Take a look at `testElectionFraud()` in the same file. Maybe I'm 
wrong, but it seems to me trying to achieve something similar.


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362505#comment-16362505
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on the issue:

https://github.com/apache/zookeeper/pull/453
  
@anmolnar I will add some kind of a test.  I ran into a lot of issues with 
`testTxnAheadSnapInRetainDB`.  I could not get it to run correctly against 
master as it would always end up electing the original leader again and the 
test would fail, but not because it had reproduced the issue.  I finally just 
did development work based off of the [original 
patch](https://github.com/apache/zookeeper/compare/master...revans2:ZOOKEEPER-2845-updated-fix?expand=1)
 and verified that `testTxnAheadSnapInRetainDB` passed, or that if it failed it 
did so because of leader election.


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362425#comment-16362425
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r167885280
  
--- Diff: src/java/main/org/apache/zookeeper/server/ZKDatabase.java ---
@@ -233,14 +233,32 @@ public long getDataTreeLastProcessedZxid() {
  * @throws IOException
  */
 public long loadDataBase() throws IOException {
-PlayBackListener listener=new PlayBackListener(){
+PlayBackListener listener = new PlayBackListener(){
 public void onTxnLoaded(TxnHeader hdr,Record txn){
 Request r = new Request(0, hdr.getCxid(),hdr.getType(), 
hdr, txn, hdr.getZxid());
 addCommittedProposal(r);
 }
 };
 
-long zxid = 
snapLog.restore(dataTree,sessionsWithTimeouts,listener);
+long zxid = snapLog.restore(dataTree, sessionsWithTimeouts, 
listener);
+initialized = true;
+return zxid;
+}
+
+/**
+ * Fast forward the database adding transactions from the committed 
log into memory.
+ * @return the last valid zxid.
+ * @throws IOException
+ */
+public long fastForwardDataBase() throws IOException {
+PlayBackListener listener = new PlayBackListener(){
--- End diff --

Will do


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362423#comment-16362423
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user anmolnar commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r167884587
  
--- Diff: src/java/main/org/apache/zookeeper/server/ZKDatabase.java ---
@@ -233,14 +233,32 @@ public long getDataTreeLastProcessedZxid() {
  * @throws IOException
  */
 public long loadDataBase() throws IOException {
-PlayBackListener listener=new PlayBackListener(){
+PlayBackListener listener = new PlayBackListener(){
 public void onTxnLoaded(TxnHeader hdr,Record txn){
 Request r = new Request(0, hdr.getCxid(),hdr.getType(), 
hdr, txn, hdr.getZxid());
 addCommittedProposal(r);
 }
 };
 
-long zxid = 
snapLog.restore(dataTree,sessionsWithTimeouts,listener);
+long zxid = snapLog.restore(dataTree, sessionsWithTimeouts, 
listener);
+initialized = true;
+return zxid;
+}
+
+/**
+ * Fast forward the database adding transactions from the committed 
log into memory.
+ * @return the last valid zxid.
+ * @throws IOException
+ */
+public long fastForwardDataBase() throws IOException {
+PlayBackListener listener = new PlayBackListener(){
--- End diff --

I think it'd be nice to extract the common logic of these two methods into 
a operate one.


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362412#comment-16362412
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user anmolnar commented on the issue:

https://github.com/apache/zookeeper/pull/453
  
@revans2 Your latest change looks good to me and a bit safer than the 
previous one. Would you please consider adding some unit tests to validate the 
functionality?
What do you think of porting testTxnAheadSnapInRetainDB() test from your 
codebase?
Maybe I can help making it not flaky, if you think it correctly verifies 
the original issue.



> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-13 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362234#comment-16362234
 ] 

Robert Joseph Evans commented on ZOOKEEPER-2845:


[~lvfangmin],

Thanks for pushing on this.  I had missed an error case in the follower.  I 
have updated the patch to hopefully fix all of the issues, but please have a 
look at it.

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362233#comment-16362233
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on the issue:

https://github.com/apache/zookeeper/pull/453
  
Thank you to everyone who reviewed the patch, but with the help of Fangmin 
Lv I found one case that the original patch didn't cover.  I have reworked the 
patch to cover that case, but to do so I had to take a completely different 
approach.

I think this is a better approach because it reuses a lot of the code that 
was originally run to load the database from disk.  So now instead of reloading 
the entire database from disk, we apply all of the uncommitted transactions in 
the log to the in memory database.  This should put it in exactly the same 
state as if we had cleared the data and reloaded it from disk, but with much 
less overhead.


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362223#comment-16362223
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user anmolnar commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r167838309
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/LearnerHandlerTest.java ---
@@ -462,6 +469,8 @@ public void testNewEpochZxid() throws Exception {
 
 // Peer has zxid of epoch 1
 peerZxid = getZxid(1, 0);
+//We are on a different epoch so we don't know 1, 0 is in our log 
or not.
+// So we need to do a full SNAP
--- End diff --

I think this comment has been added by mistake. You added (1,0) to the log 
above, hence syncFollower() returns false which means we don't need to do full 
SNAP.


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1636#comment-1636
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user anmolnar commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r167838605
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/LearnerHandlerTest.java ---
@@ -498,31 +507,20 @@ public void testNewEpochZxidWithTxnlogOnly() throws 
Exception {
 
 // Peer has zxid of epoch 3
 peerZxid = getZxid(3, 0);
-assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
-// We send DIFF to (6,0) and forward any packet starting at (4,1)
-assertOpType(Leader.DIFF, getZxid(6, 0), getZxid(4, 1));
-// DIFF + 1 proposals + 1 commit
-assertEquals(3, learnerHandler.getQueuedPackets().size());
-queuedPacketMatches(new long[] { getZxid(4, 1)});
+//There is no 3, 0 proposal in the committed log so sync
+assertTrue(learnerHandler.syncFollower(peerZxid, db, leader));
--- End diff --

It seems to me that this test checking the same thing 3 times in a row.
Do you think it's necessary to do so?


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362164#comment-16362164
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user mfenes commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r167835407
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java ---
@@ -758,6 +760,11 @@ public boolean syncFollower(long peerLastZxid, 
ZKDatabase db, Leader leader) {
 currentZxid = maxCommittedLog;
 needOpPacket = false;
 needSnap = false;
+} else if (peerLastEpoch != lastProcessedEpoch && 
!db.isInCommittedLog(peerLastZxid)) {
--- End diff --

Could you please add a description to the comments above (to "Here are the 
cases that we want to handle") what this else if case is doing?


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-12 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361234#comment-16361234
 ] 

Robert Joseph Evans commented on ZOOKEEPER-2845:


[~lvfangmin],

You are right I did miss the ID changing on the reload as part of my tests.  I 
will spend some more time debugging.  My patch does fix the test case that was 
uploaded, but I want to be sure I understand the issue well enough to see what 
situations might not be fixed by it.

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-12 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360880#comment-16360880
 ] 

Robert Joseph Evans commented on ZOOKEEPER-2845:


[~lvfangmin],

I will spend some more time debugging it because I could have made a mistake, 
but that is not what I saw from the unit test you provided.  When I logged the 
zxid used for leader election both before and after clearing the DB it didn't 
change, but like I said I could have missed something and I am not a regular 
contributor so I will go back and try to do it again.

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360557#comment-16360557
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user mfenes commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/453#discussion_r167513290
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java ---
@@ -758,6 +760,11 @@ public boolean syncFollower(long peerLastZxid, 
ZKDatabase db, Leader leader) {
 currentZxid = maxCommittedLog;
 needOpPacket = false;
 needSnap = false;
+} else if (peerLastEpoch != lastProcessedEpoch && 
!db.isInCommittedLog(peerLastZxid)) {
+//Be sure we do a snap, because if the epochs are not the 
same we don't know what
+// could have happened in between and it may take a TRUNC 
+ UPDATES to get them in SYNC
+LOG.debug("Will send SNAP to peer sid: {} epochs are too 
our of sync local 0x{} remote 0x{}",
--- End diff --

I think there is a typo here: "our of sync"


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-10 Thread Fangmin Lv (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359421#comment-16359421
 ] 

Fangmin Lv commented on ZOOKEEPER-2845:
---

[~revans2] clean and reload the DB will use the correct zxid to vote or sync 
with new leader, if it's being elected as new leader, the ensemble will all 
have this extra txn, otherwise, the new leader will send truncate or snap to 
this server, which means it will be discarded.

With RetainDB, it will ignore the truth that it actually has the txn flushed to 
disk, and there is race condition that if DB is reloaded from disk it may 
include this txn.

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-09 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358899#comment-16358899
 ] 

Robert Joseph Evans commented on ZOOKEEPER-2845:


[~lvfangmin],

So how does clearing the DB prevent it from re-applying the transactions in the 
transaction log?

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-09 Thread Fangmin Lv (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358804#comment-16358804
 ] 

Fangmin Lv commented on ZOOKEEPER-2845:
---

[~revans2] the txn will only be applied to DB when it's quorum committed, the 
problem here is not lost a txn but with an extra txn which is not quorum 
committed, and it's what shown in the Jira description.

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-09 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358627#comment-16358627
 ] 

Robert Joseph Evans commented on ZOOKEEPER-2845:


[~lvfangmin],

Perhaps I don't understand the issue well enough which is totally possible 
because I am not a frequent contributor and the path for all of the request 
processors is kind of complex.

My understanding is that the SyncRequestProcessor handles writing out edits to 
the edit log and snapshots, there are a few other places where this happens at 
startup though. The SyncRequestProcessor writes out edits as they arrive and 
will flush them to disk periodically in batches. It also takes snapshots 
periodically.

The in memory portion appears to be updated by the FinalRequestProcessor prior 
to a quorum of acks being received.

So yes there is the possibility that something is written to the transaction 
log that is not applied to memory. This means that when ZKDatabase.clear() is 
called it should actually fast forward the in memory changes to match those in 
the edit log + snapshot.

So you are saying that 
 1) proposals come in, are written to the transaction log, but the in memory 
database is not updated yet.
 2) the server does a soft restart for some reason and some transactions appear 
to be lost (because the in memory DB was not fast forwarded).
 3) more transactions come in (possibly conflicting with the first set of 
transactions).
 4) before a snapshot can happen the leader or follower restarts and has to 
reconstruct the in memory DB from edits + snapshot. This would then reapply the 
edits that originally appeared to be lost.

This does look like it might happen, so I will look into that as well.

But the test in [https://github.com/apache/zookeeper/pull/310] didn't appear to 
trigger this. I could be wrong because I concentrated most of my debugging on 
the original leader and what was happening with it, instead of the followers 
and what was happening with them. I also didn't understand how clearing the 
leader's in memory database caused an edit to be lost, if the edits are being 
written out to disk before the in memory DB is updated. What I saw was that

1) a bunch of edits and leaders/followers being restarted that didn't really do 
much of anything.
 2) the original leader lost a connection to the followers.
 3a) A transaction was written to the leader in memory DB but it didn't get a 
quorum of acks
 3b) The followers restarted and formed a new quorum
 4) The original leader timed out and joined the new quorum
 5) As part of the sync when the old leader joined the new quorum it got a diff 
(not a snap), but it had an edit that was not a part of the new leader so it 
was off from the others.

I could see this second part happening even without my change so I don't really 
understand how that clearing the database would prevent it.  My thinking was 
that it was a race condition where the edits in the edit log were not flushed 
yet, and as such when we cleared the DB they were lost.  But I didn't confirm 
this.

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the 

[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-02-09 Thread Fangmin Lv (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358060#comment-16358060
 ] 

Fangmin Lv commented on ZOOKEEPER-2845:
---

[~revans2] Thanks for jumping in and working on this issue, the consistency 
issue mentioned here is not because of syncing protocol, but because there 
might be uncommitted txns in txn file but not in ZKDatabase during retain 
database. If I understand your proposal and diff correctly, you're trying to 
solve the issue by checking the epoch during syncing with the leader, but it 
doesn't solve the issue that there will be uncommitted txn in txn file, and 
during replay the txns it could load this txn and cause inconsistency.

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-01-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347901#comment-16347901
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

GitHub user revans2 opened a pull request:

https://github.com/apache/zookeeper/pull/455

ZOOKEEPER-2845: Send a SNAP if transactions cannot be verified.

This is the version of #453 for the 3.4 branch

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/revans2/zookeeper ZOOKEEPER-2845-3.4

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/455.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #455


commit b035df19616424036afb1f31f345dedf26e3b2ae
Author: Robert Evans 
Date:   2018-02-01T02:09:53Z

ZOOKEEPER-2845: Send a SNAP if transactions cannot be verified.




> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-01-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347678#comment-16347678
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

GitHub user revans2 opened a pull request:

https://github.com/apache/zookeeper/pull/454

ZOOKEEPER-2845: Send a SNAP if transactions cannot be verified. (3.5)

This is the version of #453 for the 3.5 branch

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/revans2/zookeeper ZOOKEEPER-2845-3.5

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/454.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #454


commit 70436249c830af0b129caf3d1bed2f55a2498b6b
Author: Robert Evans 
Date:   2018-01-29T20:27:10Z

ZOOKEEPER-2845: Send a SNAP if transactions cannot be verified.




> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-01-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347557#comment-16347557
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

GitHub user revans2 opened a pull request:

https://github.com/apache/zookeeper/pull/453

ZOOKEEPER-2845: Send a SNAP if transactions cannot be verified.

I will be creating a patch/pull request for 3.4 and 3.5 too, but I wanted 
to get a pull request up for others to look at ASAP.

I have a version of this based off of #310 at 
https://github.com/revans2/zookeeper/tree/ZOOKEEPER-2845-orig-test-patch but 
the test itself is flaky.  Frequently leader election does not go as planned on 
the test and it ends up failing but not because it ended up in an inconsistent 
state.

I am happy to answer any questions anyone has about the patch.  

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/revans2/zookeeper ZOOKEEPER-2845-master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/453.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #453


commit 0219b2c9e44527067cd5fed4b642729171721886
Author: Robert Evans 
Date:   2018-01-29T20:27:10Z

ZOOKEEPER-2845: Send a SNAP if transactions cannot be verified.




> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-01-31 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347026#comment-16347026
 ] 

Robert Joseph Evans commented on ZOOKEEPER-2845:


I have a fix that I will be posting shortly.  I need to clean up the patch and 
make sure that I get pull requests ready for all of the branches that 
ZOOKEEPER-2926 went into.

 

The following table describes the situation that allows a node to get into an 
inconsistent state.

 
|| ||N1||N2||N3||
|Start with cluster in sync N1 is leader|0x0 0x5|0x0 0x5|0x0 0x5|
|N2 and N3 go down|0x0 0x5| | |
|Proposal to N1 (fails with no quorum)|0x0 0x6| | |
|N2 and N3 return, but N1 is restarting.  N2 elected leader| |0x1 0x0|0x1 0x0|
|A proposal is accepted| |0x1 0x1|0x1 0x1|
|N1 returns and is trying to sync with the new leader N2|0x0 0x6|0x1 0x1|0x1 
0x1|

 

At this point the code in {{LearnerHandler.syncFollower}} takes over to bring 
N1 into sync with N2 the new leader.

That code checks the following in order
 # Is there a {{forceSync}}? Not in this case
 # Are the two zxids in sync already?  No {{0x0 0x6 != 0x1 0x1}}
 # is the peer zxid > the local zxid (and peer didn't just rotate to a new 
epoch)? No {{0x0 0x6 < 0x1 0x1}}
 # is the peer zxid in between the max committed log and the min committed log? 
 In this case yes it is, but it shouldn't be.  The max committed log is {{0x1 
0x1}}.  The min committed log is {{0x0 0x5}} or something likely below it 
because it is based off of distance in the edit log.  The issue is that once 
the epoch changes, {{0x0}} to {{0x1}}, the leader has no idea if the edits are 
in its edit log without explicitly checking for them.

 

The reason that ZOOKEEPER-2926 exposed this is because previously when a leader 
was elected the in memory DB was dropped and everything was reread from disk.  
When this happens the {{0x0 0x6}} proposal was lost.  But it is not guaranteed 
to be lost in all cases.  In theory a snapshot could be taken triggered by that 
proposal, either on the leader, or on a follower that also allied the proposal, 
but does not join the new quorum in time.   As such ZOOKEEPER-2926 really just 
extended the window of an already existing race.  But it extended it almost 
indefinitely so it is much more likely to happen.

 

My fix is to update {{LearnerHandler.syncFollower}} to only send a {{DIFF}} if 
the epochs are the same.  If they are not the same we don't know if something 
we inserted that we don't know about.

 

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Robert Joseph Evans
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344024#comment-16344024
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on the issue:

https://github.com/apache/zookeeper/pull/310
  
Apparently for some reason I don't understand if I don't run all of the 
tests in QuorumPeerMainTest The old leader is elected again each time.


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2018-01-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343979#comment-16343979
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on the issue:

https://github.com/apache/zookeeper/pull/310
  
@lvfangmin 

I am trying to reproduce the issue you have seen here, and I have not been 
able to do so.  The test either fails for me with the same leader being elected 
each time, or on newer versions with the leader client connected instead of 
connecting waiting for it to quit with a timeout, that I am not sure if it ever 
happens.

How frequently would this test pass for you?


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2017-12-08 Thread Fangmin Lv (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1628#comment-1628
 ] 

Fangmin Lv commented on ZOOKEEPER-2845:
---

[~davelatham] I meant the broken "retainDB" commit in ZOOKEEPER-2678, we should 
revert it before we have a sound solution.

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2017-12-08 Thread Dave Latham (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284144#comment-16284144
 ] 

Dave Latham commented on ZOOKEEPER-2845:


Thanks, [~lvfangmin].  The broken "retainDB" commit is ZOOKEEPER-2845 right?  
You're suggesting that be reverted?

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2017-12-08 Thread Fangmin Lv (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284068#comment-16284068
 ] 

Fangmin Lv commented on ZOOKEEPER-2845:
---

Can someone help add my teammate jtuple as the contributor? So I can assign the 
task to him.

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2017-12-08 Thread Fangmin Lv (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284066#comment-16284066
 ] 

Fangmin Lv commented on ZOOKEEPER-2845:
---

[~davelatham] our internal patch is based on 3.6 branch, and we found it 
amplified the issue reported in ZOOKEEPER-2926, on our production we need to 
disable the local session feature to mitigate the issue. Also, we haven't 
patched and tested the diff on 3.4 yet, so we're not confident to get it out 
yet. Instead, I would suggest to revert the existing broken retainDB commit to 
unblock the next release. 

I have made a patch for ZOOKEEPER-2926, will update it there. And assign this 
Jira to my teammate Joseph to follow up, he is the owner of our internal 
retainDB feature.



> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2017-12-08 Thread Dave Latham (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283999#comment-16283999
 ] 

Dave Latham commented on ZOOKEEPER-2845:


Any updates here?  We were considering upgrading our zookeeper, but don't want 
to go to a release with a known data inconsistency problem.

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2017-09-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154246#comment-16154246
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user lvfangmin commented on the issue:

https://github.com/apache/zookeeper/pull/310
  
@revans2 my teammate was working on the fix, and he was planning run it on 
prod for a while before sending out the diff. I'll sync with him today about 
the status. 


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2017-09-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16153628#comment-16153628
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user revans2 commented on the issue:

https://github.com/apache/zookeeper/pull/310
  
@lvfangmin any update on getting a pull request for the actual fix?


> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2017-08-21 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16136242#comment-16136242
 ] 

Michael Han commented on ZOOKEEPER-2845:


Thanks for the update, [~lvfangmin]. Good to know the patch is tested in prod 
environment!

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2017-08-21 Thread Fangmin Lv (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16135559#comment-16135559
 ] 

Fangmin Lv commented on ZOOKEEPER-2845:
---

The internal patch has been stabilized, which have been tested for a long time, 
we've rolled it out to one of the production environment last week. Joseph from 
our team will attach the patch here for review this week.

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2017-08-13 Thread Fangmin Lv (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125267#comment-16125267
 ] 

Fangmin Lv commented on ZOOKEEPER-2845:
---

[~hanm] we've just finished and tried to test the RetainDB in our internal 
ensemble, might submit the code for review next week. 

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2017-08-12 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124813#comment-16124813
 ] 

Michael Han commented on ZOOKEEPER-2845:


[~lvfangmin] Any plan to submit your retain db implementation? This is an 
important bug to fix.

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2017-07-14 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088464#comment-16088464
 ] 

Michael Han commented on ZOOKEEPER-2845:


Make sense to me. I think previously we don't have this issue because the 
{{zkDb}} was cleared across leader election, and if we restarting C it will 
recover from both the snap and the tnx log so it will find out its 
{{lastProcessedZxid}} is T1, rather than T0, which will yield a TRUNC instead 
of DIFF from leader B. 

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2017-07-14 Thread Fangmin Lv (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088306#comment-16088306
 ] 

Fangmin Lv commented on ZOOKEEPER-2845:
---

[~hanm] T1 only exists in txn file but hasn't been applied to the data tree 
yet, the lastProcessedZxid in follower C is T0, so no TRUNC message when sync 
with leader.

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2017-07-14 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088299#comment-16088299
 ] 

Michael Han commented on ZOOKEEPER-2845:


Thanks for reporting this issue [~lvfangmin].

bq. C changed to looking state due to no enough followers, it will sync with 
leader B with last Zxid T0, which will have an empty diff sync

Are you saying leader B is sending a DIFF to follower C in this case? Since B 
does not have T1, I think it should send a TRUNC and C should drop T1 in its 
txn log.

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Assignee: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2017-07-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088296#comment-16088296
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

Github user lvfangmin commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/310#discussion_r127567852
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java ---
@@ -784,4 +784,126 @@ public void testWithOnlyMinSessionTimeout() throws 
Exception {
 maxSessionTimeOut, quorumPeer.getMaxSessionTimeout());
 }
 
+@Test
+public void testTxnAheadSnapInRetainDB() throws Exception {
+// 1. start up server and wait for leader election to finish
+ClientBase.setupTestEnv();
+final int SERVER_COUNT = 3;
+final int clientPorts[] = new int[SERVER_COUNT];
+StringBuilder sb = new StringBuilder();
+for(int i = 0; i < SERVER_COUNT; i++) {
+   clientPorts[i] = PortAssignment.unique();
+   
sb.append("server."+i+"=127.0.0.1:"+PortAssignment.unique()+":"+PortAssignment.unique()+";"+clientPorts[i]+"\n");
+}
+String quorumCfgSection = sb.toString();
+
+MainThread mt[] = new MainThread[SERVER_COUNT];
+ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT];
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection);
+mt[i].start();
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+}
+
+waitForAll(zk, States.CONNECTED);
+
+// we need to shutdown and start back up to make sure that the 
create session isn't the first transaction since
+// that is rather innocuous.
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].shutdown();
+}
+
+waitForAll(zk, States.CONNECTING);
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+mt[i].start();
+// Recreate a client session since the previous session was 
not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+ }
+
+waitForAll(zk, States.CONNECTED);
+
+
+// 2. kill all followers
+int leader = -1;
+Map outstanding = null;
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (mt[i].main.quorumPeer.leader != null) {
+leader = i;
+outstanding = 
mt[leader].main.quorumPeer.leader.outstandingProposals;
+// increase the tick time to delay the leader going to 
looking
+mt[leader].main.quorumPeer.tickTime = 1;
+}
+}
+
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].shutdown();
+}
+}
+
+// 3. start up the followers to form a new quorum
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+mt[i].start();
+}
+}
+
+// 4. wait one of the follower to be the leader
+for (int i = 0; i < SERVER_COUNT; i++) {
+if (i != leader) {
+// Recreate a client session since the previous session 
was not persisted.
+zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], 
ClientBase.CONNECTION_TIMEOUT, this);
+waitForOne(zk[i], States.CONNECTED);
+}
+}
+
+// 5. send a create request to leader and make sure it's synced to 
disk,
+//which means it acked from itself
+try {
+zk[leader].create("/zk" + leader, "zk".getBytes(), 
Ids.OPEN_ACL_UNSAFE,
+CreateMode.PERSISTENT);
+Assert.fail("create /zk" + leader + " should have failed");
+} catch (KeeperException e) {}
+
+// just make sure that we actually did get it in process at the
+// leader
+Assert.assertTrue(outstanding.size() == 1);
+Proposal p = (Proposal) outstanding.values().iterator().next();
+Assert.assertTrue(p.request.getHdr().getType() == OpCode.create);
+
+// make sure it has a chance to write it to disk
+Thread.sleep(1000);
+p.qvAcksetPairs.get(0).getAckset().contains(leader);
+
+// 6. wait the leader to quit due to no enough followers
+waitForOne(zk[leader], States.CONNECTING);
+
+int newLeader = -1;

[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2017-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088293#comment-16088293
 ] 

Hadoop QA commented on ZOOKEEPER-2845:
--

-1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//console

This message is automatically generated.

> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3
>Reporter: Fangmin Lv
>Assignee: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

2017-07-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088277#comment-16088277
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2845:
---

GitHub user lvfangmin opened a pull request:

https://github.com/apache/zookeeper/pull/310

[ZOOKEEPER-2845][Test] Test used to reproduce the data inconsistency issue 
due to retain database in leader election



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lvfangmin/zookeeper ZOOKEEPER-2845-TEST

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/310.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #310


commit ff0bc49de51635da1d5bff0e4f260a61acc87db0
Author: Fangmin Lyu 
Date:   2017-07-14T23:02:20Z

reproduce the data inconsistency issue




> Data inconsistency issue due to retain database in leader election
> --
>
> Key: ZOOKEEPER-2845
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3
>Reporter: Fangmin Lv
>Assignee: Fangmin Lv
>Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)