[jira] [Commented] (ZOOKEEPER-2846) Leader follower sync with on disk txns can possibly leads to data inconsistency

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695420#comment-16695420
 ] 

Michael K. Edwards commented on ZOOKEEPER-2846:
---

Does this need to be addressed (or release noted) for 3.5.5?

> Leader follower sync with on disk txns can possibly leads to data 
> inconsistency
> ---
>
> Key: ZOOKEEPER-2846
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2846
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Priority: Critical
>
> On disk txn sync could cause data inconsistency if the current leader just 
> had a snap sync before it became leader, and then having diff sync with its 
> followers may synced the txns gap on disk. Here is scenario: 
> Let's say S0 - S3 are followers, and S4 is leader at the beginning:
> 1. Stop S2 and send one more request
> 2. Stop S3 and send more requests to the quorum to let S3 have a snap sync 
> with S4 when it started up
> 3. Stop S4 and S3 became the new leader
> 4. Start S2 and had a diff sync with S3, now there are gaps in S2
> Attached the test case to verify the issue. Currently, there is no efficient 
> way to check the gap in txn files is a real gap or due to Epoch change. We 
> need to add that support, but before that, it would be safer to disable the 
> on disk txn leader-follower sync.
> Another two scenarios which could cause the same issue:
> (Scenario 1) Server A, B, C, A is leader, the others are followers:
>   1). A synced to disk, but the other 2 restarted before receiving the 
> proposal
>   2). B and C formed quorum, B is leader, and committed some requests
>   3). A looking again, and sync with B, B won't able to trunc A but send snap 
> instead, and leaves the extra txn in A's txn file
>   4). A became new leader, and someone else has a diff sync with A it will 
> have the extra txn 
> (Scenario 2) Diff sync with committed txn, will only apply to data tree but 
> not on disk txn file, which will also leave hole in it and lead to data 
> inconsistency issue when syncing with learners.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2846) Leader follower sync with on disk txns can possibly leads to data inconsistency

2018-07-09 Thread Fangmin Lv (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537668#comment-16537668
 ] 

Fangmin Lv commented on ZOOKEEPER-2846:
---

There is no efficient way to detect the gap, internally, we worked around the 
issue by adding a gap file to indicate the gap during snap sync.

In the long term, we have finished adding a real time consistency check during 
replay txns and syncing, it's being baked on some of our prod environment, will 
open a Jira for discussing the details.

> Leader follower sync with on disk txns can possibly leads to data 
> inconsistency
> ---
>
> Key: ZOOKEEPER-2846
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2846
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Priority: Critical
>
> On disk txn sync could cause data inconsistency if the current leader just 
> had a snap sync before it became leader, and then having diff sync with its 
> followers may synced the txns gap on disk. Here is scenario: 
> Let's say S0 - S3 are followers, and S4 is leader at the beginning:
> 1. Stop S2 and send one more request
> 2. Stop S3 and send more requests to the quorum to let S3 have a snap sync 
> with S4 when it started up
> 3. Stop S4 and S3 became the new leader
> 4. Start S2 and had a diff sync with S3, now there are gaps in S2
> Attached the test case to verify the issue. Currently, there is no efficient 
> way to check the gap in txn files is a real gap or due to Epoch change. We 
> need to add that support, but before that, it would be safer to disable the 
> on disk txn leader-follower sync.
> Another two scenarios which could cause the same issue:
> (Scenario 1) Server A, B, C, A is leader, the others are followers:
>   1). A synced to disk, but the other 2 restarted before receiving the 
> proposal
>   2). B and C formed quorum, B is leader, and committed some requests
>   3). A looking again, and sync with B, B won't able to trunc A but send snap 
> instead, and leaves the extra txn in A's txn file
>   4). A became new leader, and someone else has a diff sync with A it will 
> have the extra txn 
> (Scenario 2) Diff sync with committed txn, will only apply to data tree but 
> not on disk txn file, which will also leave hole in it and lead to data 
> inconsistency issue when syncing with learners.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2846) Leader follower sync with on disk txns can possibly leads to data inconsistency

2017-08-14 Thread Fangmin Lv (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125281#comment-16125281
 ] 

Fangmin Lv commented on ZOOKEEPER-2846:
---

[~hanm] The challenge here is that we don't know there is txn missing or it's 
due to the Epoch change. We need a way to verify the zxid continuous, we're 
having an intern project to verify the txns integrity, but that won't be 
available in the near time, my suggestion is turning off the on disk txn sync 
for now. 

> Leader follower sync with on disk txns can possibly leads to data 
> inconsistency
> ---
>
> Key: ZOOKEEPER-2846
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2846
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Priority: Critical
>
> On disk txn sync could cause data inconsistency if the current leader just 
> had a snap sync before it became leader, and then having diff sync with its 
> followers may synced the txns gap on disk. Here is scenario: 
> Let's say S0 - S3 are followers, and S4 is leader at the beginning:
> 1. Stop S2 and send one more request
> 2. Stop S3 and send more requests to the quorum to let S3 have a snap sync 
> with S4 when it started up
> 3. Stop S4 and S3 became the new leader
> 4. Start S2 and had a diff sync with S3, now there are gaps in S2
> Attached the test case to verify the issue. Currently, there is no efficient 
> way to check the gap in txn files is a real gap or due to Epoch change. We 
> need to add that support, but before that, it would be safer to disable the 
> on disk txn leader-follower sync.
> Another two scenarios which could cause the same issue:
> (Scenario 1) Server A, B, C, A is leader, the others are followers:
>   1). A synced to disk, but the other 2 restarted before receiving the 
> proposal
>   2). B and C formed quorum, B is leader, and committed some requests
>   3). A looking again, and sync with B, B won't able to trunc A but send snap 
> instead, and leaves the extra txn in A's txn file
>   4). A became new leader, and someone else has a diff sync with A it will 
> have the extra txn 
> (Scenario 2) Diff sync with committed txn, will only apply to data tree but 
> not on disk txn file, which will also leave hole in it and lead to data 
> inconsistency issue when syncing with learners.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2846) Leader follower sync with on disk txns can possibly leads to data inconsistency

2017-08-12 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124812#comment-16124812
 ] 

Michael Han commented on ZOOKEEPER-2846:


[~lvfangmin] Do you have a plan to upload a patch that fixes the issue? The 
existing pull request  contains a test only.

> Leader follower sync with on disk txns can possibly leads to data 
> inconsistency
> ---
>
> Key: ZOOKEEPER-2846
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2846
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Priority: Critical
>
> On disk txn sync could cause data inconsistency if the current leader just 
> had a snap sync before it became leader, and then having diff sync with its 
> followers may synced the txns gap on disk. Here is scenario: 
> Let's say S0 - S3 are followers, and S4 is leader at the beginning:
> 1. Stop S2 and send one more request
> 2. Stop S3 and send more requests to the quorum to let S3 have a snap sync 
> with S4 when it started up
> 3. Stop S4 and S3 became the new leader
> 4. Start S2 and had a diff sync with S3, now there are gaps in S2
> Attached the test case to verify the issue. Currently, there is no efficient 
> way to check the gap in txn files is a real gap or due to Epoch change. We 
> need to add that support, but before that, it would be safer to disable the 
> on disk txn leader-follower sync.
> Another two scenarios which could cause the same issue:
> (Scenario 1) Server A, B, C, A is leader, the others are followers:
>   1). A synced to disk, but the other 2 restarted before receiving the 
> proposal
>   2). B and C formed quorum, B is leader, and committed some requests
>   3). A looking again, and sync with B, B won't able to trunc A but send snap 
> instead, and leaves the extra txn in A's txn file
>   4). A became new leader, and someone else has a diff sync with A it will 
> have the extra txn 
> (Scenario 2) Diff sync with committed txn, will only apply to data tree but 
> not on disk txn file, which will also leave hole in it and lead to data 
> inconsistency issue when syncing with learners.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2846) Leader follower sync with on disk txns can possibly leads to data inconsistency

2017-08-02 Thread Fangmin Lv (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112201#comment-16112201
 ] 

Fangmin Lv commented on ZOOKEEPER-2846:
---

[~hanm] ZOOKEEPER-2099 is a subset of the issue I reported in this Jira, the 
bug of this Jira is introduced by ZOOKEEPER-1413, which is using on disk txn 
sync to reduce sync time.  

The issue I reported in ZOOKEEPER-2845 is a totally different one, it's 
introduced in ZOOKEEPER-2678, which is used to reduce unavailable time during 
leader election by retain the ZKDatabase, but it could lead to data 
inconsistency.

> Leader follower sync with on disk txns can possibly leads to data 
> inconsistency
> ---
>
> Key: ZOOKEEPER-2846
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2846
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Priority: Critical
>
> On disk txn sync could cause data inconsistency if the current leader just 
> had a snap sync before it became leader, and then having diff sync with its 
> followers may synced the txns gap on disk. Here is scenario: 
> Let's say S0 - S3 are followers, and S4 is leader at the beginning:
> 1. Stop S2 and send one more request
> 2. Stop S3 and send more requests to the quorum to let S3 have a snap sync 
> with S4 when it started up
> 3. Stop S4 and S3 became the new leader
> 4. Start S2 and had a diff sync with S3, now there are gaps in S2
> Attached the test case to verify the issue. Currently, there is no efficient 
> way to check the gap in txn files is a real gap or due to Epoch change. We 
> need to add that support, but before that, it would be safer to disable the 
> on disk txn leader-follower sync.
> Another two scenarios which could cause the same issue:
> (Scenario 1) Server A, B, C, A is leader, the others are followers:
>   1). A synced to disk, but the other 2 restarted before receiving the 
> proposal
>   2). B and C formed quorum, B is leader, and committed some requests
>   3). A looking again, and sync with B, B won't able to trunc A but send snap 
> instead, and leaves the extra txn in A's txn file
>   4). A became new leader, and someone else has a diff sync with A it will 
> have the extra txn 
> (Scenario 2) Diff sync with committed txn, will only apply to data tree but 
> not on disk txn file, which will also leave hole in it and lead to data 
> inconsistency issue when syncing with learners.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2846) Leader follower sync with on disk txns can possibly leads to data inconsistency

2017-08-02 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112177#comment-16112177
 ] 

Michael Han commented on ZOOKEEPER-2846:


ZOOKEEPER-2099 sounds similar to this?

Also, I am wondering what caused the bug. Is this bug existing since the 
feature's introduction by ZOOKEEPER-1413, or is it a regression caused by, for 
example ZOOKEEPER-2678 (which was mentioned in ZOOKEEPER-2845)?

> Leader follower sync with on disk txns can possibly leads to data 
> inconsistency
> ---
>
> Key: ZOOKEEPER-2846
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2846
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Priority: Critical
>
> On disk txn sync could cause data inconsistency if the current leader just 
> had a snap sync before it became leader, and then having diff sync with its 
> followers may synced the txns gap on disk. Here is scenario: 
> Let's say S0 - S3 are followers, and S4 is leader at the beginning:
> 1. Stop S2 and send one more request
> 2. Stop S3 and send more requests to the quorum to let S3 have a snap sync 
> with S4 when it started up
> 3. Stop S4 and S3 became the new leader
> 4. Start S2 and had a diff sync with S3, now there are gaps in S2
> Attached the test case to verify the issue. Currently, there is no efficient 
> way to check the gap in txn files is a real gap or due to Epoch change. We 
> need to add that support, but before that, it would be safer to disable the 
> on disk txn leader-follower sync.
> Another two scenarios which could cause the same issue:
> (Scenario 1) Server A, B, C, A is leader, the others are followers:
>   1). A synced to disk, but the other 2 restarted before receiving the 
> proposal
>   2). B and C formed quorum, B is leader, and committed some requests
>   3). A looking again, and sync with B, B won't able to trunc A but send snap 
> instead, and leaves the extra txn in A's txn file
>   4). A became new leader, and someone else has a diff sync with A it will 
> have the extra txn 
> (Scenario 2) Diff sync with committed txn, will only apply to data tree but 
> not on disk txn file, which will also leave hole in it and lead to data 
> inconsistency issue when syncing with learners.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2846) Leader follower sync with on disk txns can possibly leads to data inconsistency

2017-07-31 Thread Fangmin Lv (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107796#comment-16107796
 ] 

Fangmin Lv commented on ZOOKEEPER-2846:
---

There could be txn gap when using diff sync (which won't happen in our internal 
branch), since we don't log txns to txn file before taking snapshot, so this 
issue could happen more frequently in the open source version, any suggestion 
except disable the on disk txn sync for now?

> Leader follower sync with on disk txns can possibly leads to data 
> inconsistency
> ---
>
> Key: ZOOKEEPER-2846
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2846
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Priority: Critical
>
> On disk txn sync could cause data inconsistency if the current leader just 
> had a snap sync before it became leader, and then having diff sync with its 
> followers may synced the txns gap on disk. Here is scenario: 
> Let's say S0 - S3 are followers, and S4 is leader at the beginning:
> 1. Stop S2 and send one more request
> 2. Stop S3 and send more requests to the quorum to let S3 have a snap sync 
> with S4 when it started up
> 3. Stop S4 and S3 became the new leader
> 4. Start S2 and had a diff sync with S3, now there are gaps in S2
> Attached the test case to verify the issue. Currently, there is no efficient 
> way to check the gap in txn files is a real gap or due to Epoch change. We 
> need to add that support, but before that, it would be safer to disable the 
> on disk txn leader-follower sync.
> Another two scenarios which could cause the same issue:
> (Scenario 1) Server A, B, C, A is leader, the others are followers:
>   1). A synced to disk, but the other 2 restarted before receiving the 
> proposal
>   2). B and C formed quorum, B is leader, and committed some requests
>   3). A looking again, and sync with B, B won't able to trunc A but send snap 
> instead, and leaves the extra txn in A's txn file
>   4). A became new leader, and someone else has a diff sync with A it will 
> have the extra txn 
> (Scenario 2) Diff sync with committed txn, will only apply to data tree but 
> not on disk txn file, which will also leave hole in it and lead to data 
> inconsistency issue when syncing with learners.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2846) Leader follower sync with on disk txns can possibly leads to data inconsistency

2017-07-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092075#comment-16092075
 ] 

Hadoop QA commented on ZOOKEEPER-2846:
--

-1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/887//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/887//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/887//console

This message is automatically generated.

> Leader follower sync with on disk txns can possibly leads to data 
> inconsistency
> ---
>
> Key: ZOOKEEPER-2846
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2846
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Priority: Critical
>
> On disk txn sync could cause data inconsistency if the current leader just 
> had a snap sync before it became leader, and then having diff sync with its 
> followers may synced the txns gap on disk. Here is scenario: 
> Let's say S0 - S3 are followers, and S4 is leader at the beginning:
> 1. Stop S2 and send one more request
> 2. Stop S3 and send more requests to the quorum to let S3 have a snap sync 
> with S4 when it started up
> 3. Stop S4 and S3 became the new leader
> 4. Start S2 and had a diff sync with S3, now there are gaps in S2
> Attached the test case to verify the issue. Currently, there is no efficient 
> way to check the gap in txn files is a real gap or due to Epoch change. We 
> need to add that support, but before that, it would be safer to disable the 
> on disk txn leader-follower sync.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2846) Leader follower sync with on disk txns can possibly leads to data inconsistency

2017-07-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092059#comment-16092059
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2846:
---

GitHub user lvfangmin opened a pull request:

https://github.com/apache/zookeeper/pull/314

[ZOOKEEPER-2846][Test] Leader follower sync with on disk txns can possibly 
leads to data inconsistency

This is only the test case used to reproduce the issue.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lvfangmin/zookeeper ZOOKEEPER-2846-TEST

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/314.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #314


commit c2a1ec8f989b9f799f5880a92730b75ef86164b9
Author: Fangmin Lyu 
Date:   2017-07-18T19:21:02Z

add test case to check data inconsistency issue when using on-disk txn sync




> Leader follower sync with on disk txns can possibly leads to data 
> inconsistency
> ---
>
> Key: ZOOKEEPER-2846
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2846
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Priority: Critical
>
> On disk txn sync could cause data inconsistency if the current leader just 
> had a snap sync before it became leader, and then having diff sync with its 
> followers may synced the txns gap on disk. Here is scenario: 
> Let's say S0 - S3 are followers, and S4 is leader at the beginning:
> 1. Stop S2 and send one more request
> 2. Stop S3 and send more requests to the quorum to let S3 have a snap sync 
> with S4 when it started up
> 3. Stop S4 and S3 became the new leader
> 4. Start S2 and had a diff sync with S3, now there are gaps in S2
> Attached the test case to verify the issue. Currently, there is no efficient 
> way to check the gap in txn files is a real gap or due to Epoch change. We 
> need to add that support, but before that, it would be safer to disable the 
> on disk txn leader-follower sync.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)