[
https://issues.apache.org/jira/browse/ZOOKEEPER-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790060#comment-13790060
]
Germán Blanco commented on ZOOKEEPER-1777:
------------------------------------------
I believe the truncation is not working because it doesn't detect the problem.
It is possible that the servers have a compatible history in terms of zxid
numbers, but it is still a different history. For example:
1 - A, B and C form an ensemble and reach up to zxid 3.
2 - C is stopped. C was the leader. There is a new leader election and a new
epoch.
3 - A and B continue until transaction 6, epoch 2.
4 - A is stopped. B is stopped and loses all data.
5 - B and C are restarted and form an ensemble starting with zxid 3, epoch 1.
They build a different story up to zxid 9, epoch 2.
6 - A is restarted, joins the new ensemble, receives a DIFF (in epoch 2 from
zxid 7 to 9) and continues working.
7 - Transactions in A for epoch 2 from 4 to 6 are different from those in B and
C, but they have the same zxid.
Could that be it?
In my humble opinion, the fact that the algorithm doesn't guarantee correctness
in those scenarios, doesn't mean that the working software shouldn't cover the
case and have an acceptable behavior. In terms of academic demonstration of an
algorithm it would be perfectly ok, in terms of having this component in a
production environment it is not.
> Missing ephemeral nodes in one of the members of the ensemble
> -------------------------------------------------------------
>
> Key: ZOOKEEPER-1777
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1777
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum
> Affects Versions: 3.4.5
> Environment: Linux, Java 1.7
> Reporter: Germán Blanco
> Assignee: Germán Blanco
> Priority: Critical
> Fix For: 3.4.6, 3.5.0
>
> Attachments: logs_trunk.tar.gz, snaps.tar, ZOOKEEPER-1777-3.4.patch,
> ZOOKEEPER-1777.patch, ZOOKEEPER-1777.patch, ZOOKEEPER-1777.tar.gz
>
>
> In a 3-servers ensemble, one of the followers doesn't see part of the
> ephemeral nodes that are present in the leader and the other follower.
> The 8 missing nodes in "the follower that is not ok" were created in the end
> of epoch 1, the ensemble is running in epoch 2.
--
This message was sent by Atlassian JIRA
(v6.1#6144)