I can confirm this is a bug now I have a unit test to reproduce the issue. I will submit a pull request soon. Let's move discussions of this topic to JIRA and github.
On Fri, Aug 28, 2020 at 10:02 PM li xun <274952...@qq.com> wrote: > Hi hanm > > > > Thanks > > This is the issue in jira > https://issues.apache.org/jira/browse/ZOOKEEPER-3911 > > ——————————————————————————————————————————————————————— > > Below are my thoughts > > Before the server becomes the real leader, the follower needs to > synchronize data with the leader. When encountering big data, it will be > very slow, causing the server to be temporarily unavailable. Can the leader > communicate with the follower before the synchronization starts, and > calculate the maximum zxid_n [reference 1] in the proposal owned by the > leader that has reached the quorum, and then allow the leader to > immediately be able to access externally, but only access <=zxid_n Data > (such as webapp, which can access the leader, which reduces the time that > zk is inaccessible), there may be two solutions for follower > 1) Since the follower has not synchronized the data, external webpp access > is temporarily not allowed, so that even if the data that the follower > needs to synchronize is large, it will not affect the external service > provided by zk. But disadvantages: access pressure will be concentrated in > the leader, at this time the entire cluster does not have the > characteristics of distributed, prone to single point of failure > 2) The follower immediately provides services to the outside world, but > since the follower has not synchronized with the leader, if the follower > has just experienced a restart, then the follower cannot confirm that it > currently holds the largest zxid_x that has reached the quorum, and may > need the follower to do it once Additional inquiry to confirm whether > zxid_x reaches a quorum. (Or make a separate flag for zxid to indicate > whether a certain zxid reaches a quorum) Then follower provides access to > the outside, only access <=zxid_x > Disadvantages: complex implementation and increased communication volume > > > Reference 1: from <paxos made simple> Leslie Lamport 01 Nov 2001 > " > 2.3 Learning a Chosen Value > To learn that a value has been chosen, a learner must find out that a pro- > posal has been accepted by a majority of acceptors. The obvious algorithm > is to have each acceptor, whenever it accepts a proposal, respond to all > learners, sending them the proposal. This allows learners to find out about > a chosen value as soon as possible, but it requires each acceptor to > respond to each learner—a number of responses equal to the product of the > number of acceptors and the number of learners. > The assumption of non-Byzantine failures makes it easy for one learner to > find out from another learner that a value has been accepted. We can have > the acceptors respond with their acceptances to a distinguished learner, > which in turn informs the other learners when a value has been chosen. This > approach requires an extra round for all the learners to discover the > chosen value. It is also less reliable, since the distinguished learner > could fail. But it requires a number of responses equal only to the sum of > the number of acceptors and the number of learners. > More generally, the acceptors could respond with their acceptances to some > set of distinguished learners, each of which can then inform all the > learners when a value has been chosen. Using a larger set of distinguished > learners provides greater reliability at the cost of greater communication > complexity. > Because of message loss, a value could be chosen with no learner ever > finding out. The learner could ask the acceptors what proposals they have > accepted, but failure of an acceptor could make it impossible to know > whether or not a majority had accepted a particular proposal . In that > case, learners will find out what value is chosen only when a new proposal > is chosen. If a learner needs to know whether a value has been chosen, it > can have a proposer issue a proposal, using the algorithm described above. > “ > > > > Best, > li xun > > > > 2020年8月29日 10:59,Michael Han <h...@apache.org> 写道: > > Hi Xun, > > I think this is a bug, your test case is sound to me. Do you mind > creating a JIRA for this issue? > > Followers should not ACK NEWLEADER without ACK every transaction from the > DIFF sync. To ACK every transaction, a follower either persists the > transaction in log, or takes a snapshot before sending the ACK of the > NEWLEADER (which we did, before ZOOKEEPER-2678 where the snapshot > optimization was introduced). > > A potential fix I have in mind is to make sure to persist all DIFF sync > proposals from LEADER (similar to what we are already doing for proposals > coming between NEWLEADER and UPTODATE). By doing so, when the leader > receives NEWLEADER ACK from a quorum, it's guaranteed that > every transaction leader DIFF sync to follower is quorum committed. Thus > there will not be inconsistent views moving forward. Alternatively we can > take a snapshot before ACK NEWLEADER but that will be a big performance hit > for big data trees. > > I am also interested to hear what others think about this. > > On Fri, Aug 28, 2020 at 12:20 AM li xun <274952...@qq.com> wrote: > > There is a example in the link, would you understand what I mean? > > > > > https://drive.google.com/file/d/1jy3kkVQTDYGb4iV1RaPMBbEWLZZltTQG/view?usp=sharing > > Since version 3.4, the quorum of followers and the leader did not > synchronize the files immediately when the synchronization was completed, > and the data was not persisted to the files in an instant, and at this time > the zk server can provide external access, such as webapp access, if it > appears at this time Failure, phantom reading may occur > > > 2020年8月28日 14:51,Justin Ling Mao <maoling199210...@sina.com> 写道: > > @李珣The situation you describe may have conceptual deviations about how > > the consensus protocol works:---> Since the data of the follower when the > follower uses the DIFF method to synchronize with the leader is still in > the memory, it has not had time to persist1. The write path is: write > transaction log(WAL) firstly, after reaching a consensus, then apply to > memory, other than the opposite. > > ---> but at this time, the latest zxid_n of the leader has not been > > supported by the quorum of the follower. At this time, if a client connects > to the leader and sees zxid_n,2. If a write has not been supported by the > quorum, it's not safe to apply to the state machine and the client is not > able to see this write. > > I guess that your question may be: how the system handles the > > uncommitted logs when leader changes? > > > > > ----- Original Message ----- > From: Ted Dunning <ted.dunn...@gmail.com> > To: dev@zookeeper.apache.org > Subject: Re: May violate the ZAB agreement -- version 3.6.1 > Date: 2020-08-28 01:25 > > How is it that participant A would have a later zxid than the leader? > In particular, it seems to me that it should be impossible to have these > two facts be true: > 1) a transaction has been committed with zxid = z_0. This implies that a > quorum of the cluster has accepted this transaction and it has been > committed. > 2) a new leader election nominates a leader with latest zxid < z_0. > My reasoning is that any new leader election has to involve a quorum and > > at > > least a sufficient number of that quorum must have accepted zxid >= z_0 > > and > > therefore would refuse to be part of the quorum (this is a > > contradiction). > > Thus, no leader could be elected with zxid < z_0 if fact (1) is true. > What you are describing seems to require both of these facts. > Perhaps I am missing something about your suggested scenario. Could you > describe what you are thinking in more detail? > On Thu, Aug 27, 2020 at 2:08 AM 李珣 <274952...@qq.com> wrote: > > version 3.6.1 > org.apache.zookeeper.server.quorum.Learner.java line:605 > Suppose there is a situation > zxid_n is the largest zxid of Participant A (the leader has just resumed > from downtime). Zxid_n has not been recognized by the quorum. Assuming > Participant A is elected as the Leader, then if a follower appears to > > use > > DIFF to synchronize data with the Leader, Leader After sending the > UPTODATE, the leader can already provide external access, but at this > > time, > > the latest zxid_n of the leader has not been supported by the quorum of > > the > > follower. At this time, if a client connects to the leader and sees > > zxid_n, > > then at this time both the leader and the follower are down. For some > reason, the leader cannot be started, and the follower can start > > normally. > > At this time, a new leader can only be elected from the follower. Since > > the > > data of the follower when the follower uses the DIFF method to > > synchronize > > with the leader is still in the memory, it has not had time to persist, > then this The newly elected leader does not have the data of zxid_n, but > before zxid_n has been seen by the client on the old leader, there will > > be > > inconsistencies in the data view. > Is the above situation possible? > > > > >