Let's wait until the next release to include this fix. 

On Mon, Apr 9, 2018, at 15:14, Alexander Shraer wrote:
> Hi,
> 
> Please take a look on the new PR for ZK-2959:
> https://github.com/apache/zookeeper/pull/500
> If there are no further comments, I can commit it.
> 
> Thanks,
> Alex
> 
> On Fri, Apr 6, 2018 at 11:33 AM, Alexander Shraer <[email protected]> wrote:
> 
> > Hi,
> >
> > The bug described in  ZOOKEEPER-2959
> > <https://issues.apache.org/jira/browse/ZOOKEEPER-2959>  is that
> > getEpochToPropose an waitForEpochAck do not distinguish between followers
> > and observers.
> > This can cause a candidate leader's acceptedEpoch to be updated with only
> > support from observers. Same for waitForEpochAck - passing this method
> > allows the candidate leader to update the currentEpoch. The latter helps
> > this server to win FLE elections continuously, and the former
> > (acceptedEpoch)
> > causes anyone trying to connect to the server to think that it has more
> > up-to-date data and trucate their logs to match.
> >
> >
> > Alex
> >
> > On Fri, Apr 6, 2018 at 10:04 AM, Fangmin Lv <[email protected]> wrote:
> >
> >> Hi Alex,
> >>
> >> Can you give more details about the data loss scenario in Jira
> >> ZOOKEEPER-2959 <https://issues.apache.org/jira/browse/ZOOKEEPER-2959>?
> >> As far as I know, the leader will ignore the observers' ACK in
> >> waitForNewLeaderAck, so it will not start serve traffic until it received
> >> the actual quorum ACK, if it doesn't have enough followers support before
> >> timeout, it will quit leading and it's learners will re-sync with new
> >> leader.
> >>
> >> Thanks,
> >> Fangmin
> >>
> >> On Thu, Apr 5, 2018 at 12:57 PM, Alexander Shraer <[email protected]>
> >> wrote:
> >>
> >>> Btw we actually observed the described issue (data loss), thankfully in a
> >>> test environment. So I thought this is important to share with the
> >>> community.
> >>>
> >>> Unfortunately I don’t have time to run a new ZK release for this, so I’m
> >>> not going to -1 your candidate, but we are actively working on a fix (ie
> >>> a
> >>> test at this point) and I can commit that as soon as we have that.
> >>>
> >>> It may be worth while to delay the release by a few more days, but it’s
> >>> totally up to you since you’re running it.
> >>>
> >>> Cheers
> >>> Alex
> >>> On Thu, Apr 5, 2018 at 12:47 PM Andor Molnar <[email protected]> wrote:
> >>>
> >>> > Got that. I still believe it's a completely valid issue which has to be
> >>> > addressed, but it's not a showstopper. I'm afraid we're not going to
> >>> > convince each other, so it's probably Abe's call if he want to create
> >>> > another release candidate for the fix.
> >>> >
> >>> > I reviewed the code on github and I think it just needs to be covered
> >>> with
> >>> > a unit test to be complete.
> >>> >
> >>> > Regards,
> >>> > Andor
> >>> >
> >>> >
> >>> >
> >>> > On Thu, Apr 5, 2018 at 9:05 PM, Alexander Shraer <[email protected]>
> >>> > wrote:
> >>> >
> >>> > > Yes sort of, FLE is finished, then enough observer's messages reach
> >>> the
> >>> > > leader before participant's messages do.
> >>> > > Whether its rare depends on the number of observers and
> >>> participants. For
> >>> > > example with very few participants and many observers
> >>> > > your chance of hitting this are quite high.
> >>> > >
> >>> > > Alex
> >>> > >
> >>> > > On Thu, Apr 5, 2018 at 11:44 AM, Andor Molnar <[email protected]>
> >>> > wrote:
> >>> > >
> >>> > > > Maybe I'm missing something here, but this looks like a rare edge
> >>> case
> >>> > to
> >>> > > > me. Participants must finish the leader election successfully and
> >>> right
> >>> > > > after enough followers should fail to send epoch to the leader, so
> >>> > > > observers can take it over.
> >>> > > >
> >>> > > > Is that description accurate?
> >>> > > >
> >>> > > > Andor
> >>> > > >
> >>> > > >
> >>> > > > On Thu, Apr 5, 2018 at 7:35 PM, Alexander Shraer <
> >>> [email protected]>
> >>> > > > wrote:
> >>> > > >
> >>> > > > > To clarify - in a deployment with observers this bug can
> >>> potentially
> >>> > > > cause
> >>> > > > > data loss. A server could be elected leader based just on the
> >>> support
> >>> > > of
> >>> > > > > observers, even if this servers data is stale wrt other
> >>> followers.
> >>> > > > >
> >>> > > > > It is certainly a blocker, just not sure if for 3.4.11 or 3.4.12.
> >>> > > > >
> >>> > > > >
> >>> > > > > Alex
> >>> > > > > On Thu, Apr 5, 2018 at 10:29 AM Andor Molnar <[email protected]
> >>> >
> >>> > > wrote:
> >>> > > > >
> >>> > > > > > I don't think it's a blocker.
> >>> > > > > > The jira and PR has been open since last December and 3.4.11
> >>> has
> >>> > > > released
> >>> > > > > > without it.
> >>> > > > > >
> >>> > > > > > Although this bug is also important to fix, I believe it's more
> >>> > > > important
> >>> > > > > > to release a fix for the regression we've found in 3.4.11 asap.
> >>> > > > > >
> >>> > > > > > Abe, any thoughts?
> >>> > > > > >
> >>> > > > > > Regards,
> >>> > > > > > Andor
> >>> > > > > >
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > On Thu, Apr 5, 2018 at 7:00 PM, Alexander Shraer <
> >>> > [email protected]>
> >>> > > > > > wrote:
> >>> > > > > >
> >>> > > > > > > Sorry for coming in at the last moment. I'm not sure when the
> >>> > next
> >>> > > > 3.4
> >>> > > > > > > release is scheduled, so just wanted to mention this bug,
> >>> > > > > > > which I believe is a blocker for either this or next release:
> >>> > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-2959
> >>> > > > > > >
> >>> > > > > > > Best,
> >>> > > > > > > Alex
> >>> > > > > > >
> >>> > > > > > > On Thu, Apr 5, 2018 at 9:09 AM, Ted Yu <[email protected]>
> >>> > > wrote:
> >>> > > > > > >
> >>> > > > > > > > Can the vote be closed ?
> >>> > > > > > > >
> >>> > > > > > > > It seems we have enough +1's
> >>> > > > > > > >
> >>> > > > > > > > Thanks
> >>> > > > > > > >
> >>> > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >

Reply via email to