We discussed with Pat offline and agreed to go without this patch, especially since we need to patch 3 branches: 3.4, 3.5 and master. We'll prepare 3.5 and master and then commit all 3 together in time for the next release. So Abe, please go ahead with your release.
Alex On Fri, Apr 13, 2018 at 2:26 PM, Patrick Hunt <ph...@apache.org> wrote: > Hey folks. I've been on vacation. My 0.02 - given the release candidate is > well underway, has sufficient votes/time to finalize, this is not a > regression in 3.4.12 and it's not yet committed I would think we > finalize/push 3.4.12 then quickly followup with a 3.4.13 that addresses > this. Alex could be the RM given his interest/advocacy. > > Regards, > > Patrick > > On Fri, Apr 13, 2018 at 11:55 AM, Abraham Fine <af...@apache.org> wrote: > > > Given that the primary driver of this release is to fix an issue with the > > misuse of dataDir and dataLogDir I would rather see this release make it > > out the door with minimal additional changes to core functionality so > > people can more confidently upgrade. > > > > What do you think Pat? > > > > Abe > > > > On Fri, Apr 13, 2018, at 11:37, Alexander Shraer wrote: > > > Now that we have the fix, why delay it to next release? > > > > > > On Fri, Apr 13, 2018 at 11:09 AM Abraham Fine <af...@apache.org> > wrote: > > > > > > > Let's wait until the next release to include this fix. > > > > > > > > On Mon, Apr 9, 2018, at 15:14, Alexander Shraer wrote: > > > > > Hi, > > > > > > > > > > Please take a look on the new PR for ZK-2959: > > > > > https://github.com/apache/zookeeper/pull/500 > > > > > If there are no further comments, I can commit it. > > > > > > > > > > Thanks, > > > > > Alex > > > > > > > > > > On Fri, Apr 6, 2018 at 11:33 AM, Alexander Shraer < > shra...@gmail.com > > > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > The bug described in ZOOKEEPER-2959 > > > > > > <https://issues.apache.org/jira/browse/ZOOKEEPER-2959> is that > > > > > > getEpochToPropose an waitForEpochAck do not distinguish between > > > > followers > > > > > > and observers. > > > > > > This can cause a candidate leader's acceptedEpoch to be updated > > with > > > > only > > > > > > support from observers. Same for waitForEpochAck - passing this > > method > > > > > > allows the candidate leader to update the currentEpoch. The > latter > > > > helps > > > > > > this server to win FLE elections continuously, and the former > > > > > > (acceptedEpoch) > > > > > > causes anyone trying to connect to the server to think that it > has > > more > > > > > > up-to-date data and trucate their logs to match. > > > > > > > > > > > > > > > > > > Alex > > > > > > > > > > > > On Fri, Apr 6, 2018 at 10:04 AM, Fangmin Lv <lvfang...@gmail.com > > > > > > wrote: > > > > > > > > > > > >> Hi Alex, > > > > > >> > > > > > >> Can you give more details about the data loss scenario in Jira > > > > > >> ZOOKEEPER-2959 <https://issues.apache.org/ > > jira/browse/ZOOKEEPER-2959 > > > > >? > > > > > >> As far as I know, the leader will ignore the observers' ACK in > > > > > >> waitForNewLeaderAck, so it will not start serve traffic until it > > > > received > > > > > >> the actual quorum ACK, if it doesn't have enough followers > support > > > > before > > > > > >> timeout, it will quit leading and it's learners will re-sync > with > > new > > > > > >> leader. > > > > > >> > > > > > >> Thanks, > > > > > >> Fangmin > > > > > >> > > > > > >> On Thu, Apr 5, 2018 at 12:57 PM, Alexander Shraer < > > shra...@gmail.com> > > > > > >> wrote: > > > > > >> > > > > > >>> Btw we actually observed the described issue (data loss), > > thankfully > > > > in a > > > > > >>> test environment. So I thought this is important to share with > > the > > > > > >>> community. > > > > > >>> > > > > > >>> Unfortunately I don’t have time to run a new ZK release for > > this, so > > > > I’m > > > > > >>> not going to -1 your candidate, but we are actively working on > a > > fix > > > > (ie > > > > > >>> a > > > > > >>> test at this point) and I can commit that as soon as we have > > that. > > > > > >>> > > > > > >>> It may be worth while to delay the release by a few more days, > > but > > > > it’s > > > > > >>> totally up to you since you’re running it. > > > > > >>> > > > > > >>> Cheers > > > > > >>> Alex > > > > > >>> On Thu, Apr 5, 2018 at 12:47 PM Andor Molnar < > an...@cloudera.com > > > > > > > wrote: > > > > > >>> > > > > > >>> > Got that. I still believe it's a completely valid issue which > > has > > > > to be > > > > > >>> > addressed, but it's not a showstopper. I'm afraid we're not > > going > > > > to > > > > > >>> > convince each other, so it's probably Abe's call if he want > to > > > > create > > > > > >>> > another release candidate for the fix. > > > > > >>> > > > > > > >>> > I reviewed the code on github and I think it just needs to be > > > > covered > > > > > >>> with > > > > > >>> > a unit test to be complete. > > > > > >>> > > > > > > >>> > Regards, > > > > > >>> > Andor > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > On Thu, Apr 5, 2018 at 9:05 PM, Alexander Shraer < > > > > shra...@gmail.com> > > > > > >>> > wrote: > > > > > >>> > > > > > > >>> > > Yes sort of, FLE is finished, then enough observer's > messages > > > > reach > > > > > >>> the > > > > > >>> > > leader before participant's messages do. > > > > > >>> > > Whether its rare depends on the number of observers and > > > > > >>> participants. For > > > > > >>> > > example with very few participants and many observers > > > > > >>> > > your chance of hitting this are quite high. > > > > > >>> > > > > > > > >>> > > Alex > > > > > >>> > > > > > > > >>> > > On Thu, Apr 5, 2018 at 11:44 AM, Andor Molnar < > > > > an...@cloudera.com> > > > > > >>> > wrote: > > > > > >>> > > > > > > > >>> > > > Maybe I'm missing something here, but this looks like a > > rare > > > > edge > > > > > >>> case > > > > > >>> > to > > > > > >>> > > > me. Participants must finish the leader election > > successfully > > > > and > > > > > >>> right > > > > > >>> > > > after enough followers should fail to send epoch to the > > > > leader, so > > > > > >>> > > > observers can take it over. > > > > > >>> > > > > > > > > >>> > > > Is that description accurate? > > > > > >>> > > > > > > > > >>> > > > Andor > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > On Thu, Apr 5, 2018 at 7:35 PM, Alexander Shraer < > > > > > >>> shra...@gmail.com> > > > > > >>> > > > wrote: > > > > > >>> > > > > > > > > >>> > > > > To clarify - in a deployment with observers this bug > can > > > > > >>> potentially > > > > > >>> > > > cause > > > > > >>> > > > > data loss. A server could be elected leader based just > > on the > > > > > >>> support > > > > > >>> > > of > > > > > >>> > > > > observers, even if this servers data is stale wrt other > > > > > >>> followers. > > > > > >>> > > > > > > > > > >>> > > > > It is certainly a blocker, just not sure if for 3.4.11 > or > > > > 3.4.12. > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > Alex > > > > > >>> > > > > On Thu, Apr 5, 2018 at 10:29 AM Andor Molnar < > > > > an...@cloudera.com > > > > > >>> > > > > > > >>> > > wrote: > > > > > >>> > > > > > > > > > >>> > > > > > I don't think it's a blocker. > > > > > >>> > > > > > The jira and PR has been open since last December and > > > > 3.4.11 > > > > > >>> has > > > > > >>> > > > released > > > > > >>> > > > > > without it. > > > > > >>> > > > > > > > > > > >>> > > > > > Although this bug is also important to fix, I believe > > it's > > > > more > > > > > >>> > > > important > > > > > >>> > > > > > to release a fix for the regression we've found in > > 3.4.11 > > > > asap. > > > > > >>> > > > > > > > > > > >>> > > > > > Abe, any thoughts? > > > > > >>> > > > > > > > > > > >>> > > > > > Regards, > > > > > >>> > > > > > Andor > > > > > >>> > > > > > > > > > > >>> > > > > > > > > > > >>> > > > > > > > > > > >>> > > > > > On Thu, Apr 5, 2018 at 7:00 PM, Alexander Shraer < > > > > > >>> > shra...@gmail.com> > > > > > >>> > > > > > wrote: > > > > > >>> > > > > > > > > > > >>> > > > > > > Sorry for coming in at the last moment. I'm not > sure > > > > when the > > > > > >>> > next > > > > > >>> > > > 3.4 > > > > > >>> > > > > > > release is scheduled, so just wanted to mention > this > > bug, > > > > > >>> > > > > > > which I believe is a blocker for either this or > next > > > > release: > > > > > >>> > > > > > > https://issues.apache.org/ > jira/browse/ZOOKEEPER-2959 > > > > > >>> > > > > > > > > > > > >>> > > > > > > Best, > > > > > >>> > > > > > > Alex > > > > > >>> > > > > > > > > > > > >>> > > > > > > On Thu, Apr 5, 2018 at 9:09 AM, Ted Yu < > > > > yuzhih...@gmail.com> > > > > > >>> > > wrote: > > > > > >>> > > > > > > > > > > > >>> > > > > > > > Can the vote be closed ? > > > > > >>> > > > > > > > > > > > > >>> > > > > > > > It seems we have enough +1's > > > > > >>> > > > > > > > > > > > > >>> > > > > > > > Thanks > > > > > >>> > > > > > > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >> > > > > > >> > > > > > > > > > > > > >