Could we change the default and drop the old code at the same time? I don't see any benefit of letting that hang around.
I have not tested this code yet, but I hope to do it soon. On Wed, Aug 24, 2016 at 5:19 AM, Erb, Stephan <stephan....@blue-yonder.com> wrote: > The curator backend has been working well for us so far. I believe it is > safe to make it the default for the next release, and to drop the old code > in the release after that. > > > > *From: *John Sirois <jsir...@apache.org> > *Reply-To: *"u...@aurora.apache.org" <u...@aurora.apache.org>, " > jsir...@apache.org" <jsir...@apache.org> > *Date: *Thursday 7 July 2016 at 01:13 > *To: *Martin Hrabovčin <martin.hrabov...@gmail.com> > *Cc: *"dev@aurora.apache.org" <dev@aurora.apache.org>, Jake Farrell < > jfarr...@apache.org>, "u...@aurora.apache.org" <u...@aurora.apache.org> > *Subject: *Re: [FEEDBACK] Transitioning Aurora leader election to Apache > Curator (`-zk_use_curator`) > > > > Now that 0.15.0 has been released, I thought I'd check in on any progress > folks have made with testing/deploying the 0.14.0+ with the Aurora > Scheduler `-zk_use_curator` flag in-place. > > There has been 1 fix that will go out in the 0.16.0 release to reduce > logger noise on shutdown [1][2] but I have heard no negative (or positive) > feedback otherwise. > > > > [1] https://issues.apache.org/jira/browse/AURORA-1729 > > [2] https://reviews.apache.org/r/49578/ > > > > On Thu, Jun 16, 2016 at 1:18 PM, John Sirois <jsir...@apache.org> wrote: > > > > > > On Thu, Jun 16, 2016 at 12:03 AM, Martin Hrabovčin < > martin.hrabov...@gmail.com> wrote: > > How should be this flag rolled to existing running cluster? Can it be done > using rolling update instance by instance or we need to stop the whole > cluster and then bring all nodes with new flag? > > > > I recommend a whole cluster down, upgrade + new flag, up. > > > > A rolling update should work, but will likely be rocky. My analysis: > > > > The Aurora leader election consists of 2 components, the actual leader > election and the resulting advertisement by the leader of itself as the > Aurora service endpoint. These 2 components each use zookeeper and of the > 2 I only ensured that the advertisement was compatible with old releases > (old clients). The leader election portion is completely internal to the > Aurora scheduler instances vying for leadership and, under Curator, uses a > different (enhanced), zookeeper node scheme. As a result, this is what > could happen in a slow roll: > > > > before upgrade: 0: old-lead, 1: old-follow, 2: old-follow > > upgrade 0: new-lead, 1: old-lead, 2: old-follow > > > > Here, node 0 will see itself as leader and nodes 1 and 2 will see node 1 > as leader. The result will be both node 0 and node 1 attempting to read the > mesos distributed log. Now the log uses its own leader election and the > reader must be the leader as things stand, so the Aurora-level leadership > "tie" will be broken by one of the 2 Aurora-level leaders failing to become > the mesos distributed log leader, and that node will restart its lifecycle > - ie flap. This will continue to be the case with second node upgrade and > will not stabilize until the 3rd node is upgraded. > > > > > > 2016-06-16 5:03 GMT+02:00 Jake Farrell <jfarr...@apache.org>: > > +1, will enable on our test clusters to help verify > > -Jake > > > On Tue, Jun 14, 2016 at 7:43 PM, John Sirois <jsir...@apache.org> wrote: > > > I'd like to move forward with > > https://issues.apache.org/jira/browse/AURORA-1669 asap; ie: removing > > legacy > > (Twitter) commons zookeeper libraries used for Aurora leader election in > > favor of Apache Curator libraries. The change submitted in > > https://reviews.apache.org/r/46286/ is now live in Aurora 0.14.0 and > > Apache > > Curator based service discovery can be enabled with the Aurora scheduler > > flag `-zk_use_curator`. I'd like feedback from users who enable this > > option. If you have a test cluster where you can enable > `-zk_use_curator` > > and exercise leader failure and failover, I'd be grateful. If you have > > moved to using this option in production with demonstrable improvements > or > > even maintenance of status quo, I'd also be grateful for this news. If > > you've found regressions or new bugs, I'd love to know about those as > well. > > > > Thanks in advance to all those who find time to test this out on real > > systems! > > > > > > > > > >