Could we change the default and drop the old code at the same time? I don't
see any benefit of letting that hang around.

I have not tested this code yet, but I hope to do it soon.

On Wed, Aug 24, 2016 at 5:19 AM, Erb, Stephan <stephan....@blue-yonder.com>
wrote:

> The curator backend has been working well for us so far. I believe it is
> safe to make it the default for the next release, and to drop the old code
> in the release after that.
>
>
>
> *From: *John Sirois <jsir...@apache.org>
> *Reply-To: *"u...@aurora.apache.org" <u...@aurora.apache.org>, "
> jsir...@apache.org" <jsir...@apache.org>
> *Date: *Thursday 7 July 2016 at 01:13
> *To: *Martin Hrabovčin <martin.hrabov...@gmail.com>
> *Cc: *"dev@aurora.apache.org" <dev@aurora.apache.org>, Jake Farrell <
> jfarr...@apache.org>, "u...@aurora.apache.org" <u...@aurora.apache.org>
> *Subject: *Re: [FEEDBACK] Transitioning Aurora leader election to Apache
> Curator (`-zk_use_curator`)
>
>
>
> Now that 0.15.0 has been released, I thought I'd check in on any progress
> folks have made with testing/deploying the 0.14.0+ with the Aurora
> Scheduler `-zk_use_curator` flag in-place.
>
> There has been 1 fix that will go out in the 0.16.0 release to reduce
> logger noise on shutdown [1][2] but I have heard no negative (or positive)
> feedback otherwise.
>
>
>
> [1] https://issues.apache.org/jira/browse/AURORA-1729
>
> [2] https://reviews.apache.org/r/49578/
>
>
>
> On Thu, Jun 16, 2016 at 1:18 PM, John Sirois <jsir...@apache.org> wrote:
>
>
>
>
>
> On Thu, Jun 16, 2016 at 12:03 AM, Martin Hrabovčin <
> martin.hrabov...@gmail.com> wrote:
>
> How should be this flag rolled to existing running cluster? Can it be done
> using rolling update instance by instance or we need to stop the whole
> cluster and then bring all nodes with new flag?
>
>
>
> I recommend a whole cluster down, upgrade +  new flag, up.
>
>
>
> A rolling update should work, but will likely be rocky.  My analysis:
>
>
>
> The Aurora leader election consists of 2 components, the actual leader
> election and the resulting advertisement by the leader of itself as the
> Aurora service endpoint.  These 2 components each use zookeeper and of the
> 2 I only ensured that the advertisement was compatible with old releases
> (old clients). The leader election portion is completely internal to the
> Aurora scheduler instances vying for leadership and, under Curator, uses a
> different (enhanced), zookeeper node scheme.  As a result, this is what
> could happen in a slow roll:
>
>
>
> before upgrade: 0: old-lead, 1: old-follow, 2: old-follow
>
> upgrade 0: new-lead, 1: old-lead, 2: old-follow
>
>
>
> Here, node 0 will see itself as leader and nodes 1 and 2 will see node 1
> as leader. The result will be both node 0 and node 1 attempting to read the
> mesos distributed log.  Now the log uses its own leader election and the
> reader must be the leader as things stand, so the Aurora-level leadership
> "tie" will be broken by one of the 2 Aurora-level leaders failing to become
> the mesos distributed log leader, and that node will restart its lifecycle
> - ie flap.  This will continue to be the case with second node upgrade and
> will not stabilize until the 3rd node is upgraded.
>
>
>
>
>
> 2016-06-16 5:03 GMT+02:00 Jake Farrell <jfarr...@apache.org>:
>
> +1, will enable on our test clusters to help verify
>
> -Jake
>
>
> On Tue, Jun 14, 2016 at 7:43 PM, John Sirois <jsir...@apache.org> wrote:
>
> > I'd like to move forward with
> > https://issues.apache.org/jira/browse/AURORA-1669 asap; ie: removing
> > legacy
> > (Twitter) commons zookeeper libraries used for Aurora leader election in
> > favor of Apache Curator libraries. The change submitted in
> > https://reviews.apache.org/r/46286/ is now live in Aurora 0.14.0 and
> > Apache
> > Curator based service discovery can be enabled with the Aurora scheduler
> > flag `-zk_use_curator`.  I'd like feedback from users who enable this
> > option.  If you have a test cluster where you can enable
> `-zk_use_curator`
> > and exercise leader failure and failover, I'd be grateful. If you have
> > moved to using this option in production with demonstrable improvements
> or
> > even maintenance of status quo, I'd also be grateful for this news. If
> > you've found regressions or new bugs, I'd love to know about those as
> well.
> >
> > Thanks in advance to all those who find time to test this out on real
> > systems!
> >
>
>
>
>
>
>
>
>

Reply via email to