Re: [FEEDBACK] Transitioning Aurora leader election to Apache Curator (`-zk_use_curator`)

Erb, Stephan Wed, 24 Aug 2016 05:20:23 -0700

The curator backend has been working well for us so far. I believe it is safe 
to make it the default for the next release, and to drop the old code in the 
release after that.

From: John Sirois <[email protected]>
Reply-To: "[email protected]" <[email protected]>, 
"[email protected]" <[email protected]>
Date: Thursday 7 July 2016 at 01:13
To: Martin Hrabovčin <[email protected]>
Cc: "[email protected]" <[email protected]>, Jake Farrell 
<[email protected]>, "[email protected]" <[email protected]>
Subject: Re: [FEEDBACK] Transitioning Aurora leader election to Apache Curator 
(`-zk_use_curator`)

Now that 0.15.0 has been released, I thought I'd check in on any progress folks 
have made with testing/deploying the 0.14.0+ with the Aurora Scheduler 
`-zk_use_curator` flag in-place.
There has been 1 fix that will go out in the 0.16.0 release to reduce logger 
noise on shutdown [1][2] but I have heard no negative (or positive) feedback 
otherwise.

[1] https://issues.apache.org/jira/browse/AURORA-1729
[2] https://reviews.apache.org/r/49578/

On Thu, Jun 16, 2016 at 1:18 PM, John Sirois 
<[email protected]<mailto:[email protected]>> wrote:

On Thu, Jun 16, 2016 at 12:03 AM, Martin Hrabovčin 
<[email protected]<mailto:[email protected]>> wrote:
How should be this flag rolled to existing running cluster? Can it be done 
using rolling update instance by instance or we need to stop the whole cluster 
and then bring all nodes with new flag?

I recommend a whole cluster down, upgrade +  new flag, up.

A rolling update should work, but will likely be rocky.  My analysis:

The Aurora leader election consists of 2 components, the actual leader election 
and the resulting advertisement by the leader of itself as the Aurora service 
endpoint.  These 2 components each use zookeeper and of the 2 I only ensured 
that the advertisement was compatible with old releases (old clients). The 
leader election portion is completely internal to the Aurora scheduler 
instances vying for leadership and, under Curator, uses a different (enhanced), 
zookeeper node scheme.  As a result, this is what could happen in a slow roll:

before upgrade: 0: old-lead, 1: old-follow, 2: old-follow
upgrade 0: new-lead, 1: old-lead, 2: old-follow

Here, node 0 will see itself as leader and nodes 1 and 2 will see node 1 as 
leader. The result will be both node 0 and node 1 attempting to read the mesos 
distributed log.  Now the log uses its own leader election and the reader must 
be the leader as things stand, so the Aurora-level leadership "tie" will be 
broken by one of the 2 Aurora-level leaders failing to become the mesos 
distributed log leader, and that node will restart its lifecycle - ie flap.  
This will continue to be the case with second node upgrade and will not 
stabilize until the 3rd node is upgraded.

2016-06-16 5:03 GMT+02:00 Jake Farrell 
<[email protected]<mailto:[email protected]>>:
+1, will enable on our test clusters to help verify

-Jake

On Tue, Jun 14, 2016 at 7:43 PM, John Sirois 
<[email protected]<mailto:[email protected]>> wrote:

> I'd like to move forward with
> https://issues.apache.org/jira/browse/AURORA-1669 asap; ie: removing
> legacy
> (Twitter) commons zookeeper libraries used for Aurora leader election in
> favor of Apache Curator libraries. The change submitted in
> https://reviews.apache.org/r/46286/ is now live in Aurora 0.14.0 and
> Apache
> Curator based service discovery can be enabled with the Aurora scheduler
> flag `-zk_use_curator`.  I'd like feedback from users who enable this
> option.  If you have a test cluster where you can enable `-zk_use_curator`
> and exercise leader failure and failover, I'd be grateful. If you have
> moved to using this option in production with demonstrable improvements or
> even maintenance of status quo, I'd also be grateful for this news. If
> you've found regressions or new bugs, I'd love to know about those as well.
>
> Thanks in advance to all those who find time to test this out on real
> systems!
>

Re: [FEEDBACK] Transitioning Aurora leader election to Apache Curator (`-zk_use_curator`)

Reply via email to