The curator backend has been working well for us so far. I believe it is safe to make it the default for the next release, and to drop the old code in the release after that.
From: John Sirois <jsir...@apache.org> Reply-To: "u...@aurora.apache.org" <u...@aurora.apache.org>, "jsir...@apache.org" <jsir...@apache.org> Date: Thursday 7 July 2016 at 01:13 To: Martin Hrabovčin <martin.hrabov...@gmail.com> Cc: "dev@aurora.apache.org" <dev@aurora.apache.org>, Jake Farrell <jfarr...@apache.org>, "u...@aurora.apache.org" <u...@aurora.apache.org> Subject: Re: [FEEDBACK] Transitioning Aurora leader election to Apache Curator (`-zk_use_curator`) Now that 0.15.0 has been released, I thought I'd check in on any progress folks have made with testing/deploying the 0.14.0+ with the Aurora Scheduler `-zk_use_curator` flag in-place. There has been 1 fix that will go out in the 0.16.0 release to reduce logger noise on shutdown [1][2] but I have heard no negative (or positive) feedback otherwise. [1] https://issues.apache.org/jira/browse/AURORA-1729 [2] https://reviews.apache.org/r/49578/ On Thu, Jun 16, 2016 at 1:18 PM, John Sirois <jsir...@apache.org<mailto:jsir...@apache.org>> wrote: On Thu, Jun 16, 2016 at 12:03 AM, Martin Hrabovčin <martin.hrabov...@gmail.com<mailto:martin.hrabov...@gmail.com>> wrote: How should be this flag rolled to existing running cluster? Can it be done using rolling update instance by instance or we need to stop the whole cluster and then bring all nodes with new flag? I recommend a whole cluster down, upgrade + new flag, up. A rolling update should work, but will likely be rocky. My analysis: The Aurora leader election consists of 2 components, the actual leader election and the resulting advertisement by the leader of itself as the Aurora service endpoint. These 2 components each use zookeeper and of the 2 I only ensured that the advertisement was compatible with old releases (old clients). The leader election portion is completely internal to the Aurora scheduler instances vying for leadership and, under Curator, uses a different (enhanced), zookeeper node scheme. As a result, this is what could happen in a slow roll: before upgrade: 0: old-lead, 1: old-follow, 2: old-follow upgrade 0: new-lead, 1: old-lead, 2: old-follow Here, node 0 will see itself as leader and nodes 1 and 2 will see node 1 as leader. The result will be both node 0 and node 1 attempting to read the mesos distributed log. Now the log uses its own leader election and the reader must be the leader as things stand, so the Aurora-level leadership "tie" will be broken by one of the 2 Aurora-level leaders failing to become the mesos distributed log leader, and that node will restart its lifecycle - ie flap. This will continue to be the case with second node upgrade and will not stabilize until the 3rd node is upgraded. 2016-06-16 5:03 GMT+02:00 Jake Farrell <jfarr...@apache.org<mailto:jfarr...@apache.org>>: +1, will enable on our test clusters to help verify -Jake On Tue, Jun 14, 2016 at 7:43 PM, John Sirois <jsir...@apache.org<mailto:jsir...@apache.org>> wrote: > I'd like to move forward with > https://issues.apache.org/jira/browse/AURORA-1669 asap; ie: removing > legacy > (Twitter) commons zookeeper libraries used for Aurora leader election in > favor of Apache Curator libraries. The change submitted in > https://reviews.apache.org/r/46286/ is now live in Aurora 0.14.0 and > Apache > Curator based service discovery can be enabled with the Aurora scheduler > flag `-zk_use_curator`. I'd like feedback from users who enable this > option. If you have a test cluster where you can enable `-zk_use_curator` > and exercise leader failure and failover, I'd be grateful. If you have > moved to using this option in production with demonstrable improvements or > even maintenance of status quo, I'd also be grateful for this news. If > you've found regressions or new bugs, I'd love to know about those as well. > > Thanks in advance to all those who find time to test this out on real > systems! >