Re: [VOTE] Release Apache Aurora 0.18.0 RC0

2017-06-12 Thread Zameer Manji
+1

Tests pass and this has been successfully deployed in a production cluster.

On Fri, Jun 9, 2017 at 6:19 PM, Santhosh Kumar Shanmugham <
sshanmug...@twitter.com.invalid> wrote:

> Kicking off the voting.
>
> +1
>
> On Fri, Jun 9, 2017 at 5:13 PM, Santhosh Kumar Shanmugham <
> sshanmug...@twitter.com> wrote:
>
> > All,
> >
> > I propose that we accept the following release candidate as the official
> > Apache Aurora 0.18.0 release.
> >
> > Aurora 0.18.0-rc0 includes the following:
> > ---
> > The RELEASE NOTES for the release are available at:
> > https://git-wip-us.apache.org/repos/asf?p=aurora.git=
> > RELEASE-NOTES.md=rel/0.18.0-rc0
> >
> > The CHANGELOG for the release is available at:
> > https://git-wip-us.apache.org/repos/asf?p=aurora.git=
> > CHANGELOG=rel/0.18.0-rc0
> >
> > The tag used to create the release candidate is:
> > https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=
> > shortlog;h=refs/tags/rel/0.18.0-rc0
> >
> > The release candidate is available at:
> > https://dist.apache.org/repos/dist/dev/aurora/0.18.0-rc0/
> > apache-aurora-0.18.0-rc0.tar.gz
> >
> > The MD5 checksum of the release candidate can be found at:
> > https://dist.apache.org/repos/dist/dev/aurora/0.18.0-rc0/
> > apache-aurora-0.18.0-rc0.tar.gz.md5
> >
> > The signature of the release candidate can be found at:
> > https://dist.apache.org/repos/dist/dev/aurora/0.18.0-rc0/
> > apache-aurora-0.18.0-rc0.tar.gz.asc
> >
> > The GPG key used to sign the release are available at:
> > https://dist.apache.org/repos/dist/dev/aurora/KEYS
> >
> > Please download, verify, and test.
> >
> > The vote will close on Mon Jun 12 17:12:10 PDT 2017
> >
> > [ ] +1 Release this as Apache Aurora 0.18.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Aurora 0.18.0 because...
> >
> > Thanks,
> > -Santhosh
> >
>
> --
> Zameer Manji
>


Re: schedule task instances spreading them based on a host attribute.

2017-03-30 Thread Zameer Manji
What kind of isolation features are you using?

I would like to probe a little deeper here, because this is not an ideal
rationale for changing the placement algorithm. Ideally Mesos and Linux
provides the right isolation technology to make this a non problem.

I understand the push for job anti-affinity (ie don't put too many kafka
workers in general on one host), but I would imagine it would be for
reliability reasons not for performance reasons.

On Thu, Mar 30, 2017 at 12:16 PM, Rick Mangi <r...@chartbeat.com> wrote:

> Performance and utilization mostly. The kafka consumers are CPU bound (and
> sometimes network) and the rest of our jobs are mostly memory bound. We’ve
> found that if too many consumers wind up on the same EC2 instance they
> don’t perform as well. It’s hard to prove this, but the gut feeling is
> pretty strong.
>
>
> > On Mar 30, 2017, at 2:35 PM, Zameer Manji <zma...@apache.org> wrote:
> >
> > Rick,
> >
> > Can you share why it would be nice to spread out these different jobs on
> > different hosts? Is it for reliability, performance, utilization, etc?
> >
> > On Thu, Mar 30, 2017 at 11:31 AM, Rick Mangi <r...@chartbeat.com> wrote:
> >
> >> Yeah, we have a dozen or so kafka consumer jobs running in our cluster,
> >> each having about 40 or so instances.
> >>
> >>
> >>> On Mar 30, 2017, at 2:06 PM, David McLaughlin <da...@dmclaughlin.com>
> >> wrote:
> >>>
> >>> There is absolutely a need for custom hook points in the scheduler
> >> (injecting default constraints to running tasks for example). I don't
> think
> >> users should be asked to write custom scheduling algorithms to solve the
> >> problems in this thread though. There are also huge downsides to
> exposing
> >> the internals of scheduling as a part of a plugin API.
> >>>
> >>> Out of curiosity do your Kafka consumers span multiple jobs? Otherwise
> >> host constraints solve that problem right?
> >>>
> >>>> On Mar 30, 2017, at 10:34 AM, Rick Mangi <r...@chartbeat.com> wrote:
> >>>>
> >>>> I think the complexity is a great rationale for having a pluggable
> >> scheduling layer. Aurora is very flexible and people use it in many
> >> different ways. Giving users more flexibility in how jobs are scheduled
> >> seems like it would be a good direction for the project.
> >>>>
> >>>>
> >>>>> On Mar 30, 2017, at 12:16 PM, David McLaughlin <
> dmclaugh...@apache.org>
> >> wrote:
> >>>>>
> >>>>> I think this is more complicated than multiple scheduling algorithms.
> >> The
> >>>>> problem you'll end up having if you try to solve this in the
> Scheduling
> >>>>> loop is when resources are unavailable because there are preemptible
> >> tasks
> >>>>> running in them, rather than hosts being down. Right now the fact
> that
> >> the
> >>>>> task cannot be scheduled is important because it triggers preemption
> >> and
> >>>>> will make room. An alternative algorithm that tries at all costs to
> >>>>> schedule the task in the TaskAssigner could decide to place the task
> >> in a
> >>>>> non-ideal slot and leave a preemptible task running instead.
> >>>>>
> >>>>> It's also important to think of the knock-on effects here when we
> move
> >> to
> >>>>> offer affinity (i.e. the current Dynamic Reservation proposal). If
> >> you've
> >>>>> made this non-ideal compromise to get things scheduled - that
> decision
> >> will
> >>>>> basically be permanent until the host you're on goes down. At least
> >> with
> >>>>> how things work now, with each scheduling attempt the job has a fresh
> >>>>> chance of being put in an ideal slot.
> >>>>>
> >>>>>> On Thu, Mar 30, 2017 at 8:12 AM, Rick Mangi <r...@chartbeat.com>
> >> wrote:
> >>>>>>
> >>>>>> Sorry for the late reply, but I wanted to chime in here as wanting
> to
> >> see
> >>>>>> this feature. We run a medium size cluster (around 1000 cores) in
> EC2
> >> and I
> >>>>>> think we could get better usage of the cluster with more control
> over
> >> the
> >>>>>> distribution of job instances. For example it would be nice to limit
&g

Re: schedule task instances spreading them based on a host attribute.

2017-03-30 Thread Zameer Manji
Rick,

Can you share why it would be nice to spread out these different jobs on
different hosts? Is it for reliability, performance, utilization, etc?

On Thu, Mar 30, 2017 at 11:31 AM, Rick Mangi <r...@chartbeat.com> wrote:

> Yeah, we have a dozen or so kafka consumer jobs running in our cluster,
> each having about 40 or so instances.
>
>
> > On Mar 30, 2017, at 2:06 PM, David McLaughlin <da...@dmclaughlin.com>
> wrote:
> >
> > There is absolutely a need for custom hook points in the scheduler
> (injecting default constraints to running tasks for example). I don't think
> users should be asked to write custom scheduling algorithms to solve the
> problems in this thread though. There are also huge downsides to exposing
> the internals of scheduling as a part of a plugin API.
> >
> > Out of curiosity do your Kafka consumers span multiple jobs? Otherwise
> host constraints solve that problem right?
> >
> >> On Mar 30, 2017, at 10:34 AM, Rick Mangi <r...@chartbeat.com> wrote:
> >>
> >> I think the complexity is a great rationale for having a pluggable
> scheduling layer. Aurora is very flexible and people use it in many
> different ways. Giving users more flexibility in how jobs are scheduled
> seems like it would be a good direction for the project.
> >>
> >>
> >>> On Mar 30, 2017, at 12:16 PM, David McLaughlin <dmclaugh...@apache.org>
> wrote:
> >>>
> >>> I think this is more complicated than multiple scheduling algorithms.
> The
> >>> problem you'll end up having if you try to solve this in the Scheduling
> >>> loop is when resources are unavailable because there are preemptible
> tasks
> >>> running in them, rather than hosts being down. Right now the fact that
> the
> >>> task cannot be scheduled is important because it triggers preemption
> and
> >>> will make room. An alternative algorithm that tries at all costs to
> >>> schedule the task in the TaskAssigner could decide to place the task
> in a
> >>> non-ideal slot and leave a preemptible task running instead.
> >>>
> >>> It's also important to think of the knock-on effects here when we move
> to
> >>> offer affinity (i.e. the current Dynamic Reservation proposal). If
> you've
> >>> made this non-ideal compromise to get things scheduled - that decision
> will
> >>> basically be permanent until the host you're on goes down. At least
> with
> >>> how things work now, with each scheduling attempt the job has a fresh
> >>> chance of being put in an ideal slot.
> >>>
> >>>> On Thu, Mar 30, 2017 at 8:12 AM, Rick Mangi <r...@chartbeat.com>
> wrote:
> >>>>
> >>>> Sorry for the late reply, but I wanted to chime in here as wanting to
> see
> >>>> this feature. We run a medium size cluster (around 1000 cores) in EC2
> and I
> >>>> think we could get better usage of the cluster with more control over
> the
> >>>> distribution of job instances. For example it would be nice to limit
> the
> >>>> number of kafka consumers running on the same physical box.
> >>>>
> >>>> Best,
> >>>>
> >>>> Rick
> >>>>
> >>>>
> >>>>> On 2017-03-06 14:44 (-0400), Mauricio Garavaglia <m...@gmail.com>
> wrote:
> >>>>> Hello!>
> >>>>>
> >>>>> I have a job that have multiple instances (>100) that'd I like to
> spread>
> >>>>> across the hosts in a cluster. Using a constraint such as
> "limit=host:1">
> >>>>> doesn't work quite well, as I have more instances than nodes.>
> >>>>>
> >>>>> As a workaround I increased the limit value to something like>
> >>>>> ceil(instances/nodes). But now the problem happens if a bunch of
> nodes
> >>>> go>
> >>>>> down (think a whole rack dies) because the instances will not run
> until>
> >>>>> them are back, even though we may have spare capacity on the rest of
> the>
> >>>>> hosts that we'd like to use. In that scenario, the job availability
> may
> >>>> be>
> >>>>> affected because it's running with fewer instances than expected. On
> a>
> >>>>> smaller scale, the former approach would also apply if you want to
> >>>> spread>
> >>>>> tasks in racks or availability zones. I'd like to have one instance
> of a>
> >>>>> job per rack (failure domain) but in the case of it going down, the>
> >>>>> instance can be spawn on a different rack.>
> >>>>>
> >>>>> I thought we could have a scheduling constraint to "spread"
> instances>
> >>>>> across a particular host attribute; instead of vetoing an offer right
> >>>> away>
> >>>>> we check where the other instances of a task are running, looking
> for a>
> >>>>> particular attribute of the host. We try to maximize the different
> >>>> values>
> >>>>> of a particular attribute (rack, hostname, etc) on the task
> instances>
> >>>>> assignment.>
> >>>>>
> >>>>> what do you think? did something like this came up in the past? is
> it>
> >>>>> feasible?>
> >>>>>
> >>>>>
> >>>>> Mauricio>
> >>>>>
> >>>>
> >>
>
> --
> Zameer Manji
>


Re: Future of storage in Aurora

2017-03-30 Thread Zameer Manji
I don't object to changes to storage so long as we have a migration plan
and a design doc. I'm also not opposed to radical revisits of storage,
including overhauling what we store and where we store it. For example,
instead of storing our `TaskConfig` objects could we store Mesos `TaskInfo`
objects instead? Could we store data outside of the scheduler like in
Cassandra? Should we have a high level 'Job' store to make querying for job
level data easier?

On Thu, Mar 30, 2017 at 10:16 AM, David McLaughlin <dmclaugh...@apache.org>
wrote:

> Hi all,
>
> I'd like to start a discussion around storage in Aurora.
>
> I think one of the biggest mistakes we made in migrating our storage to H2
> was deleting the memory stores as we moved. We made a pretty big bet that
> we could eventually make H2/relational databases work. I don't think that
> bet has paid off and that we need to revisit the direction we're taking.
>
> My belief is that the current H2/MyBatis approach is untenable for large
> production clusters, at least without changing our current single-master
> architecture. At Twitter we are already having to fight to keep GC
> manageable even without DbTaskStore enabled, so I don't see a path forward
> where we could eventually enable that. So far experiments with H2 off-heap
> storage have provided marginal (if any) gains.
>
> Would anyone object to restoring the in-memory stores and creating new
> implementations for the missing ones (UpdateStore)? I'd even go further and
> propose that we consider in-memory H2 and MyBatis a failed experiment and
> we drop that storage layer completely.
>
> Cheers,
> David
>
> --
> Zameer Manji
>


Re: schedule task instances spreading them based on a host attribute.

2017-03-23 Thread Zameer Manji
Hey,

Sorry for the late reply.

It is possible to make this configurable. For example we could just
implement multiple algorithms and switch between them using different
flags. If the flag value is just a class on the classpath that implements
an interface, it can be 100% pluggable.

The primary part of the scheduling code is `TaskScheduler` and
`TaskAssigner`.

`TaskScheduler` receives requests to schedule tasks and does some
validation and preparation. `TaskAssigner` implements the first fit
algorithm.

However, I feel the best move for the project would be to move away from
first fit, to support soft constraints. I think it is a very valid feature
request and I believe it can be done without degradation performance.
Ideally, we should just use an existing Java library that implements a well
known algorithm. For example, Netflix's Fenzo
<https://github.com/Netflix/Fenzo> could be used here.

On Wed, Mar 15, 2017 at 11:10 AM, Mauricio Garavaglia <
mauriciogaravag...@gmail.com> wrote:

> Hi,
>
> Rather than changing the scheduling algorithm, I think we should open to
> support multiple algorithms. First-fit is certainly a great solution for
> humongous clusters with homogeneous workloads; but for smaller clusters we
> can make have more optimized scheduling without sacrificing scheduling
> performance.
>
> How difficult do you think it would be to start exploring that option?
> Haven't looked into the scheduling side of the code :)
>
>
>
>
>
>
>
>
> On Mon, Mar 6, 2017 at 2:57 PM, Zameer Manji <zma...@apache.org> wrote:
>
> > Something similar was proposed on a Dynamic Reservations review and there
> > is a ticket for it here <https://issues.apache.org/
> jira/browse/AURORA-173
> > >.
> >
> > I think it is feasible, but it is important to note that this is large
> > change because we are going to move Aurora from first fit to some other
> > algorithm.
> >
> > If we do this we need to ensure it scales to very large clusters and
> > ensures reasonably low latency in assigning tasks to offers.
> >
> > I support the idea of "spread", but it would need to be after a change to
> > the scheduling algorithm.
> >
> > On Mon, Mar 6, 2017 at 11:44 AM, Mauricio Garavaglia <
> > mauriciogaravag...@gmail.com> wrote:
> >
> > > Hello!
> > >
> > > I have a job that have multiple instances (>100) that'd I like to
> spread
> > > across the hosts in a cluster. Using a constraint such as
> "limit=host:1"
> > > doesn't work quite well, as I have more instances than nodes.
> > >
> > > As a workaround I increased the limit value to something like
> > > ceil(instances/nodes). But now the problem happens if a bunch of nodes
> go
> > > down (think a whole rack dies) because the instances will not run until
> > > them are back, even though we may have spare capacity on the rest of
> the
> > > hosts that we'd like to use. In that scenario, the job availability may
> > be
> > > affected because it's running with fewer instances than expected. On a
> > > smaller scale, the former approach would also apply if you want to
> spread
> > > tasks in racks or availability zones. I'd like to have one instance of
> a
> > > job per rack (failure domain) but in the case of it going down, the
> > > instance can be spawn on a different rack.
> > >
> > > I thought we could have a scheduling constraint to "spread" instances
> > > across a particular host attribute; instead of vetoing an offer right
> > away
> > > we check where the other instances of a task are running, looking for a
> > > particular attribute of the host. We try to maximize the different
> values
> > > of a particular attribute (rack, hostname, etc) on the task instances
> > > assignment.
> > >
> > > what do you think? did something like this came up in the past? is it
> > > feasible?
> > >
> > >
> > > Mauricio
> > >
> > > --
> > > Zameer Manji
> > >
> >
>
> --
> Zameer Manji
>


Re: Design Doc for Mesos Maintenance in Aurora

2017-03-13 Thread Zameer Manji
Thanks for the feedback Stephan.

I am going to cautiously assume that future feedback here will be along the
same lines. Therefore I have created a ticket [1] for the work proposed in
the doc.

[1]: https://issues.apache.org/jira/browse/AURORA-1904

On Mon, Mar 13, 2017 at 2:25 PM, Erb, Stephan <stephan@blue-yonder.com>
wrote:

> Looks good to me!
>
> On 08/03/2017, 02:45, "Zameer Manji" <zma...@apache.org> wrote:
>
> Hey,
>
> I have a brief design doc
> <https://docs.google.com/document/d/1Z7dFAm6I1nrBE9S5WHw0D0LApBumk
> IbHrk0-ceoD2YI/edit#heading=h.ol75ogadgfyr>
> describing
> the changes required to support Mesos Maintenance in Aurora.
>
> If we have consensus, I will cut a ticket and put up a patch.
>
> --
> Zameer Manji
>
> --
> Zameer Manji
>


Re: Dynamic Reservations

2017-03-08 Thread Zameer Manji
t;
> > > b)   The implementation proposal and patches include an
> > > OfferReconciler, so this implies we don’t want to offer any control for
> > the
> > > user. The only control mechanism will be the cluster-wide offer wait
> time
> > > limiting the number of seconds unused reserved resources can linger
> > before
> > > they are un-reserved.
> > >
> > > c)   Will we allow adhoc/cron jobs to reserve resources? Does it
> even
> > > matter if we don’t give control to users and just rely on the
> > > OfferReconciler?
> > >
> > >
> > > I have a couple of questions on the MVP and some implementation
> details.
> > I
> > > will follow up with those in a separate mail.
> > >
> > > Thanks and best regards,
> > > Stephan
> > >
> >
>
> --
> Zameer Manji
>


Re: schedule task instances spreading them based on a host attribute.

2017-03-06 Thread Zameer Manji
Something similar was proposed on a Dynamic Reservations review and there
is a ticket for it here <https://issues.apache.org/jira/browse/AURORA-173>.

I think it is feasible, but it is important to note that this is large
change because we are going to move Aurora from first fit to some other
algorithm.

If we do this we need to ensure it scales to very large clusters and
ensures reasonably low latency in assigning tasks to offers.

I support the idea of "spread", but it would need to be after a change to
the scheduling algorithm.

On Mon, Mar 6, 2017 at 11:44 AM, Mauricio Garavaglia <
mauriciogaravag...@gmail.com> wrote:

> Hello!
>
> I have a job that have multiple instances (>100) that'd I like to spread
> across the hosts in a cluster. Using a constraint such as "limit=host:1"
> doesn't work quite well, as I have more instances than nodes.
>
> As a workaround I increased the limit value to something like
> ceil(instances/nodes). But now the problem happens if a bunch of nodes go
> down (think a whole rack dies) because the instances will not run until
> them are back, even though we may have spare capacity on the rest of the
> hosts that we'd like to use. In that scenario, the job availability may be
> affected because it's running with fewer instances than expected. On a
> smaller scale, the former approach would also apply if you want to spread
> tasks in racks or availability zones. I'd like to have one instance of a
> job per rack (failure domain) but in the case of it going down, the
> instance can be spawn on a different rack.
>
> I thought we could have a scheduling constraint to "spread" instances
> across a particular host attribute; instead of vetoing an offer right away
> we check where the other instances of a task are running, looking for a
> particular attribute of the host. We try to maximize the different values
> of a particular attribute (rack, hostname, etc) on the task instances
> assignment.
>
> what do you think? did something like this came up in the past? is it
> feasible?
>
>
> Mauricio
>
> --
> Zameer Manji
>


Re: Idea: rolling restarts in Aurora

2017-03-03 Thread Zameer Manji
+1

If I recall correctly, the rolling update mechanism was added to Aurora
because having the client coordinate batching was pretty tricky. I think
the same applies here to a rolling restart.

Considering the job controller technically supports this, adding a new RPC
to expose this behaviour would be beneficial.

On Thu, Mar 2, 2017 at 7:40 PM, Cody G <codyhg...@gmail.com> wrote:

> Hi all,
>
> I'd like to implement some new functionality in Aurora allowing for rolling
> job restarts. There are many reasons why we might need to restart a job,
> e.g. freeing instances of a job from deadlock or refreshing some sort of
> external configuration.
>
> Currently, there are two options to execute a rolling restart, however both
> are undesirable — either use the restartShards endpoint and implement
> batching client-side, or use startJobUpdate with slightly modified task
> config so that a non-empty job diff forces an update. I propose adding a
> new thrift RPC for launching a rolling restart, which is an interface
> around the existing upgrade logic. Instead of requiring a TaskConfig and
> instanceCount, this restart endpoint will only accept JobUpdateSettings and
> will simply launch an update with the currently used task configuration.
> All of the existing job update RPCs will still be able to access updates
> which were launched from this restart endpoint. This ensures restarts are
> available in the UI and no additional storage changes are required.
>
> If this proposal seems reasonable, I’ll file a ticket and draft up a more
> detailed RFC for further review.
>
> Cody
>
> --
> Zameer Manji
>


Re: Design Doc: Mesos V1 API

2017-02-02 Thread Zameer Manji
As some of the comments on the doc indicate there is a performance
regression using the HTTP API. However, with the V0Mesos/V1Mesos
abstraction, we can easily swap between implementations.

On Wed, Feb 1, 2017 at 12:35 PM, Joshua Cohen <jco...@apache.org> wrote:

> Overall proposal sounds reasonable to me, thanks Zameer! My main question
> is whether we're setting ourselves up for possible performance regressions
> by switching to the HTTP API? We'll obviously need to switch eventually,
> but would be good to understand the performance impact (if any) of the
> switch.
>
> On Wed, Feb 1, 2017 at 2:23 PM, Zameer Manji <zma...@apache.org> wrote:
>
> > Hey,
> >
> > I have written a design doc
> > <https://docs.google.com/document/d/1bWK8ldaQSsRXvdKwTh8tyR_
> > 0qMxAlnMW70eOKoU3myo/edit#>
> > that
> > outlines the work required to adopt the Mesos HTTP V1 API in Aurora.
> Please
> > take a look and comment.
> >
> > --
> > Zameer Manji
> >
>
> --
> Zameer Manji
>


Design Doc: Mesos V1 API

2017-02-01 Thread Zameer Manji
Hey,

I have written a design doc
<https://docs.google.com/document/d/1bWK8ldaQSsRXvdKwTh8tyR_0qMxAlnMW70eOKoU3myo/edit#>
that
outlines the work required to adopt the Mesos HTTP V1 API in Aurora. Please
take a look and comment.

-- 
Zameer Manji


Re: [VOTE] Release Apache Aurora 0.17.0 RC0

2017-02-01 Thread Zameer Manji
+1

The release verification script passes for me.

I have also been running a8afa59fb (which is a few commits before this
release) in a production environment.

Both the executor and scheduler seem to work fine.

On Wed, Feb 1, 2017 at 10:12 AM, David McLaughlin <dmclaugh...@apache.org>
wrote:

> Is anyone running this in production yet? For me there is no value in a
> release if it hasn't been vetted in production.
>
> On Wed, Feb 1, 2017 at 2:22 AM, Stephan Erb <s...@apache.org> wrote:
>
> > All,
> >
> > I propose that we accept the following release candidate as the
> > official
> > Apache Aurora 0.17.0 release.
> >
> > Aurora 0.17.0-rc0 includes the following:
> > ---
> > The RELEASE NOTES for the release are available at:
> > https://git-wip-us.apache.org/repos/asf?p=aurora.git=RELEASE-NOTES.md
> > =rel/0.17.0-rc0
> >
> > The CHANGELOG for the release is available at:
> > https://git-wip-us.apache.org/repos/asf?p=aurora.git=CHANGELOG=rel
> > /0.17.0-rc0
> >
> > The tag used to create the release candidate is:
> > https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=shortlog;h=refs/
> > tags/rel/0.17.0-rc0
> >
> > The release candidate is available at:
> > https://dist.apache.org/repos/dist/dev/aurora/0.17.0-rc0/apache-aurora-
> > 0.17.0-rc0.tar.gz
> >
> > The MD5 checksum of the release candidate can be found at:
> > https://dist.apache.org/repos/dist/dev/aurora/0.17.0-rc0/apache-aurora-
> > 0.17.0-rc0.tar.gz.md5
> >
> > The signature of the release candidate can be found at:
> > https://dist.apache.org/repos/dist/dev/aurora/0.17.0-rc0/apache-aurora-
> > 0.17.0-rc0.tar.gz.asc
> >
> > The GPG key used to sign the release are available at:
> > https://dist.apache.org/repos/dist/dev/aurora/KEYS
> >
> > Please download, verify, and test.
> >
> > The vote will close on Sa 4. Feb 09:35:45 CET 2017
> >
> > [ ] +1 Release this as Apache Aurora 0.17.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Aurora 0.17.0 because...
>
> --
> Zameer Manji
>


Re: Adding support for Mesos' Kill Policy when running docker executor-less tasks.

2017-01-17 Thread Zameer Manji
The proposal looks good to me.

I think Stephan's idea of re submitting John's change would be a good first
step and then we can layer in Nicolas' proposal.

On Tue, Jan 17, 2017 at 10:32 AM, Nicolas Donatucci <ndonatu...@medallia.com
> wrote:

> Hi.
>
> Have there been any news on this issue?
>
> On Thu, Jan 5, 2017 at 7:10 PM, Mauricio Garavaglia <
> mauriciogaravag...@gmail.com> wrote:
>
> > Hi,
> >
> > Some progress in 'Executor-less docker containers' would be great, in
> it's
> > current form is kind of useless as you can't specify the CMD
> > <https://docs.docker.com/engine/reference/builder/#/cmd> to pass to the
> > entrypoint. I played with that a bit a while back, but didn't continue
> > (sorry) to make the CLI work properly with missing Process; which I think
> > is something that John's patch addresses. See
> > https://github.com/medallia/aurora/commit/bd5938590fea3a9a7b2db5d2ff8c6c
> > d981b0e0c1
> >
> > The cmd to run was included as part of the Docker container struct like:
> > Container(docker = Docker(image = "docker/whalesay", parameters=p,
> > command="hello world")))
> >
> >
> >
> > On Thu, Jan 5, 2017 at 4:18 PM, Renan DelValle <rdelv...@binghamton.edu>
> > wrote:
> >
> > > I think adding the kill policy to the Thrift API is fine. For the first
> > > pass, I don't think it's a big deal to just keep it as a feature in the
> > > Thrift API.
> > >
> > > However, we should also have a discussion on how we should integrate
> the
> > > increasing number of Thrift APIs missing from the main Aurora client.
> > (Mea
> > > culpa: I'm probably one of the guiltiest parties of neglecting this
> > > aspect.)
> > >
> > > Now that AURORA-1288 has shipped, we should consider reviving
> discussion
> > on
> > > John's patch and even extending it.
> > >
> > > Additional planning is definitely needed IF we plan to integrate (off
> the
> > > top of my head):
> > > * Custom Executors (At least rudimentary support, i.e.: Name + Data
> blob,
> > > and also include the command-executor)
> > > * Executor-less docker containers
> > > * URI Fetcher
> > > * Kill Policy
> > >
> > >
> > > On Thu, Jan 5, 2017 at 1:09 PM, Erb, Stephan <
> > stephan@blue-yonder.com>
> > > wrote:
> > >
> > > > I will try to summarize an off-list discussion so that more people
> can
> > > > participate:
> > > >
> > > > Aurora has an unofficial way to launch Docker containers without
> > Thermos.
> > > > Rather than using the Thermos executor, Mesos will directly call the
> > > > container entrypoint. This support was contributed by Bill (
> > > > https://reviews.apache.org/r/44685/ ). An additional patch by John (
> > > > https://reviews.apache.org/r/44745/ ) to expose this functionality
> > > within
> > > > the client job configuration was discarded due to missing consensus
> at
> > > the
> > > > time. This means, the entrypoint mode is only available for REST API
> > > users,
> > > > and for users with patched clients.
> > > >
> > > > The goal of Nicolás is now to provide a graceful shutdown for
> > containers
> > > > running without Thermos. He has prepared a minimal patch that
> sketches
> > > the
> > > > idea https://github.com/apache/aurora/compare/master...
> > > > medallia:KillPolicyGracePeriod.
> > > >
> > > > How do we want to proceed here? Do we plan to improve our Docker
> > > > entrypoint story? If yes, can we just re-open Johns RB and merge an
> > > > extended version of Nicolás change, or do we need some additional
> > > planning?
> > > >
> > > > I am happy to hear what you think.
> > > >
> > > >
> > > > On 29/12/2016, 16:48, "Nicolas Donatucci" <ndonatu...@medallia.com>
> > > wrote:
> > > >
> > > > Hello everybody.
> > > >
> > > > I was thinking on adding support for the current Mesos' Grace
> > Period
> > > > Kill
> > > > Policy when running Docker containers without Thermos. It is
> > > currently
> > > > the
> > > > only Kill Policy implemented by Mesos. (More information can be
> > found
> > > > here
> > > > https://github.com/apache/mesos/blob/master/CHANGELOG#L576-L585
> > and
> > > > JIRA
> > > > issue here https://issues.apache.org/jira/browse/MESOS-4909)
> > > >
> > > > My idea is to add a Kill Policy to TaskConfig in order to pass it
> > on
> > > to
> > > > Mesos. The "finalization_wait" field of the task schema can be
> used
> > > to
> > > > create the corresponding Kill Policy.
> > > >
> > > > What do you think?
> > > >
> > > >
> > > >
> > >
> >
>
> --
> Zameer Manji
>


Re: A sketch for supporting mesos maintenance

2016-11-09 Thread Zameer Manji
Mesos 1.1.0 is shipping
<https://github.com/apache/mesos/blob/8822a29bce4b4c1f79ed25823c8fccbb47b1660c/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp>
an implementation of the SchedulerDriver interface that uses the HTTP API
under the hood. Adopting this implementation seems straightforward,
although we would not be able to accept Mesos maintenance requests just yet.

Once the community has proved out the HTTP API works in practice, I was
thinking about adopting the JNI implementation
<https://github.com/apache/mesos/blob/8822a29bce4b4c1f79ed25823c8fccbb47b1660c/src/java/jni/org_apache_mesos_v1_scheduler_V1Mesos.cpp>
of the HTTP API which would allow us to to accept the maintenance requests.
This might be a lot of work, because the shape of the HTTP API is very
different from the SchedulerDriver API.

Maintenance state is surfaced in the offer in the `Unavailability` field
<https://github.com/apache/mesos/blob/8822a29bce4b4c1f79ed25823c8fccbb47b1660c/include/mesos/mesos.proto#L1278>
.


On Wed, Nov 9, 2016 at 7:13 PM, Bill Farner <wfar...@apache.org> wrote:

> (1) sounds like an inevitability, do you have a sense of what stands in the
> way, or what it will take?
>
> (2) is a win for ending behavior redundancy. This is probably in the doc,
> but I'm lazy - are maintenance statuses surfaced in offers? IIRC the
> original incarnation of maintenance modes in mesos didn't surface that
> info, which eliminated important state for scheduling.
>
> On Wed, Nov 9, 2016 at 3:09 PM Zameer Manji <zma...@apache.org> wrote:
>
> > Hey,
> >
> > This is not a design doc for supporting Mesos Maintenance, but more of a
> > high level overview on how we *could* support it going forward. I just
> > wanted to get this idea out there now to see where we all stand.
> >
> > As Ankit mentioned in AURORA-1800 Mesos has had Maintenance primitives
> > since 0.25. You can read about them here
> > <http://mesos.apache.org/documentation/latest/maintenance/>. The
> > primitives
> > map pretty well to our existing concept of maintenance, but they allow
> > operators to do work across multiple frameworks.
> >
> > Since the Mesos community is growing and new frameworks are emerging all
> > the time, I think Aurora should support these primitives and drop our
> > custom primitives to be a better player in the ecosystem.
> >
> > We cannot adopt these just yet however, because it is only accessible
> > behind the Mesos HTTP API which Aurora does not use today. Further,
> > `aurora_admin` has some SLA aware maintenance processes which are
> computed
> > and coordinated from the client. I think for us to successfully adopt
> Mesos
> > Maintenance, we need to do at least two things:
> >
> > 1. Adopt the Mesos HTTP API.
> > 2. Move the SLA aware maintenance logic from the admin tool into the
> > scheduler itself, so the scheduler can coordinate with the Mesos Master
> in
> > an SLA aware fashion.
> >
> > What do folks think?
> >
> > --
> > Zameer Manji
> >
>
> --
> Zameer Manji
>


A sketch for supporting mesos maintenance

2016-11-09 Thread Zameer Manji
Hey,

This is not a design doc for supporting Mesos Maintenance, but more of a
high level overview on how we *could* support it going forward. I just
wanted to get this idea out there now to see where we all stand.

As Ankit mentioned in AURORA-1800 Mesos has had Maintenance primitives
since 0.25. You can read about them here
<http://mesos.apache.org/documentation/latest/maintenance/>. The primitives
map pretty well to our existing concept of maintenance, but they allow
operators to do work across multiple frameworks.

Since the Mesos community is growing and new frameworks are emerging all
the time, I think Aurora should support these primitives and drop our
custom primitives to be a better player in the ecosystem.

We cannot adopt these just yet however, because it is only accessible
behind the Mesos HTTP API which Aurora does not use today. Further,
`aurora_admin` has some SLA aware maintenance processes which are computed
and coordinated from the client. I think for us to successfully adopt Mesos
Maintenance, we need to do at least two things:

1. Adopt the Mesos HTTP API.
2. Move the SLA aware maintenance logic from the admin tool into the
scheduler itself, so the scheduler can coordinate with the Mesos Master in
an SLA aware fashion.

What do folks think?

-- 
Zameer Manji


Re: Build failed in Jenkins: Aurora #1665

2016-11-04 Thread Zameer Manji
Fixing the regression in https://reviews.apache.org/r/53508/

On Fri, Nov 4, 2016 at 4:14 PM, Zameer Manji <zma...@apache.org> wrote:

> Looking
>
> On Fri, Nov 4, 2016 at 4:02 PM, Apache Jenkins Server <
> jenk...@builds.apache.org> wrote:
>
>> See <https://builds.apache.org/job/Aurora/1665/changes>
>>
>> Changes:
>>
>> [zmanji] Send SIGTERM to daemonized processes on shutdown.
>>
>> --
>> [...truncated 2972 lines...]
>>  collecting 122 items [0m [1m
>>  collecting 122 items [0m [1m
>>  collecting 134 items [0m [1m
>>  collecting 137 items [0m [1m
>>  collecting 137 items [0m [1m
>>  collecting 139 items [0m [1m
>>  collecting 143 items [0m [1m
>>  collecting 143 items [0m [1m
>>  collecting 151 items [0m [1m
>>  collecting 152 items [0m [1m
>>  collecting 166 items [0m [1m
>>  collecting 174 items [0m [1m
>>  collecting 175 items [0m [1m
>>  collecting 176 items [0m [1m
>>  collecting 184 items [0m [1m
>>  collecting 188 items [0m [1m
>>  collecting 192 items [0m [1m
>>  collecting 199 items [0m [1m
>>  collecting 206 items [0m [1m
>>  collecting 206 items [0m [1m
>>  collecting 208 items [0m [1m
>>  collecting 210 items [0m [1m
>>  collecting 210 items [0m [1m
>>  collecting 211 items [0m [1m
>>  collecting 212 items [0m [1m
>>  collecting 215 items [0m [1m
>>  collecting 215 items [0m [1m
>>  collecting 215 items [0m [1m
>>  collecting 232 items [0m [1m
>>  collecting 232 items [0m [1m
>>  collecting 234 items [0m [1m
>>  collecting 234 items [0m [1m
>>  collecting 234 items [0m [1m
>>  collecting 249 items [0m [1m
>>  collecting 250 items [0m [1m
>>  collecting 256 items [0m [1m
>>  collecting 256 items [0m [1m
>>  collecting 256 items [0m [1m
>>  collecting 259 items [0m [1m
>>  collecting 259 items [0m [1m
>>  collecting 260 items [0m [1m
>>  collecting 261 items [0m [1m
>>  collecting 263 items [0m [1m
>>  collecting 264 items [0m [1m
>>  collecting 273 items [0m [1m
>>  collecting 276 items [0m [1m
>>  collecting 287 items [0m [1m
>>  collecting 291 items [0m [1m
>>  collecting 291 items [0m [1m
>>  collecting 291 items [0m [1m
>>  collecting 301 items [0m [1m
>>  collecting 301 items [0m [1m
>>  collecting 301 items [0m [1m
>>  collecting 301 items [0m [1m
>>  collecting 301 items [0m [1m
>>  collecting 303 items [0m [1m
>>  collecting 304 items [0m [1m
>>  collecting 305 items [0m [1m
>>  collecting 306 items [0m [1m
>>  collecting 309 items [0m [1m
>>  collecting 309 items [0m [1m
>>  collecting 309 items [0m [1m
>>  collecting 314 items [0m [1m
>>  collecting 314 items [0m [1m
>>  collecting 314 items [0m [1m
>>  collecting 317 items [0m [1m
>>  collecting 317 items [0m [1m
>>  collecting 317 items [0m [1m
>>  collecting 317 items [0m [1m
>>  collecting 325 items [0m [1m
>>  collecting 326 items [0m [1m
>>  collecting 326 items [0m [1m
>>  collecting 326 items [0m [1m
>>  collecting 331 items [0m [1m
>>  collecting 331 items [0m [1

Re: Build failed in Jenkins: Aurora #1665

2016-11-04 Thread Zameer Manji
 src/test/python/apache/thermos/core/test_runner_
> integration.py::TestRunnerBasic::test_runner_processes_have_monotonically_increasing_timestamps
> [31mFAILED [0m
>  
> src/test/python/apache/thermos/core/test_runner_integration.py
> <- .pants.d/python-setup/chroots/aa8c19ee98132b1d807c9921997c09
> adcdd43a98/apache/thermos/testing/runner.py::TestConcurrencyBasic::test_runner_state_reconstruction
> [32mPASSED [0m
>  src/test/python/apache/thermos/core/test_runner_
> integration.py::TestConcurrencyBasic::test_runner_state_success
> [31mFAILED [0m
>  src/test/python/apache/thermos/core/test_runner_
> integration.py::TestConcurrencyBasic::test_runner_processes_separated_
> temporally_due_to_concurrency_limit  [31mFAILED [0m
>  
> src/test/python/apache/thermos/core/test_runner_integration.py
> <- .pants.d/python-setup/chroots/aa8c19ee98132b1d807c9921997c09
> adcdd43a98/apache/thermos/testing/runner.py::TestRunnerEnvironment::test_runner_state_reconstruction
> [32mPASSED [0m
>  src/test/python/apache/thermos/core/test_runner_
> integration.py::TestRunnerEnvironment::test_runner_state_success
> [31mFAILED [0m
>  src/test/python/apache/thermos/core/test_runner_
> integration.py::TestRunnerEnvironment::test_runner_processes_have_expected_output
> [31mFAILED [0m
>  
> src/test/python/apache/thermos/core/test_runner_log_config.py
> <- .pants.d/python-setup/chroots/aa8c19ee98132b1d807c9921997c09
> adcdd43a98/apache/thermos/testing/runner.py::TestStandardStdout::test_runner_state_reconstruction
> [32mPASSED [0m
>  src/test/python/apache/thermos/core/test_runner_log_
> config.py::TestStandardStdout::test_log_config  [31mFAILED [0m
>  
> src/test/python/apache/thermos/core/test_runner_log_config.py
> <- .pants.d/python-setup/chroots/aa8c19ee98132b1d807c9921997c09
> adcdd43a98/apache/thermos/testing/runner.py::TestStandardStderr::test_runner_state_reconstruction
> [32mPASSED [0m
>  src/test/python/apache/thermos/core/test_runner_log_
> config.py::TestStandardStderr::test_log_config  [31mFAILED [0m
>  
> src/test/python/apache/thermos/core/test_runner_log_config.py
> <- .pants.d/python-setup/chroots/aa8c19ee98132b1d807c9921997c09
> adcdd43a98/apache/thermos/testing/runner.py::TestRotateUnderStdout::test_runner_state_reconstruction
> [32mPASSED [0m
>  src/test/python/apache/thermos/core/test_runner_log_
> config.py::TestRotateUnderStdout::test_log_config  [31mFAILED [0m
>  
> src/test/python/apache/thermos/core/test_runner_log_config.py
> <- .pants.d/python-setup/chroots/aa8c19ee98132b1d807c9921997c09
> adcdd43a98/apache/thermos/testing/runner.py::TestRotateUnderStderr::test_runner_state_reconstruction
> [32mPASSED [0m
>  src/test/python/apache/thermos/core/test_runner_log_
> config.py::TestRotateUnderStderr::test_log_config  [31mFAILED [0m
>  
> src/test/python/apache/thermos/core/test_runner_log_config.py
> <- .pants.d/python-setup/chroots/aa8c19ee98132b1d807c9921997c09
> adcdd43a98/apache/thermos/testing/runner.py::TestRotateOverStdout::test_runner_state_reconstruction
> [32mPASSED [0m
>  src/test/python/apache/thermos/core/test_runner_log_
> config.py::TestRotateOverStdout::test_log_config  [31mFAILED [0m
>  
> src/test/python/apache/thermos/core/test_runner_log_config.py
> <- .pants.d/python-setup/chroots/aa8c19ee98132b1d807c9921997c09
> adcdd43a98/apache/thermos/testing/runner.py::TestRotateOverStderr::test_runner_state_reconstruction
> [32mPASSED [0m
>  src/test/python/apache/thermos/core/test_runner_log_
> config.py::TestRotateOverStderr::test_log_config  [31mFAILED [0m
>  
> src/test/python/apache/thermos/core/test_runner_log_config.py
> <- .pants.d/python-setup/chroots/aa8c19ee98132b1d807c9921997c09
> adcdd43a98/apache/thermos/testing/runner.py::TestRotateDefaulted::test_runner_state_reconstruction
> [32mPASSED [0m
>  src/test/python/apache/thermos/core/test_runner_log_
> config.py::TestRotateDefaulted::test_log_config  [31mFAILED [0m
>  src/test/python/apache/thermos/core/test_staged_kill.
> py::TestRunnerKill::test_process_kill Build timed out (after 120
> minutes). Marking the build as failed.
> Build was aborted
> Recording test results
> ERROR: Step ?Publish JUnit test result report? failed: No test report
> files were found. Configuration error?
>
> --
> Zameer Manji
>


Re: Aurora, Thermos, PID 1, and You

2016-11-02 Thread Zameer Manji
Filed a task https://issues.apache.org/jira/browse/AURORA-1808 to track
this work since there are no objections.

On Mon, Oct 31, 2016 at 6:42 PM, Zameer Manji <zma...@apache.org> wrote:

> Re sending this from my @apache.org email in case my previous email got
> caught in spam.
>
> On Mon, Oct 31, 2016 at 6:42 PM, Zameer Manji <zma...@uber.com> wrote:
>
>> Hey,
>>
>> Recently I have experienced a number of issues in a production
>> environment with the DockerContainerizer, Aurora and Thermos. Although my
>> experience is specific to Docker, I believe this applies to anyone using
>> the Mesos Containerizer with pid isolation. The root cause of these issues
>> originate to the interactions between how we launch the executor, and the
>> role of PID 1.
>>
>> The CommandInfo for the ExecutorInfo uses the default `shell` value which
>> is `true`[1]. This means that in any PID isolated container the `sh`
>> process that launches the executor will become PID 1. Here is an example
>> `ps` output from vagrant showing this:
>> 
>> root@aurora:/# ps auxf
>> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
>> root   250  0.0  0.0  21928  2124 ?Ss   01:19   0:00 /bin/bash
>> root   469  0.0  0.0  19176  1240 ?R+   01:28   0:00  \_ ps
>> auxf
>> root 1  0.0  0.0   4328   636 ?Ss   01:10   0:00 /bin/sh
>> -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble
>> localhost:2181 --announcer-zookeeper-auth-config
>> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json
>> --mesos-containerizer
>> root 5  0.7  1.4 1201128 45604 ?   Sl   01:10   0:08
>> python2.7 /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble
>> localhost:2181 --announcer-zookeeper-auth-config
>> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json
>> --mesos-containerizer-
>> root23  0.1  0.6 115668 20764 ?S01:10   0:01  \_
>> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
>> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
>> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermos_js
>> root29  0.0  0.5 113476 17936 ?Ss   01:10   0:00  \_
>> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
>> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
>> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
>> root34  0.0  0.0  20040  1476 ?S01:10   0:00  |
>> \_ /bin/bash -c  while true; do   echo hello world   sleep 10
>>   done
>> root   468  0.0  0.0   4228   348 ?S01:28   0:00  |
>> \_ sleep 10
>> root31  0.0  0.5 113476 17936 ?Ss   01:10   0:00  \_
>> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
>> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
>> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
>> root32  0.0  0.0  20040  1476 ?S01:10   0:00
>>  \_ /bin/bash -c  while true; do   echo hello world   sleep 10
>> done
>> root   467  0.0  0.0   4228   352 ?S01:28   0:00
>>  \_ sleep 10
>> root47  0.0  0.0  24116  3052 ?S01:10   0:00 python
>> ./daemon.py
>> 
>>
>> This means processes that double fork/daemonize will be re parented to
>> `sh` and not our executor. You can see that the `python daemon.py` process
>> has been reparented to `sh` and not the executor and is outside of the
>> scope of the runners. This has a number of undesirable implications,
>> perhaps most concerning is that processes that end up reparenting to PID 1
>> will not receive SIGTERM or SIGKILL from thermos but instead will be killed
>> by the kernel when thermos decides to to exit. If anyone here decides to
>> run published images that use popular software that double forks (like
>> nginx), you will never be able to ensure the processes die cleanly.
>>
>> I've been thinking about this problem for a while and upon advice from
>> others and my own research I believe the best solution is as follows:
>> 1. We have good reasons for setting `shell=True` when launching the
>> executor. I'm not comfortable changing this because I'm not sure of all of
>> the implications if we choose another method.
>> 2. The thermos runners end up forking off the target processes. I think
>> the runners should be responsible for all of the processes that are created
>> by the child

Re: Aurora, Thermos, PID 1, and You

2016-10-31 Thread Zameer Manji
Re sending this from my @apache.org email in case my previous email got
caught in spam.

On Mon, Oct 31, 2016 at 6:42 PM, Zameer Manji <zma...@uber.com> wrote:

> Hey,
>
> Recently I have experienced a number of issues in a production environment
> with the DockerContainerizer, Aurora and Thermos. Although my experience is
> specific to Docker, I believe this applies to anyone using the Mesos
> Containerizer with pid isolation. The root cause of these issues originate
> to the interactions between how we launch the executor, and the role of PID
> 1.
>
> The CommandInfo for the ExecutorInfo uses the default `shell` value which
> is `true`[1]. This means that in any PID isolated container the `sh`
> process that launches the executor will become PID 1. Here is an example
> `ps` output from vagrant showing this:
> 
> root@aurora:/# ps auxf
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> root   250  0.0  0.0  21928  2124 ?Ss   01:19   0:00 /bin/bash
> root   469  0.0  0.0  19176  1240 ?R+   01:28   0:00  \_ ps
> auxf
> root 1  0.0  0.0   4328   636 ?Ss   01:10   0:00 /bin/sh
> -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble
> localhost:2181 --announcer-zookeeper-auth-config
> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json
> --mesos-containerizer
> root 5  0.7  1.4 1201128 45604 ?   Sl   01:10   0:08 python2.7
> /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble
> localhost:2181 --announcer-zookeeper-auth-config
> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json
> --mesos-containerizer-
> root23  0.1  0.6 115668 20764 ?S01:10   0:01  \_
> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermos_js
> root29  0.0  0.5 113476 17936 ?Ss   01:10   0:00  \_
> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
> root34  0.0  0.0  20040  1476 ?S01:10   0:00  |
> \_ /bin/bash -c  while true; do   echo hello world   sleep 10
>   done
> root   468  0.0  0.0   4228   348 ?S01:28   0:00  |
> \_ sleep 10
> root31  0.0  0.5 113476 17936 ?Ss   01:10   0:00  \_
> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
> root32  0.0  0.0  20040  1476 ?S01:10   0:00
>  \_ /bin/bash -c  while true; do   echo hello world   sleep 10
> done
> root   467  0.0  0.0   4228   352 ?S01:28   0:00
>\_ sleep 10
> root47  0.0  0.0  24116  3052 ?S01:10   0:00 python
> ./daemon.py
> 
>
> This means processes that double fork/daemonize will be re parented to
> `sh` and not our executor. You can see that the `python daemon.py` process
> has been reparented to `sh` and not the executor and is outside of the
> scope of the runners. This has a number of undesirable implications,
> perhaps most concerning is that processes that end up reparenting to PID 1
> will not receive SIGTERM or SIGKILL from thermos but instead will be killed
> by the kernel when thermos decides to to exit. If anyone here decides to
> run published images that use popular software that double forks (like
> nginx), you will never be able to ensure the processes die cleanly.
>
> I've been thinking about this problem for a while and upon advice from
> others and my own research I believe the best solution is as follows:
> 1. We have good reasons for setting `shell=True` when launching the
> executor. I'm not comfortable changing this because I'm not sure of all of
> the implications if we choose another method.
> 2. The thermos runners end up forking off the target processes. I think
> the runners should be responsible for all of the processes that are created
> by the children.
> 3. We can make the runners responsible for their grand children by using
> `prctl(2)`[2] and setting the `PR_SET_CHILD_SUBREAPER` bit for each runner.
> This means double forked processes will be reparented to the runner and not
> PID 1
> 4. On task tear down, we make the runners send SIGTERM and SIGKILL to the
> PIDs they recorded and any other children they have.
> 5. Each runner would need to have a SIGCHLD handler to handle zombie
> processes that are reparented to 

Aurora, Thermos, PID 1, and You

2016-10-31 Thread Zameer Manji
Hey,

Recently I have experienced a number of issues in a production environment
with the DockerContainerizer, Aurora and Thermos. Although my experience is
specific to Docker, I believe this applies to anyone using the Mesos
Containerizer with pid isolation. The root cause of these issues originate
to the interactions between how we launch the executor, and the role of PID
1.

The CommandInfo for the ExecutorInfo uses the default `shell` value which
is `true`[1]. This means that in any PID isolated container the `sh`
process that launches the executor will become PID 1. Here is an example
`ps` output from vagrant showing this:

root@aurora:/# ps auxf
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   250  0.0  0.0  21928  2124 ?Ss   01:19   0:00 /bin/bash
root   469  0.0  0.0  19176  1240 ?R+   01:28   0:00  \_ ps auxf
root 1  0.0  0.0   4328   636 ?Ss   01:10   0:00 /bin/sh -c
${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble localhost:2181
--announcer-zookeeper-auth-config /home/vagrant/aurora/examples/
vagrant/config/announcer-auth.json --mesos-containerizer
root 5  0.7  1.4 1201128 45604 ?   Sl   01:10   0:08 python2.7
/mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble localhost:2181
--announcer-zookeeper-auth-config /home/vagrant/aurora/examples/
vagrant/config/announcer-auth.json --mesos-containerizer-
root23  0.1  0.6 115668 20764 ?S01:10   0:01  \_
/usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
--log_to_disk=DEBUG --hostname=192.168.33.7 --thermos_js
root29  0.0  0.5 113476 17936 ?Ss   01:10   0:00  \_
/usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
--log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
root34  0.0  0.0  20040  1476 ?S01:10   0:00  |
\_ /bin/bash -c  while true; do   echo hello world   sleep 10
  done
root   468  0.0  0.0   4228   348 ?S01:28   0:00  |
  \_ sleep 10
root31  0.0  0.5 113476 17936 ?Ss   01:10   0:00  \_
/usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
--log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
root32  0.0  0.0  20040  1476 ?S01:10   0:00
 \_ /bin/bash -c  while true; do   echo hello world   sleep 10
done
root   467  0.0  0.0   4228   352 ?S01:28   0:00
   \_ sleep 10
root47  0.0  0.0  24116  3052 ?S01:10   0:00 python
./daemon.py


This means processes that double fork/daemonize will be re parented to `sh`
and not our executor. You can see that the `python daemon.py` process has
been reparented to `sh` and not the executor and is outside of the scope of
the runners. This has a number of undesirable implications, perhaps most
concerning is that processes that end up reparenting to PID 1 will not
receive SIGTERM or SIGKILL from thermos but instead will be killed by the
kernel when thermos decides to to exit. If anyone here decides to run
published images that use popular software that double forks (like nginx),
you will never be able to ensure the processes die cleanly.

I've been thinking about this problem for a while and upon advice from
others and my own research I believe the best solution is as follows:
1. We have good reasons for setting `shell=True` when launching the
executor. I'm not comfortable changing this because I'm not sure of all of
the implications if we choose another method.
2. The thermos runners end up forking off the target processes. I think the
runners should be responsible for all of the processes that are created by
the children.
3. We can make the runners responsible for their grand children by using
`prctl(2)`[2] and setting the `PR_SET_CHILD_SUBREAPER` bit for each runner.
This means double forked processes will be reparented to the runner and not
PID 1
4. On task tear down, we make the runners send SIGTERM and SIGKILL to the
PIDs they recorded and any other children they have.
5. Each runner would need to have a SIGCHLD handler to handle zombie
processes that are reparented to it.

[1]: https://github.com/apache/aurora/blob/783baaefb9a814ca01fad78181fe3d
f3de5b34af/src/main/java/org/apache/aurora/scheduler/configuration/executor/
ExecutorModule.java#L109-L135
[2]: http://man7.org/linux/man-pages/man2/prctl.2.html

-- 
Zameer Manji


Re: Need inputs on scheduling

2016-10-14 Thread Zameer Manji
Hey,

I am not an expert on MPI jobs, but it seems possible to run them on
Aurora. Aurora is a pretty flexible scheduler that lets you run arbitrary
binaries or container images. Aurora is designed for long running services
and assuming that you want to launch workers that are long running, it
could solve your problem.

On Thu, Oct 13, 2016 at 11:12 PM, Mangirish Wagle <vaglomangir...@gmail.com>
wrote:

> Hello Aurora Devs,
>
> I am contributing to Apache Airavata <http://airavata.apache.org/> and
> currently working on extending the support for the science gateways to run
> MPI jobs on cloud based Mesos clusters.
>
> Is there a way I can achieve this using Apache Aurora? I would really
> appreciate if you could share info on any work already being done to
> achieve scheduling MPI jobs on Mesos.
>
> Thank you.
>
> Best Regards,
> Mangirish Wagle
> Graduate Student, Indiana University Bloomington
>
> --
> Zameer Manji
>


A mini postmortem on snapshot failures

2016-09-30 Thread Zameer Manji
 it will likely be
evicted. In this case running one of the `SCRIPT` queries was taking more
than 20s and there were many pending queries which caused MyBatis to evict
the connection for the `SCRIPT` query, causing snapshot creation failure.

To fix this issue, operators used the `-db_max_active_connection_count` to
increase the maximum number of active connections for MyBatis to 100. Once
the scheduler was able to serve requests, operators used `aurora_admin
scheduler_snapshot` to force create a snapshot. Then a scheduler failover
was induced and it was observed that recovery time dropped to about 40
seconds.

Today this cluster continues running with this flag and value to ensure it
can continue to serve a high read load.

I would like to raise three questions:
* Should we add a flag to tune the maximum connection time for MyBatis?
* Should a Snapshot creation failure be fatal?
* Should we change the default maximum connection time and maximum number
of active connections?

[1]:
https://github.com/apache/aurora/blob/rel/0.16.0/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L107-L127

--
Zameer Manji


Re: [Discussion] Implementing a is_health_check_enabled function OR by checking the presence of an instance of health checker

2016-09-26 Thread Zameer Manji
I have no strong preference either way.

On Mon, Sep 26, 2016 at 11:25 AM, Huang Kai 
wrote:

> Hi folks,
>
> I'm currently blocked on the review https://reviews.apache.org/r/51876.
> I was wondering if you guys can provide some insights into the two proposed
> approaches on RB and help me proceed.
>
> The problem is that the aurora executor needs to determine if it should
> send a TASK_RUNNING message based on whether health check is enabled for an
> assigned task.
>
> Initially, I created an is_health_check_enabled(assigned_task) function
> in task_info.py, and use it in aurora executor. See:
> https://reviews.apache.org/r/51876/diff/3/. However, Maxim raised a valid
> point that is_health_check_enabled has some duplication with the set up of
> health checker in later step. Therefore we should reuse the logic of
> is_health_check_enabled as much as possible in health_checker.
>
> One solution is to create a dedicated function called
> is_health_check_enabled for an assigned_task, and reuse it when we set up a
> health checker. The benefit is better abstraction and ease for test.
>
> The challenge of implementing it is that this function seems a little bit
> heavy-weighted, we have to parse an assigned_task, compute the port map,
> and get health_checker, health_check_config from it as well. One solution I
> can come up with is to store all the computation result(port_map,
> health_checker, health_check_config) in a utility class. So that it can be
> reuse later. But a downside here is that the is_health_check_enabled now
> serves multiple purposes, and the meaning of this function is not clear. It
> should only answer one question: is health check enabled on this task?
>
> A second solution is to check if health check is enabled for an
> assigned_task by checking the presence of a health checker instance. A
> benefit of doing this is that we can set up the necessary health checkers
> and check if health check is enabled in one pass. In this way, we used the
> logic as much as possible and eliminate the duplications. See
> https://reviews.apache.org/r/51876/diff/5/
>
> Could you guys let me know your thought on the two approaches? If no one
> objects to the second solution, I will modify the executor as
> https://reviews.apache.org/r/51876/diff/5/
>
> Best,
>
> Kai
>
>


Re: [VOTE] Release Apache Aurora 0.16.0 RC2

2016-09-23 Thread Zameer Manji
+1 (binding)

On Fri, Sep 23, 2016 at 11:22 AM, John Sirois  wrote:

> +1 (binding)
>
> Verified via ./build-support/release/verify-release-candidate 0.16.0-rc2
>
> On Fri, Sep 23, 2016 at 11:01 AM, karthik padmanabhan <
> treadston...@gmail.com> wrote:
>
> > +1
> >
> > On Fri, Sep 23, 2016 at 9:14 AM, Maxim Khutornenko 
> > wrote:
> >
> > > +1
> > >
> > > On Thu, Sep 22, 2016 at 9:20 PM, Martin Hrabovčin
> > >  wrote:
> > > > +1 (non-binding)
> > > >
> > > > Verified using ./build-support/release/verify-release-candidate
> > > 0.16.0-rc2
> > > >
> > > > 2016-09-23 2:29 GMT+02:00 Jake Farrell :
> > > >
> > > >> not a blocker for the release candidate, can update the CHANGELOG in
> > > trunk
> > > >> and fix version on the ticket
> > > >>
> > > >> -Jake
> > > >>
> > > >> On Thu, Sep 22, 2016 at 3:18 PM, Joshua Cohen 
> > > wrote:
> > > >>
> > > >> > Note: I forgot to mark AURORA-1779 as fixed in 0.16.0 before
> cutting
> > > this
> > > >> > RC, so that isn't reflected in the changelog. I don't consider
> that
> > a
> > > >> > blocker to release, but if others disagree I can cut rc3.
> > > >> >
> > > >> > On Thu, Sep 22, 2016 at 2:12 PM, Joshua Cohen 
> > > wrote:
> > > >> >
> > > >> > > All,
> > > >> > >
> > > >> > > I propose that we accept the following release candidate as the
> > > >> official
> > > >> > > Apache Aurora 0.16.0 release.
> > > >> > >
> > > >> > > Aurora 0.16.0-rc2 includes the following:
> > > >> > > ---
> > > >> > > The RELEASE NOTES for the release are available at:
> > > >> > > https://git-wip-us.apache.org/repos/asf?p=aurora.git=
> > > >> > > RELEASE-NOTES.md=rel/0.16.0-rc2
> > > >> > >
> > > >> > > The CHANGELOG for the release is available at:
> > > >> > > https://git-wip-us.apache.org/repos/asf?p=aurora.git=
> > > >> > > CHANGELOG=rel/0.16.0-rc2
> > > >> > >
> > > >> > > The tag used to create the release candidate is:
> > > >> > > https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=
> > > >> > > shortlog;h=refs/tags/rel/0.16.0-rc2
> > > >> > >
> > > >> > > The release candidate is available at:
> > > >> > > https://dist.apache.org/repos/dist/dev/aurora/0.16.0-rc2/
> > > >> > > apache-aurora-0.16.0-rc2.tar.gz
> > > >> > >
> > > >> > > The MD5 checksum of the release candidate can be found at:
> > > >> > > https://dist.apache.org/repos/dist/dev/aurora/0.16.0-rc2/
> > > >> > > apache-aurora-0.16.0-rc2.tar.gz.md5
> > > >> > >
> > > >> > > The signature of the release candidate can be found at:
> > > >> > > https://dist.apache.org/repos/dist/dev/aurora/0.16.0-rc2/
> > > >> > > apache-aurora-0.16.0-rc2.tar.gz.asc
> > > >> > >
> > > >> > > The GPG key used to sign the release are available at:
> > > >> > > https://dist.apache.org/repos/dist/dev/aurora/KEYS
> > > >> > >
> > > >> > > Please download, verify, and test.
> > > >> > >
> > > >> > > The vote will close on Sun Sep 25 14:11:09 CDT 2016
> > > >> > >
> > > >> > > [ ] +1 Release this as Apache Aurora 0.16.0
> > > >> > > [ ] +0
> > > >> > > [ ] -1 Do not release this as Apache Aurora 0.16.0 because...
> > > >> > >
> > > >> > > 
> > > >> > > 
> > > >> > >
> > > >> >
> > > >>
> > >
> >
>


Re: [VOTE] Release Apache Aurora 0.16.0 RC0

2016-09-19 Thread Zameer Manji
-1

I discovered https://issues.apache.org/jira/browse/AURORA-1777 in this
release.

On Mon, Sep 19, 2016 at 12:29 PM, Joshua Cohen  wrote:

> All,
>
> I propose that we accept the following release candidate as the official
> Apache Aurora 0.16.0 release.
>
> Aurora 0.16.0-rc0 includes the following:
> ---
> The RELEASE NOTES for the release are available at:
> https://git-wip-us.apache.org/repos/asf?p=aurora.git=
> RELEASE-NOTES.md=rel/0.16.0-rc0
>
> The CHANGELOG for the release is available at:
> https://git-wip-us.apache.org/repos/asf?p=aurora.git=
> CHANGELOG=rel/0.16.0-rc0
>
> The tag used to create the release candidate is:
> https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=
> shortlog;h=refs/tags/rel/0.16.0-rc0
>
> The release candidate is available at:
> https://dist.apache.org/repos/dist/dev/aurora/0.16.0-rc0/
> apache-aurora-0.16.0-rc0.tar.gz
>
> The MD5 checksum of the release candidate can be found at:
> https://dist.apache.org/repos/dist/dev/aurora/0.16.0-rc0/
> apache-aurora-0.16.0-rc0.tar.gz.md5
>
> The signature of the release candidate can be found at:
> https://dist.apache.org/repos/dist/dev/aurora/0.16.0-rc0/
> apache-aurora-0.16.0-rc0.tar.gz.asc
>
> The GPG key used to sign the release are available at:
> https://dist.apache.org/repos/dist/dev/aurora/KEYS
>
> Please download, verify, and test.
>
> The vote will close on Thu Sep 22 14:28:18 CDT 2016
>
> [ ] +1 Release this as Apache Aurora 0.16.0
> [ ] +0
> [ ] -1 Do not release this as Apache Aurora 0.16.0 because...
>
> 
> 
>


Re: Aurora 0.16.0 release

2016-09-12 Thread Zameer Manji
Not having the webhook fix block the release SGTM.

I would like to point out that I would like to add a deprecation in this
release and there is a review out already.
https://reviews.apache.org/r/51712/

Otherwise it seems we are good to go for a release from my perspective.

On Mon, Sep 12, 2016 at 11:38 AM, Joshua Cohen <jco...@apache.org> wrote:

> I don't consider it a blocker for 0.16.0, so... if it's ready before I cut
> the release, great. If not, we'll get it into 0.17.0?
>
> On Fri, Sep 9, 2016 at 8:45 PM, Zameer Manji <zma...@apache.org> wrote:
>
> > Resending this message so it's not stuck in people's spam.
> >
> > This issue is serious but can be fixed quickly I think.
> >
> > On Fri, Sep 9, 2016 at 6:19 PM, Dmitriy Shirchenko <shirc...@uber.com>
> > wrote:
> >
> > > We discovered an issue with webhook today, so we may want to consider
> not
> > > releasing 0.16.0 until we fix this issue next week:
> > > https://issues.apache.org/jira/browse/AURORA-1769
> > >
> > > On Tue, Sep 6, 2016 at 1:34 PM Maxim Khutornenko <ma...@apache.org>
> > wrote:
> > >
> > > > I'd give it one more release as it may break any internal consumers
> of
> > > > that data. Same about AURORA-1708.
> > > >
> > > > As for AURORA-1680, it should be very safe to fix now.
> > > >
> > > > On Tue, Sep 6, 2016 at 1:18 PM, Zameer Manji <zma...@apache.org>
> > wrote:
> > > > > Should we fix AURORA-1707 in this release or bump it to the next
> > > > release? I
> > > > > noticed that it has been unloved for some time.
> > > > >
> > > > > On Tue, Sep 6, 2016 at 10:54 AM, Zameer Manji <zma...@apache.org>
> > > wrote:
> > > > >
> > > > >> + 1
> > > > >>
> > > > >> I think we are due for a release so folks can get Mesos 1.0 and
> GPU
> > > > >> support.
> > > > >>
> > > > >> On Tue, Sep 6, 2016 at 8:36 AM, Joshua Cohen <jco...@apache.org>
> > > wrote:
> > > > >>
> > > > >>> Hi Aurorans,
> > > > >>>
> > > > >>> I plan to kick off the 0.16.0 release some time later this week.
> > > Please
> > > > >>> let
> > > > >>> me know if there are any outstanding patches you'd like to ship
> > > before
> > > > >>> this
> > > > >>> release.
> > > > >>>
> > > > >>> Thanks!
> > > > >>>
> > > > >>> Joshua
> > > > >>>
> > > > >>
> > > > >>
> > > >
> > >
> >
>


Re: Aurora 0.16.0 release

2016-09-09 Thread Zameer Manji
Resending this message so it's not stuck in people's spam.

This issue is serious but can be fixed quickly I think.

On Fri, Sep 9, 2016 at 6:19 PM, Dmitriy Shirchenko <shirc...@uber.com>
wrote:

> We discovered an issue with webhook today, so we may want to consider not
> releasing 0.16.0 until we fix this issue next week:
> https://issues.apache.org/jira/browse/AURORA-1769
>
> On Tue, Sep 6, 2016 at 1:34 PM Maxim Khutornenko <ma...@apache.org> wrote:
>
> > I'd give it one more release as it may break any internal consumers of
> > that data. Same about AURORA-1708.
> >
> > As for AURORA-1680, it should be very safe to fix now.
> >
> > On Tue, Sep 6, 2016 at 1:18 PM, Zameer Manji <zma...@apache.org> wrote:
> > > Should we fix AURORA-1707 in this release or bump it to the next
> > release? I
> > > noticed that it has been unloved for some time.
> > >
> > > On Tue, Sep 6, 2016 at 10:54 AM, Zameer Manji <zma...@apache.org>
> wrote:
> > >
> > >> + 1
> > >>
> > >> I think we are due for a release so folks can get Mesos 1.0 and GPU
> > >> support.
> > >>
> > >> On Tue, Sep 6, 2016 at 8:36 AM, Joshua Cohen <jco...@apache.org>
> wrote:
> > >>
> > >>> Hi Aurorans,
> > >>>
> > >>> I plan to kick off the 0.16.0 release some time later this week.
> Please
> > >>> let
> > >>> me know if there are any outstanding patches you'd like to ship
> before
> > >>> this
> > >>> release.
> > >>>
> > >>> Thanks!
> > >>>
> > >>> Joshua
> > >>>
> > >>
> > >>
> >
>


Re: [PROPOSAL] New RPC `fetchJobUpdates`

2016-09-07 Thread Zameer Manji
Maxim and I discussed offline after the Aurora dev sync today and we think
the best way forward is improving the `getJobUpdateDetails` RPC to take a
`JobUpdateQuery`.

I'll be filing a ticket and moving forward on this.

On Fri, Sep 2, 2016 at 3:36 PM, Maxim Khutornenko <ma...@apache.org> wrote:

> The pulseJobUpdate RPC was added as an entirely new feature. This
> proposal suggests a read-only "convenience" method to pull data that
> is already available via existing means.
>
> As for "taking down the scheduler", this argument is moot. It's
> possible to DDOS scheduler via existing RPCs today and the suggested
> alternative has the same inherent risk of crafting an unscoped query
> that will pull all updates from the store. Also, unless there are a
> LOT of those queries (read dos-attack that OOMs scheduler) it's
> unlikely that the scheduler will sustain any damage as read-only
> queries don't acquire any global or table locks.
>
> On Fri, Sep 2, 2016 at 3:24 PM, Zameer Manji <zma...@apache.org> wrote:
> > I'm not convinced by your argument about adding a read only RPC that's
> not
> > covered by the traditional integration test is going to cause bit rot. We
> > already have an RPC that is not covered by the traditional integration
> test
> > `pulseJobUpdate` and it hasn't bit rotted AFAIK.
> >
> > Further some members of the community have been writing alternative
> clients
> > around our API and I think adding this RPC will better support their use
> > cases. `JobUpdate` is a first class struct in our API and I think it
> makes
> > sense to expose a query interface for it.
> >
> > If integration tests are your primary concern, I can also add an
> > integration test for this.
> >
> > Regarding adding `JobUpdateQuery` to `getJobUpdateDetails`, I'm no longer
> > convinced that it's the best way to go because of the risk it contains.
> > It's all to easy to craft a query that can pull a lot of data that can
> take
> > down the scheduler. In my case, I sometimes see it being useful for
> getting
> > all of the `JobUpdates` of a role (completed and active), but that risks
> > pulling a lot of unnecessary data.
> >
> > I will also add that since `JobUpdate` is a first class struct in our
> > storage and APIs, this RPC contains very minimal code. I think it's
> highly
> > unlikely that it will bit rot.
> >
> > On Fri, Sep 2, 2016 at 2:52 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
> >
> >> As I mentioned in Slack, I am ok with the new RPC as long as there is
> >> a use for it elsewhere in the client or UI. Adding a read-only RPC
> >> that isn't going to be called by our traditional integration test
> >> clients sets a fertile ground for bit rot.
> >>
> >> I am actually warming up to your original proposal of adding
> >> JobUpdateQuery into the existing getJobUpdateDetails RPC. While it may
> >> be more expensive to pull multiple updates, we don't necessarily risk
> >> much after we migrated to MVStore on the H2 side. There are no table
> >> locks acquired and the only downside would be pulling events along
> >> with what you need. Provided the query is narrowly scoped, that should
> >> deliver acceptable performance.
> >>
> >> On Thu, Sep 1, 2016 at 2:24 PM, Zameer Manji <zma...@apache.org> wrote:
> >> > Hey,
> >> >
> >> > I've noticed a hole in our current API which makes it difficult to
> write
> >> > external clients and other tooling around fetching the state of
> updates.
> >> >
> >> > Currently, to fetch updates we are given two RPCs:
> >> > 
> >> > /** Gets job update summaries. */
> >> > Response getJobUpdateSummaries(1: JobUpdateQuery jobUpdateQuery)
> >> >
> >> > /** Gets job update details. */
> >> > Response getJobUpdateDetails(1: JobUpdateKey key)
> >> >
> >> > 
> >> >
> >> > The `getJobUpdateSummaries` RPC is not scoped to a single update and
> >> > returns a
> >> > set of `JobUpdateSummary` structs. The struct is defined:
> >> > 
> >> > /** Summary of the job update including job key, user and current
> state.
> >> */
> >> > struct JobUpdateSummary {
> >> >   /** Unique identifier for the update. */
> >> >   5: JobUpdateKey key
> >> >
> >> >   /** User initiated an update. */
> >> >   3: string user
> >> >
> >> >   /** Current job update state.

Re: Aurora 0.16.0 release

2016-09-06 Thread Zameer Manji
Should we fix AURORA-1707 in this release or bump it to the next release? I
noticed that it has been unloved for some time.

On Tue, Sep 6, 2016 at 10:54 AM, Zameer Manji <zma...@apache.org> wrote:

> + 1
>
> I think we are due for a release so folks can get Mesos 1.0 and GPU
> support.
>
> On Tue, Sep 6, 2016 at 8:36 AM, Joshua Cohen <jco...@apache.org> wrote:
>
>> Hi Aurorans,
>>
>> I plan to kick off the 0.16.0 release some time later this week. Please
>> let
>> me know if there are any outstanding patches you'd like to ship before
>> this
>> release.
>>
>> Thanks!
>>
>> Joshua
>>
>
>


Re: [DRAFT][REPORT] Apache Aurora - September 2016

2016-09-06 Thread Zameer Manji
+1

Thanks for drafting this up so quickly.

On Tue, Sep 6, 2016 at 11:52 AM, Jake Farrell  wrote:

>  Please take a second to review the draft board report below and let me
> know if there are any modifications that should be made. I will submit this
> in the next couple days if there are no objections
>
> -Jake
>
>
>
> Apache Aurora is a stateless and fault tolerant service scheduler used to
> schedule jobs onto Apache Mesos such as long-running services, cron jobs,
> and one off tasks.
>
> Project Status
> -
> The Apache Aurora community has continued to see growth from new users and
> contributors while releasing Apache Aurora 0.15.0 and making progress on
> our
> upcoming 0.16.0 release candidate. The upcoming  release will contain a
> number
> of bug fixes, stability enhancements as well as updating support for Apache
> Mesos
> 1.0.0, multiple executor support, and defaulting to using Apache Curator
> for
> scheduler leader election. Community design discussion have started around
> dynamic reservations and job update configuration.
>
> Community
> ---
> Latest Additions:
>
> * PMC addition: Stephan Erb, 2.3.2016
>
> Issue backlog status since last report:
>
> * Created:   53
> * Resolved: 35
>
> Mailing list activity since last report:
>
> * @dev270 messages
> * @user   73 messages
> * @reviews  863 messages
>
> Releases
> ---
> Last release: Apache Aurora 0.15.0 released 07.06.2016
> Release candidate: Apache Aurora 0.16.0 release candidate in progress
>


Re: Aurora 0.16.0 release

2016-09-06 Thread Zameer Manji
+ 1

I think we are due for a release so folks can get Mesos 1.0 and GPU
support.

On Tue, Sep 6, 2016 at 8:36 AM, Joshua Cohen  wrote:

> Hi Aurorans,
>
> I plan to kick off the 0.16.0 release some time later this week. Please let
> me know if there are any outstanding patches you'd like to ship before this
> release.
>
> Thanks!
>
> Joshua
>


Re: 答复: Discussion on review request 51536

2016-09-02 Thread Zameer Manji
Kai,

We have had coupled deploys before, I don't think it's too terrible. It's
something to note in the release notes and some operational pain for large
users.

On Fri, Sep 2, 2016 at 4:42 PM, 黄 凯 <texasred2...@hotmail.com> wrote:

> Another concern is that once we rolled out the new executor, we should
> rolled out a new client in order to use the health-check feature. Hence the
> executor and client rolling out process seem to be coupled.
>
>
>
>
> --
> *发件人:* 黄 凯 <texasred2...@hotmail.com>
> *发送时间:* 2016年9月3日 7:23
> *收件人:* Zameer Manji; dev@aurora.apache.org
> *抄送:* Joshua Cohen; s...@apache.org; cald...@gmail.com;
> rdelv...@binghamton.edu
> *主题:* 答复: Discussion on review request 51536
>
>
> Thanks for the new proposal, Zameer. It sounds good to me. The benefit is
> that it does not alter the current infrastructure too much.
>
>
> However, there is one thing to keep in mind:
>
> we currently do a check to ensure watch_sec is longer than
> initial_interval_secs. We will have to remove the alert message if we
> choose to skip watch_sec by setting it as zero.
>
>
> So the new configuration will not support executor-driven health check
> unless the executors are rolled out 100%.
>
>
> Does this tradeoff seems OK for us, Maxim?
>
>
> Kai
>
>
> --
> *发件人:* Zameer Manji <zma...@uber.com>
> *发送时间:* 2016年9月3日 6:53
> *收件人:* dev@aurora.apache.org
> *抄送:* 黄 凯; Joshua Cohen; s...@apache.org; cald...@gmail.com;
> rdelv...@binghamton.edu
> *主题:* Re: Discussion on review request 51536
>
>
>
> On Fri, Sep 2, 2016 at 3:24 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
>
>> Need to correct a few previous statements:
>>
>> > Also we do not want to expose this message to users.
>> This is incorrect. The original design proposal suggested to show this
>> message in the UI as: "Task is healthy"
>>
>
> Does this mean the message in the status update is going to be exactly,
> "Task is healthy" and the scheduler is going to check for this string in
> the `TASK_RUNNING` status update? This means we are going to establish a
> communication
> mechanism between the executor and scheduler that's not defined by a
> schema. I feel that's worse than putting JSON in there and having the
> scheduler parse it.
>
>
>> > The Mesos API isn't designed for packing arbitrary data
>> > in the status update message.
>> Don't think I agree, this is exactly what this field is for [1] and we
>> already use it for other states [2].
>>
>
> I guess I should have said 'structured arbitrary data'. The informational,
> messages are fine and we plumb them blindly into our logging and UI. I'm
> not convinced we should start putting JSON or something more structured in
> there. That's yet another schema we have and yet another versioning story
> we have to go though. This also complicates matters for custom executor
> authors.
>
>
>>
>> > I would be open to just saying that scheduler version
>> > 0.16 (or 0.17) just assumes the executor transitions to
>> > RUNNING once a task is healthy and dropping
>> > `watch_secs`entirely.
>> We can't drop 'watch_secs' entirely as we still have to babysit job
>> updates that don't have health checks enabled.
>>
>
> Understood. I guess we can keep it but I'm now frustrated that we have a
> parameter that is ignored if we set some json in ExecutorConfig.data.
> Ideally, we don't accept `watch_secs` if we want health check driven
> updates. As mentioned before I don't like this implicit tightening of the
> executor and the scheduler.
>
>
>>
>> As for my take on the above, I favor #1 as the simplest answer to an
>> already simple question: "Should we use watch_secs for this instance
>> or not?". That's pretty much it. Scheduler does not need any schema
>> changes, know what health checks are or if a job has them enabled. At
>> least not until we attempt to move to centralized health checks
>> (AURORA-279) but that will be an entirely different design discussion.
>>
>> [1] - https://github.com/apache/mesos/blob/master/include/mesos/
>> mesos.proto#L1605.
>> [2] - https://github.com/apache/aurora/blob/5cad046fc0f0c4bb79a456
>> 3cfcff0442b7bf8383/src/main/python/apache/aurora/executor/
>> aurora_executor.py#L97
>
>
>
> With all of this in mind, I have another proposal. Why can't we have the
> executor changes (wait until the task is healthy for RUNNING) *and* read
> `watch_secs` if it is set? Why not have both of these features and if users
> want purely

Re: Discussion on review request 51536

2016-09-02 Thread Zameer Manji
On Fri, Sep 2, 2016 at 3:24 PM, Maxim Khutornenko <ma...@apache.org> wrote:

> Need to correct a few previous statements:
>
> > Also we do not want to expose this message to users.
> This is incorrect. The original design proposal suggested to show this
> message in the UI as: "Task is healthy"
>

Does this mean the message in the status update is going to be exactly,
"Task is healthy" and the scheduler is going to check for this string in
the `TASK_RUNNING` status update? This means we are going to establish a
communication
mechanism between the executor and scheduler that's not defined by a
schema. I feel that's worse than putting JSON in there and having the
scheduler parse it.


> > The Mesos API isn't designed for packing arbitrary data
> > in the status update message.
> Don't think I agree, this is exactly what this field is for [1] and we
> already use it for other states [2].
>

I guess I should have said 'structured arbitrary data'. The informational,
messages are fine and we plumb them blindly into our logging and UI. I'm
not convinced we should start putting JSON or something more structured in
there. That's yet another schema we have and yet another versioning story
we have to go though. This also complicates matters for custom executor
authors.


>
> > I would be open to just saying that scheduler version
> > 0.16 (or 0.17) just assumes the executor transitions to
> > RUNNING once a task is healthy and dropping
> > `watch_secs`entirely.
> We can't drop 'watch_secs' entirely as we still have to babysit job
> updates that don't have health checks enabled.
>

Understood. I guess we can keep it but I'm now frustrated that we have a
parameter that is ignored if we set some json in ExecutorConfig.data.
Ideally, we don't accept `watch_secs` if we want health check driven
updates. As mentioned before I don't like this implicit tightening of the
executor and the scheduler.


>
> As for my take on the above, I favor #1 as the simplest answer to an
> already simple question: "Should we use watch_secs for this instance
> or not?". That's pretty much it. Scheduler does not need any schema
> changes, know what health checks are or if a job has them enabled. At
> least not until we attempt to move to centralized health checks
> (AURORA-279) but that will be an entirely different design discussion.
>
> [1] - https://github.com/apache/mesos/blob/master/include/
> mesos/mesos.proto#L1605.
> [2] - https://github.com/apache/aurora/blob/5cad046fc0f0c4bb79a4563cfcff04
> 42b7bf8383/src/main/python/apache/aurora/executor/aurora_executor.py#L97



With all of this in mind, I have another proposal. Why can't we have the
executor changes (wait until the task is healthy for RUNNING) *and* read
`watch_secs` if it is set? Why not have both of these features and if users
want purely health checking driven updates they can set this value to 0 and
enable health checks. If they want to have both health checking and time
driven updates they can set this to value to the time that they care about.
If they just want time driven updates they can disable health checking and
set this value.

Then there is no coupling between the executor and the scheduler except for
status updates and there is no dependency on the `message` field of the
status update.

We could even treat `watch_secs` as minimum time in STARTING + RUNNING
instead of RUNNING with this change and it becomes the lower bound in the
update transition speed. This can ensure that users don't deploy "too fast"
and end up overwhelming other services if they are deployed too quickly.



>
>
> On Fri, Sep 2, 2016 at 2:26 PM, Zameer Manji <zma...@apache.org> wrote:
> > *cc: Renan*
> >
> > I think there is some disagreement/discussion on the review because we
> have
> > not achieved consensus on the design. Since the design doc was written,
> > Aurora adopted multiple executor support as well non HTTP based
> > healthchecking. This invalidates some parts of the original design. I
> think
> > all of the solutions here are possible amendments to the design doc.
> >
> > I am not in favor of Solution 2 at all because status updates between
> > executor <-> agent <-> master <-> scheduler are designed to update the
> > framework of updates to the task and not really designed to send
> arbitrary
> > information. Just because the Mesos API provides us with a string field
> > doesn't mean we should try to pack in arbitrary data. Also, it isn't
> clear
> > what other capabilities we might add in the future so I'm unconvinced
> that
> > capabilities needs to exist at all. My fear is that we will create the
> > infrastructure for capabilities just to serve this need and nothing else.
> &

Re: [PROPOSAL] New RPC `fetchJobUpdates`

2016-09-02 Thread Zameer Manji
I'm not convinced by your argument about adding a read only RPC that's not
covered by the traditional integration test is going to cause bit rot. We
already have an RPC that is not covered by the traditional integration test
`pulseJobUpdate` and it hasn't bit rotted AFAIK.

Further some members of the community have been writing alternative clients
around our API and I think adding this RPC will better support their use
cases. `JobUpdate` is a first class struct in our API and I think it makes
sense to expose a query interface for it.

If integration tests are your primary concern, I can also add an
integration test for this.

Regarding adding `JobUpdateQuery` to `getJobUpdateDetails`, I'm no longer
convinced that it's the best way to go because of the risk it contains.
It's all to easy to craft a query that can pull a lot of data that can take
down the scheduler. In my case, I sometimes see it being useful for getting
all of the `JobUpdates` of a role (completed and active), but that risks
pulling a lot of unnecessary data.

I will also add that since `JobUpdate` is a first class struct in our
storage and APIs, this RPC contains very minimal code. I think it's highly
unlikely that it will bit rot.

On Fri, Sep 2, 2016 at 2:52 PM, Maxim Khutornenko <ma...@apache.org> wrote:

> As I mentioned in Slack, I am ok with the new RPC as long as there is
> a use for it elsewhere in the client or UI. Adding a read-only RPC
> that isn't going to be called by our traditional integration test
> clients sets a fertile ground for bit rot.
>
> I am actually warming up to your original proposal of adding
> JobUpdateQuery into the existing getJobUpdateDetails RPC. While it may
> be more expensive to pull multiple updates, we don't necessarily risk
> much after we migrated to MVStore on the H2 side. There are no table
> locks acquired and the only downside would be pulling events along
> with what you need. Provided the query is narrowly scoped, that should
> deliver acceptable performance.
>
> On Thu, Sep 1, 2016 at 2:24 PM, Zameer Manji <zma...@apache.org> wrote:
> > Hey,
> >
> > I've noticed a hole in our current API which makes it difficult to write
> > external clients and other tooling around fetching the state of updates.
> >
> > Currently, to fetch updates we are given two RPCs:
> > 
> > /** Gets job update summaries. */
> > Response getJobUpdateSummaries(1: JobUpdateQuery jobUpdateQuery)
> >
> > /** Gets job update details. */
> > Response getJobUpdateDetails(1: JobUpdateKey key)
> >
> > 
> >
> > The `getJobUpdateSummaries` RPC is not scoped to a single update and
> > returns a
> > set of `JobUpdateSummary` structs. The struct is defined:
> > 
> > /** Summary of the job update including job key, user and current state.
> */
> > struct JobUpdateSummary {
> >   /** Unique identifier for the update. */
> >   5: JobUpdateKey key
> >
> >   /** User initiated an update. */
> >   3: string user
> >
> >   /** Current job update state. */
> >   4: JobUpdateState state
> > }
> > 
> >
> > The `getJobUpdateDetails` RPC is scoped to a single update and returns
> the
> > following struct:
> >
> > 
> > struct JobUpdateDetails {
> >   /** Update definition. */
> >   1: JobUpdate update
> >
> >   /** History for this update. */
> >   2: list updateEvents
> >
> >   /** History for the individual instances updated. */
> >   3: list instanceEvents
> > }
> >
> > 
> >
> > Maxim mentioned to me yesterday that this RPC is scoped to a single
> update
> > because assembling the `instanceEvents` can be extremely expensive. A
> query
> > that
> > could span more than a single update risks taking down the scheduler in a
> > large
> > cluster.
> >
> >
> > The problem I discovered is that there is no batch API to get the
> > inexpensive
> > information inside the `JobUpdate` struct. For reference this struct
> > contains:
> >
> > 
> > /** Full definition of the job update. */
> > struct JobUpdate {
> >   /** Update summary. */
> >   1: JobUpdateSummary summary
> >
> >   /** Update configuration. */
> >   2: JobUpdateInstructions instructions
> > }
> > 
> >
> > Consumers are forced to make several `getJobUpdateDetails` calls to get
> > multiple
> > `JobUpdate` structs. Since the `JobUpdate` struct is not expensive to
> > assemble,
> > I'm proposing a new RPC that will allow consumers to get several
> `JobUpdate`
> > structs in a single call.
> >
> > 
> > /** Gets job updates. */
> > Response getJobUpdates(1: JobUpdateQuery jobUpdateQuery)
> > 
> >
> > If there are no objections, I will file tickets and put up a patch to
> > implement
> > this.
> >
> > --
> > Zameer Manji
>


Re: Discussion on review request 51536

2016-09-02 Thread Zameer Manji
*cc: Renan*

I think there is some disagreement/discussion on the review because we have
not achieved consensus on the design. Since the design doc was written,
Aurora adopted multiple executor support as well non HTTP based
healthchecking. This invalidates some parts of the original design. I think
all of the solutions here are possible amendments to the design doc.

I am not in favor of Solution 2 at all because status updates between
executor <-> agent <-> master <-> scheduler are designed to update the
framework of updates to the task and not really designed to send arbitrary
information. Just because the Mesos API provides us with a string field
doesn't mean we should try to pack in arbitrary data. Also, it isn't clear
what other capabilities we might add in the future so I'm unconvinced that
capabilities needs to exist at all. My fear is that we will create the
infrastructure for capabilities just to serve this need and nothing else.

I object to Solution 1 along the same lines. The Mesos API isn't designed
for packing arbitrary data in the status update message and I don't think
we should abuse that and rely on that. Also our current infrastructure just
plumbs the message to the UI and I think displaying capabilities is not
something we should do.

I am in favor of Solution 3 which is as close as possible to the original
design in the design doc. The design doc says the following:

Scheduler updater will skip the minWaitInInstanceMs (aka watch_secs
> )
> grace period any time it detects a named port ‘health’ in task
> configuration. A RUNNING instance status will signify the end of instance
> update.


Instead of detecting the 'health' port in the task configuration, we make
enabling this feature explicitly by enabling a bit in the task
configuration with the `executorDrivenUpdates` bit.

I understand this option makes this feature more complex because it
requires a schema change and requires operators to deploy the executor to
all agents before upgrading the client. However, I think that's a one time
operational cost as a opposed to long lived design choices that will affect
the code.

Further Solution 3 is the most amenable to custom executors and continues
our tradition of treating executors as opaque black boxes. I think there is
a lot of value in treating executors as black boxes as it leaves the door
open to switching our executor to something else and doesn't impose a
burden to others that want to write their own.

Alternatively, if amending the schema is too much work, I would be open to
just saying that scheduler version 0.16 (or 0.17) just assumes the executor
transitions to RUNNING once a task is healthy and dropping `watch_secs`
entirely. We can put it in the release notes that operators must deploy the
executor to 100% before deploying the scheduler.


On Thu, Sep 1, 2016 at 6:40 PM, 黄 凯  wrote:

> Hi Folks,
>
> I'm currently working on a feature on aurora scheduler and executor. The
> implementation strategy became controversial on the review board, so I was
> wondering if I should broadcast it to more audience and initiate a
> discussion. Please feel free to let me know your thoughts, your help is
> greatly appreciated!
>
> The high level goal of this feature is to improve reliability and
> performance of the Aurora scheduler job updater, by relying on health check
> status rather than watch_secs timeout when deciding an individual instance
> update state.
>
> Please see the original review request *https://reviews.apache.org/r/51536/
>  *
> aurora JIRA ticket *https://issues.apache.org/jira/browse/AURORA-894
> *
> design doc 
> *https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
> *
> for more details and background.
>
> Note: The design doc becomes a little bit outdated on the "scheduler
> change summary" part (this is what the review request trying to address).
> As a result, I've left some comment to clarify the latest proposed
> implementation plan for scheduler change.
>
> There are two questions I'm trying to address here:
> *1. How does the scheduler infer the executor version and be backward
> compatible?*
> *2. Where do we determine if health check is enabled?*
>
> In short, there are 3 different solutions proposed on the review board.
>
> In the first two approaches, the scheduler will rely on a string to
> determine the executor version. We determine whether health check is
> enabled merely on executor side. There will be communication between the
> executor and the scheduler.
> *Solution 1: *
> *vCurrent executor sends a message in its health check thread during
> RUNNING state transition, and the vCurrent 

[PROPOSAL] New RPC `fetchJobUpdates`

2016-09-01 Thread Zameer Manji
Hey,

I've noticed a hole in our current API which makes it difficult to write
external clients and other tooling around fetching the state of updates.

Currently, to fetch updates we are given two RPCs:

/** Gets job update summaries. */
Response getJobUpdateSummaries(1: JobUpdateQuery jobUpdateQuery)

/** Gets job update details. */
Response getJobUpdateDetails(1: JobUpdateKey key)



The `getJobUpdateSummaries` RPC is not scoped to a single update and
returns a
set of `JobUpdateSummary` structs. The struct is defined:

/** Summary of the job update including job key, user and current state. */
struct JobUpdateSummary {
  /** Unique identifier for the update. */
  5: JobUpdateKey key

  /** User initiated an update. */
  3: string user

  /** Current job update state. */
  4: JobUpdateState state
}


The `getJobUpdateDetails` RPC is scoped to a single update and returns the
following struct:


struct JobUpdateDetails {
  /** Update definition. */
  1: JobUpdate update

  /** History for this update. */
  2: list updateEvents

  /** History for the individual instances updated. */
  3: list instanceEvents
}



Maxim mentioned to me yesterday that this RPC is scoped to a single update
because assembling the `instanceEvents` can be extremely expensive. A query
that
could span more than a single update risks taking down the scheduler in a
large
cluster.


The problem I discovered is that there is no batch API to get the
inexpensive
information inside the `JobUpdate` struct. For reference this struct
contains:


/** Full definition of the job update. */
struct JobUpdate {
  /** Update summary. */
  1: JobUpdateSummary summary

  /** Update configuration. */
  2: JobUpdateInstructions instructions
}


Consumers are forced to make several `getJobUpdateDetails` calls to get
multiple
`JobUpdate` structs. Since the `JobUpdate` struct is not expensive to
assemble,
I'm proposing a new RPC that will allow consumers to get several `JobUpdate`
structs in a single call.


/** Gets job updates. */
Response getJobUpdates(1: JobUpdateQuery jobUpdateQuery)


If there are no objections, I will file tickets and put up a patch to
implement
this.

--
Zameer Manji


Re: [FEEDBACK] Transitioning Aurora leader election to Apache Curator (`-zk_use_curator`)

2016-08-24 Thread Zameer Manji
Could we change the default and drop the old code at the same time? I don't
see any benefit of letting that hang around.

I have not tested this code yet, but I hope to do it soon.

On Wed, Aug 24, 2016 at 5:19 AM, Erb, Stephan 
wrote:

> The curator backend has been working well for us so far. I believe it is
> safe to make it the default for the next release, and to drop the old code
> in the release after that.
>
>
>
> *From: *John Sirois 
> *Reply-To: *"u...@aurora.apache.org" , "
> jsir...@apache.org" 
> *Date: *Thursday 7 July 2016 at 01:13
> *To: *Martin Hrabovčin 
> *Cc: *"dev@aurora.apache.org" , Jake Farrell <
> jfarr...@apache.org>, "u...@aurora.apache.org" 
> *Subject: *Re: [FEEDBACK] Transitioning Aurora leader election to Apache
> Curator (`-zk_use_curator`)
>
>
>
> Now that 0.15.0 has been released, I thought I'd check in on any progress
> folks have made with testing/deploying the 0.14.0+ with the Aurora
> Scheduler `-zk_use_curator` flag in-place.
>
> There has been 1 fix that will go out in the 0.16.0 release to reduce
> logger noise on shutdown [1][2] but I have heard no negative (or positive)
> feedback otherwise.
>
>
>
> [1] https://issues.apache.org/jira/browse/AURORA-1729
>
> [2] https://reviews.apache.org/r/49578/
>
>
>
> On Thu, Jun 16, 2016 at 1:18 PM, John Sirois  wrote:
>
>
>
>
>
> On Thu, Jun 16, 2016 at 12:03 AM, Martin Hrabovčin <
> martin.hrabov...@gmail.com> wrote:
>
> How should be this flag rolled to existing running cluster? Can it be done
> using rolling update instance by instance or we need to stop the whole
> cluster and then bring all nodes with new flag?
>
>
>
> I recommend a whole cluster down, upgrade +  new flag, up.
>
>
>
> A rolling update should work, but will likely be rocky.  My analysis:
>
>
>
> The Aurora leader election consists of 2 components, the actual leader
> election and the resulting advertisement by the leader of itself as the
> Aurora service endpoint.  These 2 components each use zookeeper and of the
> 2 I only ensured that the advertisement was compatible with old releases
> (old clients). The leader election portion is completely internal to the
> Aurora scheduler instances vying for leadership and, under Curator, uses a
> different (enhanced), zookeeper node scheme.  As a result, this is what
> could happen in a slow roll:
>
>
>
> before upgrade: 0: old-lead, 1: old-follow, 2: old-follow
>
> upgrade 0: new-lead, 1: old-lead, 2: old-follow
>
>
>
> Here, node 0 will see itself as leader and nodes 1 and 2 will see node 1
> as leader. The result will be both node 0 and node 1 attempting to read the
> mesos distributed log.  Now the log uses its own leader election and the
> reader must be the leader as things stand, so the Aurora-level leadership
> "tie" will be broken by one of the 2 Aurora-level leaders failing to become
> the mesos distributed log leader, and that node will restart its lifecycle
> - ie flap.  This will continue to be the case with second node upgrade and
> will not stabilize until the 3rd node is upgraded.
>
>
>
>
>
> 2016-06-16 5:03 GMT+02:00 Jake Farrell :
>
> +1, will enable on our test clusters to help verify
>
> -Jake
>
>
> On Tue, Jun 14, 2016 at 7:43 PM, John Sirois  wrote:
>
> > I'd like to move forward with
> > https://issues.apache.org/jira/browse/AURORA-1669 asap; ie: removing
> > legacy
> > (Twitter) commons zookeeper libraries used for Aurora leader election in
> > favor of Apache Curator libraries. The change submitted in
> > https://reviews.apache.org/r/46286/ is now live in Aurora 0.14.0 and
> > Apache
> > Curator based service discovery can be enabled with the Aurora scheduler
> > flag `-zk_use_curator`.  I'd like feedback from users who enable this
> > option.  If you have a test cluster where you can enable
> `-zk_use_curator`
> > and exercise leader failure and failover, I'd be grateful. If you have
> > moved to using this option in production with demonstrable improvements
> or
> > even maintenance of status quo, I'd also be grateful for this news. If
> > you've found regressions or new bugs, I'd love to know about those as
> well.
> >
> > Thanks in advance to all those who find time to test this out on real
> > systems!
> >
>
>
>
>
>
>
>
>


Re: [PROPOSAL] Move Aurora discussions to Slack?

2016-08-03 Thread Zameer Manji
I'm also +0 for the same reasons that John listed.

On Tue, Aug 2, 2016 at 2:44 PM, John Sirois  wrote:

> I'm also +0.
>
> I am not a Slack fan, but that said, the barrier to entry on the face is
> the same if we publicize self-signup like mesos has on their community page
> [1]: https://mesos-slackin.herokuapp.com/
> On the negative, we hurt folks like Jake who are connected to many projects
> via IRC and on the positive (maybe) we cater to the burgeoning set of
> corporate contributors who are connected to many projects via Slack.
>
> Being paired with mesos is what tilits me +0 instead of -0.
>
> [1] http://mesos.apache.org/community/
>
> On Tue, Aug 2, 2016 at 3:14 PM, Joshua Cohen  wrote:
>
> > I'm +0 on this. I agree that Slack is in many ways superior to IRC, but
> it
> > also feels like the barrier to entry for Slack is much higher than it is
> > for IRC which is potentially problematic for an open source community.
> >
> > On Tue, Aug 2, 2016 at 4:08 PM, Jake Farrell 
> wrote:
> >
> > > There is an irc bridge which relays all messages back as well as an
> > archive
> > > bot. Not a huge fan of slack due to it being yet another chat client I
> > have
> > > in the background, but if it helps the community stay connected and
> grow
> > > and makes things easier then +1 for whatever it is.
> > >
> > > Might make sense for us to use the mesos.slack.com #aurora channel
> that
> > > way
> > > the two communities stay close and its easier for users to find and ask
> > > questions rather than having multiple slacks they have to join and keep
> > > track of
> > >
> > > -Jake
> > >
> > >
> > >
> > >
> > > On Tue, Aug 2, 2016 at 4:48 PM, Steve Niemitz 
> > wrote:
> > >
> > > > +1, I pretty much never remember to open my IRC client anymore.  I've
> > > been
> > > > using the Mesos Slack for a few weeks now and its way better than
> > IRC.  I
> > > > believe they have chat logging still via a bot of some type too?
> > > >
> > > > On Tue, Aug 2, 2016 at 4:45 PM, Maxim Khutornenko 
> > > > wrote:
> > > >
> > > > > Mesos community has recently moved to Slack as their canonical chat
> > > > channel
> > > > > [1]. Thanks to Stephan, we already have some presence there via
> > #aurora
> > > > > channel in Apache Mesos team.
> > > > >
> > > > > Should we move our IRC discussions to Slack too?
> > > > >
> > > > > [1] - http://markmail.org/message/azd37j64wsozmuhe
> > > > >
> > > >
> > >
> >
>


Re: [DRAFT][REPORT]: Apache Aurora

2016-06-10 Thread Zameer Manji
Jake,

Shouldn't we mention that our PMC Chair has resigned?

On Fri, Jun 10, 2016 at 8:06 PM, Jake Farrell  wrote:

> Please take a second to review the board report below and provide any
> feedback (+1 or any desired modifications). I will submit this report
> pending any changes monday afternoon
>
> -Jake
>
>
>
> Apache Aurora is a stateless and fault tolerant service scheduler used to
> schedule jobs onto Apache Mesos such as long-running services, cron jobs,
> and one off tasks.
>
> Project Status
> -
> The Apache Aurora community has continued to see growth from new users and
> contributors while working towards our upcoming 0.14.0 release. The
> upcoming
> release will contain a number of bug fixes, stability enhancements and new
> experimental features added such as Mesos GPU resource support, external
> webhook support, launching tasks using filesystem image with the new Apache
> Mesos unified containerizer.
>
> Community
> ---
> Latest Additions:
>
> * PMC addition: Stephan Erb, 2.3.2016
>
> Issue backlog status since last report:
>
> * Created:   62
> * Resolved:  80
>
> Mailing list activity since last report:
>
> * @dev329 messages
> * @user   103 messages
>
> Releases
> ---
> Last release: Apache Aurora 0.13.0 released 4.13.2016
> Release candidate: Apache Aurora 0.14.0 release candidate vote is currently
> in progress
>


Bug in pants and pex effecting python bdist resolution

2016-04-13 Thread Zameer Manji
Hey,

This is only a note for developers that maintain a fork where you have
changed some of the versions of the dependencies. Our current version of
pants (0.0.80) depends on pex 1.1.4 which suffers from this issue
<https://github.com/pantsbuild/pex/issues/226>. As a result if you attempt
to change any of the dependencies to version that have a `_` or `-` in
them, pants will not be able to resolve the dependency and build the
artifact.

I have asked a pants developer to downgrade pex
<https://github.com/pantsbuild/pants/pull/3184> in pants to a version that
does not have this issue. Once that version is released, I will be sure to
upgrade pants in our repo.

-- 
Zameer Manji


Re: [VOTE] Release Apache Aurora 0.13.0 RC0

2016-04-13 Thread Zameer Manji
> > > > > >>> > I propose that we accept the following release candidate as
> the
> > > > > >>> official
> > > > > >>> > Apache Aurora 0.13.0 release.
> > > > > >>> >
> > > > > >>> > Aurora 0.13.0-rc0 includes the following:
> > > > > >>> > ---
> > > > > >>> > The NEWS for the release is available at:
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git=NEWS=rel/0.13.0-rc0
> > > > > >>>
> > > > > >>>
> > > > > >>> The NEWS link above is broken, but was fixed here:
> > > > > >>> https://reviews.apache.org/r/46070/
> > > > > >>>
> > > > > >>>
> > > > > >>> >
> > > > > >>> >
> > > > > >>> > The CHANGELOG for the release is available at:
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git=CHANGELOG=rel/0.13.0-rc0
> > > > > >>>
> > > > > >>>
> > > > > >>> The CHANGELOG looks light with 2 entries, but that may be
> > correct.
> > > > If
> > > > > >>> its
> > > > > >>> not correct, I'm not sure if this is an RC blocker or not ... I
> > > voted
> > > > > >>> assuming it was not.
> > > > > >>>
> > > > > >>>
> > > > > >>> >
> > > > > >>> >
> > > > > >>> > The tag used to create the release candidate is:
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=shortlog;h=refs/tags/rel/0.13.0-rc0
> > > > > >>> >
> > > > > >>> > The release candidate is available at:
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.13.0-rc0/apache-aurora-0.13.0-rc0.tar.gz
> > > > > >>> >
> > > > > >>> > The MD5 checksum of the release candidate can be found at:
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.13.0-rc0/apache-aurora-0.13.0-rc0.tar.gz.md5
> > > > > >>> >
> > > > > >>> > The signature of the release candidate can be found at:
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.13.0-rc0/apache-aurora-0.13.0-rc0.tar.gz.asc
> > > > > >>> >
> > > > > >>> > The GPG key used to sign the release are available at:
> > > > > >>> > https://dist.apache.org/repos/dist/dev/aurora/KEYS
> > > > > >>> >
> > > > > >>> > Please download, verify, and test.
> > > > > >>> >
> > > > > >>> > The vote will close on Thu Apr 14 23:24:12 EDT 2016, please
> > vote
> > > > > >>> >
> > > > > >>> > [ ] +1 Release this as Apache Aurora 0.13.0
> > > > > >>> > [ ] +0
> > > > > >>> > [ ] -1 Do not release this as Apache Aurora 0.13.0 because...
> > > > > >>> >
> > > > > >>> >
> > > > > >>> > I'd like to get the voting started with my own +1
> > > > > >>> >
> > > > > >>> > -Jake
> > > > > >>> >
> > > > > >>>
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>
> --
> Zameer Manji
>
>


Re: Populate DiscoveryInfo in Mesos

2016-04-05 Thread Zameer Manji
zhitaoli...@gmail.com>
> > > > > > >> Sent: Tuesday, March 22, 2016 21:15
> > > > > > >> To: dev@aurora.apache.org
> > > > > > >> Subject: Re: Populate DiscoveryInfo in Mesos
> > > > > > >>
> > > > > > >> Hi Stephan,
> > > > > > >>
> > > > > > >> Sorry for the delay on follow up on this. I took a quick look
> at
> > > > > Aurora
> > > > > > >> code, and it's actually quite easy to pipe this information to
> > > Mesos
> > > > > > (see
> > > > > > >> https://reviews.apache.org/r/45177/ for quick prototype).
> > > > > > >>
> > > > > > >> I'll take a stab to see how I can get Mesos-DNS to work with
> > this
> > > > > > >> prototype.
> > > > > > >>
> > > > > > >> IMO, if this is something the community is interested, the
> main
> > > > > > questions
> > > > > > >> would be 1) how various fields would be mapped in different
> > Aurora
> > > > > > usages,
> > > > > > >> and 2) to which level should opt-in/opt-out configured for
> > > > populating
> > > > > > such
> > > > > > >> information.
> > > > > > >>
> > > > > > >> I actually don't have too much insights on how these usage
> > > > conventions
> > > > > > >> would be set (through command line of scheduler or job
> > > > configuration?)
> > > > > > >>
> > > > > > >> Do you think a design doc is the best action here, or a more
> > > > involved
> > > > > > >> questionnaire about which fields would be useful for
> community,
> > or
> > > > > what
> > > > > > >> value they should take?
> > > > > > >>
> > > > > > >> On Mon, Mar 7, 2016 at 1:00 AM, Erb, Stephan <
> > > > > > stephan@blue-yonder.com
> > > > > > >> >
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > That sounds like a good idea! Great.
> > > > > > >> >
> > > > > > >> > If you go ahead with this, please be so kind and start by
> > > posting
> > > > a
> > > > > > >> short
> > > > > > >> > design document here on mailinglist (similar to those here
> > > > > > >> >
> > > > >
> > https://github.com/apache/aurora/blob/master/docs/design-documents.md
> > > > > > ,
> > > > > > >> > but probably shorter).
> > > > > > >> >
> > > > > > >> > This will allow us to split the discussion of the design
> from
> > > > > > discussing
> > > > > > >> > the actual implementation. I believe this is necessary, as
> the
> > > > > > >> > DiscoveryInfo protocol is quite flexible (
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> http://mesos.apache.org/documentation/latest/app-framework-development-guide/
> > > > > > >> > ).
> > > > > > >> >
> > > > > > >> > Thanks,
> > > > > > >> > Stephan
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > 
> > > > > > >> > From: Zhitao Li <zhitaoli...@gmail.com>
> > > > > > >> > Sent: Monday, March 7, 2016 00:05
> > > > > > >> > To: dev@aurora.apache.org
> > > > > > >> > Subject: Populate DiscoveryInfo in Mesos
> > > > > > >> >
> > > > > > >> > Hi,
> > > > > > >> >
> > > > > > >> > It seems like Aurora does not populate the "discovery" field
> > in
> > > > > either
> > > > > > >> > TaskInfo or ExecutorInfo in mesos.proto
> > > > > > >> > <
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L438
> > > > > > >> > >
> > > > > > >> > .
> > > > > > >> >
> > > > > > >> > I'm considering adding this to support retrieving port map
> in
> > > > Mesos
> > > > > > >> > directly. This would enable us to discovery this information
> > > > > directly
> > > > > > >> from
> > > > > > >> > Mesos side, and also enables us to build one universal
> service
> > > > > > discovery
> > > > > > >> > solution for multiple frameworks including Aurora.
> > > > > > >> >
> > > > > > >> > If no objection, I'll create a JIRA ticket for this task.
> > > > > > >> >
> > > > > > >> > Thanks.
> > > > > > >> > --
> > > > > > >> > Cheers,
> > > > > > >> >
> > > > > > >> > Zhitao Li
> > > > > > >> >
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Cheers,
> > > > > > >>
> > > > > > >> Zhitao Li
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Zhitao Li
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Cheers,
> > > > > >
> > > > > > Zhitao Li
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Cheers,
> > > >
> > > > Zhitao Li
> > > >
> > >
> > >
> > >
> > > --
> > > Cheers,
> > >
> > > Zhitao Li
> > >
> >
>
>
>
> --
> Cheers,
>
> Zhitao Li
>
> --
> Zameer Manji
>
>


Re: [DISCUSS]: 0.13.0 release candidate

2016-04-04 Thread Zameer Manji
+1

On Mon, Apr 4, 2016 at 12:27 PM, Bill Farner <wfar...@apache.org> wrote:

> +1, fire away
>
> On Mon, Apr 4, 2016 at 12:26 PM, Jake Farrell <jfarr...@apache.org> wrote:
>
> > Other than a couple deprecation clean up tickets, in AURORA-1584 [1], it
> > looks like we are about ready to cut the 0.13.0 release candidate and
> start
> > a vote. I wanted to open the floor up for any last minute requests or
> > patches people would like to see make it in before we finalize and cut
> the
> > release candidate. Currently planning on cutting the release candidate
> this
> > Wednesday, April 6th, pending no blockers coming out of this discussion
> > thread. Thoughts, objections?
> >
> > -Jake
> >
> >
> > [1]: https://issues.apache.org/jira/browse/AURORA-1584
> >
>
> --
> Zameer Manji
>
>


Re: aurora job scalability

2016-03-19 Thread Zameer Manji
I would like to chime in and just say from experience Aurora scales to the
number of instances across all jobs. From a scale perspective there isn't
much difference between having a thousand 1 instance jobs and a single job
that has 1k instances as both cases roughly take up the same amount of
memory.

As Josh mentioned Twitter is running thousands of jobs and these jobs span
hundreds of thousands of instances.

On Fri, Mar 18, 2016 at 9:27 AM, Joshua Cohen <jco...@apache.org> wrote:

> Hi Christopher,
>
> I think you already got an answer from Stephan in IRC, but just wanted to
> follow up for the sake of posterity (in case anyone in the future has a
> similar question and finds this thread). The only limit on the number of
> jobs that Aurora can run would currently be the amount of memory available
> to the Scheduler. Suffice it to say that at Twitter we're running thousands
> of jobs with no issues.
>
> Let us know if you have any follow up questions.
>
> Cheers,
>
> Joshua
>
> On Fri, Mar 18, 2016 at 9:12 AM, Christopher M Luciano <
> cmluci...@us.ibm.com
> > wrote:
>
> >  Hi all. It seems that we may be outgrowing Marathon. We have a problem
> > with the amount of application that we are using, causing us to not
> exactly
> > be "compliant" with Marathon goals. It seems that the unit for Aurora is
> a
> > job+instance of that job. Does job map to a Marathon application? If
> > similar is there a known limitation to how many jobs one can have?
> >
> >  What we discovered for Marathon more application+bigger env_vars =
> bigger
> > zknode size and we come dangerously close to hitting the 1 MB default of
> > the zk zknode size. I'm wondering if this type of a thing has potentially
> > been fixed already in Aurora.
> >
> >
> >
> >
> >
> >
> > Christopher M Luciano
> >
> > Staff Software Engineer, Platform Services
> >
> > IBM Watson Core Technology
> >
> >
> >
> >
> >
>
> --
> Zameer Manji
>
>


Re: problem commiting the transaction to the log

2016-02-17 Thread Zameer Manji
gt; at
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
> at
> org.apache.aurora.scheduler.http.HttpStatsFilter.doFilter(HttpStatsFilter.java:70)
> at
> org.apache.aurora.scheduler.http.AbstractFilter.doFilter(AbstractFilter.java:44)
> at
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
> at
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
> at
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
> at
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
> at
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
> at
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
> at
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
> at
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
> at
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
> at
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
> at
> org.eclipse.jetty.servlets.UserAgentFilter.doFilter(UserAgentFilter.java:82)
> at org.eclipse.jetty.servlets.GzipFilter.doFilter(GzipFilter.java:294)
> at
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
> at
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
> at
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
> at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1288)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:443)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1044)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:372)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:978)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:317)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:369)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:486)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:944)
> at
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1005)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
> at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
> at
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
> at
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
> at
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Thread.java:745)
> Caused by:
> org.apache.aurora.scheduler.log.Log$Stream$StreamAccessException:
> Timeout performing log append
> at
> org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.disableLog(MesosLog.java:352)
> at
> org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.mutate(MesosLog.java:367)
> at
> org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.append(MesosLog.java:315)
> at
> org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.append(MesosLog.java:145)
> at
> org.apache.aurora.scheduler.storage.log.StreamManagerImpl.appendAndGetPosition(StreamManagerImpl.java:238)
> at
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:84)
> at
> org.apache.aurora.scheduler.storage.log.StreamManagerImpl$StreamTransactionImpl.commit(StreamManagerImpl.java:267)
> at
> org.apache.aurora.scheduler.storage.log.LogStorage$24.apply(LogStorage.java:616)
> ... 76 more
> Caused by: java.util.concurrent.TimeoutException: Timed out while
> attempting to append
> at org.apache.mesos.Log$Writer.append(Native Method)
> at
> org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule$5.append(MesosLogStreamModule.java:188)
> at
> org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream$3.apply(MesosLog.java:319)
> at
> org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream$3.apply(MesosLog.java:315)
> at
> org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.mutate(MesosLog.java:365)
> ... 82 more
>
> --
> Zameer Manji
>
>


Re: [PROPOSAL] Change java thrift code gen

2016-01-27 Thread Zameer Manji
" wrapper generator [1] today, this new generator - even with
> the
> > python generator removed - represents a 5-6x increase in line count of
> > custom code (~4.1k lines of code and tests in the new custom gen, ~700
> > lines in the existing python custom gen)
> > 2. We conceptually fork from a sibling Apache project.
> >
> > The fork could be mitigated by turning our real experience iterating the
> > custom code generator into a well-founded patch back into the Apache
> Thrift
> > project, but saying we'll do that is easier than following through and
> > actually doing it.
> >
> > ==
> > Review guide / details:
> >
> > The technology stack:
> > The thrift IDL parsing and thrift wire parsing are both handled by the
> > Facebook swift project [4].  We only implement the middle bit that
> > generates java code stubs.  This gives higher confidence that the tricky
> > bits out at either edge are done right.
> > The thrift struct code generation is done using Square's javapoet [5] in
> > favor of templating for the purpose of easier to read generator code.
> This
> > characterization is debatable though and template are certainly more
> > flexible the minute you need to gen a second language (say we like this
> and
> > want to do javascript codegen this way too for example).
> > The MyBatis codegen is forced by the thrift codegen for technical
> > reasons.  In short, there is no simple way to teach MyBatis to read and
> > write immutable objects with builders.  So the MyBatis code is generated
> > via an annotation processor that runs after thrift code gen, but reading
> > thrift annotations that survive that codegen process.
> > The codegen unit testing is done with the help of Google's compile-tester
> > [6].  NB that this has an expected output comparison that checks the
> > generated AST and not the text, so its fairly lenient.  Whitepsace and
> > comments certainly don't matter.
> >
> > Review strategy:
> > The code generator RBs (#1 & #2 in the 3 part series) are probably easier
> > to review looking at samples of the generated code.  Both the thrift
> > codegen and MyBatis codegen samples are conveniently contained in the
> > MyBatis codegen RB (#2: https://reviews.apache.org/r/42749/).  The unit
> > test uses resource files that contain both the thrift codegen inputs the
> > annotation processor runs over and the annotation processor expected
> > outputs  - the MyBatis peer classes.  So have a look there if you need
> > something concrete and don't want to patch the RBs in and actually run
> the
> > codegen (`./gradlew api:compileJava`).
> > The conversion RB (#3) is large but the changes are mainly mechanical
> > conversions from the current mutable thrift + I* wrappers to pure
> immutable
> > thrift mutated via `.toBuilder` and `.with`'er methods.  The main changes
> > of note are to the portions of the codebase tightly tied to thrift as a
> > technology:
> > + Gson/thrift converters
> > + Shiro annotated auth param interception
> > + Thrift/Servlet binding
> >
> > [1]
> >
> https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/tools/java/thrift_wrapper_codegen.py
> > [2] https://issues.apache.org/jira/browse/AURORA-987
> > [3]
> >
> https://docs.google.com/spreadsheets/d/1-CYMnEjzknAsY5_r_NVX8r85wxtrEByZ5YRiAbgMhP0/edit#gid=840229346
> > [4] https://github.com/facebook/swift
> > [5] https://github.com/square/javapoet
> > [6] https://github.com/google/compile-testing
> >
>
> --
> Zameer Manji
>
>


Re: PROPOSAL: Host and support nightly Aurora builds on Apache servers

2016-01-22 Thread Zameer Manji
+1 to just putting the nightly builds out there to let people test features
as the are committed. This will also ensure our RCs are of high quality.

Do we have to put them on apache servers? I thought we put our artifacts on
bintray: https://bintray.com/apache/aurora



On Fri, Jan 22, 2016 at 11:54 AM, Dmitriy Shirchenko <cald...@gmail.com>
wrote:

> Hi everyone,
>
> This is a proposal to provide nightly builds in three Linux flavors: Debian
> Jessie, Ubuntu Trusty and CentOS for public consumption. Exposing them
> through Jenkins build job is one option and would be enough.
>
> Reason for nightlies is to allow us quicker testing in production for
> features which we have contributed but have not yet made the official
> release cut.
>
> This is purely an RFC as I could not find other Apache projects doing
> nightlies, though I found that some projects eg Zookeeper provide alpha
> releases[1] while Mesos does not[2].
>
> Thanks,
> Dmitriy.
>
>
> [1] https://www.apache.org/dist/zookeeper/
> [2] http://archive.apache.org/dist/mesos/
>
> --
> Zameer Manji
>
> <http://archive.apache.org/dist/mesos/>


Re: [PROPOSAL] Amend 0.12.0 release goals

2016-01-14 Thread Zameer Manji
+1

On Thu, Jan 14, 2016 at 9:15 AM, Maxim Khutornenko <ma...@apache.org> wrote:

> +1
>
> On Thu, Jan 14, 2016 at 9:14 AM, Joshua Cohen <jco...@apache.org> wrote:
> > Sounds good to me.
> >
> > On Thu, Jan 14, 2016 at 9:31 AM, Erb, Stephan <
> stephan@blue-yonder.com>
> > wrote:
> >
> >> +1 for catching up
> >> 
> >> From: John Sirois <j...@conductant.com>
> >> Sent: Thursday, January 14, 2016 4:18 PM
> >> To: dev@aurora.apache.org
> >> Subject: Re: [PROPOSAL] Amend 0.12.0 release goals
> >>
> >> On Thu, Jan 14, 2016 at 8:02 AM, Bill Farner <wfar...@apache.org>
> wrote:
> >>
> >> > Given that we are still playing catch-up to mesos releases (we are on
> >> > 0.25.0, latest is 0.26.0, there's talk of cutting 0.27.0 soon), i
> would
> >> > like to suggest that we remove these tickets from 0.12.0:
> >> >
> >> > https://issues.apache.org/jira/browse/AURORA-987
> >> > https://issues.apache.org/jira/browse/AURORA-1150
> >> >
> >> > It's slightly unfortunate as these tickets represent the largest
> planned
> >> > effort for 0.12.0, we had a handful of significant contributions that
> >> still
> >> > make this a featureful release:
> >> >
> >> >
> >> >
> >>
> https://github.com/apache/aurora/blob/d542bd1d58bc5dcf6ead95d902c0a8cecbbffe9e/NEWS#L3-L23
> >> >
> >> > Additionally, i would like to recommend that we add the following
> ticket
> >> to
> >> > 0.12.0 since we have an in-fight patch that looks close to completion:
> >> >
> >> > https://issues.apache.org/jira/browse/AURORA-1109
> >>
> >>
> >> +1 to the proposal, catching up would be good to do.
> >>
> >> I clarified the status of AURORA-987 by stopping progress and
> >> un-assigning.  Although I'm working towards that ticket, its by very
> >> indirect means still and I'll re-assign and re-start once I'm actually
> >> engaged in the meat of the API proposal that is needed assuming no one
> else
> >> has dived in.
> >>
> >>
> >> >
> >> >
> >> >
> >> > Cheers,
> >> >
> >> > Bill
> >> >
> >>
> >>
> >>
> >> --
> >> John Sirois
> >> 303-512-3301
> >>
>
> --
> Zameer Manji
>
>


Re: [PROPOSAL] Replace commons-args

2016-01-12 Thread Zameer Manji
h
> > the
> > > > args
> > > > >> > system (intellij/gradle not working nicely with apt)
> > > > >> > b. encourage better testability of Module classes by always
> > > injecting
> > > > all
> > > > >> > args
> > > > >> > c. leverage a well-maintained third-party argument parsing
> library
> > > > >> > d. stretch: enable user-friendly features like logical option
> > groups
> > > > for
> > > > >> > better help/usage output
> > > > >> > e. stretch: enable alternative configuration inputs like a
> > > > configuration
> > > > >> > file or environment variables
> > > > >> >
> > > > >> > (b) is currently an issue because command line arguments are
> > driven
> > > > from
> > > > >> > pseudo-constants within the code, for example:
> > > > >> >
> > > > >> > @NotNull
> > > > >> > @CmdLine(name = "cluster_name", help = "Name to identify the
> > > > cluster
> > > > >> > being served.")
> > > > >> > private static final Arg CLUSTER_NAME =
> Arg.create();
> > > > >> >
> > > > >> > @NotNull
> > > > >> > @NotEmpty
> > > > >> > @CmdLine(name = "serverset_path", help = "ZooKeeper
> ServerSet
> > > > path to
> > > > >> > register at.")
> > > > >> > private static final Arg SERVERSET_PATH =
> > Arg.create();
> > > > >> >
> > > > >> > This makes it simple to add command line arguments.  However, it
> > > means
> > > > >> that
> > > > >> > a level of indirection is needed to parameterize and test the
> code
> > > > >> > consuming arg values.  We have various examples of this
> throughout
> > > the
> > > > >> > project.
> > > > >> >
> > > > >> > I propose that we change the way command line arguments are
> > declared
> > > > such
> > > > >> > that a Module with the above arguments would instead declare an
> > > > interface
> > > > >> > that produces its parameters:
> > > > >> >
> > > > >> > public interface Params {
> > > > >> >   @Help("Name to identify the cluster being served.")
> > > > >> >   String clusterName();
> > > > >> >
> > > > >> >   @NotEmpty
> > > > >> >   @Help("ZooKeeper ServerSet path to register at.")
> > > > >> >   String serversetPath();
> > > > >> > }
> > > > >> >
> > > > >> > public SchedulerModule(Params params) {
> > > > >> >   // Params are supplied to the module constructor.
> > > > >> > }
> > > > >> >
> > > > >> > Please see this review for a complete example of this part of
> the
> > > > change:
> > > > >> > https://reviews.apache.org/r/41804
> > > > >> >
> > > > >> > This is roughly the same amount of overhead for declaring
> > arguments
> > > as
> > > > >> the
> > > > >> > current scenario, with the addition of a very obvious mechanism
> > for
> > > > >> > swapping the source of parameters.  This allows us to isolate
> the
> > > > body of
> > > > >> > code responsible for supplying configuration values, which we
> lack
> > > > today.
> > > > >> >
> > > > >> > The remaining work is to bridge the gap between a command line
> > > > argument
> > > > >> > system and Parameter interfaces.  This is relatively easy to do
> > with
> > > > >> > dynamic proxies.  I have posted a proof of concept here:
> > > > >> > https://reviews.apache.org/r/42042
> > > > >> >
> > > > >> > Regarding (c), i have done some analysis of libraries available
> > and
> > > i
> > > > >> > suggest argparse4j [3].  It has thorough documentation, no
> > > transitive
> > > > >> > dependencies, and is being actively developed (last release Dec
> > > 2015).
> > > > >> > However, i would like to emphasize that i think we should
> minimize
> > > > >> coupling
> > > > >> > to the argument parsing library so that we may switch in the
> > future.
> > > > >> > Argparse4j has a feature that makes the non-critical feature (d)
> > > > >> possible.
> > > > >> >
> > > > >> > With that, what do you think?  Are there other goals we should
> > add?
> > > > Does
> > > > >> > the plan make sense?
> > > > >> >
> > > > >> > [1]
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/aurora/tree/master/commons-args/src/main/java/org/apache/aurora/common/args
> > > > >> > [2] https://github.com/twitter/commons
> > > > >> > [3] https://argparse4j.github.io/
> > > > >>
> > > >
> > >
> >
> >
> >
> > --
> > John Sirois
> > 303-512-3301
> >
>
> --
> Zameer Manji
>
>
>


Re: [PROPOSAL] Use standard logging practices

2015-12-28 Thread Zameer Manji
+1

Could we still keep the Glog formatter class so folks who want to have the
same log formatting between the Aurora log lines and the Mesos driver
(which prints to stderr by default) just have to add a line to their
logging.properties?

The alternative means users would have to build their own glog formatter
and add it to the classpath in addition to setting the formatter in
logging.properties which is not straight forward if you want to have
reasonably formatted log entries between the driver and the scheduler.

On Mon, Dec 28, 2015 at 2:54 PM, Bill Farner <wfar...@apache.org> wrote:

> We're currently using some logging scaffolding carried over from Twitter
> commons.  I would like to propose that we dismantle some of this in favor
> of more standard java application logging conventions.
>
> Concretely, i propose we remove the following scheduler command line
> arguments:
> -logtostderr
> -alsologtostderr
> -vlog
> -vmodule
> -use_glog_formatter
>
> Instead of these, we can allow users to customize logging via standard
> java.util.logging inputs (e.g. logging.properties).  We could explore using
> an alternative to java.util.logging, but i suggest we retain that backend
> for now (since it's what we're currently using).
>
> --
> Zameer Manji
>
>


Re: [VOTE] Release Apache Aurora 0.11.0 debs

2015-12-23 Thread Zameer Manji
John,

I think you you have found a bug either in the installation guide or in our
packages. We can either amend the "Installing Mesos" section to include
installing this package or we can fix our packages to list this dependency.
I'm not sure on how packages should behave, so I am not sure on what we
should do here.

Maybe we could amend the guide for now, release these debs and figure out
how to prevent this issue for the next release? Perhaps we could assist the
Mesos project in hosting their own packages so we don't need to rely on
these incorrectly packaged artifacts from Mesosphere Inc?


On Wed, Dec 23, 2015 at 4:23 PM, John Sirois <j...@conductant.com> wrote:

> On Wed, Dec 23, 2015 at 2:14 PM, John Sirois <j...@conductant.com> wrote:
>
> > -1 non-binding
> >
> > Tested using new installing guide in Vagrant image using
> 'ubuntu/trusty64'
> > against mesos 0.24.1.
> > Everything worked after 2 tweaks:
> > 1. sudo apt-get install libcurl4-nss-dev
> > 2. $ diff /etc/init/thermos.conf.orig /etc/init/thermos.conf
> > 23a24
> > > --mesos-root=/tmp/mesos \
> >
> > Without item 1 the thermos-executor fails to operate:
> > Traceback (most recent call last):
> >   File "apache/aurora/executor/bin/thermos_executor_main.py", line 45, in
> > 
> > from mesos.native import MesosExecutorDriver
> >   File
> >
> "/root/.pex/install/mesos.native-0.24.1-py2.7-linux-x86_64.egg.c2a926cdb8d599d35c7a569171311edaebda9341/mesos.native-0.24.1-py2.7-linux-x86_64.egg/mesos/native/__init__.py",
> > line 17, in 
> > from ._mesos import MesosExecutorDriverImpl
> > ImportError: libcurl-nss.so.4: cannot open shared object file: No such
> > file or directory
> >
> > Seems like `libcurl4-nss-dev` should be a dependency of the
> > aurora-executor deb.
> >
>
> I guess libcurl is properly a dependency of mesos which just means the
> install guide rec to use the mesosphere mesos debs is suboptimal.  That
> said - aurora-executor and aurora-scheduler should really depend on mesos,
> but much of the install guide works around the fact these deps aren't
> expressed in the debs either.
> I think I'm realizing this means the current partial-working state of the
> debs is accepted as better than no debs ... so
>
> I change my vote to +1
>
>
> >
> >
> > On Wed, Dec 23, 2015 at 10:44 AM, Bill Farner <wfar...@apache.org>
> wrote:
> >
> >> Note that i've lengthened this vote to accommodate the holidays.
> >>
> >> Please consider verifying these debs using the recently-added install
> >> guide: https://github.com/apache/aurora/blob/master/docs/installing.md
> >>
> >> On Wed, Dec 23, 2015 at 9:43 AM, Bill Farner <wfar...@apache.org>
> wrote:
> >>
> >> > I propose that we accept the following artifacts as the official deb
> >> > packaging for
> >> > Apache Aurora 0.11.0.
> >> >
> >> >
> >> >
> >>
> http://people.apache.org/~wfarner/aurora/distributions/0.11.0/deb/ubuntu-trusty/
> >> >
> >> > The Aurora deb packaging includes the following:
> >> > ---
> >> > The CHANGELOG is available at:
> >> >
> >> >
> >>
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=blob_plain;f=specs/debian/changelog;hb=refs/heads/0.11.x
> >> >
> >> > The branch used to create the packaging is:
> >> >
> >> >
> >>
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tree;h=refs/heads/0.11.x
> >> >
> >> > The packages are available at:
> >> >
> >> >
> >>
> http://people.apache.org/~wfarner/aurora/distributions/0.11.0/deb/ubuntu-trusty/
> >> >
> >> > The GPG keys used to sign the packages are available at:
> >> > https://dist.apache.org/repos/dist/release/aurora/KEYS
> >> >
> >> > Please download, verify, and test.
> >> >
> >> > The vote will close on Wed Jan 6 20:00:00 PT 2015
> >> >
> >> > [ ] +1 Release these as the deb packages for Apache Aurora 0.11.0
> >> > [ ] +0
> >> > [ ] -1 Do not release these artifacts because...
> >> >
> >> > I would like to get the voting started off with my own +1
> >> >
> >>
> >
> >
> >
> > --
> > John Sirois
> > 303-512-3301
> >
>
>
>
> --
> John Sirois
> 303-512-3301
>
> --
> Zameer Manji
>
> <303-512-3301>


Re: [VOTE] Release Apache Aurora 0.11.0 RC1

2015-12-22 Thread Zameer Manji
+1 (binding)

The verification script passed for me on Mac OSX 10.10.5.

Also all the components of this release have been running on a production
cluster that I oversee and no issues have been observed.

On Mon, Dec 21, 2015 at 12:46 PM, Joshua Cohen <jco...@apache.org> wrote:

> +1 non-binding
>
> On Sun, Dec 20, 2015 at 10:21 AM, Bill Farner <wfar...@apache.org> wrote:
>
> > Looks like it was orphaned - not linked against the RC ticket and not
> > listed as a blocker of https://issues.apache.org/jira/browse/AURORA-1367
> .
> > I'll move the ticket underneath it to the 0.12.0.
> >
> > On Sun, Dec 20, 2015 at 7:23 AM, Erb, Stephan <
> stephan@blue-yonder.com
> > >
> > wrote:
> >
> > > What's up with this ticket here:
> > > https://issues.apache.org/jira/browse/AURORA-1520
> > >
> > > Was this forgotten? Should we do it now?
> > >
> > > Regards,
> > > Stephan
> > > 
> > > From: John Sirois <j...@conductant.com>
> > > Sent: Friday, December 18, 2015 3:37 AM
> > > To: dev@aurora.apache.org
> > > Subject: Re: [VOTE] Release Apache Aurora 0.11.0 RC1
> > >
> > > On Thu, Dec 17, 2015 at 5:17 PM, Bill Farner <wfar...@apache.org>
> wrote:
> > >
> > > > Friendly reminder that verifying a release can be as easy as
> > > >
> > > >   ./build-support/release/verify-release-candidate  0.11.0-rc1
> > > >
> > > > Of course, if you have a simulated production environment, we would
> > love
> > > to
> > > > hear how this build behaves there!
> > > >
> > > > On Thu, Dec 17, 2015 at 4:08 PM, Bill Farner <wfar...@apache.org>
> > wrote:
> > > >
> > > > > All,
> > > > >
> > > > > I propose that we accept the following release candidate as the
> > > official
> > > > > Apache Aurora 0.11.0 release.
> > > >
> > >
> > > +1 non-binding
> > >
> > > Verified on Arch Linux - kernel 4.2.5 + OpenJDK 1.8.0_66 + Python
> 2.7.11
> > >
> > > >
> > > > > Aurora 0.11.0-rc1 includes the following:
> > > > > ---
> > > > > The NEWS for this release is available at:
> > > > >
> > > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git=NEWS=0.11.0-rc1
> > > > >
> > > > > The CHANGELOG for the release is available at:
> > > > >
> > > > >
> > > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git=CHANGELOG=0.11.0-rc1
> > > > >
> > > > > The branch used to create the release candidate is:
> > > > >
> > > > >
> > > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=shortlog;h=refs/heads/0.11.0-rc1
> > > > >
> > > > > The release candidate is available at:
> > > > >
> > > > >
> > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.11.0-rc1/apache-aurora-0.11.0-rc1.tar.gz
> > > > >
> > > > > The MD5 checksum of the release candidate can be found at:
> > > > >
> > > > >
> > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.11.0-rc1/apache-aurora-0.11.0-rc1.tar.gz.md5
> > > > >
> > > > > The signature of the release candidate can be found at:
> > > > >
> > > > >
> > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.11.0-rc1/apache-aurora-0.11.0-rc1.tar.gz.asc
> > > > >
> > > > > The GPG key used to sign the release are available at:
> > > > > https://dist.apache.org/repos/dist/dev/aurora/KEYS
> > > > >
> > > > > Please download, verify, and test.
> > > > >
> > > > > The vote will close on Tue Dec 22 15:18:55 PST 2015
> > > > >
> > > > > [ ] +1 Release this as Apache Aurora 0.11.0
> > > > > [ ] +0
> > > > > [ ] -1 Do not release this as Apache Aurora 0.11.0 because...
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > John Sirois
> > > 303-512-3301
> > >
> >
>



-- 
Zameer Manji


Re: Proposal - no IRC meeting until 1/4

2015-12-21 Thread Zameer Manji
+1

On Mon, Dec 21, 2015 at 3:46 PM, Dave Lester <d...@davelester.org> wrote:

> +1
>
> > On Dec 21, 2015, at 9:10 AM, Bill Farner <wfar...@apache.org> wrote:
> >
> > In light of the upcoming holidays and impact on everyone's schedules - I
> > suggest we reconvene IRC meetings on 1/4.
>
> --
> Zameer Manji
>
>


Re: Mac OSX brew aurora-cli support

2015-11-24 Thread Zameer Manji
Sounds reasonable to me. Someone can then update the formula to fetch our
offical binary.

On Tue, Nov 24, 2015 at 11:50 AM, Bill Farner <wfar...@apache.org> wrote:

> I realized what i said was unclear.  I actually intended the same - use
> brew, but have brew just fetch a binary we host.  Is that reasonable?
>
> On Tue, Nov 24, 2015 at 11:44 AM, Zameer Manji <zma...@apache.org> wrote:
>
> > On Mon, Nov 23, 2015 at 9:16 PM, Bill Farner <wfar...@apache.org> wrote:
> >
> > > That's awesome, thanks for doing that!  Any sense if it would be better
> > for
> > > us to host an official OS X binary?  Would make the install much
> > snappier,
> > > at least.
> > >
> >
> > I think hosting an official OSX binary and perhaps
> suggesting/recommending
> > this installation method for OSX in our docs would good for our users.
> >
> > --
> > Zameer Manji
> >
>
> --
> Zameer Manji
>
>


[RESULT] [VOTE] Release Apache Aurora 0.10.0 RC2

2015-11-16 Thread Zameer Manji
All,

The vote to accept Apache Aurora 0.10.0 RC2 as the official Apache
Aurora 0.10.0
release has passed.

+1 (Binding)
-
Zameer Manji
Maxim Khutornenko
Bill Farner
Jake Farrell

+1

Joshua Cohen

There were 5 +1 votes (4 binding) and no 0 or -1 votes. Thank you to all
who helped make this release.

Aurora 0.10.0 includes the following:
---
The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=aurora.git=CHANGELOG=0.10.0

The tag used to create the release with is 0.10.0:
https://git-wip-us.apache.org/repos/asf?p=aurora.git=0.10.0

The release is available at:
https://dist.apache.org/repos/dist/release/aurora/0.10.0/apache-aurora-0.10.0.tar.gz

The MD5 checksum of the release can be found at:
https://dist.apache.org/repos/dist/release/aurora/0.10.0/apache-aurora-0.10.0.tar.gz.md5

The signature of the release can be found at:
https://dist.apache.org/repos/dist/release/aurora/0.10.0/apache-aurora-0.10.0.asc

The GPG key used to sign the release are available at:
https://dist.apache.org/repos/dist/release/aurora/KEYS

On Wed, Nov 11, 2015 at 8:11 PM, Zameer Manji <zma...@apache.org> wrote:

> All,
>
> I propose that we accept the following release candidate as the official
> Apache Aurora 0.10.0 release.
>
>
> Aurora 0.10.0-rc2 includes the following:
> ---
> The CHANGELOG for the release is available at:
>
> https://git-wip-us.apache.org/repos/asf?p=aurora.git=CHANGELOG=0.10.0-rc2
>
> The branch used to create the release candidate is:
>
> https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=shortlog;h=refs/heads/0.10.0-rc2
>
> The release candidate is available at:
>
> https://dist.apache.org/repos/dist/dev/aurora/0.10.0-rc2/apache-aurora-0.10.0-rc2.tar.gz
>
> The MD5 checksum of the release candidate can be found at:
>
> https://dist.apache.org/repos/dist/dev/aurora/0.10.0-rc2/apache-aurora-0.10.0-rc2.tar.gz.md5
>
> The signature of the release candidate can be found at:
>
> https://dist.apache.org/repos/dist/dev/aurora/0.10.0-rc2/apache-aurora-0.10.0-rc2.tar.gz.asc
>
> The GPG key used to sign the release are available at:
> https://dist.apache.org/repos/dist/dev/aurora/KEYS
>
> Please download, verify, and test.
>
> The vote will close on Mon Nov 16 12:00:00 PST 2015
>
> [ ] +1 Release this as Apache Aurora 0.10.0
> [ ] +0
> [ ] -1 Do not release this as Apache Aurora 0.10.0 because...
>
> --
> Zameer Manji
>



-- 
Zameer Manji


Re: [VOTE] Release Apache Aurora 0.10.0 RC2

2015-11-12 Thread Zameer Manji
+1 (binding). I verified this release on a clean machine running OSX 10.10
using the ./build-support/release/verify-release-candidate script.

On Wed, Nov 11, 2015 at 8:11 PM, Zameer Manji <zma...@apache.org> wrote:

> All,
>
> I propose that we accept the following release candidate as the official
> Apache Aurora 0.10.0 release.
>
>
> Aurora 0.10.0-rc2 includes the following:
> ---
> The CHANGELOG for the release is available at:
>
> https://git-wip-us.apache.org/repos/asf?p=aurora.git=CHANGELOG=0.10.0-rc2
>
> The branch used to create the release candidate is:
>
> https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=shortlog;h=refs/heads/0.10.0-rc2
>
> The release candidate is available at:
>
> https://dist.apache.org/repos/dist/dev/aurora/0.10.0-rc2/apache-aurora-0.10.0-rc2.tar.gz
>
> The MD5 checksum of the release candidate can be found at:
>
> https://dist.apache.org/repos/dist/dev/aurora/0.10.0-rc2/apache-aurora-0.10.0-rc2.tar.gz.md5
>
> The signature of the release candidate can be found at:
>
> https://dist.apache.org/repos/dist/dev/aurora/0.10.0-rc2/apache-aurora-0.10.0-rc2.tar.gz.asc
>
> The GPG key used to sign the release are available at:
> https://dist.apache.org/repos/dist/dev/aurora/KEYS
>
> Please download, verify, and test.
>
> The vote will close on Mon Nov 16 12:00:00 PST 2015
>
> [ ] +1 Release this as Apache Aurora 0.10.0
> [ ] +0
> [ ] -1 Do not release this as Apache Aurora 0.10.0 because...
>
> --
> Zameer Manji
>



-- 
Zameer Manji


[VOTE] Release Apache Aurora 0.10.0 RC0

2015-11-10 Thread Zameer Manji
All,

I propose that we accept the following release candidate as the official
Apache Aurora 0.10.0 release.


Aurora 0.10.0-rc0 includes the following:
---
The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=aurora.git=CHANGELOG=0.10.0-rc0

The branch used to create the release candidate is:
https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=shortlog;h=refs/heads/0.10.0-rc0

The release candidate is available at:
https://dist.apache.org/repos/dist/dev/aurora/0.10.0-rc0/apache-aurora-0.10.0-rc0.tar.gz

The MD5 checksum of the release candidate can be found at:
https://dist.apache.org/repos/dist/dev/aurora/0.10.0-rc0/apache-aurora-0.10.0-rc0.tar.gz.md5

The signature of the release candidate can be found at:
https://dist.apache.org/repos/dist/dev/aurora/0.10.0-rc0/apache-aurora-0.10.0-rc0.tar.gz.asc

The GPG key used to sign the release are available at:
https://dist.apache.org/repos/dist/dev/aurora/KEYS

Please download, verify, and test.

The vote will close on Fri Nov 13 12:30:00 PST 2015

[ ] +1 Release this as Apache Aurora 0.10.0
[ ] +0
[ ] -1 Do not release this as Apache Aurora 0.10.0 because...

-- 
Zameer Manji


Re: [VOTE] Release Apache Aurora 0.10.0 RC0

2015-11-10 Thread Zameer Manji
I ran the verify-release-candidate script on a clean machine and it passed
for me so +1 (binding) from me.

On Tue, Nov 10, 2015 at 1:08 PM, Bill Farner <wfar...@apache.org> wrote:

> Thanks, Zameer!
>
> Quick reminder to everyone - a basic verification of the RC is easy:
>
> ./build-support/release/verify-release-candidate 0.10.0-rc0
>
>
> Those of you installing Aurora in production environments, it would be wise
> of you to do some live testing and report back.
>
>
> On Tue, Nov 10, 2015 at 11:27 AM, Zameer Manji <zma...@apache.org> wrote:
>
> > All,
> >
> > I propose that we accept the following release candidate as the official
> > Apache Aurora 0.10.0 release.
> >
> >
> > Aurora 0.10.0-rc0 includes the following:
> > ---
> > The CHANGELOG for the release is available at:
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git=CHANGELOG=0.10.0-rc0
> >
> > The branch used to create the release candidate is:
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=shortlog;h=refs/heads/0.10.0-rc0
> >
> > The release candidate is available at:
> >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.10.0-rc0/apache-aurora-0.10.0-rc0.tar.gz
> >
> > The MD5 checksum of the release candidate can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.10.0-rc0/apache-aurora-0.10.0-rc0.tar.gz.md5
> >
> > The signature of the release candidate can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.10.0-rc0/apache-aurora-0.10.0-rc0.tar.gz.asc
> >
> > The GPG key used to sign the release are available at:
> > https://dist.apache.org/repos/dist/dev/aurora/KEYS
> >
> > Please download, verify, and test.
> >
> > The vote will close on Fri Nov 13 12:30:00 PST 2015
> >
> > [ ] +1 Release this as Apache Aurora 0.10.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Aurora 0.10.0 because...
> >
> > --
> > Zameer Manji
> >
>
> --
> Zameer Manji
>
>


Re: Multiple executor support

2015-11-02 Thread Zameer Manji
+wfarner

I believe Bill was heavily involved in reviewing the proposed patch and
design. Bill, care to comment on what you think here?

On Mon, Nov 2, 2015 at 12:55 PM, <meghdoo...@yahoo.com.invalid> wrote:

> Do we have a decision on this?
>
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/AURORA-1376
>
> It would help to know where we stand on this.
>
> Thx
>
>
> Sent from my iPhone
>
> --
> Zameer Manji
>
>


0.10.0 Release Update

2015-11-02 Thread Zameer Manji
As Maxim mentioned in today's IRC meeting, I elected to remove the
remaining deprecations and removals from the 0.10.0 release in favor of
doing a release now. This will permit us to upgrade our Mesos dependency
and allow our users to better follow the recent Mesos releases.

There are dependencies of the RC ticket [0] have been resolved and I will
be cutting a release within a few days. If anything must make this release
please let me know ASAP.


[0]: https://issues.apache.org/jira/browse/AURORA-1250

-- 
Zameer Manji


Re: update mesos 0.23 to 0.24

2015-09-29 Thread Zameer Manji
Mauricio,

It seems that it is not possible to upgrade to Mesos 0.24 with Aurora
0.9.0. 0.9.0 was released against Mesos 0.22 which means it is not
compatible with Mesos 0.24 (only 0.23). I have filed
https://issues.apache.org/jira/browse/AURORA-1503 to best figure out how
the project can move forward from here.

You might be interest in this thread
<http://www.mail-archive.com/dev@mesos.apache.org/msg33307.html> on the
Mesos dev list where they are reevaluating their current deprecation policy.

On Sun, Sep 27, 2015 at 9:19 AM, Mauricio Garavaglia <
mauriciogaravag...@gmail.com> wrote:

> Hello guys
>
> I'm using aurora 0.9 and tried to update to mesos 0.24.  Right after the
> update I started to get this messages in the aurora leader and it crashed.
> Every new leader crashed in the same way. Mesos was updated in a rolling
> fashion, one node at the time, and it was looking healthy, even marathon
> was able to register itself and start jobs but aurora never did it. Here's
> a sample of the log I got on each leader, see the 'failed to parse data' at
> the end.
>
> I saw this comment in the mesos upgrades notes [1] "Master now publishes
> its information in ZooKeeper in JSON (instead of protobuf). Make sure
> schedulers are linked against >= 0.23.0 libmesos before upgrading the
> master." so I was wondering if it's supported or not.
>
> 2015-09-25 18:49:37,923:1(0x7fd9dc6b4700):ZOO_INFO@log_env@712: Client
> environment:zookeeper.version=zookeeper C client 3.4.5
> 2015-09-25 18:49:37,923:1(0x7fd9dc6b4700):ZOO_INFO@log_env@716: Client
> environment:host.name=11f23e5685b3
> 2015-09-25 18:49:37,923:1(0x7fd9dc6b4700):ZOO_INFO@log_env@723: Client
> environment:os.name=Linux
> 2015-09-25 18:49:37,923:1(0x7fd9dc6b4700):ZOO_INFO@log_env@724: Client
> environment:os.arch=3.19.0-28-generic
> 2015-09-25 18:49:37,923:1(0x7fd9dc6b4700):ZOO_INFO@log_env@725: Client
> environment:os.version=#30~14.04.1-Ubuntu SMP Tue Sep 1 09:32:55 UTC 2015
> I0925 18:49:37.923105   871 sched.cpp:157] Version: 0.22.0
> 2015-09-25 18:49:37,923:1(0x7fd9dc6b4700):ZOO_INFO@log_env@733: Client
> environment:user.name=(null)
> 2015-09-25 18:49:37,923:1(0x7fd9dc6b4700):ZOO_INFO@log_env@741: Client
> environment:user.home=/root
> 2015-09-25 18:49:37,923:1(0x7fd9dc6b4700):ZOO_INFO@log_env@753: Client
> environment:user.dir=/
> 2015-09-25 18:49:37,923:1(0x7fd9dc6b4700):ZOO_INFO@zookeeper_init@786:
> Initiating client connection, host=192.168.255.31:2181,192.168.255.32:2181
> ,
> 192.168.255.33:2181,192.168.255.34:2181,192.168.255.35:2181
> sessionTimeout=1 watcher=0x7fd9e6d88cd0 sessionId=0
> sessionPasswd= context=0x7fd9a8000b70 flags=0
> I0925 18:49:37.923 THREAD800
> org.apache.aurora.scheduler.mesos.SchedulerDriverService.startUp: Driver
> started with code DRIVER_RUNNING
> 2015-09-25 18:49:37,923:1(0x7fd9c9333700):ZOO_INFO@check_events@1703:
> initiated connection to server [192.168.255.32:2181]
> I0925 18:49:37.924 THREAD133
>
> org.apache.aurora.scheduler.SchedulerLifecycle$DefaultDelayedActions.onRegistrationTimeout:
> Giving up on registration in (1, mins)
> 2015-09-25 18:49:37,930:1(0x7fd9c9333700):ZOO_INFO@check_events@1750:
> session establishment complete on server [192.168.255.32:2181],
> sessionId=0x250038b61690f12, negotiated timeout=1
> I0925 18:49:37.930192   224 group.cpp:313] Group process (group(3)@
> 10.224.255.23:8083) connected to ZooKeeper
> I0925 18:49:37.930253   224 group.cpp:790] Syncing group operations: queue
> size (joins, cancels, datas) = (0, 0, 0)
> I0925 18:49:37.930297   224 group.cpp:385] Trying to create path '/mesos'
> in ZooKeeper
> I0925 18:49:37.930974   224 group.cpp:717] Found non-sequence node
> 'log_replicas' at '/mesos' in ZooKeeper
> I0925 18:49:37.931046   224 detector.cpp:138] Detected a new leader:
> (id='2513')
> I0925 18:49:37.931131   224 group.cpp:659] Trying to get
> '/mesos/json.info_002513' in ZooKeeper
> Failed to detect a master: Failed to parse data of unknown label '
> json.info'
>
>
> [1] http://mesos.apache.org/documentation/latest/upgrades/
>
> --
> Zameer Manji
>
> <http://mesos.apache.org/documentation/latest/upgrades/>


Re: 0.10.0 Feature Requests

2015-09-16 Thread Zameer Manji
It looks like the tickets for 0.10.0 already existed and I have assigned
them to myself:
RC: https://issues.apache.org/jira/browse/AURORA-1250
Deprecations: https://issues.apache.org/jira/browse/AURORA-1367
Breaking Changes: https://issues.apache.org/jira/browse/AURORA-1251

It looks like the tickets Bill referenced are already linked to the ticket
so we are good to go.

On Tue, Sep 15, 2015 at 6:35 PM, Jake Farrell <jfarr...@apache.org> wrote:

> can agree with that, just dont want to see us forget about it and have
> stale code sitting around. moved to the 0.10 deprecations epic
>
> -Jake
>
> On Tue, Sep 15, 2015 at 8:26 PM, Bill Farner <wfar...@apache.org> wrote:
>
> > TODO cleanup seems to be an ongoing process, and much like documentation
> -
> > i think it's tricky to put on a roadmap.
> >
> > On Tue, Sep 15, 2015 at 1:19 PM, Jake Farrell <jfarr...@apache.org>
> wrote:
> >
> > > we should probably spend some time before .10 going through and
> cleaning
> > up
> > > deprecation todo's left in the code also. I've added this to the Aurora
> > > Roadmap Google doc
> > >
> > > -Jake
> > >
> > > On Thu, Sep 10, 2015 at 2:09 PM, Bill Farner <wfar...@apache.org>
> wrote:
> > >
> > > > I agree, documentation overhauls are best decoupled from releases.
> > > >
> > > > On Thu, Sep 10, 2015 at 10:49 AM, Jake Farrell <jfarr...@apache.org>
> > > > wrote:
> > > >
> > > > > Think that this is a great idea and agree that we need to spend
> some
> > > time
> > > > > improving our user content, but it should not be a blocker to the
> > next
> > > > > release candidate. Improving our documentation and website should
> be
> > an
> > > > > ongoing effort
> > > > >
> > > > > -Jake
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Sep 10, 2015 at 1:43 PM, Zameer Manji <zma...@apache.org>
> > > wrote:
> > > > >
> > > > > > One thing I would like to see in 0.10.0 is improvement to our
> > > > > > documentation. We have a lot of documentation but I don't think
> it
> > is
> > > > > well
> > > > > > organized or very accessible to a new user or a prospective user.
> > > This
> > > > > > might involve writing new documentation, improving our, website,
> > etc.
> > > > > >
> > > > > > On Tue, Sep 8, 2015 at 9:30 AM, Bill Farner <wfar...@apache.org>
> > > > wrote:
> > > > > >
> > > > > > > In 0.10.0 i would like to see:
> > > > > > >
> > > > > > > - groundwork and initial endpoints in a REST API (part of
> > > AURORA-987)
> > > > > > >
> > > > > > > - thermos executor support for a simple task description.  this
> > > would
> > > > > be
> > > > > > a
> > > > > > > dramatically reduced json schema that the executor can consume
> > for
> > > > > simple
> > > > > > > use cases where the user is invoking a single shell command
> > > > > > >
> > > > > > > - support for custom executors (AURORA-1376)
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Aug 31, 2015 at 11:18 AM, Zameer Manji <
> > zma...@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > > As discussed in today's IRC meeting I will be heading up the
> > > 0.10.0
> > > > > > > > release. What would people like to see in this release?
> > > > > > > >
> > > > > > > > --
> > > > > > > > Zameer Manji
> > > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Zameer Manji
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
> --
> Zameer Manji
>
>


Re: JobConfig diff API

2015-09-15 Thread Zameer Manji
I'm a proponent of firming up our executor <-> scheduler contract. Since we
are going to get multiple executor support soon I think it would be nice if
we said that ExecutorConfig.data was JSON.

On Tue, Sep 15, 2015 at 10:47 AM, Maxim Khutornenko <ma...@apache.org>
wrote:

> | I hope this doesn't mean we would be returning a textual
> representation of a diff
>
> If we can make an assumption that executor data is always JSON, we can
> deliver a much more specific answer by applying JSON diff tools.
> Something like:
>
> - "environment": "prod"
> + "environment": "test"
>
> Otherwise, we would have to output the entire ExecutorConfig.data blob
> content for both left and right sides and let users figure out the
> problem. I don't think that's acceptable.
>
> Does it make sense? Any suggestions on the output format of the diff?
> I think it should be structured but at the same time we have to get
> down to text level at some point to report concrete discrepancies.
>
> On Mon, Sep 14, 2015 at 8:58 PM, Bill Farner <wfar...@apache.org> wrote:
> > The 'blob'-iness of ExecutorConfig is intentional so that we can support
> > alternative executors.  I'd hate for that to go away.
> >
> > On Mon, Sep 14, 2015 at 8:56 PM, Jake Farrell <jfarr...@apache.org>
> wrote:
> >
> >> This is one of the hoops encountered when using the Thrift api directly
> and
> >> not using the client, I'd love to see ExecutorConfig.data move to a
> thrift
> >> object and not be a string blob
> >>
> >> -Jake
> >>
> >> On Mon, Sep 14, 2015 at 9:28 PM, Bill Farner <wfar...@apache.org>
> wrote:
> >>
> >> > I like the idea of adding this API, but i don't see why it requires
> us to
> >> > make assumptions about ExecutorConfig.data.  I hope this doesn't mean
> we
> >> > would be returning a textual representation of a diff.  Can you
> elaborate
> >> > on that?
> >> >
> >> > On Mon, Sep 14, 2015 at 4:14 PM, Maxim Khutornenko <ma...@apache.org>
> >> > wrote:
> >> >
> >> > > Problem:
> >> > > We currently don't have a good user experience around "aurora job
> >> > > diff" command. The task configs are dumped as "prettified" JSON
> >> > > strings and diffed with the system diff tool. Anyone who tried to
> use
> >> > > it knows it can be very hard to read especially when it comes to
> >> > > executor data deltas. Also, the implementation is done completely
> >> > > within the Aurora client making it hard to reuse this feature by
> other
> >> > > clients (e.g.: an external deploy coordination tool).
> >> > >
> >> > > Proposal:
> >> > > Move the diff logic to the scheduler and expose it via a new
> >> > > jobConfigDiff thrift API.
> >> > >
> >> > > Benefits:
> >> > > - Client will no longer have the custom non-reusable logic moving us
> >> > > closer towards a "thin client" goal.
> >> > > - The new RPC can be fully used by any existing or new API clients.
> >> > > - The diff output will be improved via leveraging third party POJO
> >> > > and/or JSON diff libraries [1,2,3, etc.].
> >> > > - The server updater can be partially/fully unified with the new
> diff
> >> > > logic further improving the overall DRY-ness.
> >> > >
> >> > > Concerns:
> >> > > - The executor data is currently treated as an opaque string blob on
> >> > > the scheduler side. In reality, it's almost guaranteed to be JSON.
> In
> >> > > order to deliver the best UX, the scheduler would have to start
> >> > > requiring ExecutorConfig.data to be a valid JSON.
> >> > >
> >> > > Any other concerns/objections/comments? I would like to formalize
> the
> >> > > proposal be EOW if we reach consensus quickly.
> >> > >
> >> > > Thanks,
> >> > > Maxim
> >> > >
> >> > > [1] -
> >> > >
> >> >
> >>
> http://java-object-diff.readthedocs.org/en/latest/getting-started/#getting-started
> >> > > [2] - http://javers.org/documentation/diff-examples/
> >> > > [3] - https://github.com/skyscreamer/JSONassert
> >> > >
> >> >
> >>
>
> --
> Zameer Manji
>
>


Re: New email lists

2015-09-14 Thread Zameer Manji
The website is incorrect. The mailing list is
announceme...@aurora.apache.org and to subscribe one needs to email
announcements-subscr...@aurora.apache.org. I will update the website
immediately.

On Mon, Sep 14, 2015 at 1:15 PM, Chris Lambert <chrislamb...@gmail.com>
wrote:

> Hi folks,
>
> Looks like one of the new lists is broken...  :-/
>
> <announce-subscr...@aurora.apache.org>:
> > Sorry, no mailbox here by that name. (#5.1.1)
>
>
> Chris
>
> --
> Zameer Manji
>
>


Re: Roadmap

2015-09-08 Thread Zameer Manji
+1 Please start the doc.

Zameer Manji
On Sep 8, 2015 5:11 PM, "Bill Farner" <wfar...@apache.org> wrote:

> Reviving this thread - shall i go ahead and start the google doc for
> discussion?
>
> On Wed, Sep 2, 2015 at 12:06 PM, Zameer Manji <zma...@twopensource.com>
> wrote:
>
> > +1 to putting it in the project documentation
> > +1 to start drafting this
> >
> > On Wed, Sep 2, 2015 at 10:03 AM, Dave Lester <d...@davelester.org>
> wrote:
> >
> > > Would it make sense to first create a google doc to draft/discuss
> ideas?
> > >
> > > Also, how about making this part of the project documentation -- in the
> > > /docs folder and rendered on the website?
> > >
> > > Dave
> > >
> > > On Wed, Sep 2, 2015, at 09:35 AM, Jake Farrell wrote:
> > > > Sounds like we have pretty much reached an agreement here on making
> it
> > > > public facing on the website, will get mechanics setup and we can
> start
> > > > iterating on the content from there. Thanks everyone
> > > >
> > > > -Jake
> > > >
> > > > On Wed, Sep 2, 2015 at 12:20 PM, Chamin Nalinda <chm...@gmail.com>
> > > wrote:
> > > >
> > > > > +1 I've been in this mailing list for sometime and always wonder
> from
> > > where
> > > > > I should start. This is a great move.
> > > > >
> > > > > On Wed, Sep 2, 2015 at 8:24 AM, Dave Lester <d...@davelester.org>
> > > wrote:
> > > > >
> > > > > > +1!
> > > > > >
> > > > > > On Tue, Sep 1, 2015, at 07:45 PM, Bill Farner wrote:
> > > > > > > :-)
> > > > > > >
> > > > > > > On Tue, Sep 1, 2015 at 6:44 PM, Jeffrey Davis <
> > > > > jeffrey.n.da...@gmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1 to putting it on the website as well, but if and only if
> > it's
> > > > > titled
> > > > > > > > "squad goals"
> > > > > > > >
> > > > > > > > On Tue, Sep 1, 2015 at 8:41 PM, Zameer Manji <
> > zma...@apache.org>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > +1 to putting it on the website.
> > > > > > > > >
> > > > > > > > > On Tue, Sep 1, 2015 at 4:55 PM, Joseph Smith <
> > > yasumo...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Yep, that sounds awesome.
> > > > > > > > > >
> > > > > > > > > > > On Sep 1, 2015, at 10:04 AM, Joshua Cohen <
> > > > > > jco...@twopensource.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > +1 to putting it on the website. I think it would help
> > > adoption
> > > > > > if we
> > > > > > > > > > have
> > > > > > > > > > > an up front and easily accessible document that first,
> > > > > indicates
> > > > > > > > we've
> > > > > > > > > > got
> > > > > > > > > > > plans!, and then goes on to explain what they are. The
> > > caveat
> > > > > is
> > > > > > that
> > > > > > > > > if
> > > > > > > > > > > the document gets stale it would serve the opposite
> > > purpose ;).
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Aug 31, 2015 at 9:32 PM, Bill Farner <
> > > > > > terasur...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> I would love to see this come together!
> > > > > > > > > > >>
> > > > > > > > > > >> I'm leaning towards putting this on the website.  This
> > > gives
> > > > > the
> > > > > > > > list
> > > > > > > > > > >&g

Re: Creating Aurora user@ and announcements@ lists?

2015-09-08 Thread Zameer Manji
+1

On Tue, Sep 8, 2015 at 10:35 AM, Mauricio Garavaglia <
mauriciogaravag...@gmail.com> wrote:

> +1
>
> On Tue, Sep 8, 2015 at 2:29 PM, Hussein Elgridly <
> huss...@broadinstitute.org
> > wrote:
>
> > +1. As a user, it felt kinda weird for me to be posting my questions to
> > dev@
> > .
> >
> > Hussein Elgridly
> > Senior Software Engineer, DSDE
> > The Broad Institute of MIT and Harvard
> >
> >
> > On 8 September 2015 at 13:13, Dave Lester <d...@davelester.org> wrote:
> >
> > > I'd like to propose establishing a few additional mailing lists for the
> > > Aurora project:
> > >
> > > * a user@ list to field questions that may go unanswered related to
> the
> > > use and operations of Aurora. As the community grows, I increasingly
> see
> > > questions on IRC that go unanswered during non-business hours, or that
> > > are better handled asynchronously. This list would be a space for those
> > > discussions, along with other non-dev conversations.
> > >
> > > * an announcements@ list that only Aurora committers could publish to
> > > for release, committer, and project announcements.
> > >
> > > Why should be create new lists? To increase communication with non-dev
> > > members of the Aurora community. Additionally, I spent time digging
> > > through stats from other Apache projects and found that non-dev lists
> > > tend to have a higher subscriber rate -- particularly user@ lists.
> Let's
> > > give a space for those members of the community.
> > >
> > > Thoughts? If I don't hear any objections, I'll go ahead and file an
> > > INFRA ticket later this week.
> > >
> > > Dave
> > >
> >
>
> --
> Zameer Manji
>
>


Mechanics of Twitter Commons Import

2015-08-24 Thread Zameer Manji
Hey,

I would like to inform everyone that there is a review
https://reviews.apache.org/r/37666/ out that imports the java code we
depend on from Twitter Commons into our tree. The current approach of the
import is a large patch on RB with a description that explains the origin
of the code. The benefits of this approach ensures our commit history
remains linear and easy to bisect. However this approach means that our
repo will not contain the history of these files. Currently I believe there
is no value in preserving history since we do not intend to maintain this
code for a long time and the history of files is easily available on github.

Please +1 or -1 this approach.

-- 
Zameer Manji


Re: [VOTE] Release Apache Aurora 0.9.0 RC0

2015-07-22 Thread Zameer Manji
+1
Release candidate looks good!

On Tue, Jul 21, 2015 at 2:37 PM, Joseph Smith yasumo...@gmail.com wrote:

 +1

 + echo 'Release candidate looks good!’

  On Jul 21, 2015, at 2:29 PM, Kevin Sweeney kevi...@apache.org wrote:
 
  +1
 
  RC verification script passes
 
  On Tue, Jul 21, 2015 at 2:18 PM, Bill Farner wfar...@apache.org wrote:
 
  +1
 
  Successfully ran ./build-support/release/verify-release-candidate
 0.9.0-rc0
 
  -=Bill
 
  On Mon, Jul 20, 2015 at 10:59 AM, Jake Farrell jfarr...@apache.org
  wrote:
 
  I propose that we accept the following release candidate as the
 official
  Apache Aurora 0.9.0 release.
 
 
  Aurora 0.9.0-rc0 includes the following:
  ---
  The CHANGELOG for the release is available at:
 
 
 
 https://git-wip-us.apache.org/repos/asf?p=aurora.gitf=CHANGELOGhb=0.9.0-rc0
 
  The branch used to create the release candidate is:
 
 
 
 https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=shortlog;h=refs/heads/0.9.0-rc0
 
  The release candidate is available at:
 
 
 
 https://dist.apache.org/repos/dist/dev/aurora/0.9.0-rc0/apache-aurora-0.9.0-rc0.tar.gz
 
  The MD5 checksum of the release candidate can be found at:
 
 
 
 https://dist.apache.org/repos/dist/dev/aurora/0.9.0-rc0/apache-aurora-0.9.0-rc0.tar.gz.md5
 
  The signature of the release candidate can be found at:
 
 
 
 https://dist.apache.org/repos/dist/dev/aurora/0.9.0-rc0/apache-aurora-0.9.0-rc0.tar.gz.asc
 
  The GPG key used to sign the release are available at:
  https://dist.apache.org/repos/dist/dev/aurora/KEYS
 
  Please download, verify, and test.
 
  The vote will close on Thu Jul 23 14:00:00 EDT 2015
 
  [ ] +1 Release this as Apache Aurora 0.9.0
  [ ] +0
  [ ] -1 Do not release this as Apache Aurora 0.9.0 because...
 
 
  I would like to get the voting started off with my own +1
 
  -Jake
 
 




-- 
Zameer Manji


Re: Forking twitter-commons into our tree

2015-07-06 Thread Zameer Manji
Just to be clear, I'm proposing forking the java parts only.

On Mon, Jul 6, 2015 at 9:06 AM, Joseph Smith yasumo...@gmail.com wrote:

 Also a (tough to concede) +1. Although I’m not a fan of the fork, it will
 help improve velocity and empower a migration away from twitter common.

  On Jul 3, 2015, at 8:15 PM, Bill Farner wfar...@apache.org wrote:
 
  That's roughly the eventual plan, which this move would help us
 facilitate.
  We use guava heavily already, most of our current dependence is on ZK
 and args handling code...but we would look towards dep-shallow alternatives.
 
 
 
 _
  From: Chris Aniszczyk caniszc...@gmail.com
  Sent: Friday, July 3, 2015 8:03 AM
  Subject: Re: Forking twitter-commons into our tree
  To:  dev@aurora.apache.org
  Cc: Jake Farrell jfarr...@apache.org
 
 
  I'll see what I can do about IP clearance.
 
  For giggles, how much work do you think it would be to shed
 twitter-commons
  and just rely on guava and other what I would consider more standard
  libraries.
 
  On Thu, Jul 2, 2015 at 10:34 PM, Bill Farner wfar...@apache.org wrote:
 
  Thanks, Jake!
 
  -=Bill
 
  On Thu, Jul 2, 2015 at 8:10 PM, Jake Farrell jfarr...@apache.org
 wrote:
 
  yes, makes it easier to donate when its Apache License 2.0, but still
  requires the IP clearance [1], which is handled through the IPMC. This
 is
  required so there is an audit trail of that software being donated to
 the
  ASF
 
  -Jake
 
  [1]: http://incubator.apache.org/ip-clearance/index.html
 
 
 
  On Thu, Jul 2, 2015 at 10:41 PM, Bill Farner wfar...@apache.org
 wrote:
 
  Jake - i'm not fully versed on licenses, but is that true even though
  it's
  all Apache License 2.0?
 
  -=Bill
 
  On Thu, Jul 2, 2015 at 5:28 PM, Jake Farrell jfarr...@apache.org
  wrote:
 
  no objections, but we would have to get an IP clearance doc from
  Twitter
  for this code in order to bring this code into the ASF
 
  -Jake
 
  On Thu, Jul 2, 2015 at 3:20 PM, Zameer Manji zma...@apache.org
  wrote:
 
  Hey,
 
  Aurora depends heavily on twitter-commons for lots of
  functionality.
  However upstream is not very active and I suspect that it will be
  less
  active in the future. Currently we depend on artifacts published
  from
  this
  project which causes us to depend on older versions of guava and
  guice.
 
  As a result, it seems that will be difficult to address tickets
  like
  AURORA-1380 https://issues.apache.org/jira/browse/AURORA-1380
  without
  changing something. I propose we fork all of the java portions of
  twitter-commons into our tree, remove the parts we don't use and
  update
  guava and guice so we can move forward on this front.
 
  What are people's thoughts on this?
 
  --
  Zameer Manji
 
 
 
 
 
 
 
 
  --
  Cheers,
 
  Chris Aniszczyk
  http://aniszczyk.org
  +1 512 961 6719

 --
 Zameer Manji

 %2B1%20512%20961%206719



Re: Forking twitter-commons into our tree

2015-07-02 Thread Zameer Manji
I'm glad I came up with the same idea twice. Before I dive too deeply into
this, does anyone else agree or object?

On Thu, Jul 2, 2015 at 1:03 PM, Bill Farner wfar...@apache.org wrote:

 I believe this came up in a previous conversation, with the same
 conclusion you have drawn.  Ticket:
 https://issues.apache.org/jira/browse/AURORA-1213



 _
 From: Zameer Manji zma...@apache.org
 Sent: Thursday, July 2, 2015 12:20 PM
 Subject: Forking twitter-commons into our tree
 To:  dev@aurora.apache.org


 Hey,

 Aurora depends heavily on twitter-commons for lots of functionality.
 However upstream is not very active and I suspect that it will be less
 active in the future. Currently we depend on artifacts published from this
 project which causes us to depend on older versions of guava and guice.

 As a result, it seems that will be difficult to address tickets like
 AURORA-1380 https://issues.apache.org/jira/browse/AURORA-1380 without
 changing something. I propose we fork all of the java portions of
 twitter-commons into our tree, remove the parts we don't use and update
 guava and guice so we can move forward on this front.

 What are people's thoughts on this?

 --
 Zameer Manji

 --
 Zameer Manji




Forking twitter-commons into our tree

2015-07-02 Thread Zameer Manji
Hey,

Aurora depends heavily on twitter-commons for lots of functionality.
However upstream is not very active and I suspect that it will be less
active in the future. Currently we depend on artifacts published from this
project which causes us to depend on older versions of guava and guice.

As a result, it seems that will be difficult to address tickets like
AURORA-1380 https://issues.apache.org/jira/browse/AURORA-1380 without
changing something. I propose we fork all of the java portions of
twitter-commons into our tree, remove the parts we don't use and update
guava and guice so we can move forward on this front.

What are people's thoughts on this?

-- 
Zameer Manji


Re: Using a config file to support custom executors: potential paradigm shift

2015-07-02 Thread Zameer Manji
I am in favor of #1 to prevent yak shaving.

On Thu, Jul 2, 2015 at 12:10 PM, Bill Farner wfar...@apache.org wrote:

 Thanks for starting this discussion, Renan!

 I think it's clear that the feature you're adding calls for a configuration
 file.  I'm realizing now that we do have some precedent for configuration
 files with the recently-introduced security controls [1].  In that case the
 sane path was obvious since we pass the configuration file in an
 established format to third-party code (Apache Shiro).

 I see several paths ahead:

 1.) start with individual feature-oriented configuration files and
 re-assess down the road

 2.) establish a convention for a single global configuration file

 3.) (2) and migrate command line arguments to a configuration file

 My personal preference is (1), so as to not force Renan to start a yak
 shave, and because i think willingness to change things down the road is
 important.

 I include (3) because people have inquired about that in the past.

 Does anyone have a preference which path we take?  Are there other options
 i'm not thinking about?


 [1]

 https://github.com/apache/aurora/blob/master/docs/security.md#http-spnego-authentication-kerberos

 -=Bill

 On Wed, Jul 1, 2015 at 3:34 PM, Renan DelValle rdelv...@binghamton.edu
 wrote:

  Hi all,
 
  I'm currently working on bringing custom executor support to Aurora
  (AURORA-1288). As development and discussions about the most adequate
  solution to this problem have moved along, I've reached a crossroad
  where I need the community's input on the implementation path this
  feature will take.
 
  Right now, after evaluating other options,  it seems that the safest
  and most flexible way to providing users the ability to configure
  their own custom executor may be to use a configuration file.
 
  However, as there is no previous use of a config file (everything has
  been done through command line up until now), a discussion is
  necessary about this possible shift in paradigm due to the fact that,
  if this route is taken, it will set a precedent for Aurora.
 
  As Bill Farner said in his comment on Jira, all in all, this is
  discussion should be about how should approach this potential paradigm
  shift.
 
  -Renan
 

 --
 Zameer Manji




Re: building aurora client on mac

2015-05-22 Thread Zameer Manji
The binary should be available in the `./dist` directory adjacent to the
`./pants` command.


$ ls dist/
aurora.pex
$ ./dist/aurora.pex
usage: aurora.pex [-h] [--version] {task,quota,update,cron,job,config,sla}
...

optional arguments:
  -h, --helpshow this help message and exit
  --version show program's version number and exit

commands:
  {task,quota,update,cron,job,config,sla}
taskWork with a task running in an Apache Aurora cluster
quota   Work with quota settings for an Apache Aurora
cluster
update  Interact with the aurora update service.
cronWork with entries in the aurora cron scheduler
job Work with an aurora job
config  Work with an aurora configuration file
sla Work with SLA data in Aurora cluster.


On Fri, May 22, 2015 at 2:59 PM, Poppy poppyd...@gmail.com wrote:

 I am trying to build aurora client on mac for my cli usage.
 On Aurora git repo I do ./pants binary
 src/main/python/apache/aurora/client/cli:aurora
 Where does aurora client gets generated?

 Thx,
 Praveen

 
 

 --
 Zameer Manji




Re: Human-readable release notes

2015-05-11 Thread Zameer Manji
Should the release manager also add some additional prose at the top of the
document before the release is tagged? It could be content similar to other
projects NEWS files or content similar we put into a blog post but more
terse.

I envision the RM branches off master, adds the extra prose at the top and
then tags the commit that adds the additional prose as the rc.

On Mon, May 11, 2015 at 3:07 PM, Bill Farner wfar...@apache.org wrote:

 Now that we have started doing more regular releases, one area in need of
 improvement is communicating in plain language what is in a release.  Our
 current change log [1] leaves much to be desired, as it makes it difficult
 for even project developers to know what happened in a release.

 Below i have outlined a straw man for how to put together release notes
 going forward, please attack it!

 During the development of a release, we record notes about major line items
 for features, backwards incompatibilities, deprecations, etc.  We commit
 these to the CHANGELOG as part of the relevant commits, so that they revert
 cleanly.

 An example of a worthy line item would be Added the 'aurora update wait'
 subcommand to block while an update is in progress.  As a counter-example,
 we should not include line items like Upgrade to gradle 2.4 and Fix link
 to contributing page.

 When *preparing* for a release, the release manager is responsible for
 editing/organizing these line items.  The release manager should enlist the
 help of other contributors in this process.

 When *creating* a release candidate, the release tool will do as it does
 today - collect all tickets with the fixVersion field matching the release
 number.  The release tool will add these to the CHANGELOG in a section
 below the prose.


 [1] https://github.com/apache/aurora/blob/master/CHANGELOG


 -=Bill

 --
 Zameer Manji




Re: [DRAFT][REPORT] Apache Aurora

2015-05-07 Thread Zameer Manji
+1

On Thu, May 7, 2015 at 1:44 PM, Dave Lester d...@davelester.org wrote:

 Looks good!

 On Thu, May 7, 2015, at 01:42 PM, Kevin Sweeney wrote:
  +1
 
  On Thu, May 7, 2015 at 1:41 PM, Bill Farner wfar...@apache.org wrote:
 
   *Please find a draft of our board report below.  Happy to discuss any
   modifications anyone would like to see!  I'll be submitting this
 tomorrow,
   including any agreed edits.*
  
  
   ## Description:
  
   Aurora is a service scheduler used to schedule jobs onto Apache Mesos.
  
   ## Activity:
  
   - Vote for 0.8.0 release is underway
   - Starting monthly cadence for events on SF Bay Area Aurora Users
 Group [1]
  
   ## Issues:
  
   None to report at this time.
  
   ## PMC/Committership changes
  
   None since April report.
  
   ## Releases
  
   Last release: 0.7.0-incubating, Release Date: Feb 05, 2015
  
   ## Mailing list activity
  
   - dev@aurora.apache.org:
 - 133 subscribers
 - 93 e-mails in the last month [2]
  
   ## JIRA activity
  
   - 27 JIRA tickets created in the last month [3]
   - 47 JIRA tickets closed/resolved in the last month [4]
  
  
   [1] http://meetup.com/Bay-Area-Apache-Aurora-Users-Group
   [2] http://s.apache.org/ytb
   [3] http://s.apache.org/gmt
   [4] http://s.apache.org/KIm
  
  
   -=Bill
  
 
 
 
  --
  Kevin Sweeney
  @kts

 --
 Zameer Manji