from:"Bill Farner"

Re: [VOTE] Move Aurora to Apache Attic

2020-02-03 Thread Bill Farner

+1

Aurora has always been about pragmatism, and right now, this is the best
route for new and existing users.

On Fri, Jan 31, 2020 at 5:13 PM Renan DelValle  wrote:

> +1 (with a fair bit of sadness but hope for the future of the project)
>
> 2020-01-31 17:11 GMT-08:00 Renan DelValle:
> > Folks,
> >
> > As discussed previously, the project activity has diminished to the
> point that the overhead of being an Apache project outweighs the benefits
> of being under the Apache umbrella.
> >
> > If this vote passes, the PMC will be dissolved, our current project
> resources will be moved into the Attic, and the project will reboot in
> Github under https://github.com/aurora-scheduler
> >
> > The vote will close on Fri Feb 7 12:00:00 2020 San Francisco Time
> >
> > [ ] +1 Move Aurora into the Apache Attic and dissolve the PMC
> > [ ] +0
> > [ ] -1 Move Aurora into the Apache Attic and dissolve the PMC because...

Re: Combining limit and dedicated constraint

2019-02-04 Thread Bill Farner

I’m pretty sure I’ve seen this combination working in the (distant) past,
for limit:1 and other values.

As a sanity check, if you remove the host that _is_ having the task
scheduled, does the task move to the other host?

On Mon, Feb 4, 2019 at 5:10 PM Renan DelValle 
wrote:

> I have two dedicated nodes with a dedicated X attribute. I launched a Job
> containing the dedicated constraint X and a limit:1 and the job has two
> instances.
> Only one instance is able to find a match while the other alternates
> between being vetoed for not matching the dedicated attribute and not
> matching the limit attribute.
>
> I've checked that the offers from the dedicated nodes are coming through
> with the right attributes.
> Has anyone else been able to do this successfully?
>
> -Renan
>

Re: Volunteers needed

2018-09-18 Thread Bill Farner

I’m happy to pitch in for periodic review.  Anyone is welcome to email me
requesting a review.  I don’t monitor incoming reviews, so unfortunately I
will need to be contacted out-of-band.

On Tue, Sep 18, 2018 at 10:45 AM Renan DelValle 
wrote:

> All,
>
> We are in dire need of folks who would be willing to commit time to review
> patches and submit patches to maintain the project. Small things like
> submitting a patch to upgrade our Mesos dependency (or any other dependency
> really) go a long way towards keeping the project up to date.
>
> Unfortunately, many of the members of the Aurora Project Management
> Committee (PMC) have either moved on from the project or are not in a
> position to dedicate time to the project.
>
> I brought up the topic of PMC inactivity on September 5th to the PMC.
> Until today I've only heard from two other PMC members about this topic
> privately. That is the current situation the project is currently in.
>
> This means there is a very real chance that if we don't get volunteers,
> the project will fall behind and, ultimately, become unmaintained.
>
> Therefore if you use this project and would like to see its development
> continue, please consider helping us maintain it by submitting patches or
> code reviews.
>
> Thanks
>
> -Renan
>

Re: Transient task state timeout

2018-07-21 Thread Bill Farner

This is expected behavior, as STARTING is not a transient state
.
I don't believe it ever was.  The rationale is that the ASSIGNED ->
STARTING transition acknowledges the handoff from scheduler to executor
control.  From then, the executor manages the task state.  This allows for
tasks that have a long delay in STARTING -> RUNNING, which may commonly
occur due to slow package or container image fetching.  At this point, your
executor is responsible for any timeouts you deem necessary.

On Fri, Jul 20, 2018 at 11:19 PM, meghdoot bhattacharya <
meghdoo...@yahoo.com.invalid> wrote:

> http://aurora.apache.org/documentation/latest/reference/task-lifecycle/
>
>
> Unexpected Termination: LOST
>
> If a Task stays in a transient task state for too long (such
> as ASSIGNED or STARTING), the scheduler forces it into LOST state, creating
> a new Task in its place that’s sent into PENDING state.
>
> So, the behavior we are observing while testing with our custom executor
> is mesos task in staging or say executor has not sent the task starting
> mesos status message, the transient timeout is working and task marked as
> lost in aurora. However, if executor has sent starting status message but
> then does not send the task running/failed message status, the transient
> timeout is not kicking in and aurora not marking it lost. we waited good 5+
> mins after the timeout to see a change in multiple tests.
> This is 0.19 aurora.
> Thx
>
>

Re: [DISCUSS] Move project to GitBox service

2018-06-11 Thread Bill Farner

+1 for taking PRs, but I’m not opinionated about how that is achieved.

> On Jun 11, 2018, at 5:27 PM, Renan DelValle  wrote:
> 
> All,
> 
> I wanted to bring up for discussion moving the project from our current
> ReviewBoard based workflow to a GitHub pull request based workflow through
> the use of the ASF's GitBox service[1].
> 
> The GitBox service would allow us to read/write to our GitHub repository
> instead of only mirroring our Apache hosted repository. This means we'd be
> able to take pull requests through the GitHub platform instead of taking
> patches though ReviewBoard.
> 
> It's worth mentioning that the Aurora community is responsible for two
> repositories: aurora and aurora-packaging. I believe moving to this model
> would greatly simplify our patch submission process and would lower the
> barrier of entry to new contributors for both of these repositories.
> 
> Looking forward to hearing the community's thoughts on this idea.
> 
> - Renan
> 
> [1] https://reference.apache.org/committer/git

Re: snapshot analyze

2018-06-06 Thread Bill Farner

You might be thinking of something we had at Twitter geared towards capacity 
planning. It intended to tell us how many more instances of a given resource 
shape would hypothetically “fit” in the cluster.

> On Jun 6, 2018, at 1:20 AM, meghdoot bhattacharya 
>  wrote:
> 
> Long time back, I remember reading some threads about some tools that could 
> introspect a snapshot directly and analyze records etc.Do they exist and if 
> so any pointers greatly appreciated.
> Thx

Re: Recovery instructions updates

2018-06-05 Thread Bill Farner

>
> How does the site get updated? Is it auto-generated when we build releases?


The source lives in the project SVN repo:

$ svn co https://svn.apache.org/repos/asf/aurora/ aurora-svn

Here  are
instructions for updating it.  It's a pretty mechanical process, but not
automated.

So, do we plan to add the patch for next release?


Meghdoot - i suspect David would appreciate an incoming patch for the docs,
assuming that's what you're referring to.

He mentioned this step

in the end-to-end tests, which is (hopefully) straightforward enough to try
without assistance.



On Tue, Jun 5, 2018 at 12:47 AM, meghdoot bhattacharya <
meghdoo...@yahoo.com.invalid> wrote:

> Thx David. So, do we plan to add the patch for next release? We will be
> happy to validate it as part of rc validation.
>
>
>
>   From: David McLaughlin 
>  To: dev@aurora.apache.org
>  Sent: Monday, June 4, 2018 9:45 AM
>  Subject: Re: Recovery instructions updates
>
> We should definitely update that doc, Bill's patch makes this much easier
> (as can be seen by the e2e test) and we've been using it in our scale test
> environment. How does the site get updated? Is it auto-generated when we
> build releases?
>
> Having corrupted logs that frequently is concerning too, we haven't seen
> anything like this and we do explicit snapshots/backups as part of every
> Scheduler deploy. If there's a bug lurking, would be good to get in front
> of it.
>
> On Sun, Jun 3, 2018 at 10:40 AM, Meghdoot bhattacharya <
> meghdoo...@yahoo.com.invalid> wrote:
>
> > We will try to recover the log files on the snapshot loading error.
> >
> > + 1 to Bill’s approach on making offline recovery. We will try the patch
> > on our side.
> >
> > Renan, I would ask you to prepare a PR for the restoration docs proposing
> > the 2 additional steps required in current world as we look to maybe
> using
> > a different mechanism. The prep steps to get scheduler ready for backup
> can
> > be eliminated hopefully with the alternative approach.
> >
> > On side lets see if we can recover the logs of the corrupted snapshot
> > loading.
> >
> >
> > Thx
> >
> > > On Jun 3, 2018, at 9:50 AM, Stephan Erb 
> > wrote:
> > >
> > > That sounds indeed concerning. Would be great if you could file an
> issue
> > and attach the related log files and tracebacks.
> > >
> > > Bill recently added a potential replacement for the existing restore
> > mechanism: https://github.com/apache/aurora/commit/
> > 2e1ca42887bc8ea1e8c6cddebe9d1cf29268c714. Given the set of issues you
> > have bumped into with the current restore, this new approach might be
> worth
> > exploring further.
> > >
> > > On 03.06.18, 08:43, "Meghdoot bhattacharya"
> >  wrote:
> > >
> > >Thx Renan for sharing the details. This backup restore happened
> under
> > not so easy circumstances, so would encourage the leads to keep docs
> > updated as much as possible and include in release validation.
> > >
> > >The other issue of snapshots having task and other objects as nil
> > that causes to fail the schedulers, we have now seen 2 times in past
> year.
> > Other than finding root cause why that entry happens during snapshot
> > creation, there needs to be defensive code either to ignore that entry on
> > loading or a way to fix the snapshot. Because we might have to go
> through a
> > days worth of snapshots to find which one did not had that entry and
> > recover from there. Mean time to recover gets impacted under the
> > circumstances. One extra info not sure is relevant or not is the
> corrupted
> > snapshot got created by the admin cli (assumption should not matter
> whether
> > scheduler triggers or forced via cli) that showed success as well as the
> > aurora logs but then loading it exposed the issue.
> > >
> > >Thx
> > >
> > >> On Jun 2, 2018, at 3:54 PM, Renan DelValle 
> > wrote:
> > >>
> > >> Hi all,
> > >>
> > >> We tried following the recovery instructions from
> > >> http://aurora.apache.org/documentation/latest/
> > operations/backup-restore/
> > >>
> > >> After our change from the Twitter commons ZK library to Apache
> Curator,
> > >> these instructions are no longer valid.
> > >>
> > >> In order for Aurora to carry out a leader election in the current
> state,
> > >> Aurora has to first connect to a Mesos master. What we ended up doing
> > was
> > >> connecting to Mesos master that was had nothing on it to bypass this
> new
> > >> requirement.
> > >>
> > >> Next, wiping away -native_log_file_path did not seem to be enough to
> > >> recover from a corrupted mesos replicated log. We had to manually wipe
> > away
> > >> entries in ZK and move the snapshot backup directory in order for the
> > >> leader to not fall back on either a snapshot or the mesos-log to
> > rehydrate
> > >> the leader.
> > >>
> > >> Finall

Re: [VOTE] Discontinue Official Binary Package releases

2018-05-18 Thread Bill Farner

+1

> On May 18, 2018, at 6:44 PM, Renan DelValle  wrote:
> 
> +1
> 
> On Fri, May 18, 2018 at 6:35 PM, Mauricio Garavaglia <
> mauriciogaravag...@gmail.com> wrote:
> 
>> +1
>> 
>>> On Fri, May 18, 2018 at 8:01 PM, Renan DelValle  wrote:
>>> 
>>> All,
>>> 
>>> As has been brought up before, we lack the capacity to continue to hold
>>> votes separately for releases and binary releases.
>>> 
>>> Therefore, I propose that we stop release binary packages until such a
>>> time
>>> as when we can combine voting for release packages and binary packages
>>> into
>>> a single vote.
>>> 
>>> 
>>> The vote will close on Wed May 25th 16:00:00 PST 2018
>>> 
>>> [ ] +1 Discontinue Official Binary packages releases
>>> [ ] +0
>>> [ ] -1 Continue Official Binary packages releases because...
>>> 
>> 
>>

Re: Scheduling improvements followups

2018-05-02 Thread Bill Farner

>
> How much work (and feasibility?) would be to add other filter conditions


In my opinion, relatively little work.  An extension point exists as a thin
interface [1], injectable via a module [2].  OfferOrder [3] may look like a
promising alternative, but i see that as a less flexible option.
Regardless of the path taken, i advise starting with customization in a
downstream fork/extension of the apache codebase.

I will be happy to offer high-level guidance to anyone looking to pursue
this.  You may also be interested in a commit of mine on a branch [4] that
may be partially reusable as an approach to doing this while maintaining
high scheduling throughput.

[1] https://github.com/apache/aurora/blob/a3d596ead62404300edbbab1179476
410c8284ad/src/main/java/org/apache/aurora/scheduler/
offers/OfferSet.java#L35-L40
[2] https://github.com/apache/aurora/blob/a3d596ead62404300edbbab1179476
410c8284ad/src/main/java/org/apache/aurora/scheduler/
offers/OfferManagerModule.java#L98-L101
[3] https://github.com/apache/aurora/blob/a3d596ead62404300edbbab1179476
410c8284ad/src/main/java/org/apache/aurora/scheduler/
offers/OfferOrder.java#L16-L25
[4]
https://github.com/wfarner/aurora/commit/f77d79a2d01c3b5b34f11b812d5dcff2789e0766

On Wed, May 2, 2018 at 10:40 AM, Meghdoot bhattacharya <
meghdoo...@yahoo.com.invalid> wrote:

> Just wanted to follow up on the current state.
>
> 1. Is there any work on implementing soft constraints?
>
> 2. How much work (and feasibility?) would be to add other filter
> conditions to sort the list of hosts to schedule even without a full
> fitness score kind of model. So, say the default algorithm picks a set of
> probable hosts but we apply external filters to sort it. Example even
> though cpu/mem is available on the host based on network or iops (as read
> from a real time monitor) we choose to avoid them.
>
> Good write up on what K8 is providing today with options to override.
>
> https://thenewstack.io/implementing-advanced-scheduling-techniques-with-
> kubernetes/
>
> Thx

Re: [VOTE] Release Apache Aurora 0.19.1 RC0

2018-02-08 Thread Bill Farner

+1, binding

I did encounter a unit test failure, but maintain my +1 as this test case
has been notorious flaky on macOS 10.13.3 (especially for me, apparently).
All other checks in the verification script pass.

   def _run_collector_tests(collector, target, wait):
 assert collector.value == 0

 collector.sample()
 wait()
 assert collector.value == 0

 f1 = make_file(TEST_AMOUNT_1, dir=target)
 wait()
   > assert collector.value >= TEST_AMOUNT_1.as_(Data.BYTES)
   E assert 100728832 >= 104857600.0
   E  +  where 100728832 =
.value
   E  +  and   104857600.0 = ()
   E  +where  =
Amount(100, MB).as_
   E  +and=
Data.BYTES

   src/test/python/apache/thermos/monitoring/test_disk.py:44: AssertionError

On Wed, Feb 7, 2018 at 3:41 PM, Renan DelValle 
wrote:

> This point release fixes the list arg parsing regression experienced due to
> switching to JCommander so that we may release binary packages for 0.19.x
>
> Kicking off the voting with a +1 from me.
>
> Validated with ./build-support/release/verify-release-candidate 0.19.1-rc0
>
> On Wed, Feb 7, 2018 at 3:37 PM, Renan DelValle 
> wrote:
>
> > All,
> >
> > I propose that we accept the following release candidate as the official
> > Apache Aurora 0.19.1 release.
> >
> > Aurora 0.19.1-rc0 includes the following:
> > ---
> > The RELEASE NOTES for the release are available at:
> > https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=RELEA
> > SE-NOTES.md&hb=rel/0.19.1-rc0
> >
> > The CHANGELOG for the release is available at:
> > https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=CHANG
> > ELOG&hb=rel/0.19.1-rc0
> >
> > The tag used to create the release candidate is:
> > https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=short
> > log;h=refs/tags/rel/0.19.1-rc0
> >
> > The release candidate is available at:
> > https://dist.apache.org/repos/dist/dev/aurora/0.19.1-rc0/apa
> > che-aurora-0.19.1-rc0.tar.gz
> >
> > The MD5 checksum of the release candidate can be found at:
> > https://dist.apache.org/repos/dist/dev/aurora/0.19.1-rc0/apa
> > che-aurora-0.19.1-rc0.tar.gz.md5
> >
> > The signature of the release candidate can be found at:
> > https://dist.apache.org/repos/dist/dev/aurora/0.19.1-rc0/apa
> > che-aurora-0.19.1-rc0.tar.gz.asc
> >
> > The GPG key used to sign the release are available at:
> > https://dist.apache.org/repos/dist/dev/aurora/KEYS
> >
> > Please download, verify, and test.
> >
> > The vote will close on Sat Feb 10 14:00:33 PST 2018
> >
> > [ ] +1 Release this as Apache Aurora 0.19.1
> > [ ] +0
> > [ ] -1 Do not release this as Apache Aurora 0.19.1 because...
> >
>

Welcome new committers and PMC member!

2018-02-06 Thread Bill Farner

Folks,

I'm happy to announce that we have two new developers on the project!

Renan DelValle is now a committer and PMC member

Jordan Ly is now a committer


Welcome aboard, we're looking forward to your continued contributions!


-=Bill

Re: [RESULT] [VOTE] Release Apache Aurora 0.19.x packages

2018-01-16 Thread Bill Farner

We'll need to step back to a new point release.  I should be able to kick
this off next week unless i am beaten to the punch.

On Mon, Jan 15, 2018 at 4:00 PM, Renan DelValle 
wrote:

> Should we try releasing the binaries again now that we tackled this issue?
> There's been a few folks on the Slack channel have been asking when the
> binaries for 0.19 will be released.
>
> -Renan
>
> On Wed, Dec 13, 2017 at 5:23 PM, Bill Farner  wrote:
>
> > I reverse my vote to -1 and am closing the vote as failed.
> >
> > Turns out i had some old debs in my dist/ dir, and the test script picked
> > those up.  After clearing those, i encounter the same issue.
> >
> > Here is the culprit:
> >
> > $ ag -i THERMOS_EXECUTOR_RESOURCES
> > specs/debian/aurora-scheduler.startup.sh
> > 37:  -thermos_executor_resources="$THERMOS_EXECUTOR_RESOURCES" \
> >
> > specs/debian/aurora-scheduler.upstart
> > 34:  -thermos_executor_resources="$THERMOS_EXECUTOR_RESOURCES" \
> >
> > specs/debian/aurora-scheduler.init
> > 75:-thermos_executor_resources="$THERMOS_EXECUTOR_RESOURCES" \
> >
> > specs/debian/aurora-scheduler.default
> > 65:THERMOS_EXECUTOR_RESOURCES=""
> >
> > thermos_executor_resources is passed an empty string.  Here is the option
> > definition:
> >
> > @Parameter(names = "-thermos_executor_resources",
> > > description = "A comma separated list of additional resources to copy
> > > into the sandbox."
> > > + "Note: if thermos_executor_path is not the thermos_executor.pex file
> > > itself, "
> > > + "this must include it.")
> > > public List thermosExecutorResources = ImmutableList.of();
> >
> >
> > We expect this to become an empty list, however the parser emits a list
> of
> > size one, containing an empty string.  I've filed AURORA-1962
> > <https://issues.apache.org/jira/browse/AURORA-1962> for the issue.
> >
> >
> > On Wed, Dec 13, 2017 at 3:31 PM, Renan DelValle <
> renanidelva...@gmail.com>
> > wrote:
> >
> > > I'm running into the same issues as Stephan. I tried with Trusty,
> Xenial,
> > > and Jessie. Same issue with all.
> > >
> > > Somehow a Mesos fetcher entry with a URI value of '' gets injected into
> > the
> > > task protobuf.
> > >
> > > This is the command I ran for Trusty:
> > > ./test/test-artifact.sh test/deb/ubuntu-trusty/
> > > /repo/artifacts/aurora-ubuntu-trusty/dist
> > >
> > > Oddly enough, we have deployed 0.19.0 packages for trusty without any
> > issue
> > > on at least two of our test clusters so it may have to do with our
> > > artifacts tests?
> > >
> > > I tried upgrading the trusty box to Mesos 1.2.2 and the problem
> > persisted.
> > >
> > > -Renan
> > >
> > > On Wed, Dec 13, 2017 at 9:28 AM, Mohit Jaggi 
> > wrote:
> > >
> > > > +1
> > > >
> > > > On Wed, Dec 13, 2017 at 9:03 AM, Bill Farner 
> > wrote:
> > > >
> > > > > We would need at least 2 more binding votes to complete this
> release.
> > > Do
> > > > > folks need more time?
> > > > >
> > > > > On Tue, Dec 12, 2017 at 2:49 PM, thinker0 
> > wrote:
> > > > >
> > > > > > +1, we 0.19.0 small production tested
> > > > > > 2017년 12월 13일 (수) 04:05, Mohit Jaggi 님이
> 작성:
> > > > > >
> > > > > > > +0, we don't use the packages. If you just need someone to test
> > and
> > > > > > verify,
> > > > > > > I can do that. Let me know.
> > > > > > >
> > > > > > > On Tue, Dec 12, 2017 at 9:53 AM, Bill Farner <
> wfar...@apache.org
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Friendly reminder that the vote is due to close tomorrow!
> > > > > > > >
> > > > > > > > Stephan - is the issue you described reproducible?  Did i run
> > the
> > > > > same
> > > > > > > test
> > > > > > > > command(s) as you?
> > > > > > > >
> > > > > > > > On Sun, Dec 10, 2017 at 8:32 PM, Bill Farner <
> > wfar...@apache.org
> > > >
> > > > > > wrote:

Re: replicated log improvement

2018-01-04 Thread Bill Farner

I think it aligns well with work that plausibly follows r/64288/
, so i'd say it's a likely outcome
regardless of the direction taken for storage.

On Wed, Jan 3, 2018 at 11:44 PM, meghdoot bhattacharya <
meghdoo...@yahoo.com.invalid> wrote:

> mesos 7973 delivered in 1.5. Do we plan to leverage it?
>
> Thx
>

Re: executor id from task id?

2018-01-03 Thread Bill Farner

Yeah it looks like you will need to regenerate it.

On Jan 2, 2018, 9:19 PM -0800, Mohit Jaggi , wrote:
> Yes. I had linked to that in my email. I was hoping to retrieve it from
> "storage", but if it is not stored then I will try to use that logic to
> create one. I don't want to add it to storage if it is not there to avoid
> handling db migration.
>
> On Tue, Jan 2, 2018 at 8:05 PM, Bill Farner  wrote:
>
> > Did your code search lead you here
> > <https://github.com/apache/aurora/blob/2e1ca42887bc8ea1e8c6cddebe9d1c
> > f29268c714/src/main/java/org/apache/aurora/scheduler/mesos/
> > MesosTaskFactory.java#L112>?
> > It should show how the executor and task IDs relate.
> >
> > On Tue, Jan 2, 2018 at 1:08 PM, Mohit Jaggi  wrote:
> >
> > > Happy new year folks! Help with the following will be much appreciated.
> > >
> > > -- Forwarded message --
> > > From: Mohit Jaggi  > > Date: Sat, Dec 23, 2017 at 9:08 PM
> > > Subject: executor id from task id?
> > > To: u...@aurora.apache.org
> > >
> > >
> > > Folks,
> > > I am trying to work on this:
> > > https://issues.apache.org/jira/browse/AURORA-1960
> > >
> > > In VersionedSchedulerDriver
> > > <https://github.com/apache/aurora/blob/47c689956f77ed635d26f7ec659689
> > > 002bd047af/src/main/java/org/apache/aurora/scheduler/mesos/
> > > VersionedSchedulerDriverService.java#L180-L185
> > > killTask() gets a taskId but the SHUTDOWN
> > > <https://mesos.apache.org/documentation/latest/
> > > scheduler-http-api/#shutdown
> > > call requires an executor id. How do I get that? I see it is created here
> > > <https://github.com/apache/aurora/blob/47c689956f77ed635d26f7ec659689
> > > 002bd047af/src/main/java/org/apache/aurora/scheduler/mesos/
> > > MesosTaskFactory.java#L111
> > > but
> > > I am not sure how to find it in the killTask() call.
> > >
> > > Mohit.
> > >
> >

Re: executor id from task id?

2018-01-02 Thread Bill Farner

Did your code search lead you here
?
It should show how the executor and task IDs relate.

On Tue, Jan 2, 2018 at 1:08 PM, Mohit Jaggi  wrote:

> Happy new year folks! Help with the following will be much appreciated.
>
> -- Forwarded message --
> From: Mohit Jaggi 
> Date: Sat, Dec 23, 2017 at 9:08 PM
> Subject: executor id from task id?
> To: u...@aurora.apache.org
>
>
> Folks,
> I am trying to work on this:
> https://issues.apache.org/jira/browse/AURORA-1960
>
> In VersionedSchedulerDriver
>  002bd047af/src/main/java/org/apache/aurora/scheduler/mesos/
> VersionedSchedulerDriverService.java#L180-L185>
>  killTask() gets a taskId but the SHUTDOWN
>  scheduler-http-api/#shutdown>
> call requires an executor id. How do I get that? I see it is created here
>  002bd047af/src/main/java/org/apache/aurora/scheduler/mesos/
> MesosTaskFactory.java#L111>
> but
> I am not sure how to find it in the killTask() call.
>
> Mohit.
>

Re: [REPORT] Apache Aurora - December 2017

2017-12-18 Thread Bill Farner

+1

Thanks, Jake!

On Mon, Dec 18, 2017 at 6:34 AM, Jake Farrell  wrote:

> Please find below the draft report for December, if anyone has any
> modifications or additions please let me know
>
> -Jake
>
>
>
> Apache Aurora is a stateless and fault tolerant service scheduler used to
> schedule jobs onto Apache Mesos such as long-running services, cron jobs,
> and one off tasks.
>
> Project Status
> -
> The Apache Aurora community has seen a huge growth from new
> contributors and user activity over the last quarter. We have successfully
> released two new versions of Apache Aurora during this time also,
> 0.18.1 security release to address CVE-2016-4437 [1] and a regular
> planned release of 0.19.0.
>
> Community
> ---
> Latest Additions:
>
> * Committer addition: Santhosh Kumar Shanmugham, 2.9.2017
> * PMC addition:  Mehrdad Nurolahzade, 2.24.2017
>
> Issue backlog status since last report:
>
> * Created:   17
> * Resolved: 22
>
> Mailing list activity since last report:
>
> * @dev 140 messages
> * @user112 messages (3 in previous reporting cycle!!)
> * @reviews   1207 messages
>
> Releases
> ---
> Last release:
> * Apache Aurora 0.18.1 released 10.31.2017. Security release
> * Apache Aurora 0.19.0 released 11.9.2017
>
> [1]: https://www.cvedetails.com/cve/CVE-2016-4437/
>
>
>

[RESULT] [VOTE] Release Apache Aurora 0.19.x packages

2017-12-13 Thread Bill Farner

I reverse my vote to -1 and am closing the vote as failed.

Turns out i had some old debs in my dist/ dir, and the test script picked
those up.  After clearing those, i encounter the same issue.

Here is the culprit:

$ ag -i THERMOS_EXECUTOR_RESOURCES
specs/debian/aurora-scheduler.startup.sh
37:  -thermos_executor_resources="$THERMOS_EXECUTOR_RESOURCES" \

specs/debian/aurora-scheduler.upstart
34:  -thermos_executor_resources="$THERMOS_EXECUTOR_RESOURCES" \

specs/debian/aurora-scheduler.init
75:-thermos_executor_resources="$THERMOS_EXECUTOR_RESOURCES" \

specs/debian/aurora-scheduler.default
65:THERMOS_EXECUTOR_RESOURCES=""

thermos_executor_resources is passed an empty string.  Here is the option
definition:

@Parameter(names = "-thermos_executor_resources",
> description = "A comma separated list of additional resources to copy
> into the sandbox."
> + "Note: if thermos_executor_path is not the thermos_executor.pex file
> itself, "
> + "this must include it.")
> public List thermosExecutorResources = ImmutableList.of();


We expect this to become an empty list, however the parser emits a list of
size one, containing an empty string.  I've filed AURORA-1962
<https://issues.apache.org/jira/browse/AURORA-1962> for the issue.


On Wed, Dec 13, 2017 at 3:31 PM, Renan DelValle 
wrote:

> I'm running into the same issues as Stephan. I tried with Trusty, Xenial,
> and Jessie. Same issue with all.
>
> Somehow a Mesos fetcher entry with a URI value of '' gets injected into the
> task protobuf.
>
> This is the command I ran for Trusty:
> ./test/test-artifact.sh test/deb/ubuntu-trusty/
> /repo/artifacts/aurora-ubuntu-trusty/dist
>
> Oddly enough, we have deployed 0.19.0 packages for trusty without any issue
> on at least two of our test clusters so it may have to do with our
> artifacts tests?
>
> I tried upgrading the trusty box to Mesos 1.2.2 and the problem persisted.
>
> -Renan
>
> On Wed, Dec 13, 2017 at 9:28 AM, Mohit Jaggi  wrote:
>
> > +1
> >
> > On Wed, Dec 13, 2017 at 9:03 AM, Bill Farner  wrote:
> >
> > > We would need at least 2 more binding votes to complete this release.
> Do
> > > folks need more time?
> > >
> > > On Tue, Dec 12, 2017 at 2:49 PM, thinker0  wrote:
> > >
> > > > +1, we 0.19.0 small production tested
> > > > 2017년 12월 13일 (수) 04:05, Mohit Jaggi 님이 작성:
> > > >
> > > > > +0, we don't use the packages. If you just need someone to test and
> > > > verify,
> > > > > I can do that. Let me know.
> > > > >
> > > > > On Tue, Dec 12, 2017 at 9:53 AM, Bill Farner 
> > > wrote:
> > > > >
> > > > > > Friendly reminder that the vote is due to close tomorrow!
> > > > > >
> > > > > > Stephan - is the issue you described reproducible?  Did i run the
> > > same
> > > > > test
> > > > > > command(s) as you?
> > > > > >
> > > > > > On Sun, Dec 10, 2017 at 8:32 PM, Bill Farner  >
> > > > wrote:
> > > > > >
> > > > > > > +1 from me, as the test script passes for all artifacts
> > > > > > >
> > > > > > > I did not have time to run them prior to opening the vote; but
> i
> > do
> > > > not
> > > > > > > encounter the failure you did:
> > > > > > >
> > > > > > > $ ./test/test-artifact.sh test/deb/debian-jessie/
> > > > > > > /repo/artifacts/aurora-debian-jessie/dist
> > > > > > > 
> > > > > > > OK (all tests passed)
> > > > > > > Connection to 127.0.0.1 closed.
> > > > > > > ==> aurora_jessie: Forcing shutdown of VM...
> > > > > > >
> > > > > > > the branch is missing
> > > > > > >
> > > > > > >
> > > > > > > That i cannot speak for, unfortunately.
> > > > > > >
> > > > > > > For those who have yet to try out the builds, here are the test
> > > > > commands
> > > > > > i
> > > > > > > ran.  You can reproduce these by first downloading the
> artifacts
> > > (in
> > > > my
> > > > > > > case, they were under artifacts/*)
> > > > > > >
> > > > > > > ./test/test-artifac

Re: [VOTE] Release Apache Aurora 0.19.x packages

2017-12-13 Thread Bill Farner

We would need at least 2 more binding votes to complete this release.  Do
folks need more time?

On Tue, Dec 12, 2017 at 2:49 PM, thinker0  wrote:

> +1, we 0.19.0 small production tested
> 2017년 12월 13일 (수) 04:05, Mohit Jaggi 님이 작성:
>
> > +0, we don't use the packages. If you just need someone to test and
> verify,
> > I can do that. Let me know.
> >
> > On Tue, Dec 12, 2017 at 9:53 AM, Bill Farner  wrote:
> >
> > > Friendly reminder that the vote is due to close tomorrow!
> > >
> > > Stephan - is the issue you described reproducible?  Did i run the same
> > test
> > > command(s) as you?
> > >
> > > On Sun, Dec 10, 2017 at 8:32 PM, Bill Farner 
> wrote:
> > >
> > > > +1 from me, as the test script passes for all artifacts
> > > >
> > > > I did not have time to run them prior to opening the vote; but i do
> not
> > > > encounter the failure you did:
> > > >
> > > > $ ./test/test-artifact.sh test/deb/debian-jessie/
> > > > /repo/artifacts/aurora-debian-jessie/dist
> > > > 
> > > > OK (all tests passed)
> > > > Connection to 127.0.0.1 closed.
> > > > ==> aurora_jessie: Forcing shutdown of VM...
> > > >
> > > > the branch is missing
> > > >
> > > >
> > > > That i cannot speak for, unfortunately.
> > > >
> > > > For those who have yet to try out the builds, here are the test
> > commands
> > > i
> > > > ran.  You can reproduce these by first downloading the artifacts (in
> my
> > > > case, they were under artifacts/*)
> > > >
> > > > ./test/test-artifact.sh test/deb/debian-jessie/
> > > > /repo/artifacts/aurora-debian-jessie/dist/
> > > > ./test/test-artifact.sh test/deb/ubuntu-trusty/
> > > > /repo/artifacts/aurora-ubuntu-trusty/dist/
> > > > ./test/test-artifact.sh test/deb/ubuntu-xenial/
> > > > /repo/artifacts/aurora-ubuntu-xenial/dist/
> > > > ./test/test-artifact.sh test/rpm/centos-7/
> > /repo/artifacts/aurora-centos-
> > > > 7/dist/rpmbuild/RPMS/x86_64
> > > >
> > > > Each command took about 5 mins to run in my case.
> > > >
> > > > A belated thanks to Stephan for adding the test script!
> > > >
> > > > commit b904b5f
> > > >> Author: Stephan Erb 
> > > >> Date:   Tue Feb 14 23:33:41 2017 +0100
> > > >> Add basic test scripts for RPM and DEB packages
> > > >
> > > >
> > > >
> > > > On Sun, Dec 10, 2017 at 1:06 PM, Stephan Erb 
> wrote:
> > > >
> > > >> I was just trying to run the validation scripts for Debian Jessie
> and
> > > >> those are failing with the error:
> > > >>
> > > >>
> > > >> I1210 20:48:36.172399  7371 fetcher.cpp:283] Fetching directly into
> > the
> > > >> sandbox directory
> > > >> I1210 20:48:36.172417  7371 fetcher.cpp:220] Fetching URI ''
> > > >> Failed to fetch '': A relative path was passed for the resource but
> > the
> > > >> Mesos framework home was not specified. Please either provide this
> > > >> config option or avoid using a relative path
> > > >>
> > > >> End fetcher log for container 48b4029a-231d-441a-98a6-8c6538fe0efa
> > > >> E1210 20:48:36.220979  5590 fetcher.cpp:558] Failed to run mesos-
> > > >> fetcher: Failed to fetch all URIs for container '48b4029a-231d-441a-
> > > >> 98a6-8c6538fe0efa' with exit status: 256
> > > >> E1210 20:48:36.221174  5590 slave.cpp:4650] Container
> '48b4029a-231d-
> > > >> 441a-98a6-8c6538fe0efa' for executor
> > 'thermos-vagrant-test-hello_world-
> > > >> 0-74b87db4-0d97-4d03-a7f9-9482a1060f20' of framework 208bd6e7-5d17-
> > > >> 4257-9564-71af57900310- fail
> > > >> ed to start: Failed to fetch all URIs for container '48b4029a-231d-
> > > >> 441a-98a6-8c6538fe0efa' with exit status: 256
> > > >>
> > > >>
> > > >> Did those tests work for you?
> > > >>
> > > >>
> > > >> In addition, but most probably unrelated, the branch is missing on
> > http
> > > >> s://github.com/apache/aurora-packaging. The ASF bot might have
> missed
> > > >>

Thrift 0.10.0

2017-12-12 Thread Bill Farner

After much battling with tools, tests, and jenkins; John upgraded us to
thrift 0.10.0 by landing r/64290 !

The next time you pull, you will likely notice that you have some stale
untracked files.  You can safely clean these up:

  $ rm -r build-support/thrift/thrift-0.*

You may also notice that you no longer have to bootstrap the thrift
compiler by compiling it; meaning that building from scratch should be
noticeably faster!

I scanned the changelog for thrift 0.9.2, 0.9.3, 0.10.0, and there is only
one feature i found potentially useful:

THRIFT-640  Support
deprecation

There are a handful of modest performance improvements that we will benefit
from:

THRIFT-2877  Optimize
generated hashCode
THRIFT-3306  Java:
TBinaryProtocol: Use 1 temp buffer instead of allocating 8
THRIFT-2172  Java
compiler allocates optionals array for every struct with an optional field
THRIFT-3431  Avoid
"schemes" HashMap lookups during struct reads/writes


Many thanks to John and Stephan for their hard work getting us upgraded!

Re: [VOTE] Release Apache Aurora 0.19.x packages

2017-12-12 Thread Bill Farner

Friendly reminder that the vote is due to close tomorrow!

Stephan - is the issue you described reproducible?  Did i run the same test
command(s) as you?

On Sun, Dec 10, 2017 at 8:32 PM, Bill Farner  wrote:

> +1 from me, as the test script passes for all artifacts
>
> I did not have time to run them prior to opening the vote; but i do not
> encounter the failure you did:
>
> $ ./test/test-artifact.sh test/deb/debian-jessie/
> /repo/artifacts/aurora-debian-jessie/dist
> 
> OK (all tests passed)
> Connection to 127.0.0.1 closed.
> ==> aurora_jessie: Forcing shutdown of VM...
>
> the branch is missing
>
>
> That i cannot speak for, unfortunately.
>
> For those who have yet to try out the builds, here are the test commands i
> ran.  You can reproduce these by first downloading the artifacts (in my
> case, they were under artifacts/*)
>
> ./test/test-artifact.sh test/deb/debian-jessie/
> /repo/artifacts/aurora-debian-jessie/dist/
> ./test/test-artifact.sh test/deb/ubuntu-trusty/
> /repo/artifacts/aurora-ubuntu-trusty/dist/
> ./test/test-artifact.sh test/deb/ubuntu-xenial/
> /repo/artifacts/aurora-ubuntu-xenial/dist/
> ./test/test-artifact.sh test/rpm/centos-7/ /repo/artifacts/aurora-centos-
> 7/dist/rpmbuild/RPMS/x86_64
>
> Each command took about 5 mins to run in my case.
>
> A belated thanks to Stephan for adding the test script!
>
> commit b904b5f
>> Author: Stephan Erb 
>> Date:   Tue Feb 14 23:33:41 2017 +0100
>> Add basic test scripts for RPM and DEB packages
>
>
>
> On Sun, Dec 10, 2017 at 1:06 PM, Stephan Erb  wrote:
>
>> I was just trying to run the validation scripts for Debian Jessie and
>> those are failing with the error:
>>
>>
>> I1210 20:48:36.172399  7371 fetcher.cpp:283] Fetching directly into the
>> sandbox directory
>> I1210 20:48:36.172417  7371 fetcher.cpp:220] Fetching URI ''
>> Failed to fetch '': A relative path was passed for the resource but the
>> Mesos framework home was not specified. Please either provide this
>> config option or avoid using a relative path
>>
>> End fetcher log for container 48b4029a-231d-441a-98a6-8c6538fe0efa
>> E1210 20:48:36.220979  5590 fetcher.cpp:558] Failed to run mesos-
>> fetcher: Failed to fetch all URIs for container '48b4029a-231d-441a-
>> 98a6-8c6538fe0efa' with exit status: 256
>> E1210 20:48:36.221174  5590 slave.cpp:4650] Container '48b4029a-231d-
>> 441a-98a6-8c6538fe0efa' for executor 'thermos-vagrant-test-hello_world-
>> 0-74b87db4-0d97-4d03-a7f9-9482a1060f20' of framework 208bd6e7-5d17-
>> 4257-9564-71af57900310- fail
>> ed to start: Failed to fetch all URIs for container '48b4029a-231d-
>> 441a-98a6-8c6538fe0efa' with exit status: 256
>>
>>
>> Did those tests work for you?
>>
>>
>> In addition, but most probably unrelated, the branch is missing on http
>> s://github.com/apache/aurora-packaging. The ASF bot might have missed
>> it.
>>
>>
>> On Fri, 2017-12-08 at 10:50 -0800, Bill Farner wrote:
>> > All,
>> >
>> > I propose that we accept the following artifacts as the official deb
>> > and
>> > rpm packaging for
>> > Apache Aurora 0.19.x:
>> >
>> > https://dl.bintray.com/bill/aurora/
>> >
>> > The Aurora deb and rpm packaging includes the following:
>> >
>> > ---
>> >
>> > The branch used to create the packaging is:
>> > https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tr
>> > ee;hb=refs/heads/0.19.x
>> >
>> > The packages are available at:
>> > https://dl.bintray.com/wfarner/aurora/
>> >
>> > The GPG keys used to sign the packages are available at:
>> > https://dist.apache.org/repos/dist/release/aurora/KEYS
>> >
>> > Please download, verify, and test. Detailed test instructions are
>> > available
>> > here:
>> > https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tr
>> > ee;f=test;hb=refs/heads/0.19.x
>> >
>> >
>> > The vote will close on Wed Dec 13 10:34:51 PST 2017
>> >
>> > [ ] +1 Release these as the deb and rpm packages for Apache Aurora
>> > 0.19.x
>> > [ ] +0
>> > [ ] -1 Do not release these artifacts because...
>>
>
>

Re: [VOTE] Release Apache Aurora 0.19.x packages

2017-12-10 Thread Bill Farner

+1 from me, as the test script passes for all artifacts

I did not have time to run them prior to opening the vote; but i do not
encounter the failure you did:

$ ./test/test-artifact.sh test/deb/debian-jessie/
/repo/artifacts/aurora-debian-jessie/dist

OK (all tests passed)
Connection to 127.0.0.1 closed.
==> aurora_jessie: Forcing shutdown of VM...

the branch is missing


That i cannot speak for, unfortunately.

For those who have yet to try out the builds, here are the test commands i
ran.  You can reproduce these by first downloading the artifacts (in my
case, they were under artifacts/*)

./test/test-artifact.sh test/deb/debian-jessie/
/repo/artifacts/aurora-debian-jessie/dist/
./test/test-artifact.sh test/deb/ubuntu-trusty/
/repo/artifacts/aurora-ubuntu-trusty/dist/
./test/test-artifact.sh test/deb/ubuntu-xenial/
/repo/artifacts/aurora-ubuntu-xenial/dist/
./test/test-artifact.sh test/rpm/centos-7/
/repo/artifacts/aurora-centos-7/dist/rpmbuild/RPMS/x86_64

Each command took about 5 mins to run in my case.

A belated thanks to Stephan for adding the test script!

commit b904b5f
> Author: Stephan Erb 
> Date:   Tue Feb 14 23:33:41 2017 +0100
> Add basic test scripts for RPM and DEB packages



On Sun, Dec 10, 2017 at 1:06 PM, Stephan Erb  wrote:

> I was just trying to run the validation scripts for Debian Jessie and
> those are failing with the error:
>
>
> I1210 20:48:36.172399  7371 fetcher.cpp:283] Fetching directly into the
> sandbox directory
> I1210 20:48:36.172417  7371 fetcher.cpp:220] Fetching URI ''
> Failed to fetch '': A relative path was passed for the resource but the
> Mesos framework home was not specified. Please either provide this
> config option or avoid using a relative path
>
> End fetcher log for container 48b4029a-231d-441a-98a6-8c6538fe0efa
> E1210 20:48:36.220979  5590 fetcher.cpp:558] Failed to run mesos-
> fetcher: Failed to fetch all URIs for container '48b4029a-231d-441a-
> 98a6-8c6538fe0efa' with exit status: 256
> E1210 20:48:36.221174  5590 slave.cpp:4650] Container '48b4029a-231d-
> 441a-98a6-8c6538fe0efa' for executor 'thermos-vagrant-test-hello_world-
> 0-74b87db4-0d97-4d03-a7f9-9482a1060f20' of framework 208bd6e7-5d17-
> 4257-9564-71af57900310- fail
> ed to start: Failed to fetch all URIs for container '48b4029a-231d-
> 441a-98a6-8c6538fe0efa' with exit status: 256
>
>
> Did those tests work for you?
>
>
> In addition, but most probably unrelated, the branch is missing on http
> s://github.com/apache/aurora-packaging. The ASF bot might have missed
> it.
>
>
> On Fri, 2017-12-08 at 10:50 -0800, Bill Farner wrote:
> > All,
> >
> > I propose that we accept the following artifacts as the official deb
> > and
> > rpm packaging for
> > Apache Aurora 0.19.x:
> >
> > https://dl.bintray.com/bill/aurora/
> >
> > The Aurora deb and rpm packaging includes the following:
> >
> > ---
> >
> > The branch used to create the packaging is:
> > https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tr
> > ee;hb=refs/heads/0.19.x
> >
> > The packages are available at:
> > https://dl.bintray.com/wfarner/aurora/
> >
> > The GPG keys used to sign the packages are available at:
> > https://dist.apache.org/repos/dist/release/aurora/KEYS
> >
> > Please download, verify, and test. Detailed test instructions are
> > available
> > here:
> > https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tr
> > ee;f=test;hb=refs/heads/0.19.x
> >
> >
> > The vote will close on Wed Dec 13 10:34:51 PST 2017
> >
> > [ ] +1 Release these as the deb and rpm packages for Apache Aurora
> > 0.19.x
> > [ ] +0
> > [ ] -1 Do not release these artifacts because...
>

[VOTE] Release Apache Aurora 0.19.x packages

2017-12-08 Thread Bill Farner

All,

I propose that we accept the following artifacts as the official deb and
rpm packaging for
Apache Aurora 0.19.x:

https://dl.bintray.com/bill/aurora/

The Aurora deb and rpm packaging includes the following:

---

The branch used to create the packaging is:
https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tree;hb=refs/heads/0.19.x

The packages are available at:
https://dl.bintray.com/wfarner/aurora/

The GPG keys used to sign the packages are available at:
https://dist.apache.org/repos/dist/release/aurora/KEYS

Please download, verify, and test. Detailed test instructions are available
here:
https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tree;f=test;hb=refs/heads/0.19.x


The vote will close on Wed Dec 13 10:34:51 PST 2017

[ ] +1 Release these as the deb and rpm packages for Apache Aurora 0.19.x
[ ] +0
[ ] -1 Do not release these artifacts because...

[RESULT][VOTE] Release Apache Aurora 0.19.0 RC0

2017-11-10 Thread Bill Farner

All,
The vote to accept Apache Aurora 0.19.0 RC0 as the official Apache Aurora
0.19.0 release has passed.


+1 (Binding)
--
Bill Farner
David McLaughlin
Stephan Erb


+1 (Non-binding)
--
Mohit Jaggi


There were no 0 or -1 votes. Thank you to all who helped make this release.

On Fri, Nov 10, 2017 at 9:53 AM, David McLaughlin 
wrote:

> If it helps: we're running master in production at Twitter.
>
> On Fri, Nov 10, 2017 at 8:39 AM, Erb, Stephan  >
> wrote:
>
> > +1 from me.
> >
> > Verification script has passed. I also intended to deploy this to a test
> > cluster, but won’t be able to do so before the vote is closing.
> >
> > On 09.11.17, 17:16, "Bill Farner"  wrote:
> >
> > Friendly reminder to vote, folks!  We are currently one binding vote
> > shy of
> > a release, and the vote closes tomorrow!
> >
> > If anyone else is getting stuck on the macOS build, a workaround is
> to
> > verify from vagrant:
> >
> > $ vagrant up
> > $ vagrant ssh
> > $ cd /vagrant
> > $ ./build-support/release/verify-release-candidate 0.19.0-rc0
> >
> >
> >
> > On Wed, Nov 8, 2017 at 10:48 AM, David McLaughlin <
> > dmclaugh...@apache.org>
> > wrote:
> >
> > > +1 from me. The Mac OS breakage is disappointing, but I'm fine with
> > it not
> > > being a blocker.
> > >
> > > On Tue, Nov 7, 2017 at 11:04 PM, Mohit Jaggi  >
> > wrote:
> > >
> > > > +1
> > > >
> > > > On Tue, Nov 7, 2017 at 10:51 PM, Bill Farner  >
> > wrote:
> > > >
> > > > > +1
> > > > >
> > > > > Successfully validated with ./build-support/release/
> > > > > verify-release-candidate
> > > > > 0.19.0-rc0
> > > > >
> > > > > Note that the above command fails on macOS due to AURORA-1956
> > > > > <https://issues.apache.org/jira/browse/AURORA-1956>.  However,
> > i am
> > > > still
> > > > > a
> > > > > +1 since i see mac builds as developer convenience rather than
> a
> > > > supported
> > > > > environment.  Others are welcome to feel differently.
> > > > >
> > > > > On Tue, Nov 7, 2017 at 8:49 PM, Bill Farner <
> wfar...@apache.org>
> > > wrote:
> > > > >
> > > > > > All,
> > > > > >
> > > > > > I propose that we accept the following release candidate as
> the
> > > > official
> > > > > > Apache Aurora 0.19.0 release.
> > > > > >
> > > > > > Aurora 0.19.0-rc0 includes the following:
> > > > > > ---
> > > > > > The RELEASE NOTES for the release are available at:
> > > > > > https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=
> > > > > > RELEASE-NOTES.md&hb=rel/0.19.0-rc0
> > > > > >
> > > > > > The CHANGELOG for the release is available at:
> > > > > > https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=
> > > > > > CHANGELOG&hb=rel/0.19.0-rc0
> > > > > >
> > > > > > The tag used to create the release candidate is:
> > > > > > https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=
> > > > > > shortlog;h=refs/tags/rel/0.19.0-rc0
> > > > > >
> > > > > > The release candidate is available at:
> > > > > > https://dist.apache.org/repos/dist/dev/aurora/0.19.0-rc0/
> > > > > > apache-aurora-0.19.0-rc0.tar.gz
> > > > > >
> > > > > > The MD5 checksum of the release candidate can be found at:
> > > > > > https://dist.apache.org/repos/dist/dev/aurora/0.19.0-rc0/
> > > > > > apache-aurora-0.19.0-rc0.tar.gz.md5
> > > > > >
> > > > > > The signature of the release candidate can be found at:
> > > > > > https://dist.apache.org/repos/dist/dev/aurora/0.19.0-rc0/
> > > > > > apache-aurora-0.19.0-rc0.tar.gz.asc
> > > > > >
> > > > > > The GPG key used to sign the release are available at:
> > > > > > https://dist.apache.org/repos/dist/dev/aurora/KEYS
> > > > > >
> > > > > > Please download, verify, and test.
> > > > > >
> > > > > > The vote will close on Fri Nov 10 20:48:05 PST 2017
> > > > > >
> > > > > > [ ] +1 Release this as Apache Aurora 0.19.0
> > > > > > [ ] +0
> > > > > > [ ] -1 Do not release this as Apache Aurora 0.19.0 because...
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
>

Re: [VOTE] Release Apache Aurora 0.19.0 RC0

2017-11-09 Thread Bill Farner

Friendly reminder to vote, folks!  We are currently one binding vote shy of
a release, and the vote closes tomorrow!

If anyone else is getting stuck on the macOS build, a workaround is to
verify from vagrant:

$ vagrant up
$ vagrant ssh
$ cd /vagrant
$ ./build-support/release/verify-release-candidate 0.19.0-rc0



On Wed, Nov 8, 2017 at 10:48 AM, David McLaughlin 
wrote:

> +1 from me. The Mac OS breakage is disappointing, but I'm fine with it not
> being a blocker.
>
> On Tue, Nov 7, 2017 at 11:04 PM, Mohit Jaggi  wrote:
>
> > +1
> >
> > On Tue, Nov 7, 2017 at 10:51 PM, Bill Farner  wrote:
> >
> > > +1
> > >
> > > Successfully validated with ./build-support/release/
> > > verify-release-candidate
> > > 0.19.0-rc0
> > >
> > > Note that the above command fails on macOS due to AURORA-1956
> > > <https://issues.apache.org/jira/browse/AURORA-1956>.  However, i am
> > still
> > > a
> > > +1 since i see mac builds as developer convenience rather than a
> > supported
> > > environment.  Others are welcome to feel differently.
> > >
> > > On Tue, Nov 7, 2017 at 8:49 PM, Bill Farner 
> wrote:
> > >
> > > > All,
> > > >
> > > > I propose that we accept the following release candidate as the
> > official
> > > > Apache Aurora 0.19.0 release.
> > > >
> > > > Aurora 0.19.0-rc0 includes the following:
> > > > ---
> > > > The RELEASE NOTES for the release are available at:
> > > > https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=
> > > > RELEASE-NOTES.md&hb=rel/0.19.0-rc0
> > > >
> > > > The CHANGELOG for the release is available at:
> > > > https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=
> > > > CHANGELOG&hb=rel/0.19.0-rc0
> > > >
> > > > The tag used to create the release candidate is:
> > > > https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=
> > > > shortlog;h=refs/tags/rel/0.19.0-rc0
> > > >
> > > > The release candidate is available at:
> > > > https://dist.apache.org/repos/dist/dev/aurora/0.19.0-rc0/
> > > > apache-aurora-0.19.0-rc0.tar.gz
> > > >
> > > > The MD5 checksum of the release candidate can be found at:
> > > > https://dist.apache.org/repos/dist/dev/aurora/0.19.0-rc0/
> > > > apache-aurora-0.19.0-rc0.tar.gz.md5
> > > >
> > > > The signature of the release candidate can be found at:
> > > > https://dist.apache.org/repos/dist/dev/aurora/0.19.0-rc0/
> > > > apache-aurora-0.19.0-rc0.tar.gz.asc
> > > >
> > > > The GPG key used to sign the release are available at:
> > > > https://dist.apache.org/repos/dist/dev/aurora/KEYS
> > > >
> > > > Please download, verify, and test.
> > > >
> > > > The vote will close on Fri Nov 10 20:48:05 PST 2017
> > > >
> > > > [ ] +1 Release this as Apache Aurora 0.19.0
> > > > [ ] +0
> > > > [ ] -1 Do not release this as Apache Aurora 0.19.0 because...
> > > >
> > >
> >
>

Re: [VOTE] Release Apache Aurora 0.19.0 RC0

2017-11-07 Thread Bill Farner

+1

Successfully validated with ./build-support/release/verify-release-candidate
0.19.0-rc0

Note that the above command fails on macOS due to AURORA-1956
<https://issues.apache.org/jira/browse/AURORA-1956>.  However, i am still a
+1 since i see mac builds as developer convenience rather than a supported
environment.  Others are welcome to feel differently.

On Tue, Nov 7, 2017 at 8:49 PM, Bill Farner  wrote:

> All,
>
> I propose that we accept the following release candidate as the official
> Apache Aurora 0.19.0 release.
>
> Aurora 0.19.0-rc0 includes the following:
> ---
> The RELEASE NOTES for the release are available at:
> https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=
> RELEASE-NOTES.md&hb=rel/0.19.0-rc0
>
> The CHANGELOG for the release is available at:
> https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=
> CHANGELOG&hb=rel/0.19.0-rc0
>
> The tag used to create the release candidate is:
> https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=
> shortlog;h=refs/tags/rel/0.19.0-rc0
>
> The release candidate is available at:
> https://dist.apache.org/repos/dist/dev/aurora/0.19.0-rc0/
> apache-aurora-0.19.0-rc0.tar.gz
>
> The MD5 checksum of the release candidate can be found at:
> https://dist.apache.org/repos/dist/dev/aurora/0.19.0-rc0/
> apache-aurora-0.19.0-rc0.tar.gz.md5
>
> The signature of the release candidate can be found at:
> https://dist.apache.org/repos/dist/dev/aurora/0.19.0-rc0/
> apache-aurora-0.19.0-rc0.tar.gz.asc
>
> The GPG key used to sign the release are available at:
> https://dist.apache.org/repos/dist/dev/aurora/KEYS
>
> Please download, verify, and test.
>
> The vote will close on Fri Nov 10 20:48:05 PST 2017
>
> [ ] +1 Release this as Apache Aurora 0.19.0
> [ ] +0
> [ ] -1 Do not release this as Apache Aurora 0.19.0 because...
>

[VOTE] Release Apache Aurora 0.19.0 RC0

2017-11-07 Thread Bill Farner

All,

I propose that we accept the following release candidate as the official
Apache Aurora 0.19.0 release.

Aurora 0.19.0-rc0 includes the following:
---
The RELEASE NOTES for the release are available at:
https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=RELEASE-NOTES.md&hb=rel/0.19.0-rc0

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=CHANGELOG&hb=rel/0.19.0-rc0

The tag used to create the release candidate is:
https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=shortlog;h=refs/tags/rel/0.19.0-rc0

The release candidate is available at:
https://dist.apache.org/repos/dist/dev/aurora/0.19.0-rc0/apache-aurora-0.19.0-rc0.tar.gz

The MD5 checksum of the release candidate can be found at:
https://dist.apache.org/repos/dist/dev/aurora/0.19.0-rc0/apache-aurora-0.19.0-rc0.tar.gz.md5

The signature of the release candidate can be found at:
https://dist.apache.org/repos/dist/dev/aurora/0.19.0-rc0/apache-aurora-0.19.0-rc0.tar.gz.asc

The GPG key used to sign the release are available at:
https://dist.apache.org/repos/dist/dev/aurora/KEYS

Please download, verify, and test.

The vote will close on Fri Nov 10 20:48:05 PST 2017

[ ] +1 Release this as Apache Aurora 0.19.0
[ ] +0
[ ] -1 Do not release this as Apache Aurora 0.19.0 because...

Re: 0.19.0 release preparation

2017-11-07 Thread Bill Farner

Can you provide pointers to tickets?  Alternatively, i can make a ticket
for the release and we can start queueing blockers.  I'd like to make sure
we maintain momentum towards a release.

On Tue, Nov 7, 2017 at 10:11 AM, David McLaughlin 
wrote:

> We have two outstanding regressions I'll get to this week, one is minor and
> one is relatively serious (no longer showing pending reasons on the task
> list). If people are comfortable moving forward without those fixes, then
> go for it.
>
> On Tue, Nov 7, 2017 at 9:55 AM, Bill Farner  wrote:
>
> > David - now that a week has passed, do you see any further reason to
> wait?
> >
> > We are still seeing minor cosmetic issues and bug fixes pop up. It would
> be
> > > pretty harmful to have to deal with these for an entire release
> >
> >
> > It's worth stating the obvious - we can cut point releases when necessary
> > to address this type of issue.
> >
> > On Mon, Oct 30, 2017 at 8:41 AM, David McLaughlin <
> dmclaugh...@apache.org>
> > wrote:
> >
> > > I'd like another week of feedback to incorporate changes of the new UI.
> > We
> > > are still seeing minor cosmetic issues and bug fixes pop up. It would
> be
> > > pretty harmful to have to deal with these for an entire release.
> > >
> > > On Mon, Oct 30, 2017 at 8:34 AM, Erb, Stephan <
> > stephan@blue-yonder.com
> > > >
> > > wrote:
> > >
> > > > Sounds good to me. Getting the release out quickly will allow us to
> > > remove
> > > > the old mybatis/h2 code sooner.
> > > >
> > > > I planned on upgrading to Mesos 1.4. Unfortunately this is currently
> > > > blocked by a missing mesos.interface package on PyPI. I send a mail
> out
> > > to
> > > > the Mesos Dev list but I am still waiting for a response. So this
> will
> > > have
> > > > to wait for 0.20.
> > > >
> > > > On 29.10.17, 23:15, "Bill Farner"  wrote:
> > > >
> > > > Folks,
> > > >
> > > > I propose we cut our 0.19.0 release soon.  We have built up a
> > > > respectable
> > > > set of unreleased changes
> > > > <https://github.com/apache/aurora/blob/master/RELEASE-NOTES.md>.
> > I
> > > am
> > > > happy to perform this release, and should be able to do so as
> soon
> > as
> > > > this
> > > > week.
> > > >
> > > > Please chime in here if there is any outstanding work that should
> > > > block a
> > > > release.
> > > >
> > > >
> > > > Cheers,
> > > >
> > > > Bill
> > > >
> > > >
> > > >
> > >
> >
>

Re: 0.19.0 release preparation

2017-11-07 Thread Bill Farner

David - now that a week has passed, do you see any further reason to wait?

We are still seeing minor cosmetic issues and bug fixes pop up. It would be
> pretty harmful to have to deal with these for an entire release


It's worth stating the obvious - we can cut point releases when necessary
to address this type of issue.

On Mon, Oct 30, 2017 at 8:41 AM, David McLaughlin 
wrote:

> I'd like another week of feedback to incorporate changes of the new UI. We
> are still seeing minor cosmetic issues and bug fixes pop up. It would be
> pretty harmful to have to deal with these for an entire release.
>
> On Mon, Oct 30, 2017 at 8:34 AM, Erb, Stephan  >
> wrote:
>
> > Sounds good to me. Getting the release out quickly will allow us to
> remove
> > the old mybatis/h2 code sooner.
> >
> > I planned on upgrading to Mesos 1.4. Unfortunately this is currently
> > blocked by a missing mesos.interface package on PyPI. I send a mail out
> to
> > the Mesos Dev list but I am still waiting for a response. So this will
> have
> > to wait for 0.20.
> >
> > On 29.10.17, 23:15, "Bill Farner"  wrote:
> >
> > Folks,
> >
> > I propose we cut our 0.19.0 release soon.  We have built up a
> > respectable
> > set of unreleased changes
> > <https://github.com/apache/aurora/blob/master/RELEASE-NOTES.md>.  I
> am
> > happy to perform this release, and should be able to do so as soon as
> > this
> > week.
> >
> > Please chime in here if there is any outstanding work that should
> > block a
> > release.
> >
> >
> > Cheers,
> >
> > Bill
> >
> >
> >
>

[RESULT][VOTE] Release Apache Aurora 0.18.1 RC1

2017-11-01 Thread Bill Farner

All,
The vote to accept Apache Aurora 0.18.1 RC1
as the official Apache Aurora 0.18.1 release has passed.


+1 (Binding)
--
Bill Farner
David McLaughlin
Joshua Cohen
Stephan Erb

+0 (Non-binding)
--
Mohit Jaggi

There were no -1 votes. Thank you to all who helped make this release.


Aurora 0.18.1 includes the following:
---
The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=CHANGELOG&hb=rel/0.18.1

The tag used to create the release with is rel/0.18.1:
https://git-wip-us.apache.org/repos/asf?p=aurora.git&a=shortlog&h=refs/tags/rel/0.18.1

The release is available at:
https://dist.apache.org/repos/dist/release/aurora/0.18.1/apache-aurora-0.18.1.tar.gz

The MD5 checksum of the release can be found at:
https://dist.apache.org/repos/dist/release/aurora/0.18.1/apache-aurora-0.18.1.tar.gz.md5

The signature of the release can be found at:
https://dist.apache.org/repos/dist/release/aurora/0.18.1/apache-aurora-0.18.1.tar.gz.asc

The GPG key used to sign the release are available at:
https://dist.apache.org/repos/dist/release/aurora/KEYS

On Wed, Nov 1, 2017 at 8:44 AM, Bill Farner  wrote:

> I'm in favor of AURORA-1955, but am -1 to blocking 0.18.1 for it.  0.19.0
> is arriving soon, and we can certainly get it in then.
>
> On Mon, Oct 30, 2017 at 3:33 PM, Mohit Jaggi  wrote:
>
>> would love to get https://issues.apache.org/jira/browse/AURORA-1955
>> resolved in 0.18.1 but no strong opinions.
>>
>> +0
>>
>> On Mon, Oct 30, 2017 at 12:06 PM, Stephan Erb  wrote:
>>
>> > +1
>> >
>> > Thanks for handling this, Bill.
>> >
>> > On Mon, 2017-10-30 at 10:05 -0500, Joshua Cohen wrote:
>> > > +1
>> > >
>> > > On Sun, Oct 29, 2017 at 6:30 PM, Bill Farner 
>> > > wrote:
>> > >
>> > > > >
>> > > > > sha512 signature
>> > > >
>> > > >
>> > > > Thanks, this is now fixed.  The release script runs from the
>> > > > released SHA,
>> > > > which pre-dates the inclusion of a sha512 signature.  The
>> > > > verification
>> > > > script on master should otherwise work against 0.18.1-rc1, but no
>> > > > guarantees.
>> > > >
>> > > > On Sun, Oct 29, 2017 at 3:20 PM, Joshua Cohen 
>> > > > wrote:
>> > > >
>> > > > > I'm trying to run the verify-release-candidate script, but
>> > > > > getting a 404
>> > > > > for the sha512 signature?
>> > > > >
>> > > > > + download_rc_file apache-aurora-0.18.1-rc1.tar.gz.sha512
>> > > > > + download_dist_file 0.18.1-rc1/apache-aurora-0.18.1-
>> > > > > rc1.tar.gz.sha512
>> > > > > + curl -f -O
>> > > > > https://dist.apache.org/repos/dist/dev/aurora/0.18.1-rc1/
>> > > > > apache-aurora-0.18.1-rc1.tar.gz.sha512
>> > > > >   % Total% Received % Xferd  Average
>> > > > > Speed   TimeTime Time
>> > > > > Current
>> > > > >  Dload  Upload   Total   Spent
>> > > > >  Left
>> > > > > Speed
>> > > > >   0 00 00 0  0  0 --:--:-- --:--:--
>> > > > > --:--:--
>> > > > >  0
>> > > > > curl: (22) The requested URL returned error: 404 Not Found
>> > > > >
>> > > > >
>> > > > > On Sun, Oct 29, 2017 at 5:03 PM, Bill Farner 
>> > > > > wrote:
>> > > > >
>> > > > > > +1
>> > > > > >
>> > > > > > Verified by running ./build-support/release/verify-release-
>> > > > > > candidate
>> > > > > > 0.18.1-rc1
>> > > > > >
>> > > > > > On Sun, Oct 29, 2017 at 12:05 PM, David McLaughlin <
>> > > > >
>> > > > > dmclaugh...@apache.org
>> > > > > > >
>> > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > +1
>> > > > > > >
>> > > > > > > On Sun, Oct 29, 2017 at 10:32 AM, Bill Farner > > > > > > > .org>
>> > > > > >
>> > > > > > wrote:
>> > > > > > >
>> > > > > > &

Re: [VOTE] Release Apache Aurora 0.18.1 RC1

2017-11-01 Thread Bill Farner

I'm in favor of AURORA-1955, but am -1 to blocking 0.18.1 for it.  0.19.0
is arriving soon, and we can certainly get it in then.

On Mon, Oct 30, 2017 at 3:33 PM, Mohit Jaggi  wrote:

> would love to get https://issues.apache.org/jira/browse/AURORA-1955
> resolved in 0.18.1 but no strong opinions.
>
> +0
>
> On Mon, Oct 30, 2017 at 12:06 PM, Stephan Erb  wrote:
>
> > +1
> >
> > Thanks for handling this, Bill.
> >
> > On Mon, 2017-10-30 at 10:05 -0500, Joshua Cohen wrote:
> > > +1
> > >
> > > On Sun, Oct 29, 2017 at 6:30 PM, Bill Farner 
> > > wrote:
> > >
> > > > >
> > > > > sha512 signature
> > > >
> > > >
> > > > Thanks, this is now fixed.  The release script runs from the
> > > > released SHA,
> > > > which pre-dates the inclusion of a sha512 signature.  The
> > > > verification
> > > > script on master should otherwise work against 0.18.1-rc1, but no
> > > > guarantees.
> > > >
> > > > On Sun, Oct 29, 2017 at 3:20 PM, Joshua Cohen 
> > > > wrote:
> > > >
> > > > > I'm trying to run the verify-release-candidate script, but
> > > > > getting a 404
> > > > > for the sha512 signature?
> > > > >
> > > > > + download_rc_file apache-aurora-0.18.1-rc1.tar.gz.sha512
> > > > > + download_dist_file 0.18.1-rc1/apache-aurora-0.18.1-
> > > > > rc1.tar.gz.sha512
> > > > > + curl -f -O
> > > > > https://dist.apache.org/repos/dist/dev/aurora/0.18.1-rc1/
> > > > > apache-aurora-0.18.1-rc1.tar.gz.sha512
> > > > >   % Total% Received % Xferd  Average
> > > > > Speed   TimeTime Time
> > > > > Current
> > > > >  Dload  Upload   Total   Spent
> > > > >  Left
> > > > > Speed
> > > > >   0 00 00 0  0  0 --:--:-- --:--:--
> > > > > --:--:--
> > > > >  0
> > > > > curl: (22) The requested URL returned error: 404 Not Found
> > > > >
> > > > >
> > > > > On Sun, Oct 29, 2017 at 5:03 PM, Bill Farner 
> > > > > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > Verified by running ./build-support/release/verify-release-
> > > > > > candidate
> > > > > > 0.18.1-rc1
> > > > > >
> > > > > > On Sun, Oct 29, 2017 at 12:05 PM, David McLaughlin <
> > > > >
> > > > > dmclaugh...@apache.org
> > > > > > >
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > On Sun, Oct 29, 2017 at 10:32 AM, Bill Farner  > > > > > > .org>
> > > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > All,
> > > > > > > >
> > > > > > > > I propose that we accept the following release candidate as
> > > > > > > > the
> > > > > >
> > > > > > official
> > > > > > > > Apache Aurora 0.18.1 release.
> > > > > > > >
> > > > > > > > Aurora 0.18.1-rc1 includes the following:
> > > > > > > > ---
> > > > > > > > The RELEASE NOTES for the release are available at:
> > > > > > > > https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=
> > > > > > > > RELEASE-NOTES.md&hb=rel/0.18.1-rc1
> > > > > > > >
> > > > > > > > The CHANGELOG for the release is available at:
> > > > > > > > https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=
> > > > > > > > CHANGELOG&hb=rel/0.18.1-rc1
> > > > > > > >
> > > > > > > > The tag used to create the release candidate is:
> > > > > > > > https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=
> > > > > > > > shortlog;h=refs/tags/rel/0.18.1-rc1
> > > > > > > >
> > > > > > > > The release candidate is available at:
> > > > > > > > https://dist.apache.org/repos/dist/dev/aurora/0.18.1-rc1/
> > > > > > > > apache-aurora-0.18.1-rc1.tar.gz
> > > > > > > >
> > > > > > > > The MD5 checksum of the release candidate can be found at:
> > > > > > > > https://dist.apache.org/repos/dist/dev/aurora/0.18.1-rc1/
> > > > > > > > apache-aurora-0.18.1-rc1.tar.gz.md5
> > > > > > > >
> > > > > > > > The signature of the release candidate can be found at:
> > > > > > > > https://dist.apache.org/repos/dist/dev/aurora/0.18.1-rc1/
> > > > > > > > apache-aurora-0.18.1-rc1.tar.gz.asc
> > > > > > > >
> > > > > > > > The GPG key used to sign the release are available at:
> > > > > > > > https://dist.apache.org/repos/dist/dev/aurora/KEYS
> > > > > > > >
> > > > > > > > Please download, verify, and test.
> > > > > > > >
> > > > > > > > The vote will close on Wed Nov  1 10:31:07 PDT 2017
> > > > > > > >
> > > > > > > > [ ] +1 Release this as Apache Aurora 0.18.1
> > > > > > > > [ ] +0
> > > > > > > > [ ] -1 Do not release this as Apache Aurora 0.18.1
> > > > > > > > because...
> > > > > > > >
> >
>

Re: Build failed in Jenkins: Aurora #1858

2017-10-30 Thread Bill Farner

I suppose the point is moot, since we have no control over available memory
being claimed after our build starts.  Looks like the jenkins slaves are
configured with 2 executors for the most part.  Next time we spot this,
let's try to determine the neighbor build; perhaps there's a common culprit
eating memory.

On Mon, Oct 30, 2017 at 12:21 PM, Stephan Erb  wrote:

> Actually there was never a discussion. I just enabled it as a test and
> then totally forgot about it because it worked surprisingly well.
>
> I believe we won't use more than 1-2 GB. I simply added the remaining 2
> as an additional safeguard when something else is launched on the
> Jenkins node shortly after we have passed the guard.
>
> On Tue, 2017-10-24 at 21:38 -0700, Bill Farner wrote:
> > Possibly rehashing an old discussion - does the build really require
> > 4 GB?
> >
> > On Mon, Oct 23, 2017 at 1:53 PM, Erb, Stephan  > r.com>
> > wrote:
> >
> > > Ah, again a node with insufficient memory. I once added a mechanism
> > > to
> > > abort the build early rather than running and eventually failing in
> > > these
> > > cases. This was very helpful for the regular reviewbot but is not
> > > that
> > > helpful for the normal SCM-triggerd build.
> > >
> > > Can anyone think of a better way to handle this case here?
> > >
> > >
> > > On 23.10.17, 22:02, "Apache Jenkins Server"  > > org>
> > > wrote:
> > >
> > > See <https://builds.apache.org/job/Aurora/1858/display/
> > > redirect?page=changes>
> > >
> > > Changes:
> > >
> > > [david] Add sorting and filtering controls for TaskList
> > >
> > > --
> > > Started by an SCM change
> > > Started by an SCM change
> > > [EnvInject] - Loading node environment variables.
> > > Building remotely on ubuntu-4 (ubuntu trusty) in workspace <
> > > https://builds.apache.org/job/Aurora/ws/>;
> > > Wiping out workspace first.
> > > Cloning the remote Git repository
> > > Cloning repository https://git-wip-us.apache.org/repos/asf/auro
> > > ra.git
> > >  > git init <https://builds.apache.org/job/Aurora/ws/> #
> > > timeout=10
> > > Fetching upstream changes from https://git-wip-us.apache.org/
> > > repos/asf/aurora.git
> > >  > git --version # timeout=10
> > >  > git fetch --tags --progress https://git-wip-us.apache.org/
> > > repos/asf/aurora.git +refs/heads/*:refs/remotes/origin/*
> > >  > git config remote.origin.url https://git-wip-us.apache.org/
> > > repos/asf/aurora.git # timeout=10
> > >  > git config --add remote.origin.fetch
> > > +refs/heads/*:refs/remotes/origin/*
> > > # timeout=10
> > >  > git config remote.origin.url https://git-wip-us.apache.org/
> > > repos/asf/aurora.git # timeout=10
> > > Fetching upstream changes from https://git-wip-us.apache.org/
> > > repos/asf/aurora.git
> > >  > git fetch --tags --progress https://git-wip-us.apache.org/
> > > repos/asf/aurora.git +refs/heads/*:refs/remotes/origin/*
> > >  > git rev-parse origin/master^{commit} # timeout=10
> > > Checking out Revision 5b91150fd0668c23b178d80516427763764ac2d3
> > > (origin/master)
> > > Commit message: "Add sorting and filtering controls for
> > > TaskList"
> > >  > git config core.sparsecheckout # timeout=10
> > >  > git checkout -f 5b91150fd0668c23b178d80516427763764ac2d3
> > >  > git rev-list ec640117c273f51e26089cd83ba325be9e8a0e89 #
> > > timeout=10
> > > Cleaning workspace
> > >  > git rev-parse --verify HEAD # timeout=10
> > > Resetting working tree
> > >  > git reset --hard # timeout=10
> > >  > git clean -fdx # timeout=10
> > > [Aurora] $ /bin/bash -xe /tmp/jenkins2427407600627764864.sh
> > > + export HOME=<https://builds.apache.org/job/Aurora/ws/.home>
> > > + HOME=<https://builds.apache.org/job/Aurora/ws/.home>
> > > ++ awk '/^MemAvailable:/{print $2}' /proc/meminfo
> > > + available_mem_k=
> > > + echo
> > >
> > > + threshold_mem_k=4194304
> > > + ((  threshold_mem_k > available_mem_k  ))
> > > + echo 'Less than 4 GiB memory available. Bailing.'
> > > Less than 4 GiB memory available. Bailing.
> > > + exit 1
> > > Build step 'Execute shell' marked build as failure
> > > Recording test results
> > > ERROR: Step ?Publish JUnit test result report? failed: No test
> > > report
> > > files were found. Configuration error?
> > >
> > >
> > >
> > >
>

Re: [VOTE] Release Apache Aurora 0.18.1 RC1

2017-10-29 Thread Bill Farner

>
> sha512 signature


Thanks, this is now fixed.  The release script runs from the released SHA,
which pre-dates the inclusion of a sha512 signature.  The verification
script on master should otherwise work against 0.18.1-rc1, but no
guarantees.

On Sun, Oct 29, 2017 at 3:20 PM, Joshua Cohen  wrote:

> I'm trying to run the verify-release-candidate script, but getting a 404
> for the sha512 signature?
>
> + download_rc_file apache-aurora-0.18.1-rc1.tar.gz.sha512
> + download_dist_file 0.18.1-rc1/apache-aurora-0.18.1-rc1.tar.gz.sha512
> + curl -f -O
> https://dist.apache.org/repos/dist/dev/aurora/0.18.1-rc1/
> apache-aurora-0.18.1-rc1.tar.gz.sha512
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>  Dload  Upload   Total   SpentLeft
> Speed
>   0 00 00 0  0  0 --:--:-- --:--:-- --:--:--
>  0
> curl: (22) The requested URL returned error: 404 Not Found
>
>
> On Sun, Oct 29, 2017 at 5:03 PM, Bill Farner  wrote:
>
> > +1
> >
> > Verified by running ./build-support/release/verify-release-candidate
> > 0.18.1-rc1
> >
> > On Sun, Oct 29, 2017 at 12:05 PM, David McLaughlin <
> dmclaugh...@apache.org
> > >
> > wrote:
> >
> > > +1
> > >
> > > On Sun, Oct 29, 2017 at 10:32 AM, Bill Farner 
> > wrote:
> > >
> > > > All,
> > > >
> > > > I propose that we accept the following release candidate as the
> > official
> > > > Apache Aurora 0.18.1 release.
> > > >
> > > > Aurora 0.18.1-rc1 includes the following:
> > > > ---
> > > > The RELEASE NOTES for the release are available at:
> > > > https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=
> > > > RELEASE-NOTES.md&hb=rel/0.18.1-rc1
> > > >
> > > > The CHANGELOG for the release is available at:
> > > > https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=
> > > > CHANGELOG&hb=rel/0.18.1-rc1
> > > >
> > > > The tag used to create the release candidate is:
> > > > https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=
> > > > shortlog;h=refs/tags/rel/0.18.1-rc1
> > > >
> > > > The release candidate is available at:
> > > > https://dist.apache.org/repos/dist/dev/aurora/0.18.1-rc1/
> > > > apache-aurora-0.18.1-rc1.tar.gz
> > > >
> > > > The MD5 checksum of the release candidate can be found at:
> > > > https://dist.apache.org/repos/dist/dev/aurora/0.18.1-rc1/
> > > > apache-aurora-0.18.1-rc1.tar.gz.md5
> > > >
> > > > The signature of the release candidate can be found at:
> > > > https://dist.apache.org/repos/dist/dev/aurora/0.18.1-rc1/
> > > > apache-aurora-0.18.1-rc1.tar.gz.asc
> > > >
> > > > The GPG key used to sign the release are available at:
> > > > https://dist.apache.org/repos/dist/dev/aurora/KEYS
> > > >
> > > > Please download, verify, and test.
> > > >
> > > > The vote will close on Wed Nov  1 10:31:07 PDT 2017
> > > >
> > > > [ ] +1 Release this as Apache Aurora 0.18.1
> > > > [ ] +0
> > > > [ ] -1 Do not release this as Apache Aurora 0.18.1 because...
> > > >
> > >
> >
>

0.19.0 release preparation

2017-10-29 Thread Bill Farner

Folks,

I propose we cut our 0.19.0 release soon.  We have built up a respectable
set of unreleased changes
.  I am
happy to perform this release, and should be able to do so as soon as this
week.

Please chime in here if there is any outstanding work that should block a
release.


Cheers,

Bill

Re: [VOTE] Release Apache Aurora 0.18.1 RC1

2017-10-29 Thread Bill Farner

+1

Verified by running ./build-support/release/verify-release-candidate
0.18.1-rc1

On Sun, Oct 29, 2017 at 12:05 PM, David McLaughlin 
wrote:

> +1
>
> On Sun, Oct 29, 2017 at 10:32 AM, Bill Farner  wrote:
>
> > All,
> >
> > I propose that we accept the following release candidate as the official
> > Apache Aurora 0.18.1 release.
> >
> > Aurora 0.18.1-rc1 includes the following:
> > ---
> > The RELEASE NOTES for the release are available at:
> > https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=
> > RELEASE-NOTES.md&hb=rel/0.18.1-rc1
> >
> > The CHANGELOG for the release is available at:
> > https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=
> > CHANGELOG&hb=rel/0.18.1-rc1
> >
> > The tag used to create the release candidate is:
> > https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=
> > shortlog;h=refs/tags/rel/0.18.1-rc1
> >
> > The release candidate is available at:
> > https://dist.apache.org/repos/dist/dev/aurora/0.18.1-rc1/
> > apache-aurora-0.18.1-rc1.tar.gz
> >
> > The MD5 checksum of the release candidate can be found at:
> > https://dist.apache.org/repos/dist/dev/aurora/0.18.1-rc1/
> > apache-aurora-0.18.1-rc1.tar.gz.md5
> >
> > The signature of the release candidate can be found at:
> > https://dist.apache.org/repos/dist/dev/aurora/0.18.1-rc1/
> > apache-aurora-0.18.1-rc1.tar.gz.asc
> >
> > The GPG key used to sign the release are available at:
> > https://dist.apache.org/repos/dist/dev/aurora/KEYS
> >
> > Please download, verify, and test.
> >
> > The vote will close on Wed Nov  1 10:31:07 PDT 2017
> >
> > [ ] +1 Release this as Apache Aurora 0.18.1
> > [ ] +0
> > [ ] -1 Do not release this as Apache Aurora 0.18.1 because...
> >
>

[VOTE] Release Apache Aurora 0.18.1 RC1

2017-10-29 Thread Bill Farner

All,

I propose that we accept the following release candidate as the official
Apache Aurora 0.18.1 release.

Aurora 0.18.1-rc1 includes the following:
---
The RELEASE NOTES for the release are available at:
https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=RELEASE-NOTES.md&hb=rel/0.18.1-rc1

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=CHANGELOG&hb=rel/0.18.1-rc1

The tag used to create the release candidate is:
https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=shortlog;h=refs/tags/rel/0.18.1-rc1

The release candidate is available at:
https://dist.apache.org/repos/dist/dev/aurora/0.18.1-rc1/apache-aurora-0.18.1-rc1.tar.gz

The MD5 checksum of the release candidate can be found at:
https://dist.apache.org/repos/dist/dev/aurora/0.18.1-rc1/apache-aurora-0.18.1-rc1.tar.gz.md5

The signature of the release candidate can be found at:
https://dist.apache.org/repos/dist/dev/aurora/0.18.1-rc1/apache-aurora-0.18.1-rc1.tar.gz.asc

The GPG key used to sign the release are available at:
https://dist.apache.org/repos/dist/dev/aurora/KEYS

Please download, verify, and test.

The vote will close on Wed Nov  1 10:31:07 PDT 2017

[ ] +1 Release this as Apache Aurora 0.18.1
[ ] +0
[ ] -1 Do not release this as Apache Aurora 0.18.1 because...

Re: Build failed in Jenkins: Aurora #1858

2017-10-24 Thread Bill Farner

Possibly rehashing an old discussion - does the build really require 4 GB?

On Mon, Oct 23, 2017 at 1:53 PM, Erb, Stephan 
wrote:

> Ah, again a node with insufficient memory. I once added a mechanism to
> abort the build early rather than running and eventually failing in these
> cases. This was very helpful for the regular reviewbot but is not that
> helpful for the normal SCM-triggerd build.
>
> Can anyone think of a better way to handle this case here?
>
>
> On 23.10.17, 22:02, "Apache Jenkins Server" 
> wrote:
>
> See  redirect?page=changes>
>
> Changes:
>
> [david] Add sorting and filtering controls for TaskList
>
> --
> Started by an SCM change
> Started by an SCM change
> [EnvInject] - Loading node environment variables.
> Building remotely on ubuntu-4 (ubuntu trusty) in workspace <
> https://builds.apache.org/job/Aurora/ws/>
> Wiping out workspace first.
> Cloning the remote Git repository
> Cloning repository https://git-wip-us.apache.org/repos/asf/aurora.git
>  > git init  # timeout=10
> Fetching upstream changes from https://git-wip-us.apache.org/
> repos/asf/aurora.git
>  > git --version # timeout=10
>  > git fetch --tags --progress https://git-wip-us.apache.org/
> repos/asf/aurora.git +refs/heads/*:refs/remotes/origin/*
>  > git config remote.origin.url https://git-wip-us.apache.org/
> repos/asf/aurora.git # timeout=10
>  > git config --add remote.origin.fetch 
> +refs/heads/*:refs/remotes/origin/*
> # timeout=10
>  > git config remote.origin.url https://git-wip-us.apache.org/
> repos/asf/aurora.git # timeout=10
> Fetching upstream changes from https://git-wip-us.apache.org/
> repos/asf/aurora.git
>  > git fetch --tags --progress https://git-wip-us.apache.org/
> repos/asf/aurora.git +refs/heads/*:refs/remotes/origin/*
>  > git rev-parse origin/master^{commit} # timeout=10
> Checking out Revision 5b91150fd0668c23b178d80516427763764ac2d3
> (origin/master)
> Commit message: "Add sorting and filtering controls for TaskList"
>  > git config core.sparsecheckout # timeout=10
>  > git checkout -f 5b91150fd0668c23b178d80516427763764ac2d3
>  > git rev-list ec640117c273f51e26089cd83ba325be9e8a0e89 # timeout=10
> Cleaning workspace
>  > git rev-parse --verify HEAD # timeout=10
> Resetting working tree
>  > git reset --hard # timeout=10
>  > git clean -fdx # timeout=10
> [Aurora] $ /bin/bash -xe /tmp/jenkins2427407600627764864.sh
> + export HOME=
> + HOME=
> ++ awk '/^MemAvailable:/{print $2}' /proc/meminfo
> + available_mem_k=
> + echo
>
> + threshold_mem_k=4194304
> + ((  threshold_mem_k > available_mem_k  ))
> + echo 'Less than 4 GiB memory available. Bailing.'
> Less than 4 GiB memory available. Bailing.
> + exit 1
> Build step 'Execute shell' marked build as failure
> Recording test results
> ERROR: Step ?Publish JUnit test result report? failed: No test report
> files were found. Configuration error?
>
>
>
>

Re: gorealis is now officially a PayPal Open Source Project

2017-10-16 Thread Bill Farner

Congrats on releasing!

On Oct 16, 2017, 4:15 PM -0500, Renan DelValle , 
wrote:
> Hi all,
>
> Just wanted to drop a note about a recent update for gorealis[1]. For those
> who aren't familiar with it, gorealis is a library that aims to enable
> users to programmatically interact with the Aurora scheduler without
> dealing with thrift directly.
>
> A few days ago the project was moved from my public repository to PayPal's
> Open Source Github repository. Although the team at PayPal has been behind
> it 100% since day one, this shift symbolizes a new level of commitment to
> the project.
>
> We hope that gorealis is helpful both as a a library and as an example of
> how to engage the Aurora scheduler through thrift.
>
> If anyone has any questions regarding the project or how to get started
> using it, or better yet, wants to make a contribution, feel free to reach
> out to me by e-mail, through Github, through Slack (mesos.slack.com), or
> even Tweet at me @renandelvalle
>
> Thanks!
>
> -Renan
>
> [1] https://github.com/paypal/gorealis

Re: Future of storage in Aurora

2017-10-03 Thread Bill Farner

Good questions!


> What does this mean in terms of the original goals behind the storage system
> refactor?


My current effort is targeting goals (b), (c), and (d) in the above list.

Are we confident that Jordan's work for hot-followers will alleviate the
> problems w/ long failovers?


My plans will not rely on Jordan's work.  I do, however, hope that it will
enable straightforward support for warm standby and very fast failover.

I'd also like to know what our plans are for storage in the future


The plan is nascent and still undergoing prototyping, but i intend to
implement a log-structured storage on top of a key-value abstraction.  This
would eliminate the need for snapshots.  The first implementation will be
backed by ZooKeeper.  I'll send out a doc once i have confidence in the
approach.

Also, what does this mean for stores that have never existed as non-H2 (i.e.
> the job update store).


They will need to be reimplemented with map-based stores.  I'm not phased
by this part of the effort, and should have a JobUpdateStore implementation
ready over the next few days.


> Will converting it have an impact on, e.g., storage write-lock contention?


For JobUpdateStore, i expect scheduler performance to increase
significantly.  This store has been a performance quagmire in high-scale
clusters.

Looking around at the current state of write locking, we're still at the
whim of a global write lock in LogStorage, so we at least should not
regress!


On Tue, Oct 3, 2017 at 7:45 AM, Joshua Cohen  wrote:

> What does this mean in terms of the original goals behind the storage
> system refactor? Are we confident that Jordan's work for hot-followers will
> alleviate the problems w/ long failovers? I'm definitely in favor of
> killing the H2 code if its goals can never be realized and it's just a
> maintenance burden, but I'd also like to know what our plans are for
> storage in the future.
>
> Also, what does this mean for stores that have never existed as non-H2
> (i.e. the job update store). Will converting it have an impact on, e.g.,
> storage write-lock contention?
>
> On Sun, Oct 1, 2017 at 5:59 PM, Bill Farner  wrote:
>
> > I would like to revive this discussion in light of some work i have been
> > doing around the storage system.  The fruits of the DB storage system
> will
> > require a lot of additional effort to reach the beneficial outcomes i
> laid
> > out above, and i agree that we should cut our losses.
> >
> > I plan to introduce patches soon to introduce non-H2 in-memory store
> > implementations.  *If anyone disagrees with removing the H2
> implementations
> > as well, please chime in here.*
> >
> > Disclaimer - i may propose an alternative for the persistent storage in
> the
> > near future.
> >
> > On Mon, Apr 3, 2017 at 9:40 AM, Stephan Erb  wrote:
> >
> > > H2 could give us fine granular data access. However, most of our code
> > > performs massive joins to reconstruct fully hydrated thrift objects.
> > > Most of the time we are then only interested in very few properties of
> > > those thrift structs. This applies to internal usage, but also how we
> > > use the API.
> > >
> > > I therefore believe we have to improve and refine our domain model in
> > > order to significantly improve the storage situation.
> > >
> > > I really liked Maxim's proposal from last year, and I think it is worth
> > > reconsidering: https://docs.google.com/document/d/
> 1myYX3yuofGr8JIzud98x
> > > Xd5mqgpZ8q_RqKBpSff4-WE/edit
> > >
> > > Best regards,
> > > Stephan
> > >
> > > On Thu, 2017-03-30 at 15:53 -0700, David McLaughlin wrote:
> > > > So it sounds like before we make any decisions around removing the
> > > > work
> > > > done in H2 so far, we should figure out what is remaining to move to
> > > > external storage (or if it's even still a goal).
> > > >
> > > > I may still play around with reviving the in-memory stores, but will
> > > > separate that work from any goal to remove the H2 layer. Since it's
> > > > motivated by performance, I'd verify there is a benefit before
> > > > submitting
> > > > any review.
> > > >
> > > > Thanks all for the feedback.
> > > >
> > > >
> > > > On Thu, Mar 30, 2017 at 12:08 PM, Bill Farner <
> wfarnerapa...@gmail.co
> > > > m>
> > > > wrote:
> > > >
> > > > > Adding some background - there were several motivators to using SQL
> > > > > that
> >

Re: Future of storage in Aurora

2017-10-02 Thread Bill Farner

That’s right, nothing fancy.

On Oct 2, 2017, 4:24 AM -0700, Erb, Stephan , 
wrote:
> What do you have in mind for the in-memory replacement? Revert back to the 
> usage of thrift objects within plain Java containers like we do for the task 
> store?
>
> On 02.10.17, 00:59, "Bill Farner"  wrote:
>
> I would like to revive this discussion in light of some work i have been
> doing around the storage system. The fruits of the DB storage system will
> require a lot of additional effort to reach the beneficial outcomes i laid
> out above, and i agree that we should cut our losses.
>
> I plan to introduce patches soon to introduce non-H2 in-memory store
> implementations. *If anyone disagrees with removing the H2 implementations
> as well, please chime in here.*
>
> Disclaimer - i may propose an alternative for the persistent storage in the
> near future.
>
> On Mon, Apr 3, 2017 at 9:40 AM, Stephan Erb  wrote:
>
> > H2 could give us fine granular data access. However, most of our code
> > performs massive joins to reconstruct fully hydrated thrift objects.
> > Most of the time we are then only interested in very few properties of
> > those thrift structs. This applies to internal usage, but also how we
> > use the API.
> >
> > I therefore believe we have to improve and refine our domain model in
> > order to significantly improve the storage situation.
> >
> > I really liked Maxim's proposal from last year, and I think it is worth
> > reconsidering: https://docs.google.com/document/d/1myYX3yuofGr8JIzud98x
> > Xd5mqgpZ8q_RqKBpSff4-WE/edit
> >
> > Best regards,
> > Stephan
> >
> > On Thu, 2017-03-30 at 15:53 -0700, David McLaughlin wrote:
> > > So it sounds like before we make any decisions around removing the
> > > work
> > > done in H2 so far, we should figure out what is remaining to move to
> > > external storage (or if it's even still a goal).
> > >
> > > I may still play around with reviving the in-memory stores, but will
> > > separate that work from any goal to remove the H2 layer. Since it's
> > > motivated by performance, I'd verify there is a benefit before
> > > submitting
> > > any review.
> > >
> > > Thanks all for the feedback.
> > >
> > >
> > > On Thu, Mar 30, 2017 at 12:08 PM, Bill Farner  > > m
> > > wrote:
> > >
> > > > Adding some background - there were several motivators to using SQL
> > > > that
> > > > come to mind:
> > > > a) well-understood transaction isolation guarantees leading to a
> > > > simpler
> > > > programming model w.r.t. concurrency
> > > > b) ability to offload storage to a separate system (e.g. Postgres)
> > > > and
> > > > scale it separately
> > > > c) relief of computational burden of performing snapshots and
> > > > backups due
> > > > to (b)
> > > > d) simpler code and operations model due to (b)
> > > > e) schema backwards compatibility guarantees due to persistence-
> > > > friendly
> > > > migration-scripts
> > > > f) straightforward normalization to facilitate sharing of
> > > > otherwise-redundant state (I.e. TaskConfig)
> > > >
> > > > The storage overhaul comes with a huge caveat requiring the
> > > > approach to
> > > > scheduling rounds to change. I concur that the current model is
> > > > hostile to
> > > > offloaded storage, as ~all state must be read every scheduling
> > > > round. If
> > > > that cannot be worked around with lazy state or best-effort
> > > > concurrency
> > > > (I.e. in-memory caching), the approach is indeed flawed.
> > > >
> > > > On Mar 30, 2017, 10:29 AM -0700, Joshua Cohen ,
> > > > wrote:
> > > > > My understanding of the H2-backed stores is that at least part of
> > > > > the
> > > > > original rationale behind them was that they were meant to be an
> > > > > interim
> > > > > point on the way to external SQL-backed stores which should
> > > > > theoretically
> > > > > provide significant benefits w.r.t. to GC (obviously unproven,
> > > > > especially
> > > > > at scale).
> > > > >
> > > > > I don't disagree that the H2 stores themselves are problematic
> > > > > (to say
> > > >
> > > > the
> &

Re: Future of storage in Aurora

2017-10-01 Thread Bill Farner

I would like to revive this discussion in light of some work i have been
doing around the storage system.  The fruits of the DB storage system will
require a lot of additional effort to reach the beneficial outcomes i laid
out above, and i agree that we should cut our losses.

I plan to introduce patches soon to introduce non-H2 in-memory store
implementations.  *If anyone disagrees with removing the H2 implementations
as well, please chime in here.*

Disclaimer - i may propose an alternative for the persistent storage in the
near future.

On Mon, Apr 3, 2017 at 9:40 AM, Stephan Erb  wrote:

> H2 could give us fine granular data access. However, most of our code
> performs massive joins to reconstruct fully hydrated thrift objects.
> Most of the time we are then only interested in very few properties of
> those thrift structs. This applies to internal usage, but also how we
> use the API.
>
> I therefore believe we have to improve and refine our domain model in
> order to significantly improve the storage situation.
>
> I really liked Maxim's proposal from last year, and I think it is worth
> reconsidering: https://docs.google.com/document/d/1myYX3yuofGr8JIzud98x
> Xd5mqgpZ8q_RqKBpSff4-WE/edit
>
> Best regards,
> Stephan
>
> On Thu, 2017-03-30 at 15:53 -0700, David McLaughlin wrote:
> > So it sounds like before we make any decisions around removing the
> > work
> > done in H2 so far, we should figure out what is remaining to move to
> > external storage (or if it's even still a goal).
> >
> > I may still play around with reviving the in-memory stores, but will
> > separate that work from any goal to remove the H2 layer. Since it's
> > motivated by performance, I'd verify there is a benefit before
> > submitting
> > any review.
> >
> > Thanks all for the feedback.
> >
> >
> > On Thu, Mar 30, 2017 at 12:08 PM, Bill Farner  > m>
> > wrote:
> >
> > > Adding some background - there were several motivators to using SQL
> > > that
> > > come to mind:
> > > a) well-understood transaction isolation guarantees leading to a
> > > simpler
> > > programming model w.r.t. concurrency
> > > b) ability to offload storage to a separate system (e.g. Postgres)
> > > and
> > > scale it separately
> > > c) relief of computational burden of performing snapshots and
> > > backups due
> > > to (b)
> > > d) simpler code and operations model due to (b)
> > > e) schema backwards compatibility guarantees due to persistence-
> > > friendly
> > > migration-scripts
> > > f) straightforward normalization to facilitate sharing of
> > > otherwise-redundant state (I.e. TaskConfig)
> > >
> > > The storage overhaul comes with a huge caveat requiring the
> > > approach to
> > > scheduling rounds to change. I concur that the current model is
> > > hostile to
> > > offloaded storage, as ~all state must be read every scheduling
> > > round. If
> > > that cannot be worked around with lazy state or best-effort
> > > concurrency
> > > (I.e. in-memory caching), the approach is indeed flawed.
> > >
> > > On Mar 30, 2017, 10:29 AM -0700, Joshua Cohen ,
> > > wrote:
> > > > My understanding of the H2-backed stores is that at least part of
> > > > the
> > > > original rationale behind them was that they were meant to be an
> > > > interim
> > > > point on the way to external SQL-backed stores which should
> > > > theoretically
> > > > provide significant benefits w.r.t. to GC (obviously unproven,
> > > > especially
> > > > at scale).
> > > >
> > > > I don't disagree that the H2 stores themselves are problematic
> > > > (to say
> > >
> > > the
> > > > least); do we have evidence that returning to memory based stores
> > > > will be
> > > > an improvement on that?
> > > >
> > > > On Thu, Mar 30, 2017 at 12:16 PM, David McLaughlin <
> > >
> > > dmclaugh...@apache.org
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I'd like to start a discussion around storage in Aurora.
> > > > >
> > > > > I think one of the biggest mistakes we made in migrating our
> > > > > storage
> > >
> > > to H2
> > > > > was deleting the memory stores as we moved. We made a pretty
> > > > > big bet

Notice - removing rewriteConfigs API call

2017-09-27 Thread Bill Farner

FYI - i have a patch out for review 
that will remove the rewriteConfigs thrift API call, with no deprecation
period (see the review for rationale).  Currently there is movement to land
the patch.  If you use or care about this API call, please chime in ASAP!

Re: ResourceBag Ordering violates general contract

2017-07-28 Thread Bill Farner

I've implemented the above suggestion here:
https://reviews.apache.org/r/61238/

On Fri, Jul 28, 2017 at 7:19 PM, Bill Farner  wrote:

> Neat bug!
>
> The implementation gets into trouble when different resource types are in
> play, giving inconsistent results depending on which argument is 'left'.
>  e.g.
>
> host A: cpu=2, disk=1
> host B: cpu=2, ram=1
>
> In this example, compare(A, B) considers A > B, and compare(B, A)
> considers B > A.
>
> The fix would be to take the union of ResourceTypes between the two
> ResourceBags being compared, rather than only iterating over the
> ResourceTypes in one parameter.
>
>
> On Fri, Jul 28, 2017 at 4:07 PM, Mauricio Garavaglia <
> mauriciogaravag...@gmail.com> wrote:
>
>> Hi guys,
>>
>> There seems to be a bug in this comparator in PreemptionVictimFilter (
>> https://github.com/apache/aurora/blob/master/src/main/java/
>> org/apache/aurora/scheduler/preemptor/PreemptionVictimFilter.java#L142)
>> because the comparator doesn’t satisfy transitivity, which is required by
>> Array.sort's Tim Sort implementation. This fails with a
>> "java.lang.IllegalArgumentException: Comparison method violates its
>> general
>> contract!" from time to time.
>>
>> For example this input of A, B, and C doesn't work in the current
>> comparator.
>>
>> A: <1, 1, 1>
>> B: <2, 2, 2>
>> C: <0, 2, 1>
>>
>> Based on the comparator:
>>
>> B > A
>> B == C
>> C == A
>>
>> Constructs impossible situation where C == B > A == C. As a workaround we
>> patched to simply sort based on RAM, but anyone have any suggestions about
>> what a permanent fix for upstream should be? Thanks
>>
>> Mauricio
>>
>
>

Re: ResourceBag Ordering violates general contract

2017-07-28 Thread Bill Farner

Neat bug!

The implementation gets into trouble when different resource types are in
play, giving inconsistent results depending on which argument is 'left'.
 e.g.

host A: cpu=2, disk=1
host B: cpu=2, ram=1

In this example, compare(A, B) considers A > B, and compare(B, A) considers
B > A.

The fix would be to take the union of ResourceTypes between the two
ResourceBags being compared, rather than only iterating over the
ResourceTypes in one parameter.


On Fri, Jul 28, 2017 at 4:07 PM, Mauricio Garavaglia <
mauriciogaravag...@gmail.com> wrote:

> Hi guys,
>
> There seems to be a bug in this comparator in PreemptionVictimFilter (
> https://github.com/apache/aurora/blob/master/src/main/
> java/org/apache/aurora/scheduler/preemptor/PreemptionVictimFilter.java#
> L142)
> because the comparator doesn’t satisfy transitivity, which is required by
> Array.sort's Tim Sort implementation. This fails with a
> "java.lang.IllegalArgumentException: Comparison method violates its
> general
> contract!" from time to time.
>
> For example this input of A, B, and C doesn't work in the current
> comparator.
>
> A: <1, 1, 1>
> B: <2, 2, 2>
> C: <0, 2, 1>
>
> Based on the comparator:
>
> B > A
> B == C
> C == A
>
> Constructs impossible situation where C == B > A == C. As a workaround we
> patched to simply sort based on RAM, but anyone have any suggestions about
> what a permanent fix for upstream should be? Thanks
>
> Mauricio
>

Re: Reducing Failover Time by Eagerly Reading/Replaying Log in Followers

2017-07-26 Thread Bill Farner

Some (hopefully) constructive criticism:

- the doc is very high-level on the problem statement and the proposal,
making it difficult to agree with prioritization over cheaper snapshots or
the oft-discussed support of an external DBMS.

- the supporting data is a single data point of the
scheduler_log_recover_nanos_total metric.  More data points and more detail
on this data (how many entries/bytes did this represent?) would help
normalize the metric, and possibly indicate whether recover time is linear
or non-linear.  Finer-grained information would also help (where was time
spent within the replay - GC?  reading log entries?  inflating snapshots?).

- the doc calls out parts (1) mesos log support and (2) scheduler support.
Is the planned approach to gain value from (1) before (2), or are both
needed?

- for (2) scheduler support, can you add detail on the implementation?
Much of the scheduler code assumes it is the leader
(CallOrderEnforcingStorage is currently a gatekeeper to avoid mistakes of
this type), so i would caution against replaying directly into the main
Storage.


On Wed, Jul 26, 2017 at 1:56 PM, Santhosh Kumar Shanmugham <
sshanmug...@twitter.com.invalid> wrote:

> +1
>
> This sets up the stage for more potential benefits by offloading work from
> the leading scheduler that consumes stable data (that is not affected by
> minor inconsistencies).
>
> On Wed, Jul 26, 2017 at 10:31 AM, David McLaughlin  >
> wrote:
>
> > I'm +1 to this approach over my proposal. With the enforced daily
> failover,
> > it's a much bigger win to make failovers "cheap" than making snapshots
> > cheap, and this is going to be backwards compatible too.
> >
> > On Wed, Jul 26, 2017 at 9:51 AM, Jordan Ly  wrote:
> >
> > > Hello everyone!
> > >
> > > I've created a document with an initial proposal to reduce leader
> > > failover time by eagerly reading and replaying the replicated log in
> > > followers:
> > >
> > > https://docs.google.com/document/d/10SYOq0ehLMFKQ9rX2TGC_xpM--
> > > GBnstzMFP-tXGQaVI/edit?usp=sharing
> > >
> > > We wanted to open up this topic for discussion with the community and
> > > see if anyone had any alternate opinions or recommendations before
> > > starting the work.
> > >
> > > If this solution seems reasonable, we will write and release a design
> > > document for a more formal discussion and review.
> > >
> > > Please feel free to comment on the doc, or let me know if you have any
> > > concerns.
> > >
> > > -Jordan
> > >
> >
>

Re: Future of storage in Aurora

2017-03-30 Thread Bill Farner

Adding some background - there were several motivators to using SQL that come 
to mind:
a) well-understood transaction isolation guarantees leading to a simpler 
programming model w.r.t. concurrency
b) ability to offload storage to a separate system (e.g. Postgres) and scale it 
separately
c) relief of computational burden of performing snapshots and backups due to (b)
d) simpler code and operations model due to (b)
e) schema backwards compatibility guarantees due to persistence-friendly 
migration-scripts
f) straightforward normalization to facilitate sharing of otherwise-redundant 
state (I.e. TaskConfig)

The storage overhaul comes with a huge caveat requiring the approach to 
scheduling rounds to change. I concur that the current model is hostile to 
offloaded storage, as ~all state must be read every scheduling round. If that 
cannot be worked around with lazy state or best-effort concurrency (I.e. 
in-memory caching), the approach is indeed flawed.

On Mar 30, 2017, 10:29 AM -0700, Joshua Cohen , wrote:
> My understanding of the H2-backed stores is that at least part of the
> original rationale behind them was that they were meant to be an interim
> point on the way to external SQL-backed stores which should theoretically
> provide significant benefits w.r.t. to GC (obviously unproven, especially
> at scale).
>
> I don't disagree that the H2 stores themselves are problematic (to say the
> least); do we have evidence that returning to memory based stores will be
> an improvement on that?
>
> On Thu, Mar 30, 2017 at 12:16 PM, David McLaughlin  wrote:
>
> > Hi all,
> >
> > I'd like to start a discussion around storage in Aurora.
> >
> > I think one of the biggest mistakes we made in migrating our storage to H2
> > was deleting the memory stores as we moved. We made a pretty big bet that
> > we could eventually make H2/relational databases work. I don't think that
> > bet has paid off and that we need to revisit the direction we're taking.
> >
> > My belief is that the current H2/MyBatis approach is untenable for large
> > production clusters, at least without changing our current single-master
> > architecture. At Twitter we are already having to fight to keep GC
> > manageable even without DbTaskStore enabled, so I don't see a path forward
> > where we could eventually enable that. So far experiments with H2 off-heap
> > storage have provided marginal (if any) gains.
> >
> > Would anyone object to restoring the in-memory stores and creating new
> > implementations for the missing ones (UpdateStore)? I'd even go further and
> > propose that we consider in-memory H2 and MyBatis a failed experiment and
> > we drop that storage layer completely.
> >
> > Cheers,
> > David
> >

Re: A sketch for supporting mesos maintenance

2016-11-09 Thread Bill Farner

(1) sounds like an inevitability, do you have a sense of what stands in the
way, or what it will take?

(2) is a win for ending behavior redundancy. This is probably in the doc,
but I'm lazy - are maintenance statuses surfaced in offers? IIRC the
original incarnation of maintenance modes in mesos didn't surface that
info, which eliminated important state for scheduling.

On Wed, Nov 9, 2016 at 3:09 PM Zameer Manji  wrote:

> Hey,
>
> This is not a design doc for supporting Mesos Maintenance, but more of a
> high level overview on how we *could* support it going forward. I just
> wanted to get this idea out there now to see where we all stand.
>
> As Ankit mentioned in AURORA-1800 Mesos has had Maintenance primitives
> since 0.25. You can read about them here
> . The
> primitives
> map pretty well to our existing concept of maintenance, but they allow
> operators to do work across multiple frameworks.
>
> Since the Mesos community is growing and new frameworks are emerging all
> the time, I think Aurora should support these primitives and drop our
> custom primitives to be a better player in the ecosystem.
>
> We cannot adopt these just yet however, because it is only accessible
> behind the Mesos HTTP API which Aurora does not use today. Further,
> `aurora_admin` has some SLA aware maintenance processes which are computed
> and coordinated from the client. I think for us to successfully adopt Mesos
> Maintenance, we need to do at least two things:
>
> 1. Adopt the Mesos HTTP API.
> 2. Move the SLA aware maintenance logic from the admin tool into the
> scheduler itself, so the scheduler can coordinate with the Mesos Master in
> an SLA aware fashion.
>
> What do folks think?
>
> --
> Zameer Manji
>

Re: /offers endpoint is being modified

2016-07-19 Thread Bill Farner

Sounds like the patch should include a comment in the release notes.

On Tuesday, July 19, 2016, Mehrdad Nurolahzade 
wrote:

> Hi All,
>
> As part of AURORA-1736 
> and to facilitate future development for Dynamic Reservation, I am
> modifying the JSON dump generated by the /offers http debug endpoint (see
> this  review board for more
> information).
>
> If you have any comments or concerns please let me know.
>
> Cheers,
> Mehrdad
>

Re: mesos-log health check HTTP endpoint

2016-06-22 Thread Bill Farner

>
> Once its started and it can open log it won't crash and starts mesos-log
> recovery


My memory is fuzzy here, but i was under the impression that holes in the
log were filled before open() returned.  Have you observed otherwise?

On Wed, Jun 22, 2016 at 8:02 AM, Martin Hrabovčin <
martin.hrabov...@gmail.com> wrote:

> If there is some obvious issue with replicated log then open() call would
> fail and caused aurora to exist or restart itself. I am looking at
> different issue - If there are 3 aurora instances that needs the update its
> hard to tell right now at which point its safe to move from one instance to
> another. Lets say there is rolling update going and applying update on each
> aurora instance at the time. One instance is down and out of rotation. Once
> its started and it can open log it won't crash and starts mesos-log
> recovery. But if you start doing upgrade on 2nd instance before mesos-log
> is replicated to first one its easy to loose quorum and data. I'd like to
> have some deterministic check that would allow to ensure that its safe to
> consider log replicated.
>
> 2016-06-17 16:05 GMT+02:00 Bill Farner :
>
> > If i recall correctly, the current implementation of the mesos log
> requires
> > that the callers handle mutually-exclusive access for reads and writes.
> > This means that non-leading schdulers may not read or write to perform
> the
> > check you describe.
> >
> > What's the behavior of the scheduler when it starts and the log replica
> is
> > non-VOTING?  I thought the log open() call would fail, and the scheduler
> > process would exit (giving a strong signal that the scheduler is not
> > healthy).
> >
> > On Fri, Jun 17, 2016 at 2:44 AM, Martin Hrabovčin <
> > martin.hrabov...@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > I was asking same question in #aurora channel and I still haven't found
> > an
> > > answer so I am bringing this in mailing list with a proposal.
> > >
> > > Is there a way to check the state of mesos-log (whether the its
> writable
> > in
> > > VOTING state) through some HTTP check outside of aurora process on a
> > > non-leading aurora instance? We are trying to create external check
> that
> > > would determine whether the mesos-log is ready in case of aurora
> rolling
> > > update. When adding new instance to existing aurora cluster and we want
> > to
> > > make sure that mesos-log is replicated and replica is ready to serve
> > reads
> > > and writes. Currently we’re grep-ing java process log and looking for
> > > “Persisted replica status to VOTING”.
> > >
> > > I was pointed to /vars endpoint but I haven't found obvious answer
> there.
> > >
> > > I'd like to propose creating new HTTP endpoint "/loghealth" that would
> > > similarly to "/leaderhealth" return 200 when mesos-log is ready and 503
> > in
> > > case when mesos log throws exception. As for implementation I was
> > thinking
> > > about doing simple read from log or write noop to log directly.
> > >
> > > Thanks!
> > >
> >
>

Re: mesos-log health check HTTP endpoint

2016-06-17 Thread Bill Farner

If i recall correctly, the current implementation of the mesos log requires
that the callers handle mutually-exclusive access for reads and writes.
This means that non-leading schdulers may not read or write to perform the
check you describe.

What's the behavior of the scheduler when it starts and the log replica is
non-VOTING?  I thought the log open() call would fail, and the scheduler
process would exit (giving a strong signal that the scheduler is not
healthy).

On Fri, Jun 17, 2016 at 2:44 AM, Martin Hrabovčin <
martin.hrabov...@gmail.com> wrote:

> Hello,
>
> I was asking same question in #aurora channel and I still haven't found an
> answer so I am bringing this in mailing list with a proposal.
>
> Is there a way to check the state of mesos-log (whether the its writable in
> VOTING state) through some HTTP check outside of aurora process on a
> non-leading aurora instance? We are trying to create external check that
> would determine whether the mesos-log is ready in case of aurora rolling
> update. When adding new instance to existing aurora cluster and we want to
> make sure that mesos-log is replicated and replica is ready to serve reads
> and writes. Currently we’re grep-ing java process log and looking for
> “Persisted replica status to VOTING”.
>
> I was pointed to /vars endpoint but I haven't found obvious answer there.
>
> I'd like to propose creating new HTTP endpoint "/loghealth" that would
> similarly to "/leaderhealth" return 200 when mesos-log is ready and 503 in
> case when mesos log throws exception. As for implementation I was thinking
> about doing simple read from log or write noop to log directly.
>
> Thanks!
>

Re: Few things we would like to support in aurora scheduler

2016-06-16 Thread Bill Farner

>
> We don't have an easy way to assign a common unique identifier for
> all JobUpdates in different aurora clusters in order to reconcile them
> later into a single meta update job so to speak. Instead we need to
> generate that ID and keep it in every aurora's JobUpdate
> metadata(JobUpdateRequest.taskConfig). Then in order to get the status the
> upgrade workflow running in different data centers we have to query all
> recent jobs and based on their metadata content try to filter in ones that
> we thing belongs to a currently running upgrade for the service.


Can you elaborate on the shortcoming of using TaskConfig.metadata?  From a
quick read, it seems like your proposal does with an explicit field what
you can accomplish with the more versatile metadata field.  For example,
you could store a git commit SHA in TaskConfig.metadata, and identify the
commit in use by each instance as well as track the revision changes when a
job is updated.

However, i feel like i may be missing some context as "query all recent
jobs" sounds like a broader query scope than i would expect.

We propose a new convenience API to rollback a running or complete
> JobUpdate:
> *  /**Rollback job update. */*
> *  Response rollbackJobUpdate(*
> *  /** The update to rollback. */*
> *  1: JobUpdateKey key,*
> *  /** A user-specified message to include with the induced job update
> state change. */*
> *  3: string message)*


I think this is a great idea!  It's something i've thought about for a
while, but haven't really had the personal need.

The next problem is related to the way we collect  service cluster
> status. I couldn't find a way to quickly get latest statuses for all
> instances/shards for a job in one query. Instead we query all task statuses
> for a job, them manually iterate through all the statuses and filter the
> latest one in grouped by instance ids. For services with lots of churn on
> tasks statuses that means huge blobs of thrift transferred every time we
> issue a query. I was thinking adding something in this line:


Does a TaskQuery filtering by job key and ACTIVE_STATES solve this?  Still
includes the TaskConfig, but it's a single query, and probably rarely
exceeds 1 MB in response payload.


On Thu, Jun 16, 2016 at 1:28 PM, Igor Morozov  wrote:

> Hi aurora people,
>
> I would like to start a discussion around few things we would like to see
> supported in aurora scheduler. It is based on our experience of integrating
> aurora into Uber infrastructure and I believe all the items I'm going to
> talk about will benefit the community and people running aurora clusters.
>
> 1. We support multiple aurora clusters in different failure domains and we
> run services in those domains. The upgrade workflow for those services
> includes rolling out the same version of a service software to all aurora
> clusters concurrently while monitoring the health status and other service
> vitals that includes like checking error logs, service stats,
> downstream/upstream services health. That means we occasionally need to
> manually trigger a rollback if things go south and rollback all the update
> jobs in all aurora clusters for that particular service. So here are the
> problems we discovered so far with this approach:
>
>- We don't have an easy way to assign a common unique identifier for
> all JobUpdates in different aurora clusters in order to reconcile them
> later into a single meta update job so to speak. Instead we need to
> generate that ID and keep it in every aurora's JobUpdate
> metadata(JobUpdateRequest.taskConfig). Then in order to get the status the
> upgrade workflow running in different data centers we have to query all
> recent jobs and based on their metadata content try to filter in ones that
> we thing belongs to a currently running upgrade for the service.
>
> We propose to change
> struct JobUpdateRequest {
>   /** Desired TaskConfig to apply. */
>   1: TaskConfig taskConfig
>
>   /** Desired number of instances of the task config. */
>   2: i32 instanceCount
>
>   /** Update settings and limits. */
>   3: JobUpdateSettings settings
>
> *  /**Optional Job Update key's id, if not specified aurora will generate
> one**/*
>
> *  4: optional string id*}
>
> There is potentially another much more involved solution of supporting user
> defined metadata mentioned in this ticket:
> https://issues.apache.org/jira/browse/AURORA-1711
>
>
> -  All that brings us to a second problem we had to deal with during
> the upgrade:
> We don't have a good way to manually trigger a job update rollback in
> aurora. The use case is again the same, while running multiple update jobs
> in different aurora clusters we have a real production requirement to start
> rolling back update jobs if things are misbehaving and the nature of this
> misbehavior could be potentially very complex. Currently we abort the job
> update and start a new one that would essentially roll cluster forward to a
> previously ru

Re: [FEEDBACK] Transitioning Aurora leader election to Apache Curator (`-zk_use_curator`)

2016-06-15 Thread Bill Farner

>
> Assuming we don't run into any roadblocks: How about changing the
> default of `-zk_use_curator` from False to True for the next release?


+1

I believe that was the plan of action, though i can't recall if it was
recorded anywhere more official than the dev list.

On Wed, Jun 15, 2016 at 9:44 AM, Stephan Erb  wrote:

> Thanks for doing the follow-up! I'll gradually enable the option on our
> clusters sometime next week and let you know if we hit any issues.
>
> Assuming we don't run into any roadblocks: How about changing the
> default of `-zk_use_curator` from False to True for the next release?
> Then we cane make significant progress with the deprecation while still
> giving operators the possibility to do a  fall-back if necessary.
>
> Cheers,
> Stephan
>
> On Di, 2016-06-14 at 17:43 -0600, John Sirois wrote:
> > I'd like to move forward with https://issues.apache.org/jira/browse/A
> > URORA-1669 asap; ie: removing legacy (Twitter) commons zookeeper
> > libraries used for Aurora leader election in favor of Apache Curator
> > libraries. The change submitted in https://reviews.apache.org/r/46286
> > / is now live in Aurora 0.14.0 and Apache Curator based service
> > discovery can be enabled with the Aurora scheduler flag `-
> > zk_use_curator`.  I'd like feedback from users who enable this
> > option.  If you have a test cluster where you can enable `-
> > zk_use_curator` and exercise leader failure and failover, I'd be
> > grateful.  If you have moved to using this option in production with
> > demonstrable improvements or even maintenance of status quo, I'd also
> > be grateful for this news. If you've found regressions or new bugs,
> > I'd love to know about those as well.
> >
> > Thanks in advance to all those who find time to test this out on real
> > systems!
>

Re: Aurora performance impact with hourly query runs

2016-06-12 Thread Bill Farner

MemTaskStore is the default

On Sunday, June 12, 2016,  wrote:

> Yes Maxim really appreciate the tip. That's quiet a difference.
> One follow up question, any reason of not making MemTaskStore the default
> in aurora?
>
> Thx
>
> Sent from my iPhone
>
> > On Jun 12, 2016, at 9:48 AM, Shyam Patel  > wrote:
> >
> > The query performance improved drastically, It took only 29ms for 12K
> jobs/30K tasks.. (from an hour !)
> >
> > Thanks Maxim for quick lead, really appreciate your help.
> >
> >
> >
> > Thanks,
> > Sham
> >
> >> On Jun 9, 2016, at 10:06 AM, Maxim Khutornenko  > wrote:
> >>
> >> Scheduler persists its state in the Mesos replicated log regardless of
> >> the in-memory engine. If you change the flag and restart scheduler all
> >> tasks are going to be re-inserted into MemTaskStore instead of
> >> DBTaskStore. No data will be lost.
> >>
> >>> On Thu, Jun 9, 2016 at 9:55 AM, Shyam Patel  > wrote:
> >>> Thanks Maxim,
> >>>
> >>> If we move to mem task store, restart of aurora would lose the data ?
> (btw, I’m running aurora in a container)
> >>>
> >>>
> >>>
> >>>> On Jun 9, 2016, at 8:37 AM, Maxim Khutornenko  > wrote:
> >>>>
> >>>> There are plenty of factors that may contribute towards the behavior
> >>>> you're observing. Based on the logs though it appears you are using
> >>>> DBTaskStore (-use_beta_db_task_store=true)? If so, you may want to
> >>>> revert to the default in-mem task store
> >>>> (-use_beta_db_task_store=false) as DBTaskStore is known to perform
> >>>> subpar on large task counts. This is a known issue and we plan to
> >>>> invest into making it faster.
> >>>>
> >>>> On Thu, Jun 9, 2016 at 6:58 AM, Erb, Stephan
> >>>> > wrote:
> >>>>> I am no expert here, but I would assume that slow task store
> operations could result from a slow replicated log. Have you tried keeping
> it on an SSD? (
> https://github.com/apache/aurora/blob/e89521f1eebd9a5301eb02e2ed6ffebdecd54c9a/docs/operations/configuration.md#-native_log_file_path
> )
> >>>>>
> >>>>> FWIW, there was a recent RB by Maxim to reduce Master load unter
> task reconciliation:
> https://reviews.apache.org/r/47373/diff/2#index_header
> >>>>> 
> >>>>> From: Shyam Patel >
> >>>>> Sent: Thursday, June 9, 2016 07:48
> >>>>> To: dev@aurora.apache.org 
> >>>>> Subject: Re: Aurora performance impact with hourly query runs
> >>>>>
> >>>>> Hi Bill,
> >>>>>
> >>>>> Cluster Set up : AWS
> >>>>>
> >>>>> 1 Mesos , 1 ZK , 1 Aurora instance : 4 CPU, 16G mem
> >>>>>
> >>>>> Aurora : Xmx 14G
> >>>>>
> >>>>> 100 nodes agent cluster : 40 CPU, 160G mem each
> >>>>>
> >>>>> 8000 Jobs, each with 2 instances. So, total ~16K containers
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Sham
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Jun 8, 2016, at 9:18 PM, Bill Farner  > wrote:
> >>>>>>
> >>>>>> Can you give some insight into the machine specs and JVM options
> used?
> >>>>>>
> >>>>>> Also, is it 8000 jobs or tasks?  The terms are often mixed up, but
> will
> >>>>>> have a big difference here.
> >>>>>>
> >>>>>>> On Wednesday, June 8, 2016, Shyam Patel  > wrote:
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> While running LnP testing, I’m spinning of 8K docker jobs. During
> the run,
> >>>>>>> I ran into issue where TaskStatUpdate and TaskReconciler queries
> taking
> >>>>>>> real long times. During the time, Aurora is pretty much freezing
> and at a
> >>>>>>> point dying.  Also, tried the same run w/o the docker jobs and
> faced the
> >>>>>>> same issue.
> >>>>>>>
> >>>>>>>
> >>>>>>> Is there a way to keep the Aurora performance intact during the
> query runs
> >>>>>>> ?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Here is snipped from log :
> >>>>>>>
> >>>>>>>
> >>>>>>> I0602 00:53:37.527 [TaskStatUpdaterService RUNNING,
> DbTaskStore:104] Query
> >>>>>>> took 1243517 ms: TaskQuery(owner:null, role:null, environment:null,
> >>>>>>> jobName:null, taskIds:null, statuses:[STARTING, THROTTLED, RUNNING,
> >>>>>>> DRAINING, ASSIGNED, KILLING, RESTARTING, PENDING, PREEMPTING],
> >>>>>>> instanceIds:null, slaveHosts:null, jobKeys:null, offset:0, limit:0)
> >>>>>>>
> >>>>>>>
> >>>>>>> I0602 00:56:54.180 [TaskReconciler-0, DbTaskStore:104] Query took
> 1380169
> >>>>>>> ms: TaskQuery(owner:null, role:null, environment:null,
> jobName:null,
> >>>>>>> taskIds:null, statuses:[STARTING, RUNNING, DRAINING, ASSIGNED,
> KILLING,
> >>>>>>> RESTARTING, PREEMPTING], instanceIds:null, slaveHosts:null,
> jobKeys:null,
> >>>>>>> offset:0, limit:0)
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Appreciate any insights..
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Sham
> >
>

Re: Aurora performance impact with hourly query runs

2016-06-08 Thread Bill Farner

Can you give some insight into the machine specs and JVM options used?

Also, is it 8000 jobs or tasks?  The terms are often mixed up, but will
have a big difference here.

On Wednesday, June 8, 2016, Shyam Patel  wrote:

> Hi,
>
> While running LnP testing, I’m spinning of 8K docker jobs. During the run,
> I ran into issue where TaskStatUpdate and TaskReconciler queries taking
> real long times. During the time, Aurora is pretty much freezing and at a
> point dying.  Also, tried the same run w/o the docker jobs and faced the
> same issue.
>
>
> Is there a way to keep the Aurora performance intact during the query runs
> ?
>
>
>
> Here is snipped from log :
>
>
> I0602 00:53:37.527 [TaskStatUpdaterService RUNNING, DbTaskStore:104] Query
> took 1243517 ms: TaskQuery(owner:null, role:null, environment:null,
> jobName:null, taskIds:null, statuses:[STARTING, THROTTLED, RUNNING,
> DRAINING, ASSIGNED, KILLING, RESTARTING, PENDING, PREEMPTING],
> instanceIds:null, slaveHosts:null, jobKeys:null, offset:0, limit:0)
>
>
> I0602 00:56:54.180 [TaskReconciler-0, DbTaskStore:104] Query took 1380169
> ms: TaskQuery(owner:null, role:null, environment:null, jobName:null,
> taskIds:null, statuses:[STARTING, RUNNING, DRAINING, ASSIGNED, KILLING,
> RESTARTING, PREEMPTING], instanceIds:null, slaveHosts:null, jobKeys:null,
> offset:0, limit:0)
>
>
>
> Appreciate any insights..
>
>
> Thanks,
> Sham
>
>

Re: Subscribing to Aurora's tasks' event changes

2016-04-25 Thread Bill Farner

I don't have a firm example in mind, I just don't think the approach
recommended by Zameer is optimal.  It's only marginally better than forking
(and probably requires you to fork for the sake of sanity).

Thinking out loud, it would be nice if Kafka had a JMS interface/bridge, as
it would allow us to add support for a bunch of backends with one
implementation.  Unfortunately that does not appear to be the case.

On Monday, April 25, 2016, Dmitriy Shirchenko  wrote:

> @wfarner
>
> Can you help me out and clarify what you mean by 'First-class mechanism'?
> An example would be awesome.
>
> On Mon, Apr 25, 2016 at 4:40 PM Dmitriy Shirchenko  >
> wrote:
>
> > Has anyone built something that can subscribe to events and then send
> them
> > to a pub/sub system? Maybe can give pointers on how you would approach
> this?
> >
> > Our use case is sending TaskStateChange`s into an internal Kafka topic.
> >
> > Thanks!
> >
>

Re: Subscribing to Aurora's tasks' event changes

2016-04-25 Thread Bill Farner

Fwiw I would favor a first-class mechanism instead.  I think pointing to
auth modules as a successful pattern is a mistake.

On Monday, April 25, 2016, Zameer Manji  wrote:

> I think a good approach to take would be to have `PubsubEventModule` take
> modules as an arg like `HttpSecurityModule`. Those modules can create
> instances of `EventSubscriber` and those subscribers can consume the events
> you desire and do whatever you want.
>
> I think this is a low touch approach that is flexible and aligned with how
> we do other things.
>
> On Mon, Apr 25, 2016 at 4:40 PM, Dmitriy Shirchenko  >
> wrote:
>
> > Has anyone built something that can subscribe to events and then send
> them
> > to a pub/sub system? Maybe can give pointers on how you would approach
> > this?
> >
> > Our use case is sending TaskStateChange`s into an internal Kafka topic.
> >
> > Thanks!
> >
> > --
> > Zameer Manji
> >
> >
>

Re: experimental require_docker_use_executor

2016-04-20 Thread Bill Farner

There's no real roadmap for this feature, i added it just because a bunch
of people had been asking and i finally found a personal need for it :-)

Adding support for CMD sounds great.  Could you approach it by adding a
field to DockerContainer [1]?

[1]
https://github.com/apache/aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L192-L197

On Tue, Apr 19, 2016 at 6:46 AM, Mauricio Garavaglia <
mauriciogaravag...@gmail.com> wrote:

> Hi!
>
> I was happy to find the new require_docker_use_executor option included in
> 0.13. We have been using a similar hack
> <
> https://github.com/medallia/aurora/commit/2280cf104d5601cc86c50b482f4631d31b08d640
> >
> since a while and was a bit surprised to find something similar made
> through upstream. What are the plans for this feature in the future?
>
> Something extremelly useful would be able to specify the 'CMD' of a docker
> image with an ENTRYPOINT, i.e. the args passed to the entrypoint.
>
> This can be achieved by setting the CommandInfo Value here
> <
> https://github.com/apache/aurora/commit/3806e626a244a9338d9040d4e7132e02deb5065e#diff-de7054c6a8ecbed836d6f4c49b2249d9R162
> >
> and leaving the Shell attr to false, but the question is how do we get this
> value. We were using the first process cmdline attribute, as a hack, but
> would like to hear what are the plans. I'll be glad to implement it.
>
> Mauricio
>

Re: [VOTE] Release Apache Aurora 0.13.0 RC0

2016-04-12 Thread Bill Farner

+1

verification passes for me

On Tue, Apr 12, 2016 at 1:34 PM, Joshua Cohen  wrote:

> Ahh, thanks for being on top of that John!
>
> On Tue, Apr 12, 2016 at 3:05 PM, John Sirois 
> wrote:
>
> > On Tue, Apr 12, 2016 at 1:48 PM, Joshua Cohen  wrote:
> >
> > > +1, ran the verification script, everything looks good to me.
> > >
> > > Nitpick: the NEWS link in the email 404s because we renamed it to
> > > RELEASE-NOTES.md. We should probably update the script that generates
> the
> > > email?
> > >
> >
> > See my response above - fixed here earlier this am:
> > https://reviews.apache.org/r/46070/
> >
> >
> > >
> > > On Tue, Apr 12, 2016 at 10:02 AM, Erb, Stephan <
> > > stephan@blue-yonder.com>
> > > wrote:
> > >
> > > > +1 for releasing 0.13.0-rc0 as Aurora 0.13.0
> > > >
> > > > * tested with the verification script
> > > > * deployed the RC to an inhouse test cluster
> > > >
> > > > I am also OK with fixing the changelog afterwards.
> > > >
> > > > 
> > > > From: Bill Farner 
> > > > Sent: Tuesday, April 12, 2016 16:18
> > > > To: jfarr...@apache.org
> > > > Cc: dev@aurora.apache.org
> > > > Subject: Re: [VOTE] Release Apache Aurora 0.13.0 RC0
> > > >
> > > > I'm okay with not blocking the release on it.
> > > >
> > > > On Tuesday, April 12, 2016, Jake Farrell 
> wrote:
> > > >
> > > > > My fault, I thought we had changed this awhile back to
> automatically
> > > pull
> > > > > in any tickets not having a set fix version that had been marked as
> > > > > resolved and that this had become a post step to clean up jira
> after
> > > the
> > > > > release when marking the version as released.
> > > > >
> > > > > This should not be a blocker for the 0.13.0-rc0 release candidate
> as
> > > the
> > > > > CHANGELOG is not a requirement for a release, just a nice best
> > > practice.
> > > > > Will follow up with a patch to trunk to update for the missing
> > section.
> > > > If
> > > > > this is a must have in anyones opinion then we can cancel this vote
> > and
> > > > > start a new release candidate, thoughts?
> > > > >
> > > > > -Jake
> > > > >
> > > > > On Tue, Apr 12, 2016 at 1:59 AM, Bill Farner  > > > > > wrote:
> > > > >
> > > > >> Changelog looks sparse.  In the past we have tagged all resolved
> > > tickets
> > > > >> with fixVersion that don't already have fixVersion set, which I
> > > believe
> > > > >> lands them in the changelog.  Presumably that step is undocumented
> > and
> > > > not
> > > > >> automated, and as a result didn't happen?
> > > > >>
> > > > >> On Monday, April 11, 2016, John Sirois  > > > >> > wrote:
> > > > >>
> > > > >>> +1 - Tested with
> `./build-support/release/verify-release-candidate
> > > > >>> 0.13.0-rc0`.
> > > > >>>
> > > > >>> On Mon, Apr 11, 2016 at 9:26 PM, Jake Farrell <
> jfarr...@apache.org
> > >
> > > > >>> wrote:
> > > > >>>
> > > > >>> > All,
> > > > >>> >
> > > > >>> > I propose that we accept the following release candidate as the
> > > > >>> official
> > > > >>> > Apache Aurora 0.13.0 release.
> > > > >>> >
> > > > >>> > Aurora 0.13.0-rc0 includes the following:
> > > > >>> > ---
> > > > >>> > The NEWS for the release is available at:
> > > > >>> >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=NEWS&hb=rel/0.13.0-rc0
> > > > >>>
> > > > >>>
> > > > >>> The NEWS link above is broken, but was fixed here:
> > > > >>> https://reviews.apache.org/r/46070/
> > > > >>>
> > > > >>>
> > > > >>> >
&g

Re: [VOTE] Release Apache Aurora 0.13.0 RC0

2016-04-12 Thread Bill Farner

I'm okay with not blocking the release on it.

On Tuesday, April 12, 2016, Jake Farrell  wrote:

> My fault, I thought we had changed this awhile back to automatically pull
> in any tickets not having a set fix version that had been marked as
> resolved and that this had become a post step to clean up jira after the
> release when marking the version as released.
>
> This should not be a blocker for the 0.13.0-rc0 release candidate as the
> CHANGELOG is not a requirement for a release, just a nice best practice.
> Will follow up with a patch to trunk to update for the missing section. If
> this is a must have in anyones opinion then we can cancel this vote and
> start a new release candidate, thoughts?
>
> -Jake
>
> On Tue, Apr 12, 2016 at 1:59 AM, Bill Farner  > wrote:
>
>> Changelog looks sparse.  In the past we have tagged all resolved tickets
>> with fixVersion that don't already have fixVersion set, which I believe
>> lands them in the changelog.  Presumably that step is undocumented and not
>> automated, and as a result didn't happen?
>>
>> On Monday, April 11, 2016, John Sirois > > wrote:
>>
>>> +1 - Tested with `./build-support/release/verify-release-candidate
>>> 0.13.0-rc0`.
>>>
>>> On Mon, Apr 11, 2016 at 9:26 PM, Jake Farrell 
>>> wrote:
>>>
>>> > All,
>>> >
>>> > I propose that we accept the following release candidate as the
>>> official
>>> > Apache Aurora 0.13.0 release.
>>> >
>>> > Aurora 0.13.0-rc0 includes the following:
>>> > ---
>>> > The NEWS for the release is available at:
>>> >
>>> >
>>> https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=NEWS&hb=rel/0.13.0-rc0
>>>
>>>
>>> The NEWS link above is broken, but was fixed here:
>>> https://reviews.apache.org/r/46070/
>>>
>>>
>>> >
>>> >
>>> > The CHANGELOG for the release is available at:
>>> >
>>> >
>>> https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=CHANGELOG&hb=rel/0.13.0-rc0
>>>
>>>
>>> The CHANGELOG looks light with 2 entries, but that may be correct.  If
>>> its
>>> not correct, I'm not sure if this is an RC blocker or not ... I voted
>>> assuming it was not.
>>>
>>>
>>> >
>>> >
>>> > The tag used to create the release candidate is:
>>> >
>>> >
>>> https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=shortlog;h=refs/tags/rel/0.13.0-rc0
>>> >
>>> > The release candidate is available at:
>>> >
>>> >
>>> https://dist.apache.org/repos/dist/dev/aurora/0.13.0-rc0/apache-aurora-0.13.0-rc0.tar.gz
>>> >
>>> > The MD5 checksum of the release candidate can be found at:
>>> >
>>> >
>>> https://dist.apache.org/repos/dist/dev/aurora/0.13.0-rc0/apache-aurora-0.13.0-rc0.tar.gz.md5
>>> >
>>> > The signature of the release candidate can be found at:
>>> >
>>> >
>>> https://dist.apache.org/repos/dist/dev/aurora/0.13.0-rc0/apache-aurora-0.13.0-rc0.tar.gz.asc
>>> >
>>> > The GPG key used to sign the release are available at:
>>> > https://dist.apache.org/repos/dist/dev/aurora/KEYS
>>> >
>>> > Please download, verify, and test.
>>> >
>>> > The vote will close on Thu Apr 14 23:24:12 EDT 2016, please vote
>>> >
>>> > [ ] +1 Release this as Apache Aurora 0.13.0
>>> > [ ] +0
>>> > [ ] -1 Do not release this as Apache Aurora 0.13.0 because...
>>> >
>>> >
>>> > I'd like to get the voting started with my own +1
>>> >
>>> > -Jake
>>> >
>>>
>>
>

Re: [VOTE] Release Apache Aurora 0.13.0 RC0

2016-04-11 Thread Bill Farner

Changelog looks sparse.  In the past we have tagged all resolved tickets
with fixVersion that don't already have fixVersion set, which I believe
lands them in the changelog.  Presumably that step is undocumented and not
automated, and as a result didn't happen?

On Monday, April 11, 2016, John Sirois  wrote:

> +1 - Tested with `./build-support/release/verify-release-candidate
> 0.13.0-rc0`.
>
> On Mon, Apr 11, 2016 at 9:26 PM, Jake Farrell  > wrote:
>
> > All,
> >
> > I propose that we accept the following release candidate as the official
> > Apache Aurora 0.13.0 release.
> >
> > Aurora 0.13.0-rc0 includes the following:
> > ---
> > The NEWS for the release is available at:
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=NEWS&hb=rel/0.13.0-rc0
>
>
> The NEWS link above is broken, but was fixed here:
> https://reviews.apache.org/r/46070/
>
>
> >
> >
> > The CHANGELOG for the release is available at:
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=CHANGELOG&hb=rel/0.13.0-rc0
>
>
> The CHANGELOG looks light with 2 entries, but that may be correct.  If its
> not correct, I'm not sure if this is an RC blocker or not ... I voted
> assuming it was not.
>
>
> >
> >
> > The tag used to create the release candidate is:
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=shortlog;h=refs/tags/rel/0.13.0-rc0
> >
> > The release candidate is available at:
> >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.13.0-rc0/apache-aurora-0.13.0-rc0.tar.gz
> >
> > The MD5 checksum of the release candidate can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.13.0-rc0/apache-aurora-0.13.0-rc0.tar.gz.md5
> >
> > The signature of the release candidate can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.13.0-rc0/apache-aurora-0.13.0-rc0.tar.gz.asc
> >
> > The GPG key used to sign the release are available at:
> > https://dist.apache.org/repos/dist/dev/aurora/KEYS
> >
> > Please download, verify, and test.
> >
> > The vote will close on Thu Apr 14 23:24:12 EDT 2016, please vote
> >
> > [ ] +1 Release this as Apache Aurora 0.13.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Aurora 0.13.0 because...
> >
> >
> > I'd like to get the voting started with my own +1
> >
> > -Jake
> >
>

Re: Upgrade executor without restarting task

2016-04-07 Thread Bill Farner

Correct, you cannot currently do this.  Can you share the what motivates
this need?  Do you plan on iterating rapidly on the executor in production?

In general, one of the principles of a system like Aurora is to embrace the
reality that components can be restarted without warning; which enables
non-disruptive infrastructure upgrades.

On Thu, Apr 7, 2016 at 7:38 PM, Zhitao Li  wrote:

> Hi,
>
> I'm writing to see whether anyone can share some real life experience about
> managing the thermos executor, especially about upgrading it fleet-wise.
>
>
> AFAICT, we cannot upgrade thermos executor w/o restarting the running task
> because Mesos does now allow executor to be restarted/replaced without
> losing tasks belonging to it.
>
> We are thinking about building such capability in Mesos and possibly extend
> it to Aurora. Given this seems like an involved project, I'd like to gather
> some feedback from the community on whether people consider this valuable.
>
> Thanks.
>
> --
> Cheers,
>
> Zhitao Li
>

Re: [DISCUSS]: 0.13.0 release candidate

2016-04-06 Thread Bill Farner

Patch has landed - tests came up green on the new image.  We're good to go!

On Wed, Apr 6, 2016 at 9:14 PM, Bill Farner  wrote:

> Nix that - i've updated the dev image, and have green e2e tests.  I will
> have a go at merging 45177 tomorrow, provided green tests when applied.
>
> On Wed, Apr 6, 2016 at 4:02 PM, Bill Farner  wrote:
>
>> Stephan - you could install jq within the vagrant provisioning as a
>> workaround in the meantime.
>>
>> On Wed, Apr 6, 2016 at 3:52 PM, Erb, Stephan > > wrote:
>>
>>> Short heads up: I believe I might be blocking the release candidate
>>> right now :-/.
>>>
>>> * Goal was to get https://reviews.apache.org/r/45177/ merged
>>> * Before we can merge this, we need  to rebuild of the vagrant base
>>> image due to this change https://reviews.apache.org/r/45782/
>>> * Unfortunately, I don't get to the e2e tests to pass in the newly
>>> generated image. I therefore don't feel comfortable pushing forward here
>>>
>>> I won't be able to look into this before Friday evening. Hopefully I get
>>> it resolved on Friday (unless someone else wants to have a look first)
>>> 
>>> From: Maxim Khutornenko 
>>> Sent: Monday, April 4, 2016 23:14
>>> To: dev@aurora.apache.org
>>> Subject: Re: [DISCUSS]: 0.13.0 release candidate
>>>
>>> +1
>>>
>>> On Mon, Apr 4, 2016 at 2:10 PM, Zameer Manji  wrote:
>>> > +1
>>> >
>>> > On Mon, Apr 4, 2016 at 12:27 PM, Bill Farner 
>>> wrote:
>>> >
>>> >> +1, fire away
>>> >>
>>> >> On Mon, Apr 4, 2016 at 12:26 PM, Jake Farrell 
>>> wrote:
>>> >>
>>> >> > Other than a couple deprecation clean up tickets, in AURORA-1584
>>> [1], it
>>> >> > looks like we are about ready to cut the 0.13.0 release candidate
>>> and
>>> >> start
>>> >> > a vote. I wanted to open the floor up for any last minute requests
>>> or
>>> >> > patches people would like to see make it in before we finalize and
>>> cut
>>> >> the
>>> >> > release candidate. Currently planning on cutting the release
>>> candidate
>>> >> this
>>> >> > Wednesday, April 6th, pending no blockers coming out of this
>>> discussion
>>> >> > thread. Thoughts, objections?
>>> >> >
>>> >> > -Jake
>>> >> >
>>> >> >
>>> >> > [1]: https://issues.apache.org/jira/browse/AURORA-1584
>>> >> >
>>> >>
>>> >> --
>>> >> Zameer Manji
>>> >>
>>> >>
>>>
>>
>>
>

Re: [DISCUSS]: 0.13.0 release candidate

2016-04-06 Thread Bill Farner

Nix that - i've updated the dev image, and have green e2e tests.  I will
have a go at merging 45177 tomorrow, provided green tests when applied.

On Wed, Apr 6, 2016 at 4:02 PM, Bill Farner  wrote:

> Stephan - you could install jq within the vagrant provisioning as a
> workaround in the meantime.
>
> On Wed, Apr 6, 2016 at 3:52 PM, Erb, Stephan 
> wrote:
>
>> Short heads up: I believe I might be blocking the release candidate right
>> now :-/.
>>
>> * Goal was to get https://reviews.apache.org/r/45177/ merged
>> * Before we can merge this, we need  to rebuild of the vagrant base image
>> due to this change https://reviews.apache.org/r/45782/
>> * Unfortunately, I don't get to the e2e tests to pass in the newly
>> generated image. I therefore don't feel comfortable pushing forward here
>>
>> I won't be able to look into this before Friday evening. Hopefully I get
>> it resolved on Friday (unless someone else wants to have a look first)
>> 
>> From: Maxim Khutornenko 
>> Sent: Monday, April 4, 2016 23:14
>> To: dev@aurora.apache.org
>> Subject: Re: [DISCUSS]: 0.13.0 release candidate
>>
>> +1
>>
>> On Mon, Apr 4, 2016 at 2:10 PM, Zameer Manji  wrote:
>> > +1
>> >
>> > On Mon, Apr 4, 2016 at 12:27 PM, Bill Farner 
>> wrote:
>> >
>> >> +1, fire away
>> >>
>> >> On Mon, Apr 4, 2016 at 12:26 PM, Jake Farrell 
>> wrote:
>> >>
>> >> > Other than a couple deprecation clean up tickets, in AURORA-1584
>> [1], it
>> >> > looks like we are about ready to cut the 0.13.0 release candidate and
>> >> start
>> >> > a vote. I wanted to open the floor up for any last minute requests or
>> >> > patches people would like to see make it in before we finalize and
>> cut
>> >> the
>> >> > release candidate. Currently planning on cutting the release
>> candidate
>> >> this
>> >> > Wednesday, April 6th, pending no blockers coming out of this
>> discussion
>> >> > thread. Thoughts, objections?
>> >> >
>> >> > -Jake
>> >> >
>> >> >
>> >> > [1]: https://issues.apache.org/jira/browse/AURORA-1584
>> >> >
>> >>
>> >> --
>> >> Zameer Manji
>> >>
>> >>
>>
>
>

Re: [DISCUSS]: 0.13.0 release candidate

2016-04-06 Thread Bill Farner

Stephan - you could install jq within the vagrant provisioning as a
workaround in the meantime.

On Wed, Apr 6, 2016 at 3:52 PM, Erb, Stephan 
wrote:

> Short heads up: I believe I might be blocking the release candidate right
> now :-/.
>
> * Goal was to get https://reviews.apache.org/r/45177/ merged
> * Before we can merge this, we need  to rebuild of the vagrant base image
> due to this change https://reviews.apache.org/r/45782/
> * Unfortunately, I don't get to the e2e tests to pass in the newly
> generated image. I therefore don't feel comfortable pushing forward here
>
> I won't be able to look into this before Friday evening. Hopefully I get
> it resolved on Friday (unless someone else wants to have a look first)
> 
> From: Maxim Khutornenko 
> Sent: Monday, April 4, 2016 23:14
> To: dev@aurora.apache.org
> Subject: Re: [DISCUSS]: 0.13.0 release candidate
>
> +1
>
> On Mon, Apr 4, 2016 at 2:10 PM, Zameer Manji  wrote:
> > +1
> >
> > On Mon, Apr 4, 2016 at 12:27 PM, Bill Farner  wrote:
> >
> >> +1, fire away
> >>
> >> On Mon, Apr 4, 2016 at 12:26 PM, Jake Farrell 
> wrote:
> >>
> >> > Other than a couple deprecation clean up tickets, in AURORA-1584 [1],
> it
> >> > looks like we are about ready to cut the 0.13.0 release candidate and
> >> start
> >> > a vote. I wanted to open the floor up for any last minute requests or
> >> > patches people would like to see make it in before we finalize and cut
> >> the
> >> > release candidate. Currently planning on cutting the release candidate
> >> this
> >> > Wednesday, April 6th, pending no blockers coming out of this
> discussion
> >> > thread. Thoughts, objections?
> >> >
> >> > -Jake
> >> >
> >> >
> >> > [1]: https://issues.apache.org/jira/browse/AURORA-1584
> >> >
> >>
> >> --
> >> Zameer Manji
> >>
> >>
>

Re: [PROPOSAL] Support GPU resources in Aurora

2016-04-06 Thread Bill Farner

There has been separate discussion around supporting arbitrary resources.
Is it plausible to build that and get GPU support for free?

On Wed, Apr 6, 2016 at 10:41 AM, Maxim Khutornenko  wrote:

> Mesos community is finalizing their MVP for supporting GPUs:
> https://issues.apache.org/jira/browse/MESOS-4424
> Design doc:
> https://docs.google.com/document/d/10GJ1A80x4nIEo8kfdeo9B11PIbS1xJrrB4Z373Ifkpo/edit
>
> Would anyone have any reservations about supporting GPUs in Aurora?
> Initial walk through our resource management codebase shows we can do
> it without too many changes or excessive refactoring. If there are no
> objections I am willing to put up a design doc for review next.
>
> Thanks,
> Maxim
>

Re: Are we ready to remove the observer?

2016-04-04 Thread Bill Farner

>
> There is no process to host the observer UI when a task terminates.


Aha, i didn't realize that was in reference to Steve's comment "have the
executor itself host the observer web UI".
I agree, that approach is encumbered for terminal tasks.

On Mon, Apr 4, 2016 at 5:10 PM, Maxim Khutornenko  wrote:

> Sorry, I should have tried harder explaining myself. There is no
> process to host the observer UI when a task terminates. We still want
> (and arguably more so) to look at terminal task details but since
> there is no process left to host the http server, there is no way to
> access that data in the current way of things.
>
> On Mon, Apr 4, 2016 at 3:54 PM, Bill Farner  wrote:
> >>
> >> It falls apart for terminal tasks when executor process is not running
> >> anymore.
> >
> >
> > This sounds important...can you recall what would not work in that
> > scenario?  I figured it would work ~identically because the observer
> > follows the lifecycle of the task sandbox directory.
> >
> > On Mon, Apr 4, 2016 at 2:43 PM, Maxim Khutornenko 
> wrote:
> >
> >> I think we've discussed this option before. It falls apart for
> >> terminal tasks when executor process is not running anymore.
> >>
> >> One of the possible ways forward could be extending Mesos UI to
> >> opportunistically consume task data periodically dumped by an executor
> >> into a json file. That could cover the functionality gap created by
> >> killing the observer and let other frameworks customize their task
> >> views in a standard and pluggable way.
> >>
> >> On Mon, Apr 4, 2016 at 2:31 PM, Steve Niemitz 
> wrote:
> >> > It seems like the easiest path forward would be to have the executor
> >> itself
> >> > host the observer web UI, if the HTTP port for the UI were configured
> as
> >> > just another port on the task, the aurora UI could just link to /mname
> >> for
> >> > that instance.
> >> >
> >> > I think the overall "what is running on this machine" view the
> observer
> >> > displays (if you go to it without a task ID) is much less useful and
> >> could
> >> > probably be removed without much sadness.
> >> >
> >> > On Mon, Apr 4, 2016 at 5:23 PM, Bill Farner 
> wrote:
> >> >
> >> >> >
> >> >> > why don't we revisit the problem from the other direction and see
> if
> >> we
> >> >> > can remove checkpoints?
> >> >>
> >> >>
> >> >> Simplicity, again :-)  If it turns out we don't need the observer
> >> anyhow,
> >> >> it saves a lot of time.  I'm just poking at different parts to make
> >> sure we
> >> >> can still justify their weight.
> >> >>
> >> >> On Mon, Apr 4, 2016 at 1:54 PM, Zameer Manji 
> wrote:
> >> >>
> >> >> > On Mon, Apr 4, 2016 at 1:47 PM, Bill Farner 
> >> wrote:
> >> >> >
> >> >> > > We clearly have different experiences - i've never really
> benefited
> >> >> from
> >> >> > > viewing the process graph, as most jobs have very simple
> sequences
> >> that
> >> >> > > could be easily explained by a text file in the sandbox.  On the
> >> >> > contrary,
> >> >> > > i've encountered people confused by the process graph, the
> observer,
> >> >> and
> >> >> > > sandbox browsing...so i must respectfully disagree that it is
> >> >> universally
> >> >> > > appreciated.
> >> >> > >
> >> >> > > What i'm trying to achieve is simplicity.  The observer is an
> extra
> >> >> > moving
> >> >> > > part, and another thing for operators to understand and maintain.
> >> It
> >> >> > also
> >> >> > > couples Aurora to one relatively specific way of running tasks,
> >> which
> >> >> > makes
> >> >> > > it difficult to open new use cases like Docker tasks.  Removing
> the
> >> >> > > observer starts to pull on a thread of complexity that i don't
> think
> >> >> > Aurora
> >> >> > > benefits much from, for example state checkpointing by the
> executor.
> >> >> > >
> >> >> > > My goal is not to apply pressure, but to perform a gut check.  If
> >> the
> >> >> > > answer is "No", that's fine.
> >> >> > >
> >> >> >
> >> >> >
> >> >> > Bill,
> >> >> >
> >> >> > I think you are pulling on the right thread here but I think
> >> revisiting
> >> >> the
> >> >> > observer is the wrong way of approaching the problem. I also agree
> >> that
> >> >> > Aurora doesn't benefit much from state checkpointing by the
> executor
> >> and
> >> >> > the observer is an extension of that since it provides a read only
> >> human
> >> >> > friendly view of the data in the checkpoints. However, instead of
> >> >> removing
> >> >> > the observer (and degrading the UX around accessing the data in the
> >> >> > checkpoints), why don't we revisit the problem from the other
> >> direction
> >> >> and
> >> >> > see if we can remove checkpoints?
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Zameer Manji
> >> >> >
> >> >>
> >>
>

Re: Are we ready to remove the observer?

2016-04-04 Thread Bill Farner

>
> It falls apart for terminal tasks when executor process is not running
> anymore.


This sounds important...can you recall what would not work in that
scenario?  I figured it would work ~identically because the observer
follows the lifecycle of the task sandbox directory.

On Mon, Apr 4, 2016 at 2:43 PM, Maxim Khutornenko  wrote:

> I think we've discussed this option before. It falls apart for
> terminal tasks when executor process is not running anymore.
>
> One of the possible ways forward could be extending Mesos UI to
> opportunistically consume task data periodically dumped by an executor
> into a json file. That could cover the functionality gap created by
> killing the observer and let other frameworks customize their task
> views in a standard and pluggable way.
>
> On Mon, Apr 4, 2016 at 2:31 PM, Steve Niemitz  wrote:
> > It seems like the easiest path forward would be to have the executor
> itself
> > host the observer web UI, if the HTTP port for the UI were configured as
> > just another port on the task, the aurora UI could just link to /mname
> for
> > that instance.
> >
> > I think the overall "what is running on this machine" view the observer
> > displays (if you go to it without a task ID) is much less useful and
> could
> > probably be removed without much sadness.
> >
> > On Mon, Apr 4, 2016 at 5:23 PM, Bill Farner  wrote:
> >
> >> >
> >> > why don't we revisit the problem from the other direction and see if
> we
> >> > can remove checkpoints?
> >>
> >>
> >> Simplicity, again :-)  If it turns out we don't need the observer
> anyhow,
> >> it saves a lot of time.  I'm just poking at different parts to make
> sure we
> >> can still justify their weight.
> >>
> >> On Mon, Apr 4, 2016 at 1:54 PM, Zameer Manji  wrote:
> >>
> >> > On Mon, Apr 4, 2016 at 1:47 PM, Bill Farner 
> wrote:
> >> >
> >> > > We clearly have different experiences - i've never really benefited
> >> from
> >> > > viewing the process graph, as most jobs have very simple sequences
> that
> >> > > could be easily explained by a text file in the sandbox.  On the
> >> > contrary,
> >> > > i've encountered people confused by the process graph, the observer,
> >> and
> >> > > sandbox browsing...so i must respectfully disagree that it is
> >> universally
> >> > > appreciated.
> >> > >
> >> > > What i'm trying to achieve is simplicity.  The observer is an extra
> >> > moving
> >> > > part, and another thing for operators to understand and maintain.
> It
> >> > also
> >> > > couples Aurora to one relatively specific way of running tasks,
> which
> >> > makes
> >> > > it difficult to open new use cases like Docker tasks.  Removing the
> >> > > observer starts to pull on a thread of complexity that i don't think
> >> > Aurora
> >> > > benefits much from, for example state checkpointing by the executor.
> >> > >
> >> > > My goal is not to apply pressure, but to perform a gut check.  If
> the
> >> > > answer is "No", that's fine.
> >> > >
> >> >
> >> >
> >> > Bill,
> >> >
> >> > I think you are pulling on the right thread here but I think
> revisiting
> >> the
> >> > observer is the wrong way of approaching the problem. I also agree
> that
> >> > Aurora doesn't benefit much from state checkpointing by the executor
> and
> >> > the observer is an extension of that since it provides a read only
> human
> >> > friendly view of the data in the checkpoints. However, instead of
> >> removing
> >> > the observer (and degrading the UX around accessing the data in the
> >> > checkpoints), why don't we revisit the problem from the other
> direction
> >> and
> >> > see if we can remove checkpoints?
> >> >
> >> >
> >> > --
> >> > Zameer Manji
> >> >
> >>
>

Re: Are we ready to remove the observer?

2016-04-04 Thread Bill Farner

>
> why don't we revisit the problem from the other direction and see if we
> can remove checkpoints?


Simplicity, again :-)  If it turns out we don't need the observer anyhow,
it saves a lot of time.  I'm just poking at different parts to make sure we
can still justify their weight.

On Mon, Apr 4, 2016 at 1:54 PM, Zameer Manji  wrote:

> On Mon, Apr 4, 2016 at 1:47 PM, Bill Farner  wrote:
>
> > We clearly have different experiences - i've never really benefited from
> > viewing the process graph, as most jobs have very simple sequences that
> > could be easily explained by a text file in the sandbox.  On the
> contrary,
> > i've encountered people confused by the process graph, the observer, and
> > sandbox browsing...so i must respectfully disagree that it is universally
> > appreciated.
> >
> > What i'm trying to achieve is simplicity.  The observer is an extra
> moving
> > part, and another thing for operators to understand and maintain.  It
> also
> > couples Aurora to one relatively specific way of running tasks, which
> makes
> > it difficult to open new use cases like Docker tasks.  Removing the
> > observer starts to pull on a thread of complexity that i don't think
> Aurora
> > benefits much from, for example state checkpointing by the executor.
> >
> > My goal is not to apply pressure, but to perform a gut check.  If the
> > answer is "No", that's fine.
> >
>
>
> Bill,
>
> I think you are pulling on the right thread here but I think revisiting the
> observer is the wrong way of approaching the problem. I also agree that
> Aurora doesn't benefit much from state checkpointing by the executor and
> the observer is an extension of that since it provides a read only human
> friendly view of the data in the checkpoints. However, instead of removing
> the observer (and degrading the UX around accessing the data in the
> checkpoints), why don't we revisit the problem from the other direction and
> see if we can remove checkpoints?
>
>
> --
> Zameer Manji
>

Re: Are we ready to remove the observer?

2016-04-04 Thread Bill Farner

We clearly have different experiences - i've never really benefited from
viewing the process graph, as most jobs have very simple sequences that
could be easily explained by a text file in the sandbox.  On the contrary,
i've encountered people confused by the process graph, the observer, and
sandbox browsing...so i must respectfully disagree that it is universally
appreciated.

What i'm trying to achieve is simplicity.  The observer is an extra moving
part, and another thing for operators to understand and maintain.  It also
couples Aurora to one relatively specific way of running tasks, which makes
it difficult to open new use cases like Docker tasks.  Removing the
observer starts to pull on a thread of complexity that i don't think Aurora
benefits much from, for example state checkpointing by the executor.

My goal is not to apply pressure, but to perform a gut check.  If the
answer is "No", that's fine.

On Mon, Apr 4, 2016 at 1:01 PM, Maxim Khutornenko  wrote:

> I am with Josh on this one. Thermos Observer UI (and especially its
> process graph) is one of the features universally appreciated by our
> customers. I am all for deprecating the Observer but only in way that
> retains parity with the existing functionality and hopefully enhances
> it. What are we trying to achieve here that would justify losing some
> of our feature set?
>
> On Mon, Apr 4, 2016 at 12:42 PM, Erb, Stephan
>  wrote:
> > Have you recently looked at the Mesos UI, Joshua? It offers sandbox
> browsing similar to the chroot link of Thermos. So at least you don't have
> to do SSH into any box. We could link to that Mesos UI instead of the
> Thermos one, and Mesos could then serve a nice index.html that contains the
> content that was formerly served by Thermos.
> >
> > When dropping Thermos and relying on Mesos instead, we could profit from
> the recent addition such as authentication.
> >
> >
> > 
> > From: Joshua Cohen 
> > Sent: Monday, April 4, 2016 18:42
> > To: dev@aurora.apache.org
> > Subject: Re: Are we ready to remove the observer?
> >
> > If you're suggesting just going to the task directory and pulling them
> out
> > of the executor logs. Yes, I could ssh into the host the task is running
> on
> > and grep the task directory out of the mesos agent logs and then trawl
> the
> > logs (or cat task.json), but that's much more effort than going to the
> > observer's task UI (i.e. it'd take a minute, rather than a few seconds).
> > I'd also posit that it's much easier for new Aurora operators to come to
> > grips with the process tree via the UI rather than a JSON blob.
> >
> > If you're suggesting something else (i.e. new UI to expose these separate
> > from the Observer), I'm fine with that, that's what I was implying above
> > would be necessary before I think we could retire the Observer.
> >
> > A counter question: do people feel that updating/deploying the Observer
> is
> > a major obstacle? I know we've got the process well automated, so it's
> > relatively painless. I'd love to replace the Observer with something
> > better, but I don't feel like it's a major drag on our productivity as it
> > exists today to warrant killing it off entirely. My opinion may be
> colored
> > by the deploy automation we have in place though!
> >
> > On Mon, Apr 4, 2016 at 9:32 AM, Bill Farner  wrote:
> >
> >> >
> >> > 2) Providing an easy view of a process's command-line
> >> > 3) Providing a holistic view of the task config
> >>
> >>
> >> Just to check my understanding - these could be trivially handled in
> >> text/log format, right?
> >>
> >> On Mon, Apr 4, 2016 at 9:30 AM, Joshua Cohen  wrote:
> >>
> >> > I'm -1 on this until we have an actual replacement for the Observer. I
> >> > think that the observer provides significant value outside of just
> >> sandbox
> >> > browsing:
> >> >
> >> > 1) Exporting task-level statistics.
> >> > 2) Providing an easy view of a process's command-line
> >> > 3) Providing a holistic view of the task config
> >> > 4) Real time utilization stats
> >> >
> >> > As a cluster operator, I use all of these features on a daily basis
> >> > (especially when I'm on call) in addition to sandbox browsing, so I
> don't
> >> > think that these uses cases are that rare.
> >> >
> >> > On Fri, Apr 1, 2016 at 6:55 AM, Steve

Re: [DISCUSS]: 0.13.0 release candidate

2016-04-04 Thread Bill Farner

+1, fire away

On Mon, Apr 4, 2016 at 12:26 PM, Jake Farrell  wrote:

> Other than a couple deprecation clean up tickets, in AURORA-1584 [1], it
> looks like we are about ready to cut the 0.13.0 release candidate and start
> a vote. I wanted to open the floor up for any last minute requests or
> patches people would like to see make it in before we finalize and cut the
> release candidate. Currently planning on cutting the release candidate this
> Wednesday, April 6th, pending no blockers coming out of this discussion
> thread. Thoughts, objections?
>
> -Jake
>
>
> [1]: https://issues.apache.org/jira/browse/AURORA-1584
>

Re: Are we ready to remove the observer?

2016-04-04 Thread Bill Farner

>
> 2) Providing an easy view of a process's command-line
> 3) Providing a holistic view of the task config


Just to check my understanding - these could be trivially handled in
text/log format, right?

On Mon, Apr 4, 2016 at 9:30 AM, Joshua Cohen  wrote:

> I'm -1 on this until we have an actual replacement for the Observer. I
> think that the observer provides significant value outside of just sandbox
> browsing:
>
> 1) Exporting task-level statistics.
> 2) Providing an easy view of a process's command-line
> 3) Providing a holistic view of the task config
> 4) Real time utilization stats
>
> As a cluster operator, I use all of these features on a daily basis
> (especially when I'm on call) in addition to sandbox browsing, so I don't
> think that these uses cases are that rare.
>
> On Fri, Apr 1, 2016 at 6:55 AM, Steve Niemitz  wrote:
>
> > The per-process stats have never been very useful to us (since they don't
> > work for docker), however, even being able to see the processes that are
> > running, how many times they've restarted, when they launched, etc is
> > invaluable.
> >
> > I think there would be big pushback from users if they were to lose the
> > functionality it provided currently (beyond log viewing).
> >
> > On Fri, Apr 1, 2016 at 6:58 AM, Erb, Stephan <
> stephan@blue-yonder.com>
> > wrote:
> >
> > > From an operator and Aurora developer perspective, it would be really
> > > great to get rid of the thermos observer quickly.
> > >
> > > However, from a user perspective the usability gap between observer and
> > > plain Mesos sandbox browsing is quite large right now. I agree with
> > > Benjamin here that it would probably work if we generate html pages
> ready
> > > for user consumption.
> > >
> > > These are the relevant tickets in our tracker:
> > > * https://issues.apache.org/jira/browse/AURORA-725
> > > * https://issues.apache.org/jira/browse/AURORA-777
> > >
> > > 
> > > From: ben...@gmail.com 
> > > Sent: Friday, April 1, 2016 02:35
> > > To: dev@aurora.apache.org
> > > Subject: Re: Are we ready to remove the observer?
> > >
> > > Is there any chance we can keep the per-process cpu and ram utilization
> > > stats?  That's one of the coolest things about aurora, imo.  The
> executor
> > > is already writing those checkpoints inside the mesos sandbox (I
> think?),
> > > so perhaps it could also produce the html pages that the observer
> > currently
> > > renders?
> > >
> > > On Thu, Mar 31, 2016 at 4:33 PM Zhitao Li 
> wrote:
> > >
> > > > +1.
> > > >
> > > > On Thu, Mar 31, 2016 at 4:11 PM, Bill Farner 
> > wrote:
> > > >
> > > > > Assuming that the vast majority of utility provided by the observer
> > is
> > > > > sandbox/log browsing - can we remove it and link to sandbox
> browsing
> > > that
> > > > > mesos provides?
> > > > >
> > > > > The rest of the information could be (or already is) logged in the
> > > > sandbox
> > > > > for the rare debugging scenarios that call for it.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Cheers,
> > > >
> > > > Zhitao Li
> > > >
> > >
> >
>

Are we ready to remove the observer?

2016-03-31 Thread Bill Farner

Assuming that the vast majority of utility provided by the observer is
sandbox/log browsing - can we remove it and link to sandbox browsing that
mesos provides?

The rest of the information could be (or already is) logged in the sandbox
for the rare debugging scenarios that call for it.

Re: Looking for feedback - Setting CommandInfo.user by default when launching tasks.

2016-03-29 Thread Bill Farner

Aha, i think we have different notions of the proposal.  I was under the
impression that the executor itself would run as the target user (e.g. steve),
not as a system user (e.g. aurora).  I find the former more appealing, with
the exception that it leaves us without a solution for concealing the
credentials file.

On Tue, Mar 29, 2016 at 2:39 PM, John Sirois  wrote:

> On Tue, Mar 29, 2016 at 3:26 PM, Bill Farner  wrote:
>
> > If i'm understanding you correctly, that doesn't address preventing users
> > from reading the credentials though.
> >
>
> It does:
>
> Say:
> /var/lib/aurora/creds 400 root
>
> And then if the CommandInfo has user: aurora (executor user as Steve
> suggested), it will get a copy of /var/lib/aurora/creds  in its sandbox
> chowned to 400 aurora
>
> Now aurora's executor (thermos), launches a task in role www-data
> announcing for it using the cred.  The www-data user will not be able to
> read the local sandbox 400 aurora creds.
>
>
> > On Tue, Mar 29, 2016 at 1:52 PM, John Sirois  wrote:
> >
> > > On Tue, Mar 29, 2016 at 2:31 PM, Steve Niemitz 
> > > wrote:
> > >
> > > > So maybe we add it, but don't change the current default behavior?
> > > >
> > >
> > > Could we use the CommandInfo.uris [1] to solve this?  IE: the scheduler
> > > would need to learn the credential file path, and with that knowledge,
> > the
> > > local mesos (root) readable credential file could be specified as a uir
> > > dependency for all launched executors (or bare commands).  IIUC, if the
> > URI
> > > if a file:// the local secured credentails file will be copied into the
> > > sandbox where it can be read by the executor (as aurora).
> > >
> > > [1]
> > >
> >
> https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L422
> > >
> > >
> > > >
> > > > On Tue, Mar 29, 2016 at 4:26 PM, Bill Farner 
> > wrote:
> > > >
> > > > > I'm in favor of moving forward.  There's no requirement to use the
> > > > > Announcer, and a non-root executor seems like a useful option.
> > > > >
> > > > > On Tue, Mar 29, 2016 at 1:00 PM, Steve Niemitz <
> sniem...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Makes sense, I guess it can be up to the cluster operator which
> > model
> > > > to
> > > > > > choose.  Is there any interest in the feature I proposed or
> should
> > I
> > > > just
> > > > > > drop it?  It's not a lot of code, but also it's not a requirement
> > for
> > > > > > anything we're working on either (the docker stuff however, is).
> > > > > >
> > > > > > On Tue, Mar 29, 2016 at 1:39 PM, Bill Farner  >
> > > > wrote:
> > > > > >
> > > > > > > That's correct - those credentials should require privileged
> > > access.
> > > > > > >
> > > > > > > On Tue, Mar 29, 2016 at 10:25 AM, Steve Niemitz <
> > > > > > > sniem...@twitter.com.invalid> wrote:
> > > > > > >
> > > > > > > > Re: ZK credential files, thats an interesting issue, I assume
> > you
> > > > > don't
> > > > > > > > want the role user to be able to read it either, and only
> root
> > or
> > > > > some
> > > > > > > > other privileged user?
> > > > > > > >
> > > > > > > > On Tue, Mar 29, 2016 at 12:14 PM, Erb, Stephan <
> > > > > > > > stephan@blue-yonder.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I am in favor of your proposal. We offer less attack
> surface
> > if
> > > > the
> > > > > > > > > executor is not running as root.
> > > > > > > > >
> > > > > > > > > Interesting though, this introduces another security
> problem:
> > > The
> > > > > > > > > credentials file in the incoming Zookeeper  ACL patch (
> > > > > > > > > https://reviews.apache.org/r/45042/) will have to be
> > readable
> > > by
> > > > > > > > > everyone. That feels a little bit like being back to square
> > >

Re: Looking for feedback - Setting CommandInfo.user by default when launching tasks.

2016-03-29 Thread Bill Farner

If i'm understanding you correctly, that doesn't address preventing users
from reading the credentials though.

On Tue, Mar 29, 2016 at 1:52 PM, John Sirois  wrote:

> On Tue, Mar 29, 2016 at 2:31 PM, Steve Niemitz 
> wrote:
>
> > So maybe we add it, but don't change the current default behavior?
> >
>
> Could we use the CommandInfo.uris [1] to solve this?  IE: the scheduler
> would need to learn the credential file path, and with that knowledge, the
> local mesos (root) readable credential file could be specified as a uir
> dependency for all launched executors (or bare commands).  IIUC, if the URI
> if a file:// the local secured credentails file will be copied into the
> sandbox where it can be read by the executor (as aurora).
>
> [1]
> https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L422
>
>
> >
> > On Tue, Mar 29, 2016 at 4:26 PM, Bill Farner  wrote:
> >
> > > I'm in favor of moving forward.  There's no requirement to use the
> > > Announcer, and a non-root executor seems like a useful option.
> > >
> > > On Tue, Mar 29, 2016 at 1:00 PM, Steve Niemitz 
> > > wrote:
> > >
> > > > Makes sense, I guess it can be up to the cluster operator which model
> > to
> > > > choose.  Is there any interest in the feature I proposed or should I
> > just
> > > > drop it?  It's not a lot of code, but also it's not a requirement for
> > > > anything we're working on either (the docker stuff however, is).
> > > >
> > > > On Tue, Mar 29, 2016 at 1:39 PM, Bill Farner 
> > wrote:
> > > >
> > > > > That's correct - those credentials should require privileged
> access.
> > > > >
> > > > > On Tue, Mar 29, 2016 at 10:25 AM, Steve Niemitz <
> > > > > sniem...@twitter.com.invalid> wrote:
> > > > >
> > > > > > Re: ZK credential files, thats an interesting issue, I assume you
> > > don't
> > > > > > want the role user to be able to read it either, and only root or
> > > some
> > > > > > other privileged user?
> > > > > >
> > > > > > On Tue, Mar 29, 2016 at 12:14 PM, Erb, Stephan <
> > > > > > stephan@blue-yonder.com>
> > > > > > wrote:
> > > > > >
> > > > > > > I am in favor of your proposal. We offer less attack surface if
> > the
> > > > > > > executor is not running as root.
> > > > > > >
> > > > > > > Interesting though, this introduces another security problem:
> The
> > > > > > > credentials file in the incoming Zookeeper  ACL patch (
> > > > > > > https://reviews.apache.org/r/45042/) will have to be readable
> by
> > > > > > > everyone. That feels a little bit like being back to square
> one.
> > > > > > > 
> > > > > > > From: Steve Niemitz 
> > > > > > > Sent: Tuesday, March 29, 2016 17:34
> > > > > > > To: dev@aurora.apache.org
> > > > > > > Subject: Looking for feedback - Setting CommandInfo.user by
> > default
> > > > > when
> > > > > > > launching tasks.
> > > > > > >
> > > > > > > I've been working on some changes to how aurora submits tasks
> to
> > > > mesos,
> > > > > > > specifically around Docker tasks, but I'd also like to see how
> > > people
> > > > > > feel
> > > > > > > about making it more general.
> > > > > > >
> > > > > > > Currently, when Aurora submits a task to mesos, it does NOT set
> > > > > > > command.user on the ExecutorInfo, this means that mesos
> > configures
> > > > the
> > > > > > > sandbox (mesos sandbox that is) as root, and launches the
> > executor
> > > > > > > (thermos_executor in our case) as root as well.
> > > > > > >
> > > > > > > What then happens is that the executor then chown()s the
> sandbox
> > it
> > > > > > creates
> > > > > > > to the aurora role/user, and also setuid()s the runners it
> forks
> > to
> > > > > that
> > > > > > > role/user.  However, the executor itself is still running a

Re: Looking for feedback - Setting CommandInfo.user by default when launching tasks.

2016-03-29 Thread Bill Farner

I'm in favor of moving forward.  There's no requirement to use the
Announcer, and a non-root executor seems like a useful option.

On Tue, Mar 29, 2016 at 1:00 PM, Steve Niemitz  wrote:

> Makes sense, I guess it can be up to the cluster operator which model to
> choose.  Is there any interest in the feature I proposed or should I just
> drop it?  It's not a lot of code, but also it's not a requirement for
> anything we're working on either (the docker stuff however, is).
>
> On Tue, Mar 29, 2016 at 1:39 PM, Bill Farner  wrote:
>
> > That's correct - those credentials should require privileged access.
> >
> > On Tue, Mar 29, 2016 at 10:25 AM, Steve Niemitz <
> > sniem...@twitter.com.invalid> wrote:
> >
> > > Re: ZK credential files, thats an interesting issue, I assume you don't
> > > want the role user to be able to read it either, and only root or some
> > > other privileged user?
> > >
> > > On Tue, Mar 29, 2016 at 12:14 PM, Erb, Stephan <
> > > stephan@blue-yonder.com>
> > > wrote:
> > >
> > > > I am in favor of your proposal. We offer less attack surface if the
> > > > executor is not running as root.
> > > >
> > > > Interesting though, this introduces another security problem: The
> > > > credentials file in the incoming Zookeeper  ACL patch (
> > > > https://reviews.apache.org/r/45042/) will have to be readable by
> > > > everyone. That feels a little bit like being back to square one.
> > > > 
> > > > From: Steve Niemitz 
> > > > Sent: Tuesday, March 29, 2016 17:34
> > > > To: dev@aurora.apache.org
> > > > Subject: Looking for feedback - Setting CommandInfo.user by default
> > when
> > > > launching tasks.
> > > >
> > > > I've been working on some changes to how aurora submits tasks to
> mesos,
> > > > specifically around Docker tasks, but I'd also like to see how people
> > > feel
> > > > about making it more general.
> > > >
> > > > Currently, when Aurora submits a task to mesos, it does NOT set
> > > > command.user on the ExecutorInfo, this means that mesos configures
> the
> > > > sandbox (mesos sandbox that is) as root, and launches the executor
> > > > (thermos_executor in our case) as root as well.
> > > >
> > > > What then happens is that the executor then chown()s the sandbox it
> > > creates
> > > > to the aurora role/user, and also setuid()s the runners it forks to
> > that
> > > > role/user.  However, the executor itself is still running as root.
> > > >
> > > > My proposal / change is to set command.user to the aurora role by
> > > default,
> > > > which will cause the executor to run as that user.  I've tested this
> > > > already, and no changes are needed to the executor, it will still try
> > to
> > > > chown the sandbox (which is fine since it already owns it), and
> > setuid()
> > > > the runners it forks (again, fine, since they're already running as
> > that
> > > > user).
> > > >
> > > > *The controversial part of this* however is I'd like to enable this
> > > > behavior BY DEFAULT, and allow disabling it (reverting to the current
> > > > behavior now) via a flag to the scheduler.  My reasoning here is two
> > > fold.
> > > >  1) It's a more secure default, preventing a compromised executor
> from
> > > > doing things it shouldn't, and 2) we already have a lot of "flag
> > bloat",
> > > > and flags are hard enough to discover as they are.  However, I do
> > believe
> > > > this should be considered as a "breaking change", particularly
> because
> > of
> > > > finicky PEX extraction for the executor.
> > > >
> > > > I'd like to hear people's thoughts on this.
> > > >
> > >
> >
>

Re: Looking for feedback - Setting CommandInfo.user by default when launching tasks.

2016-03-29 Thread Bill Farner

That's correct - those credentials should require privileged access.

On Tue, Mar 29, 2016 at 10:25 AM, Steve Niemitz <
sniem...@twitter.com.invalid> wrote:

> Re: ZK credential files, thats an interesting issue, I assume you don't
> want the role user to be able to read it either, and only root or some
> other privileged user?
>
> On Tue, Mar 29, 2016 at 12:14 PM, Erb, Stephan <
> stephan@blue-yonder.com>
> wrote:
>
> > I am in favor of your proposal. We offer less attack surface if the
> > executor is not running as root.
> >
> > Interesting though, this introduces another security problem: The
> > credentials file in the incoming Zookeeper  ACL patch (
> > https://reviews.apache.org/r/45042/) will have to be readable by
> > everyone. That feels a little bit like being back to square one.
> > 
> > From: Steve Niemitz 
> > Sent: Tuesday, March 29, 2016 17:34
> > To: dev@aurora.apache.org
> > Subject: Looking for feedback - Setting CommandInfo.user by default when
> > launching tasks.
> >
> > I've been working on some changes to how aurora submits tasks to mesos,
> > specifically around Docker tasks, but I'd also like to see how people
> feel
> > about making it more general.
> >
> > Currently, when Aurora submits a task to mesos, it does NOT set
> > command.user on the ExecutorInfo, this means that mesos configures the
> > sandbox (mesos sandbox that is) as root, and launches the executor
> > (thermos_executor in our case) as root as well.
> >
> > What then happens is that the executor then chown()s the sandbox it
> creates
> > to the aurora role/user, and also setuid()s the runners it forks to that
> > role/user.  However, the executor itself is still running as root.
> >
> > My proposal / change is to set command.user to the aurora role by
> default,
> > which will cause the executor to run as that user.  I've tested this
> > already, and no changes are needed to the executor, it will still try to
> > chown the sandbox (which is fine since it already owns it), and setuid()
> > the runners it forks (again, fine, since they're already running as that
> > user).
> >
> > *The controversial part of this* however is I'd like to enable this
> > behavior BY DEFAULT, and allow disabling it (reverting to the current
> > behavior now) via a flag to the scheduler.  My reasoning here is two
> fold.
> >  1) It's a more secure default, preventing a compromised executor from
> > doing things it shouldn't, and 2) we already have a lot of "flag bloat",
> > and flags are hard enough to discover as they are.  However, I do believe
> > this should be considered as a "breaking change", particularly because of
> > finicky PEX extraction for the executor.
> >
> > I'd like to hear people's thoughts on this.
> >
>

Re: Shepard for AURORA-1493

2016-03-24 Thread Bill Farner

I can help.  Feel free to ping me with any questions.

On Thu, Mar 24, 2016 at 5:22 PM, Ashwin Murthy 
wrote:

> I am looking for a shepard  to help me work on this.
>
> https://issues.apache.org/jira/browse/AURORA-1493
>
> Thanks
> Ashwin
>

Re: aurora job scalability

2016-03-19 Thread Bill Farner

>
> Does job map to a Marathon application?

I believe so, yes.  An Aurora job is multiple replicas of a [group of]
processes, usually (but not necessarily) homogeneous.

If similar is there a known limitation to how many jobs one can have?

This will depend on the hardware used for the scheduler machines, and the
total number of instances.  Hundreds of jobs totaling hundreds of thousands
of tasks has been proven stable.

What we discovered for Marathon more application+bigger env_vars = bigger
> zknode size and we come dangerously close to hitting the 1 MB default of
> the zk zknode size. I'm wondering if this type of a thing has potentially
> been fixed already in Aurora.

A key difference in implementation is that Aurora does not use ZK for
storage, but instead uses the replicated log implementation provided by
mesos.  Aurora has no ceiling on transaction data size, however the storage
must fit in JVM memory (which would be the same for equivalent data stored
in ZooKeeper).

I'm not sure what data marathon stores in a ZK node that you're hitting 1
MB.  So long as we're not talking about ~1 MB of env var data for a single
process, i doubt you would encounter a similar ceiling in Aurora.

Happy to talk through this more, so feel free to ask more questions!

On Fri, Mar 18, 2016 at 7:12 AM, Christopher M Luciano  wrote:

>  Hi all. It seems that we may be outgrowing Marathon. We have a problem
> with the amount of application that we are using, causing us to not exactly
> be "compliant" with Marathon goals. It seems that the unit for Aurora is a
> job+instance of that job. Does job map to a Marathon application? If
> similar is there a known limitation to how many jobs one can have?
>
>  What we discovered for Marathon more application+bigger env_vars = bigger
> zknode size and we come dangerously close to hitting the 1 MB default of
> the zk zknode size. I'm wondering if this type of a thing has potentially
> been fixed already in Aurora.
>
>
>
>
>
>
> Christopher M Luciano
>
> Staff Software Engineer, Platform Services
>
> IBM Watson Core Technology
>
>
>
>
>

Re: [PROPOSAL] Supporting Mesos Universal Containers

2016-03-15 Thread Bill Farner

+1, move forward and support both until the path is obvious

On Tue, Mar 15, 2016 at 9:11 AM, Erb, Stephan 
wrote:

> > Does anyone think we need a stop-the-world moment to come up with a long
> > term, holistic plan, or is it reasonable to assess the situation as we
> go?
>
> FWIW, I am ok with moving along and assessing the situation on the fly.
>
> We cannot tell right now when the unified containerizer is rock-solid, so
> a couple of improvement patches for the native Docker support will probably
> do more good than harm. We just have to keep an eye on what's happening in
> Mesos itself.
>
> Regards,
> Stephan
>
> 
> From: Joshua Cohen 
> Sent: Tuesday, March 15, 2016 15:55
> To: dev@aurora.apache.org
> Subject: Re: [PROPOSAL] Supporting Mesos Universal Containers
>
> I've gone ahead and filed some tickets breaking down the work involved in
> this effort. They're all contained within this epic:
> https://issues.apache.org/jira/browse/AURORA-1634.
>
> I agree that we should assess our plans re: containers as a whole. My
> understanding of the current state of the world is as follows:
>
>
>1. There are definite benefits to adopting the unified containerizer to
>launch image-based tasks (outlined in the design doc, but I'll reiterate
>here: better interaction w/ Mesos isolators, no need to rely on external
>daemons to coordinate containers, etc.).
>2. There are also considerations to keep in mind, specifically, we now
>rely on Mesos maintaining this image support, which is secondary to its
>primary goals, rather than relying on organizations that are invested in
>the formats. Additionally, we'll have to wait on Mesos to implement new
>features of image formats before they can be adopted by Aurora users.
>
> I don't believe that number 2 above should be a blocker to number 1, but
> it's a caveat that we must always keep in mind. I'll point out that the
> design doc takes a very cautious approach towards deprecating the current
> Docker support. It may be that we maintain both in perpetuity. It's also
> possible, and I'm hopeful that this is the case, that the Mesos community
> will show responsible custodianship of the unified container support,
> allaying the concerns outlined above and allowing us to deprecate the
> current native-Docker containerizer support. In either case described here,
> I think Aurora users will benefit.
>
> Does anyone think we need a stop-the-world moment to come up with a long
> term, holistic plan, or is it reasonable to assess the situation as we go?
>
>
> On Sun, Mar 13, 2016 at 7:23 AM, Erb, Stephan  >
> wrote:
>
> > As mentioned in IRC, I like the proposal.
> >
> > Still, we need a discussion regarding the future of current Docker
> > support. Especially since Bill and John have now started improving it.
> What
> > are our plans here? What  are the plans of the Mesos community (i.e.,
> > deprecation of the docker containerizer)?
> >
> > In addition, the switch implemented for disabling Thermos when running
> > Docker kind of reminded me of
> > https://issues.apache.org/jira/browse/AURORA-1288. It is probably
> > worthwhile to at least assess this as a whole.
> >
> > Kind Regards,
> > Stephan
> >
> > 
> > From: Joshua Cohen 
> > Sent: Monday, March 7, 2016 20:58
> > To: dev@aurora.apache.org
> > Subject: [PROPOSAL] Supporting Mesos Universal Containers
> >
> > Hi all,
> >
> > I'd like to propose we adopt the Mesos universal container support for
> > provisioning tasks from both Docker and AppC images. Please review the
> doc
> > below and let me know what you think.
> >
> >
> >
> https://docs.google.com/document/d/111T09NBF2zjjl7HE95xglsDpRdKoZqhCRM5hHmOfTLA/edit?usp=sharing
> >
> > Thanks!
> >
> > Joshua
> >
>

Re: [PROPOSAL] Rename NEWS to RELEASE-NOTES.md

2016-03-14 Thread Bill Farner

Thanks for the feedback, folks.  This patch has landed.

On Mon, Mar 14, 2016 at 6:00 PM, Jake Farrell  wrote:

> LICENSE and NOTICE are the only 2 required named files within the release.
> The changelog is an optional item and can be named whatever we like, +1 to
> making it easier to understand its purpose
>
> -Jake
>
> On Mon, Mar 14, 2016 at 4:27 PM, Bill Farner  wrote:
>
> > Our NEWS file has turned into a useful source of information, but i think
> > the name doesn't clearly illustrate its purpose.  Since we have been
> > primarily using it to manage release notes, i propose we rename the file
> to
> > RELEASE-NOTES.md.
> >
> > As the name implies, i also propose that we formally use markdown syntax.
> > This will make for more natural linking to other docs, and direct posting
> > to our blog at release time (this is currently a manual process,
> > translating NEWS to markdown).
> >
>

Re: [PROPOSAL] Rename NEWS to RELEASE-NOTES.md

2016-03-14 Thread Bill Farner

Patch is up here: https://reviews.apache.org/r/44806/


On Mon, Mar 14, 2016 at 1:40 PM, Maxim Khutornenko  wrote:

> +1 to both.
>
> On Mon, Mar 14, 2016 at 1:27 PM, Bill Farner  wrote:
> > Our NEWS file has turned into a useful source of information, but i think
> > the name doesn't clearly illustrate its purpose.  Since we have been
> > primarily using it to manage release notes, i propose we rename the file
> to
> > RELEASE-NOTES.md.
> >
> > As the name implies, i also propose that we formally use markdown syntax.
> > This will make for more natural linking to other docs, and direct posting
> > to our blog at release time (this is currently a manual process,
> > translating NEWS to markdown).
>

[PROPOSAL] Rename NEWS to RELEASE-NOTES.md

2016-03-14 Thread Bill Farner

Our NEWS file has turned into a useful source of information, but i think
the name doesn't clearly illustrate its purpose.  Since we have been
primarily using it to manage release notes, i propose we rename the file to
RELEASE-NOTES.md.

As the name implies, i also propose that we formally use markdown syntax.
This will make for more natural linking to other docs, and direct posting
to our blog at release time (this is currently a manual process,
translating NEWS to markdown).

Re: [VOTE] Release Apache Aurora 0.12.0 rpms

2016-03-14 Thread Bill Farner

+1

Verified using instructions starting here:
https://github.com/apache/aurora-packaging/blob/master/test/rpm/centos-7/README.md#released

and
pkg_root="https://dl.bintray.com/john-sirois/aurora/centos-7/";


On Mon, Mar 14, 2016 at 12:10 PM, John Sirois  wrote:

> I propose that we accept the following artifacts as the official rpm
> packaging
> for Apache Aurora 0.12.0.
>
> *https://dl.bintray.com/john-sirois/aurora/centos-7/
> *
>
> The Aurora rpm packaging includes the following:
> ---
> The CHANGELOG is viewable at:
> *
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=log;h=refs/heads/0.12.x;hp=refs/heads/0.11.x
> <
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=log;h=refs/heads/0.12.x;hp=refs/heads/0.11.x
> >*
>
> The branch used to create the packaging is:
> https://git1-us-west.apache.org/repos/asf?p=aurora
> -packaging.git;a=tree;h=refs/heads/0.12.x
>
> The packages are available at:
> *https://dl.bintray.com/john-sirois/aurora/centos-7/
> *
>
> The GPG keys used to sign the packages are available at:
> https://dist.apache.org/repos/dist/release/aurora/KEYS
>
> Please download, verify, and test.
>
> The vote will close on Sun, 20 Mar 2016 10:00:00 -0700
>
> [ ] +1 Release these as the rpm packages for Apache Aurora 0.12.0
> [ ] +0
> [ ] -1 Do not release these artifacts because...
> ---
>
> Please consider verifying these rpms using the install guide:
> https://github.com/apache/aurora/blob/master/docs/installing.md
>
> Or by using the test guide for rpms:
>
> https://github.com/apache/aurora-packaging/blob/master/test/rpm/centos-7/README.md
> <
> https://github.com/apache/aurora-packaging/blob/master/test/rpm/centos-7/README.md
> >
>
>
>
> I'd like to kick off voting with my own +1
>

Re: [VOTE] Release Apache Aurora 0.12.0 rpms

2016-03-12 Thread Bill Farner

-1

I'm had trouble getting these to work.  I used the vagrant environment
here:
https://github.com/apache/aurora-packaging/tree/master/test/rpm/centos-7

*Executor:*
$ sudo rpm -ivh aurora-executor-0.12.0-1.el7.centos.aurora.x86_64.rpm
error: Failed dependencies:
docker is needed by aurora-executor-0.12.0-1.el7.centos.aurora.x86_64

Apparently the official docker package is called docker-engine
https://docs.docker.com/engine/installation/linux/centos/

Dependency naming aside, i think we should omit docker from our
dependencies, as it really should be a mesos dep if anything.  *I can send
a patch for that if others agree.*

*Scheduler:*
I had trouble getting the scheduler to start, it exits due to an uncaught
exception in the main thread, and unfortunately a stack trace doesn't turn
up in journalctl.  We need to figure out why the errors don't show up,
possibly in conjunction with addressing the items below.

Doing some investigation, i noticed something strange - JAVA_OPTS (set in
/etc/sysconfig/aurora) doesn't make it to the process launched by systemd.
It seems to be discarded when /usr/bin/aurora-scheduler-startup calls
/usr/lib/aurora/bin/aurora-scheduler.  Other variables (e.g.
AURORA_SCHEDULER_OPTS) propagate fine.  I've probably been staring at this
too long and am missing something obvious, but i'm not making sense of it.

Sidestepping the above issue, i discovered 2 reasons the scheduler won't
start up:
- Default backup dir /var/lib/aurora/scheduler/backups does not exist,
insufficient permission to create

- Fails to load the mesos native lib
aurora-scheduler-startup[8500]: Failed to load native Mesos library from
/usr/lib;/usr/lib64
I was able to fix this by removing ;/usr/lib64 from
-Djava.library.path='/usr/lib;/usr/lib64', alternatively by removing the
library.path setting and exporting LD_LIBRARY_PATH=/usr/lib.

Happy to pitch in on fixing these issues, curious what folks think of the
items above, especially the JAVA_OPTS issue.

On Fri, Mar 11, 2016 at 1:31 PM, John Sirois  wrote:

> Pinging this VOTE and noting that the close is Monday at 11am Mountain
> time.
>
> Please test!
>
> On Wed, Mar 9, 2016 at 11:03 AM, John Sirois  wrote:
>
> >
> >
> > On Wed, Mar 9, 2016 at 11:00 AM, John Sirois  wrote:
> >
> >> I propose that we accept the following artifacts as the official rpm
> packaging
> >> for Apache Aurora 0.12.0.
> >>
> >> *https://dl.bintray.com/john-sirois/aurora/centos-7/
> >> *
> >>
> >> The Aurora rpm packaging includes the following:
> >> ---
> >> The CHANGELOG is viewable at:
> >> *
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=log;h=refs/heads/0.12.x;hp=refs/heads/0.11.x
> >> <
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=log;h=refs/heads/0.12.x;hp=refs/heads/0.11.x
> >*
> >>
> >> The branch used to create the packaging is:
> >>
> >>
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tree;h=refs/heads/0.12.x
> >>
> >> The packages are available at:
> >> *https://dl.bintray.com/john-sirois/aurora/centos-7/
> >> *
> >>
> >> The GPG keys used to sign the packages are available at:
> >> https://dist.apache.org/repos/dist/release/aurora/KEYS
> >>
> >> Please download, verify, and test.
> >>
> >> The vote will close on Mon, 14 Mar 2016 11:00:00 -0700
> >>
> >> [ ] +1 Release these as the deb packages for Apache Aurora 0.12.0
> >>
> >
> > Correction - "Release these as the rpm packages for Apache Aurora 0.12.0"
> >
> > [ ] +0
> >> [ ] -1 Do not release these artifacts because...
> >> ---
> >>
> >>
> > And again, copypasta - "Please consider verifying these rpms using the
> > install guide:"
> >
> > Please consider verifying these debs using the install guide:
> >>   https://github.com/apache/aurora/blob/master/docs/installing.md
> >>
> >>
> >> I'd like to kick off voting with my own +1
> >>
> >
> >
>

Re: Non-exclusive dedicated constraint

2016-03-09 Thread Bill Farner

Ah, so it only practically makes sense when the dedicated attribute is
*/something, but * would not make much sense.  Seems reasonable to me.

On Wed, Mar 9, 2016 at 2:32 PM, Maxim Khutornenko  wrote:

> It's an *easy* way to get a virtual cluster with specific
> requirements. One example: have a set of machines in a shared pool
> with a different OS. This would let any existing or new customers try
> their services for compliance. The alternative would be spinning off a
> completely new physical cluster, which is a huge overhead on both
> supply and demand sides.
>
> On Wed, Mar 9, 2016 at 2:26 PM, Bill Farner  wrote:
> > What does it mean to have a 'dedicated' host that's free-for-all like
> that?
> >
> > On Wed, Mar 9, 2016 at 2:16 PM, Maxim Khutornenko 
> wrote:
> >
> >> Reactivating this thread. I like Bill's suggestion to have scheduler
> >> dedicated constraint management system. It will, however, require a
> >> substantial effort to get done properly. Would anyone oppose adopting
> >> Steve's patch in the meantime? The ROI is so high it would be a crime
> >> NOT to take it :)
> >>
> >> On Wed, Jan 20, 2016 at 10:25 AM, Maxim Khutornenko 
> >> wrote:
> >> > I should have looked closely, you are right! This indeed addresses
> >> > both cases: a job with a named dedicated role is still allowed to get
> >> > though if it's role matches the constraint and everything else
> >> > (non-exclusive dedicated pool) is addressed with "*".
> >> >
> >> > What it does not solve though is the variety of non-exclusive
> >> > dedicated pools (e.g. GPU, OS, high network bandwidth and etc.). For
> >> > that we would need something similar to what Bill suggested.
> >> >
> >> > On Wed, Jan 20, 2016 at 10:03 AM, Steve Niemitz 
> >> wrote:
> >> >> An arbitrary job can't target a fully dedicated role with this
> patch, it
> >> >> will still get a "constraint not satisfied: dedicated" error.  The
> code
> >> in
> >> >> the scheduler that matches the constraints does a simple string
> match,
> >> so
> >> >> "*/test" will not match "role1/test" when trying to place the task,
> it
> >> will
> >> >> only match "*/test".
> >> >>
> >> >> On Wed, Jan 20, 2016 at 12:24 PM, Maxim Khutornenko <
> ma...@apache.org>
> >> >> wrote:
> >> >>
> >> >>> Thanks for the info, Steve! Yes, it would accomplish the same goal
> but
> >> >>> at the price of removing the exclusive dedicated constraint
> >> >>> enforcement. With this patch any job could target a fully dedicated
> >> >>> exclusive pool, which may be undesirable for dedicated pool owners.
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Wed, Jan 20, 2016 at 7:13 AM, Steve Niemitz  >
> >> >>> wrote:
> >> >>> > We've been running a trivial patch [1] that does what I believe
> >> you're
> >> >>> > talking about for awhile now.  It allows a * for the role name,
> >> basically
> >> >>> > allowing any role to match the constraint, so our constraints look
> >> like
> >> >>> > "*/secure"
> >> >>> >
> >> >>> > Our use case is we have a "secure" cluster of machines that is
> >> >>> constrained
> >> >>> > on what can run on it (via an external audit process) that
> multiple
> >> roles
> >> >>> > run on.
> >> >>> >
> >> >>> > I believe I had talked to Bill about this a few months ago, but I
> >> don't
> >> >>> > remember where it ended up.
> >> >>> >
> >> >>> > [1]
> >> >>> >
> >> >>>
> >>
> https://github.com/tellapart/aurora/commit/76f978c76cc1377e19e602f7e0d050f7ce353562
> >> >>> >
> >> >>> > On Tue, Jan 19, 2016 at 11:48 PM, Maxim Khutornenko <
> >> ma...@apache.org>
> >> >>> > wrote:
> >> >>> >
> >> >>> >> Oh, I didn't mean the memory GC pressure in the pure sense,
> rather a
> >> >>> >> logical garbage of orphaned hosts that never leave the scheduler

Re: Non-exclusive dedicated constraint

2016-03-09 Thread Bill Farner

What does it mean to have a 'dedicated' host that's free-for-all like that?

On Wed, Mar 9, 2016 at 2:16 PM, Maxim Khutornenko  wrote:

> Reactivating this thread. I like Bill's suggestion to have scheduler
> dedicated constraint management system. It will, however, require a
> substantial effort to get done properly. Would anyone oppose adopting
> Steve's patch in the meantime? The ROI is so high it would be a crime
> NOT to take it :)
>
> On Wed, Jan 20, 2016 at 10:25 AM, Maxim Khutornenko 
> wrote:
> > I should have looked closely, you are right! This indeed addresses
> > both cases: a job with a named dedicated role is still allowed to get
> > though if it's role matches the constraint and everything else
> > (non-exclusive dedicated pool) is addressed with "*".
> >
> > What it does not solve though is the variety of non-exclusive
> > dedicated pools (e.g. GPU, OS, high network bandwidth and etc.). For
> > that we would need something similar to what Bill suggested.
> >
> > On Wed, Jan 20, 2016 at 10:03 AM, Steve Niemitz 
> wrote:
> >> An arbitrary job can't target a fully dedicated role with this patch, it
> >> will still get a "constraint not satisfied: dedicated" error.  The code
> in
> >> the scheduler that matches the constraints does a simple string match,
> so
> >> "*/test" will not match "role1/test" when trying to place the task, it
> will
> >> only match "*/test".
> >>
> >> On Wed, Jan 20, 2016 at 12:24 PM, Maxim Khutornenko 
> >> wrote:
> >>
> >>> Thanks for the info, Steve! Yes, it would accomplish the same goal but
> >>> at the price of removing the exclusive dedicated constraint
> >>> enforcement. With this patch any job could target a fully dedicated
> >>> exclusive pool, which may be undesirable for dedicated pool owners.
> >>>
> >>>
> >>>
> >>> On Wed, Jan 20, 2016 at 7:13 AM, Steve Niemitz 
> >>> wrote:
> >>> > We've been running a trivial patch [1] that does what I believe
> you're
> >>> > talking about for awhile now.  It allows a * for the role name,
> basically
> >>> > allowing any role to match the constraint, so our constraints look
> like
> >>> > "*/secure"
> >>> >
> >>> > Our use case is we have a "secure" cluster of machines that is
> >>> constrained
> >>> > on what can run on it (via an external audit process) that multiple
> roles
> >>> > run on.
> >>> >
> >>> > I believe I had talked to Bill about this a few months ago, but I
> don't
> >>> > remember where it ended up.
> >>> >
> >>> > [1]
> >>> >
> >>>
> https://github.com/tellapart/aurora/commit/76f978c76cc1377e19e602f7e0d050f7ce353562
> >>> >
> >>> > On Tue, Jan 19, 2016 at 11:48 PM, Maxim Khutornenko <
> ma...@apache.org>
> >>> > wrote:
> >>> >
> >>> >> Oh, I didn't mean the memory GC pressure in the pure sense, rather a
> >>> >> logical garbage of orphaned hosts that never leave the scheduler.
> It's
> >>> >> not something to be concerned about from the performance standpoint.
> >>> >> It's, however, something operators need to be aware of when a host
> >>> >> from a dedicated pool gets dropped or replaced.
> >>> >>
> >>> >> On Tue, Jan 19, 2016 at 8:39 PM, Bill Farner 
> >>> wrote:
> >>> >> > What do you mean by GC burden?  What i'm proposing is effectively
> >>> >> > Map.  Even with an extremely forgetful operator
> (even
> >>> >> more
> >>> >> > than Joe!), it would require a huge oversight to put a dent in
> heap
> >>> >> usage.
> >>> >> > I'm sure there are ways we could even expose a useful stat to flag
> >>> such
> >>> >> an
> >>> >> > oversight.
> >>> >> >
> >>> >> > On Tue, Jan 19, 2016 at 8:31 PM, Maxim Khutornenko <
> ma...@apache.org>
> >>> >> wrote:
> >>> >> >
> >>> >> >> Right, that's what I thought. Yes, it sounds interesting. My only
> >>> >> >> concern is the GC burden of getting rid of hostnames that are
> &

Re: [VOTE] Release Apache Aurora 0.12.0 debs

2016-03-07 Thread Bill Farner

+1

Successfully installed and lightly exercised ubuntu and debian packages in
their respective systems (using test/ instructions in the aurora-packaging
repo).

On Mon, Mar 7, 2016 at 8:00 AM, John Sirois  wrote:

> I propose that we accept the following artifacts as the official deb
> packaging for
> Apache Aurora 0.12.0.
>
> https://dl.bintray.com/john-sirois/aurora/ubuntu-trusty/
> https://dl.bintray.com/john-sirois/aurora/debian-jessie/
>
> The Aurora deb packaging includes the following:
> ---
> The CHANGELOG is available at:
>
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=blob_plain;f=specs/debian/changelog;hb=refs/heads/0.12.x
>
> The branch used to create the packaging is:
>
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tree;h=refs/heads/0.12.x
>
> The packages are available at:
> https://dl.bintray.com/john-sirois/aurora/ubuntu-trusty/
> https://dl.bintray.com/john-sirois/aurora/debian-jessie/
>
> The GPG keys used to sign the packages are available at:
> https://dist.apache.org/repos/dist/release/aurora/KEYS
>
> Please download, verify, and test.
>
> The vote will close on Thu, 07 Mar 2016 09:00:00 -0700
>
> [ ] +1 Release these as the deb packages for Apache Aurora 0.12.0
> [ ] +0
> [ ] -1 Do not release these artifacts because...
> ---
>
> Please consider verifying these debs using the install guide:
>   https://github.com/apache/aurora/blob/master/docs/installing.md
>

Re: [RESULT][VOTE] Release Apache Aurora 0.12.0 RC4

2016-03-03 Thread Bill Farner

I'm taking care of things on the site now, should be up shortly!

On Thu, Mar 3, 2016 at 6:59 AM, John Sirois  wrote:

> Thanks Bill.
>
> Thanks to contributions to fix the thrift deb hosting issue for our ubuntu
> deb builds I was able to locally build our pacakegs but did not get time to
> finish up.  I likely won't get the docs published and RC out until Sunday.
>
> Apologies for the delays.
>
> On Feb 29, 2016 10:03 AM, "Bill Farner"  wrote:
> >
> > IIRC site changes aren't blocked on binaries.  I believe the only link is
> the bintray button, which automatically tracks the latest bintray version.
> >
> > https://aurora.apache.org/downloads/
> >
> > On Mon, Feb 29, 2016 at 8:10 AM, John Sirois  wrote:
> >>
> >> On Sun, Feb 28, 2016 at 11:43 AM, Jake Farrell 
> wrote:
> >>
> >> > Released source artifacts have all ready been promoted, any committer
> can
> >> > update the website to reflect this change (though ideally it is
> handled
> >> > immediately after the release is promoted).
> >> >
> >> > For the binary artifacts since they where not included in the initial
> vote
> >> > with the source artifacts they will need to be a separate vote
> >> >
> >>
> >> And I have been tardy getting an RC out for the debs (doc site update
> >> blocks on binaries since it has links).  I'll be getting the vote
> started
> >> for this later today.
> >>
> >>
> >> > -Jake
> >> >
> >> >
> >> >
> >> >
> >> > On Sun, Feb 28, 2016 at 12:55 PM, Erb, Stephan  >> > @blue-yonder.com> wrote:
> >> >
> >> >> Even though we have done the voting, the release is still pending. We
> >> >> still have to build the packages and update the website.
> >> >>
> >> >> Is there a way we can help out here?
> >> >>
> >> >> Best,
> >> >> Stephan
> >> >> 
> >> >> From: John Sirois 
> >> >> Sent: Monday, February 8, 2016 23:47
> >> >> To: dev@aurora.apache.org
> >> >> Subject: [RESULT][VOTE] Release Apache Aurora 0.12.0 RC4
> >> >>
> >> >> All,
> >> >> The vote to accept Apache Aurora 0.12.0 RC4
> >> >> as the official Apache Aurora 0.12.0 release has passed.
> >> >>
> >> >>
> >> >> +1 (Binding)
> >> >> --
> >> >> jsirois
> >> >> wfarner
> >> >> serb
> >> >> maxim
> >> >>
> >> >>
> >> >> +1 (Non-binding)
> >> >> --
> >> >>
> >> >>
> >> >> There were no 0 or -1 votes. Thank you to all who helped make this
> >> >> release.
> >> >>
> >> >>
> >> >> Aurora 0.12.0 includes the following:
> >> >> ---
> >> >> The CHANGELOG for the release is available at:
> >> >>
> >> >>
>
> https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=CHANGELOG&hb=rel/0.12.0
> >> >>
> >> >> The tag used to create the release with is rel/0.12.0:
> >> >> https://git-wip-us.apache.org/repos/asf?p=aurora.git&hb=rel/0.12.0
> >> >>
> >> >> The release is available at:
> >> >>
> >> >>
>
> https://dist.apache.org/repos/dist/release/aurora/0.12.0/apache-aurora-0.12.0.tar.gz
> >> >>
> >> >> The MD5 checksum of the release can be found at:
> >> >>
> >> >>
>
> https://dist.apache.org/repos/dist/release/aurora/0.12.0/apache-aurora-0.12.0.tar.gz.md5
> >> >>
> >> >> The signature of the release can be found at:
> >> >>
> >> >>
>
> https://dist.apache.org/repos/dist/release/aurora/0.12.0/apache-aurora-0.12.0.asc
> >> >>
> >> >> The GPG key used to sign the release are available at:
> >> >> https://dist.apache.org/repos/dist/release/aurora/KEYS
> >> >>
> >> >>
> >> >> On Fri, Feb 5, 2016 at 3:14 PM, John Sirois 
> wrote:
> >> >>
> >> >> > All,
> >> >> >
> >> >> > I propose that we accept the following release candidate as the
>

Re: [PROPOSAL] DB snapshotting

2016-03-02 Thread Bill Farner

Seems prudent to explore rather than write off though.  For all we know it
simplifies a lot.

On Wednesday, March 2, 2016, Maxim Khutornenko  wrote:

> Ah, sorry, missed that conversation on IRC.
>
> I have not looked into that. Would be interesting to explore that
> route. Given our ultimate goal is to get rid of the replicated log
> altogether it does not stand as an immediate priority to me though.
>
> On Wed, Mar 2, 2016 at 11:51 AM, Erb, Stephan
> > wrote:
> > +1 for the plan and the ticket.
> >
> > In addition, for reference a couple of messages from IRC from yesterday:
> >
> > 23:42  mkhutornenko:  interesting storage proposal on the
> mailinglist! I only wondered one thing...
> > 23:42  it feeld kind of weird that we use H2 as a non-replicated
> database and build some scaffolding around it in order to distribute its
> state via the Mesos replicated log.
> > 23:42  Have you looked into H2, if it would be possible to
> replace/subclass their in-process transaction log with a replicated Mesos
> one?
> > 23:43  Then we would not need that logic that performs a
> simultaneous inserts into the log and the taskstore, as the backend would
> handle that by itself
> > 23:44  (I know close to nothing about the storage layer, so that's
> like my perspective from 10.000 feet)
> >
> > 00:22  serb: that crossed my mind as well.  I have only drilled
> in a bit, would love to more
> >
> > 
> > From: Maxim Khutornenko >
> > Sent: Wednesday, March 2, 2016 18:18
> > To: dev@aurora.apache.org 
> > Subject: Re: [PROPOSAL] DB snapshotting
> >
> > Thanks Bill! Filed https://issues.apache.org/jira/browse/AURORA-1627
> > to track it.
> >
> > On Mon, Feb 29, 2016 at 11:41 AM, Bill Farner  > wrote:
> >> Thanks for the detailed write up and real-world details!  I generally
> >> support momentum towards a single task store implementation, so +1
> >> on dealing with that.
> >>
> >> I anticipated there would be a performance win from straight-to-SQL
> >> snapshots, so I am a +1 on that as well.
> >>
> >> In summary, +1 on all fronts!
> >>
> >> On Monday, February 29, 2016, Maxim Khutornenko  > wrote:
> >>
> >>> (Apologies for the wordy problem statement but I feel it's really
> >>> necessary to justify the proposal).
> >>>
> >>> Over the past two weeks we have been battling a nasty scheduler issue
> >>> in production: the scheduler suddenly stops responding to any user
> >>> requests and subsequently gets killed by our health monitoring. Upon
> >>> restart, a leader may only function for a few seconds and almost
> >>> immediately hangs again.
> >>>
> >>> The long and painful investigation pointed towards internal H2 table
> >>> lock contention that resulted in a massive db-write starvation and a
> >>> state where a scheduler write lock would *never* be released. This was
> >>> relatively easy to replicate in Vagrant by creating a large update
> >>> (~4K instances) with a large batch_size (~1K), while bombarding the
> >>> scheduler with getJobUpdateDetails() requests for that job. The
> >>> scheduler would enter a locked up state on the very first write op
> >>> following the update creation (e.g. a status update for an instance
> >>> transition from the first batch) and stay in that state for minutes
> >>> until all getJobUpdateDetails() requests are served. This behavior is
> >>> well explained by the following sentence from [1]:
> >>>
> >>> "When a lock is released, and multiple connections are waiting for
> >>> it, one of them is picked at random."
> >>>
> >>> What happens here is that in a situation when many more read requests
> >>> are competing for a shared table lock, the H2 PageStore does not help
> >>> write requests requiring an exclusive table lock in any way to
> >>> succeed. This leads to db-write starvation and eventual scheduler
> >>> native store write starvation as there is no timeout on a scheduler
> >>> write lock.
> >>>
> >>> We have played with various available H2/MyBatis configuration
> >>> settings to mitigate the above with no noticeable impact. That, until
> >>> we switched to H2 MVStore [2], at which point we were able to
> >>> completely eliminate the scheduler lockup without making any other
> >>> code changes! So, the

[DRAFT] [REPORT] Apache Aurora

2016-02-29 Thread Bill Farner

Please take a moment to read through a draft of the board report i have
prepared for Aurora.  I will plan to submit mid-week, so please let me know
if you would like to see any edits/additions!

## Description:
  Apache Aurora lets you use an Apache Mesos cluster as a private cloud. It
  supports running long-running services, cron jobs, and ad-hoc jobs.

## Issues:
 There are no issues requiring board attention at this time.

## Activity:
 - Significant number of contributors in recent 0.11.0 and 0.12.0 releases.
 - Positive feedback from prebuilt binary packages (including nightlies),
   signs of many users taking advantage of this to build their own.

## Health report:
 Aurora has been under active development for this period, with strong
 community engagement.  We have placed deliberate effort to build release
 notes as we iterate towards future releases, which has received positive
 feedback from users.

## PMC changes:
 - Currently 17 PMC members.
 - New PMC members:
- Joshua Cohen was added to the PMC on Tue Dec 22 2015
- John Sirois was added to the PMC on Sun Jan 03 2016
- Stephan Erb was added to the PMC on Wed Feb 03 2016
- Steve Niemitz was added to the PMC on Mon Jan 11 2016

## Committer base changes:
 - Currently 18 committers.
 - New commmitters:
- John Sirois was added as a committer on Mon Jan 04 2016
- Stephan Erb was added as a committer on Wed Feb 03 2016
- Steve Niemitz was added as a committer on Tue Jan 12 2016

## Releases:
 - 0.11.0 was released on Wed Dec 23 2015
 - 0.12.0 was released on Sun Feb 07 2016

## JIRA activity:
 - 79 JIRA tickets created in the last 3 months
 - 185 JIRA tickets closed/resolved in the last 3 months

Re: Weekly community meeting

2016-02-29 Thread Bill Farner

I think it's worth raising the topic of whether to continue
regularly-scheduled meetings.  Given the distributed community, it is
difficult to find a time that routinely works for everyone.  Eliminating
this might also encourage more frequent use of the mailing list when
discussion topics arise, which I think would be better for everyone.

On Monday, February 29, 2016, Zameer Manji  wrote:

> I agree with Maxim, a weekly meeting is too frequent relative to the
> current activity on the project. I think changing it to bi-weekly schedule
> would be better overall.
>
> On Mon, Feb 29, 2016 at 9:23 AM, Joshua Cohen  > wrote:
>
> > I'm in the activated list of freenode-cloaks, but I'm not able to start
> > meetings.
> >
> > On Mon, Feb 29, 2016 at 11:00 AM, Jake Farrell  >
> > wrote:
> >
> > > A registered nick and ASF cloak are required in order to kick off
> > meetings
> > > with asfbot, registration requests can be made at [1]. Currently for
> > Aurora
> > > dlester, jfarrell, kts, wfarner , and zmanji are the only ones who
> > > are registered and setup to start the meetings
> > >
> > > -Jake
> > >
> > >
> > > [1]:
> > >
> https://svn.apache.org/repos/private/committers/docs/freenode-cloaks.txt
> > >
> > > On Mon, Feb 29, 2016 at 10:53 AM, Joshua Cohen  >
> > wrote:
> > >
> > > > I agree that it's unfortunate we've missed a few meetings.
> Historically
> > > > speaking it's always been one of a few people who kick the meeting
> off,
> > > and
> > > > if those folks aren't available the meetings seem to not happen. That
> > > said,
> > > > I think we should take a more decentralized approach. We don't need
> to
> > > rely
> > > > on anyone in particular being around to kick the meeting off. Anyone
> > > who's
> > > > around at the scheduled time should feel free to start it. If you
> don't
> > > > have the karma necessary to get the bot to work, that's fine, just
> send
> > > out
> > > > manual notes.
> > > >
> > > > That said, Jake, any idea what we need to do to ensure committers (or
> > > maybe
> > > > just PMC members, whatever) have the necessary ASFBot karma?
> > > >
> > > > On Sun, Feb 28, 2016 at 12:07 PM, Erb, Stephan <
> > > > stephan@blue-yonder.com >
> > > > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > seems like we have been sloppy with the community meeting in the
> last
> > > > > weeks. It doesn't feel right to have a regular meeting that is
> > skipped
> > > > > silently.
> > > > >
> > > > > Any thoughts or ideas what we could do about that?
> > > > >
> > > > > Best Regards,
> > > > > Stephan
> > > >
> > >
> >
> > --
> > Zameer Manji
> >
> >
>

Re: [PROPOSAL] DB snapshotting

2016-02-29 Thread Bill Farner

Thanks for the detailed write up and real-world details!  I generally
support momentum towards a single task store implementation, so +1
on dealing with that.

I anticipated there would be a performance win from straight-to-SQL
snapshots, so I am a +1 on that as well.

In summary, +1 on all fronts!

On Monday, February 29, 2016, Maxim Khutornenko  wrote:

> (Apologies for the wordy problem statement but I feel it's really
> necessary to justify the proposal).
>
> Over the past two weeks we have been battling a nasty scheduler issue
> in production: the scheduler suddenly stops responding to any user
> requests and subsequently gets killed by our health monitoring. Upon
> restart, a leader may only function for a few seconds and almost
> immediately hangs again.
>
> The long and painful investigation pointed towards internal H2 table
> lock contention that resulted in a massive db-write starvation and a
> state where a scheduler write lock would *never* be released. This was
> relatively easy to replicate in Vagrant by creating a large update
> (~4K instances) with a large batch_size (~1K), while bombarding the
> scheduler with getJobUpdateDetails() requests for that job. The
> scheduler would enter a locked up state on the very first write op
> following the update creation (e.g. a status update for an instance
> transition from the first batch) and stay in that state for minutes
> until all getJobUpdateDetails() requests are served. This behavior is
> well explained by the following sentence from [1]:
>
> "When a lock is released, and multiple connections are waiting for
> it, one of them is picked at random."
>
> What happens here is that in a situation when many more read requests
> are competing for a shared table lock, the H2 PageStore does not help
> write requests requiring an exclusive table lock in any way to
> succeed. This leads to db-write starvation and eventual scheduler
> native store write starvation as there is no timeout on a scheduler
> write lock.
>
> We have played with various available H2/MyBatis configuration
> settings to mitigate the above with no noticeable impact. That, until
> we switched to H2 MVStore [2], at which point we were able to
> completely eliminate the scheduler lockup without making any other
> code changes! So, the solution has finally been found? The answer
> would be YES until you try MVStore-enabled H2 with any reasonable size
> production DB on scheduler restart. There was a reason why we disabled
> MVStore in the scheduler [3] in the first place and that reason was
> poor MVStore performance with bulk inserts. Re-populating
> MVStore-enabled H2 DB took at least 2.5 times longer than normal. This
> is unacceptable in prod where every second of scheduler downtime
> counts.
>
> Back to the drawing board, we tried all relevant settings and
> approaches to speed up MVStore inserts on restart but nothing really
> helped. Finally, the only reasonable way forward was to eliminate the
> point of slowness altogether - namely remove thrift-to-sql migration
> on restart. Fortunately, H2 supports an easy to operate command to
> generate the entire DB dump with a single statement [4]. We were now
> able to bypass the lengthly DB repopulation on restart by storing the
> entire DB dump in snapshot and replaying it on scheduler restart.
>
>
> Now, the proposal. Given that MVStore vastly outperforms PageStore we
> currently use, I suggest we move our H2 to it AND adopt db
> snapshotting instead of thrift snapshotting to speed up scheduler
> restarts. The rough POC is available here [5]. We are running a
> version of this build in production since last week and were able to
> completely eliminate scheduler lockups. As a welcome side effect, we
> also observed faster scheduler restart times due to eliminating
> thrift-to-sql chattiness. Depending on the snapshot freshness the
> observed failover downtimes got reduced by ~40%.
>
> Moving to db snapshotting will require us to rethink DB schema
> versioning and thrift deprecating/removal policy. We will have to move
> to pre-/post- snapshot restore SQL migration scripts to handle any
> schema changes, which is a common industry pattern but something we
> have not tried yet. The upside though is that we can get an early
> start here as we will have to adopt strict SQL migration rules anyway
> when we move to persistent DB storage. Also, given that migrating to
> H2 TaskStore will likely further degrade scheduler restart times,
> having a better performing DB snapshotting solution in place will
> definitely aid migration.
>
> Thanks,
> Maxim
>
> [1] - http://www.h2database.com/html/advanced.html?#transaction_isolation
> [2] - http://www.h2database.com/html/mvstore.html
> [3] -
> https://github.com/apache/aurora/blob/824e396ab80874cfea98ef47829279126838a3b2/src/main/java/org/apache/aurora/scheduler/storage/db/DbModule.java#L119
> [4] - http://www.h2database.com/html/grammar.html#script
> [5] -
> https://github.com/maxim111333/incu

Re: [RESULT][VOTE] Release Apache Aurora 0.12.0 RC4

2016-02-29 Thread Bill Farner

IIRC site changes aren't blocked on binaries.  I believe the only link is
the bintray button, which automatically tracks the latest bintray version.

https://aurora.apache.org/downloads/

On Mon, Feb 29, 2016 at 8:10 AM, John Sirois  wrote:

> On Sun, Feb 28, 2016 at 11:43 AM, Jake Farrell 
> wrote:
>
> > Released source artifacts have all ready been promoted, any committer can
> > update the website to reflect this change (though ideally it is handled
> > immediately after the release is promoted).
> >
> > For the binary artifacts since they where not included in the initial
> vote
> > with the source artifacts they will need to be a separate vote
> >
>
> And I have been tardy getting an RC out for the debs (doc site update
> blocks on binaries since it has links).  I'll be getting the vote started
> for this later today.
>
>
> > -Jake
> >
> >
> >
> >
> > On Sun, Feb 28, 2016 at 12:55 PM, Erb, Stephan  > @blue-yonder.com> wrote:
> >
> >> Even though we have done the voting, the release is still pending. We
> >> still have to build the packages and update the website.
> >>
> >> Is there a way we can help out here?
> >>
> >> Best,
> >> Stephan
> >> 
> >> From: John Sirois 
> >> Sent: Monday, February 8, 2016 23:47
> >> To: dev@aurora.apache.org
> >> Subject: [RESULT][VOTE] Release Apache Aurora 0.12.0 RC4
> >>
> >> All,
> >> The vote to accept Apache Aurora 0.12.0 RC4
> >> as the official Apache Aurora 0.12.0 release has passed.
> >>
> >>
> >> +1 (Binding)
> >> --
> >> jsirois
> >> wfarner
> >> serb
> >> maxim
> >>
> >>
> >> +1 (Non-binding)
> >> --
> >>
> >>
> >> There were no 0 or -1 votes. Thank you to all who helped make this
> >> release.
> >>
> >>
> >> Aurora 0.12.0 includes the following:
> >> ---
> >> The CHANGELOG for the release is available at:
> >>
> >>
> https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=CHANGELOG&hb=rel/0.12.0
> >>
> >> The tag used to create the release with is rel/0.12.0:
> >> https://git-wip-us.apache.org/repos/asf?p=aurora.git&hb=rel/0.12.0
> >>
> >> The release is available at:
> >>
> >>
> https://dist.apache.org/repos/dist/release/aurora/0.12.0/apache-aurora-0.12.0.tar.gz
> >>
> >> The MD5 checksum of the release can be found at:
> >>
> >>
> https://dist.apache.org/repos/dist/release/aurora/0.12.0/apache-aurora-0.12.0.tar.gz.md5
> >>
> >> The signature of the release can be found at:
> >>
> >>
> https://dist.apache.org/repos/dist/release/aurora/0.12.0/apache-aurora-0.12.0.asc
> >>
> >> The GPG key used to sign the release are available at:
> >> https://dist.apache.org/repos/dist/release/aurora/KEYS
> >>
> >>
> >> On Fri, Feb 5, 2016 at 3:14 PM, John Sirois  wrote:
> >>
> >> > All,
> >> >
> >> > I propose that we accept the following release candidate as the
> official
> >> > Apache Aurora 0.12.0 release.
> >> >
> >> > Aurora 0.12.0-rc4 includes the following:
> >> > ---
> >> > The NEWS for the release is available at:
> >> >
> >> >
> >>
> https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=NEWS&hb=rel/0.12.0-rc4
> >> >
> >> > The CHANGELOG for the release is available at:
> >> >
> >> >
> >>
> https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=CHANGELOG&hb=rel/0.12.0-rc4
> >> >
> >> > The tag used to create the release candidate is:
> >> >
> >> >
> >>
> https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=shortlog;h=refs/tags/rel/0.12.0-rc4
> >> >
> >> > The release candidate is available at:
> >> >
> >> >
> >>
> https://dist.apache.org/repos/dist/dev/aurora/0.12.0-rc4/apache-aurora-0.12.0-rc4.tar.gz
> >> >
> >> > The MD5 checksum of the release candidate can be found at:
> >> >
> >> >
> >>
> https://dist.apache.org/repos/dist/dev/aurora/0.12.0-rc4/apache-aurora-0.12.0-rc4.tar.gz.md5
> >> >
> >> > The signature of the release candidate can be found at:
> >> >
> >> >
> >>
> https://dist.apache.org/repos/dist/dev/aurora/0.12.0-rc4/apache-aurora-0.12.0-rc4.tar.gz.asc
> >> >
> >> > The GPG key used to sign the release are available at:
> >> > https://dist.apache.org/repos/dist/dev/aurora/KEYS
> >> >
> >> > Please download, verify, and test.
> >> >
> >> > The vote will close on Mon Feb  8 15:12:45 MST 2016
> >> >
> >> > [ ] +1 Release this as Apache Aurora 0.12.0
> >> > [ ] +0
> >> > [ ] -1 Do not release this as Apache Aurora 0.12.0 because...
> >> > ---
> >> >
> >> > Reminder: you can verify the release candidate via:
> >> >
> >> >   ./build-support/release/verify-release-candidate 0.12.0-rc4
> >> >
> >> > If you can deploy the RC to a test cluster and evaluate it there, even
> >> > better.
> >> >
> >>
> >
> >
>

Re: [PROPOSAL] Revisit task ID format

2016-02-19 Thread Bill Farner

It has now been submitted.

On Fri, Feb 19, 2016 at 9:32 AM, Chris Lambert 
wrote:

> 'Morning,
>
> What's the eta for this patch, Bill?  Planning to commit soon?
>
> Chris
>
>
> On Wed, Jan 27, 2016 at 8:53 PM, Bill Farner  wrote:
>
> > Thanks for the input, folks.  Patch is up here:
> > https://reviews.apache.org/r/42896/
> >
> > On Wed, Jan 27, 2016 at 8:09 PM, Maxim Khutornenko 
> > wrote:
> >
> > > You're right, ignore me. I guess I was mourning the loss of
> > StorageBackfill
> > > too hard :) Obviously, we don't have to force this change over existing
> > > tasks and let them die out naturally. Some user scraping tools may be
> > > broken due to this but we always advised against taking a dependency on
> > our
> > > task ID format.
> > >
> > > +1 to the proposal.
> > >
> > > On Wed, Jan 27, 2016 at 5:46 PM, Bill Farner 
> wrote:
> > >
> > > > I don't believe backwards compatibility is an issue here.  This would
> > be
> > > an
> > > > alteration to generation of new task IDs.  AFAIK we don't do anything
> > > that
> > > > requires comprehension or synthesis of previously-generated task IDs.
> > > >
> > > > On Wed, Jan 27, 2016 at 5:39 PM, Maxim Khutornenko  >
> > > > wrote:
> > > >
> > > > > What's the cluster upgrade story going to look like? Since task IDs
> > are
> > > > > used as unique identifiers for Mesos, I expect this change would
> > > require
> > > > > rebooting the entire cluster to the new format? I am not really
> sure
> > > how
> > > > > this can be done in a graceful manner without sending the entire
> > > cluster
> > > > to
> > > > > LOST on first reconciliation run.
> > > > >
> > > > > Unless we have answers to the above, I am -1 to this proposal. The
> > > > benefits
> > > > > don't seem significant enough to offset the pain of migrating to
> the
> > > new
> > > > > format.
> > > > >
> > > > > On Tue, Jan 26, 2016 at 6:25 PM, Mauricio Garavaglia <
> > > > > mauriciogaravag...@gmail.com> wrote:
> > > > >
> > > > > > +1 to dropping the timestamp.
> > > > > >
> > > > > > Agree that having the jobkey at hand has been helpful for
> > debugging.
> > > > > >
> > > > > > On Tue, Jan 26, 2016 at 10:39 PM, Zameer Manji <
> zma...@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > +1 to removing the timestamp.
> > > > > > >
> > > > > > > The timestamp has not provided me with any benefit as an
> > operator.
> > > > The
> > > > > > > mangled jobkey and UUID have been very useful in grepping logs
> > and
> > > > > > > diagnosing failing jobs.
> > > > > > >
> > > > > > > On Tue, Jan 26, 2016 at 3:49 PM, Zhitao Li <
> > zhitaoli...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > +1 for dropping the time and keeping the mangled jobkey.
> Unless
> > > we
> > > > > are
> > > > > > > sure
> > > > > > > > that all internal logging of Mesos master and agent contains
> an
> > > > > > > identifier
> > > > > > > > with user some user generated data, changing it to UUID will
> > make
> > > > > adhoc
> > > > > > > > debugging through Mesos logging harder.
> > > > > > > >
> > > > > > > > On Tue, Jan 26, 2016 at 3:17 PM, Erb, Stephan <
> > > > > > > stephan@blue-yonder.com
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > +1 for dropping the timestamp
> > > > > > > > >
> > > > > > > > > However, I am not sure regarding the mangled jobkey. It
> tends
> > > to
> > > > > make
> > > > > > > it
> > > > > > > > > easier to correlate Mesos tasks to Aurora jobs when
> skimming
> > > log
> > > > > > files,
> > > &g

Re: Run a Single Java Test

2016-02-17 Thread Bill Farner

Interesting.  That command works fine for me.  (Well, it fails - but
because a test coverage task erroneously runs.)  What git sha are you at?

On Wed, Feb 17, 2016 at 10:45 AM, Zane Silver 
wrote:

> Hi developers, I'm trying to figure out how to run a single java test with
> gradlew. I'm sure there's a simple answer to my question, but I can seem to
> figure this one out.
>
> Specifically, I'm trying to run the scheduler tests in
> (/src/test/java/org/apache/aurora/scheduler) ResourcesTest.java. I've tried
> the following (and variation) to no avail:
>
> $ ./gradlew test --tests org.apache.aurora.scheduler.Resources*
>
> ...
>
> FAILURE: Build failed with an exception.
>
> * What went wrong:
> Execution failed for task ':commons:test'.
> > No tests found for given includes:
> [org.apache.aurora.scheduler.Resources*]
> ...
>
>
> I appreciate the help!
>

Re: Jenkins build is back to normal : aurora-packaging-nightly #194

2016-02-12 Thread Bill Farner

I would also be interested i knowing.  The failure that occurred here does
seem to be a semi-frequent one.  I have not invested time to understand why
that might happen.

Error response from daemon: lstat
/var/lib/docker/aufs/mnt/2bf3bb359dec47586c324e03f190404b312fa6237618c36d23d935105d1b6172/dist:
no such file or directory

On Fri, Feb 12, 2016 at 5:56 PM, John Sirois  wrote:

> I would love to understand:
> 1.) Why this job fails intermittently.
> 2.) Why every time (4?) I click re-build, it goes green.
>
> Does anyone have clues from prior investigations?
>
> On Fri, Feb 12, 2016 at 6:51 PM, Apache Jenkins Server <
> jenk...@builds.apache.org> wrote:
>
> > See 
> >
> >
>
>
> --
> John Sirois
> 303-512-3301
>

Re: [PROPOSAL] Disallow instance removal in job update

2016-02-05 Thread Bill Farner

Or without any persistence at all.  The client could refuse to adjust the
instance count on a job unless there's additional command line argument.
The same arguments of responsibility could be said here of users of old
clients or custom clients.

On Fri, Feb 5, 2016 at 3:17 PM, John Sirois  wrote:

> On Fri, Feb 5, 2016 at 4:07 PM, Maxim Khutornenko 
> wrote:
>
> > We have had attempts to safeguard client updater command with a
> > "dangerous change" warning before but it did not get good feedback.
> > Besides, automated tools/scripts just ignored it.
> >
> > An alternative could be what George suggest on the scaling API thread
> > mentioned earlier: automatically bump up instance count to the job
> > active task count. I'd say this could be an implementation to the
> > proposal above rather than a safeguard as it accomplishes the exact
> > same goal.
> >
> > Bill, do you have any ideas of what that safeguard could be?
> >
>
> I'd recommend that an API call that reduced instance count require an
> `confirm_instance_reduction =true` parameter - this could be plumbed back
> to a flag in the official Aurora client.
> That said, since Aurora immediately forgets jobs and splits things into
> tasks, I'm not sure this is even sanely possible today.
>
> Assuming it is possible, any human that turns that flag on by default with
> a shell alias or an rc file can take responsibility for their own problem.
> If a tool passes the boolean, again - that's the tool's problem.  Hopefully
> its a carefully developed and vetted auto-scaling tool.
>
>
> > On Fri, Feb 5, 2016 at 2:56 PM, Bill Farner  wrote:
> > >>
> > >> the outdated instance count problem will only get worse as automated
> > >> scaling tools will quickly render existing .aurora config value
> obsolete
> > >
> > >
> > > This is not a compelling reason to remove functionality.  Sounds like a
> > > safeguard is needed instead.
> > >
> > > On Fri, Feb 5, 2016 at 2:43 PM, Maxim Khutornenko 
> > wrote:
> > >
> > >> This is mostly a survey rather than a proposal. How would people think
> > >> about limiting updater to only adding/updating instances and let
> > >> killTasks take care of instance removals?
> > >>
> > >> We have all heard stories (or happen to create some ourselves) when an
> > >> outdated instance count value in .aurora config caused unexpected
> > >> instance removals. Granted, there are plenty of other values in the
> > >> config that can cause service-wide outage but instance count seems to
> > >> be the worst in that sense.
> > >>
> > >> After the recent refactoring of addInstances and killTasks to act as
> > >> scaleOut/scaleIn APIs [1], the outdated instance count problem will
> > >> only get worse as automated scaling tools will quickly render existing
> > >> .aurora config value obsolete. With that in mind, should we block
> > >> instance removal in the updater and let an explicit killTasks call be
> > >> the only acceptable action to reduce instance count? Is there any
> > >> value (aside from arguable convenience factor) in having
> > >> startJobUpdate ever killing instances?
> > >>
> > >> Thanks,
> > >> Maxim
> > >>
> > >> [1] - http://markmail.org/message/2smaej5n5e54li3g
> > >>
> >
>
>
>
> --
> John Sirois
> 303-512-3301
>

Re: Subject: [VOTE] Release Apache Aurora 0.12.0 RC4

2016-02-05 Thread Bill Farner

+1

Successfully ran ./build-support/release/verify-release-candidate
 0.12.0-rc4

On Fri, Feb 5, 2016 at 3:18 PM, John Sirois  wrote:

> I'd like to kick-off voting with my own +1 (binding).
> ./build-support/release/verify-release-candidate 0.12.0-rc4 is green for
> me.
>
> On Fri, Feb 5, 2016 at 3:14 PM, John Sirois  wrote:
>
> > All,
> >
> > I propose that we accept the following release candidate as the official
> > Apache Aurora 0.12.0 release.
> >
> > Aurora 0.12.0-rc4 includes the following:
> > ---
> > The NEWS for the release is available at:
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=NEWS&hb=rel/0.12.0-rc4
> >
> > The CHANGELOG for the release is available at:
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git&f=CHANGELOG&hb=rel/0.12.0-rc4
> >
> > The tag used to create the release candidate is:
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=shortlog;h=refs/tags/rel/0.12.0-rc4
> >
> > The release candidate is available at:
> >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.12.0-rc4/apache-aurora-0.12.0-rc4.tar.gz
> >
> > The MD5 checksum of the release candidate can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.12.0-rc4/apache-aurora-0.12.0-rc4.tar.gz.md5
> >
> > The signature of the release candidate can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.12.0-rc4/apache-aurora-0.12.0-rc4.tar.gz.asc
> >
> > The GPG key used to sign the release are available at:
> > https://dist.apache.org/repos/dist/dev/aurora/KEYS
> >
> > Please download, verify, and test.
> >
> > The vote will close on Mon Feb  8 15:12:45 MST 2016
> >
> > [ ] +1 Release this as Apache Aurora 0.12.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Aurora 0.12.0 because...
> > ---
> >
> > Reminder: you can verify the release candidate via:
> >
> >   ./build-support/release/verify-release-candidate 0.12.0-rc4
> >
> > If you can deploy the RC to a test cluster and evaluate it there, even
> > better.
> >
>

Re: [PROPOSAL] Disallow instance removal in job update

2016-02-05 Thread Bill Farner

>
> the outdated instance count problem will only get worse as automated
> scaling tools will quickly render existing .aurora config value obsolete


This is not a compelling reason to remove functionality.  Sounds like a
safeguard is needed instead.

On Fri, Feb 5, 2016 at 2:43 PM, Maxim Khutornenko  wrote:

> This is mostly a survey rather than a proposal. How would people think
> about limiting updater to only adding/updating instances and let
> killTasks take care of instance removals?
>
> We have all heard stories (or happen to create some ourselves) when an
> outdated instance count value in .aurora config caused unexpected
> instance removals. Granted, there are plenty of other values in the
> config that can cause service-wide outage but instance count seems to
> be the worst in that sense.
>
> After the recent refactoring of addInstances and killTasks to act as
> scaleOut/scaleIn APIs [1], the outdated instance count problem will
> only get worse as automated scaling tools will quickly render existing
> .aurora config value obsolete. With that in mind, should we block
> instance removal in the updater and let an explicit killTasks call be
> the only acceptable action to reduce instance count? Is there any
> value (aside from arguable convenience factor) in having
> startJobUpdate ever killing instances?
>
> Thanks,
> Maxim
>
> [1] - http://markmail.org/message/2smaej5n5e54li3g
>

New committer and PMC member: Stephan Erb

2016-02-03 Thread Bill Farner

Folks,

Please join me in welcoming Stephan Erb, who is now an Aurora committer and
PMC member!  I'm sure anyone paying attention has noticed Stephan's
involvement in the community and commitment to improving Aurora!

Welcome aboard, Stephan!

Re: NEWS Layout

2016-02-02 Thread Bill Farner

+1

On Tuesday, February 2, 2016, Erb, Stephan 
wrote:

> Hi everyone,
>
> I'd like to propose that we give our NEWS file a little bit more
> structure. Currently, it is quite cluttered [1].
>
> To keep it simple, I'd suggest that we adopt the style from the 0.11
> Aurora blog post:
>
> * New/updated
> * Deprecations and removals
>
> [1] https://github.com/apache/aurora/blob/master/NEWS
> [2] https://aurora.apache.org/blog/aurora-0-11-0-released/
>
>
> Thoughts?
>

1 2 3 4 >

1 - 100 of 314 matches

Mail list logo