Re: Move to new CI

2017-12-06 Thread Marco de Abreu
Hi,

I'm pleased to announce that we just moved from Apache CI to our new Ci
hosted at http://jenkins.mxnet-ci.amazon-ml.com/. We are now in an
environment which is entirely controlled by the MXNet community - if any
issues around CI arise, we'll be able to solve them ourselves without
involving Apache Infra or other third parties. This will allow us to get
back to the development speed that MXNet deserves - no more waiting for a
job in the queue! But just to set some expectations: We only migrated the
previous CI. While we've made sure to update dependencies, set up slaves
from scratch and document all steps as well as mitigate possible pitfalls,
we're still facing issues due to flaky tests and other irregular behaviour
which we will have to address in future.

Unfortunately, this change requires everybody to rebase their PR in order
to include necessary changes to the testing infrastructure. As soon as you
submitted a new commit, it will be built and tested automatically -
indicated by Chris Oliviers' nice profile picture below your PR.

A short summary of the changes and actions which we have executed during
the migration:
- Moved Dockerfiles to Ubuntu16.04 due to incompatibility issues on
Ubuntu14.04 with OpenBlas on C5 instances
- Moved Dockerfiles to CUDA8 due to CUDA7.5 (previously tested version) not
being available for Ubuntu16.04
- CPU and GPU tasks are getting executed on appropriate machines
(C5.18xlarge and G3.8xlarge). CPU-tasks are being parallelised.
- Defined Jenkins a Infrastructure-as-code in Terraform. Allowing a clean
redeployment at any time.
- Created an entirely new setup for Ubuntu CPU and GPU slaves, based on
Ubuntu16.04. All steps have been documented and will be available in
future. This will allow everybody to reproduce the entire stack
- Fixed a few bugs in the existing Windows CPU and GPU slaves, causing
builds to hang or crash. We were not able to create entirely new images due
to time constraints and the vast amount of dependencies on Windows.

The following points are on our roadmap for MXNet CI:
- Provide debug artefacts in the Jenkins Webinterface. This allows you to
download the generated artefacts so you can reproduce a test failure
locally without having to compile that specific configuration yourself.
- Reactive auto scaling based on queue length: In future, slaves will be
started and stopped, depending on the queue length. This will ensure that a
PR is always going to start being built and tested within a few minutes -
no more queues!
- Parallelise GPU tests and increase performance: We plan to investigate in
how far it is possible to parallelise GPU tests, reduce the required
execution time and increase the performance of the CI in general

Thanks to the following people who helped creating and launching this CI:
Meghna Baijal, Gautam Kumar, Bhavin Thaker, Steffen Rochel, Chris Olivier,
Eric Junyuan Xie, Kellen Sunderland, Pedor Larroy, Marco de Abreu, Daniel
Bay, Asmus Hetzel, Daniel Takamori and Sebastian Schelter

Feel free to reach back to me if any questions arise.

Best regards,
Marco de Abreu



On Wed, Dec 6, 2017 at 9:14 AM, Marco de Abreu  wrote:

> Thanks for your feedback, Bhavin.
>
> I've uploaded the documents to Google Docs:
> - Auto scaling: https://docs.google.com/document/d/1a_
> bj2wmmnFFG70wWoghK3YxlNFv9CKo8GomFSK4vh3g/edit?usp=sharing
> - Security design: https://docs.google.com/document/d/1YZaHHQr5f4j-
> XQ2y8PjK3ACCI-ybXshcRubmbu_MCWk/edit?usp=sharing
>
> Please note that these documents are still WIP.
>
> -Marco
>
> Am 06.12.2017 8:11 vorm. schrieb "Bhavin Thaker" :
>
> Hi Marco,
>
> Thanks for your work on the CI.
>
> Is it possible/ok to share the docs via googledocs link until you get write
> permissions for the apache wiki page?
>
> Bhavin Thaker.
>
> On Tue, Dec 5, 2017 at 4:06 PM, Marco de Abreu <
> marco.g.ab...@googlemail.com
> > wrote:
>
> > Hello MXNet community,
> >
> > as discussed in
> > https://lists.apache.org/thread.html/bb0f63cd6dbab9cf7e5f857f60b758
> > a774a76876b8135ec9cf67a57c@%3Cdev.mxnet.apache.org%3E
> > and tracked at https://issues.apache.org/jira/browse/MXNET-1, we've
> > decided
> > to switch to a new CI running Jenkins and move away from the current
> setup
> > running under builds.apache.org. Within the past weeks, me and my team
> > have
> > been working on setting up the new CI.
> >
> > We've reached a state which is, besides a few structural changes
> described
> > in https://github.com/apache/incubator-mxnet/pull/8960, compatible to
> the
> > MXNet repository. I'd like to make the transition within the next days
> and
> > thus request a review on my PR.
> >
> > I would like to point out, that at this stage, we're only migrating to
> the
> > new CI. This means we're in control of the master, can start and stop
> > slaves as desired, don't rely on Apache Infra-tickets and thus increase
> > stability and reaction time. Just to set some expectations: During 

Re: Move to new CI

2017-12-05 Thread Bhavin Thaker
Hi Marco,

Thanks for your work on the CI.

Is it possible/ok to share the docs via googledocs link until you get write
permissions for the apache wiki page?

Bhavin Thaker.

On Tue, Dec 5, 2017 at 4:06 PM, Marco de Abreu  wrote:

> Hello MXNet community,
>
> as discussed in
> https://lists.apache.org/thread.html/bb0f63cd6dbab9cf7e5f857f60b758
> a774a76876b8135ec9cf67a57c@%3Cdev.mxnet.apache.org%3E
> and tracked at https://issues.apache.org/jira/browse/MXNET-1, we've
> decided
> to switch to a new CI running Jenkins and move away from the current setup
> running under builds.apache.org. Within the past weeks, me and my team
> have
> been working on setting up the new CI.
>
> We've reached a state which is, besides a few structural changes described
> in https://github.com/apache/incubator-mxnet/pull/8960, compatible to the
> MXNet repository. I'd like to make the transition within the next days and
> thus request a review on my PR.
>
> I would like to point out, that at this stage, we're only migrating to the
> new CI. This means we're in control of the master, can start and stop
> slaves as desired, don't rely on Apache Infra-tickets and thus increase
> stability and reaction time. Just to set some expectations: During my tests
> I've noticed seg-faults, flaky tests and other irregular build failures
> like they're already happening on the current CI setup. These are not
> related to the CI itself, but we're going to look into these issues after
> the initial migration has been completed. While the new CI is running,
> we'll work on implementing reactive auto scaling to avoid any queue times.
>
> I would appreciate some feedback and I'd like to hear if there's anything
> you'd like to have in the new CI. Additionally, I've created a few design
> documents regarding auto scaling and security design, but unfortunately,
> I'm not able to share them yet due to permission issues on the MXNet wiki.
> As soon as they're available, I'll send a follow-up email.
>
> Looking forward to your feedback.
>
> Best regards,
> Marco
>