+1 for sanity check - that's fast.
-1 for unix-cpu - that's slow and can just hang.

So my suggestion would be to see the data apart - what's the failure
rate on the sanity check and the unix-cpu? Actually, can we get a
table of all of the tests with this data?!
If the sanity check fails... let's say 20% of the time, but only takes
a couple of minutes, then ya, let's stack it and do that one first.

I think unix-cpu needs to be broken apart. It's too complex and fails
in multiple ways. Isolate the brittle parts. Then we can
restart/disable those as needed, while all of the other parts pass and
don't have to be rerun.

On Wed, Mar 25, 2020 at 1:32 AM Marco de Abreu <marco.g.ab...@gmail.com> wrote:
>
> We had this structure in the past and the community was bothered by CI
> taking more time, thus we moved to the current model with everything
> parallelized. We'd basically revert that then.
>
> Can you show by how much the duration will increase?
>
> Also, we have zero test parallelisation, speak we are running one test on
> 72 core machines (although multiple workers). Wouldn't it be way more
> efficient to add parallelisation and thus heavily reduce the time spent on
> the tasks instead of staggering?
>
> I feel concerned that these measures to save cost are paid in the form of a
> worse user experience. I see a big potential to save costs by increasing
> efficiency while actually improving the user experience due to CI being
> faster.
>
> -Marco
>
> Joe Evans <joseph.ev...@gmail.com> schrieb am Mi., 25. März 2020, 04:58:
>
> > Hi,
> >
> >
> > First, I just wanted to introduce myself to the MXNet community. I’m Joe
> > and will be working with Chai and the AWS team to improve some issues
> > around MXNet CI. One of our goals is to reduce the costs associated with
> > running MXNet CI. The task I’m working on now is this issue:
> >
> >
> > https://github.com/apache/incubator-mxnet/issues/17802
> >
> >
> > Proposal: Staggered Jenkins CI pipeline
> >
> >
> > Based on data collected from Jenkins, around 55% of the time when the
> > mxnet-validation CI build is triggered by a PR, either the sanity or
> > unix-cpu builds fail. When either of these builds fail, it doesn’t make
> > sense to run the rest of the pipelines and utilize all those resources if
> > we’ve already identified a build or unit test failure.
> >
> >
> > We are proposing changing the MXNet Jenkins CI pipeline by requiring the
> > *sanity* and *unix-cpu* builds to complete and pass tests successfully
> > before starting the other build pipelines (centos-cpu/gpu, unix-gpu,
> > windows-cpu/gpu, etc.) Once the sanity builds successfully complete, the
> > remaining build pipelines will be triggered and run in parallel (as they
> > currently do.) The purpose of this change is to identify faulty code or
> > compatibility issues early and prevent further execution of CI builds. This
> > will increase the time required to test a PR, but will prevent unnecessary
> > builds from running.
> >
> >
> > Does anyone have any concerns with this change or suggestions?
> >
> >
> > Thanks.
> >
> > Joe Evans
> >
> > joseph.ev...@gmail.com
> >

Reply via email to