Podling Mxnet Report Reminder - April 2020

2020-03-25 Thread jmclean
Dear podling,

This email was sent by an automated system on behalf of the Apache
Incubator PMC. It is an initial reminder to give you plenty of time to
prepare your quarterly board report.

The board meeting is scheduled for Wed, 15 April 2020, 10:30 am PDT.
The report for your podling will form a part of the Incubator PMC
report. The Incubator PMC requires your report to be submitted 2 weeks
before the board meeting, to allow sufficient time for review and
submission (Wed, April 01).

Please submit your report with sufficient time to allow the Incubator
PMC, and subsequently board members to review and digest. Again, the
very latest you should submit your report is 2 weeks prior to the board
meeting.

Candidate names should not be made public before people are actually
elected, so please do not include the names of potential committers or
PPMC members in your report.

Thanks,

The Apache Incubator PMC

Submitting your Report

--

Your report should contain the following:

*   Your project name
*   A brief description of your project, which assumes no knowledge of
the project or necessarily of its field
*   A list of the three most important issues to address in the move
towards graduation.
*   Any issues that the Incubator PMC or ASF Board might wish/need to be
aware of
*   How has the community developed since the last report
*   How has the project developed since the last report.
*   How does the podling rate their own maturity.

This should be appended to the Incubator Wiki page at:

https://cwiki.apache.org/confluence/display/INCUBATOR/April2020

Note: This is manually populated. You may need to wait a little before
this page is created from a template.

Note: The format of the report has changed to use markdown.

Mentors
---

Mentors should review reports for their project(s) and sign them off on
the Incubator wiki page. Signing off reports shows that you are
following the project - projects that are not signed may raise alarms
for the Incubator PMC.

Incubator PMC


Re: Podling Mxnet Report Reminder - April 2020

2020-03-25 Thread Sheng Zha
I'm working on a draft now and will report back with link once ready.

-sz

On 2020/03/20 03:18:09, jmcl...@apache.org wrote: 
> Dear podling,
> 
> This email was sent by an automated system on behalf of the Apache
> Incubator PMC. It is an initial reminder to give you plenty of time to
> prepare your quarterly board report.
> 
> The board meeting is scheduled for Wed, 15 April 2020, 10:30 am PDT.
> The report for your podling will form a part of the Incubator PMC
> report. The Incubator PMC requires your report to be submitted 2 weeks
> before the board meeting, to allow sufficient time for review and
> submission (Wed, April 01).
> 
> Please submit your report with sufficient time to allow the Incubator
> PMC, and subsequently board members to review and digest. Again, the
> very latest you should submit your report is 2 weeks prior to the board
> meeting.
> 
> Candidate names should not be made public before people are actually
> elected, so please do not include the names of potential committers or
> PPMC members in your report.
> 
> Thanks,
> 
> The Apache Incubator PMC
> 
> Submitting your Report
> 
> --
> 
> Your report should contain the following:
> 
> *   Your project name
> *   A brief description of your project, which assumes no knowledge of
> the project or necessarily of its field
> *   A list of the three most important issues to address in the move
> towards graduation.
> *   Any issues that the Incubator PMC or ASF Board might wish/need to be
> aware of
> *   How has the community developed since the last report
> *   How has the project developed since the last report.
> *   How does the podling rate their own maturity.
> 
> This should be appended to the Incubator Wiki page at:
> 
> https://cwiki.apache.org/confluence/display/INCUBATOR/April2020
> 
> Note: This is manually populated. You may need to wait a little before
> this page is created from a template.
> 
> Note: The format of the report has changed to use markdown.
> 
> Mentors
> ---
> 
> Mentors should review reports for their project(s) and sign them off on
> the Incubator wiki page. Signing off reports shows that you are
> following the project - projects that are not signed may raise alarms
> for the Incubator PMC.
> 
> Incubator PMC
> 


Re: CI Pipeline Change Proposal

2020-03-25 Thread Marco de Abreu
Back then I have created a system which exports all Jenkins results to
cloud watch. It does not include individual test results but rather stages
and jobs. The data for the sanity check should be available there.

Something I'd also be curious about is the percentage of the failures in
one run. Speak, if a commit failed, have there been multiple jobs failing
(indicating an error in the code) or only one or two (indicating
flakyness). This should give us a proper understanding of how unnecessary
these runs really are.

-Marck

Aaron Markham  schrieb am Mi., 25. März 2020,
16:53:

> +1 for sanity check - that's fast.
> -1 for unix-cpu - that's slow and can just hang.
>
> So my suggestion would be to see the data apart - what's the failure
> rate on the sanity check and the unix-cpu? Actually, can we get a
> table of all of the tests with this data?!
> If the sanity check fails... let's say 20% of the time, but only takes
> a couple of minutes, then ya, let's stack it and do that one first.
>
> I think unix-cpu needs to be broken apart. It's too complex and fails
> in multiple ways. Isolate the brittle parts. Then we can
> restart/disable those as needed, while all of the other parts pass and
> don't have to be rerun.
>
> On Wed, Mar 25, 2020 at 1:32 AM Marco de Abreu 
> wrote:
> >
> > We had this structure in the past and the community was bothered by CI
> > taking more time, thus we moved to the current model with everything
> > parallelized. We'd basically revert that then.
> >
> > Can you show by how much the duration will increase?
> >
> > Also, we have zero test parallelisation, speak we are running one test on
> > 72 core machines (although multiple workers). Wouldn't it be way more
> > efficient to add parallelisation and thus heavily reduce the time spent
> on
> > the tasks instead of staggering?
> >
> > I feel concerned that these measures to save cost are paid in the form
> of a
> > worse user experience. I see a big potential to save costs by increasing
> > efficiency while actually improving the user experience due to CI being
> > faster.
> >
> > -Marco
> >
> > Joe Evans  schrieb am Mi., 25. März 2020, 04:58:
> >
> > > Hi,
> > >
> > >
> > > First, I just wanted to introduce myself to the MXNet community. I’m
> Joe
> > > and will be working with Chai and the AWS team to improve some issues
> > > around MXNet CI. One of our goals is to reduce the costs associated
> with
> > > running MXNet CI. The task I’m working on now is this issue:
> > >
> > >
> > > https://github.com/apache/incubator-mxnet/issues/17802
> > >
> > >
> > > Proposal: Staggered Jenkins CI pipeline
> > >
> > >
> > > Based on data collected from Jenkins, around 55% of the time when the
> > > mxnet-validation CI build is triggered by a PR, either the sanity or
> > > unix-cpu builds fail. When either of these builds fail, it doesn’t make
> > > sense to run the rest of the pipelines and utilize all those resources
> if
> > > we’ve already identified a build or unit test failure.
> > >
> > >
> > > We are proposing changing the MXNet Jenkins CI pipeline by requiring
> the
> > > *sanity* and *unix-cpu* builds to complete and pass tests successfully
> > > before starting the other build pipelines (centos-cpu/gpu, unix-gpu,
> > > windows-cpu/gpu, etc.) Once the sanity builds successfully complete,
> the
> > > remaining build pipelines will be triggered and run in parallel (as
> they
> > > currently do.) The purpose of this change is to identify faulty code or
> > > compatibility issues early and prevent further execution of CI builds.
> This
> > > will increase the time required to test a PR, but will prevent
> unnecessary
> > > builds from running.
> > >
> > >
> > > Does anyone have any concerns with this change or suggestions?
> > >
> > >
> > > Thanks.
> > >
> > > Joe Evans
> > >
> > > joseph.ev...@gmail.com
> > >
>


Re: CI Pipeline Change Proposal

2020-03-25 Thread Aaron Markham
+1 for sanity check - that's fast.
-1 for unix-cpu - that's slow and can just hang.

So my suggestion would be to see the data apart - what's the failure
rate on the sanity check and the unix-cpu? Actually, can we get a
table of all of the tests with this data?!
If the sanity check fails... let's say 20% of the time, but only takes
a couple of minutes, then ya, let's stack it and do that one first.

I think unix-cpu needs to be broken apart. It's too complex and fails
in multiple ways. Isolate the brittle parts. Then we can
restart/disable those as needed, while all of the other parts pass and
don't have to be rerun.

On Wed, Mar 25, 2020 at 1:32 AM Marco de Abreu  wrote:
>
> We had this structure in the past and the community was bothered by CI
> taking more time, thus we moved to the current model with everything
> parallelized. We'd basically revert that then.
>
> Can you show by how much the duration will increase?
>
> Also, we have zero test parallelisation, speak we are running one test on
> 72 core machines (although multiple workers). Wouldn't it be way more
> efficient to add parallelisation and thus heavily reduce the time spent on
> the tasks instead of staggering?
>
> I feel concerned that these measures to save cost are paid in the form of a
> worse user experience. I see a big potential to save costs by increasing
> efficiency while actually improving the user experience due to CI being
> faster.
>
> -Marco
>
> Joe Evans  schrieb am Mi., 25. März 2020, 04:58:
>
> > Hi,
> >
> >
> > First, I just wanted to introduce myself to the MXNet community. I’m Joe
> > and will be working with Chai and the AWS team to improve some issues
> > around MXNet CI. One of our goals is to reduce the costs associated with
> > running MXNet CI. The task I’m working on now is this issue:
> >
> >
> > https://github.com/apache/incubator-mxnet/issues/17802
> >
> >
> > Proposal: Staggered Jenkins CI pipeline
> >
> >
> > Based on data collected from Jenkins, around 55% of the time when the
> > mxnet-validation CI build is triggered by a PR, either the sanity or
> > unix-cpu builds fail. When either of these builds fail, it doesn’t make
> > sense to run the rest of the pipelines and utilize all those resources if
> > we’ve already identified a build or unit test failure.
> >
> >
> > We are proposing changing the MXNet Jenkins CI pipeline by requiring the
> > *sanity* and *unix-cpu* builds to complete and pass tests successfully
> > before starting the other build pipelines (centos-cpu/gpu, unix-gpu,
> > windows-cpu/gpu, etc.) Once the sanity builds successfully complete, the
> > remaining build pipelines will be triggered and run in parallel (as they
> > currently do.) The purpose of this change is to identify faulty code or
> > compatibility issues early and prevent further execution of CI builds. This
> > will increase the time required to test a PR, but will prevent unnecessary
> > builds from running.
> >
> >
> > Does anyone have any concerns with this change or suggestions?
> >
> >
> > Thanks.
> >
> > Joe Evans
> >
> > joseph.ev...@gmail.com
> >


Re: CI Pipeline Change Proposal

2020-03-25 Thread Marco de Abreu
We had this structure in the past and the community was bothered by CI
taking more time, thus we moved to the current model with everything
parallelized. We'd basically revert that then.

Can you show by how much the duration will increase?

Also, we have zero test parallelisation, speak we are running one test on
72 core machines (although multiple workers). Wouldn't it be way more
efficient to add parallelisation and thus heavily reduce the time spent on
the tasks instead of staggering?

I feel concerned that these measures to save cost are paid in the form of a
worse user experience. I see a big potential to save costs by increasing
efficiency while actually improving the user experience due to CI being
faster.

-Marco

Joe Evans  schrieb am Mi., 25. März 2020, 04:58:

> Hi,
>
>
> First, I just wanted to introduce myself to the MXNet community. I’m Joe
> and will be working with Chai and the AWS team to improve some issues
> around MXNet CI. One of our goals is to reduce the costs associated with
> running MXNet CI. The task I’m working on now is this issue:
>
>
> https://github.com/apache/incubator-mxnet/issues/17802
>
>
> Proposal: Staggered Jenkins CI pipeline
>
>
> Based on data collected from Jenkins, around 55% of the time when the
> mxnet-validation CI build is triggered by a PR, either the sanity or
> unix-cpu builds fail. When either of these builds fail, it doesn’t make
> sense to run the rest of the pipelines and utilize all those resources if
> we’ve already identified a build or unit test failure.
>
>
> We are proposing changing the MXNet Jenkins CI pipeline by requiring the
> *sanity* and *unix-cpu* builds to complete and pass tests successfully
> before starting the other build pipelines (centos-cpu/gpu, unix-gpu,
> windows-cpu/gpu, etc.) Once the sanity builds successfully complete, the
> remaining build pipelines will be triggered and run in parallel (as they
> currently do.) The purpose of this change is to identify faulty code or
> compatibility issues early and prevent further execution of CI builds. This
> will increase the time required to test a PR, but will prevent unnecessary
> builds from running.
>
>
> Does anyone have any concerns with this change or suggestions?
>
>
> Thanks.
>
> Joe Evans
>
> joseph.ev...@gmail.com
>