Re: [NOTIFICATION] CI BACK ONLINE

2020-03-18 Thread sandeep krishnamurthy
Thank you for taking care of it Zi.

Best,
Sandeep

On Wed, 18 Mar 2020, 6:52 pm Chaitanya Bapat,  wrote:

> Thanks Zi!
>
> On Wed, 18 Mar 2020 at 15:48, Patrick Mu  wrote:
>
> > Dear Community,
> >
> > We have restarted the CI master, with a more powerful instance with
> larger
> > network and IO bandwidth.
> >
> > Now CI is fully back online, and you can retrigger any pending PRs now.
> >
> > Thanks,
> > Ziyi
> >
>
>
> --
> *Chaitanya Prakash Bapat*
> *+1 (973) 953-6299*
>
> [image: https://www.linkedin.com//in/chaibapat25]
> [image: https://www.facebook.com/chaibapat
> ]
> [image:
> https://twitter.com/ChaiBapchya] [image:
> https://www.linkedin.com//in/chaibapat25]
> 
>


Re: [NOTIFICATION] CI BACK ONLINE

2020-03-18 Thread Chaitanya Bapat
Thanks Zi!

On Wed, 18 Mar 2020 at 15:48, Patrick Mu  wrote:

> Dear Community,
>
> We have restarted the CI master, with a more powerful instance with larger
> network and IO bandwidth.
>
> Now CI is fully back online, and you can retrigger any pending PRs now.
>
> Thanks,
> Ziyi
>


-- 
*Chaitanya Prakash Bapat*
*+1 (973) 953-6299*

[image: https://www.linkedin.com//in/chaibapat25]
[image: https://www.facebook.com/chaibapat]
[image:
https://twitter.com/ChaiBapchya] [image:
https://www.linkedin.com//in/chaibapat25]



Re: MXNet Bot Demo

2020-03-18 Thread Marco de Abreu
Hi, that's correct. But as stated previously, it's not an option to remove
the hook. For now, I'd like to see how the system behaves while it's
optional. Later on, we can talk about revisiting this decision. But to me
it's not an option to deploy an entirely new system and approach without
having a transition or even a timeframe in which we are able to fall back.

I'm happy to support the deployment of the bot and add an additional
webhook to enable it's functionality to support selective triggering by PR
authors and committers, but I will not support the disabling of automatic
triggering of branches or PRs.

-Marco

Chaitanya Bapat  schrieb am Mi., 18. März 2020, 21:00:

> Hey Marco,
>
> I thought currently every commit on PR and master triggers CI
> because
> a. github webhook points to Jenkins Server
> b. GH Webhook events trigger builds on Jenkins for all commits to any
> branch in apache/incubator-mxnet
> may it be master/PR/non-PR
> Reason:
> Because all the 3 types of branches are discovered by Jenkins (non-PR
> (including master) and PR)
>
> Proposal: Remove GitHub WebHook to Jenkins and replace with GH Webhook to
> Lambda
> But after I remove the github webhook that points to Jenkins : *N**o commit
> will trigger Jenkins build by default* (as Jenkins wont receive GH events)
> Only those that Bot deems fit will be triggered (using Jenkins API invoked
> by Lambda).
> Hence its needed to handle that case of master merge.
> Am I understanding this correctly?
>
> On Wed, 18 Mar 2020 at 04:23, Marco de Abreu 
> wrote:
>
> > Thanks Chai, sounds good to me. Could you elaborate a bit on the point
> > about triggering a CI run after the PR has been merged? We already to
> that
> > automatically for the master, so what's the benefit to do it twice?
> >
> > -Marco
> >
> > Chaitanya Bapat  schrieb am Mi., 18. März 2020,
> > 09:30:
> >
> > > Update:
> > >
> > > >  we can ensure that all CI runs ran on the commit that will be merged
> > > @Sam Skalicky  Branch Protection is added to
> > public
> > > MXNet repo. It ensures that for every PR to be merged, the CI passes.
> All
> > > the jobs selected "required" jobs will have to be green for the PR to
> be
> > > merged. Ofcourse, users with "Adminstrator" access can merge without it
> > but
> > > that's just a backdoor. It is the case now and will continue to be the
> > case
> > > with the inclusion of Bot.
> > >
> > > > easily verify that the CI has executed all runs on the commit that
> will
> > > be merged
> > > GitHub UI shows all the jobs and the status corresponding to it on
> every
> > > commit. That should suffice. For the merged commits, Repo -> Commits ->
> > > Commit ID (Status) can be tracked currently (only way that I know of).
> > > Moreover, it is beyond the scope of this project (and possibly out of
> our
> > > control since this is purely GitHub UI specific use-case).
> > >
> > > Thanks @przemyslaw for supporting the opt-in.
> > >
> > > Thanks everyone in the community for sharing concerns, voicing your
> > opinion
> > > and participating in the discussion.
> > > Thanks to those who attended the demo last Friday.
> > >
> > > Action items from that discussion
> > > 1. Handle master merge builds [Done]
> > > Bot runs entire CI suite after the PR is merged and comments on the PR
> > > about the same.
> > > Design decision :
> > > MXNet Bot comment about master merge build on the *merge commit vs PR*.
> > > After the PR is merged, Bot runs entire CI and comments the result of
> CI
> > > trigger on the PR (because it is easy to track on a PR rather than
> > > commenting inside the merge commit)
> > >
> > > 2. Idempotent condition
> > > In case of already running build, if an attempt is made to retrigger
> the
> > > job then what should be the response
> > > a. Not to re-trigger, let the ongoing build continue till completion
> > > b. End the ongoing build and re-trigger
> > > c. Let the ongoing build continue, re-trigger new build
> > >
> > > From resource saving point of view, *c* looks costly and a can be
> > > avoided/optimized by B.
> > > In case when a re-trigger was started "erroneously" then killing
> ongoing
> > > build and re-trigger is a waste.
> > > In case when ongoing build failed in one sub-part, then re-triggering
> is
> > > justified.
> > > Erroneous re-triggers would be less often while conscious re-triggers
> to
> > > suppress failure is more common use-case. It looks like a safe
> assumption
> > > to make given the trade-off.
> > > [Open to debate]
> > >
> > > 3. Add security consideration [Use of secret manager, but without
> > > auto-rotation due to Jenkins manual config requirement] [Done]
> > > 4. New PR Instruction message by the Bot [Done]
> > > Thanks to the suggestion of Leonard, supported by others. I've now
> added
> > > the feature where the Bot comments a help message. [For reference -
> > > https://github.com/ChaiBapchya/incubator-mxnet/pull/52]
> > >
> > > Barring the opt-in vs opt-out debate & idempotency, consensus 

[NOTIFICATION] CI BACK ONLINE

2020-03-18 Thread Patrick Mu
Dear Community,

We have restarted the CI master, with a more powerful instance with larger 
network and IO bandwidth.
 
Now CI is fully back online, and you can retrigger any pending PRs now.

Thanks,
Ziyi


Re: MXNet Bot Demo

2020-03-18 Thread Chaitanya Bapat
Hey Marco,

I thought currently every commit on PR and master triggers CI
because
a. github webhook points to Jenkins Server
b. GH Webhook events trigger builds on Jenkins for all commits to any
branch in apache/incubator-mxnet
may it be master/PR/non-PR
Reason:
Because all the 3 types of branches are discovered by Jenkins (non-PR
(including master) and PR)

Proposal: Remove GitHub WebHook to Jenkins and replace with GH Webhook to
Lambda
But after I remove the github webhook that points to Jenkins : *N**o commit
will trigger Jenkins build by default* (as Jenkins wont receive GH events)
Only those that Bot deems fit will be triggered (using Jenkins API invoked
by Lambda).
Hence its needed to handle that case of master merge.
Am I understanding this correctly?

On Wed, 18 Mar 2020 at 04:23, Marco de Abreu 
wrote:

> Thanks Chai, sounds good to me. Could you elaborate a bit on the point
> about triggering a CI run after the PR has been merged? We already to that
> automatically for the master, so what's the benefit to do it twice?
>
> -Marco
>
> Chaitanya Bapat  schrieb am Mi., 18. März 2020,
> 09:30:
>
> > Update:
> >
> > >  we can ensure that all CI runs ran on the commit that will be merged
> > @Sam Skalicky  Branch Protection is added to
> public
> > MXNet repo. It ensures that for every PR to be merged, the CI passes. All
> > the jobs selected "required" jobs will have to be green for the PR to be
> > merged. Ofcourse, users with "Adminstrator" access can merge without it
> but
> > that's just a backdoor. It is the case now and will continue to be the
> case
> > with the inclusion of Bot.
> >
> > > easily verify that the CI has executed all runs on the commit that will
> > be merged
> > GitHub UI shows all the jobs and the status corresponding to it on every
> > commit. That should suffice. For the merged commits, Repo -> Commits ->
> > Commit ID (Status) can be tracked currently (only way that I know of).
> > Moreover, it is beyond the scope of this project (and possibly out of our
> > control since this is purely GitHub UI specific use-case).
> >
> > Thanks @przemyslaw for supporting the opt-in.
> >
> > Thanks everyone in the community for sharing concerns, voicing your
> opinion
> > and participating in the discussion.
> > Thanks to those who attended the demo last Friday.
> >
> > Action items from that discussion
> > 1. Handle master merge builds [Done]
> > Bot runs entire CI suite after the PR is merged and comments on the PR
> > about the same.
> > Design decision :
> > MXNet Bot comment about master merge build on the *merge commit vs PR*.
> > After the PR is merged, Bot runs entire CI and comments the result of CI
> > trigger on the PR (because it is easy to track on a PR rather than
> > commenting inside the merge commit)
> >
> > 2. Idempotent condition
> > In case of already running build, if an attempt is made to retrigger the
> > job then what should be the response
> > a. Not to re-trigger, let the ongoing build continue till completion
> > b. End the ongoing build and re-trigger
> > c. Let the ongoing build continue, re-trigger new build
> >
> > From resource saving point of view, *c* looks costly and a can be
> > avoided/optimized by B.
> > In case when a re-trigger was started "erroneously" then killing ongoing
> > build and re-trigger is a waste.
> > In case when ongoing build failed in one sub-part, then re-triggering is
> > justified.
> > Erroneous re-triggers would be less often while conscious re-triggers to
> > suppress failure is more common use-case. It looks like a safe assumption
> > to make given the trade-off.
> > [Open to debate]
> >
> > 3. Add security consideration [Use of secret manager, but without
> > auto-rotation due to Jenkins manual config requirement] [Done]
> > 4. New PR Instruction message by the Bot [Done]
> > Thanks to the suggestion of Leonard, supported by others. I've now added
> > the feature where the Bot comments a help message. [For reference -
> > https://github.com/ChaiBapchya/incubator-mxnet/pull/52]
> >
> > Barring the opt-in vs opt-out debate & idempotency, consensus was quickly
> > reached for the rest.
> >
> > In the coming days, I hope to roll-out this feature into Prod (public
> > MXNet) for all devs to use.
> >
> > Thanks,
> > Chai
> >
> > On Mon, 16 Mar 2020 at 11:57, Marco de Abreu 
> > wrote:
> >
> > > Well that's generally a problem with a deferred CI approach (CI is run
> at
> > > commit and not at merge time). This can either be solved through the
> > other
> > > proposal that's currently on dev@, by having a bot which does merges
> by
> > > having a global lock and a merge queue or by accepting the issue.
> Reality
> > > right now is that we're running that model where two PRs which are
> merged
> > > in parallel might break one another. One thing to consider though is
> that
> > > this breakage would have to be introduced in two separate parts since
> > > otherwise there'd be merge conflicts. I think we had that situation
> twice
> > > 

Re: [NOTIFICATION] CI Restart

2020-03-18 Thread Aaron Markham
I noticed at point the iops limit was getting hit.
Upgrading the storage specs could be a great idea too.

On Wed, Mar 18, 2020, 11:59 Patrick Mu  wrote:

> Dear Community,
>
> Our developers have identified frequently occurrence of "Cannot contact
> " issue
> in our CI system. Sheng and Leonard have helped to investigate this and
> have found the CI master's network bandwidth reaching limit is probably the
> culprit of the issue. To remove the burden of repeated CI retriggering from
> developers, we decided to take the following steps:
>
> 1) Stop the CI Jenkins master
> 2) Resize the CI master instance to a larger instance for more network
> bandwidth capacity
> 3) Restart the master
>
> The workflow will take less than 1 hour to complete (ideally 5-10 mins).
>
> In the meanwhile, if you already have PRs currently running in the CI,
> please resubmit your PRs to make sure they will run the pipeline after
> restart.
>
> We are sorry for any inconvenience caused.
>
> Best Regards,
>
> Ziyi
>


[NOTIFICATION] CI Restart

2020-03-18 Thread Patrick Mu
Dear Community,

Our developers have identified frequently occurrence of "Cannot contact 
" issue 
in our CI system. Sheng and Leonard have helped to investigate this and have 
found the CI master's network bandwidth reaching limit is probably the culprit 
of the issue. To remove the burden of repeated CI retriggering from developers, 
we decided to take the following steps:

1) Stop the CI Jenkins master
2) Resize the CI master instance to a larger instance for more network 
bandwidth capacity
3) Restart the master

The workflow will take less than 1 hour to complete (ideally 5-10 mins).

In the meanwhile, if you already have PRs currently running in the CI, please 
resubmit your PRs to make sure they will run the pipeline after restart.

We are sorry for any inconvenience caused.

Best Regards,

Ziyi


Re: Requesting slack access

2020-03-18 Thread Chaitanya Bapat
Hey Haohao,

Welcome to the MXNet Community!
Have sent you a request on the email id (wei...@sjtu.edu.cn)

Thanks
Chai

On Wed, 18 Mar 2020 at 11:06, Haohao  wrote:

> Hello, I am writing to request the slack access of mxnet slack channel.
>
> Thank you.
>


-- 
*Chaitanya Prakash Bapat*
*+1 (973) 953-6299*

[image: https://www.linkedin.com//in/chaibapat25]
[image: https://www.facebook.com/chaibapat]
[image:
https://twitter.com/ChaiBapchya] [image:
https://www.linkedin.com//in/chaibapat25]



Re: [apache/incubator-mxnet] [RFC] Custom subgraph property enhancements (#17236)

2020-03-18 Thread Sam Skalicky
Current implementation (as of 3/18/2020) only supports combining ops into 
subgraphs generically. does not allow for fine grained control to say which ops 
should go into which subgraphs. Need to modify `supportedOps` API to change the 
`ids` vector from bool (supported or not) to integer (to say which subgraph 
this node should go in to). This will allow for coloring and fine grained 
control of ops in a subgraph. 
https://github.com/apache/incubator-mxnet/blob/07d0b738384cba011ce519bcc41aeeaaafcd22c1/example/extensions/lib_subgraph/subgraph_lib.cc#L179-L181
Will also need to modify custom_subgraph_property.h selector to use these int 
values to define the subgraph instead of just looking for existence of the 
operator name.
https://github.com/apache/incubator-mxnet/blob/41d534be3ffd7b45846f555366230f4e811d8751/src/operator/subgraph/partitioner/custom_subgraph_property.h#L47-L54

Open question:
How do we handle the cases where we want to remove args/aux from the top level 
of the model, migrating those into the subgraph? If we compile a subgraph with 
Args/Aux then those tensor values are effectively frozen in the compiled binary 
so we need to be clear with users that they can no longer be changed in the top 
level model

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17236#issuecomment-600786482

Requesting slack access

2020-03-18 Thread Haohao
Hello, I am writing to request the slack access of mxnet slack channel.

Thank you.


Re: MXNet Bot Demo

2020-03-18 Thread Marco de Abreu
Thanks Chai, sounds good to me. Could you elaborate a bit on the point
about triggering a CI run after the PR has been merged? We already to that
automatically for the master, so what's the benefit to do it twice?

-Marco

Chaitanya Bapat  schrieb am Mi., 18. März 2020, 09:30:

> Update:
>
> >  we can ensure that all CI runs ran on the commit that will be merged
> @Sam Skalicky  Branch Protection is added to public
> MXNet repo. It ensures that for every PR to be merged, the CI passes. All
> the jobs selected "required" jobs will have to be green for the PR to be
> merged. Ofcourse, users with "Adminstrator" access can merge without it but
> that's just a backdoor. It is the case now and will continue to be the case
> with the inclusion of Bot.
>
> > easily verify that the CI has executed all runs on the commit that will
> be merged
> GitHub UI shows all the jobs and the status corresponding to it on every
> commit. That should suffice. For the merged commits, Repo -> Commits ->
> Commit ID (Status) can be tracked currently (only way that I know of).
> Moreover, it is beyond the scope of this project (and possibly out of our
> control since this is purely GitHub UI specific use-case).
>
> Thanks @przemyslaw for supporting the opt-in.
>
> Thanks everyone in the community for sharing concerns, voicing your opinion
> and participating in the discussion.
> Thanks to those who attended the demo last Friday.
>
> Action items from that discussion
> 1. Handle master merge builds [Done]
> Bot runs entire CI suite after the PR is merged and comments on the PR
> about the same.
> Design decision :
> MXNet Bot comment about master merge build on the *merge commit vs PR*.
> After the PR is merged, Bot runs entire CI and comments the result of CI
> trigger on the PR (because it is easy to track on a PR rather than
> commenting inside the merge commit)
>
> 2. Idempotent condition
> In case of already running build, if an attempt is made to retrigger the
> job then what should be the response
> a. Not to re-trigger, let the ongoing build continue till completion
> b. End the ongoing build and re-trigger
> c. Let the ongoing build continue, re-trigger new build
>
> From resource saving point of view, *c* looks costly and a can be
> avoided/optimized by B.
> In case when a re-trigger was started "erroneously" then killing ongoing
> build and re-trigger is a waste.
> In case when ongoing build failed in one sub-part, then re-triggering is
> justified.
> Erroneous re-triggers would be less often while conscious re-triggers to
> suppress failure is more common use-case. It looks like a safe assumption
> to make given the trade-off.
> [Open to debate]
>
> 3. Add security consideration [Use of secret manager, but without
> auto-rotation due to Jenkins manual config requirement] [Done]
> 4. New PR Instruction message by the Bot [Done]
> Thanks to the suggestion of Leonard, supported by others. I've now added
> the feature where the Bot comments a help message. [For reference -
> https://github.com/ChaiBapchya/incubator-mxnet/pull/52]
>
> Barring the opt-in vs opt-out debate & idempotency, consensus was quickly
> reached for the rest.
>
> In the coming days, I hope to roll-out this feature into Prod (public
> MXNet) for all devs to use.
>
> Thanks,
> Chai
>
> On Mon, 16 Mar 2020 at 11:57, Marco de Abreu 
> wrote:
>
> > Well that's generally a problem with a deferred CI approach (CI is run at
> > commit and not at merge time). This can either be solved through the
> other
> > proposal that's currently on dev@, by having a bot which does merges by
> > having a global lock and a merge queue or by accepting the issue. Reality
> > right now is that we're running that model where two PRs which are merged
> > in parallel might break one another. One thing to consider though is that
> > this breakage would have to be introduced in two separate parts since
> > otherwise there'd be merge conflicts. I think we had that situation twice
> > so far and the result was a quick revert, so I'd say that it's a problem
> > that can happily be accepted. All other solutions basically require some
> > form of single-threaded and globally locked solution which limits us in
> > scalability. I'd recommend to just accept that risk and revert a PR in
> case
> > it actually had a conflict.
> >
> > -Marco
> >
> > On Mon, Mar 16, 2020 at 6:29 PM Skalicky, Sam  >
> > wrote:
> >
> > > We probably need some way to track which CI runs ran for which commit
> > too,
> > > that way we can ensure that all CI runs ran on the commit that will be
> > > merged.  Maybe the bot can comment with the commit hash when users
> > command
> > > it to do something. Although since users can trigger individual CI runs
> > its
> > > possible to have some commits run some CI runs but not others. We need
> > some
> > > way to easily verify that the CI has executed all runs on the commit
> that
> > > will be merged.
> > >
> > > Sam
> > >
> > > > On Mar 13, 2020, at 8:28 PM, Przemysław 

Re: MXNet Bot Demo

2020-03-18 Thread Chaitanya Bapat
Update:

>  we can ensure that all CI runs ran on the commit that will be merged
@Sam Skalicky  Branch Protection is added to public
MXNet repo. It ensures that for every PR to be merged, the CI passes. All
the jobs selected "required" jobs will have to be green for the PR to be
merged. Ofcourse, users with "Adminstrator" access can merge without it but
that's just a backdoor. It is the case now and will continue to be the case
with the inclusion of Bot.

> easily verify that the CI has executed all runs on the commit that will
be merged
GitHub UI shows all the jobs and the status corresponding to it on every
commit. That should suffice. For the merged commits, Repo -> Commits ->
Commit ID (Status) can be tracked currently (only way that I know of).
Moreover, it is beyond the scope of this project (and possibly out of our
control since this is purely GitHub UI specific use-case).

Thanks @przemyslaw for supporting the opt-in.

Thanks everyone in the community for sharing concerns, voicing your opinion
and participating in the discussion.
Thanks to those who attended the demo last Friday.

Action items from that discussion
1. Handle master merge builds [Done]
Bot runs entire CI suite after the PR is merged and comments on the PR
about the same.
Design decision :
MXNet Bot comment about master merge build on the *merge commit vs PR*.
After the PR is merged, Bot runs entire CI and comments the result of CI
trigger on the PR (because it is easy to track on a PR rather than
commenting inside the merge commit)

2. Idempotent condition
In case of already running build, if an attempt is made to retrigger the
job then what should be the response
a. Not to re-trigger, let the ongoing build continue till completion
b. End the ongoing build and re-trigger
c. Let the ongoing build continue, re-trigger new build

>From resource saving point of view, *c* looks costly and a can be
avoided/optimized by B.
In case when a re-trigger was started "erroneously" then killing ongoing
build and re-trigger is a waste.
In case when ongoing build failed in one sub-part, then re-triggering is
justified.
Erroneous re-triggers would be less often while conscious re-triggers to
suppress failure is more common use-case. It looks like a safe assumption
to make given the trade-off.
[Open to debate]

3. Add security consideration [Use of secret manager, but without
auto-rotation due to Jenkins manual config requirement] [Done]
4. New PR Instruction message by the Bot [Done]
Thanks to the suggestion of Leonard, supported by others. I've now added
the feature where the Bot comments a help message. [For reference -
https://github.com/ChaiBapchya/incubator-mxnet/pull/52]

Barring the opt-in vs opt-out debate & idempotency, consensus was quickly
reached for the rest.

In the coming days, I hope to roll-out this feature into Prod (public
MXNet) for all devs to use.

Thanks,
Chai

On Mon, 16 Mar 2020 at 11:57, Marco de Abreu 
wrote:

> Well that's generally a problem with a deferred CI approach (CI is run at
> commit and not at merge time). This can either be solved through the other
> proposal that's currently on dev@, by having a bot which does merges by
> having a global lock and a merge queue or by accepting the issue. Reality
> right now is that we're running that model where two PRs which are merged
> in parallel might break one another. One thing to consider though is that
> this breakage would have to be introduced in two separate parts since
> otherwise there'd be merge conflicts. I think we had that situation twice
> so far and the result was a quick revert, so I'd say that it's a problem
> that can happily be accepted. All other solutions basically require some
> form of single-threaded and globally locked solution which limits us in
> scalability. I'd recommend to just accept that risk and revert a PR in case
> it actually had a conflict.
>
> -Marco
>
> On Mon, Mar 16, 2020 at 6:29 PM Skalicky, Sam 
> wrote:
>
> > We probably need some way to track which CI runs ran for which commit
> too,
> > that way we can ensure that all CI runs ran on the commit that will be
> > merged.  Maybe the bot can comment with the commit hash when users
> command
> > it to do something. Although since users can trigger individual CI runs
> its
> > possible to have some commits run some CI runs but not others. We need
> some
> > way to easily verify that the CI has executed all runs on the commit that
> > will be merged.
> >
> > Sam
> >
> > > On Mar 13, 2020, at 8:28 PM, Przemysław Trędak 
> > wrote:
> > >
> > > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and
> know
> > the content is safe.
> > >
> > >
> > >
> > > I personally like the idea of opt-in more than opt-out:
> > > - ultimately PR author wants the PR to be merged so they (or committer
> > reviewing the PR) will trigger the CI
> > > - if it is easy to trigger the PR via the bot command then the amount