CI Broke AF

2018-03-29 Thread Chris Olivier
Something for checked in and CI is nuked. I tried revert of my last commit,
and that didn’t fix it, so apparently it wasn’t that.

Anyone have any ideas? It is super-broken. unit tests failing like crazy,
GPU builds hanging on shutdown. No successful builds today at all.


Re: PR build failed because of git errors

2018-03-29 Thread Haibin Lin
I've seen this before. Try rebasing and force pushing.

On Thu, Mar 29, 2018 at 3:51 PM, Indhu  wrote:

> Hi,
>
> Looks like PR #10039 build failed because of git errors. Here is the error
> log:
> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> mxnet/job/PR-10039/4/console.
> Does someone know what could be happening here?
>
> Build error:
>
> Adding as 3rdparty/dlpack~7c28089749287f42ea8f41abd1358e6dbac54187 instead
> Automatic merge failed; fix conflicts and then commit the result.
>
> stderr:
> at
> org.jenkinsci.plugins.gitclient.CliGitAPIImpl.
> launchCommandIn(CliGitAPIImpl.java:1990)
> at
> org.jenkinsci.plugins.gitclient.CliGitAPIImpl.
> launchCommandIn(CliGitAPIImpl.java:1958)
> at
> org.jenkinsci.plugins.gitclient.CliGitAPIImpl.
> launchCommandIn(CliGitAPIImpl.java:1954)
> at
> org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.
> java:1592)
> at
> org.jenkinsci.plugins.gitclient.CliGitAPIImpl$3.
> execute(CliGitAPIImpl.java:692)
> at
> jenkins.plugins.git.MergeWithGitSCMExtension.decorateRevisionToBuild(
> MergeWithGitSCMExtension.java:122)
> at hudson.plugins.git.GitSCM.determineRevisionToBuild(GitSCM.java:1068)
> at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1161)
> at
> org.jenkinsci.plugins.workflow.steps.scm.SCMStep.
> checkout(SCMStep.java:113)
> at
> org.jenkinsci.plugins.workflow.cps.CpsScmFlowDefinition.create(
> CpsScmFlowDefinition.java:130)
> at
> org.jenkinsci.plugins.workflow.multibranch.SCMBinder.create(SCMBinder.
> java:120)
> at org.jenkinsci.plugins.workflow.job.WorkflowRun.run(
> WorkflowRun.java:263)
> at hudson.model.ResourceController.execute(ResourceController.java:97)
> at hudson.model.Executor.run(Executor.java:429)
> Finished: FAILURE
>
> Thanks,
> Indu
>


PR build failed because of git errors

2018-03-29 Thread Indhu
Hi,

Looks like PR #10039 build failed because of git errors. Here is the error
log:
http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/PR-10039/4/console.
Does someone know what could be happening here?

Build error:

Adding as 3rdparty/dlpack~7c28089749287f42ea8f41abd1358e6dbac54187 instead
Automatic merge failed; fix conflicts and then commit the result.

stderr:
at
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1990)
at
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1958)
at
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1954)
at
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1592)
at
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$3.execute(CliGitAPIImpl.java:692)
at
jenkins.plugins.git.MergeWithGitSCMExtension.decorateRevisionToBuild(MergeWithGitSCMExtension.java:122)
at hudson.plugins.git.GitSCM.determineRevisionToBuild(GitSCM.java:1068)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1161)
at
org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:113)
at
org.jenkinsci.plugins.workflow.cps.CpsScmFlowDefinition.create(CpsScmFlowDefinition.java:130)
at
org.jenkinsci.plugins.workflow.multibranch.SCMBinder.create(SCMBinder.java:120)
at org.jenkinsci.plugins.workflow.job.WorkflowRun.run(WorkflowRun.java:263)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:429)
Finished: FAILURE

Thanks,
Indu


Re: Killed builds

2018-03-29 Thread Marco de Abreu
Thank you, Chris!

What's interesting here (e.g. at [1]) is the matter of the fact that all
tests are actually finishing, but the process does not terminate. I have
experienced such a behaviour in my past C# and Java projects. In these
cases, it was related to threads being created as non-foreground-threads or
threadpools that have not been disposed and thus causing the process to
stay alive until explicitly terminated. Does anybody remember a change to
the threading in the last days or has a better idea what this could be
related to?

-Marco

[1]:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10308/5/pipeline/586

On Thu, Mar 29, 2018 at 7:44 PM, Chris Olivier 
wrote:

> I killed several builds which were > 11 hours old -- all stuck at this
> python3 GPU hang problem
>


Re: CI Python 3 GPU

2018-03-29 Thread Marco de Abreu
Thanks for looking into this! Did this happen in no specific job in
particular or could it be pinned down to a single configuration? We have
never had hangs like this, so this definitely seems related to a recent
change.

-Marco

On Thu, Mar 29, 2018 at 7:26 PM, kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Debugging this a bit with Chris.  I haven't looked at it closely but it
> seems like there might be a genuine hang here between
> CuDNNConvolutionOp::SelectAlgo  and a customop lambda invoke.  What
> do you guys think?
>
> Stack is here:
> https://gist.github.com/KellenSunderland/84aa9bb7270c0483eeccde6f08e91489
>
> -Kellen
>
> On Thu, Mar 29, 2018 at 6:24 PM, Chris Olivier 
> wrote:
>
> > Seems like a lot of PR builds are hanging at (what appears to be) the end
> > of Python 3 GPU unit tests.  Anyone have any idea why that might be?
> >
>


Killed builds

2018-03-29 Thread Chris Olivier
I killed several builds which were > 11 hours old -- all stuck at this
python3 GPU hang problem


Re: CI Python 3 GPU

2018-03-29 Thread kellen sunderland
Debugging this a bit with Chris.  I haven't looked at it closely but it
seems like there might be a genuine hang here between
CuDNNConvolutionOp::SelectAlgo  and a customop lambda invoke.  What
do you guys think?

Stack is here:
https://gist.github.com/KellenSunderland/84aa9bb7270c0483eeccde6f08e91489

-Kellen

On Thu, Mar 29, 2018 at 6:24 PM, Chris Olivier 
wrote:

> Seems like a lot of PR builds are hanging at (what appears to be) the end
> of Python 3 GPU unit tests.  Anyone have any idea why that might be?
>


RE: Extend MXNET distributed training with MPI AllReduce

2018-03-29 Thread Ye, Zhouhai
For our current POC:
b. Add mpi.kvstore in python. It depends upon mxnet submodule mpi_collectives 
(new). (mpi_collectives is c++ library depending upon mxnet.)(Add new type 
of kvstore in python layer.)

mpi_collectives doesn’t need to be a single c++ library. It’s source code can 
be compiled into libmxnet.so.


From: Ye, Zhouhai
Sent: Tuesday, March 27, 2018 11:21 AM
To: Nan Zhu ; dev@mxnet.incubator.apache.org
Cc: Li, Mu ; Lv, Tao A ; Ma, Guokai 
; Rahul Huilgol ; Ye, Jason Y 
; Zhang, Rong A ; Zhao, Patric 

Subject: RE: Extend MXNET distributed training with MPI AllReduce

You can check mpi.kvstore API Spec in our design doc:

e.g.  We add pushpull and broadcast interface and disable original push and 
pull in new kvstore.

From: Ye, Zhouhai
Sent: Tuesday, March 27, 2018 11:18 AM
To: 'Nan Zhu' >; 
dev@mxnet.incubator.apache.org
Cc: Li, Mu >; Lv, Tao A 
>; Ma, Guokai 
>; Rahul Huilgol 
>; Ye, Jason Y 
>; Zhang, Rong A 
>; Zhao, Patric 
>
Subject: RE: Extend MXNET distributed training with MPI AllReduce

Hi,
Nan Zhu

As we described in our design doc, there’s two possible code structure 
(implementation) : (currently we implement second in our POC)


a.  Implement mpi.kvstore same level as the current kvstores (CPP 
src/kvstore)   (Adhere to original kvstore factory pattern)



b.  Add mpi.kvstore in python. It depends upon mxnet submodule 
mpi_collectives (new). (mpi_collectives is c++ library depending upon mxnet.)   
 (Add new type of kvstore in python layer.)


For your second question, I think to make a single communication submodule is 
OK (just like a.). But an unified abstraction for both PS and Allreduce is very 
hard.


From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
Sent: Tuesday, March 27, 2018 10:39 AM
To: dev@mxnet.incubator.apache.org
Cc: Li, Mu >; Lv, Tao A 
>; Ma, Guokai 
>; Rahul Huilgol 
>; Ye, Jason Y 
>; Ye, Zhouhai 
>; Zhang, Rong A 
>; Zhao, Patric 
>
Subject: Re: Extend MXNET distributed training with MPI AllReduce

Hi, Patric

It's pretty nice work!

A question:

how the future code structure would look like when putting this allreduce 
module as an submodule? We will have two communication submodules?

Is there any plan to give an unified abstraction for communication so that a 
single communication submodule is possible?

Best,

Nan


On Mon, Mar 26, 2018 at 7:20 PM, Chris Olivier 
> wrote:
great! nice work!

On Mon, Mar 26, 2018 at 6:31 PM Zhao, Patric 
> wrote:

> Hi MXNET owners/developers,
>
> As you known, the AllReduce and Parameter Severs are two very popular
> distributed training modes in DL.
>
> Currently, MXNET only supports parameter server mode and is lack of
> AllReduce mode. Other frameworks, like tensorflow, pytorch, caffe, etc, can
> work with AllReduce.
> Based on our analysis and experiments, AllReduce mode can achieves the
> better scalability and more efficiency
>
> So, we propose to extend MXNET distributed training with MPI AllReduce
> mode.
> We have implemented a AllReduce prototype in MXNET and the results are
> very positive.
> AllReduce mode can get 94.7% scale efficiency by 8 compute nodes for VGG16
> while the Parameter Server requires totally 16 nodes (8 compute nodes + 8
> parameter severs) to reach 93.2%.
>
> The whole proposal is available in MXNET wiki. Any feedback are highly
> appreciated.
>
> https://cwiki.apache.org/confluence/display/MXNET/Extend+MXNet+Distributed+Training+by+MPI+AllReduce
>
> Thanks in advance.
>
> BR,
>
> --Patric
>
>



RE: Extend MXNET distributed training with MPI AllReduce

2018-03-29 Thread Ye, Zhouhai
You can check mpi.kvstore API Spec in our design doc:

e.g.  We add pushpull and broadcast interface and disable original push and 
pull in new kvstore.

From: Ye, Zhouhai
Sent: Tuesday, March 27, 2018 11:18 AM
To: 'Nan Zhu' ; dev@mxnet.incubator.apache.org
Cc: Li, Mu ; Lv, Tao A ; Ma, Guokai 
; Rahul Huilgol ; Ye, Jason Y 
; Zhang, Rong A ; Zhao, Patric 

Subject: RE: Extend MXNET distributed training with MPI AllReduce

Hi,
Nan Zhu

As we described in our design doc, there’s two possible code structure 
(implementation) : (currently we implement second in our POC)


a.  Implement mpi.kvstore same level as the current kvstores (CPP 
src/kvstore)   (Adhere to original kvstore factory pattern)



b.  Add mpi.kvstore in python. It depends upon mxnet submodule 
mpi_collectives (new). (mpi_collectives is c++ library depending upon mxnet.)   
 (Add new type of kvstore in python layer.)


For your second question, I think to make a single communication submodule is 
OK (just like a.). But an unified abstraction for both PS and Allreduce is very 
hard.


From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
Sent: Tuesday, March 27, 2018 10:39 AM
To: dev@mxnet.incubator.apache.org
Cc: Li, Mu >; Lv, Tao A 
>; Ma, Guokai 
>; Rahul Huilgol 
>; Ye, Jason Y 
>; Ye, Zhouhai 
>; Zhang, Rong A 
>; Zhao, Patric 
>
Subject: Re: Extend MXNET distributed training with MPI AllReduce

Hi, Patric

It's pretty nice work!

A question:

how the future code structure would look like when putting this allreduce 
module as an submodule? We will have two communication submodules?

Is there any plan to give an unified abstraction for communication so that a 
single communication submodule is possible?

Best,

Nan


On Mon, Mar 26, 2018 at 7:20 PM, Chris Olivier 
> wrote:
great! nice work!

On Mon, Mar 26, 2018 at 6:31 PM Zhao, Patric 
> wrote:

> Hi MXNET owners/developers,
>
> As you known, the AllReduce and Parameter Severs are two very popular
> distributed training modes in DL.
>
> Currently, MXNET only supports parameter server mode and is lack of
> AllReduce mode. Other frameworks, like tensorflow, pytorch, caffe, etc, can
> work with AllReduce.
> Based on our analysis and experiments, AllReduce mode can achieves the
> better scalability and more efficiency
>
> So, we propose to extend MXNET distributed training with MPI AllReduce
> mode.
> We have implemented a AllReduce prototype in MXNET and the results are
> very positive.
> AllReduce mode can get 94.7% scale efficiency by 8 compute nodes for VGG16
> while the Parameter Server requires totally 16 nodes (8 compute nodes + 8
> parameter severs) to reach 93.2%.
>
> The whole proposal is available in MXNET wiki. Any feedback are highly
> appreciated.
>
> https://cwiki.apache.org/confluence/display/MXNET/Extend+MXNet+Distributed+Training+by+MPI+AllReduce
>
> Thanks in advance.
>
> BR,
>
> --Patric
>
>



RE: Extend MXNET distributed training with MPI AllReduce

2018-03-29 Thread Ye, Zhouhai
Hi,
Nan Zhu

As we described in our design doc, there’s two possible code structure 
(implementation) : (currently we implement second in our POC)


a.  Implement mpi.kvstore same level as the current kvstores (CPP 
src/kvstore)   (Adhere to original kvstore factory pattern)



b.  Add mpi.kvstore in python. It depends upon mxnet submodule 
mpi_collectives (new). (mpi_collectives is c++ library depending upon mxnet.)   
 (Add new type of kvstore in python layer.)


For your second question, I think to make a single communication submodule is 
OK (just like a.). But an unified abstraction for both PS and Allreduce is very 
hard.


From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
Sent: Tuesday, March 27, 2018 10:39 AM
To: dev@mxnet.incubator.apache.org
Cc: Li, Mu ; Lv, Tao A ; Ma, Guokai 
; Rahul Huilgol ; Ye, Jason Y 
; Ye, Zhouhai ; Zhang, Rong A 
; Zhao, Patric 
Subject: Re: Extend MXNET distributed training with MPI AllReduce

Hi, Patric

It's pretty nice work!

A question:

how the future code structure would look like when putting this allreduce 
module as an submodule? We will have two communication submodules?

Is there any plan to give an unified abstraction for communication so that a 
single communication submodule is possible?

Best,

Nan


On Mon, Mar 26, 2018 at 7:20 PM, Chris Olivier 
> wrote:
great! nice work!

On Mon, Mar 26, 2018 at 6:31 PM Zhao, Patric 
> wrote:

> Hi MXNET owners/developers,
>
> As you known, the AllReduce and Parameter Severs are two very popular
> distributed training modes in DL.
>
> Currently, MXNET only supports parameter server mode and is lack of
> AllReduce mode. Other frameworks, like tensorflow, pytorch, caffe, etc, can
> work with AllReduce.
> Based on our analysis and experiments, AllReduce mode can achieves the
> better scalability and more efficiency
>
> So, we propose to extend MXNET distributed training with MPI AllReduce
> mode.
> We have implemented a AllReduce prototype in MXNET and the results are
> very positive.
> AllReduce mode can get 94.7% scale efficiency by 8 compute nodes for VGG16
> while the Parameter Server requires totally 16 nodes (8 compute nodes + 8
> parameter severs) to reach 93.2%.
>
> The whole proposal is available in MXNET wiki. Any feedback are highly
> appreciated.
>
> https://cwiki.apache.org/confluence/display/MXNET/Extend+MXNet+Distributed+Training+by+MPI+AllReduce
>
> Thanks in advance.
>
> BR,
>
> --Patric
>
>



RE: Extend MXNET distributed training with MPI AllReduce

2018-03-29 Thread Zhao, Patric
Actually, the current design structure is very like kvstore_nccl as attached 
picture shown.

I have updated the proposal into google doc as well. It’s more easy to add 
comments and modify.

https://docs.google.com/document/d/1e4anwDiS18cWP49FAghU6tqqdtnRKUcbNJJxvhIfvIA/edit#heading=h.t762l56r1094

Thanks,

--Patric


From: Ye, Zhouhai
Sent: Tuesday, March 27, 2018 4:30 PM
To: 'Nan Zhu' ; 'dev@mxnet.incubator.apache.org' 

Cc: 'Li, Mu' ; Lv, Tao A ; Ma, Guokai 
; 'Rahul Huilgol' ; Ye, Jason Y 
; Zhang, Rong A ; Zhao, Patric 

Subject: RE: Extend MXNET distributed training with MPI AllReduce

For our current POC:
b. Add mpi.kvstore in python. It depends upon mxnet submodule mpi_collectives 
(new). (mpi_collectives is c++ library depending upon mxnet.)(Add new type 
of kvstore in python layer.)

mpi_collectives doesn’t need to be a single c++ library. It’s source code can 
be compiled into libmxnet.so.


From: Ye, Zhouhai
Sent: Tuesday, March 27, 2018 11:21 AM
To: Nan Zhu >; 
dev@mxnet.incubator.apache.org
Cc: Li, Mu >; Lv, Tao A 
>; Ma, Guokai 
>; Rahul Huilgol 
>; Ye, Jason Y 
>; Zhang, Rong A 
>; Zhao, Patric 
>
Subject: RE: Extend MXNET distributed training with MPI AllReduce

You can check mpi.kvstore API Spec in our design doc:

e.g.  We add pushpull and broadcast interface and disable original push and 
pull in new kvstore.

From: Ye, Zhouhai
Sent: Tuesday, March 27, 2018 11:18 AM
To: 'Nan Zhu' >; 
dev@mxnet.incubator.apache.org
Cc: Li, Mu >; Lv, Tao A 
>; Ma, Guokai 
>; Rahul Huilgol 
>; Ye, Jason Y 
>; Zhang, Rong A 
>; Zhao, Patric 
>
Subject: RE: Extend MXNET distributed training with MPI AllReduce

Hi,
Nan Zhu

As we described in our design doc, there’s two possible code structure 
(implementation) : (currently we implement second in our POC)


a.   Implement mpi.kvstore same level as the current kvstores (CPP 
src/kvstore)   (Adhere to original kvstore factory pattern)



b.  Add mpi.kvstore in python. It depends upon mxnet submodule 
mpi_collectives (new). (mpi_collectives is c++ library depending upon mxnet.)   
 (Add new type of kvstore in python layer.)


For your second question, I think to make a single communication submodule is 
OK (just like a.). But an unified abstraction for both PS and Allreduce is very 
hard.


From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
Sent: Tuesday, March 27, 2018 10:39 AM
To: dev@mxnet.incubator.apache.org
Cc: Li, Mu >; Lv, Tao A 
>; Ma, Guokai 
>; Rahul Huilgol 
>; Ye, Jason Y 
>; Ye, Zhouhai 
>; Zhang, Rong A 
>; Zhao, Patric 
>
Subject: Re: Extend MXNET distributed training with MPI AllReduce

Hi, Patric

It's pretty nice work!

A question:

how the future code structure would look like when putting this allreduce 
module as an submodule? We will have two communication submodules?

Is there any plan to give an unified abstraction for communication so that a 
single communication submodule is possible?

Best,

Nan


On Mon, Mar 26, 2018 at 7:20 PM, Chris Olivier 
> wrote:
great! nice work!

On Mon, Mar 26, 2018 at 6:31 PM Zhao, Patric 
> wrote:

> Hi MXNET owners/developers,
>
> As you known, the AllReduce and Parameter Severs are two very popular
> distributed training modes in DL.
>
> Currently, MXNET only supports parameter server mode and is lack of
> AllReduce mode. Other frameworks, like tensorflow, pytorch, caffe, etc, can
> work with AllReduce.
>