CI Broke AF

2018-03-29 Thread Chris Olivier
Something for checked in and CI is nuked. I tried revert of my last commit, and that didn’t fix it, so apparently it wasn’t that. Anyone have any ideas? It is super-broken. unit tests failing like crazy, GPU builds hanging on shutdown. No successful builds today at all.

Re: PR build failed because of git errors

2018-03-29 Thread Haibin Lin
I've seen this before. Try rebasing and force pushing. On Thu, Mar 29, 2018 at 3:51 PM, Indhu wrote: > Hi, > > Looks like PR #10039 build failed because of git errors. Here is the error > log: > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator- >

PR build failed because of git errors

2018-03-29 Thread Indhu
Hi, Looks like PR #10039 build failed because of git errors. Here is the error log: http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/PR-10039/4/console. Does someone know what could be happening here? Build error: Adding as 3rdparty/dlpack~7c28089749287f42ea8f41abd1358e6dbac54187

Re: Killed builds

2018-03-29 Thread Marco de Abreu
Thank you, Chris! What's interesting here (e.g. at [1]) is the matter of the fact that all tests are actually finishing, but the process does not terminate. I have experienced such a behaviour in my past C# and Java projects. In these cases, it was related to threads being created as

Re: CI Python 3 GPU

2018-03-29 Thread Marco de Abreu
Thanks for looking into this! Did this happen in no specific job in particular or could it be pinned down to a single configuration? We have never had hangs like this, so this definitely seems related to a recent change. -Marco On Thu, Mar 29, 2018 at 7:26 PM, kellen sunderland <

Killed builds

2018-03-29 Thread Chris Olivier
I killed several builds which were > 11 hours old -- all stuck at this python3 GPU hang problem

Re: CI Python 3 GPU

2018-03-29 Thread kellen sunderland
Debugging this a bit with Chris. I haven't looked at it closely but it seems like there might be a genuine hang here between CuDNNConvolutionOp::SelectAlgo and a customop lambda invoke. What do you guys think? Stack is here:

RE: Extend MXNET distributed training with MPI AllReduce

2018-03-29 Thread Ye, Zhouhai
For our current POC: b. Add mpi.kvstore in python. It depends upon mxnet submodule mpi_collectives (new). (mpi_collectives is c++ library depending upon mxnet.)(Add new type of kvstore in python layer.) mpi_collectives doesn’t need to be a single c++ library. It’s source code can be

RE: Extend MXNET distributed training with MPI AllReduce

2018-03-29 Thread Ye, Zhouhai
You can check mpi.kvstore API Spec in our design doc: e.g. We add pushpull and broadcast interface and disable original push and pull in new kvstore. From: Ye, Zhouhai Sent: Tuesday, March 27, 2018 11:18 AM To: 'Nan Zhu' ; dev@mxnet.incubator.apache.org Cc: Li, Mu

RE: Extend MXNET distributed training with MPI AllReduce

2018-03-29 Thread Ye, Zhouhai
Hi, Nan Zhu As we described in our design doc, there’s two possible code structure (implementation) : (currently we implement second in our POC) a. Implement mpi.kvstore same level as the current kvstores (CPP src/kvstore) (Adhere to original kvstore factory pattern) b. Add

RE: Extend MXNET distributed training with MPI AllReduce

2018-03-29 Thread Zhao, Patric
Actually, the current design structure is very like kvstore_nccl as attached picture shown. I have updated the proposal into google doc as well. It’s more easy to add comments and modify.