Hi, It's great to see MXNet-Horovod integration got merged: https://github.com/uber/horovod/pull/542
Is there any future plan for this? I've been working on Kubeflow's MPI-Operator (https://github.com/kubeflow/mpi-operator) lately and it would be interesting to see an example of using Horovod + MXNet + Kubeflow using MPI Operator. Feel free to reach out (@terrytangyuan <https://github.com/terrytangyuan>) if you encounter any issues. Best, Yuan On Fri, Nov 2, 2018 at 6:51 PM Lin Yuan <apefor...@gmail.com> wrote: > Hi Mu, > > Darren (@yuxihu <https://github.com/yuxihu>) and I have been working on > releasing MXNet-Horovod integration in production. We have made some > changes on both MXNet and Horovod sides. The changes on MXNet side have > mostly been merged and we are working to merge code to horovod repo. We > will send a design doc to you for review again next week. > > Thanks for your feedback, > > Lin > > On Wed, Oct 31, 2018 at 12:03 PM Mu Li <muli....@gmail.com> wrote: > > > Thanks for your contribution, Carl. > > > > I remember I left a comment on the proposal, but today I found it was > > disappeared. My suggestion is trying best to not change the existing API. > > The reason is that we need to change all trainers on the frontend that > uses > > the existing kvstore APIs, which may cause confusion to users. > > > > The current proposal wants add the following 4 APIs into kvstore: > > > > > > - > > > > kv.pushpull > > - > > > > kv.broadcast > > - > > > > kv.local_rank > > - > > > > kv.num_local_workers > > > > > > Pushpull can be done with a sequential push and pull, you can do nothing > in > > push and put all workloads into pushpull. Broadcast can be implemented by > > pull. > > > > What's local workers? GPUs in the single machine? If so, we can query it > > directly. > > > > > > On Fri, Sep 14, 2018 at 4:46 PM Carl Yang <carl14...@gmail.com> wrote: > > > > > Hi, > > > > > > Currently, MXNet distributed can only be done using parameter server. > > > Horovod is an open-source distributed training framework that has > > > shown 2x speedup compared to TensorFlow using Parameter Server. We > > > propose to add Horovod support to MXNet. This will help our users > > > achieve goal of linear scalability to 256 GPUs and beyond. Design > > > proposal on cwiki: > > > > > > > > > https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration > > > > > > Please feel free to let me know if you have any suggestions or > feedback. > > > > > > Regards, > > > Carl > > > > > >