Congrats on the Horovod integration everyone. That's really great to hear.
On Wed, Jan 30, 2019 at 10:08 AM Lin Yuan <apefor...@gmail.com> wrote: > > Hi Yuan, > > Thanks for your interest. We have just supported MXNet in Horovod and are > working on performance tuning and adding more examples. We are definitely > interested in further extending it's support with Kubeflow. > > Let's set up some time to have a more detailed discussion. > > Best, > > Lin > > On Wed, Jan 30, 2019 at 7:42 AM Yuan Tang <terrytangy...@gmail.com> wrote: > > > Hi, > > > > It's great to see MXNet-Horovod integration got merged: > > https://github.com/uber/horovod/pull/542 > > > > Is there any future plan for this? I've been working on Kubeflow's > > MPI-Operator (https://github.com/kubeflow/mpi-operator) lately and it > > would > > be interesting to see an example of using Horovod + MXNet + Kubeflow using > > MPI Operator. Feel free to reach out (@terrytangyuan > > <https://github.com/terrytangyuan>) if you encounter any issues. > > > > Best, > > Yuan > > > > > > On Fri, Nov 2, 2018 at 6:51 PM Lin Yuan <apefor...@gmail.com> wrote: > > > > > Hi Mu, > > > > > > Darren (@yuxihu <https://github.com/yuxihu>) and I have been working on > > > releasing MXNet-Horovod integration in production. We have made some > > > changes on both MXNet and Horovod sides. The changes on MXNet side have > > > mostly been merged and we are working to merge code to horovod repo. We > > > will send a design doc to you for review again next week. > > > > > > Thanks for your feedback, > > > > > > Lin > > > > > > On Wed, Oct 31, 2018 at 12:03 PM Mu Li <muli....@gmail.com> wrote: > > > > > > > Thanks for your contribution, Carl. > > > > > > > > I remember I left a comment on the proposal, but today I found it was > > > > disappeared. My suggestion is trying best to not change the existing > > API. > > > > The reason is that we need to change all trainers on the frontend that > > > uses > > > > the existing kvstore APIs, which may cause confusion to users. > > > > > > > > The current proposal wants add the following 4 APIs into kvstore: > > > > > > > > > > > > - > > > > > > > > kv.pushpull > > > > - > > > > > > > > kv.broadcast > > > > - > > > > > > > > kv.local_rank > > > > - > > > > > > > > kv.num_local_workers > > > > > > > > > > > > Pushpull can be done with a sequential push and pull, you can do > > nothing > > > in > > > > push and put all workloads into pushpull. Broadcast can be implemented > > by > > > > pull. > > > > > > > > What's local workers? GPUs in the single machine? If so, we can query > > it > > > > directly. > > > > > > > > > > > > On Fri, Sep 14, 2018 at 4:46 PM Carl Yang <carl14...@gmail.com> wrote: > > > > > > > > > Hi, > > > > > > > > > > Currently, MXNet distributed can only be done using parameter server. > > > > > Horovod is an open-source distributed training framework that has > > > > > shown 2x speedup compared to TensorFlow using Parameter Server. We > > > > > propose to add Horovod support to MXNet. This will help our users > > > > > achieve goal of linear scalability to 256 GPUs and beyond. Design > > > > > proposal on cwiki: > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration > > > > > > > > > > Please feel free to let me know if you have any suggestions or > > > feedback. > > > > > > > > > > Regards, > > > > > Carl > > > > > > > > > > > > > >