Hi Mu,

Darren (@yuxihu <https://github.com/yuxihu>) and I have been working on
releasing MXNet-Horovod integration in production. We have made some
changes on both MXNet and Horovod sides. The changes on MXNet side have
mostly been merged and we are working to merge code to horovod repo. We
will send a design doc to you for review again next week.

Thanks for your feedback,

Lin

On Wed, Oct 31, 2018 at 12:03 PM Mu Li <muli....@gmail.com> wrote:

> Thanks for your contribution, Carl.
>
> I remember I left a comment on the proposal, but today I found it was
> disappeared. My suggestion is trying best to not change the existing API.
> The reason is that we need to change all trainers on the frontend that uses
> the existing kvstore APIs, which may cause confusion to users.
>
> The current proposal wants add the following 4 APIs into kvstore:
>
>
>    -
>
>    kv.pushpull
>    -
>
>    kv.broadcast
>    -
>
>    kv.local_rank
>    -
>
>    kv.num_local_workers
>
>
> Pushpull can be done with a sequential push and pull, you can do nothing in
> push and put all workloads into pushpull. Broadcast can be implemented by
> pull.
>
> What's local workers? GPUs in the single machine? If so, we can query it
> directly.
>
>
> On Fri, Sep 14, 2018 at 4:46 PM Carl Yang <carl14...@gmail.com> wrote:
>
> > Hi,
> >
> > Currently, MXNet distributed can only be done using parameter server.
> > Horovod is an open-source distributed training framework that has
> > shown 2x speedup compared to TensorFlow using Parameter Server. We
> > propose to add Horovod support to MXNet. This will help our users
> > achieve goal of linear scalability to 256 GPUs and beyond. Design
> > proposal on cwiki:
> >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration
> >
> > Please feel free to let me know if you have any suggestions or feedback.
> >
> > Regards,
> > Carl
> >
>

Reply via email to