Horovod-MXNet Integration

2018-09-14 Thread Carl Yang
Hi,

Currently, MXNet distributed can only be done using parameter server.
Horovod is an open-source distributed training framework that has
shown 2x speedup compared to TensorFlow using Parameter Server. We
propose to add Horovod support to MXNet. This will help our users
achieve goal of linear scalability to 256 GPUs and beyond. Design
proposal on cwiki:

https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration

Please feel free to let me know if you have any suggestions or feedback.

Regards,
Carl


Single-Machine Topology-aware Communication

2018-06-18 Thread Carl Yang
Hi,

Currently, we have two methods for single-machine communication:
parameter server and NCCL ring reduction. Both of these methods have
some downsides. Parameter server does not differentiate between NVLink
connections and PCI-E, so it ends up using the higher latency and
slower PCI-E connections as frequently as it does NVLink. NCCL uses
the ring reduce algorithm, which has higher theoretical latency than
other algorithms. I am working on a topology-aware approach that can
address these limitations. Design proposal is on cwiki:
https://cwiki.apache.org/confluence/display/MXNET/Single+machine+All+Reduce+Topology-aware+Communication

Please feel free to let me know if you have any suggestions.

Regards,
Carl


Single-Machine Topology-aware Communication

2018-06-18 Thread Carl Yang
Hi,

Currently, we have two methods for single-machine communication:
parameter server and NCCL ring reduction. Both of these methods have
some downsides. Parameter server does not differentiate between NVLink
connections and PCI-E, so it ends up using the higher latency and
slower PCI-E connections as frequently as it does NVLink. NCCL uses
the ring reduce algorithm, which has higher theoretical latency than
other algorithms. NCCL also requires users to install another
dependency in order to use it. I am working on a topology-aware
approach that can
address these limitations. Design proposal is on cwiki:
https://cwiki.apache.org/confluence/display/MXNET/Single+machine+All+Reduce+Topology-aware+Communication

Please feel free to let me know if you have any suggestions.

Regards,
Carl