Horovod-MXNet Integration
Hi, Currently, MXNet distributed can only be done using parameter server. Horovod is an open-source distributed training framework that has shown 2x speedup compared to TensorFlow using Parameter Server. We propose to add Horovod support to MXNet. This will help our users achieve goal of linear scalability to 256 GPUs and beyond. Design proposal on cwiki: https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration Please feel free to let me know if you have any suggestions or feedback. Regards, Carl
Single-Machine Topology-aware Communication
Hi, Currently, we have two methods for single-machine communication: parameter server and NCCL ring reduction. Both of these methods have some downsides. Parameter server does not differentiate between NVLink connections and PCI-E, so it ends up using the higher latency and slower PCI-E connections as frequently as it does NVLink. NCCL uses the ring reduce algorithm, which has higher theoretical latency than other algorithms. I am working on a topology-aware approach that can address these limitations. Design proposal is on cwiki: https://cwiki.apache.org/confluence/display/MXNET/Single+machine+All+Reduce+Topology-aware+Communication Please feel free to let me know if you have any suggestions. Regards, Carl
Single-Machine Topology-aware Communication
Hi, Currently, we have two methods for single-machine communication: parameter server and NCCL ring reduction. Both of these methods have some downsides. Parameter server does not differentiate between NVLink connections and PCI-E, so it ends up using the higher latency and slower PCI-E connections as frequently as it does NVLink. NCCL uses the ring reduce algorithm, which has higher theoretical latency than other algorithms. NCCL also requires users to install another dependency in order to use it. I am working on a topology-aware approach that can address these limitations. Design proposal is on cwiki: https://cwiki.apache.org/confluence/display/MXNET/Single+machine+All+Reduce+Topology-aware+Communication Please feel free to let me know if you have any suggestions. Regards, Carl