Hi, Currently, we have two methods for single-machine communication: parameter server and NCCL ring reduction. Both of these methods have some downsides. Parameter server does not differentiate between NVLink connections and PCI-E, so it ends up using the higher latency and slower PCI-E connections as frequently as it does NVLink. NCCL uses the ring reduce algorithm, which has higher theoretical latency than other algorithms. NCCL also requires users to install another dependency in order to use it. I am working on a topology-aware approach that can address these limitations. Design proposal is on cwiki: https://cwiki.apache.org/confluence/display/MXNET/Single+machine+All+Reduce+Topology-aware+Communication
Please feel free to let me know if you have any suggestions. Regards, Carl