rahul003 commented on a change in pull request #8766: NDArray Indexing tutorial and Gradient Compression FAQ URL: https://github.com/apache/incubator-mxnet/pull/8766#discussion_r152468880
########## File path: docs/faq/gradient_compression.md ########## @@ -0,0 +1,107 @@ +# Gradient Compression + +Gradient Compression reduces communication bandwidth to make distributed training with GPUs more scalable and efficient without significant loss in convergence rate or accuracy. + + +## Benefits + +**Increased Speed** + +For architectures with fully connected components, the gradient compression capability is observed to speedup training by about 2x, depending on the size of the model and the network bandwidth of the instance. Bigger models see larger speedup with gradient compression. + +**Minimal Accuracy Loss** + +Gradient compression uses the approach of delaying the synchronization of weight updates which are small. Although small weight updates might not be sent for that batch, this information is not discarded. Once the weight updates for this location accumulate to become a larger value, they will be propagated. Since there is no information loss, but only delayed updates, it does not lead to a significant loss in accuracy or convergence rate. In distributed training experiments[1], it is observed a loss of accuracy as low as 1% for this technique. + + +## When to Use Gradient Compression + +When training models whose architectures include large fully connected components, it can be helpful to use gradient compression. For larger models, the communication cost becomes a major factor. Such models stand to benefit greatly with gradient compression. + + +### GPU versus CPU + +The greatest benefits from gradient compression are realized when using multi-node (single or multi-GPU) distributed training. Training on CPU would provide a lower compute density per compute node as compared to the massive compute density per compute node on a GPU. Due to this, the required communication bandwidth for CPU-based nodes during training is not as high as for GPU-based nodes. Hence, the benefits of gradient compression are lower for CPU-based nodes as compared to GPU-based nodes. + + +### Network Latency + +Benefits of gradient compression can be found when using distributed training with network connected nodes. Depending on the network latency between nodes and the model's size, these can contribute to slow performance such that gradient compression may provide speed improvements. + +You may not want to use gradient compression if you have low latency network communication. + + +### Model Size + +Distributed training involves synchronization of weights after each batch. Larger models have much higher communication costs during training, hence such models stand to benefit much more from gradient compression. +When running distributed training with gradient compression, the quantize and dequantize operations happen on CPU parallelized with OpenMP. For smaller models, when training on GPUs, it helps to set `OMP_NUM_THREADS=1` on each node, so that the overhead of launching OMP threads doesn't cause the compression and decompression to be slow. + +### Model Architecture + +The communication bandwidth requirements during training vary across various neural network architectures and hence the benefits of gradient compression vary accordingly. + +In networks which have significant fully connected components, since such layers have low compute cost on GPUs, communication becomes a bottleneck limiting the speed of distributed training. Gradient compression can help reduce the communication cost, and thus speed up training in such cases. We have observed speedup of about 2x on large fully connected neural networks. Models like AlexNet and VGG have large fully connected components as part of the network, hence stand to benefit from gradient compression. Long Short-Term Memory architectures require more communication bandwidth, so they also exhibit speed improvements with gradient compression. + +Architectures like Convolutional Neural Networks on the other hand have a higher compute cost, in which case some communication can be parallelized with compute. Since communication is not the bottleneck in such networks, gradient compression doesn't help much. + + +### Single Node Gradient Compression + +When the training is configured to use device to device communication on a single node with multiple GPUs, gradient compression can be used to reduce the cost communication. This can provide about 20% speedup for large models using older generation architectures. However, speed benefits may be negligible on a machine with a newer generation architecture where GPUs can communicate at low latency. Review comment: cost communication -> cost of communication ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services