rahul003 commented on a change in pull request #8762: Gradient compression faq
URL: https://github.com/apache/incubator-mxnet/pull/8762#discussion_r152454945
 
 

 ##########
 File path: docs/faq/gradient_compression.md
 ##########
 @@ -0,0 +1,95 @@
+# Gradient Compression
+
+Gradient Compression reduces communication bandwidth to make distributed 
training with GPUs more scalable and efficient without significant loss in 
convergence rate or accuracy.
+
+
+## Benefits
+
+**Increased Speed**
+
+For tasks like acoustic modeling in speech recognition (like in Alexa), the 
gradient compression capability is observed to speedup training by about 2 
times, depending on the size of the model and the network bandwidth of the 
instance. Bigger models see larger speedup with gradient compression.
+
+**Minimal Accuracy Loss**
+
+Gradient compression uses the approach of delaying the synchronization of 
weight updates which are small. Although small weight updates might not be sent 
for that batch, this information is not discarded. Once the weight updates for 
this location accumulate to become a larger value, they will be propagated. 
Since there is no information loss, but only delayed updates, it does not lead 
to a significant loss in accuracy or convergence rate. In distributed training 
experiments[1], it is observed a loss of accuracy as low as 1% for this 
technique.
+
+
+## When to Use Gradient Compression
+
+When training models whose architectures include large fully connected 
components, it can be helpful to use gradient compression. For larger models, 
the communication cost becomes a major factor. Such models stand to benefit 
greatly with gradient compression.
+
+
+### GPU versus CPU
+
+The greatest benefits from gradient compression are realized when using GPUs 
for both single-node multi-GPU and multi-node (single or multi-GPU) distributed 
training. Training on CPU would provide a lower compute density per compute 
node as compared to the massive compute density per compute node on a GPU. Due 
to this, the required communication bandwidth for CPU-based nodes during 
training is not as high as for GPU-based nodes. Hence, the benefits of gradient 
compression are lower for CPU-based nodes as compared to GPU-based nodes.
+
+
+### Scaling
+
+When the training is configured to use device to device communication on a 
single node with multiple GPUs, gradient compression can be used to reduce the 
cost communication. This can provide about 20% speedup for large models using 
older generation architectures where GPU communication goes through the CPU. 
However, speed benefits may be negligible on a 8-GPU machine with a newer 
generation architecture where GPUs can communicate without going through the 
CPU first.
+
+
+### Network Latency
+
+Benefits of gradient compression can be found when using distributed training 
with network connected nodes. Depending on the network latency between nodes 
and the model's size, these can contribute to slow performance such that 
gradient compression may provide speed improvements.
+
+You may not want to use gradient compression if you have low latency 
communication. The performance may be negligible when GPUs can communicate at 
low latency in newer architectures.
+
+
+### Model Size
+
+If the model is small, gradient compression can actually decrease speed. More 
examples of this are covered in the Benchmarking section.
 
 Review comment:
   When running distributed training with gradient compression, the quantize 
and dequantize operations happen on CPU parallelized with OpenMP. For smaller 
models, when training on GPUs, it helps to set `OMP_NUM_THREADS=1` on each 
node, so that the overhead of launching OMP threads doesn't cause the 
compression and decompression to be slow. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to