rahul003 commented on a change in pull request #8762: Gradient compression faq
URL: https://github.com/apache/incubator-mxnet/pull/8762#discussion_r152460220
 
 

 ##########
 File path: docs/faq/gradient_compression.md
 ##########
 @@ -0,0 +1,98 @@
+# Gradient Compression
+
+Gradient Compression reduces communication bandwidth to make distributed 
training with GPUs more scalable and efficient without significant loss in 
convergence rate or accuracy.
+
+
+## Benefits
+
+**Increased Speed**
+
+For tasks like acoustic modeling in speech recognition (like in Alexa), the 
gradient compression capability is observed to speedup training by about 2 
times, depending on the size of the model and the network bandwidth of the 
instance. Bigger models see larger speedup with gradient compression.
+
+**Minimal Accuracy Loss**
+
+Gradient compression uses the approach of delaying the synchronization of 
weight updates which are small. Although small weight updates might not be sent 
for that batch, this information is not discarded. Once the weight updates for 
this location accumulate to become a larger value, they will be propagated. 
Since there is no information loss, but only delayed updates, it does not lead 
to a significant loss in accuracy or convergence rate. In distributed training 
experiments[1], it is observed a loss of accuracy as low as 1% for this 
technique.
+
+
+## When to Use Gradient Compression
+
+When training models whose architectures include large fully connected 
components, it can be helpful to use gradient compression. For larger models, 
the communication cost becomes a major factor. Such models stand to benefit 
greatly with gradient compression.
+
+
+### GPU versus CPU
+
+The greatest benefits from gradient compression are realized when using GPUs 
for both single-node multi-GPU and multi-node (single or multi-GPU) distributed 
training. Training on CPU would provide a lower compute density per compute 
node as compared to the massive compute density per compute node on a GPU. Due 
to this, the required communication bandwidth for CPU-based nodes during 
training is not as high as for GPU-based nodes. Hence, the benefits of gradient 
compression are lower for CPU-based nodes as compared to GPU-based nodes.
+
+
+### Network Latency
+
+Benefits of gradient compression can be found when using distributed training 
with network connected nodes. Depending on the network latency between nodes 
and the model's size, these can contribute to slow performance such that 
gradient compression may provide speed improvements.
+
+You may not want to use gradient compression if you have low latency network 
communication.
+
+
+### Model Size
+
+Distributed training involves synchronization of weights after each batch. 
Larger models have much higher communication costs during training, hence such 
models stand to benefit much more from gradient compression.
+When running distributed training with gradient compression, the quantize and 
dequantize operations happen on CPU parallelized with OpenMP. For smaller 
models, when training on GPUs, it helps to set `OMP_NUM_THREADS=1` on each 
node, so that the overhead of launching OMP threads doesn't cause the 
compression and decompression to be slow.
+
+### Model Architecture
+
+The communication bandwidth requirements during training vary across various 
neural network architectures and hence the benefits of gradient compression 
vary accordingly.
+
+In networks which have significant fully connected components, since such 
layers have low compute cost on GPUs, communication becomes a bottleneck 
limiting the speed of distributed training. Gradient compression can help 
reduce the communication cost, and thus speed up training in such cases. We 
have observed speedup of about 2x on large fully connected neural networks. 
Models like AlexNet and VGG have large fully connected components as part of 
the network, hence stand to benefit from gradient compression. Long Short-Term 
Memory architectures require more communication bandwidth, so they also exhibit 
speed improvements with gradient compression.
+
+Architectures like Convolutional Neural Networks on the other hand have a 
higher compute cost, in which case some communication can be parallelized with 
compute. Since communication is not the bottleneck in such networks, gradient 
compression doesn't help much.
+
+
+### Single Node Gradient Compression
+
+When the training is configured to use device to device communication on a 
single node with multiple GPUs, gradient compression can be used to reduce the 
cost communication. This can provide about 20% speedup for large models using 
older generation architectures. However, speed benefits may be negligible on a 
machine with a newer generation architecture where GPUs can communicate at low 
latency.
+
+
+## Deep Neural Networks and Sparse Data
+
+It is well-known that typically the weights of a fully connected DNN (Deep 
Neural Networks) are sparsely distributed with most weights close to zero, and 
so it is not surprising that sub-gradients are also sparse [1]. Since 
sub-gradients are computed from a small part of the training data, they are 
even sparser than the weights. Hence, only a small fraction of the weights is 
required to be updated after each mini-batch. In other words, elements of the 
gradient that are near zero can safely be delayed longer than the typical 
mini-batch size. The sub-gradients are compressed significantly by considering 
only gradient elements whose absolute values exceed a threshold. The resulting 
sparse gradients are then encoded using 2-bit quantization thereby reducing the 
communication bandwidth. The delayed gradient values are aggregated into a 
gradient residual which is communicated when it reaches the threshold.
+
+
+## Technical Implementation
+
+For data-parallel training, the model is replicated across compute nodes with 
the weight-updates synchronized across all the model replicas. The massive 
local computational density of the GPU nodes increases the required 
communication bandwidth for weight updates across model replicas in 
data-parallel distributed training. Instead of the uniform update-rate of 
weights imposed by the mini-batch size, the gradient compression capability 
controls the rate of weight-update per individual weight. Gradient compression 
uses the approach of delaying synchronization of weights whose updates (aka 
gradients) are small, and compressing the weight-updates which are 
synchronized. This reduction in communication bandwidth enables distributed 
training to be more efficient and scalable to more GPU nodes without 
significant loss in convergence rate or accuracy.
 
 Review comment:
   Please remove the current Technical Implementation and the Deep neural 
networks sections and create a new section for the below. I think its valuable 
to include the details in technical implementation because this is important 
information which a user would otherwise get only by reading and understanding 
the code.
   
   ## Approach
   The idea behind gradient compression comes from two observations. 
   Firstly, the gradients of weights computed for a small mini-batch of 
training data, when training large neural networks, are typically sparse. Only 
a small fraction of the weights have significant updates after each mini-batch. 
The synchronization of updates that are near zero can safely be delayed longer 
than the typical mini-batch size. This essentially means that the rate of 
weight-update can vary depending on the value of an individual weight. 
   Secondly, gradients can be compressed significantly by considering only 
those gradient elements whose absolute values exceed a threshold and then 
quantizing them to use lower bits per gradient value. By compressing the 
gradients, we can reduce communication bandwidth. The delayed gradient values, 
in the form of quantization error and values that don't meet the threshold, are 
aggregated into a gradient residual which is communicated when it reaches the 
threshold.
   
   
   ## Technical Implementation
   ### 2 bit Quantization
   Currently the supported type of quantization uses 2bits for each gradient 
value. Any positive value greater than or equal to the threshold sets two bits 
as `11`, any negative value whose absolute value is greater or equal to the 
threshold sets two bits as`10`, and others are set to `00`. This enables us to 
store 16 quantized gradients into one float. The error in quantization, which 
is `original_value - quantized_value` is stored in a gradient residual. 
   
   ### Types of kvstore
   Supported types of kvstore are `device` and all distributed kvstores (like 
`dist_sync`, `dist_async`, `dist_sync_device`). When kvstore is `device`, the 
communication between GPUs is compressed. Please note that this increases the 
memory usage of GPUs because of the additional residual stored. When using a 
distributed kvstore, worker to server communication is compressed. Server to 
worker communication and device to device communication are not compressed to 
avoid multiple levels of compression. In this case, compression and 
decompression happen on CPU, gradient residuals will be stored on CPU.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to