CUDA Graphs support in MXNet 1.8
--------------------------------
When optimizing both training and inference Deep Learning models on GPUs, a lot 
of effort is often put into optimizing kernel runtimes. However, especially for 
workloads with small per-GPU batch sizes like cluster-scale training jobs or 
latency-optimized inference, the CPU portion of the operators' execution time 
can limit the overall throughput. This CPU overhead may include the logic for 
choosing the right parameters and kernel to launch, and for performing the 
launch itself. CUDA Graphs is an NVIDIA CUDA feature that can mitigate these 
overheads.

By using CUDA Graphs, one can capture a sequence of driver calls, then replay 
that sequence going forward with greater efficiency, thereby eliminating the 
CPU from the critical path.  CUDA Graphs became a viable enhancement to MXNet 
when it became fully supported by cuDNN v8 in cuda 11, and is now an 
environment-variable enabled feature of MXNet 1.8.  In this talk we present an 
overview of CUDA Graphs to give insight into its performance advantages.  We 
describe the history and details of the CUDA Graphs integration into MXNet.  
Since CUDA Graphs captures kernel arguments as well as the kernel launches 
themselves, we discuss how this motivated the integration approach. Finally, we 
describe how to enable CUDA Graphs use within MXNet, what modeling scenarios 
are most likely to benefit from CUDA Graphs, and how to create new GPU 
operators that are compatible with CUDA Graphs.

Reply via email to