CUDA Graphs support in MXNet 1.8 -------------------------------- When optimizing both training and inference Deep Learning models on GPUs, a lot of effort is often put into optimizing kernel runtimes. However, especially for workloads with small per-GPU batch sizes like cluster-scale training jobs or latency-optimized inference, the CPU portion of the operators' execution time can limit the overall throughput. This CPU overhead may include the logic for choosing the right parameters and kernel to launch, and for performing the launch itself. CUDA Graphs is an NVIDIA CUDA feature that can mitigate these overheads.
By using CUDA Graphs, one can capture a sequence of driver calls, then replay that sequence going forward with greater efficiency, thereby eliminating the CPU from the critical path. CUDA Graphs became a viable enhancement to MXNet when it became fully supported by cuDNN v8 in cuda 11, and is now an environment-variable enabled feature of MXNet 1.8. In this talk we present an overview of CUDA Graphs to give insight into its performance advantages. We describe the history and details of the CUDA Graphs integration into MXNet. Since CUDA Graphs captures kernel arguments as well as the kernel launches themselves, we discuss how this motivated the integration approach. Finally, we describe how to enable CUDA Graphs use within MXNet, what modeling scenarios are most likely to benefit from CUDA Graphs, and how to create new GPU operators that are compatible with CUDA Graphs.
