DickJC123 commented on issue #14006: Dual stream cudnn Convolution backward() with MXNET_GPU_WORKER_NSTREAMS=2. URL: https://github.com/apache/incubator-mxnet/pull/14006#issuecomment-465283374 After the rework of this PR to make it far simpler to use within operators, I went back and re-measured the 1 GPU training speeds. The perf gains I measured on a run across 1 32g Volta GPU of Resnet50 v1b (also with DALI in NVIDIA's MXNet container) were: ``` batchsize 32: 2.9% speedup batchsize 64: 1.4% speedup batchsize 128: 0.95% speedup batchsize 256: 0.15% speedup ``` The speedup is based on a comparison of the 2nd epoch "time cost", where the 1st epoch time is not considered because of cuDNNFind() and DALI overheads that are unique to the 1st epoch. Single GPU training is not really the target of this PR, but at least this shows there's still a 1% improvement at a typical batchsize of 128. I don't recommend enabling 2 streams by default however because the increased use of global memory might make some user's models too big to run. Looking for any further reviewer input.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services