DickJC123 commented on issue #14006: Dual stream cudnn Convolution backward() 
with MXNET_GPU_WORKER_NSTREAMS=2.
URL: https://github.com/apache/incubator-mxnet/pull/14006#issuecomment-465283374
 
 
   After the rework of this PR to make it far simpler to use within operators, 
I went back and re-measured the 1 GPU training speeds.  The perf gains I 
measured on a run across 1 32g Volta GPU of Resnet50 v1b (also with DALI in 
NVIDIA's MXNet container) were:
   
   ```
   batchsize  32: 2.9% speedup
   batchsize  64: 1.4% speedup
   batchsize 128: 0.95% speedup
   batchsize 256: 0.15% speedup
   ```
   
   The speedup is based on a comparison of the 2nd epoch "time cost", where the 
1st epoch time is not considered because of cuDNNFind() and DALI overheads that 
are unique to the 1st epoch.
   
   Single GPU training is not really the target of this PR, but at least this 
shows there's still a 1% improvement at a typical batchsize of 128.  I don't 
recommend enabling 2 streams by default however because the increased use of 
global memory might make some user's models too big to run.
   
   Looking for any further reviewer input.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to