stephenrawls opened a new issue #14210: Slow Async GPU Copy
URL: https://github.com/apache/incubator-mxnet/issues/14210
 
 
   Some of our NLP models need to use multiple data arrays, sometimes up to 6 
different data ndarrays as input. This is a big problem for multi-gpu training 
because the CPU -> GPU copy time is very slow. In particular, whereas the 
*actual* time to copy data via cuda memcpy calls is fast, the python overhead 
with calling into the MxNet C API seems slow.
   
   Here is a sample script that shows what I mean:
   ```
   import time
   import mxnet as mx
   
   def generate_data(num_inputs, ngpus):
       return[[mx.nd.random.randn(1) for _ in range(num_inputs)] for _ in 
range(ngpus)]
   
   def send_to_gpu(data, ctx_list):
       return [[data[i][j].as_in_context(ctx_list[i]) for j in 
range(len(data[0]))] for i in range(len(data))]
   
   ctx_list = [mx.gpu(i) for i in range(8)]
   
   # Need to do this as warmup                                                  
                                                                                
                                     
   data = generate_data(num_inputs=1, ngpus=8)
   send_to_gpu(data, ctx_list)
   
   
   for num_inputs in range(1,11):
       data = generate_data(num_inputs, ngpus=8)
   
       start = time.time()
       data = send_to_gpu(data, ctx_list)
       end = time.time()
   
       print("Num Inputs: %d. Took %f ms to set off all async copies" % 
(num_inputs, 1000*(end-start)))
   ```
   
   The output on a p3.16xlarge instance is:
   ```
   % python3 ~/test_send_to_gpu_gh.py
   Num Inputs: 1. Took 0.680685 ms to set off all async copies
   Num Inputs: 2. Took 2.131939 ms to set off all async copies
   Num Inputs: 3. Took 2.979040 ms to set off all async copies
   Num Inputs: 4. Took 4.072189 ms to set off all async copies
   Num Inputs: 5. Took 4.901409 ms to set off all async copies
   Num Inputs: 6. Took 7.693768 ms to set off all async copies
   Num Inputs: 7. Took 6.579638 ms to set off all async copies
   Num Inputs: 8. Took 7.871866 ms to set off all async copies
   Num Inputs: 9. Took 8.602858 ms to set off all async copies
   Num Inputs: 10. Took 9.772539 ms to set off all async copies
   ```
   
   My thoughts are: 
   
   (1) Can I move this overhead out of the main training thread and hide the 
latency the same way we hide data loading latency? I think to do that I would 
need something like the cuda IPC support that PyTorch has: 
https://github.com/pytorch/pytorch/blob/220ce8046e5fcf1434f948795bcdefda33e95e9a/torch/multiprocessing/reductions.py
   
   (2) Can I reduce the overhead associated with each `.as_in_context()` call, 
which I am *mostly* sure is an async call and is just suffering from high 
overhead. My thought there was to try the cython support, which at least some 
places on the internet suggest will have lower overhead when calling into the C 
API, but it looks like that is currently broken pending this patch: 
https://github.com/apache/incubator-mxnet/pull/10951
   
   Can someone look at this and let me know if I'm doing anything obviously 
wrong with the way I am trying to asynchronously copy data to each GPU? And is 
there an easy way to get faster CPU -> GPU copy times?  (Again, the actual data 
is relatively small and doing the copy is quick; the problem is the overhead 
from needing to call copy 8 (gpus) * 6 (num input arrays) = 48 times)
   
   Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to