stephenrawls opened a new issue #14210: Slow Async GPU Copy URL: https://github.com/apache/incubator-mxnet/issues/14210 Some of our NLP models need to use multiple data arrays, sometimes up to 6 different data ndarrays as input. This is a big problem for multi-gpu training because the CPU -> GPU copy time is very slow. In particular, whereas the *actual* time to copy data via cuda memcpy calls is fast, the python overhead with calling into the MxNet C API seems slow. Here is a sample script that shows what I mean: ``` import time import mxnet as mx def generate_data(num_inputs, ngpus): return[[mx.nd.random.randn(1) for _ in range(num_inputs)] for _ in range(ngpus)] def send_to_gpu(data, ctx_list): return [[data[i][j].as_in_context(ctx_list[i]) for j in range(len(data[0]))] for i in range(len(data))] ctx_list = [mx.gpu(i) for i in range(8)] # Need to do this as warmup data = generate_data(num_inputs=1, ngpus=8) send_to_gpu(data, ctx_list) for num_inputs in range(1,11): data = generate_data(num_inputs, ngpus=8) start = time.time() data = send_to_gpu(data, ctx_list) end = time.time() print("Num Inputs: %d. Took %f ms to set off all async copies" % (num_inputs, 1000*(end-start))) ``` The output on a p3.16xlarge instance is: ``` % python3 ~/test_send_to_gpu_gh.py Num Inputs: 1. Took 0.680685 ms to set off all async copies Num Inputs: 2. Took 2.131939 ms to set off all async copies Num Inputs: 3. Took 2.979040 ms to set off all async copies Num Inputs: 4. Took 4.072189 ms to set off all async copies Num Inputs: 5. Took 4.901409 ms to set off all async copies Num Inputs: 6. Took 7.693768 ms to set off all async copies Num Inputs: 7. Took 6.579638 ms to set off all async copies Num Inputs: 8. Took 7.871866 ms to set off all async copies Num Inputs: 9. Took 8.602858 ms to set off all async copies Num Inputs: 10. Took 9.772539 ms to set off all async copies ``` My thoughts are: (1) Can I move this overhead out of the main training thread and hide the latency the same way we hide data loading latency? I think to do that I would need something like the cuda IPC support that PyTorch has: https://github.com/pytorch/pytorch/blob/220ce8046e5fcf1434f948795bcdefda33e95e9a/torch/multiprocessing/reductions.py (2) Can I reduce the overhead associated with each `.as_in_context()` call, which I am *mostly* sure is an async call and is just suffering from high overhead. My thought there was to try the cython support, which at least some places on the internet suggest will have lower overhead when calling into the C API, but it looks like that is currently broken pending this patch: https://github.com/apache/incubator-mxnet/pull/10951 Can someone look at this and let me know if I'm doing anything obviously wrong with the way I am trying to asynchronously copy data to each GPU? And is there an easy way to get faster CPU -> GPU copy times? (Again, the actual data is relatively small and doing the copy is quick; the problem is the overhead from needing to call copy 8 (gpus) * 6 (num input arrays) = 48 times) Thanks!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services