stephenrawls commented on issue #14210: Slow Async GPU Copy URL: https://github.com/apache/incubator-mxnet/issues/14210#issuecomment-465442428 As a follow up, I decided to time how long a single call into the c api takes for a no-op style operation: ``` import time import mxnet as mx import numpy as np x = mx.nd.array([0]) # warmup y = mx.nd.identity(x) times = [] for _ in range(1000): start = time.time() y = mx.nd.identity(x) end = time.time() times.append(1000 * (end-start)) print("Took %0.3f +/- %0.2fms" % (np.mean(times), np.std(times))) ``` The output I get is: ``` Took 0.037 +/- 0.01ms ``` If I multiply this by 48 then I get an expected overhead of 1.776 ms. Which is still too slow to pay at the start of every training loop, but there is a big gap between that and the 6-7 ms that I am observing. It turns out the `.as_in_context` is also calling the following code to create an nd array handle: ``` NDArray(_new_alloc_handle(self.shape, other, True, self.dtype)) ``` From my timing tests this adds another 0.816 ms of latency when doing 48 calls. So I guess what we have is with 8 GPUs and 6 input arrays per GPU, roughly: ``` 1.777 ms latency from call into c api 0.816 ms from new ndarray handle allocation 3-5 ms from, I guess CopyFromToSimple() & its call graph ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services