Hi Jean-Matthieu, Jean-Matthieu Etancelin <[email protected]> writes: > While optimizing some host-device data transfers, I came to the little piece > of code given below. > My questions are : why such a long time is spent on the non-blocking copy > launching ? What can I do to have a ‘real’ non-blocking call in order to do > some computations on the host before waiting the copy completion ? > In the example launch time ~ profile time where launch time is the > cl.enqueue_copy calling time and profile time come from the event profiling > informations. I was expecting that wait time ~ profile time. > The result on a K20m is : > In [15]: print "Launch time=", t_wait - t_start > Launch time= 0.373787879944 > > In [16]: print "Wait time", t_end - t_wait > Wait time 0.0372970104218 > > In [17]: print "Profile time", 1e-9 * (evt.profile.end - evt.profile.start) > Profile time 0.338622592
On Nvidia implementations, the host memory from which you want to do async copies has to be "page-locked", which in terms of their OpenCL implementation means that it has to be allocated as a buffer with the ALLOC_HOST_PTR flag. Hope that helps, Andreas
pgpR7EG78iu9f.pgp
Description: PGP signature
_______________________________________________ PyOpenCL mailing list [email protected] http://lists.tiker.net/listinfo/pyopencl
