Hi Jean-Matthieu,
Jean-Matthieu Etancelin <[email protected]> writes:
> While optimizing some host-device data transfers, I came to the little piece 
> of code given below.
> My questions are : why such a long time is spent on the non-blocking copy 
> launching ? What can I do to have a ‘real’ non-blocking call in order to do 
> some computations on the host before waiting the copy completion ? 
> In the example launch time ~ profile time where launch time is the 
> cl.enqueue_copy calling time and profile time come from the event profiling 
> informations. I was expecting that wait time ~ profile time.
> The result on a K20m is :
> In [15]: print "Launch time=", t_wait - t_start
> Launch time= 0.373787879944
>
> In [16]: print "Wait time", t_end - t_wait
> Wait time 0.0372970104218
>
> In [17]: print "Profile time", 1e-9 * (evt.profile.end - evt.profile.start)
> Profile time 0.338622592

On Nvidia implementations, the host memory from which you want to do
async copies has to be "page-locked", which in terms of their OpenCL
implementation means that it has to be allocated as a buffer with the
ALLOC_HOST_PTR flag.

Hope that helps,
Andreas

Attachment: pgpR7EG78iu9f.pgp
Description: PGP signature

_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to