I have updated the code to reuse allocated memory both on the device
and host page-locked memory (since I think those allocations were the
cause of the blocking calls).

The streamed version is now faster than the "serial" code. However, I
still think that the speed increase is simply due to faster mem-copies
(from/to page-locked memory) and not from any overlap between stream 1
and 2. At least if I trust the visual profiler.

So ... I'm I doing things wrong?

-Magnus

ps GTX 470 which should be able to compute and copy memory at same
time. pyCUDA from git. CUDA 3.2, pyfft 0.34

-----------------------------------------------
Magnus Paulsson
Assistant Professor
School of Computer Science, Physics and Mathematics
Linnaeus University
Phone: +46-480-446308
Mobile: +46-70-6942987

Attachment: serial.py
Description: Binary data

Attachment: streams.py
Description: Binary data

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to