I have updated the code to reuse allocated memory both on the device and host page-locked memory (since I think those allocations were the cause of the blocking calls).
The streamed version is now faster than the "serial" code. However, I still think that the speed increase is simply due to faster mem-copies (from/to page-locked memory) and not from any overlap between stream 1 and 2. At least if I trust the visual profiler. So ... I'm I doing things wrong? -Magnus ps GTX 470 which should be able to compute and copy memory at same time. pyCUDA from git. CUDA 3.2, pyfft 0.34 ----------------------------------------------- Magnus Paulsson Assistant Professor School of Computer Science, Physics and Mathematics Linnaeus University Phone: +46-480-446308 Mobile: +46-70-6942987
serial.py
Description: Binary data
streams.py
Description: Binary data
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda