Hi Magnus, On Thu, 3 Mar 2011 16:50:29 +0100, Magnus Paulsson <paulsso...@gmail.com> wrote: > I have updated the code to reuse allocated memory both on the device > and host page-locked memory (since I think those allocations were the > cause of the blocking calls). > > The streamed version is now faster than the "serial" code. However, I > still think that the speed increase is simply due to faster mem-copies > (from/to page-locked memory) and not from any overlap between stream 1 > and 2. At least if I trust the visual profiler. > > So ... I'm I doing things wrong?
Looking at your streams.py code, I'm wondering why you're expecting things to run in parallel if your synchronizing with both stream1 and stream2 after you're done with each of them? Wouldn't that explicitly prevent any parallelism between them? What am I missing? Andreas
pgppdt2rl8RFv.pgp
Description: PGP signature
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda