Hi Magnus,

On Thu, 3 Mar 2011 16:50:29 +0100, Magnus Paulsson <paulsso...@gmail.com> wrote:
> I have updated the code to reuse allocated memory both on the device
> and host page-locked memory (since I think those allocations were the
> cause of the blocking calls).
> 
> The streamed version is now faster than the "serial" code. However, I
> still think that the speed increase is simply due to faster mem-copies
> (from/to page-locked memory) and not from any overlap between stream 1
> and 2. At least if I trust the visual profiler.
> 
> So ... I'm I doing things wrong?

Looking at your streams.py code, I'm wondering why you're expecting
things to run in parallel if your synchronizing with both stream1 and
stream2 after you're done with each of them? Wouldn't that explicitly
prevent any parallelism between them?

What am I missing?

Andreas

Attachment: pgppdt2rl8RFv.pgp
Description: PGP signature

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to