On Thu, Feb 16, 2012, at 11:57, Jesse Lu wrote:
> Hi everyone,
> 
> I ran a simple experiment today, which consisted of trying to maximize
> the
> memory (device memory) throughput of a very simple kernel. 

I don't have pycuda installed at the moment, so I can't try, but your
benchmark reads and writes from two arrays that aren't declared with
__restrict__ (can nvcc assumed they aren't aliased?). If you declared
the arrays as unaliased, do you see an improvement? Also, if you just
write memory (e.g., a[ind] = 1.0;) do you get better bandwidth
utilization?

The cuda prog guide also mentions that kernels that mix memory access
and computational work will show better performance on both overall
because the compiler can schedule simultaneous access and computation,
but I don't know how significant that is.  

Marmaduke

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to