On Thu, Feb 16, 2012, at 11:57, Jesse Lu wrote: > Hi everyone, > > I ran a simple experiment today, which consisted of trying to maximize > the > memory (device memory) throughput of a very simple kernel.
I don't have pycuda installed at the moment, so I can't try, but your benchmark reads and writes from two arrays that aren't declared with __restrict__ (can nvcc assumed they aren't aliased?). If you declared the arrays as unaliased, do you see an improvement? Also, if you just write memory (e.g., a[ind] = 1.0;) do you get better bandwidth utilization? The cuda prog guide also mentions that kernels that mix memory access and computational work will show better performance on both overall because the compiler can schedule simultaneous access and computation, but I don't know how significant that is. Marmaduke _______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda