Try to use the double2 or double4 data type? From memory I saw benchmark on float that show faster float{2,3,4} vs float.
Fred On Thu, Feb 16, 2012 at 2:57 PM, Jesse Lu <jess...@stanford.edu> wrote: > Hi everyone, > > I ran a simple experiment today, which consisted of trying to maximize the > memory (device memory) throughput of a very simple kernel. I was slightly > disappointed that I was only able to achieve 72% of the theoretical maximum > bandwidth. My GPU is a C2070. The file is attached and is executed using: > > $ python test_pycuda_speed.py > 0.72196600476 utilization (1.0 is perfect utilization). > Achieved bandwidth: 98 GB/s > Theoretical maximum bandwidth: 136 GB/s > Fastest kernel execution time: 0.000777023971081 > Optimum block shape: (160, 1, 1) > . > ---------------------------------------------------------------------- > Ran 1 test in 0.814s > > OK > > The questions that I have are: > > How close can others get to the theoretical peak bandwidth? > Any suggested tweaks to increase performance? > > Thanks! > > Jesse > > _______________________________________________ > PyCUDA mailing list > PyCUDA@tiker.net > http://lists.tiker.net/listinfo/pycuda > _______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda