Thanks Andreas, your very right and you steered me in the right
direction :-) . If i cache the data in local memory outside the inner
loop in the benchmark_all example and increase the local work size i
manage 47 GFLOPS (from 100 GFLOPS theoretical) - much more like what i
was expecting. Thanks for your help.
Execution time of test without OpenCL: 10.1647880077 s
===============================================================
Platform name: NVIDIA
Platform profile: FULL_PROFILE
Platform vendor: NVIDIA Corporation
Platform version: OpenCL 1.0
---------------------------------------------------------------
Device name: GeForce 8600 GT
Device type: GPU
Device memory: 255 MB
Device max clock speed: 1188 MHz
Device compute units: 4
Execution time of test: 9.9648e-05 s
Results OK
Andreas Klöckner wrote:
On Donnerstag 17 September 2009, Lyndon Whaite wrote:
Thanks Andreas. No i don't think so. I was using a kernel very similar
to the benchmark_all example.
That's a contradiction--benchmark-all is purely memory-bound. :)
Andreas
------------------------------------------------------------------------
_______________________________________________
PyOpenCL mailing list
[email protected]
http://tiker.net/mailman/listinfo/pyopencl_tiker.net
_______________________________________________
PyOpenCL mailing list
[email protected]
http://tiker.net/mailman/listinfo/pyopencl_tiker.net