Thanks Andreas, your very right and you steered me in the right direction :-) . If i cache the data in local memory outside the inner loop in the benchmark_all example and increase the local work size i manage 47 GFLOPS (from 100 GFLOPS theoretical) - much more like what i was expecting. Thanks for your help.


Execution time of test without OpenCL:  10.1647880077 s
===============================================================
Platform name: NVIDIA
Platform profile: FULL_PROFILE
Platform vendor: NVIDIA Corporation
Platform version: OpenCL 1.0
---------------------------------------------------------------
Device name: GeForce 8600 GT
Device type: GPU
Device memory:  255 MB
Device max clock speed: 1188 MHz
Device compute units: 4
Execution time of test: 9.9648e-05 s
Results OK



Andreas Klöckner wrote:
On Donnerstag 17 September 2009, Lyndon Whaite wrote:
Thanks Andreas. No i don't think so. I was using a kernel very similar
to the benchmark_all example.

That's a contradiction--benchmark-all is purely memory-bound. :)

Andreas
------------------------------------------------------------------------

_______________________________________________
PyOpenCL mailing list
[email protected]
http://tiker.net/mailman/listinfo/pyopencl_tiker.net

_______________________________________________
PyOpenCL mailing list
[email protected]
http://tiker.net/mailman/listinfo/pyopencl_tiker.net

Reply via email to