On Tue, 20 Sep 2011 17:19:50 -0500, Robert L Cloud <[email protected]> wrote: > However, even for small domains, where most of everything should fit into > cache, my program is far slower than an OpenMP program.
Just one more suggestion from my side: Try and do more per work item. It might be that the AMD implementation has a fairly high setup cost for each work item, and so having fewer (larger) ones is going to be beneficial. In my experience, the AMD implementation gives performance about as good as gcc, while Intel can be significantly better, depending on what you're trying to do. Andreas
pgpjtlHswryfE.pgp
Description: PGP signature
_______________________________________________ PyOpenCL mailing list [email protected] http://lists.tiker.net/listinfo/pyopencl
