Aron, I realize that this list is primarily for GPU computing, PyOpenCL being a descendent of PyCUDA, and I am aware of Vasiliy's work on optimizing matrix operations and know that he is very well respected in the GPGPU community. I have done an amount of GPU programming in the past, but in my current work environment, we are using traditional(funny how beowulf machines are now considered traditional) clusters with MPI with little to no shared memory programming. I'm trying to make arguments to change this, arguing that we should leverage heterogeneous environments for our codes.
Right now I am primarily concerned with making the argument that OpenCL is a viable option for numerical computing on shared memory multicore computers. But the results I am seeing are not indicating this. This may be due(and probably is) to my inadequacy as a programmer, or it may be that the current implementations of OpenCL for the CPU are not utilizing the full resources of the computer. Perhaps there is a better forum for discussing OpenCL on the CPU, but it is still an immature language and I thought I might find some insite on this list. I know that 'top' is not a means to measure the efficiency of a numerical algorithm, but I am under the understanding that it can measure the occupancy of the CPU. It was troubling to me to see that the Python process was utilizing nearly 800% of the processors(on my 8 core Xeon) whereas the PyOpenCL was only utilizing 400-500% and no other significant CPU programs were running concurrently. I used the same program semantics on both the CPU and GPU. It could be(but I doubt it) that the poor performance is as a result of creating memory buffers rather than directly using host allocated memory. However, the copies are only performed after ever several hundred iterations and shouldn't be the bottleneck of the program. I have only looked at domains of up to 1000 x 1000, and for smaller domains, up to 500 x 500, my OpenCL program is faster than a sequential Cython implementation, probably due to being able to fit most of the matrix in Cache. At larger sizes, though, the OpenCL runtimes explode. I've read two books on OpenCL(Gaster's and Munshi's) and at least Gaster's states that for the CPU, you should let the OpenCL driver chose the appropriate work group size. However I am wondering that if I am getting many cache misses, whether it would be beneficial to break them up myself. However, even for small domains, where most of everything should fit into cache, my program is far slower than an OpenMP program. Anyway, you will have to pardon any stupidity or unsophistication on my part, as I am from Alabama :) -- Robert L Cloud Student, School of Engineering The University of Alabama at Birmingham http://www.robertlouiscloud.com
_______________________________________________ PyOpenCL mailing list [email protected] http://lists.tiker.net/listinfo/pyopencl
