Hi David, A point we are trying to drive at but haven't stated is that the vast majority of bottlenecks in CUDA code are due to inefficient memory access. High performance CUDA code works very hard to limit certain kinds of memory access, but you seem to think these operations will be cheap. This is why Eli and I keep suggesting you use the cpu to do the calculation.
Right, I do realize it's sometimes better to do things on the host than move data and do the task there. By memory access do you mean moving data from host to device and back, or access within the device itself? I have noticed especially with testing in an NVIDIA C1060 (8 Xeon cores, 240 SMs for the GPU) that unless matrix multiplication (an example) is around 2k X 2k rows or higher, the host effortlessly wins the speed competition. It seems like you decided to work on this as a toy problem to learn CUDA. If that's the case, you will be better served looking at some of the examples in the CUDA SDK instead. I'm sure people on this list could give even better suggestions if you asked. Alright thanks a lot. Best regards, ./francis
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda