Hi David,

A point we are trying to drive at but haven't stated is that the vast
majority of bottlenecks in CUDA code are due to inefficient memory access.
High performance CUDA code works very hard to limit certain kinds of memory
access, but you seem to think these operations will be cheap. This is why
Eli and I keep suggesting you use the cpu to do the calculation.

Right, I do realize it's sometimes better to do things on the host than move
data and do the task there. By memory access do you mean moving data from
host to device and back, or access within the device itself? I have noticed
especially with testing in an NVIDIA C1060 (8 Xeon cores, 240 SMs for the
GPU) that unless matrix multiplication (an example) is around 2k X 2k rows
or higher, the host effortlessly wins the speed competition.

It seems like you decided to work on this as a toy problem to learn CUDA. If
that's the case, you will be better served looking at some of the examples
in the CUDA SDK instead. I'm sure people on this list could give even better
suggestions if you asked.
Alright thanks a lot.

Best regards,

./francis
_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to