Dear Michiel,
On a windows machine (cuda 4.1) timing was about the same without
fiddling, where I have a consumer level GT 330.
Over on linux (cuda 4.0) with a tesla card the difference you report
could be reproduced and seemed to originate in an "-arch=sm_20" switch
from the ptx files previously mentioned.
If I add the switch (-arch sm_20) to the nvcc command line it degrades
to match the pycuda performance.
To get pycuda matching the nvcc speed I needed to add extra arguments in
SourceModule: arch='compute_10', code='sm_20'.
Looking at the ptx output there seems to be something funky going on in
the compiler with double versus single precision. Surprisingly it gets
to be even slower for me if I ask for erff instead of erf.
Which nvidia SDK, driver and card are you using? It seems the difference
comes down to a slightly different default nvcc invocation. Perhaps you
will need to tune according to the precision you need for your problem.
Best,
Jon
On 10/04/2012 14:37, Michiel Bruinink wrote:
Hello,
When you use the erf function with pyCuda, execution takes almost 2x
longer than with nvcc. The exp function takes about 30% longer.
I use these functions in my pyCuda program a lot and in that case it
makes the program 3x slower than when using nvcc.
I attached two small example programs that can be run straight away.
Regards,
Michiel.
_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda
_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda