Re: [PyCUDA] PyCuda 3x slower than nvcc

Jonathan WRIGHT Tue, 10 Apr 2012 01:31:51 -0700

Dear Michiel,

So going back to your original question about the difference in timingbetween C code and pycuda code, the compiler is the same but thecompilation was apparently not identical. Now I am guessing, but Iwonder if the difference you are seeing is related to floats versusdoubles? In the first output you have:


> .target sm_10, map_f64_to_f32

In the second:

> // Target:ptx, ISA:sm_20, Endian:little, Pointer Size:64

Are you using float (=double) or float32 types from python/pycuda?

There is also a tool called "decuda" which might take you from thecubins back to the ptx code to hunt for differences in performance.

Otherwise, to further track this down you could compile pycuda with apitracing enabled to see which cuda function calls are made. You couldalso try running your code under the cuda profiler to see if thedifference is python overhead or actual kernel execution time.


HTH,

Jon


On 06/04/2012 14:39, Michiel Bruinink wrote:

I did what you said and I have a file GaussFit2D.ptx from nvcc and a
file kernel.ptx from pyCuda.
GaussFit2D.ptx is 88 kB and kernel.ptx is 546 kB. I will show the top
sections of the files:
GaussFit2D.ptx:
.version 1.4
.target sm_10, map_f64_to_f32
// compiled with /usr/local/cuda/open64/lib//be
// nvopencc 3.2 built on 2010-11-03
//-----------------------------------------------------------
// Compiling /tmp/tmpxft_000042fc_00000000-9_GaussFit2D_M2.cpp3.i
(/tmp/ccBI#.fcgE6j)
//-----------------------------------------------------------
//-----------------------------------------------------------
// Options:
//-----------------------------------------------------------
// Target:ptx, ISA:sm_10, Endian:little, Pointer Size:64
// -O3 (Optimization level)
// -g0 (Debug level)
// -m2 (Report advisories)
//-----------------------------------------------------------
.file 1 "<command-line>"
.file 2 "/tmp/tmpxft_000042fc_00000000-8_GaussFit2D_M2.cudafe2.gpu"
.file 3 "/usr/lib/gcc/x86_64-linux-gnu/4.3.5/include/stddef.h"
.file 4 "/usr/local/cuda/bin/../include/crt/device_runtime.h"
.file 5 "/usr/local/cuda/bin/../include/host_defines.h"
.file 6 "/usr/local/cuda/bin/../include/builtin_types.h"
.file 7 "/usr/local/cuda/bin/../include/device_types.h"
.file 8 "/usr/local/cuda/bin/../include/driver_types.h"
.file 9 "/usr/local/cuda/bin/../include/surface_types.h"
.file 10 "/usr/local/cuda/bin/../include/texture_types.h"
.file 11 "/usr/local/cuda/bin/../include/vector_types.h"
.file 12 "/usr/local/cuda/bin/../include/device_launch_parameters.h"
.file 13 "/usr/local/cuda/bin/../include/crt/storage_class.h"
.file 14 "/usr/include/bits/types.h"
.file 15 "/usr/include/time.h"
.file 16 "GaussFit2D_M2.cu"
.file 17 "/usr/local/cuda/bin/../include/common_functions.h"
.file 18 "/usr/local/cuda/bin/../include/math_functions.h"
.file 19 "/usr/local/cuda/bin/../include/math_constants.h"
.file 20 "/usr/local/cuda/bin/../include/device_functions.h"
.file 21 "/usr/local/cuda/bin/../include/sm_11_atomic_functions.h"
.file 22 "/usr/local/cuda/bin/../include/sm_12_atomic_functions.h"
.file 23 "/usr/local/cuda/bin/../include/sm_13_double_functions.h"
.file 24 "/usr/local/cuda/bin/../include/sm_20_atomic_functions.h"
.file 25 "/usr/local/cuda/bin/../include/sm_20_intrinsics.h"
.file 26 "/usr/local/cuda/bin/../include/surface_functions.h"
.file 27 "/usr/local/cuda/bin/../include/texture_fetch_functions.h"
.file 28 "/usr/local/cuda/bin/../include/math_functions_dbl_ptx1.h"

.entry _Z6doFitsPfS_S_S_S_Pi (
.param .u64 __cudaparm__Z6doFitsPfS_S_S_S_Pi_p,
...
...
kernel.ptx:
.version 2.2
.target sm_20
// compiled with /usr/local/cuda/open64/lib//be
// nvopencc 3.2 built on 2010-11-03
//-----------------------------------------------------------
// Compiling kernel.cpp3.i (/tmp/ccBI#.sl67va)
//-----------------------------------------------------------
//-----------------------------------------------------------
// Options:
//-----------------------------------------------------------
// Target:ptx, ISA:sm_20, Endian:little, Pointer Size:64
// -O3 (Optimization level)
// -g0 (Debug level)
// -m2 (Report advisories)
//-----------------------------------------------------------
.file 1 "<command-line>"
.file 2 "kernel.cudafe2.gpu"
.file 3 "/usr/lib/gcc/x86_64-linux-gnu/4.3.5/include/stddef.h"
.file 4 "/usr/local/cuda/bin/../include/crt/device_runtime.h"
.file 5 "/usr/local/cuda/bin/../include/host_defines.h"
.file 6 "/usr/local/cuda/bin/../include/builtin_types.h"
.file 7 "/usr/local/cuda/bin/../include/device_types.h"
.file 8 "/usr/local/cuda/bin/../include/driver_types.h"
.file 9 "/usr/local/cuda/bin/../include/surface_types.h"
.file 10 "/usr/local/cuda/bin/../include/texture_types.h"
.file 11 "/usr/local/cuda/bin/../include/vector_types.h"
.file 12 "/usr/local/cuda/bin/../include/device_launch_parameters.h"
.file 13 "/usr/local/cuda/bin/../include/crt/storage_class.h"
.file 14 "/usr/include/bits/types.h"
.file 15 "/usr/include/time.h"
.file 16 "kernel.cu"
.file 17 "/usr/local/cuda/bin/../include/common_functions.h"
.file 18 "/usr/local/cuda/bin/../include/math_functions.h"
.file 19 "/usr/local/cuda/bin/../include/math_constants.h"
.file 20 "/usr/local/cuda/bin/../include/device_functions.h"
.file 21 "/usr/local/cuda/bin/../include/sm_11_atomic_functions.h"
.file 22 "/usr/local/cuda/bin/../include/sm_12_atomic_functions.h"
.file 23 "/usr/local/cuda/bin/../include/sm_13_double_functions.h"
.file 24 "/usr/local/cuda/bin/../include/sm_20_atomic_functions.h"
.file 25 "/usr/local/cuda/bin/../include/sm_20_intrinsics.h"
.file 26 "/usr/local/cuda/bin/../include/surface_functions.h"
.file 27 "/usr/local/cuda/bin/../include/texture_fetch_functions.h"
.file 28 "/usr/local/cuda/bin/../include/math_functions_dbl_ptx3.h"

.visible .func gaussj (.param .u64 __cudaparmf1_gaussj, .param .u64
__cudaparmf2_gaussj)
{
.reg .u32 %r<55>;
...
...
What does this mean?
Should they have been identical?
Michiel.

 >>> Jonathan WRIGHT <wri...@esrf.fr> 4/4/2012 3:01 PM >>>
Hello,

If you compile with the keep=True option you should find the ptx file
generated by the compiler, eg:

In [127]: mod = SourceModule(s, keep=True )
*** compiler output in c:\users\wright\appdata\local\temp\tmpvzledt

Over in that folder I find "kernel.ptx" which contains the details of
the nvcc compiler and options used and the assembler output. If you
compile your C based kernel using nvcc and the -ptx option you should be
able to diff the two outputs.

If the ptx files match and the timing still does not then you might want
to try configuring pycuda with --cuda-trace as another way to track down
the differences.

Cheers

Jon

On 04/04/2012 10:39, Michiel Bruinink wrote:
 > Hello,
 > I have written a Cuda program that calculates lots of Gauss fits. When I
 > use that same program with PyCuda, the time it takes to do the
 > calculations is almost 3x the time it takes with nvcc.
 > With nvcc it takes 380 ms and with PyCuda it takes 1110 ms, while the
 > outcome of the calculations is the same.
 > There is no difference in the device code, because I use the same file
 > for the device code in both cases.
 > How is this possible?
 > Does anybody have an idea?
 > I am not sure, but could it have someting to do with array declarations
 > inside a device function?
 > # define lenP 6
 > # define nPoints 100000
 > ...
 > __device__ void someFunction()
 > {
 > float residu[nPoints], newResidu[nPoints], pNew[lenP], b[lenP],
 > deltaP[lenP];
 > float A[lenP*lenP], Jacobian[nPoints*lenP], B[lenP*lenP];
 > ...
 > }
 > Thanks,
 > Michiel.
 >
 >
 > _______________________________________________
 > PyCUDA mailing list
 > PyCUDA@tiker.net
 > http://lists.tiker.net/listinfo/pycuda

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda


_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] PyCuda 3x slower than nvcc

Reply via email to