Hi Andreas,

On Jul 19, 2013, at 8:13 AM, Andreas Kloeckner <li...@coyote.tiker.net> wrote:

> I'm not in principle opposed to including such a thing. But I do have
> one question: Have you measured that this is really a
> performance-limiting issue for you?

Yes.  Here is a sample profile from a loop that repeats a FFT and then kernel 
application (apply_K and apply_V) without my PreparedElementwiseKernel:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   320        40           54      1.4      0.0          for _n in 
xrange(steps-1):
   321        39         2648     67.9      0.8              
fft_plan.forward(psi_, out=psi_)
   322        39       102632   2631.6     32.0              apply_K(psi_pycu)
   323        39         2689     68.9      0.8              
fft_plan.inverse(psi_, out=psi_)
   324        39       102356   2624.5     31.9              apply_V(psi_pycu)
   325        39        95372   2445.4     29.8              alpha = 
math.sqrt(Ntot/self.get_N_cu(psi_1D, blas))
   326        39         2609     66.9      0.8              
blas.scal(alpha=alpha, x=psi_1D)


And with the PreparedElementwiseKernel:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   320        40           53      1.3      0.0          for _n in 
xrange(steps-1):
   321        39         2611     66.9      0.9              
fft_plan.forward(psi_, out=psi_)
   322        39         1160     29.7      0.4              apply_K(psi_pycu)
   323        39         2248     57.6      0.8              
fft_plan.inverse(psi_, out=psi_)
   324        39         1160     29.7      0.4              apply_V(psi_pycu)
   325        39        83660   2145.1     28.5              alpha = 
math.sqrt(Ntot/self.get_N_cu(psi_1D, blas))
   326        39         2487     63.8      0.8              
blas.scal(alpha=alpha, x=psi_1D)

As you can see, the kernel applications are only comparable with the FFT after 
preparing.

Here is the profile of the slow __call__. All the time is spent in 
generate_stride_kernel_and_types:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   192                                               def __call__(self, *args, 
**kwargs):
   193        78          145      1.9      0.1          vectors = []
   ...
   204        78          104      1.3      0.1          func, arguments = 
self.generate_stride_kernel_and_types(
   205        78       199968   2563.7     97.3                  range_ is not 
None or slice_ is not None)
   206                                           
   207       156          354      2.3      0.2          for arg, arg_descr in 
zip(args, arguments):
   ...
   241                                           
   242        78         2780     35.6      1.4          
func.prepared_async_call(grid, block, stream, *invocation_args)

Michael.
_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to