Hi Andreas, On Jul 19, 2013, at 8:13 AM, Andreas Kloeckner <li...@coyote.tiker.net> wrote:
> I'm not in principle opposed to including such a thing. But I do have > one question: Have you measured that this is really a > performance-limiting issue for you? Yes. Here is a sample profile from a loop that repeats a FFT and then kernel application (apply_K and apply_V) without my PreparedElementwiseKernel: Line # Hits Time Per Hit % Time Line Contents ============================================================== 320 40 54 1.4 0.0 for _n in xrange(steps-1): 321 39 2648 67.9 0.8 fft_plan.forward(psi_, out=psi_) 322 39 102632 2631.6 32.0 apply_K(psi_pycu) 323 39 2689 68.9 0.8 fft_plan.inverse(psi_, out=psi_) 324 39 102356 2624.5 31.9 apply_V(psi_pycu) 325 39 95372 2445.4 29.8 alpha = math.sqrt(Ntot/self.get_N_cu(psi_1D, blas)) 326 39 2609 66.9 0.8 blas.scal(alpha=alpha, x=psi_1D) And with the PreparedElementwiseKernel: Line # Hits Time Per Hit % Time Line Contents ============================================================== 320 40 53 1.3 0.0 for _n in xrange(steps-1): 321 39 2611 66.9 0.9 fft_plan.forward(psi_, out=psi_) 322 39 1160 29.7 0.4 apply_K(psi_pycu) 323 39 2248 57.6 0.8 fft_plan.inverse(psi_, out=psi_) 324 39 1160 29.7 0.4 apply_V(psi_pycu) 325 39 83660 2145.1 28.5 alpha = math.sqrt(Ntot/self.get_N_cu(psi_1D, blas)) 326 39 2487 63.8 0.8 blas.scal(alpha=alpha, x=psi_1D) As you can see, the kernel applications are only comparable with the FFT after preparing. Here is the profile of the slow __call__. All the time is spent in generate_stride_kernel_and_types: Line # Hits Time Per Hit % Time Line Contents ============================================================== 192 def __call__(self, *args, **kwargs): 193 78 145 1.9 0.1 vectors = [] ... 204 78 104 1.3 0.1 func, arguments = self.generate_stride_kernel_and_types( 205 78 199968 2563.7 97.3 range_ is not None or slice_ is not None) 206 207 156 354 2.3 0.2 for arg, arg_descr in zip(args, arguments): ... 241 242 78 2780 35.6 1.4 func.prepared_async_call(grid, block, stream, *invocation_args) Michael. _______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda