Michael McNeil Forbes <michael.forbes+pyt...@gmail.com> writes:
> Here is the profile of the slow __call__. All the time is spent in 
> generate_stride_kernel_and_types:
>
> Line #      Hits         Time  Per Hit   % Time  Line Contents
> ==============================================================
>    192                                               def __call__(self, 
> *args, **kwargs):
>    193        78          145      1.9      0.1          vectors = []
>    ...
>    204        78          104      1.3      0.1          func, arguments = 
> self.generate_stride_kernel_and_types(
>    205        78       199968   2563.7     97.3                  range_ is 
> not None or slice_ is not None)
>    206                                           
>    207       156          354      2.3      0.2          for arg, arg_descr 
> in zip(args, arguments):
>    ...
>    241                                           
>    242        78         2780     35.6      1.4          
> func.prepared_async_call(grid, block, stream, *invocation_args)

Now this is just confusing to me. generate_stride_kernel_and_types has a
@memoize_method decorator, which should take care of caching the built
kernel. Unless you're instantiating a new ElementwiseKernel for each
call, generate_stride_kernel_and_types should only ever get called
once. The default (cached) case should amount to one dictionary lookup,
so I'm confused as to how that would eat up so much time. Can you
perhaps create a small reproducer for this?

Thanks,
Andreas

Attachment: pgpF6GV5YfgPS.pgp
Description: PGP signature

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to