Ok, thanks!
This sounds reasonable indeed.
Philippe
2014-06-26 23:51 GMT+02:00 Karl Rupp :
> Hi,
>
> the cases 5, 6, and 7 are handled by running a kernel for four vectors,
> then subtract '4' and run a dedicated kernel on the remaining 1, 2, or 3
> vectors. This could also be handled by a gene
Hi,
the cases 5, 6, and 7 are handled by running a kernel for four vectors,
then subtract '4' and run a dedicated kernel on the remaining 1, 2, or 3
vectors. This could also be handled by a generated kernel, yes, but I
haven't implemented this for two reasons:
1. less kernels to compile
2.
I'll add something. I assume that multiple kernels are launched thanks to
current_index. Wouldn't it be better to launch one single kernel ? I think
that a lot of users would prefer to have better performance for perhaps a
slightly longer JIT overhead (since we'll provide a caching mechanism).
Phi
Hello!
I note this in the implementation of multi_inner_prod:
switch (vec_tuple.const_size() - current_index)
{
case 7:
case 6:
case 5:
case 4:
//do stuff
However, there is a test for 5,6,7 so I assume that these h