
 > I was in fact wondering why one passed reciprocal_alpha and flip_sign
> into the kernel. After thinking more about it, I have noticed that this
> permits us to do the corresponding inversion/multiplication within the
> kernel, and therefore avoid one some latency penalty / kernel launch
> overhead when the scalar is pointed out, that's smart!
> On the other hand, modifying the generator to not actually generate a
> specific kernel would be absurd imho. This brings another question,
> then. How could ambm beneficiate from the auto-tuning environment?
> I propose the following solution:
> check the size of the matrices/vector
> If the computation is dominated by the kernel launch time (say, less
> than 100,000 elements), then we use the current ambm kernel. Otherwise,
> we transfer the scalars to the CPU, perform the corresponding a' = +- OP
> a, b' = +- OP b, and either generate the kernel or use a BLAS library.
> This way, we beneficiate from kernel launch time optimization for small
> data, and high-bandwidth for large data. Does this sounds good?

In terms of execution time, this is probably the best solution. On the 
other hand, it does not solve the problem of compilation overhead: If we 
only dispatch into the generator for large data, we still have to 
generate the respective kernels and go through the OpenCL jit-compiler 
each time. The compilation overhead of this is even likely to dominate 
any gains we get from a faster execution.

Instead, what about opening up the generator a bit? It is enough if we 
have some mechanism to access a batch-generation of axpy-like 
operations, for all other operations the generator can remain as-is.

Another option is to move only the axpy-template from the generator over 
to linalg/opencl/kernels/*, because the generation of these kernels is 
fairly light-weight. Sure, it is a little bit of code-duplication, but 
it will keep the generator clean.

Another possible improvement is to separate operations on full vectors 
from operations on ranges and slices. For full vectors we can use the 
built-in vector-types in OpenCL, which allows further optimizations not 
possible with ranges and strides, where we cannot use vector types in 

What do you think?

Best regards,

