Hey,
> The integration of the generator is going slowly, but "surely". All the
> OpenCL vector kernels (including plane rotations, multi-inner_prod,
> etc.) should be device-specific in a few days. This will be a good leap
> forward, in terms of maintainability, peak performance and
> performance
I forgot to add that right now, things are handled by using the
kernel-generator only when start1=start2=start3=0 &
stride1=stride2=stride3=1. Otherwise, we forward to the good old kernel.
I'd like to change this because I think that ranges are more common than
strides (ranges are part of the BLAS3
Hi,
The integration of the generator is going slowly, but "surely". All the
OpenCL vector kernels (including plane rotations, multi-inner_prod, etc.)
should be device-specific in a few days. This will be a good leap forward,
in terms of maintainability, peak performance and performance-portability