> I have finally started my work at the crossroads between Machine
> Learning and HPC. This is an excellent example of how PyViennaCL and
> ViennaCL can interact.
> Goal
> --------------------
> We want to execute a routine (GEMM, GEMV, DOT, FFT, etc...) on some
> hardware and a set of inputs .
> For now, the auto-tuner / generator optimizes only the routine with
> respect to the hardware. I'm working on optimizing it as well with
> respect to the properties of the inputs (in the case of GEMM : the three
> sizes involved).

Yes, this is the right thing to do. There is still a lot of potential in 
optimizing the routines for the respective sizes, particularly for the 
non-square case.

> Solution
> --------------------
> The idea is to run a large enough number of auto-tuning procedure and to
> record the best profiles for different given inputs (different M, N, K
> for GEMM). One can then do supervised learning to find the most suitable
> profile to execute (i.e. kernel to generate) given new inputs, without
> re-runing the auto-tuning procedure.

If you add in practical aspects, there is still the issue of kernel 
compilation. Thus, even though we might be able to find the best kernels 
for each possible matrix dimensions, this may require us to compile way 
too many kernels to cover them all. Clearly, users won't be happy about 
1 TFLOP during a 1 second execution if it takes 10 seconds to compile 
the kernels... ;-)

> I have basically carried out about 30 carefully-chosen GEMM auto-tuning
> procedures on Hawaii SGEMM. And I can tell that the size and the shape
> both matter... Basically, if you use the wrong kernel, you might end up
> with lower performance (up to 20-30%).
> Anyway, 30 is an extremely small number considering that we are spanning
> three dimensions. I obtain 13 different optimal kernels, and in many
> cases the optimal kernel appears only once. Things can get better if we
> accept different input to share the same optimum if it doesn't hurt the
> performance too much (say, not more than 5%).

I'd say that if we stay within 10% of the best configuration for the 
possible sizes, that would be fantastic. I doubt that vendor-tuned 
libraries get anywhere close to that for the non-square case.

> For now, the results I have obtained running an SVM-classifier seem to
> make sense, but I think that we should have between 50 and 100 examples
> to make it work properly. This is not very tractable as of now, but
> another part of my research is to find a way to speed-up the auto-tuning
> procedure. Anyway, this is altogether an interesting research direction
> which could potentially lead to nice performance improvements (in
> average). I'm not sure whether or not SVM is the most appropriate
> classifier for this, but it is what seems to make most sense for me in
> this particular case.
> Disussion
> -------------
> There are to very distinct steps in that procedure, that I'll recall for
> those who don't have a ML background:
> -> The training step: the parameters of the classifier are found. This
> is where all the auto-tuning procedures execute, and this is basically
> what takes potentially forever (a couple of days, perhaps). That is, we
> don't care about the over-head here. The point is that this is also a
> separate routine, so there's absolutely no reason to write it in C++!
> Plus, the whole Machine Learning community uses Python. That is, what we
> want to do here is to provide a few wrappers into pyViennaCL generate a
> kernel for a given profile. From that point, we can re-use the existing
> work of other researchers to speed-up the auto-tuning procedure, and
> /train/ some classifier for input-dependent kernel generation. Once the
> classifier is trained, we can export the model to a file (most ML
> libraries allow this). We could ideally replace the vendor-specific
> model file by some header-only C++ source code.
> -> The prediction step: This is executed every-time a
> matrix-multiplication is carried out. A prediction is made at run-time
> given the inputs, the hardware and the model created during the training
> step. This triggers the generation/compilation of a
> hardware/input-specific kernel for optimal performance. I'm not afraid
> of the prediction overhead if we use C++. (since input is only
> 3-dimensional.)

Yes, I'm not worried about the prediction step either. I'm more worried 
about the number of kernels we need to compile. If the number 13 you 
mentioned above is representative, then this is tractable. Is this for a 
single layout?

> This is imho a perfect example of how pyViennaCL could be used to
> increase the productivity on the core.

Absolutely. :-)

