Hello,

The integration of the kernel generator has been a nightmare! Anyway, I've
realized that thousands of kernels per scalartype are required, in order to
obtain optimal performance. Why so much?
- flip_a, reciprocal_a, flip_b, reciprocal_b requiring their own kernel
- The generator interprets differently x = a*y + b*z, x = a*y + b*x, x =
a*x + b*y, etc...
- Each avbv requires 2 kernel, because we need one fallback when the offset
is not a multiple of the simd_width. There are some trick on AMD
implementations to avoid doing this, but I know no portable trick.

As you might have guessed, this gets me uncomfortable and upset.
On the one hand, it cannot be bad performance-wise to have a specific
implementation for operations such as x = a*x + b*y, x = a*x + b*x, etc. On
the other hand, I'm seriously wondering if the practical gain would be
noticeable, and what practical overhead it would induce.

Note, however, that a kernel x = a*x + a*x is at least as efficient as x =
(2*a)*x (and more efficient if a is a device scalar).

I need your advises here. Should I add an option to force the generator to
treat each vector as a different object (so that x = a*x + b*z would use
the kernel x = a*y + b*z with y<-x), or should I leave it as-is,
considering that we might have a higher throughput at the price of more
latency? Has anyone ever had bad experience with very large programs?

Philippe
------------------------------------------------------------------------------
The best possible search technologies are now affordable for all companies.
Download your FREE open source Enterprise Search Engine today!
Our experts will assist you in its installation for $59/mo, no commitment.
Test it for FREE on our Cloud platform anytime!
http://pubads.g.doubleclick.net/gampad/clk?id=145328191&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to