Hi Karl,


2014/1/24 Karl Rupp <r...@iue.tuwien.ac.at>

> Hey,
>
>  > I am a bit confused, is there any reason for using "reciprocal" and
> > "flip_sign", instead of just changing the scalar accordingly?
>
> yes (with a drawback I'll discuss at the end): Consider the family of
> operations
>
>   x = +- y OP1 a +- z OP2 b
>
> where x, y, and z are vectors, OP1 and OP2 are either multiplication or
> division, and a,b are host scalars. If I did the math correctly, these
> are 16 different kernels when coded explicitly. Hence, if you put all
> these into separate OpenCL kernels, you'll get fairly long compilation
> times. However, not that you cannot do this if a and b stem from device
> scalars, because then the manipulation of a and b would result in
> additional buffer allocations and kernel launches -> way too slow.
>
> For floating point operations, one can reduce the number of operations a
> lot when (+- OP1 a) and (+- OP2 b) are computed once in a preprocessing
> step. Then, only the kernel
>
>   x = y * a' + z * b'
>
> is needed, cutting the number of OpenCL kernels from 16 to 1. Since (-a)
> and (1/a) cannot be computed outside the kernel if a is a GPU scalar,
> this is always computed in a preprocessing step inside the OpenCL kernel
> for unification purposes. I think we can even apply some more cleverness
> here if we delegate all the work to a suitable implementation function.
>
> And now for the drawback: When using integers, the operation n/m is no
> longer the same as n * (1/m). Even worse, for unsigned integers it is
> also no longer possible to replace n - m by n + (-m). Thus, we certainly
> have to bite the bullet and generate kernels for all 16 combinations
> when using unsigned integers. However, I'm reluctant to generate all 16
> combinations for floating point arguments if this is not needed...
>
>
Thanks for the clarification. I also absolutely don't want to generate the
16 kernels either!

I was in fact wondering why one passed reciprocal_alpha and flip_sign into
the kernel. After thinking more about it, I have noticed that this permits
us to do the corresponding inversion/multiplication within the kernel, and
therefore avoid one some latency penalty / kernel launch overhead when the
scalar is pointed out, that's smart!
On the other hand, modifying the generator to not actually generate a
specific kernel would be absurd imho. This brings another question, then.
How could ambm beneficiate from the auto-tuning environment?
I propose the following solution:

check the size of the matrices/vector

If the computation is dominated by the kernel launch time (say, less than
100,000 elements), then we use the current ambm kernel. Otherwise, we
transfer the scalars to the CPU, perform the corresponding a' = +- OP a, b'
= +- OP b, and either generate the kernel or use a BLAS library. This way,
we beneficiate from kernel launch time optimization for small data, and
high-bandwidth for large data. Does this sounds good?

Best regards,
Philippe


Best regards,
> Karli
>
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
> _______________________________________________
> ViennaCL-devel mailing list
> ViennaCL-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/viennacl-devel
>
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to