Hey,

2014/1/24 Karl Rupp <[email protected]>

> Hi,
>
>
> > I was in fact wondering why one passed reciprocal_alpha and flip_sign
>
>> into the kernel. After thinking more about it, I have noticed that this
>> permits us to do the corresponding inversion/multiplication within the
>> kernel, and therefore avoid one some latency penalty / kernel launch
>> overhead when the scalar is pointed out, that's smart!
>> On the other hand, modifying the generator to not actually generate a
>> specific kernel would be absurd imho. This brings another question,
>> then. How could ambm beneficiate from the auto-tuning environment?
>> I propose the following solution:
>>
>> check the size of the matrices/vector
>>
>> If the computation is dominated by the kernel launch time (say, less
>> than 100,000 elements), then we use the current ambm kernel. Otherwise,
>> we transfer the scalars to the CPU, perform the corresponding a' = +- OP
>> a, b' = +- OP b, and either generate the kernel or use a BLAS library.
>> This way, we beneficiate from kernel launch time optimization for small
>> data, and high-bandwidth for large data. Does this sounds good?
>>
>
> In terms of execution time, this is probably the best solution. On the
> other hand, it does not solve the problem of compilation overhead: If we
> only dispatch into the generator for large data, we still have to generate
> the respective kernels and go through the OpenCL jit-compiler each time.
> The compilation overhead of this is even likely to dominate any gains we
> get from a faster execution.
>
Instead, what about opening up the generator a bit? It is enough if we have
> some mechanism to access a batch-generation of axpy-like operations, for
> all other operations the generator can remain as-is.
>
> Another option is to move only the axpy-template from the generator over
> to linalg/opencl/kernels/*, because the generation of these kernels is
> fairly light-weight. Sure, it is a little bit of code-duplication, but it
> will keep the generator clean.
>
> Another possible improvement is to separate operations on full vectors
> from operations on ranges and slices. For full vectors we can use the
> built-in vector-types in OpenCL, which allows further optimizations not
> possible with ranges and strides, where we cannot use vector types in
> general.


> What do you think?
>
>
I prefer option 3. This would allow for something like :

if(size(x)>1e5 && stride==1 && start==0){

 //The following steps are costly for small vectors
 NumericT cpu_alpha = alpha //copy back to host when the scalar is on
global device memory)
 if(alpha_flip) cpu_alpha*=-1;
 if(reciprocal) cpu_alpha = 1/cpu_alpha;
 //... same for beta

//Optimized routines
 if(external_blas)
   call_axpy_twice(x,cpu_alpha,y,cpu_beta,z)
 else{
   generate_execute(x = cpu_alpha*y + cpu_beta*z);
}
else{
  //fallback
}

This way, we at most generate two kernels, one for small vectors,  designed
to optimize latency, and one for big vectors, designed to optimize
bandwidth. Are we converging? :)


Best regards,
Philippe


Best regards,
> Karli
>
>
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to