Hey,
2014/1/24 Karl Rupp <[email protected]>
> Hi,
>
>
> > I was in fact wondering why one passed reciprocal_alpha and flip_sign
>
>> into the kernel. After thinking more about it, I have noticed that this
>> permits us to do the corresponding inversion/multiplication within the
>> kernel, and therefore avoid one some latency penalty / kernel launch
>> overhead when the scalar is pointed out, that's smart!
>> On the other hand, modifying the generator to not actually generate a
>> specific kernel would be absurd imho. This brings another question,
>> then. How could ambm beneficiate from the auto-tuning environment?
>> I propose the following solution:
>>
>> check the size of the matrices/vector
>>
>> If the computation is dominated by the kernel launch time (say, less
>> than 100,000 elements), then we use the current ambm kernel. Otherwise,
>> we transfer the scalars to the CPU, perform the corresponding a' = +- OP
>> a, b' = +- OP b, and either generate the kernel or use a BLAS library.
>> This way, we beneficiate from kernel launch time optimization for small
>> data, and high-bandwidth for large data. Does this sounds good?
>>
>
> In terms of execution time, this is probably the best solution. On the
> other hand, it does not solve the problem of compilation overhead: If we
> only dispatch into the generator for large data, we still have to generate
> the respective kernels and go through the OpenCL jit-compiler each time.
> The compilation overhead of this is even likely to dominate any gains we
> get from a faster execution.
>
Instead, what about opening up the generator a bit? It is enough if we have
> some mechanism to access a batch-generation of axpy-like operations, for
> all other operations the generator can remain as-is.
>
> Another option is to move only the axpy-template from the generator over
> to linalg/opencl/kernels/*, because the generation of these kernels is
> fairly light-weight. Sure, it is a little bit of code-duplication, but it
> will keep the generator clean.
>
> Another possible improvement is to separate operations on full vectors
> from operations on ranges and slices. For full vectors we can use the
> built-in vector-types in OpenCL, which allows further optimizations not
> possible with ranges and strides, where we cannot use vector types in
> general.
> What do you think?
>
>
I prefer option 3. This would allow for something like :
if(size(x)>1e5 && stride==1 && start==0){
//The following steps are costly for small vectors
NumericT cpu_alpha = alpha //copy back to host when the scalar is on
global device memory)
if(alpha_flip) cpu_alpha*=-1;
if(reciprocal) cpu_alpha = 1/cpu_alpha;
//... same for beta
//Optimized routines
if(external_blas)
call_axpy_twice(x,cpu_alpha,y,cpu_beta,z)
else{
generate_execute(x = cpu_alpha*y + cpu_beta*z);
}
else{
//fallback
}
This way, we at most generate two kernels, one for small vectors, designed
to optimize latency, and one for big vectors, designed to optimize
bandwidth. Are we converging? :)
Best regards,
Philippe
Best regards,
> Karli
>
>
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/viennacl-devel