Hi hi,

2013/10/27 Karl Rupp <r...@iue.tuwien.ac.at>

> Hi,
>
>
> > This makes the assumption that the 2-way reduction will always be the
>
>> best way to compute an inner-product on any OpenCL device. We want the
>> reduction-based programs to be device-specific, so these "sometimes
>> truncated operations" will have to be forwarded somehow to the kernel
>> generator, and therefore the expression tree. Does it mean that we need
>> an additional parameter in the statement which basically says "don't
>> execute the last kernel!". This would introduce a lot of complexity in
>> the scheduler and the generator, for too little benefit imho.
>>
>
> You are right, this is indeed a bit tricky. There is preparation for this
> case already in the 'standard' vector kernels, where each GPU scalar
> argument may include an additional 'mini reduction' before computing the
> actual operation. However, this functionality is currently unused. The
> motivation for this were operations of type
>  z = inner_prod(u,v) * w;
> where the second reduction could go into the z <- alpha * w assignment.


Oh I see :)
When the kernels are generated, this is actually what happens, ie
z = inner_prod(u,v) * w
leads to two kernels.


>
>
>  What about input-dependent kernels? For small inputs where the second
>> kernel would not be negligible, we would actually be better off
>> performing the full reduction computation in one, big, work group. I
>> think that, for small vector, this is also more cache-efficient than the
>> first kernel of the dual-reduction approach plus a final reduction on
>> the CPU... This would preserve the benefit of saving one kernel launch,
>> and at the same time more smoothly integrate within the
>> scheduler/generator framework...
>>
>
> Yes, I thought about that already. I think we don't need separate kernels,
> only a proper kernel calling logic. What is quite tricky is to get the
> 'cross-over' point right, because that depends on not only the device
> performance, but also on the latency, which is OS-specific.
>
>
Ah... This gets tricky indeed. Are there any measure of how the OS affects
the latency? Specifically, If the OS-dependence is independent from the
device-dependence, there should be static ways out of this mess...

Another simple way out is to have a reasonable "cross-over size" value, and
to integrate such platform-specific information in the autotuning software.
Then, the user could override the default for optimal results, using a
#define typically... until we are able to interact at runtime with the
autotuner's results (using some io mechanism).


Best regards,
> Karli
>
>
Philippe
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to