Re: [ViennaCL-devel] Influence of the number of kernels on the behavior of a program

Philippe Tillet Wed, 28 May 2014 03:39:08 -0700

Hi,


2014-05-28 10:25 GMT+02:00 Karl Rupp <r...@iue.tuwien.ac.at>:

> Hi,
>
>
> > An additional information. Now that the test passes, i can safely
>
>> compare the JIT compilation times.
>> Without the optimized kernels : 1.6seconds
>> With the optimized kernels : 6.5 seconds
>> I fear that this might get untractable when further operations are
>> added. I should probably find a way to disable the optimized kernels if
>> the VIENNACL_CACHE_PATH environment variable is not set...
>>
>
> Pfuh, that is tough, 6.5 seconds are too much and will upset users quite a
> lot. What is the number of programs and kernels for this?
>
>
>
>
>      The integration of the kernel generator has been a nightmare!
>>
>
> Heads up: If it were easy, others would have done it already ;-)
>
>
>
>      Anyway, I've realized that thousands of kernels per scalartype are
>>     required, in order to obtain optimal performance.
>>
>
> Well, 'required' is probably not the correct term here ;-)
>
>
>  Why so much?
>>     - flip_a, reciprocal_a, flip_b, reciprocal_b requiring their own
>> kernel
>>
>
> Well, this is only required for unsigned integers. For integers one only
> needs the reciprocals, and for floating points one can merge this into a
> single kernel.
>
>
There is one program of around 1000 kernels. For comparison, the current
vector operations program take 2seconds to compile on my laptop. Since the
1.6seconds are including all the flip, reciprocal, but not the x = a*x +
b*y -like operations, I suspect that this would be better in every aspect
to have multiple kernels for float/double too. I suspect that it is faster
for the compiler to have multiple simple kernel, rather than one more
complicated (the compiler probably has a hard time optimizing the
conditional statements if it cannot assume that the condition will have the
same value for all the threads). I really think that having separate
kernels for flip/reciprocal is the way to go. Plus, it allows us to have
the same implementation for all the numeric types, which is much, much
better.


>      - The generator interprets differently x = a*y + b*z, x = a*y + b*x,
>>     x = a*x + b*y, etc...
>>
>
> Hmm, this needs fixing. I see two possible paths:
>  - Only query the kernel sources from the generator, but manage them in a
> separate entity just like it is done now. This way one can deal with the
> necessary extra logic outside the generator, just like it is done now in
> viennacl/linalg/opencl/kernels/*.hpp


I don't see it as a "bug", though.  x = a*y + b*z and x = a*y + b*x are two
different expression trees. It eventually leads to two equivalent kernels
(2N reads, N write) because of the triviality of avbv kernels, not because
of the expression tree. I think it's normal that it is interpreted
differently, but we should have a way to control it to allow for a simpler
behavior.
There are many problems with handling the enqueueing manually, without
using the generator, unfortunately:
- We shouldn't have to modify vector_operations.hpp when we modify a given
kernel template. For example, the number of kernels required by the
template (1 for AXPY, 2 for DOT, 1 for GEMV/GEMM) should remain entirely
internal. If we ever find out that GEMV is better off with 2 kernels on
CPUs, then we shouldn't have to modify anything else than the GEMV profile.
Similarly, if for some reason we realize that a kernel could be optimized
by having an additional argument, we shouldn't have to modify the viennacl
core. While the generation and the enqueueing should be clearly separated,
it is fundamental that they remain encapsulated.

>
>
>
>      - Each avbv requires 2 kernel, because we need one fallback when the
>>     offset is not a multiple of the simd_width. There are some trick on
>>     AMD implementations to avoid doing this, but I know no portable trick.
>>
>
> Do you have performance numbers on this? As this is heavily
> memory-bandwidth limited, it may not make any notable difference overall.
> Btw: Could you please explain in a sentence or two what this new
> simd_width parameter is about? I know what SIMD is, yet I want to make sure
> that we are talking about the same thing.


Yes, the name will change. I should call it vector_width, to conform with
the OpenCL standards. It's about using float4* instead of float*, for
example. It does make a huge bandwidth difference to load float4* rather
than float* on my AMD HD5850. I guess you observed that as well when you
auto-tuned avbv.
The problem is that using float4* restricts pointer arithmetics so the
offset is forced to be a multiple of 4. On AMD hardware, one may safely do

union ptr{
   float* fp;
   float4* f4p;
}

ptr.f4p = my_float4_ptr;
ptr.fp += offset.

To handle all offsets, but it does not sounds to me like a reasonable
portable solution...


>
>      As you might have guessed, this gets me uncomfortable and upset.
>>     On the one hand, it cannot be bad performance-wise to have a
>>     specific implementation for operations such as x = a*x + b*y, x =
>>     a*x + b*x, etc. On the other hand, I'm seriously wondering if the
>>     practical gain would be noticeable, and what practical overhead it
>>     would induce.
>>
>
> It's reasonable to expect that
>  a) the scheduler can take care of some optimizations
>  b) the user/developer won't specify such operations (or at least won't
> expect record-setting performance from it)
> Thus, we should use this freedom in the design space and cut down the
> kernel numbers appropriately to the benefit of all users.
>
>
>      Note, however, that a kernel x = a*x + a*x is at least as efficient
>>     as x = (2*a)*x (and more efficient if a is a device scalar).
>>
>
> Which is cool :-)
>
>
>
>      I need your advises here. Should I add an option to force the
>>     generator to treat each vector as a different object (so that x =
>>     a*x + b*z would use the kernel x = a*y + b*z with y<-x), or should I
>>     leave it as-is, considering that we might have a higher throughput
>>     at the price of more latency? Has anyone ever had bad experience
>>     with very large programs?
>>
>
> I haven't looked at how you integrate all the operations currently, but
> from what I've written above it sounds most reasonable to me to not taint
> the generator with such a 'force flag' and instead use the logic outside
> the generator to take care of this. I think that replacing the manual
> kernel generation in viennacl/linalg/opencl/kernels/*.hpp with source
> strings from the generator is at least close to the global optimum already,
> given the current overheads of program and kernel compilation in the SDKs
> out there. If there is a better approach, I'm happy to adapt to it, yet I
> can't see it right now. Thoughts?
>
>

As a sketch of how it is implemented, it does something like this

in linald/opencl/kernels/vector.hpp:generate_avbv:
source.append(device_specific::generate::opencl_sources(database::get<T>(database::axpy),
scheduler::preset::axpy(&x, &y, &alpha, reciprocal_a, flip_a, &z, &beta,
reciprocal_b, flip_b));

And in linalg/opencl/vector_operations.hpp:avbv
device_specific::enqueue(database::get<T>(database::axpy),
kernels::vector<T>::program_name, scheduler::preset::axpy(&x, &y, &alpha,
reciprocal_a, flip_a, &z, &beta, reciprocal_b, flip_b));

The whole problem could be sorted out by adding a bool parameter to
opencl_sources() and enqueue(), (to tell the generator to ignore if x, y
and z are the same objects/point to the same memory handle in the same
statement). Initially, this feature was added to identify handles when
there are multiple statements, in order to load z only once in things like:

s = reduce<max>(z) - reduce<min>(z)
{x = y + z, y = x + z}
x = y + z + z

It turns out that in the last case the interest is limited, but it's just a
special case!

Best regards,
Philippe


> Best regards,
> Karli
>
>

------------------------------------------------------------------------------
Time is money. Stop wasting it! Get your web API in 5 minutes.
www.restlet.com/download
http://p.sf.net/sfu/restlet

_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Influence of the number of kernels on the behavior of a program

Reply via email to