
> There is one program of around 1000 kernels. For comparison, the current
> vector operations program take 2seconds to compile on my laptop. Since
> the 1.6seconds are including all the flip, reciprocal, but not the x =
> a*x + b*y -like operations, I suspect that this would be better in every
> aspect to have multiple kernels for float/double too.

It would be certainly more uniform and save us quite some special cases, 
yes. So, let me ask the question differently then: How large is the 
performance difference of 'non-optimized' versus 'optimized' for these 
vector kernels? For BLAS level 1 operations I'd expect that it is in the 
order of 10 percent, if at all (of course assuming that we pick suitable 
device-specific profiles). Even for the matrix-vector product it 
shouldn't get above ~30%. I'd also expect that the older the hardware, 
the more pronounced the effect.

Another suggestion: What if use the 'non-optimized' kernels by default, 
but provide users with a switch to explicitly enable the optimized 
kernels? This way users are made aware of the trade-off if they're 
hunting for the last bit of performance, while the average user isn't 
annoyed by large jit-overhead.

> I suspect that it
> is faster for the compiler to have multiple simple kernel, rather than
> one more complicated (the compiler probably has a hard time optimizing
> the conditional statements if it cannot assume that the condition will
> have the same value for all the threads). I really think that having
> separate kernels for flip/reciprocal is the way to go. Plus, it allows
> us to have the same implementation for all the numeric types, which is
> much, much better.

It is a trade-off. If we can keep compilation times sufficiently low, 
then we can follow this route. ~1 second jit-overhead is kind of a 
magical threshold: Make it .2 and nobody will really notice, but make it 
5 and everybody will complain.

>              - The generator interprets differently x = a*y + b*z, x =
>         a*y + b*x,
>              x = a*x + b*y, etc...
>     Hmm, this needs fixing. I see two possible paths:
>       - Only query the kernel sources from the generator, but manage
>     them in a separate entity just like it is done now. This way one can
>     deal with the necessary extra logic outside the generator, just like
>     it is done now in viennacl/linalg/opencl/__kernels/*.hpp
> I don't see it as a "bug", though.  x = a*y + b*z and x = a*y + b*x are
> two different expression trees. It eventually leads to two equivalent
> kernels (2N reads, N write) because of the triviality of avbv kernels,
> not because of the expression tree. I think it's normal that it is
> interpreted differently, but we should have a way to control it to allow
> for a simpler behavior.

It's not a bug in the generator, but it's a bug in the way we handle the 
kernels then. If jit-compilation overhead is becoming an issue, we have 
to take appropriate measures to fight it, and not just hide behind the 
academic barrel and justify excessive overhead via 'but the 
implementation is nice and abstract'. Providing a flag in order to 
control the behavior sounds like a legitimate approach here.

> There are many problems with handling the enqueueing manually, without
> using the generator, unfortunately:
> - We shouldn't have to modify vector_operations.hpp when we modify a
> given kernel template. For example, the number of kernels required by
> the template (1 for AXPY, 2 for DOT, 1 for GEMV/GEMM) should remain
> entirely internal. If we ever find out that GEMV is better off with 2
> kernels on CPUs, then we shouldn't have to modify anything else than the
> GEMV profile. Similarly, if for some reason we realize that a kernel
> could be optimized by having an additional argument, we shouldn't have
> to modify the viennacl core. While the generation and the enqueueing
> should be clearly separated, it is fundamental that they remain
> encapsulated.


>              - Each avbv requires 2 kernel, because we need one fallback
>         when the
>              offset is not a multiple of the simd_width. There are some
>         trick on
>              AMD implementations to avoid doing this, but I know no
>         portable trick.
>     Do you have performance numbers on this? As this is heavily
>     memory-bandwidth limited, it may not make any notable difference
>     overall.
>     Btw: Could you please explain in a sentence or two what this new
>     simd_width parameter is about? I know what SIMD is, yet I want to
>     make sure that we are talking about the same thing.
> Yes, the name will change. I should call it vector_width, to conform
> with the OpenCL standards. It's about using float4* instead of float*,
> for example. It does make a huge bandwidth difference to load float4*
> rather than float* on my AMD HD5850. I guess you observed that as well
> when you auto-tuned avbv.

Yes, I observed that. I also observed that without these vector types 
one can still get fairly close to peak performance if the right work 
group sizes/dimensions are chosen, particularly on newer hardware.

> The problem is that using float4* restricts pointer arithmetics so the
> offset is forced to be a multiple of 4. On AMD hardware, one may safely do
> union ptr{
>     float* fp;
>     float4* f4p;
> }
> ptr.f4p = my_float4_ptr;
> ptr.fp += offset.
> To handle all offsets, but it does not sounds to me like a reasonable
> portable solution...

You certainly know this thread: ;-)

The endianness should not be a problem here, as both float* and float4* 
have the same size. My interpretation of the standard is that you can 
savely go from float4 to float, but the other way only works within the 
alignment guarantees. A more practical problem is that of performance: 
If a compiler sees such constructs, it may quickly switch to worst-case 
assumptions, not providing much benefit.

> As a sketch of how it is implemented, it does something like this
> in linald/opencl/kernels/vector.hpp:generate_avbv:
> source.append(device_specific::generate::opencl_sources(database::get<T>(database::axpy),
> scheduler::preset::axpy(&x, &y, &alpha, reciprocal_a, flip_a, &z, &beta,
> reciprocal_b, flip_b));
> And in linalg/opencl/vector_operations.hpp:avbv
> device_specific::enqueue(database::get<T>(database::axpy),
> kernels::vector<T>::program_name, scheduler::preset::axpy(&x, &y,
> &alpha, reciprocal_a, flip_a, &z, &beta, reciprocal_b, flip_b));


> The whole problem could be sorted out by adding a bool parameter to
> opencl_sources() and enqueue(), (to tell the generator to ignore if x, y
> and z are the same objects/point to the same memory handle in the same
> statement). Initially, this feature was added to identify handles when
> there are multiple statements, in order to load z only once in things like:
> s = reduce<max>(z) - reduce<min>(z)
> {x = y + z, y = x + z}
> x = y + z + z
> It turns out that in the last case the interest is limited, but it's
> just a special case!

Hmm, actually I find it better to provide the boolean flag to a 
generator object rather than passing it each time when I want to 
enqueue. The reason is that the generator knows which kernels have been 
compiled earlier, whereas that knowledge isn't usually available in the 
routines taking care of the enqueue().

Best regards,

Time is money. Stop wasting it! Get your web API in 5 minutes.
ViennaCL-devel mailing list

Reply via email to