Hello,

Timo Betcke <[email protected]> wrote:

> I have on my Kaby Lake laptop hand-vectorized the routine by
> processing items in multiple of 8.
> The performance of POCL is now within a factor of two of the Intel
> OpenCL runtime with auto-vectorization.

I've tried a changing your code in a few ways to get its loops
vectorized, but i came to conclusion LLVM's loop-vectorizer is
extremely picky and will fail to vectorize for many reasons. A few
which i found:

1) if it cannot prove a vector instruction would handle all corner
cases exactly the same way as scalar instruction (unsafe/fast-math
options seems to somewhat alleviate this problem)

2) in some cases, branches and switches. Pocl may contain these in
surprising places (e.g. distance(x, y) which is the same as length(x -
y) - length contains branches to handle denorms / infinity).

3) certain memory access patterns & extractelement instruction
(extracting elements of a vector).

So.. the bad news is, LLVM will need a lot of improvement to vectorize
this code automagically. The tiny good news is, >99% of kernel
library functions which are listed here

https://github.com/pocl/pocl/tree/master/lib/kernel/libclc-pocl

and here

https://github.com/pocl/pocl/tree/master/lib/kernel/sleef-pocl

have been hand-vectorized and LLVM does indeed generate AVX1/2/512 code
for them. So hand-vectorized code should be able to achieve speeds
reasonably close to Intel's runtime.

Regards,
  -- mb

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Reply via email to