Hello, Timo Betcke <[email protected]> wrote:
> I have on my Kaby Lake laptop hand-vectorized the routine by > processing items in multiple of 8. > The performance of POCL is now within a factor of two of the Intel > OpenCL runtime with auto-vectorization. I've tried a changing your code in a few ways to get its loops vectorized, but i came to conclusion LLVM's loop-vectorizer is extremely picky and will fail to vectorize for many reasons. A few which i found: 1) if it cannot prove a vector instruction would handle all corner cases exactly the same way as scalar instruction (unsafe/fast-math options seems to somewhat alleviate this problem) 2) in some cases, branches and switches. Pocl may contain these in surprising places (e.g. distance(x, y) which is the same as length(x - y) - length contains branches to handle denorms / infinity). 3) certain memory access patterns & extractelement instruction (extracting elements of a vector). So.. the bad news is, LLVM will need a lot of improvement to vectorize this code automagically. The tiny good news is, >99% of kernel library functions which are listed here https://github.com/pocl/pocl/tree/master/lib/kernel/libclc-pocl and here https://github.com/pocl/pocl/tree/master/lib/kernel/sleef-pocl have been hand-vectorized and LLVM does indeed generate AVX1/2/512 code for them. So hand-vectorized code should be able to achieve speeds reasonably close to Intel's runtime. Regards, -- mb ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ pocl-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/pocl-devel
