Re: [pocl-devel] Debugging auto vectorizer

Michal Babej Wed, 07 Feb 2018 02:42:08 -0800

Hi,

> we noticed for one of our OpenCL kernels that pocl is over 4 times
> slower than the Intel OpenCL runtime on a Xeon W processor.


1) If i googled correctly, Xeon W has AVX-512, which the intel runtime
is likely fully using. LLVM 4 has absolutely horrible AVX512 support,
LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for
AVX-512 to work (at least i know they fixed the AVX-512 few bugs i
found, i don't have a machine anymore to test it).

2) It could be the autovectorizer, or it could be something else. Are
your machines NUMA ? if so, you'll likely see very bad performance, as
pocl has no NUMA tuning currently. Also i've seen occasionally that pocl
unrolls too much and overflows L1 caches (you could try experimenting
with various local WG sizes to clEnqueueNDRK). Unfortunately
this part of pocl has received little attention lately...

Cheers,
-- mb

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Re: [pocl-devel] Debugging auto vectorizer

Reply via email to