Hi, > we noticed for one of our OpenCL kernels that pocl is over 4 times > slower than the Intel OpenCL runtime on a Xeon W processor.
1) If i googled correctly, Xeon W has AVX-512, which the intel runtime is likely fully using. LLVM 4 has absolutely horrible AVX512 support, LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for AVX-512 to work (at least i know they fixed the AVX-512 few bugs i found, i don't have a machine anymore to test it). 2) It could be the autovectorizer, or it could be something else. Are your machines NUMA ? if so, you'll likely see very bad performance, as pocl has no NUMA tuning currently. Also i've seen occasionally that pocl unrolls too much and overflows L1 caches (you could try experimenting with various local WG sizes to clEnqueueNDRK). Unfortunately this part of pocl has received little attention lately... Cheers, -- mb ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ pocl-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/pocl-devel
