Hi Timo,
Too bad I personally cannot spend more time on this due to urgent
deadlines, but some quick insights:
I added a ticket so we remember to check why you didn't get vectorizer
remarks, which can be really useful: https://github.com/pocl/pocl/issues/613
Do you use FP relaxation flags to clBuildProgram? Strict FP
reordering rules sometimes prevent vectorization.
If you aim for horizontal (work-group) vectorization of your
kernel loops, the below debug output indeed can indicate a reason.
I haven't followed the progress of outer loop vectorization in upstream
LLVM, but how pocl tries to enforce it now is to try to force the parallel
WI loop insde your kernel loops. It does that by trying to add an implicit
barrier inside your loop which results in that effect.
It cannot do it if it doesn't know if it's legal to do so (all WIs
have to go through all kernel loop iterations). In this case the analysis
to figure that out failed to prove that's the case. It might be
worthwhile to try to track the reason for that. I think upstream LLVM
also has divergence analysis which might be adopted now to pocl.
VariableUniformityAnalysis.cc is the one that analyses whether a
variable is "uniform" (known to always contain the same value for
all WIs) or not. There are also debug outputs that can be enabled to
figure out why your loop iteration variables were not detected as such.
The early exit might cause difficulties to various analysis:
if (myTrialIndex - trialOffset >= nTrial) return;
In fact, that could cause all sorts of troubles for static fine grained
parallelization as it can mean WI divergence at the end of the grid (even
if it really doesn't, it's not possible for the kcompiler to prove it due
to nTrial being a kernel argument variable).
So, if you can avoid this by specializing your kernel to an edge kernel and
one which is known to not get out of bounds, it might help pocl kcompiler
to cope with this case.
All of it could be done by the kcompiler, but it currently isn't. If someone
would like to add handling for this, it would be really useful, as this
is quite a common pattern in OpenCL C kernels.
I hope these insights help,
Pekka
On 02/08/2018 02:24 AM, Timo Betcke wrote:
Hi,
one more hint. I followed Pekka's suggestion to enable debug output in
ImplicitLoopBarriers.cc and
ImplicitConditionalBarriers.cc. There is some interesting output generated.
It states that:
### ILB: The kernel has no barriers, let's not add implicit ones either to
avoid WI context switch overheads
### ILB: The kernel has no barriers, let's not add implicit ones either to
avoid WI context switch overheads
### trying to add a loop barrier to force horizontal parallelization
### the loop is not uniform because loop entry '' is not uniform
### trying to add a loop barrier to force horizontal parallelization
### the loop is not uniform because loop entry '' is not uniform
What does it mean and does it prevent workgroup level parallelization?
Best wishes
Timo
On 7 February 2018 at 23:41, Timo Betcke <[email protected]
<mailto:[email protected]>> wrote:
Hi,
I have tried to dive a bit more into the code now and used Pekka's and
Jeffs hints. Analyzing with Vtune showed that no AVX2 code is generated
in POCL,
which I already suspected. I tried POCL_VECTORIZER_REMARKS=1 to activate
vectorizer remarks. But it does not create any kind of output. However,
I could create the llvm generated code using
POCL_LEAVE_KERNEL_COMPILER_TEMP_FILES=1. I am not experienced with LLVM
IR. But it seems that
it does not create vectorized code. I have uploaded a gist with the
disassembled output here:
https://gist.github.com/tbetcke/c5f71dca27cc20c611c35b67f5faa36b
<https://gist.github.com/tbetcke/c5f71dca27cc20c611c35b67f5faa36b>
The question is what prevents the auto vectorizer from working at all.
The code seems quite straight forward with very simple for-loops with
hard-coded bounds
(numQuadPoints is a compiler macro, set to 3 in the experiments). I
would be grateful for any pointer of how to proceed to figure out what
is going on with the
vectorizer.
By the way, I have recompiled pocl with llvm 6. There was no change in
behavior from versions 4 and 5.
Best wishes
Timo
On 7 February 2018 at 16:37, Timo Betcke <[email protected]
<mailto:[email protected]>> wrote:
Dear Jeff,
thanks for the explanations. I have now installed pocl on my Xeon W
workstation, and the benchmarks are as follows
(pure kernel runtime via event timers this time to exclude Python
overhead.)
1.) Intel OpenCL Driver: 0.0965s
2.) POCL: 0.937s
3.) AMD CPU OpenCL Driver: 0.64s
The CPU is a Xeon W-2155 with 3.3GHz and 10 Cores. I have not had
time to investigate the LLVM IR Code as suggested
but will do as soon as possible. AMD is included as I have a Radeon
Pro card, which automatically also installed OpenCL CPU drivers.
Best wishes
Timo
On 7 February 2018 at 16:03, Jeff Hammond <[email protected]
<mailto:[email protected]>> wrote:
On Wed, Feb 7, 2018 at 2:41 AM, Michal Babej
<[email protected] <mailto:[email protected]>>
wrote:
Hi,
> we noticed for one of our OpenCL kernels that pocl is over 4
times
> slower than the Intel OpenCL runtime on a Xeon W processor.
1) If i googled correctly, Xeon W has AVX-512, which the
intel runtime
is likely fully using. LLVM 4 has absolutely horrible AVX512
support,
LLVM 5 is better but there are still bugs, and you'll want
LLVM 6 for
AVX-512 to work (at least i know they fixed the AVX-512 few
bugs i
found, i don't have a machine anymore to test it).
Indeed, Xeon W [1] is a sibling of Xeon Scalable and Core
X-series of the Skylake generation, which I'll refer to as SKX
since they are microarchitecturally the same. All of these
support AVX-512, which I'm going to refer to as AVX3 in the
following, for reasons that will become clear.
An important detail when evaluating vectorization on these
processors is that the frequency drops when transitioning from
scalar/SSE2 code to AVX2 code to AVX3 (i.e. AVX-512) code [2],
which corresponds to the use of xmm (128b), ymm (256b), and zmm
(512b) registers respectively. AVX3 instructions with ymm
registers should run at AVX2 frequency.
While most (but not all - see [3]) parts have 2 VPUs, the first
of these is implemented via port fusion [4]. What this means is
that the core can dispatch 2 512b AVX3 instructions on ports 0+1
and 5, or it can dispatch 3 256b instructions (AVX2 or AVX3) on
ports 0, 1 and 5. Thus, one can get 1024b throughput at one
frequency or 768b throughput at a slightly higher frequency.
What this means is that 512b vectorization pays off for code
that is thoroughly compute-bound and heavily vectorized (e.g.
dense linear algebra and molecular dynamics) but that 256b
vectorization is likely better for code that is more
memory-bound or doesn't vectorize as well.
The Intel C/C++ compiler has a flag -qopt-zmm-usage={low,high}
to address this, where "-xCORE-AVX512 -qopt-zmm-usage=low" is
going to take advantage of all the AVX3 instructions but favor
256b ymm registers, which will behave exactly like AVX2 in some
cases (i.e. ones where the AVX3 instruction features aren't used).
Anyways, the short version of this story is that you should not
assume 512b SIMD code generation is the reason for a performance
benefit from the Intel OpenCL compiler, since it may in fact not
generate those instructions if it thinks that 256b is better.
It would be useful to force both POCL and Intel OpenCL to use
SSE2 and AVX2, respectively, in experiments, to see how they
compare when targeting the same vector ISA. This sort of
comparison would also be helpful to resolve an older bug report
of a similar nature [5].
What I wrote here is one engineer's attempt to summarize a large
amount of information in a user-friendly format. I apologize
for any errors - they are certainly not intentional.
[1]
https://ark.intel.com/products/series/125035/Intel-Xeon-Processor-W-Family
<https://ark.intel.com/products/series/125035/Intel-Xeon-Processor-W-Family>
[2]
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf
<https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf>
[3] https://github.com/jeffhammond/vpu-count
<https://github.com/jeffhammond/vpu-count>
[4]
https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)#Scheduler_.26_512-SIMD_addition
<https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)#Scheduler_.26_512-SIMD_addition>
[5] https://github.com/pocl/pocl/issues/292
<https://github.com/pocl/pocl/issues/292>
2) It could be the autovectorizer, or it could be something
else. Are
your machines NUMA ? if so, you'll likely see very bad
performance, as
pocl has no NUMA tuning currently. Also i've seen
occasionally that pocl
unrolls too much and overflows L1 caches (you could try
experimenting
with various local WG sizes to clEnqueueNDRK). Unfortunately
this part of pocl has received little attention lately...
I don't know what POCL uses for threading, but Intel OpenCL uses
the TBB runtime [6]. The TBB runtime has some very smart
features for load-balancing and automatic cache blocking that
are not implemented in OpenMP and are hard to implement by hand
in Pthreads.
[6]
https://software.intel.com/en-us/articles/whats-new-opencl-runtime-1611
<https://software.intel.com/en-us/articles/whats-new-opencl-runtime-1611>
Jeff
--
Jeff Hammond
[email protected] <mailto:[email protected]>
http://jeffhammond.github.io/
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
pocl-devel mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/pocl-devel
<https://lists.sourceforge.net/lists/listinfo/pocl-devel>
--
Dr. Timo Betcke
Reader in Mathematics
University College London
Department of Mathematics
E-Mail: [email protected] <mailto:[email protected]>
Tel.: +44 (0) 20-3108-4068 <tel:020%203108%204068>
Fax.: +44 (0) 20-7383-5519 <tel:020%207383%205519>
--
Dr. Timo Betcke
Reader in Mathematics
University College London
Department of Mathematics
E-Mail: [email protected] <mailto:[email protected]>
Tel.: +44 (0) 20-3108-4068 <tel:020%203108%204068>
Fax.: +44 (0) 20-7383-5519 <tel:020%207383%205519>
--
Dr. Timo Betcke
Reader in Mathematics
University College London
Department of Mathematics
E-Mail: [email protected] <mailto:[email protected]>
Tel.: +44 (0) 20-3108-4068
Fax.: +44 (0) 20-7383-5519
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel
--
Pekka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel