Re: [pocl-devel] Debugging auto vectorizer

Pekka Jääskeläinen Wed, 07 Feb 2018 21:59:40 -0800

Hi Timo,

Too bad I personally cannot spend more time on this due to urgent
deadlines, but some quick insights:


I added a ticket so we remember to check why you didn't get vectorizer
remarks, which can be really useful: https://github.com/pocl/pocl/issues/613

Do you use FP relaxation flags to clBuildProgram? Strict FP
reordering rules sometimes prevent vectorization.

If you aim for horizontal (work-group) vectorization of your
kernel loops, the below debug output indeed can indicate a reason.

I haven't followed the progress of outer loop vectorization in upstream
LLVM, but how pocl tries to enforce it now is to try to force the parallel
WI loop insde your kernel loops.  It does that by trying to add an implicit
barrier inside your loop which results in that effect.

It cannot do it if it doesn't know if it's legal to do so (all WIs
have to go through all kernel loop iterations).  In this case the analysis
to figure that out failed to prove that's the case.  It might be
worthwhile to try to track the reason for that.  I think upstream LLVM
also has divergence analysis which might be adopted now to pocl.

VariableUniformityAnalysis.cc is the one that analyses whether a
variable is "uniform" (known to always contain the same value for
all WIs) or not. There are also debug outputs that can be enabled to
figure out why your loop iteration variables were not detected as such.

The early exit might cause difficulties to various analysis:

    if (myTrialIndex - trialOffset >= nTrial) return;

In fact, that could cause all sorts of troubles for static fine grained
parallelization as it can mean WI divergence at the end of the grid (even
if it really doesn't, it's not possible for the kcompiler to prove it due
to nTrial being a kernel argument variable).

So, if you can avoid this by specializing your kernel to an edge kernel andone which is known to not get out of bounds, it might help pocl kcompiler

to cope with this case.

All of it could be done by the kcompiler, but it currently isn't. If someone
would like to add handling for this, it would be really useful, as this
is quite a common pattern in OpenCL C kernels.


I hope these insights help,
Pekka

On 02/08/2018 02:24 AM, Timo Betcke wrote:

Hi,

one more hint. I followed Pekka's suggestion to enable debug output inImplicitLoopBarriers.cc andImplicitConditionalBarriers.cc. There is some interesting output generated.It states that:

### ILB: The kernel has no barriers, let's not add implicit ones either toavoid WI context switch overheads### ILB: The kernel has no barriers, let's not add implicit ones either toavoid WI context switch overheads

### trying to add a loop barrier to force horizontal parallelization
### the loop is not uniform because loop entry '' is not uniform
### trying to add a loop barrier to force horizontal parallelization
### the loop is not uniform because loop entry '' is not uniform

What does it mean and does it prevent workgroup level parallelization?

Best wishes

Timo

On 7 February 2018 at 23:41, Timo Betcke <[email protected]<mailto:[email protected]>> wrote:


    Hi,

    I have tried to dive a bit more into the code now and used Pekka's and
    Jeffs hints. Analyzing with Vtune showed that no AVX2 code is generated
    in POCL,
    which I already suspected. I tried POCL_VECTORIZER_REMARKS=1 to activate
    vectorizer remarks. But it does not create any kind of output. However,
    I could create the llvm generated code using
    POCL_LEAVE_KERNEL_COMPILER_TEMP_FILES=1. I am not experienced with LLVM
    IR. But it seems that
    it does not create vectorized code. I have uploaded a gist with the
    disassembled output here:

    https://gist.github.com/tbetcke/c5f71dca27cc20c611c35b67f5faa36b
    <https://gist.github.com/tbetcke/c5f71dca27cc20c611c35b67f5faa36b>

    The question is what prevents the auto vectorizer from working at all.
    The code seems quite straight forward with very simple for-loops with
    hard-coded bounds
    (numQuadPoints is a compiler macro, set to 3 in the experiments). I
    would be grateful for any pointer of how to proceed to figure out what
    is going on with the
    vectorizer.

    By the way, I have recompiled pocl with llvm 6. There was no change in
    behavior from versions 4 and 5.

    Best wishes

    Timo

    On 7 February 2018 at 16:37, Timo Betcke <[email protected]
    <mailto:[email protected]>> wrote:

        Dear Jeff,

        thanks for the explanations. I have now installed pocl on my Xeon W
        workstation, and the benchmarks are as follows
        (pure kernel runtime via event timers this time to exclude Python
        overhead.)

        1.) Intel OpenCL Driver: 0.0965s
        2.) POCL: 0.937s
        3.) AMD CPU OpenCL Driver: 0.64s

        The CPU is a Xeon W-2155 with 3.3GHz and 10 Cores. I have not had
        time to investigate the LLVM IR Code as suggested
        but will do as soon as possible. AMD is included as I have a Radeon
        Pro card, which automatically also installed OpenCL CPU drivers.

        Best wishes

        Timo


        On 7 February 2018 at 16:03, Jeff Hammond <[email protected]
        <mailto:[email protected]>> wrote:



            On Wed, Feb 7, 2018 at 2:41 AM, Michal Babej
            <[email protected] <mailto:[email protected]>>
            wrote:

                Hi,

                > we noticed for one of our OpenCL kernels that pocl is over 4 
times
                > slower than the Intel OpenCL runtime on a Xeon W processor.

                1) If i googled correctly, Xeon W has AVX-512, which the
                intel runtime
                is likely fully using. LLVM 4 has absolutely horrible AVX512
                support,
                LLVM 5 is better but there are still bugs, and you'll want
                LLVM 6 for
                AVX-512 to work (at least i know they fixed the AVX-512 few
                bugs i
                found, i don't have a machine anymore to test it).



            Indeed, Xeon W [1] is a sibling of Xeon Scalable and Core
            X-series of the Skylake generation, which I'll refer to as SKX
            since they are microarchitecturally the same.  All of these
            support AVX-512, which I'm going to refer to as AVX3 in the
            following, for reasons that will become clear.

            An important detail when evaluating vectorization on these
            processors is that the frequency drops when transitioning from
            scalar/SSE2 code to AVX2 code to AVX3 (i.e. AVX-512) code [2],
            which corresponds to the use of xmm (128b), ymm (256b), and zmm
            (512b) registers respectively.  AVX3 instructions with ymm
            registers should run at AVX2 frequency.

            While most (but not all - see [3]) parts have 2 VPUs, the first
            of these is implemented via port fusion [4].  What this means is
            that the core can dispatch 2 512b AVX3 instructions on ports 0+1
            and 5, or it can dispatch 3 256b instructions (AVX2 or AVX3) on
            ports 0, 1 and 5.  Thus, one can get 1024b throughput at one

frequency or 768b throughput at a slightly higher frequency.What this means is that 512b vectorization pays off for code

            that is thoroughly compute-bound and heavily vectorized (e.g.
            dense linear algebra and molecular dynamics) but that 256b
            vectorization is likely better for code that is more
            memory-bound or doesn't vectorize as well.

            The Intel C/C++ compiler has a flag -qopt-zmm-usage={low,high}
            to address this, where "-xCORE-AVX512 -qopt-zmm-usage=low" is
            going to take advantage of all the AVX3 instructions but favor
            256b ymm registers, which will behave exactly like AVX2 in some
            cases (i.e. ones where the AVX3 instruction features aren't used).

            Anyways, the short version of this story is that you should not
            assume 512b SIMD code generation is the reason for a performance
            benefit from the Intel OpenCL compiler, since it may in fact not

generate those instructions if it thinks that 256b is better.It would be useful to force both POCL and Intel OpenCL to use

            SSE2 and AVX2, respectively, in experiments, to see how they
            compare when targeting the same vector ISA.  This sort of
            comparison would also be helpful to resolve an older bug report
            of a similar nature [5].

            What I wrote here is one engineer's attempt to summarize a large
            amount of information in a user-friendly format.  I apologize
            for any errors - they are certainly not intentional.

            [1]
            
https://ark.intel.com/products/series/125035/Intel-Xeon-Processor-W-Family
            
<https://ark.intel.com/products/series/125035/Intel-Xeon-Processor-W-Family>
            [2]
            
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf
            
<https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf>
            [3] https://github.com/jeffhammond/vpu-count
            <https://github.com/jeffhammond/vpu-count>
            [4]
            
https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)#Scheduler_.26_512-SIMD_addition
            
<https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)#Scheduler_.26_512-SIMD_addition>
            [5] https://github.com/pocl/pocl/issues/292
            <https://github.com/pocl/pocl/issues/292>

                2) It could be the autovectorizer, or it could be something
                else. Are
                your machines NUMA ? if so, you'll likely see very bad
                performance, as
                pocl has no NUMA tuning currently. Also i've seen
                occasionally that pocl
                unrolls too much and overflows L1 caches (you could try
                experimenting
                with various local WG sizes to clEnqueueNDRK). Unfortunately
                this part of pocl has received little attention lately...


            I don't know what POCL uses for threading, but Intel OpenCL uses
            the TBB runtime [6].  The TBB runtime has some very smart
            features for load-balancing and automatic cache blocking that
            are not implemented in OpenMP and are hard to implement by hand
            in Pthreads.

            [6]
            
https://software.intel.com/en-us/articles/whats-new-opencl-runtime-1611
            
<https://software.intel.com/en-us/articles/whats-new-opencl-runtime-1611>

            Jeff

--Jeff Hammond

            [email protected] <mailto:[email protected]>
            http://jeffhammond.github.io/

            
------------------------------------------------------------------------------
            Check out the vibrant tech community on one of the world's most
            engaging tech sites, Slashdot.org! http://sdm.link/slashdot
            _______________________________________________
            pocl-devel mailing list
            [email protected]
            <mailto:[email protected]>
            https://lists.sourceforge.net/lists/listinfo/pocl-devel
            <https://lists.sourceforge.net/lists/listinfo/pocl-devel>

--Dr. Timo Betcke

        Reader in Mathematics
        University College London
        Department of Mathematics
        E-Mail: [email protected] <mailto:[email protected]>
        Tel.: +44 (0) 20-3108-4068 <tel:020%203108%204068>
        Fax.: +44 (0) 20-7383-5519 <tel:020%207383%205519>

--Dr. Timo Betcke

    Reader in Mathematics
    University College London
    Department of Mathematics
    E-Mail: [email protected] <mailto:[email protected]>
    Tel.: +44 (0) 20-3108-4068 <tel:020%203108%204068>
    Fax.: +44 (0) 20-7383-5519 <tel:020%207383%205519>




--
Dr. Timo Betcke
Reader in Mathematics
University College London
Department of Mathematics
E-Mail: [email protected] <mailto:[email protected]>
Tel.: +44 (0) 20-3108-4068
Fax.: +44 (0) 20-7383-5519


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot



_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel


--
Pekka

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Re: [pocl-devel] Debugging auto vectorizer

Reply via email to