Re: autovectorization in gcc

Kay F. Jahnke Wed, 09 Jan 2019 02:56:49 -0800

On 09.01.19 10:45, Kyrill Tkachov wrote:

There's plenty of work being done on auto-vectorisation in GCC.
Auto-vectorisation is a performance optimisation and as such is not really
a user-visible feature that absolutely requires user documentation.

Since I'm trying to deliberately exploit it, a more user-visible guisewould help ;)

- repeated use of vectorizable functions

for ( int i = 0 ; i < vsz ; i++ )
   A [ i ] = sqrt ( B [ i ] ) ;

Here, replacing the repeated call of sqrt with the vectorized equivalent
gives a dramatic speedup (ca. 4X)


The above is a typical example. So, to give a complete source 'vec_sqrt.cc':

#include <cmath>

extern float data [ 32768 ] ;

extern void vf1()
{
  #pragma vectorize enable
  for ( int i = 0 ; i < 32768 ; i++ )
    data [ i ] = std::sqrt ( data [ i ] ) ;
}

This has a large trip count, the loop is trivial. It's an idealcandidate for autovectorization. When I compile this source, using


g++ -O3 -mavx2 -S -o sqrt.s sqrt_gcc.cc

the inner loop translates to:

.L2:
        vmovss  (%rbx), %xmm0
        vucomiss        %xmm0, %xmm2
        vsqrtss %xmm0, %xmm1, %xmm1
        jbe     .L3
        vmovss  %xmm2, 12(%rsp)
        addq    $4, %rbx
        vmovss  %xmm1, 8(%rsp)
        call    sqrtf@PLT
        vmovss  8(%rsp), %xmm1
        vmovss  %xmm1, -4(%rbx)
        cmpq    %rbp, %rbx
        vmovss  12(%rsp), %xmm2
        jne     .L2

AFAICT this is not vectorized, it only uses a single float at a time.
In vector code, I'd expect the vsqrtps mnemonic to show up.

I believe GCC will do some of that already given a high-enoughoptimisation level
and floating-point constraints.
Do you have examples where it doesn't? Testcases with self-containedsource code
and compiler flags would be useful to analyse.

so, see above. With -Ofast output is similar, just the inner loop isunrolled. But maybe I'm missing something? Any hints for additional flags?

If the compiler were to provide the autovectorization facilities, and if
the patterns it recognizes were well-documented, users could rely on
certain code patterns being recognized and autovectorized - sort of a
contract between the user and the compiler. With a well-chosen spectrum
of patterns, this would make it unnecessary to have to rely on explicit
vectorization in many cases. My hope is that such an interface would
help vectorization to become more frequently used - as I understand the
status quo, this is still a niche topic, even though many processors
provide suitable hardware nowadays.


I wouldn't say it's a niche topic :)

From my monitoring of the GCC development over the last few yearsthere's been lots

of improvements in auto-vectorisation in compilers (at least in GCC).


Okay, I'll take your word for it.

The thing is, auto-vectorisation is not always profitable for performance.
Sometimes the runtime loop iteration count is so low that setting up thevectorised loop(alignment checks, loads/permutes) is slower than just doing the scalarform,
especially since SIMD performance varies from CPU to CPU.
So we would want the compiler to have the freedom to make its ownjudgement on whento auto-vectorise rather than enforce a "contract". If the user reallyonly wants
vector code, they should use one of the explicit programming paradigms.

I know that these issues are important. I am using Vc for explicitvectorization, so I can easily code to produce vector code for commontargets. And I can compare the performance. I have tried the examplegiven above on my AVX2 machine, linking with a main program which calls'vf1' 32768 times, to get one gigaroot (giggle). The vectorized versiontakes about half a second, the unvectorized takes about three. withfunctions like sqrt, trigonometric functions, exp and pow, vectorizationis very profitable. Some further details:


Here's the main program 'memaxs.cc':

float data [ 32768 ] ;
extern void vf1() ;

int main ( int argc , char * argv[] )
{
  for ( int k = 0 ; k < 32768 ; k++ )
  {
    vf1() ;
  }
}

And the compiler call to get a binary:

g++ -O3 -mavx2 -o memaxs sqrt.s memaxs.cc

Here's the performance:

$ time ./memaxs

real    0m3,205s
user    0m3,200s
sys     0m0,004s

This variant of vec_sqrt.cc uses Vc ('vc_vec_sqrt.cc')

#include <Vc/Vc>

extern float data [ 32768 ] ;

extern void vf1()
{
  for ( int k = 0 ; k < 32768 ; k += 8 )
  {
    Vc::float_v fv ( data + k ) ;
    fv = sqrt ( fv ) ;
    fv.store ( data + k ) ;
  }
}

Translated to assembler, I get the inner loop

.L2:
        vmovups (%rax), %xmm0
        addq    $32, %rax
        vinsertf128     $0x1, -16(%rax), %ymm0, %ymm0
        vsqrtps %ymm0, %ymm0
        vmovups %xmm0, -32(%rax)
        vextractf128    $0x1, %ymm0, -16(%rax)
        cmpq    %rax, %rdx
        jne     .L2
        vzeroupper
        ret
        .cfi_endproc

note how the data are read 32 bytes at a time and processed with vsqrtps.

creating the corresponding binary and executing it:

$ g++ -O3 -mavx2 -o memaxs sqrt_vc.s memaxs.cc -lVc
$ time ./memaxs

real    0m0,548s
user    0m0,544s
sys     0m0,004s

So, I think this performance difference looks like a good enough gain toconsider my vectorization-of-math-functions proposal. When it comes tothe gather/scatter with arbitrary indexes, I suppose that's lessprofitable and probably harder to scan for.

Kay

Re: autovectorization in gcc

Reply via email to