On 09.01.19 10:45, Kyrill Tkachov wrote:

There's plenty of work being done on auto-vectorisation in GCC.
Auto-vectorisation is a performance optimisation and as such is not really
a user-visible feature that absolutely requires user documentation.

Since I'm trying to deliberately exploit it, a more user-visible guise would help ;)

- repeated use of vectorizable functions

for ( int i = 0 ; i < vsz ; i++ )
   A [ i ] = sqrt ( B [ i ] ) ;

Here, replacing the repeated call of sqrt with the vectorized equivalent
gives a dramatic speedup (ca. 4X)

The above is a typical example. So, to give a complete source 'vec_sqrt.cc':

#include <cmath>

extern float data [ 32768 ] ;

extern void vf1()
{
  #pragma vectorize enable
  for ( int i = 0 ; i < 32768 ; i++ )
    data [ i ] = std::sqrt ( data [ i ] ) ;
}

This has a large trip count, the loop is trivial. It's an ideal candidate for autovectorization. When I compile this source, using

g++ -O3 -mavx2 -S -o sqrt.s sqrt_gcc.cc

the inner loop translates to:

.L2:
        vmovss  (%rbx), %xmm0
        vucomiss        %xmm0, %xmm2
        vsqrtss %xmm0, %xmm1, %xmm1
        jbe     .L3
        vmovss  %xmm2, 12(%rsp)
        addq    $4, %rbx
        vmovss  %xmm1, 8(%rsp)
        call    sqrtf@PLT
        vmovss  8(%rsp), %xmm1
        vmovss  %xmm1, -4(%rbx)
        cmpq    %rbp, %rbx
        vmovss  12(%rsp), %xmm2
        jne     .L2

AFAICT this is not vectorized, it only uses a single float at a time.
In vector code, I'd expect the vsqrtps mnemonic to show up.

I believe GCC will do some of that already given a high-enough optimisation level
and floating-point constraints.
Do you have examples where it doesn't? Testcases with self-contained source code
and compiler flags would be useful to analyse.

so, see above. With -Ofast output is similar, just the inner loop is unrolled. But maybe I'm missing something? Any hints for additional flags?

If the compiler were to provide the autovectorization facilities, and if
the patterns it recognizes were well-documented, users could rely on
certain code patterns being recognized and autovectorized - sort of a
contract between the user and the compiler. With a well-chosen spectrum
of patterns, this would make it unnecessary to have to rely on explicit
vectorization in many cases. My hope is that such an interface would
help vectorization to become more frequently used - as I understand the
status quo, this is still a niche topic, even though many processors
provide suitable hardware nowadays.


I wouldn't say it's a niche topic :)
From my monitoring of the GCC development over the last few years there's been lots
of improvements in auto-vectorisation in compilers (at least in GCC).

Okay, I'll take your word for it.

The thing is, auto-vectorisation is not always profitable for performance.
Sometimes the runtime loop iteration count is so low that setting up the vectorised loop (alignment checks, loads/permutes) is slower than just doing the scalar form,
especially since SIMD performance varies from CPU to CPU.
So we would want the compiler to have the freedom to make its own judgement on when to auto-vectorise rather than enforce a "contract". If the user really only wants
vector code, they should use one of the explicit programming paradigms.

I know that these issues are important. I am using Vc for explicit vectorization, so I can easily code to produce vector code for common targets. And I can compare the performance. I have tried the example given above on my AVX2 machine, linking with a main program which calls 'vf1' 32768 times, to get one gigaroot (giggle). The vectorized version takes about half a second, the unvectorized takes about three. with functions like sqrt, trigonometric functions, exp and pow, vectorization is very profitable. Some further details:

Here's the main program 'memaxs.cc':

float data [ 32768 ] ;
extern void vf1() ;

int main ( int argc , char * argv[] )
{
  for ( int k = 0 ; k < 32768 ; k++ )
  {
    vf1() ;
  }
}

And the compiler call to get a binary:

g++ -O3 -mavx2 -o memaxs sqrt.s memaxs.cc

Here's the performance:

$ time ./memaxs

real    0m3,205s
user    0m3,200s
sys     0m0,004s

This variant of vec_sqrt.cc uses Vc ('vc_vec_sqrt.cc')

#include <Vc/Vc>

extern float data [ 32768 ] ;

extern void vf1()
{
  for ( int k = 0 ; k < 32768 ; k += 8 )
  {
    Vc::float_v fv ( data + k ) ;
    fv = sqrt ( fv ) ;
    fv.store ( data + k ) ;
  }
}

Translated to assembler, I get the inner loop

.L2:
        vmovups (%rax), %xmm0
        addq    $32, %rax
        vinsertf128     $0x1, -16(%rax), %ymm0, %ymm0
        vsqrtps %ymm0, %ymm0
        vmovups %xmm0, -32(%rax)
        vextractf128    $0x1, %ymm0, -16(%rax)
        cmpq    %rax, %rdx
        jne     .L2
        vzeroupper
        ret
        .cfi_endproc

note how the data are read 32 bytes at a time and processed with vsqrtps.

creating the corresponding binary and executing it:

$ g++ -O3 -mavx2 -o memaxs sqrt_vc.s memaxs.cc -lVc
$ time ./memaxs

real    0m0,548s
user    0m0,544s
sys     0m0,004s

So, I think this performance difference looks like a good enough gain to consider my vectorization-of-math-functions proposal. When it comes to the gather/scatter with arbitrary indexes, I suppose that's less profitable and probably harder to scan for.

Kay

Reply via email to