On 09.01.19 10:45, Kyrill Tkachov wrote:
There's plenty of work being done on auto-vectorisation in GCC.
Auto-vectorisation is a performance optimisation and as such is not really
a user-visible feature that absolutely requires user documentation.
Since I'm trying to deliberately exploit it, a more user-visible guise
would help ;)
- repeated use of vectorizable functions
for ( int i = 0 ; i < vsz ; i++ )
A [ i ] = sqrt ( B [ i ] ) ;
Here, replacing the repeated call of sqrt with the vectorized equivalent
gives a dramatic speedup (ca. 4X)
The above is a typical example. So, to give a complete source 'vec_sqrt.cc':
#include <cmath>
extern float data [ 32768 ] ;
extern void vf1()
{
#pragma vectorize enable
for ( int i = 0 ; i < 32768 ; i++ )
data [ i ] = std::sqrt ( data [ i ] ) ;
}
This has a large trip count, the loop is trivial. It's an ideal
candidate for autovectorization. When I compile this source, using
g++ -O3 -mavx2 -S -o sqrt.s sqrt_gcc.cc
the inner loop translates to:
.L2:
vmovss (%rbx), %xmm0
vucomiss %xmm0, %xmm2
vsqrtss %xmm0, %xmm1, %xmm1
jbe .L3
vmovss %xmm2, 12(%rsp)
addq $4, %rbx
vmovss %xmm1, 8(%rsp)
call sqrtf@PLT
vmovss 8(%rsp), %xmm1
vmovss %xmm1, -4(%rbx)
cmpq %rbp, %rbx
vmovss 12(%rsp), %xmm2
jne .L2
AFAICT this is not vectorized, it only uses a single float at a time.
In vector code, I'd expect the vsqrtps mnemonic to show up.
I believe GCC will do some of that already given a high-enough
optimisation level
and floating-point constraints.
Do you have examples where it doesn't? Testcases with self-contained
source code
and compiler flags would be useful to analyse.
so, see above. With -Ofast output is similar, just the inner loop is
unrolled. But maybe I'm missing something? Any hints for additional flags?
If the compiler were to provide the autovectorization facilities, and if
the patterns it recognizes were well-documented, users could rely on
certain code patterns being recognized and autovectorized - sort of a
contract between the user and the compiler. With a well-chosen spectrum
of patterns, this would make it unnecessary to have to rely on explicit
vectorization in many cases. My hope is that such an interface would
help vectorization to become more frequently used - as I understand the
status quo, this is still a niche topic, even though many processors
provide suitable hardware nowadays.
I wouldn't say it's a niche topic :)
From my monitoring of the GCC development over the last few years
there's been lots
of improvements in auto-vectorisation in compilers (at least in GCC).
Okay, I'll take your word for it.
The thing is, auto-vectorisation is not always profitable for performance.
Sometimes the runtime loop iteration count is so low that setting up the
vectorised loop
(alignment checks, loads/permutes) is slower than just doing the scalar
form,
especially since SIMD performance varies from CPU to CPU.
So we would want the compiler to have the freedom to make its own
judgement on when
to auto-vectorise rather than enforce a "contract". If the user really
only wants
vector code, they should use one of the explicit programming paradigms.
I know that these issues are important. I am using Vc for explicit
vectorization, so I can easily code to produce vector code for common
targets. And I can compare the performance. I have tried the example
given above on my AVX2 machine, linking with a main program which calls
'vf1' 32768 times, to get one gigaroot (giggle). The vectorized version
takes about half a second, the unvectorized takes about three. with
functions like sqrt, trigonometric functions, exp and pow, vectorization
is very profitable. Some further details:
Here's the main program 'memaxs.cc':
float data [ 32768 ] ;
extern void vf1() ;
int main ( int argc , char * argv[] )
{
for ( int k = 0 ; k < 32768 ; k++ )
{
vf1() ;
}
}
And the compiler call to get a binary:
g++ -O3 -mavx2 -o memaxs sqrt.s memaxs.cc
Here's the performance:
$ time ./memaxs
real 0m3,205s
user 0m3,200s
sys 0m0,004s
This variant of vec_sqrt.cc uses Vc ('vc_vec_sqrt.cc')
#include <Vc/Vc>
extern float data [ 32768 ] ;
extern void vf1()
{
for ( int k = 0 ; k < 32768 ; k += 8 )
{
Vc::float_v fv ( data + k ) ;
fv = sqrt ( fv ) ;
fv.store ( data + k ) ;
}
}
Translated to assembler, I get the inner loop
.L2:
vmovups (%rax), %xmm0
addq $32, %rax
vinsertf128 $0x1, -16(%rax), %ymm0, %ymm0
vsqrtps %ymm0, %ymm0
vmovups %xmm0, -32(%rax)
vextractf128 $0x1, %ymm0, -16(%rax)
cmpq %rax, %rdx
jne .L2
vzeroupper
ret
.cfi_endproc
note how the data are read 32 bytes at a time and processed with vsqrtps.
creating the corresponding binary and executing it:
$ g++ -O3 -mavx2 -o memaxs sqrt_vc.s memaxs.cc -lVc
$ time ./memaxs
real 0m0,548s
user 0m0,544s
sys 0m0,004s
So, I think this performance difference looks like a good enough gain to
consider my vectorization-of-math-functions proposal. When it comes to
the gather/scatter with arbitrary indexes, I suppose that's less
profitable and probably harder to scan for.
Kay