https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119187
Bug ID: 119187
Summary: vectorizer should be able to SLP already vectorized
code
Product: gcc
Version: unknown
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: tnfchris at gcc dot gnu.org
Target Milestone: ---
Today there's a lot of code written as intrinsics for older microarchitectures
that aren't optimal for newer designs.
One of the promises of intrinsics is that the compiler should be able to do
better if it knows it can. One way is to take advantage of all the cost
modelling support in the vectorizer is to be able to vectorize already
vectorized code.
As an example:
#include <arm_neon.h>
void foo (uint8_t *a, uint8_t *b, uint8_t *c, int n)
{
for (int i = 0; i < n; i+=16, a+=16, b+=16, c+=16)
{
uint8x8_t av1 = vld1_u8 (a);
uint8x8_t av2 = vld1_u8 (a+8);
uint8x8_t bv1 = vld1_u8 (b);
uint8x8_t bv2 = vld1_u8 (b+8);
vst1_u8 (c, vadd_u8 (av1, bv1));
vst1_u8 (c+8, vadd_u8 (av2, bv2));
}
}
at -O3 generates:
.L3:
ldr d29, [x0, x4]
ldr d28, [x1, x4]
ldr d31, [x7, x4]
ldr d30, [x6, x4]
add v28.8b, v29.8b, v28.8b
add v30.8b, v31.8b, v30.8b
str d28, [x2, x4]
str d30, [x5, x4]
add x4, x4, 16
cmp w3, w4
bgt .L3
Which underutilizes the load bandwidth. Ideally these would be Q sized loads
and one ADD. e.g. we'd SLP them.
This ticket documents and asks for feedback on how to best do this.
I have a WIP trunk that is able to re-vectorize the above into:
.L4:
ldr q29, [x0, x4]
ldr q31, [x1, x4]
ldr q0, [x10, x4]
ldr q30, [x9, x4]
add v31.16b, v29.16b, v31.16b
add v30.16b, v0.16b, v30.16b
str q31, [x2, x4]
str q30, [x8, x4]
add x4, x4, 16
cmp x7, x4
bne .L4
tst x5, 15
beq .L1
and w4, w5, -16
lsl w8, w4, 4
add x9, x2, w4, uxtw 4
add x5, x1, w4, uxtw 4
add x7, x0, w4, uxtw 4
.L3:
sub w6, w6, w4
cmp w6, 6
bls .L6
ubfiz x4, x4, 4, 32
add w6, w6, 1
add x10, x4, 8
ldr d26, [x0, x4]
ldr d28, [x1, x4]
ldr d1, [x0, x10]
ldr d27, [x1, x10]
add v28.8b, v26.8b, v28.8b
add v27.8b, v1.8b, v27.8b
str d28, [x2, x4]
str d27, [x2, x10]
tst x6, 7
beq .L1
which happens mostly because it gets the unroll factor wrong and the loop
increment is also not correct. however the SLP tree itself and the vectypes
look correct:
note: === vect_analyze_data_refs ===
note: got vectype for stmt: _13 = MEM <__Uint8x8_t> [(unsigned char *
{ref-all})a_30];
vector(16) unsigned char
note: got vectype for stmt: _14 = MEM <__Uint8x8_t> [(unsigned char *
{ref-all})a_30 + 8B];
vector(16) unsigned char
note: got vectype for stmt: _15 = MEM <__Uint8x8_t> [(unsigned char *
{ref-all})b_31];
vector(16) unsigned char
note: got vectype for stmt: _16 = MEM <__Uint8x8_t> [(unsigned char *
{ref-all})b_31 + 8B];
vector(16) unsigned char
note: got vectype for stmt: MEM <__Uint8x8_t> [(unsigned char *
{ref-all})c_32] = _17;
vector(16) unsigned char
note: got vectype for stmt: MEM <__Uint8x8_t> [(unsigned char *
{ref-all})c_32 + 8B] = _18;
vector(16) unsigned char
...
note: === vect_analyze_data_ref_accesses ===
note: Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char *
{ref-all})a_30]
note: Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char *
{ref-all})a_30 + 8B]
note: Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char *
{ref-all})b_31]
note: Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char *
{ref-all})b_31 + 8B]
note: Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char *
{ref-all})c_32]
note: Detected vector linear access in MEM <__Uint8x8_t> [(unsigned char *
{ref-all})c_32 + 8B]
...
note: ==> examining statement: _13 = MEM <__Uint8x8_t> [(unsigned char *
{ref-all})a_30];
note: precomputed vectype: vector(16) unsigned char
note: get vectype for smallest scalar type: __Uint8x8_t
note: nunits vectype: vector(16) unsigned char
note: nunits = 16
note: ==> examining statement: _14 = MEM <__Uint8x8_t> [(unsigned char *
{ref-all})a_30 + 8B];
note: precomputed vectype: vector(16) unsigned char
note: get vectype for smallest scalar type: __Uint8x8_t
note: nunits vectype: vector(16) unsigned char
note: nunits = 16
...
costing is off though:
note: === vect_compute_single_scalar_iteration_cost ===
MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30] 1 times scalar_load costs 1
in prologue
MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30 + 8B] 1 times scalar_load
costs 1 in prologue
MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31] 1 times scalar_load costs 1
in prologue
MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31 + 8B] 1 times scalar_load
costs 1 in prologue
_13 + _15 1 times scalar_stmt costs 1 in prologue
_17 1 times scalar_store costs 1 in prologue
_14 + _16 1 times scalar_stmt costs 1 in prologue
_18 1 times scalar_store costs 1 in prologue
and VF I think is wrong, I think VF=2 here since we consider the scalar mode to
be V8QI no? or should we consider the scalar mode to be QI? in which case VF=16
is correct?
Here I think the detected unroll factor is wrong, I'd expect unroll factor ==
1:
note: SLP graph after lowering permutations:
note: node 0x5896350 (max_nunits=16, refcnt=2) vector(16) unsigned char
note: op template: MEM <__Uint8x8_t> [(unsigned char * {ref-all})c_32] =
_17;
note: stmt 0 MEM <__Uint8x8_t> [(unsigned char * {ref-all})c_32] = _17;
note: children 0x58963e8
note: node 0x58963e8 (max_nunits=16, refcnt=2) vector(16) unsigned char
note: op template: _17 = _13 + _15;
note: stmt 0 _17 = _13 + _15;
note: children 0x5896480 0x5896518
note: node 0x5896480 (max_nunits=16, refcnt=2) vector(16) unsigned char
note: op template: _13 = MEM <__Uint8x8_t> [(unsigned char *
{ref-all})a_30];
note: stmt 0 _13 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30];
note: load permutation { 0 }
note: node 0x5896518 (max_nunits=16, refcnt=2) vector(16) unsigned char
note: op template: _15 = MEM <__Uint8x8_t> [(unsigned char *
{ref-all})b_31];
note: stmt 0 _15 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31];
note: load permutation { 0 }
note: node 0x58965b0 (max_nunits=16, refcnt=2) vector(16) unsigned char
note: op template: MEM <__Uint8x8_t> [(unsigned char * {ref-all})c_32 + 8B]
= _18;
note: stmt 0 MEM <__Uint8x8_t> [(unsigned char * {ref-all})c_32 + 8B] =
_18;
note: children 0x5896648
note: node 0x5896648 (max_nunits=16, refcnt=2) vector(16) unsigned char
note: op template: _18 = _14 + _16;
note: stmt 0 _18 = _14 + _16;
note: children 0x58966e0 0x5896778
note: node 0x58966e0 (max_nunits=16, refcnt=2) vector(16) unsigned char
note: op template: _14 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30
+ 8B];
note: stmt 0 _14 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})a_30 +
8B];
note: load permutation { 0 }
note: node 0x5896778 (max_nunits=16, refcnt=2) vector(16) unsigned char
note: op template: _16 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31
+ 8B];
note: stmt 0 _16 = MEM <__Uint8x8_t> [(unsigned char * {ref-all})b_31 +
8B];
note: load permutation { 0 }
note: === vect_make_slp_decision ===
note: Decided to SLP 2 instances. Unrolling factor 16
which I think is what's causing the incorrect codegen.
So far I've had to modify:
* vect_analyze_group_access_1: Don't see vector loads as strided accesses
unless there's a gap between group members
* vect_analyze_data_ref_accesses: Don't consider vector loads as interleaving
by default
* vectorizable_operation, vectorizable_load: Check the scalar type precision
rather than the "scalar vector" type.
* get_related_vectype_for_scalar_type: Support vector types as scalar types.
* get_related_vectype_for_scalar_type: Ditto
* vect_get_vector_types_for_stmt: Allow vector inputs.