https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120383
Bug ID: 120383
Summary: Improving early break unrolled sequences with Adv.
SIMD
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: tnfchris at gcc dot gnu.org
Blocks: 53947, 115130
Target Milestone: ---
Target: aarch64*
Today if we unroll an early break loop such as:
#define N 640
long long a[N] = {};
long long b[N] = {};
int f1 (long long c)
{
for (int i = 0; i < N; i++)
{
if (a[i] == c)
return 1;
}
return 0;
}
we generate an ORR reduction followed by a compression sequence when using Adv.
SIMD.
ldp q31, q30, [x1], 32
cmeq v31.2d, v31.2d, v27.2d
cmeq v30.2d, v30.2d, v27.2d
orr v31.16b, v31.16b, v30.16b
umaxp v31.4s, v31.4s, v31.4s
fmov x4, d31
cbz x4, .L2
fmov w1, s29
However the dependency chain here is long enough that it removes the vector
profitability.
This sequence can however be replaced by:
ldp q31, q30, [x0], 32
cmeq v31.2d, v31.2d, v29.2d
cmeq v30.2d, v30.2d, v29.2d
addhn v31.2s, v31.2d, v30.2d
fmov x2, d31
cbz x2, .L15
with addhn replacing the ORR reduction and the UMAXP compression.
When using 3 compare statements, we can use nested addhn, when 4 we can create
2 ORRs + ADDHN.
The AArch64 ADDHN (Add and Halve Narrow) instruction performs an addition of
two vectors of the same size, then truncates (narrow) the result by keeping
only the higher half of each element.
This means that this is hard to do in RTL as you wouldn't be able to match this
long sequence with combine, and the intermediate combinations aren't valid. for
instance it's only possible when the vector mask values are 0 and -1 and so we
would need to know that the values in the registers are vector mask values.
An RTL folder won't work either as it won't let us get the nested variant.
Which leaves a gimple folder or support in the vectorizer for this.
By far the simplest version is using the vectorizer, as it knows about mask
type (e.g. VECTOR_BOOLEAN_P (..)) , it knows about the precision of the mask
type and it's the one generating the sequence so it can choose how to do the
reductions.
However for this to work we have to introduce an optab for addhn. Open coding
the sequence doesn't work as we don't have a way to describe that the addition
is done as a higher precision.
With the optab the final codegen can generate a scalar cbranch rather than a
vector one and gets the result we want.
This makes unrolling an early break loop much more profitable on AArch64.
Having looked into the options and describing the limitations above, are you ok
with a new optab
for addhn Richi?
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115130
[Bug 115130] [meta-bug] early break vectorization