On Tue, 15 Oct 2024, Tamar Christina wrote:
> > -----Original Message-----
> > From: Richard Biener <[email protected]>
> > Sent: Tuesday, October 15, 2024 1:42 PM
> > To: Tamar Christina <[email protected]>
> > Cc: [email protected]; nd <[email protected]>
> > Subject: Re: [PATCH 4/4]middle-end: create the longest possible zero extend
> > chain
> > after overwidening
> >
> > On Mon, 14 Oct 2024, Tamar Christina wrote:
> >
> > > Hi All,
> > >
> > > Consider loops such as:
> > >
> > > void test9(unsigned char *x, long long *y, int n, unsigned char k) {
> > > for(int i = 0; i < n; i++) {
> > > y[i] = k + x[i];
> > > }
> > > }
> > >
> > > where today we generate:
> > >
> > > .L5:
> > > ldr q29, [x5], 16
> > > add x4, x4, 128
> > > uaddl v1.8h, v29.8b, v30.8b
> > > uaddl2 v29.8h, v29.16b, v30.16b
> > > zip1 v2.8h, v1.8h, v31.8h
> > > zip1 v0.8h, v29.8h, v31.8h
> > > zip2 v1.8h, v1.8h, v31.8h
> > > zip2 v29.8h, v29.8h, v31.8h
> > > sxtl v25.2d, v2.2s
> > > sxtl v28.2d, v0.2s
> > > sxtl v27.2d, v1.2s
> > > sxtl v26.2d, v29.2s
> > > sxtl2 v2.2d, v2.4s
> > > sxtl2 v0.2d, v0.4s
> > > sxtl2 v1.2d, v1.4s
> > > sxtl2 v29.2d, v29.4s
> > > stp q25, q2, [x4, -128]
> > > stp q27, q1, [x4, -96]
> > > stp q28, q0, [x4, -64]
> > > stp q26, q29, [x4, -32]
> > > cmp x5, x6
> > > bne .L5
> > >
> > > Note how the zero extend from short to long is half way the chain
> > > transformed
> > > into a sign extend. There are two problems with this:
> > >
> > > 1. sign extends are typically slower than zero extends on many uArches.
> > > 2. it prevents vectorizable_conversion from attempting to do a single
> > > step
> > > promotion.
> > >
> > > These sign extend happen due to the varous range reduction optimizations
> > > and
> > > patterns we have, such as multiplication widening, etc.
> > >
> > > My first attempt to fix this was just updating the patterns to when the
> > > original
> > > source is a zero extend, to not add the intermediate sign extend.
> > >
> > > However this behavior happens in many other places, some of it and as new
> > > patterns get added the problem can be re-introduced.
> > >
> > > Instead I have added a new pattern vect_recog_zero_extend_chain_pattern
> > > that
> > > attempts to simplify and extend an existing zero extend over multiple
> > > conversions statements.
> > >
> > > As an example, T3 a = (T3)(signed T2)(unsigned T1)x where bitsize T3 > T2
> > > > T1
> > > gets transformed into T3 a = (T3)(signed T2)(unsigned T2)x.
> > >
> > > The final cast to signed it kept so the types in the tree still match. It
> > > will
> > > be correctly elided later on.
> > >
> > > This represenation is the most optimal as vectorizable_conversion is
> > > already
> > > able to decompose a long promotion into multiple steps if the target does
> > > not
> > > support it in a single step. More importantly it allows us to do proper
> > > costing
> > > and support such conversions like (double)x, where bitsize(x) < int in an
> > > efficient manner.
> > >
> > > To do this I have used Ranger's on-demand analysis to perform the check
> > > to see
> > > if an extension can be removed and extended to zero extend. The reason
> > > for this
> > > is that the vectorizer introduces several patterns that are not in the
> > > IL, but
> > > also lots of widening IFNs for which handling in a switch wouldn't be very
> > > future proof.
> > >
> > > I did try to do it without Ranger, but ranger had two benefits:
> > >
> > > 1. It simplified the handling of the IL changes the vectorizer
> > > introduces, and
> > > makes it future proof.
> > > 2. Ranger has the advantage of doing the transformation in cases where it
> > knows
> > > that the top bits of the value is zero. Which we wouldn't be able to
> > > tell
> > > by looking purely at statements.
> > > 3. Ranger simplified the handling of corner cases. Without it the
> > > handling was
> > > quite complex and I wasn't very confident in it's correctness.
> > >
> > > So I think ranger is the right way to go here... With these changes the
> > > above
> > > now generates:
> > >
> > > .L5:
> > > add x4, x4, 128
> > > ldr q26, [x5], 16
> > > uaddl v2.8h, v26.8b, v31.8b
> > > uaddl2 v26.8h, v26.16b, v31.16b
> > > tbl v4.16b, {v2.16b}, v30.16b
> > > tbl v3.16b, {v2.16b}, v29.16b
> > > tbl v24.16b, {v2.16b}, v28.16b
> > > tbl v1.16b, {v26.16b}, v30.16b
> > > tbl v0.16b, {v26.16b}, v29.16b
> > > tbl v25.16b, {v26.16b}, v28.16b
> > > tbl v2.16b, {v2.16b}, v27.16b
> > > tbl v26.16b, {v26.16b}, v27.16b
> > > stp q4, q3, [x4, -128]
> > > stp q1, q0, [x4, -64]
> > > stp q24, q2, [x4, -96]
> > > stp q25, q26, [x4, -32]
> > > cmp x5, x6
> > > bne .L5
> > >
> > > I have also seen similar improvements in codegen on Arm and x86_64,
> > > especially
> > > with AVX512.
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > arm-none-linux-gnueabihf,
> > > x86_64-pc-linux-gnu -m32, -m64 and no issues.
> > >
> > > Hopefully Ok for master?
> >
> > Hohumm. So I looked at one of the examples and I don't see any
> > sign-extends in the IL we vectorize. So your pattern is about
> > changing int -> double to unsigned int -> double but only so
> > a required intermediate int -> long conversion is done as
> > zero-extend? IMO this doesn't belong to patterns but to
> > vectorizable_conversion, specifically the step determining the
> > intermediate types.
>
> There is, it's not in C but created by the vectorizer.
> So the way I saw it, was that vectorizable_conversion should be
> given the choice of what it wants to do. Whether it wants to do
> it in one operation of multiple.
>
> If the goal was to just get rid of the final zero extend, yes I would agree.
>
> But that's just part of the goal, the other is to have the zero extend
> explicitly
> exposed to vectorizable_conversion. That way the patch that uses TBL actually
> sees the long multi-step conversion.
>
> My worry was that if done in vectorizable_conversion, while I can walk the IL,
> we'd cost the intermediate casts. On AArch64 the cost model takes into
> account
> the throughput of sequences, not just latencies. And the TBLs have better
> throughput.
> So for costing you do really want to see the full thing and not cost the
> intermediate
> Conversions.
>
> This is why I used a pattern, since the IL is actually changed from the
> input. But see blow...
>
> >
> > I don't quite understand what scalar pattern IL you feed to
> > the vectorizer in the end, few comments - also to this effect,
> > below.
>
> Sure, I'll answer this and the question below in one go:
>
> Lets pick a simple example:
>
> void test8(unsigned char *x, double *y, int n, unsigned char k) {
> for(int i = 0; i < n; i++) {
> y[i] = k + x[i];
> }
> }
>
> In GIMPLE this generates:
>
> _4 = *_3;
> _5 = (int) _4;
> _6 = _5 + _29;
> _9 = (double) _6;
>
> i.e. the unsigned char, is widened to int, added to k as int and then
> converted to a double.
>
> When we start vectorizing, overwidening detection runs:
>
> note: vect_recog_over_widening_pattern: detected: _6 = _5 + _29;
> note: demoting int to unsigned short
> note: Splitting statement: _5 = (int) _4;
> note: into pattern statements: patt_32 = (unsigned short) _4;
> note: and: patt_31 = (int) patt_32;
> note: created pattern stmt: patt_28 = patt_32 + patt_30;
> note: over_widening pattern recognized: patt_27 = (int) patt_28;
> note: extra pattern stmt: patt_28 = patt_32 + patt_30;
> note: vect_is_simple_use: operand (unsigned short) _4, type of def: internal
>
> and it demotes it correctly from int to unsigned short. Performs the
> operation
> as unsigned, and then sign extends that to int.
>
> The final IL the vectorizer builds is this:
>
> note: node 0x473d790 (max_nunits=16, refcnt=2) vector(2) double
> note: op template: *_8 = _9;
> note: stmt 0 *_8 = _9;
> note: children 0x473d828
> note: node 0x473d828 (max_nunits=16, refcnt=2) vector(2) double
> note: op template: _9 = (double) _6;
> note: stmt 0 _9 = (double) _6;
> note: children 0x473d8c0
> note: node 0x473d8c0 (max_nunits=16, refcnt=2) vector(4) int
> note: op template: patt_27 = (int) patt_28;
> note: stmt 0 patt_27 = (int) patt_28;
> note: children 0x473d958
> note: node 0x473d958 (max_nunits=16, refcnt=2) vector(8) unsigned short
> note: op template: patt_28 = .VEC_WIDEN_PLUS (_4, k_14(D));
> note: stmt 0 patt_28 = .VEC_WIDEN_PLUS (_4, k_14(D));
> note: children 0x473d9f0 0x473da88
> note: node 0x473d9f0 (max_nunits=16, refcnt=2) vector(16) unsigned char
> note: op template: _4 = *_3;
> note: stmt 0 _4 = *_3;
> note: load permutation { 0 }
> note: node (external) 0x473da88 (max_nunits=1, refcnt=1)
> note: { k_14(D) }
>
> We later relax the cast to int as a zero extend today already. However the
> final promotion is
> the FLOAT_EXPR. There it sees a widening from int to long. It assumes that
> this has to be a
> sign extend because of the signed input, it doesn't know that the signed
> input was created
> by a zero extending operation.
OK, so it's like I thought then. The fix should be to
vectorizable_conversion to consider using a zero-extend for the
conversion to the intermediate long type. I'm not sure how far
we can use stmt-vinfo min_output_precision for such analysis,
maybe Richard can answer this. But there's the bad (because it's
wrongly implemented for SLP) example of using range info
in supportable_indirect_convert_operation for this already.
> What my pattern does it make this explicit in the tree.
>
> My first attempt was to update the code that does:
>
> /app/example.c:2:22: note: demoting int to unsigned short
> /app/example.c:2:22: note: Splitting statement: _5 = (int) _4;
> /app/example.c:2:22: note: into pattern statements: patt_32 = (unsigned
> short) _4;
> /app/example.c:2:22: note: and: patt_31 = (int) patt_32;
>
> In vect_split_statement, But even doing so, the problem is that it split
> the range of the
> conversions. And so while this fixed some cases. Code that uses the result
> never know that
> the top bits are zero.
>
> So it worked for some cases. But missed out plenty of others.
I think the pattern would have to be for the (double) _5 conversion
and we do not want to expose (double) (long) (unsigned) _5 at that
point because some targets (x86!) _can_ do SImode to DFmode float
conversions and don't require an intermediate widening at all
(vectorizable_conversion knows this).
Richard.
> > So for (double)(int)x you produce (double)(int)(unsigned int)x? It
> > feels like the above example misses a conversion?
> >
>
> No, (double)(int)x is basically (double)(long)(int)x
>
> And the pattern creates
>
> (double)(long)(unsigned long)x? Basically it just makes the conversions
> explicit
> and extends the zero extend as wide as possible.
>
> > I don't think the pure integer type sequence happens - we do
> > (and should if not) already perform canonicalization of equivalent
> > conversion sequences, otherwise you'd see a missed CSE (I couldn't
> > produce such an example).
>
> It does. This is a pure integer sequence with the same problem
>
> void test8(unsigned char *x, long *y, int n, unsigned char k) {
> for(int i = 0; i < n; i++) {
> y[i] = k + x[i];
> }
> }
>
> And there are other code that introduce this, for instance there's a cleanup
> after
> multiplication widening specifically that also tries to split types. And it
> will do this
> as well.
>
> >
> > You say you found the "issue" to be exposed in several (or found in one,
> > suspected in several) existing patterns. Can you elaborate and point
> > out where this happens? I don't think it covers the int -> double
> > case, does it?
>
> Not this case on its own. You have to do *some* operation in between.
>
> Thanks,
> Tamar
>
> >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > > * tree-vect-patterns.cc (vect_recog_zero_extend_chain_pattern): New.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * gcc.dg/vect/bb-slp-pattern-1.c: Update tests.
> > > * gcc.dg/vect/slp-widen-mult-half.c: Likewise.
> > > * gcc.dg/vect/vect-over-widen-10.c: Likewise.
> > > * gcc.dg/vect/vect-over-widen-12.c: Likewise.
> > > * gcc.dg/vect/vect-over-widen-14.c: Likewise.
> > > * gcc.dg/vect/vect-over-widen-16.c: Likewise.
> > > * gcc.dg/vect/vect-over-widen-6.c: Likewise.
> > > * gcc.dg/vect/vect-over-widen-8.c: Likewise.
> > > * gcc.dg/vect/vect-widen-mult-u16.c: Likewise.
> > > * gcc.dg/vect/vect-widen-mult-u8-s16-s32.c: Likewise.
> > > * lib/target-supports.exp
> > > (check_effective_target_vect_widen_mult_hi_to_si_pattern,
> > > check_effective_target_vect_widen_mult_si_to_di_pattern): Enable
> > > AArch64.
> > > * gcc.target/aarch64/vect-tbl-zero-extend_2.c: New test.
> > >
> > > ---
> > > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> > b/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> > > index
> > 5ae99225273ca5f915f60ecba3a5aaedebe46e96..627de78af4e48581575beda9
> > 7bf2a0708ac091cb 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> > > @@ -52,4 +52,4 @@ int main (void)
> > >
> > > /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1
> > > "slp2" {
> > target { vect_widen_mult_hi_to_si || vect_unpack } } } } */
> > > /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern:
> > detected" 8 "slp2" { target vect_widen_mult_hi_to_si_pattern } } } */
> > > -/* { dg-final { scan-tree-dump-times "pattern recognized" 8 "slp2" {
> > > target
> > vect_widen_mult_hi_to_si_pattern } } } */
> > > +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 8
> > > "slp2" {
> > target vect_widen_mult_hi_to_si_pattern } } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> > b/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> > > index
> > b69ade338862cda4f44f5206d195eef1cb5e8d36..aecc085a51c93e0e7bed122df
> > 0a77a0a099ad6ef 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> > > @@ -52,5 +52,5 @@ int main (void)
> > > /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" {
> > > target
> > vect_widen_mult_hi_to_si } } } */
> > > /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2
> > > "vect" {
> > target vect_widen_mult_hi_to_si } } } */
> > > /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern:
> > detected" 2 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> > > -/* { dg-final { scan-tree-dump-times "pattern recognized" 2 "vect" {
> > > target
> > vect_widen_mult_hi_to_si_pattern } } } */
> > > +/* { dg-final { scan-tree-dump-times "pattern recognized" 4 "vect" {
> > > target
> > vect_widen_mult_hi_to_si_pattern } } } */
> > >
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> > > index
> > f0140e4ef6d70cd61aa7dbb3ba39b1da142a79b2..bd798fae7e8136975d48820
> > 6cfef9e39fac2bfea 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> > > @@ -11,7 +11,7 @@
> > >
> > > #include "vect-over-widen-9.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } }
> > > */
> > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+ } "vect" } } */
> > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 1} "vect" } } */
> > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 2} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> > > index
> > ddb3bd8c0d378f0138c8cc7f9c6ea3300744b8a8..8c0544e35c29de60e76759f4
> > ed13206278c72925 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> > > @@ -11,7 +11,7 @@
> > >
> > > #include "vect-over-widen-11.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } }
> > > */
> > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+ } "vect" } } */
> > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 1} "vect" } } */
> > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 2} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> > > index
> > dfa09f5d2cafe329e6d57b5cc681786cc2c7d215..1fe0305c1c4f61d05864ef9778
> > 9726a1dc6ec8b1 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> > > @@ -11,7 +11,7 @@
> > >
> > > #include "vect-over-widen-13.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } }
> > > */
> > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+} "vect" } } */
> > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 1} "vect" } } */
> > > /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> > detected:[^\n]* = \(unsigned char\)} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> > > index
> > 4584c586da1e6f13e8c8de4c1291cea0141ebab5..4ecdadf7a035a4f83b1767a06
> > 3a1b0f47bdd543d 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> > > @@ -11,7 +11,7 @@
> > >
> > > #include "vect-over-widen-15.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } }
> > > */
> > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+} "vect" } } */
> > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 1} "vect" } } */
> > > /* { dg-final { scan-tree-dump-not {vect_recog_cast_forwprop_pattern:
> > detected} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> > > index
> > bda92c965e080dd3f48ec42b6bea16e79d9416cd..6b8c3dfa2c89ce04d7673607
> > ef2d2f14a14eb32f 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> > > @@ -9,7 +9,7 @@
> > >
> > > #include "vect-over-widen-5.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } }
> > > */
> > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+ } "vect" } } */
> > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 1} "vect" } } */
> > > /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> > detected:[^\n]* \(unsigned char\)} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> > > index
> > 553c0712a79a1d19195dbdab7cbd6fa330685bea..1cf725ff4b7f151097192db1
> > a0b65173c4c83b19 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> > > @@ -12,7 +12,7 @@
> > >
> > > #include "vect-over-widen-7.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } }
> > > */
> > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+ } "vect" } } */
> > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 2} "vect" } } */
> > > /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> > detected:[^\n]* \(unsigned char\)} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> > b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> > > index
> > 258d253f401459d448d1ae86f56b0c97815d5b61..b5018f855a72534b4d64d2
> > dc2b7ab2ac0deb674b 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> > > @@ -47,5 +47,5 @@ int main (void)
> > >
> > > /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" {
> > > target {
> > vect_widen_mult_hi_to_si || vect_unpack } } } } */
> > > /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern:
> > detected" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> > > -/* { dg-final { scan-tree-dump-times "pattern recognized" 1 "vect" {
> > > target
> > vect_widen_mult_hi_to_si_pattern } } } */
> > > +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 1
> > > "vect" {
> > target vect_widen_mult_hi_to_si_pattern } } } */
> > >
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> > b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> > > index
> > 3baafca7b548124ae5c48fdf3c2f07c319155967..ab523ca77652e1f1533889fda9
> > c0eb31c987ffe9 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> > > @@ -47,5 +47,5 @@ int main (void)
> > >
> > > /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" {
> > > target {
> > vect_widen_mult_hi_to_si || vect_unpack } } } } */
> > > /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern:
> > detected" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> > > -/* { dg-final { scan-tree-dump-times "pattern recognized" 1 "vect" {
> > > target
> > vect_widen_mult_hi_to_si_pattern } } } */
> > > +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 1
> > > "vect" {
> > target vect_widen_mult_hi_to_si_pattern } } } */
> > >
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c
> > b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c
> > > new file mode 100644
> > > index
> > 0000000000000000000000000000000000000000..1577eacd9dbbb52274d9f
> > 86c77406555b7726482
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c
> > > @@ -0,0 +1,33 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -std=c99 -march=armv8-a" } */
> > > +
> > > +void test6(unsigned char *x, double *y, int n) {
> > > + for(int i = 0; i < (n & -8); i++) {
> > > + y[i] += x[i];
> > > + }
> > > +}
> > > +
> > > +void test7(unsigned char *x, double *y, int n, unsigned char k) {
> > > + for(int i = 0; i < (n & -8); i++) {
> > > + y[i] += k * x[i];
> > > + }
> > > +}
> > > +
> > > +void test8(unsigned char *x, double *y, int n, unsigned char k) {
> > > + for(int i = 0; i < (n & -8); i++) {
> > > + y[i] = k + x[i];
> > > + }
> > > +}
> > > +
> > > +void test9(unsigned char *x, long long *y, int n, unsigned char k) {
> > > + for(int i = 0; i < (n & -8); i++) {
> > > + y[i] = k + x[i];
> > > + }
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times {\tuxtl} 1 } } */
> > > +/* { dg-final { scan-assembler-not {\tuxtl2} } } */
> > > +/* { dg-final { scan-assembler-not {\tzip1} } } */
> > > +/* { dg-final { scan-assembler-not {\tzip2} } } */
> > > +/* { dg-final { scan-assembler-times {\ttbl} 44 } } */
> > > +/* { dg-final { scan-assembler-times {\.LC[0-9]+:} 12 } } */
> > > diff --git a/gcc/testsuite/lib/target-supports.exp
> > > b/gcc/testsuite/lib/target-
> > supports.exp
> > > index
> > d113a08dff7b2a8ab5bdfe24386d271bff255afc..feae1b8fcf8cd7ab56a8c76c0cd3
> > 034c0a828724 100644
> > > --- a/gcc/testsuite/lib/target-supports.exp
> > > +++ b/gcc/testsuite/lib/target-supports.exp
> > > @@ -8240,6 +8240,7 @@ proc
> > check_effective_target_vect_widen_mult_hi_to_si_pattern { } {
> > > return [check_cached_effective_target_indexed
> > vect_widen_mult_hi_to_si_pattern {
> > > expr { [istarget powerpc*-*-*]
> > > || [istarget ia64-*-*]
> > > + || [istarget aarch64*-*-*]
> > > || [istarget loongarch*-*-*]
> > > || [istarget i?86-*-*] || [istarget x86_64-*-*]
> > > || ([is-effective-target arm_neon]
> > > @@ -8259,6 +8260,7 @@ proc
> > check_effective_target_vect_widen_mult_si_to_di_pattern { } {
> > > expr { [istarget ia64-*-*]
> > > || [istarget i?86-*-*] || [istarget x86_64-*-*]
> > > || [istarget loongarch*-*-*]
> > > + || [istarget aarch64*-*-*]
> > > || ([istarget s390*-*-*]
> > > && [check_effective_target_s390_vx]) }}]
> > > }
> > > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> > > index
> > 9bf8526ac995c6c2678b25f5df4316aec41333e0..74c7269a3ab15cba1ee2ef055
> > 6d25afda851f7f0 100644
> > > --- a/gcc/tree-vect-patterns.cc
> > > +++ b/gcc/tree-vect-patterns.cc
> > > @@ -5524,6 +5524,122 @@ vect_recog_mixed_size_cond_pattern (vec_info
> > *vinfo,
> > > return pattern_stmt;
> > > }
> > >
> > > +/* Function vect_recog_zero_extend_chain_pattern
> > > +
> > > + Try to find the following pattern:
> > > +
> > > + type x_t;
> > > + TYPE a_T, b_T, c_T;
> >
> > But a_T, b_T and c_T are different types - what types?
> >
> > > + loop:
> > > + S1 a_T = (b_T)(c_T)x_t;
> > > +
> > > + where type 'TYPE' is an integral type which has different size
> > > + from 'type' and c_T is a zero extend or a sign extend on a value
> > > whose top
> > > + bit is known to be zero. a_T can be signed or unsigned.
> > > +
> > > + Input:
> > > +
> > > + * STMT_VINFO: The stmt from which the pattern search begins.
> > > +
> > > + Output:
> > > +
> > > + * TYPE_OUT: The type of the output of this pattern.
> > > +
> > > + * Return value: A new stmt that will be used to replace the pattern.
> > > + This replaces multiple chained extensions with the longest possible
> > > + chain or zero extends and a final convert to the required sign.
> > > +
> > > + S1 a_T = (a_T)(unsigned a_T)x_t; */
> >
> > So for (double)(int)x you produce (double)(int)(unsigned int)x? It
> > feels like the above example misses a conversion?
> >
> > I don't think the pure integer type sequence happens - we do
> > (and should if not) already perform canonicalization of equivalent
> > conversion sequences, otherwise you'd see a missed CSE (I couldn't
> > produce such an example).
> >
> > You say you found the "issue" to be exposed in several (or found in one,
> > suspected in several) existing patterns. Can you elaborate and point
> > out where this happens? I don't think it covers the int -> double
> > case, does it?
> >
> > Thanks,
> > Richard.
> >
> > > +
> > > +static gimple *
> > > +vect_recog_zero_extend_chain_pattern (vec_info *vinfo,
> > > + stmt_vec_info stmt_vinfo, tree *type_out)
> > > +{
> > > + gimple *last_stmt = STMT_VINFO_STMT (vect_stmt_to_vectorize
> > (stmt_vinfo));
> > > +
> > > + if (!is_gimple_assign (last_stmt))
> > > + return NULL;
> > > +
> > > + tree_code code = gimple_assign_rhs_code (last_stmt);
> > > + tree lhs = gimple_assign_lhs (last_stmt);
> > > + tree rhs = gimple_assign_rhs1 (last_stmt);
> > > + tree lhs_type = TREE_TYPE (lhs);
> > > + tree rhs_type = TREE_TYPE (rhs);
> > > +
> > > + if ((code != FLOAT_EXPR && code != NOP_EXPR)
> > > + || TYPE_UNSIGNED (lhs_type)
> > > + || TREE_CODE (rhs_type) != INTEGER_TYPE
> > > + || TREE_CODE (rhs) != SSA_NAME
> > > + || STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_internal_def)
> > > + return NULL;
> > > +
> > > + /* Check to see if it's safe to extend the zero extend to the new type.
> > > + In general this is safe if the rhs1 type is unsigned or if we know
> > > that
> > > + the top bits are zero, this can happen due to all the widening
> > > operations
> > > + we have. For instance a widening addition will have top bits
> > > zero. */
> > > + if (!TYPE_UNSIGNED (rhs_type))
> > > + {
> > > + wide_int wcst = get_nonzero_bits (rhs);
> > > + if (wi::neg_p (wcst) || wi::clz (wcst) == 0)
> > > + return NULL;
> > > + }
> > > +
> > > + tree cvt_type = unsigned_type_for (lhs_type);
> > > +
> > > + tree cvt_vectype = get_vectype_for_scalar_type (vinfo, cvt_type);
> > > + if (!cvt_vectype || !VECTOR_TYPE_P (cvt_vectype))
> > > + return NULL;
> > > +
> > > + tree out_vectype = get_vectype_for_scalar_type (vinfo, lhs_type);
> > > + if (!out_vectype || !VECTOR_TYPE_P (out_vectype))
> > > + return NULL;
> > > +
> > > + stmt_vec_info irhs;
> > > +
> > > + gimple_ranger ranger;
> > > +
> > > + /* Dig through any existing conversions to see if we can extend the
> > > zero
> > > + extend chain across multiple converts. */
> > > + while ((irhs = vect_get_internal_def (vinfo, rhs)))
> > > + {
> > > + gimple *g_irhs = STMT_VINFO_STMT (irhs);
> > > + if (!is_gimple_assign (g_irhs)
> > > + || gimple_assign_rhs_code (g_irhs) != NOP_EXPR)
> > > + break;
> > > +
> > > + /* See if we can consume the next conversion as well. To do this
> > > it's
> > > + best to use Ranger as it can see through the intermediate IL that the
> > > + vectorizer creates throughout pattern matching. */
> > > + int_range_max r;
> > > + ranger.range_of_stmt (r, g_irhs);
> > > + wide_int nz = r.get_nonzero_bits ();
> > > + if (wi::neg_p (nz) || wi::clz (nz) == 0)
> > > + break;
> > > +
> > > + rhs = gimple_assign_rhs1 (g_irhs);
> > > + }
> > > +
> > > + /* If the result is a no-op, or we've jumped over a truncate of sort,
> > > or if
> > > + nothing would change materially just leave it alone. */
> > > + if (TYPE_PRECISION (lhs_type) <= TYPE_PRECISION (TREE_TYPE (rhs))
> > > + || (code == FLOAT_EXPR && rhs == gimple_assign_rhs1 (last_stmt)))
> > > + return NULL;
> > > +
> > > + vect_pattern_detected ("vect_recog_zero_extend_chain_pattern",
> > > last_stmt);
> > > +
> > > + tree cast_var = vect_recog_temp_ssa_var (cvt_type, NULL);
> > > + gimple *pattern_stmt = NULL;
> > > + pattern_stmt = gimple_build_assign (cast_var, NOP_EXPR, rhs);
> > > + append_pattern_def_seq (vinfo, stmt_vinfo, pattern_stmt, cvt_vectype);
> > > +
> > > + tree cvt_var = vect_recog_temp_ssa_var (lhs_type, NULL);
> > > + pattern_stmt = gimple_build_assign (cvt_var, code, cast_var);
> > > +
> > > + *type_out = out_vectype;
> > > +
> > > + return pattern_stmt;
> > > +}
> > > +
> > >
> > > /* Helper function of vect_recog_bool_pattern. Called recursively,
> > > return
> > > true if bool VAR can and should be optimized that way. Assume it
> > > shouldn't
> > > @@ -7509,6 +7625,7 @@ static vect_recog_func vect_vect_recog_func_ptrs[] =
> > {
> > > { vect_recog_widen_minus_pattern, "widen_minus" },
> > > { vect_recog_widen_abd_pattern, "widen_abd" },
> > > /* These must come after the double widening ones. */
> > > + { vect_recog_zero_extend_chain_pattern, "zero_extend_chain" },
> > > };
> > >
> > > /* Mark statements that are involved in a pattern. */
> > >
> > >
> > >
> > >
> > >
> >
> > --
> > Richard Biener <[email protected]>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)