[Bug tree-optimization/106594] [13 Regression] sign-extensions no longer merged into addressing mode

2022-08-12 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106594

--- Comment #2 from Tamar Christina  ---
(In reply to Richard Biener from comment #1)
> I believe this was some match.pd simplification - why does this affect the
> addressing mode?  Is the IVOPTs result different or does it differ later?

The results in IVOpts don't change aside from the changes mentioned above.  It
looks like this was previously being matched in RTL through combine.

Before we had

(insn 25 24 26 4 (set (reg:DI 118)
(sign_extend:DI (reg:SI 117))) "/app/example.c":12:35 109
{*extendsidi2_aarch64}
 (expr_list:REG_DEAD (reg:SI 117)
(nil)))
(insn 26 25 27 4 (set (reg:SI 119 [ constellation_64qam[_6] ])
(mem/u:SI (plus:DI (mult:DI (reg:DI 118)
(const_int 4 [0x4]))
(reg/f:DI 120)) [1 constellation_64qam[_6]+0 S4 A32]))
"/app/example.c":12:14 52 {*movsi_aarch64}

Trying 25 -> 26:
   25: r118:DI=sign_extend(r117:SI)
  REG_DEAD r117:SI
   26: r119:SI=[r118:DI*0x4+r120:DI]
  REG_DEAD r118:DI
  REG_EQUAL [r118:DI*0x4+`constellation_64qam']
Successfully matched this instruction:
(set (reg:SI 119 [ constellation_64qamD.3590[_6] ])
(mem/u:SI (plus:DI (mult:DI (sign_extend:DI (reg:SI 117))
(const_int 4 [0x4]))
(reg/f:DI 120)) [1 constellation_64qamD.3590[_6]+0 S4 A32]))
allowing combination of insns 25 and 26
original costs 4 + 16 = 20
replacement cost 16
deferring deletion of insn with uid = 25.
modifying insn i326: r119:SI=[sign_extend(r117:SI)*0x4+r120:DI]
  REG_DEAD r117:SI
deferring rescan insn with uid = 26.

now, instead we have


(insn 24 23 25 4 (set (reg:DI 116)
(sign_extend:DI (reg:SI 115))) "/app/example.c":12:35 126
{*extendsidi2_aarch64}
 (expr_list:REG_DEAD (reg:SI 115)
(nil)))
(insn 25 24 26 4 (set (reg:SI 117 [ constellation_64qam[_5] ])
(mem/u:SI (plus:DI (mult:DI (reg:DI 116)
(const_int 4 [0x4]))
(reg/f:DI 118)) [1 constellation_64qam[_5]+0 S4 A32]))
"/app/example.c":12:14 52 {*movsi_aarch64}

Trying 24 -> 25:
   24: r116:DI=sign_extend(r115:SI)
  REG_DEAD r115:SI
   25: r117:SI=[r116:DI*0x4+r118:DI]
  REG_DEAD r116:DI
  REG_EQUAL [r116:DI*0x4+`constellation_64qam']
Failed to match this instruction:
(set (reg:SI 117 [ constellation_64qamD.3641[_5] ])
(mem/u:SI (plus:DI (and:DI (mult:DI (subreg:DI (reg:SI 115) 0)
(const_int 4 [0x4]))
(const_int 252 [0xfc]))
(reg/f:DI 118)) [1 constellation_64qamD.3641[_5]+0 S4 A32]))


where the sign extend has suddenly turned into an and.

I don't know why though, the two input RTLs look identical to me.

[Bug target/106524] [12 Regression] ICE in extract_insn, at recog.cc:2791 (error: unrecognizable insn) since r12-4349-ge36206c9940d22.

2022-08-12 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106524

Tamar Christina  changed:

   What|Removed |Added

Summary|[12/13 Regression] ICE in   |[12 Regression] ICE in
   |extract_insn, at|extract_insn, at
   |recog.cc:2791 (error:   |recog.cc:2791 (error:
   |unrecognizable insn) since  |unrecognizable insn) since
   |r12-4349-ge36206c9940d22.   |r12-4349-ge36206c9940d22.
   Target Milestone|12.2|12.3

--- Comment #3 from Tamar Christina  ---
Fixed on trunk, will backport to 12 once branch unfrozen.

[Bug tree-optimization/106594] New: [13 Regression] sign-extensions no longer merged into addressing mode

2022-08-12 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106594

Bug ID: 106594
   Summary: [13 Regression] sign-extensions no longer merged into
addressing mode
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*

Since around the 5th of August (need to bisect) we no longer generate
addressing modes on load merging the sign extends.

The following example:

extern const int constellation_64qam[64];

void foo(int nbits,
 const char *p_src,
 int *p_dst) {

  while (nbits > 0U) {
char first = *p_src++;

char index1 = ((first & 0x3) << 4) | (first >> 4);

*p_dst++ = constellation_64qam[index1];

nbits--;
  }
}

used to generate

orr w3, w4, w3, lsr 4
ldr w3, [x6, w3, sxtw 2]

and now generates:

orr w0, w4, w0, lsr 4
sxtwx0, w0
ldr w0, [x6, x0, lsl 2]

at -O2 (-O1 seems to still be fine).  This is causing a regression in perf in
some of our libraries.

Looks like there's a change in how the operation is expressed.  It used to be

  first_17 = *p_src_28;
  _1 = (int) first_17;
  _2 = _1 << 4;
  _3 = (signed char) _2;

where the shift is done as an int, whereas now it's

  first_16 = *p_src_27;
  first.1_1 = (signed char) first_16;
  _2 = first.1_1 << 4;

[Bug target/106524] [12/13 Regression] ICE in extract_insn, at recog.cc:2791 (error: unrecognizable insn) since r12-4349-ge36206c9940d22.

2022-08-12 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106524

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

[Bug rtl-optimization/106553] pre-register allocation scheduler is now RMW aware

2022-08-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106553

--- Comment #2 from Tamar Christina  ---
(In reply to Alexander Monakov from comment #1)
> Are you sure the testcase is correctly reduced, i.e. does it show the same
> performance degradation? Latency-wise the scheduler is making the correct
> decision here: we really want to schedule second-to-last FMA


The reduction wasn't tested for performance, I'm not even claiming the final
result is optimal because scheduling was completely disabled.

> 
>   y = v_fma_f32 (y, r2, x);
> 
> earlier than its predecessor
> 
>   r = v_fma_f32 (y, r2, z);
> 
> because we need to compute y*r2 before the last FMA.

The relative order of the instructions didn't change as far as I can tell in
the reduced example.

I had expected the mul to be moved earlier.

__v_sinf(__Float32x4_t, __Float32x4_t, __Float32x4_t, __Float32x4_t):
fmulv3.4s, v3.4s, v3.4s
mov v5.16b, v0.16b
mov v4.16b, v0.16b
fmlav5.4s, v2.4s, v3.4s
fmlav4.4s, v5.4s, v3.4s
fmlav0.4s, v4.4s, v3.4s
fmulv6.4s, v3.4s, v0.4s
mov v0.16b, v1.16b
fmlav0.4s, v4.4s, v3.4s
fmlav0.4s, v6.4s, v0.4s
ret

as the copy of v0 into v2 is still redundant. However looking at the RTL of the
reduction, I don't really understand why the mov existed.

The bad case is

(insn 13 11 12 2 (set (reg:V4SF 94 [ _9 ])
(fma:V4SF (reg:V4SF 96 [ _11 ])
(reg/v:V4SF 93 [ r2 ])
(reg/v:V4SF 99 [ x ]))) "":14605:10 2206 {fmav4sf4}
 (expr_list:REG_DEAD (reg/v:V4SF 99 [ x ])
(nil)))
(insn 12 13 14 2 (set (reg:V4SF 95 [ _10 ])
(fma:V4SF (reg:V4SF 96 [ _11 ])
(reg/v:V4SF 93 [ r2 ])
(reg:V4SF 106))) "":14605:10 2206 {fmav4sf4}
 (expr_list:REG_DEAD (reg:V4SF 106)
(expr_list:REG_DEAD (reg:V4SF 96 [ _11 ])
(nil
(insn 14 12 19 2 (set (reg:V4SF 103)
(mult:V4SF (reg/v:V4SF 93 [ r2 ])
(reg:V4SF 94 [ _9 ]))) "":20:7 2186 {mulv4sf3}
 (expr_list:REG_DEAD (reg:V4SF 94 [ _9 ])
(expr_list:REG_DEAD (reg/v:V4SF 93 [ r2 ])
(nil

and the good case

(insn 12 11 13 2 (set (reg:V4SF 95 [ _10 ])
(fma:V4SF (reg:V4SF 96 [ _11 ])
(reg/v:V4SF 93 [ r2 ])
(reg:V4SF 106))) "":14605:10 2206 {fmav4sf4}
 (expr_list:REG_DEAD (reg:V4SF 106)
(nil)))
(insn 13 12 14 2 (set (reg:V4SF 94 [ _9 ])
(fma:V4SF (reg:V4SF 96 [ _11 ])
(reg/v:V4SF 93 [ r2 ])
(reg/v:V4SF 99 [ x ]))) "":14605:10 2206 {fmav4sf4}
 (expr_list:REG_DEAD (reg/v:V4SF 99 [ x ])
(expr_list:REG_DEAD (reg:V4SF 96 [ _11 ])
(nil
(insn 14 13 15 2 (set (reg:V4SF 103)
(mult:V4SF (reg/v:V4SF 93 [ r2 ])
(reg:V4SF 94 [ _9 ]))) "":20:7 2186 {mulv4sf3}
 (expr_list:REG_DEAD (reg:V4SF 94 [ _9 ])
(expr_list:REG_DEAD (reg/v:V4SF 93 [ r2 ])
(nil

So I don't really see why the live range of _9 was extended... so maybe this is
register allocation after all.

It's weird that -fno-schedule-insns removes the redundant moves in all cases
though.. But perhaps that's just coincidence...

[Bug rtl-optimization/106553] New: pre-register allocation scheduler is now RMW aware

2022-08-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106553

Bug ID: 106553
   Summary: pre-register allocation scheduler is now RMW aware
   Product: gcc
   Version: 11.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*

The following example is minimized from the math routines in glibc:

#include 

typedef float32x4_t v_f32_t;

static inline v_f32_t
v_fma_f32 (v_f32_t x, v_f32_t y, v_f32_t z)
{
  return vfmaq_f32 (z, x, y);
}

v_f32_t
__v_sinf (v_f32_t x,v_f32_t z, v_f32_t n, v_f32_t r)
{
  v_f32_t r2, y;
  r2 = r * r;
  y = v_fma_f32 (n, r2, x);
  y = v_fma_f32 (y, r2, x);
  r = v_fma_f32 (y, r2, z);
  y = v_fma_f32 (y, r2, x);
  y = v_fma_f32 (y * r2, r, r);

  return y;
}

here we generate at -O2

__v_sinf(__Float32x4_t, __Float32x4_t, __Float32x4_t, __Float32x4_t):
fmulv3.4s, v3.4s, v3.4s
mov v5.16b, v0.16b
mov v4.16b, v0.16b
fmlav5.4s, v2.4s, v3.4s
fmlav4.4s, v5.4s, v3.4s
fmlav0.4s, v4.4s, v3.4s
mov v2.16b, v0.16b
mov v0.16b, v1.16b
fmlav0.4s, v4.4s, v3.4s
fmulv3.4s, v3.4s, v2.4s
fmlav0.4s, v3.4s, v0.4s
ret

the 3rd move is there because the pre-register allocation scheduler created a
false dependency by scheduling the the fmul after the fmla. This forces reload
to have to create a reload to keep `v0` alive after the destructive operation.

with -O2  -fno-schedule-insns we get

__v_sinf(__Float32x4_t, __Float32x4_t, __Float32x4_t, __Float32x4_t):
fmulv3.4s, v3.4s, v3.4s
mov v4.16b, v0.16b
fmlav0.4s, v2.4s, v3.4s
mov v2.16b, v4.16b
fmlav2.4s, v0.4s, v3.4s
mov v0.16b, v1.16b
fmlav4.4s, v2.4s, v3.4s
fmlav0.4s, v2.4s, v3.4s
fmulv3.4s, v3.4s, v4.4s
fmlav0.4s, v3.4s, v0.4s
ret

In glibc these additional moves cost double digit performance by breaking up
the fmla chains.


Should we perhaps use a special RMW scheduling attribute to make it treat the
last input as an output too?

[Bug middle-end/106534] [13 Regression][gcn] ICE 'verify_ssa failed' / 'error: definition in block 18 does not dominate use in block 19' during libgfortran bootstrap

2022-08-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106534

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #11 from Tamar Christina  ---
Fixed now.

[Bug middle-end/106534] [13 Regression][gcn] ICE 'verify_ssa failed' / 'error: definition in block 18 does not dominate use in block 19' during libgfortran bootstrap

2022-08-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106534

--- Comment #9 from Tamar Christina  ---
(In reply to Tamar Christina from comment #6)
> (In reply to rguent...@suse.de from comment #5)
> > On Fri, 5 Aug 2022, burnus at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106534
> > > 
> > > --- Comment #4 from Tobias Burnus  ---
> > > Seems as if the ICE is fixed for that file by reverting the following 
> > > commit:
> > > 
> > > commit r13-1963-gc832ec4c3ec4853ad89ff3b0dbf6e9454e75e8cc
> > > middle-end: Fix phi-ssa assertion triggers.  [PR106519]
> > 
> > As hinted in the e-mail exchange the earlier two transforms might need
> > guarding with !diamond_p since IIRC only minmax is capable of dealing
> > with it?
> 
> Yes, Though I didn't find any failing cases so I didn't guard store elim.
> value_replacement is a weird one as it looks like it's already looking for
> some kind of diamond form.  I can guard both to be safe, but that's
> essentially doing a blind change if I can't reproduce this locally.

This does seem to fix it, so bootstrapping now and will put up a patch once
done.

[Bug middle-end/106534] [13 Regression][gcn] ICE 'verify_ssa failed' / 'error: definition in block 18 does not dominate use in block 19' during libgfortran bootstrap

2022-08-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106534

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
   Last reconfirmed||2022-08-05
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED

--- Comment #8 from Tamar Christina  ---
(In reply to Tobias Burnus from comment #7)
> (In reply to Tamar Christina from comment #3)
> > Thanks for the repro, could you tell me what target I need to use or what
> > configure options to build amdgcn-amdhsa?
> 
> See
> https://gcc.gnu.org/wiki/Offloading#How_to_build_an_offloading-enabled_GCC
> for some guidance. (But that's more for building a full compiler and more
> than you need.)
> 
> I think you just need to know:  --target=amdgcn-amdhsa

Thanks! Able to reproduce it now!

[Bug middle-end/106534] [13 Regression][gcn] ICE 'verify_ssa failed' / 'error: definition in block 18 does not dominate use in block 19' during libgfortran bootstrap

2022-08-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106534

--- Comment #6 from Tamar Christina  ---
(In reply to rguent...@suse.de from comment #5)
> On Fri, 5 Aug 2022, burnus at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106534
> > 
> > --- Comment #4 from Tobias Burnus  ---
> > Seems as if the ICE is fixed for that file by reverting the following 
> > commit:
> > 
> > commit r13-1963-gc832ec4c3ec4853ad89ff3b0dbf6e9454e75e8cc
> > middle-end: Fix phi-ssa assertion triggers.  [PR106519]
> 
> As hinted in the e-mail exchange the earlier two transforms might need
> guarding with !diamond_p since IIRC only minmax is capable of dealing
> with it?

Yes, Though I didn't find any failing cases so I didn't guard store elim.
value_replacement is a weird one as it looks like it's already looking for some
kind of diamond form.  I can guard both to be safe, but that's essentially
doing a blind change if I can't reproduce this locally.

[Bug middle-end/106534] [13 Regression][gcn] ICE 'verify_ssa failed' / 'error: definition in block 18 does not dominate use in block 19' during libgfortran bootstrap

2022-08-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106534

--- Comment #3 from Tamar Christina  ---
Thanks for the repro, could you tell me what target I need to use or what
configure options to build amdgcn-amdhsa?

[Bug middle-end/106519] [13 Regression] internal compiler error: in gimple_phi_arg, at gimple.h:4594 by r13-1950

2022-08-04 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106519

Tamar Christina  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from Tamar Christina  ---
(In reply to Martin Liška from comment #7)
> Is it fixed now?

Yes, should be, tested the x86 -m32 failures but the other arches should be the
same.

[Bug middle-end/106519] [13 Regression] internal compiler error: in gimple_phi_arg, at gimple.h:4594 by r13-1950

2022-08-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106519

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2022-08-03
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 Ever confirmed|0   |1

--- Comment #2 from Tamar Christina  ---
The condition checks that the two BBs share the same successor but forgot to
check that both BB have only one successor.

It looks like with -m32 (and powerpc) the order of the edges just happen to
match and the assert triggers.

Testing a patch overnight and will post tomorrow.

[Bug target/106346] [11/12/13 Regression] Potential regression on vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94

2022-07-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

--- Comment #6 from Tamar Christina  ---
(In reply to Richard Biener from comment #5)
> (In reply to Tamar Christina from comment #4)
> > I believe the problem is actually g:27842e2a1eb26a7eae80b8efd98fb8c8bd74a68e
> > 
> > We added an optab for the widening left shift pattern there however the
> > operation requires a uniform shift constant to work. See
> > https://godbolt.org/z/4hqKc69Ke
> > 
> > The existing pattern that deals with this is vect_recog_widen_shift_pattern
> > which is a scalar pattern.  during build_slp it validates that constants are
> > the same and when they're not it aborts SLP.  This is why we lose
> > vectorization.  Eventually we hit V4HI for which we have no widening shift
> > optab for and it vectorizes using that low VF.
> > 
> > This example shows a number of things wrong:
> > 
> > 1. The generic costing seems off, this sequence shouldn't have been
> > generated, as a vector sequence it's more inefficient than the scalar
> > sequence. Using -mcpu=neover-n1 or any other costing structure correctly
> > only gives scalar.
> > 
> > 2. vect_recog_widen_shift_pattern is implemented in the wrong place.  It
> > predates the existence of the SLP pattern matcher. Because of the uniform
> > requirements it's better to use the SLP pattern matcher where we have access
> > to all the constants to decide whether the pattern is a match or not.  That
> > way we don't abort SLP. Are you ok with this as a fix Richi?
> 
> patterns are difficult beasts - I think vect_recog_widen_shift_pattern is
> at the correct place but instead what is lacking is SLP discovery support
> for scrapping it - that is, ideally the vectorizer would take patterns as
> a hint and ignore them when they are not helpful.

Hmm, yes but the problem is that we've consumed additional related statements
which now need to be handled by build_slp as well. I suppose you could do an
in-place build_slp on the pattern stmt seq iterator.  Though that seems like
undoing a mistake.

> 
> Now - in theory, for SLP vectorization, all patterns could be handled
> as SLP patterns and scalar patterns disabled.  But that isn't easy to
> do either.

As long as we still have the non-SLP loop vect it's probably not a good idea
no?
since we then lose all patterns for it.  The widening shift was already
sufficiently
limited that it wouldn't really regress here.

> 
> I fear to fight this regression the easiest route is to pretend the
> ISA can do widen shift by vector and fixup in the expander ...

I can do this, but we're hiding the cost then. Or did you want me to fudge the
numbers in the costing hooks?

> 
> > 3. The epilogue costing seems off..
> > 
> > This example https://godbolt.org/z/YoPcWv6Td ends up generating an
> > exceptionally high epilogue cost and so thinks vectorization at the higher
> > VF is not profitable.
> > 
> > *src1_18(D) 1 times vec_to_scalar costs 2 in epilogue
> > MEM[(uint16_t *)src1_18(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue
> > MEM[(uint16_t *)src1_18(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue
> > MEM[(uint16_t *)src1_18(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue
> > MEM[(uint16_t *)src1_18(D) + 8B] 1 times vec_to_scalar costs 2 in epilogue
> > MEM[(uint16_t *)src1_18(D) + 10B] 1 times vec_to_scalar costs 2 in epilogue
> > MEM[(uint16_t *)src1_18(D) + 12B] 1 times vec_to_scalar costs 2 in epilogue
> > MEM[(uint16_t *)src1_18(D) + 14B] 1 times vec_to_scalar costs 2 in epilogue
> > /app/example.c:16:12: note: Cost model analysis for part in loop 0:
> >   Vector cost: 23
> >   Scalar cost: 17
> 
> I don't see any epilogue cost - the example doesn't have a loop.  With BB
> vect you could see no epilogue costs?

That was my expectation, but see e.g. https://godbolt.org/z/MGEMYEe86 the SLP
shows the
above output.  I don't understand where the vec_to_scalar costs come from.

> 
> > For some reason it thinks it needs a scalar epilogue? using
> > -fno-vect-cost-model gives the desired codegen.

[Bug target/106346] [11/12/13 Regression] Potential regression on vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94

2022-07-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

Tamar Christina  changed:

   What|Removed |Added

Summary|Potential regression on |[11/12/13 Regression]
   |vectorization of left shift |Potential regression on
   |with constants since|vectorization of left shift
   |r11-5160-g9fc9573f9a5e94|with constants since
   ||r11-5160-g9fc9573f9a5e94
   Priority|P3  |P2
 CC||rguenth at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
   Target Milestone|--- |11.5
 Status|NEW |ASSIGNED

--- Comment #4 from Tamar Christina  ---
I believe the problem is actually g:27842e2a1eb26a7eae80b8efd98fb8c8bd74a68e

We added an optab for the widening left shift pattern there however the
operation requires a uniform shift constant to work. See
https://godbolt.org/z/4hqKc69Ke

The existing pattern that deals with this is vect_recog_widen_shift_pattern
which is a scalar pattern.  during build_slp it validates that constants are
the same and when they're not it aborts SLP.  This is why we lose
vectorization.  Eventually we hit V4HI for which we have no widening shift
optab for and it vectorizes using that low VF.

This example shows a number of things wrong:

1. The generic costing seems off, this sequence shouldn't have been generated,
as a vector sequence it's more inefficient than the scalar sequence. Using
-mcpu=neover-n1 or any other costing structure correctly only gives scalar.

2. vect_recog_widen_shift_pattern is implemented in the wrong place.  It
predates the existence of the SLP pattern matcher. Because of the uniform
requirements it's better to use the SLP pattern matcher where we have access to
all the constants to decide whether the pattern is a match or not.  That way we
don't abort SLP. Are you ok with this as a fix Richi?

3. The epilogue costing seems off..

This example https://godbolt.org/z/YoPcWv6Td ends up generating an
exceptionally high epilogue cost and so thinks vectorization at the higher VF
is not profitable.

*src1_18(D) 1 times vec_to_scalar costs 2 in epilogue
MEM[(uint16_t *)src1_18(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(uint16_t *)src1_18(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(uint16_t *)src1_18(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(uint16_t *)src1_18(D) + 8B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(uint16_t *)src1_18(D) + 10B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(uint16_t *)src1_18(D) + 12B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(uint16_t *)src1_18(D) + 14B] 1 times vec_to_scalar costs 2 in epilogue
/app/example.c:16:12: note: Cost model analysis for part in loop 0:
  Vector cost: 23
  Scalar cost: 17

For some reason it thinks it needs a scalar epilogue? using
-fno-vect-cost-model gives the desired codegen.

[Bug target/106346] Potential regression on vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94

2022-07-22 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #3 from Tamar Christina  ---
Joel is no longer working on GCC, I'll fix the regression next week when back
from holidays.

[Bug target/106253] [13 Regression] ICE in vect_transform_loops, at tree-vectorizer.cc:1032

2022-07-13 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106253

--- Comment #10 from Tamar Christina  ---
For completeness, I reduced the Armhf failure and that seems to happen on
bswap.

#include 
#include 

void
__sha256_process_block (uint32_t *buffer, size_t len, uint32_t *W)
{
 for (unsigned int t = 0; t < 16; ++t)
 {
   W[t] = __bswap_32 (*buffer);
   
   
++buffer;
 }
}

will ICE at -O3

[Bug target/106253] [13 Regression] ICE in vect_transform_loops, at tree-vectorizer.cc:1032

2022-07-12 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106253

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org
 Target|aarch64-linux-gnu   |aarch64-linux-gnu,
   ||arm-none-linux-gnueabihf

--- Comment #6 from Tamar Christina  ---
Same problem happens with Armhf when building libc.

during GIMPLE pass: vect
In file included from sha256.c:213:
./sha256-block.c: In function ‘__sha256_process_block’:
./sha256-block.c:6:1: internal compiler error: in vect_transform_loops, at
tree-vectorizer.cc:1032
6 | __sha256_process_block (const void *buffer, size_t len, struct
sha256_ctx *ctx)
  | ^~

[Bug middle-end/106196] [13 Regression] vect_do_peeling ICE since g:3769ad4ccea9589b3f7edaef901cb542aa10f49a

2022-07-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106196

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|REOPENED|RESOLVED

--- Comment #7 from Tamar Christina  ---
(In reply to Tamar Christina from comment #6)
> (In reply to Richard Biener from comment #5)
> > (In reply to Tamar Christina from comment #4)
> > > Some benchmarks are still failing with the same error, just different line
> > > I am reducing a testcase now.
> > 
> > Also after r13-1575-gcf3a120084e946?
> 
> Ah no, that's not included yet. I've kicked off a build, should know soon.

Looks like some of the benchmarks are fixed with that but I see some are
failing with another vectorizer ICE. Is probably unrelated so will open new
ticket.

[Bug middle-end/106196] [13 Regression] vect_do_peeling ICE since g:3769ad4ccea9589b3f7edaef901cb542aa10f49a

2022-07-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106196

--- Comment #6 from Tamar Christina  ---
(In reply to Richard Biener from comment #5)
> (In reply to Tamar Christina from comment #4)
> > Some benchmarks are still failing with the same error, just different line
> > I am reducing a testcase now.
> 
> Also after r13-1575-gcf3a120084e946?

Ah no, that's not included yet. I've kicked off a build, should know soon.

[Bug middle-end/106196] [13 Regression] vect_do_peeling ICE since g:3769ad4ccea9589b3f7edaef901cb542aa10f49a

2022-07-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106196

Tamar Christina  changed:

   What|Removed |Added

 Resolution|FIXED   |---
 Status|RESOLVED|REOPENED

--- Comment #4 from Tamar Christina  ---
Some benchmarks are still failing with the same error, just different line
numbers

during GIMPLE pass: vect
include/dofs/dof_tools.h: In function 'make_sparsity_pattern':
include/dofs/dof_tools.h:1971: internal compiler error: in vect_do_peeling, at
tree-vect-loop-manip.cc:2702
 1971 | DoFTools::make_sparsity_pattern (
  | 
0xdc683f vect_do_peeling(_loop_vec_info*, tree_node*, tree_node*, tree_node**,
tree_node**, tree_node**, int, bool, bool, tree_node**)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop-manip.cc:2702
0xdbbbdb vect_transform_loop(_loop_vec_info*, gimple*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:9877
0xde9483 vect_transform_loops
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1006
0xde9483 try_vectorize_loop_1
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1136
0xde9483 try_vectorize_loop
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1165
0xde9bbf execute
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1281

I am reducing a testcase now.

[Bug tree-optimization/106217] New: [11/12/13 Regression] sinking of loads prevents vectorization

2022-07-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106217

Bug ID: 106217
   Summary: [11/12/13 Regression] sinking of loads prevents
vectorization
   Product: gcc
   Version: 11.3.1
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

The following example

void f(int n, float * __restrict x, float *mass, float max, float )
{
float lax = 0.0f;

for (int i = 0; i < n; ++i) {
float f = x[i] * mass[i];
lax += x[i] >= max ? 0 : f * x[i];
}

  ax += lax;
  return;
}

vectorizes with GCC 10 with -Ofast but fails starting with GCC 11.
The difference is that in the sink1 pass we now sink the unconditional load of
mass[i] into the conditional.

Sinking # VUSE <.MEM_14(D)>
_7 = *_6;
 from bb 3 to bb 4
Sinking _6 = mass_18(D) + _2;
 from bb 3 to bb 4

consequently this causes if-convert to no longer be able to if-convert the loop
because the load is now conditional, and we only accept unconditional loads in
ifcvt_memrefs_wont_trap.

-
_6 = mass_18(D) + _2;
-
_7 = *_6;
tree could trap...

perhaps during the sinking we should mark the load as non-trapping? as it was
originally unconditional anyway.

[Bug middle-end/106196] New: [13 Regression] vect_do_peeling ICE since g:3769ad4ccea9589b3f7edaef901cb542aa10f49a

2022-07-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106196

Bug ID: 106196
   Summary: [13 Regression] vect_do_peeling ICE since
g:3769ad4ccea9589b3f7edaef901cb542aa10f49a
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
CC: rguenth at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*

The following testcase

char a;
char *b;
void e() {
  char *d;
  int c;
  d = 
  for (; c; c++) {
d[2] = d[1] = d[0] = b[c];
d += 3;
  }
}

ICEs with g++ -O3 -std=c++14

during GIMPLE pass: vect
jdcolor.ii: In function ‘void e()’:
jdcolor.ii:3:6: internal compiler error: in vect_do_peeling, at
tree-vect-loop-manip.cc:2689
3 | void e() {
  |  ^
0x102183f vect_do_peeling(_loop_vec_info*, tree_node*, tree_node*, tree_node**,
tree_node**, tree_node**, int, bool, bool, tree_node**)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop-manip.cc:2689
0x1016b1b vect_transform_loop(_loop_vec_info*, gimple*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:9747
0x1044363 vect_transform_loops
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1006
0x1044363 try_vectorize_loop_1
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1136
0x1044363 try_vectorize_loop
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1165
0x1044a9f execute
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1281

[Bug bootstrap/106137] baremetal cross builds broken in libgfortran since g:133d0d422ebd18dbd215cfa5394aff9f938e7060

2022-06-29 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106137

--- Comment #4 from Tamar Christina  ---
(In reply to Jakub Jelinek from comment #2)
> Created attachment 53224 [details]
> gcc13-pr106137.patch
> 
> Perhaps this patch could fix this?

The patch does fix the build! I also have the 4 files you asked for, but assume
you no longer need them?

[Bug bootstrap/106137] baremetal cross builds broken in libgfortran since g:133d0d422ebd18dbd215cfa5394aff9f938e7060

2022-06-29 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106137

--- Comment #3 from Tamar Christina  ---
(In reply to Jakub Jelinek from comment #1)
> Could you please attach
> */libgfortran/Makefile
> */libgfortran/config.h
> from the build dir before/after that commit?

Waiting for a build to finish to grab them

> Perhaps this patch could fix this?

Will kick off a build with the patch! thanks!

[Bug bootstrap/106137] New: baremetal cross builds broken in libgfortran since g:133d0d422ebd18dbd215cfa5394aff9f938e7060

2022-06-29 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106137

Bug ID: 106137
   Summary: baremetal cross builds broken in libgfortran since
g:133d0d422ebd18dbd215cfa5394aff9f938e7060
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: build
  Severity: normal
  Priority: P3
 Component: bootstrap
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
CC: jakub at gcc dot gnu.org
  Target Milestone: ---
  Host: aarch64-none-linux-gnu
Target: aarch64-none-elf

Cross builds fail with:

make[1]: [Makefile:1786: aarch64-none-elf/bits/largefile-config.h] Error 1
(ignored)
make[1]: [Makefile:1787: aarch64-none-elf/bits/largefile-config.h] Error 1
(ignored)
In file included from
/opt/buildAgent/work/ca23804305872d31/gcc-src/libgfortran/caf/libcaf.h:32,
 from
/opt/buildAgent/work/ca23804305872d31/gcc-src/libgfortran/caf/single.c:26:
/opt/buildAgent/work/ca23804305872d31/gcc-src/libgfortran/libgfortran.h:62:11:
fatal error: quadmath_weak.h: No such file or directory
   62 | # include "quadmath_weak.h"
  |   ^
compilation terminated.

But admittedly I don't understand why the change in the header which seems to
tighten the condition makes it fail now...

[Bug tree-optimization/106106] SRA scalarizes structure copies

2022-06-28 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106106

--- Comment #4 from Tamar Christina  ---
> 
> Is the fact that float32x2x2_t is an aggregate with a field named 'val'
> part of the neon API?

Yeah, it's mandated by ACLE
https://arm-software.github.io/acle/main/acle.html#vector-array-data-types-1

> 
> We could heuristically avoid to scalarize arrays when the aggregate has
> a vector mode.  Alternatively instead of scalarizing to the array
> element type we could choose the type of the aggregate mode (but only
> when doing total scalarization, that is, when there are no component
> uses or defs).

I was wondering about this as well, since in principle this would have been a
win if the user had manually extracted one of the vectors.  The scalarization
would have allowed us to ignore the rest of the vector earlier in gimple.  It's
that the type is used whole that seems like the problem

[Bug tree-optimization/106106] SRA scalarizes structure copies

2022-06-28 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106106

--- Comment #2 from Tamar Christina  ---
(In reply to Richard Biener from comment #1)
> SRA is eliding 'v' by doing what it does, so it essentially changes
> it looks like providing __builtin_neon_vld2_lanev2sf with float32x2x2
> argument and return type might avoid one copy.
> 

We already do, the UNSPEC is

(insn 11 10 12 2 (set (reg:V2x2SF 95 [ D.22913 ])
(unspec:V2x2SF [
(mem:BLK (reg/v/f:DI 100 [ p2 ]) [0  S8 A8])
(reg/v:V2x2SF 97 [ __b ])
(const_int 1 [0x1])
] UNSPEC_LD2_LANE))
"/opt/compiler-explorer/arm64/gcc-trunk-20220628/aarch64-unknown-linux-gnu/lib/gcc/aarch64-unknown-linux-gnu/13.0.0/include/arm_neon.h":17515:10
-1
 (nil))

> In any case improving register allocation or massaging the RTL before it
> is the way to go here.  How does the RTL IL fed to RA differ with/without
> SRA?

I am not sure this a reload problem. The underlying type of float32x2x2_t which
is V2x2SF always reserves two sequential registers.

without SRA we get

(insn 8 7 9 2 (set (reg/v:V2x2SF 95 [ v ])
(reg:V2x2SF 92 [ D.22915 ])) -1
 (nil))
(insn 9 8 10 2 (set (reg/v:V2x2SF 96 [ __b ])
(reg/v:V2x2SF 95 [ v ])) -1
 (nil))

which is simple to eliminate as it's copying the whole structure in one go and
reload eliminates the extra move fine.  With SRA scalarization you end up with
a series of subregs

(insn 8 7 9 2 (set (reg:V2SF 93 [ v$val$1 ])
(subreg:V2SF (reg:V2x2SF 94 [ D.22915 ]) 8)) -1
 (nil))
(insn 9 8 10 2 (set (subreg:V2SF (reg/v:V2x2SF 97 [ __b ]) 0)
(subreg:V2SF (reg:V2x2SF 94 [ D.22915 ]) 0)) -1
 (nil))
(insn 10 9 11 2 (set (subreg:V2SF (reg/v:V2x2SF 97 [ __b ]) 8)
(reg:V2SF 93 [ v$val$1 ])) -1
 (nil))

So we get an explicit extract and piecewise recreation of the V2x2SF, 94 will
take 2 registers and 97 two different ones. reload is just doing as it was
told.

[Bug tree-optimization/106106] New: SRA scalarizes structure copies

2022-06-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106106

Bug ID: 106106
   Summary: SRA scalarizes structure copies
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*

The following example

#include 

float32x2x2_t f2(const float *p1, const float *p2)
{
float32x2x2_t v = vld2_f32(p1);
return vld2_lane_f32(p2, v, 1);
}

uses a type `float32x2x2_t` which is an array consisting of two `float32x2_t`
types.  This type fits within the maximum object size for SRA so it tries to
scalarize it.

However doing so it makes some useless copies:

  D.22939 = __builtin_aarch64_ld2v2sf (p1_2(D));
  v = D.22939;
  __b = v;
  D.22937 = __builtin_aarch64_ld2_lanev2sf (p2_3(D), __b, 1); [tail call]

becomes

  D.22939 = __builtin_aarch64_ld2v2sf (p1_2(D));
  v$val$0_3 = D.22939.val[0];
  v$val$1_5 = D.22939.val[1];
  __b.val[0] = v$val$0_3;
  __b.val[1] = v$val$1_5;
  D.22937 = __builtin_aarch64_ld2_lanev2sf (p2_4(D), __b, 1); [tail call]

having broken the structures up it causes problem for register allocation as
these types require sequential register allocation and reload is unable to
consolidate all the copies resulting in superfluous register moves:

f2:
ld2 {v2.2s - v3.2s}, [x0]
mov v0.8b, v2.8b
mov v1.8b, v3.8b
ld2 {v0.s - v1.s}[1], [x1]
ret

The following snippet from a real library using intrinsics shows the resulting
carnage https://godbolt.org/z/xnre3Pe34.

Perhaps SRA should not scalarize a type if it's just being used in a copy? or
have a way to prevent scalarization of certain types?

[Bug tree-optimization/106063] [12/13 Regression] ICE: in gimple_expand_vec_cond_expr, at gimple-isel.cc:281 with -O2 -fno-tree-forwprop --param=evrp-mode=legacy-first

2022-06-26 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106063

--- Comment #4 from Tamar Christina  ---
Ah, there's optimize_vectors_before_lowering_p,

would you prefer I check the operation or just gate the pattern on the above
Richi?

[Bug tree-optimization/106063] [12/13 Regression] ICE: in gimple_expand_vec_cond_expr, at gimple-isel.cc:281 with -O2 -fno-tree-forwprop --param=evrp-mode=legacy-first

2022-06-24 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106063

--- Comment #3 from Tamar Christina  ---
(In reply to Richard Biener from comment #2)
> 
> but after vector lowering only vector operations that are handled by the
> target may be introduced.  The pattern
> 

We can't tell that we're after veclower can we? so does it make sense to just
never introduce a vector operation the target has no optab for?

[Bug middle-end/105874] [13 Regression] Incorrect codegen and ICE since g:ed6fd2aed58f2cca99f15331bf68999c0e6df370

2022-06-09 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105874

--- Comment #8 from Tamar Christina  ---
Can confirm that the benchmark works again.

Thanks!

[Bug middle-end/105874] [13 Regression] Incorrect codegen and ICE since g:ed6fd2aed58f2cca99f15331bf68999c0e6df370

2022-06-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105874

--- Comment #5 from Tamar Christina  ---
Ah, great, Thanks Roger!

I did end up reducing it to:

template  class b {
public:
  int c[a];
  int operator[](long d) const { return c[d]; }
};
class board {
  bool is_eye(int, int);
  static const b<2> e;
};
const b<2> board::e{{}};
bool board::is_eye(int d, int) {
  int f(e[d]);
  if (f)
return false;
  return true;
}

but looks like you already have a patch :)

Nice patch btw!

[Bug middle-end/105874] [13 Regression] Incorrect codegen and ICE since g:ed6fd2aed58f2cca99f15331bf68999c0e6df370

2022-06-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105874

--- Comment #3 from Tamar Christina  ---
(In reply to Roger Sayle from comment #1)
> Hi Tamar.
> I'm truly sorry for the inconvenience.  Can you try reducing again now that
> the load_register_parameters issue with the "small const structs as
> immediate constants" patch has been resolved?  A small failing testcase from
> Leela's FastBoard::is_eye would be extremely helpful.  Thanks in advance.

No Worries, I can see indeed the ICE is resolved, I'll minimize the benchmark.

Cheers!

[Bug middle-end/105874] New: [13 Regression] Incorrect codegen and ICE since g:ed6fd2aed58f2cca99f15331bf68999c0e6df370

2022-06-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105874

Bug ID: 105874
   Summary: [13 Regression] Incorrect codegen and ICE since
g:ed6fd2aed58f2cca99f15331bf68999c0e6df370
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code, wrong-code
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
CC: sayle at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*

In SPECCPU 2017 Leela no longer terminates since
g:ed6fd2aed58f2cca99f15331bf68999c0e6df370

Looking at the differences in the code, there seems to be a lot of additional
useless calculation around functions such as _ZN9FastBoard6is_eyeEii

Filtering through them it looks like the change is causing loads from
uninitialize d stack space.

Before the change the code generated

```
_ZN9FastBoard6is_eyeEii:
adrpx3, <>
mov x4, #0x1ba4 // #7076
add x4, x0, x4
add x3, x3, #0xb20
ldrhw4, [x4, w2, sxtw #1]
ldr w3, [x3, w1, sxtw #2]
tst w4, w3
```

So it loaded from a fixed anchor into rdata.  After the change

```
_ZN9FastBoard6is_eyeEii:
sub sp, sp, #0x20
mov x4, #0x1ba4
add x5, x0, x4
add x4, sp, #0x8
ldrhw5, [x5, w2, sxtw #1]
ldr w4, [x4, w1, sxtw #2]
tst w5, w4
```

So it allocated 32 bytes of stack and then decides to load from uninitialized
space at sp+0x8.

I tried to create a minimal reproducer but the compiler ICEs as you get close.
e.g. even the example from the ticket PR95126

struct small{ short a,b; signed char c; };
extern int func(struct small X);
void call_func(void)
{
static struct small const s = { 1, 2, 0 };
func(s);
}

ICEs at -O2 with:

during RTL pass: expand
../borked.c: In function 'call_func':
../borked.c:6:5: internal compiler error: in emit_move_insn, at expr.cc:4011
6 | func(s);
  | ^~~
0x909253 emit_move_insn(rtx_def*, rtx_def*)
/ci/work/5c94c4ced6ebfcd0/gcc/expr.cc:4011
0x7eda3f load_register_parameters
/ci/work/5c94c4ced6ebfcd0/gcc/calls.cc:2192
0x7f2183 expand_call(tree_node*, rtx_def*, int)
/ci/work/5c94c4ced6ebfcd0/gcc/calls.cc:3593
0x905ccb expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
/ci/work/5c94c4ced6ebfcd0/gcc/expr.cc:11621
0x8057e3 expand_expr
/ci/work/5c94c4ced6ebfcd0/gcc/expr.h:301
0x8057e3 expand_call_stmt
/ci/work/5c94c4ced6ebfcd0/gcc/cfgexpand.cc:2831
0x8057e3 expand_gimple_stmt_1
/ci/work/5c94c4ced6ebfcd0/gcc/cfgexpand.cc:3869
0x8057e3 expand_gimple_stmt
/ci/work/5c94c4ced6ebfcd0/gcc/cfgexpand.cc:4033
0x80a44b expand_gimple_tailcall
/ci/work/5c94c4ced6ebfcd0/gcc/cfgexpand.cc:4079
0x80a44b expand_gimple_basic_block
/ci/work/5c94c4ced6ebfcd0/gcc/cfgexpand.cc:6059
0x80cbbf execute
/ci/work/5c94c4ced6ebfcd0/gcc/cfgexpand.cc:6811

So I can't really reduce it at this point.

[Bug target/102218] 128-bit atomic compare and exchange does not honor memory model on AArch64 and Arm

2022-05-26 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102218

Tamar Christina  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2022-05-26
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

[Bug tree-optimization/94793] Failure to optimize clz idiom

2022-05-25 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94793

Tamar Christina  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |andrew.carlotti at arm 
dot com

--- Comment #4 from Tamar Christina  ---
Assigning it to Andrew who'll take a crack at it in GCC 13

[Bug tree-optimization/105451] New: miss optimizations due to inconsistency in complex numbers associativity

2022-05-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105451

Bug ID: 105451
   Summary: miss optimizations due to inconsistency in complex
numbers associativity
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: tnfchris at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

The two functions

#include 

void f (complex double * restrict a, complex double *restrict b,
complex double *restrict c, complex double *res, int n)
{
  for (int i = 0; i < n; i++)
   res[i] = a[i] * (b[i] * c[i]);
}

and

void g (complex double * restrict a, complex double *restrict b,
complex double *restrict c, complex double *res, int n)
{
  for (int i = 0; i < n; i++)
   res[i] = (a[i] * b[i]) * c[i];
}

At -Ofast produce the same code, but internally they get there using different
SLP trees.

The former creates a chain of VEC_PERM_EXPR nodes as is expected
tinyurl.com/cmulslp1 however the latter avoids the need of the permutes by
duplicating the elements of the complex number https://tinyurl.com/cmulslp2

The former we can detect as back to back complex multiplication but the latter
we can't.

Not sure what the best way to get consistency here is.

[Bug testsuite/105095] gcc.dg/vect/complex/fast-math-complex-* tests are not executed

2022-04-28 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105095

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|REOPENED|RESOLVED

--- Comment #9 from Tamar Christina  ---
.

[Bug target/105219] [12 Regression] SVE: Wrong code with -O3 -msve-vector-bits=128 -mtune=thunderx

2022-04-25 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105219

Tamar Christina  changed:

   What|Removed |Added

   Assignee|tnfchris at gcc dot gnu.org|avieira at gcc dot 
gnu.org

--- Comment #12 from Tamar Christina  ---
Won't have time to continue on this this week so passing it to Andre in the
interest of getting it unblocked sooner.

[Bug target/105219] [12 Regression] SVE: Wrong code with -O3 -msve-vector-bits=128 -mtune=thunderx

2022-04-13 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105219

--- Comment #10 from Tamar Christina  ---
nb_iterations_upper_bound is indeed set incorrectly and tracked to this commit,

commit 7ed1cd9665d8ca0fa07b2483e604c25e704584af
Author: Andre Vieira 
Date:   Thu Jun 3 13:55:24 2021 +0100

vect: Use main loop's thresholds and VF to narrow upper_bound of epilogue

This patch uses the knowledge of the conditions to enter an epilogue loop
to
help come up with a potentially more restricive upper bound.

gcc/ChangeLog:

* tree-vect-loop.c (vect_transform_loop): Use main loop's various'
thresholds to narrow the upper bound on epilogue iterations.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/part_vect_single_iter_epilog.c: New test.

I don't quite understand when comparing the two bounds there's a -1 in the min
comparison.

And indeed:

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 5c7b163f01c..19235ea79fe 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -9971,7 +9971,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple
*loop_vectorized_call)
LOOP_VINFO_VECT_FACTOR (loop_vinfo),
))
loop->nb_iterations_upper_bound
- = wi::umin ((widest_int) (bound - 1),
+ = wi::umin ((widest_int)bound,
  loop->nb_iterations_upper_bound);
   }
   }

fixes it. It looks to me that when comparing the bounds of the main loop and
epilogue you shouldn't subtract 1 again. But need to ask why this is there.

[Bug target/105219] [12 Regression] SVE: Wrong code with -O3 -msve-vector-bits=128 -mtune=thunderx

2022-04-13 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105219

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #9 from Tamar Christina  ---
This was a bit annoying to track down...

So BB 68 gets merged into BB 51,

BB 51 is loading 4 ints 8 - 12 out of 14.

So after it there's still 2 ints left to process.  However it looks like the
compiler thinks that there can't be any more elements left.

It turns:

  next_mask_416 = .WHILE_ULT (12, bnd.38_326, { 0, 0, 0, 0 });
  if (next_mask_416 != { 0, 0, 0, 0 })
goto ; [75.00%]
  else
goto ; [25.00%]

into

  next_mask_373 = .WHILE_ULT (12, bnd.38_326, { 0, 0, 0, 0 });
  goto ; [100.00%]

where bb 56 is the lastb reduction bit.  So basically we're 2 scalar loop
iterations short.

This seems to indicate like Richi suggested that nb_iterations_upper_bound is
set wrong.

Perhaps it was confused with the profitability threshold which went from 14 to
12.

Mine.

[Bug target/105219] [12 Regression] SVE: Wrong code with -O3 -msve-vector-bits=128 -mtune=thunderx

2022-04-12 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105219

--- Comment #8 from Tamar Christina  ---
looks like the code out of the vectorizer is fine, and between the two versions
it doesn't change.

The big change is in the unroller, decides to drops one of the BBs for some
reason.

It turns the loop mask phi node from

 # loop_mask_374 = PHI 

into

 # loop_mask_374 = PHI 

and deletes the BB 68. And indeed -fdisable-tree-cunroll fixes the problem.

It looks like BB68 in the broken version is being marked as unreachable by DFA
and removed.

Curiously with no codegen difference in output of vect, or in dce6 and pcom,
the DFA in cunroll find different number of BBs

good:

Incremental SSA update started at block: 3
Number of blocks in CFG: 88
Number of blocks to update: 37 ( 42%)
Affected blocks: 4 18 28 34 45 47 50 51 53 54 56 58 59 60 61 62 63 64 65 66 67
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83

bad:

Incremental SSA update started at block: 3
Number of blocks in CFG: 86
Number of blocks to update: 35 ( 41%)
Affected blocks: 4 18 28 34 45 47 50 51 53 54 56 58 59 60 61 62 63 64 65 66 67
68 69 70 71 72 73 74 75 76 77 78 79 80 81

why did it lose 2 BBs?

Is there some info missing from the dump files here?

[Bug target/105219] [12 Regression] SVE: Wrong code with -O3 -msve-vector-bits=128 -mtune=thunderx

2022-04-12 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105219

--- Comment #7 from Tamar Christina  ---
I have managed to find the commit where this starts failing:

commit 61fc5e098e76c9809f35f449a70c9c8d74773d9d (HEAD)
Author: Richard Biener 
Date:   Fri Feb 18 11:34:52 2022 +0100

tree-optimization/104582 - Simplify vectorizer cost API and fixes

This simplifies the vectorizer cost API by providing overloads
to add_stmt_cost and record_stmt_cost suitable for scalar stmt
and branch stmt costing which do not need information like
a vector type or alignment.  It also fixes two mistakes where
costs for versioning tests were recorded as vector stmt rather
than scalar stmt.

This is a first patch to simplify the actual fix for PR104582.

2022-02-18  Richard Biener  

PR tree-optimization/104582
* tree-vectorizer.h (add_stmt_cost): New overload.
(record_stmt_cost): Likewise.
* tree-vect-loop.cc (vect_compute_single_scalar_iteration_cost):
Use add_stmt_costs.
(vect_get_known_peeling_cost): Use new overloads.
(vect_estimate_min_profitable_iters): Likewise.  Consistently
use scalar_stmt for costing versioning checks.
* tree-vect-stmts.cc (record_stmt_cost): New overload.

but I think this just exposes the issue.  The weird thing is while this does
change the cost calculations during vect the final vectorized code is exactly
the same out of both version. 

and indeed there's no change at all in the dumps until cunrolli, which in one
version unrolls more than the other, but I think that's just costing
difference. So I'm disabling it to continue seeing at which pass there's
actually codegen difference between these two commits.

[Bug target/105219] [12 Regression] SVE: Wrong code with -O3 -msve-vector-bits=128 -mtune=thunderx

2022-04-12 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105219

--- Comment #5 from Tamar Christina  ---
OK, I think this is an alignment issue.

When using the thunderx cost model it needs to peel the loop for alignment in
order to vectorize and it looks the error is there.

[Bug target/105219] [12 Regression] SVE: Wrong code with -O3 -msve-vector-bits=128 -mtune=thunderx

2022-04-11 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105219

--- Comment #4 from Tamar Christina  ---
(In reply to Alex Coplan from comment #3)
> (In reply to Tamar Christina from comment #1)
> > Smaller reproducer getting rid of the loop nest and simplify the inner
> > condition.
> 
> Hmm, I can't reproduce the issue with this locally (with or without the
> -mtune option). I also seem to need the -mtune option with the original
> reproducer. Are you definitely setting the right VL when running the
> compiled program?

Ah, my sve-max-vq was in the wrong place... I indeed see that the thunderx is
significant now. Will try to reduce a bit again.

[Bug target/105219] [12 Regression] SVE: Wrong code with -O3 -msve-vector-bits=128 -mtune=thunderx

2022-04-11 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105219

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org
   Last reconfirmed||2022-4-11

--- Comment #1 from Tamar Christina  ---
Smaller reproducer getting rid of the loop nest and simplify the inner
condition.

int a; 
   
   
   char b[60];
short d[19];
long long f;

__attribute__ ((noinline, noipa))  
   
   
   void e(int g, int h, short k[19]) {
for (signed j = 1; j < h + 14; j++) {
  int i = 0;
  b[j] = 1;
  i = 2;
  b[i * 14 + j] = 1;
  a = g ? k[j] : 0;
}
}

__attribute__ ((optimize("O0")))
int main() {
  e(9, 1, d);
  for (long l = 0; l < 6; ++l)
for (long m = 0; m < 4; ++m)
  f ^= b[l + m * 4];
  if (f)
__builtin_abort ();
}

and the -mtune=thunderx doesn't seem needed. I can reproduce with just -O3
-march=armv8.2-a+sve -msve-vector-bits=128.

the write to `a` itself is significant not for the result but to block the loop
from being optimized to a memset.

[Bug target/105197] [12 Regression] SVE: wrong code with -O -ftree-vectorize

2022-04-11 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105197

Tamar Christina  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Tamar Christina  ---
Fixed and no regression from codegen in GCC-11.

[Bug target/105197] [12 Regression] SVE: wrong code with -O -ftree-vectorize

2022-04-11 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105197

Tamar Christina  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #3 from Tamar Christina  ---
Looks like this started with

commit d846f225c25c5885250c303c8d118caa08c447ab
Author: Richard Biener 
Date:   Tue May 4 15:51:20 2021 +0200

tree-optimization/79333 - fold stmts following SSA edges in VN

This makes sure to follow SSA edges when folding eliminated stmts.
This reaps the same benefit as forwprop folding all stmts, not
waiting for one to produce copysign in the new testcase.

2021-05-04  Richard Biener  

PR tree-optimization/79333
* tree-ssa-sccvn.c (eliminate_dom_walker::eliminate_stmt):
Fold stmt following SSA edges.

* gcc.dg/tree-ssa/ssa-fre-94.c: New testcase.
* gcc.dg/graphite/fuse-1.c: Adjust.
* gcc.dg/pr43864-4.c: Likewise.

and what's happening is that the vectorize relies on a mask A and it's inverse
~A be represented by a negation of the mask. Ifcvt used to enforce this but
with the change it now pushes the ~ into the mask operation if it can.

So previously we would generate

  _26 = ~_25;
  _43 = ~_44;

out of ifcvt and how we generate

  _26 = _6 == 0;
  _43 = _4 == 0;

and force the creation of two new extra mask as it de-optimizes the vectorizers
ability to immediately see a mask invert.  

We however still detect that those two are inverses of

  _25 = _6 != 0;
  _44 = _4 != 0;

and when generating the second VEC_COND for the operation we end up flipping
the arguments somehow

-->vectorizing statement: _ifc__41 = _43 ? 0 : _ifc__40;
created new init_stmt: vect_cst__136 = { 0, ... }
add new stmt: _137 = mask__43.26_135 & loop_mask_111
note:  add new stmt: vect__ifc__41.27_138 = VEC_COND_EXPR <_137,
vect__ifc__40.25_133, vect_cst__136>;

so we've vectorized_ifc__41 = _43 ?_ifc__40 : 0; instead without negating _137
which is where the contradiction gets introduced. I'll fix that bug, but the
question remains whether we want this simplification to now happen in ifcvt for
masks.

It makes the vectorizer generate a lot more intermediate masks that are cleaned
up by the RPO pass I added at the end but we also lose the fact that they are
simple inverses, i.e. at -O3 on these integer masks we could have just
generated a NOT.

[Bug target/105197] [12 Regression] SVE: wrong code with -O -ftree-vectorize

2022-04-10 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105197

Tamar Christina  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 CC||tnfchris at gcc dot gnu.org
   Last reconfirmed||2022-04-10
 Ever confirmed|0   |1

--- Comment #2 from Tamar Christina  ---
(In reply to Richard Biener from comment #1)
> the GIMPLE doesn't look wrong.  We're using an EXTRACT_LAST, so that might
> be the special thing.  Vectorization of the first loop is probably not
> necessary to trigger the failure.

Hmmm looks like the GIMPLE is wrong, the masks it combines creates a
contradiction

At GIMPLE we have

  mask__44.14_114 = vect__4.13_112 != { 0, ... };
  mask__26.22_128 = vect__6.17_121 == { 0, ... };
  mask_patt_65.24_130 = mask__44.14_114 & mask__26.22_128;
  mask__43.26_135 = vect__4.13_112 == { 0, ... };
  mask__25.18_123 = vect__6.17_121 != { 0, ... };
  _137 = mask__43.26_135 & loop_mask_111;
  _163 = mask_patt_65.24_130 & _137;

where _163 demands vect__4.13_112 != 0 && vect__4.13_112 == 0

  _163 should have been _163 = mask_patt_65.24_130 & loop_mask_111;

So it looks like the wrong loop masks are combined.

Mine.

[Bug target/104039] [9/10/11/12 Regression] AArch64 Redundant instruction moving general to vector register

2022-04-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104039

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #3 from Tamar Christina  ---
This has been fixed in GCC-12 where we optimize the merge of two vec_duplicates
into a simple vec_concat. https://godbolt.org/z/dWYfnjKnq

can we drop the 12 regression?

[Bug testsuite/105196] [12 regression] gcc.dg/vect/complex/fast-math-complex-add-pattern-float.c fails after r12-8044-gfdd81afcf18d1a

2022-04-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105196

Tamar Christina  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from Tamar Christina  ---
thanks

[Bug target/104409] [12 Regression] -march=armv8.6-a+ls64 crashes, LS64 builtins causes ICE

2022-04-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104409

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #10 from Tamar Christina  ---
Fixed, thanks for the report.

[Bug target/104409] [12 Regression] -march=armv8.6-a+ls64 crashes, LS64 builtins causes ICE

2022-04-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104409

Tamar Christina  changed:

   What|Removed |Added

Summary|-march=armv8.6-a+ls64   |[12 Regression]
   |crashes, LS64 builtins  |-march=armv8.6-a+ls64
   |causes ICE  |crashes, LS64 builtins
   ||causes ICE
   Assignee|wirkus at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
   Priority|P3  |P1
 CC||tnfchris at gcc dot gnu.org
   Target Milestone|--- |12.0

--- Comment #8 from Tamar Christina  ---
Testing patch.

[Bug target/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves

2022-04-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

Tamar Christina  changed:

   What|Removed |Added

   Target Milestone|12.0|13.0
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
   Priority|P1  |P2

--- Comment #17 from Tamar Christina  ---
fixed the immediate regression will handle the general codegen issue in GCC 13

[Bug testsuite/105095] gcc.dg/vect/complex/fast-math-complex-* tests are not executed

2022-04-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105095

Tamar Christina  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Tamar Christina  ---
fixed, thanks!

[Bug target/105157] [12 Regression] compile-time regressions with generic tuning since r12-7756-g27d8748df59fe6

2022-04-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105157

--- Comment #7 from Tamar Christina  ---
(In reply to Richard Biener from comment #6)
> More to the point the cited rev. doesn't look like it should change anything
> for -mtune=generic.  Maybe the "generic" config is always the last one on
> aarch64 and now "demeter"?  At least there doesn't seem to be explicit
> "generic" configs like on x86_64.

We do have an explicit generic tuning struct, the ones with generic_* in the
name e.g. generic_tunings, but as you say, this patch doesn't change it at all.

we'll take a look.

[Bug target/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves

2022-04-04 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #15 from Tamar Christina  ---
(In reply to rsand...@gcc.gnu.org from comment #14)
> 
> So IMO we should fix RTL representation problem that Andrew pointed
> out in comment 7 as the P1 fix, then accept the other cases as a P2
> regression caused by bigger improvements elsewhere.

Alright, mine then.

[Bug testsuite/105095] gcc.dg/vect/complex/fast-math-complex-* tests are not executed

2022-03-29 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105095

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #3 from Tamar Christina  ---
Thanks,

Hadn't noticed those weren't picked up. It looks like the fast-math- has a
filter on the first character to distinguish them from fast-math-bb.

I'll fix it on friday.

[Bug target/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves

2022-03-25 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #13 from Tamar Christina  ---
That said, I'll wait for Richard S to respond, but I don't think this is a P1
any longer, we know why it can't be done during reload and neither sequences
are really significantly better/worse.

[Bug target/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves

2022-03-25 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #12 from Tamar Christina  ---
(In reply to Jakub Jelinek from comment #10)
> Could we fix this up in postreload or later?
> (insn 35 18 21 2 (set (reg:SI 0 x0 [125])
> (reg:SI 32 v0 [117])) "pr104049.c":16:33 52 {*movsi_aarch64}
>  (nil))
> (insn 21 35 36 2 (set (reg:SI 0 x0 [120])
> (and:SI (reg:SI 0 x0 [125])
> (const_int 65535 [0x]))) "pr104049.c":16:33 500 {andsi3}
>  (nil))
> (insn 36 21 22 2 (set (reg:SI 1 x1 [126])
> (reg:SI 32 v0 [117])) "pr104049.c":16:33 52 {*movsi_aarch64}
>  (nil))
> (insn 22 36 28 2 (set (reg:SI 0 x0 [121])
> (plus:SI (lshiftrt:SI (reg:SI 1 x1 [126])
> (const_int 16 [0x10]))
> (reg:SI 0 x0 [120]))) "pr104049.c":16:33 211 {*add_lsr_si}
>  (nil))
>
> transformation into:
> (insn 35 18 21 2 (set (reg:SI 0 x0 [125])
> (reg:SI 32 v0 [117])) "pr104049.c":16:33 52 {*movsi_aarch64}
>  (nil))
> (insn 21 35 22 2 (set (reg:SI 0 x1 [120])
> (and:SI (reg:SI 0 x0 [125])
> (const_int 65535 [0x]))) "pr104049.c":16:33 500 {andsi3}
>  (nil))
> (insn 22 21 28 2 (set (reg:SI 0 x0 [121])
> (plus:SI (lshiftrt:SI (reg:SI 1 x0 [126])
> (const_int 16 [0x10]))
> (reg:SI 0 x1 [120]))) "pr104049.c":16:33 211 {*add_lsr_si}
>  (nil))
> ?

Yeah I think so, normally -frename-registers would have done it but it doesn't
find the opportunity in this case because of the zero extend which clobbers x0
before the second extract. So I think purely the instruction scheduling causes
a problem here.  So it's probably better to clean it up pre-reload if possible.

[Bug target/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves

2022-03-25 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #11 from Tamar Christina  ---
(In reply to Jakub Jelinek from comment #9)
> Perhaps the r12-2288-g8695bf78dad1a42636 change wasn't a good idea?

I think it's still a good idea as it fixes a bigger problem (unneeded SIMD
partial extracts) and makes it easier to write RTL as you don't have to deal
with both VEC_SELECT and subregs.  So having one canonical form is better.

> I mean, if we add some hack for the .REDUC_* stuff so that we don't have the
> lowpart vec_select that r12-2288 folds into a subreg, won't we still suffer
> the same problem when doing anything similar?

Yes but I think the problem is in how we do the transfers to start with. While
looking at this issue I noticed that the SIMD <-> genreg transfers for sizes
where we don't have an exact register for on the genreg (i.e. 8-bit and 16-bit)
are suboptimal (even before this) in a number of cases already and dealing with
that  underlying problem first is better, so I postponed it to GCC 13.

That is to say, even

typedef int V __attribute__((vector_size (4 * sizeof (int;

int
test (V a)
{
  int sum = a[0];
  return (unsigned int)sum >> 16;
}

is suboptimal.

> E.g. with -O2:
> 
> typedef int V __attribute__((vector_size (4 * sizeof (int;
> 
> int
> test (V a)
> {
>   int sum = a[0];
>   return (((unsigned short)sum) + ((unsigned int)sum >> 16)) >> 1;
> }
> 
> The assembly difference is then:
> - fmovw0, s0
> - lsr w1, w0, 16
> - add w0, w1, w0, uxth
> + umovw0, v0.h[0]
> + fmovw1, s0
> + add w0, w0, w1, lsr 16
>   lsr w0, w0, 1
>   ret
> Dunno how costly on aarch64 is Neon -> GPR register move.
> Is fmov w0, s0; fmov w1, s0 or fmov w0, s0; mov w1, w0 cheaper?

The answer is quite uarch specific, but in general fmov w0, s0; mov w1, w0 is
cheaper, that said for the sequence you pasted above the it's really a bit of a
wash.

The old codegen has a longer dependency chain and needed both a shift and
zero-extend,

The new codegen removes the zero extends and folds the shift into the add but
adds a transfer so it about cancels out.

Ideally we'd want here:

umovw0, v0.h[0]
umovw1, v0.h[1]
add w0, w0, w1
lsr w0, w0, 1
ret

where the shift and the zero extend are gone and the moves could be done in
parallel.

[Bug lto/105012] New: [12 Regression] wrf from SPECCPU2017 ICEs during LTO linking

2022-03-22 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105012

Bug ID: 105012
   Summary: [12 Regression] wrf from SPECCPU2017 ICEs during LTO
linking
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: lto
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
CC: marxin at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64-*

wrf when linking with -mcpu=native -Ofast -fomit-frame-pointer -flto=auto
--param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 (have not reduced
flags yet) fails during final link with:

during GIMPLE pass: ifcvt
module_cam_mp_ndrop.fppized.f90: In function 'dropmixnuc':
module_cam_mp_ndrop.fppized.f90:33:27: internal compiler error: Segmentation
fault
   33 |   subroutine dropmixnuc(lchnk, ncol, ncldwtr,tendnd, temp,omega,  &
  |   ^
0xb9bee3 crash_signal
/data/tamar/ci/work/5c94c4ced6ebfcd0/gcc/toplev.cc:322
0xc93978 first_imm_use_stmt
/data/tamar/ci/work/5c94c4ced6ebfcd0/gcc/ssa-iterators.h:932
0xc93978 dse_classify_store(ao_ref*, gimple*, bool, simple_bitmap_def*, bool*,
tree_node*)
/data/tamar/ci/work/5c94c4ced6ebfcd0/gcc/tree-ssa-dse.cc:954
0xbf9263 ifcvt_local_dce
/data/tamar/ci/work/5c94c4ced6ebfcd0/gcc/tree-if-conv.cc:3154
0xbfe1b7 tree_if_conversion(loop*, vec*)
/data/tamar/ci/work/5c94c4ced6ebfcd0/gcc/tree-if-conv.cc:3383
0xc000af execute
/data/tamar/ci/work/5c94c4ced6ebfcd0/gcc/tree-if-conv.cc:3461
0xc000af execute
/data/tamar/ci/work/5c94c4ced6ebfcd0/gcc/tree-if-conv.cc:3449
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.
make[1]: *** [/opt/buildAgent/temp/buildTmp/ccGjHi3g.ltrans8.ltrans.o] Error 1

This has started somewhere between g:79e210f0c8e1fad875333e93b5ae2fe9b4879b7a
and g:9fc8f278ebebc57537dc0cb9d33e36d932be0bc3

Sorry for the lack of a small reproducer, still need to learn how to reduce LTO
cases properly..

Note that this range does contain a fix for the threader failure for wrf in
PR102943 but not sure if related.

[Bug tree-optimization/104755] gcc.dg/vect/vect-bic-bitmask-10.c etc. FAIL

2022-03-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104755

Tamar Christina  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Tamar Christina  ---
fixed by skipping tests since they don't vectorize.

[Bug target/104529] [12 Regression] inefficient codegen around new/delete

2022-03-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104529

--- Comment #7 from Tamar Christina  ---
(In reply to Jakub Jelinek from comment #6)
> I don't see the code mentioned in #c0 on x86_64, I see also loads and stores
> like on aarch64.

Yes, that was my mistake, I was accidentally comparing GCC 11 x86_64 with GCC
12 AArch64.  That's how I noticed it was an 12 regression later.  Should have
clarified.

[Bug tree-optimization/104755] gcc.dg/vect/vect-bic-bitmask-10.c etc. FAIL

2022-03-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104755

--- Comment #4 from Tamar Christina  ---
Hmmm looks like it doesn't support vector comparisons

 missed:   not vectorized: relevant stmt not supported: _6 = _5 <= 255;

I'll probably just have to skip them on sparc*-* then.

[Bug testsuite/104730] gcc.dg/vect/complex/pr102819-9.c FAILs

2022-03-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104730

Tamar Christina  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Tamar Christina  ---
fixed on master and branch

[Bug tree-optimization/104755] gcc.dg/vect/vect-bic-bitmask-10.c etc. FAIL

2022-03-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104755

--- Comment #2 from Tamar Christina  ---
Hmmm the tests are gated by vect_int which sparc declares to support but the
code didn't vectorize, so probably an unsupported operation somewhere..

Could you attach the output of -fdump-tree-vect-all?

Thanks

[Bug testsuite/104730] gcc.dg/vect/complex/pr102819-9.c FAILs

2022-03-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104730

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
  Component|tree-optimization   |testsuite

--- Comment #2 from Tamar Christina  ---
Thanks,

Just a testsism, need to disable test for targets that can't vectorize floats.

[Bug tree-optimization/102819] [11 Regression] IFN_COMPLEX_MUL matches things that it shouldn't

2022-02-25 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102819

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #10 from Tamar Christina  ---
fixed in all affected branches.

[Bug target/104529] [12 Regression] inefficient codegen around new/delete

2022-02-14 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104529

Tamar Christina  changed:

   What|Removed |Added

Summary|[missed optimization]   |[12 Regression] inefficient
   |inefficient codegen around  |codegen around new/delete
   |new/delete  |
 Status|RESOLVED|UNCONFIRMED
 Resolution|DUPLICATE   |---

--- Comment #3 from Tamar Christina  ---
I'm re-opening because I don't think it has anything to do with #94294

This is a GCC 12 regression.

In GCC 11 we generated in the mid-end

   [local count: 536870913]:
  _32 = operator new (6);
  MEM  [(char * {ref-all})_32] = 255;
  MEM  [(char * {ref-all})_32 + 4B] = 0;
  operator delete (_32, 6);
  return 56;

and in GCC 12 we now generate

   [local count: 536870913]:
  MEM  [(unsigned char *)] = { 255, 0, 0, 0 };
  MEM  [(unsigned char *) + 4B] = { 0, 0 };
  _34 = operator new (6);
  MEM  [(char * {ref-all})_34] = MEM 
[(char * {ref-all})];
  D.24688 ={v} {CLOBBER(eol)};
  operator delete (_34, 6);
  return 56;

See https://godbolt.org/z/KKfhxTxnd

Forcing it to keep the stores before the call to new.

[Bug target/104529] [missed optimization] inefficient codegen around new/delete

2022-02-14 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104529

--- Comment #2 from Tamar Christina  ---
I don't quite see how this is a CSE problem,

There's only one of each constant and none of them are needed before the call.
unlike in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86892

You don't need the values of your array until you allocate memory for said
array.

x86 has the following sequence in GIMPLE

  _32 = operator new (6);
  MEM  [(char * {ref-all})_32] = 255;
  MEM  [(char * {ref-all})_32 + 4B] = 0;
  operator delete (_32, 6);

which is optimal, you create the object, store the values, and remove it.

AArch64 however has this

  MEM  [(unsigned char *)] = 255;
  MEM  [(unsigned char *) + 4B] = 0;
  _34 = operator new (6);
  MEM  [(char * {ref-all})_34] = MEM 
[(char * {ref-all})];
  D.24688 ={v} {CLOBBER(eol)};
  operator delete (_34, 6);

which is where the issue comes from. So this has nothing to do with CSE as far
as I can tell.  The GIMPLE is just suboptimal.

[Bug target/104529] New: [missed optimization] inefficient codegen around new/delete

2022-02-14 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104529

Bug ID: 104529
   Summary: [missed optimization] inefficient codegen around
new/delete
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64-*

Consider the following example

#include 
#include 
#include 

struct param {
  uint32_t k;
  std::vector src;
  std::vector ref0;
};

size_t foo() {
param test[] = {
{48, {255, 0, 0, 0, 0, 0}
}};
return sizeof(test);
}

where the entire thing should have been elided, but that is already reported in
#94294.

Instead this code also shows that we are generating quite inefficient code
(even at -Ofast)

on AArch64 we generate:

foo():
stp x29, x30, [sp, -32]!
mov w1, 255 <-- 1
mov x0, 6
mov x29, sp
str w1, [sp, 24] <-- 1
strhwzr, [sp, 28] <-- 2
bl  operator new(unsigned long)
ldrhw3, [sp, 28] <-- 2
mov x1, 6
ldr w4, [sp, 24] <-- 1
str w4, [x0]
strhw3, [x0, 4]
bl  operator delete(void*, unsigned long)
mov x0, 56
ldp x29, x30, [sp], 32
ret

There's no reason to spill and rematerialize a constant when the constant is
representable in a single move.

It's also unclear to me why it things the 255 and 0 need to be before the call
to new.  But even if it did need it, it's better to re-create the constants
rather than materializing them again.

However x86 gets this right, which is why I've opened this as a target bug:

foo():
sub rsp, 8
mov edi, 6
calloperator new(unsigned long)
mov esi, 6
mov DWORD PTR [rax], 255
mov rdi, rax
xor eax, eax
mov WORD PTR [rdi+4], ax
calloperator delete(void*, unsigned long)
mov eax, 56
add rsp, 8
ret

[Bug tree-optimization/94294] [missed optimization] new+delete of unused local string not removed

2022-02-14 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94294

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #5 from Tamar Christina  ---
(In reply to Marc Glisse from comment #4)
> I don't believe there is a "new/delete" issue.

I wonder.. , looking at

#include 
#include 
#include 

struct param {
  uint32_t k;
  std::vector src;
  std::vector ref0;
};

size_t foo() {
param test[] = {
{48, {255, 0, 0, 0, 0, 0}
}};
return sizeof(test);
}

I had expected the answer to simply be

mov w0, #56
ret

since the new and delete can be elided.
...but GCC generates some pretty bad code here... 
https://godbolt.org/z/EPoYYohPa

In this case nothing of the structure is used at all yet we keep all the
construction code.

[Bug middle-end/103641] [11 regression] Severe compile time regression in SLP vectorize step

2022-02-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103641

--- Comment #34 from Tamar Christina  ---
(In reply to Andrew Pinski from comment #33)
> (In reply to Tamar Christina from comment #32)
> > I'm not sure... I can't seem to get the same granularity level that Andrew
> > got... How did you get that report Andrew?
> 
> I was using perf record/perf report to get that report.

Ah! doh.. thanks I'll take a look

[Bug middle-end/103641] [11 regression] Severe compile time regression in SLP vectorize step

2022-02-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103641

--- Comment #32 from Tamar Christina  ---
> 
> I suppose the slowness is still entirely within synth_mult?

I'm not sure... I can't seem to get the same granularity level that Andrew
got... How did you get that report Andrew?

[Bug middle-end/103641] [11/12 regression] Severe compile time regression in SLP vectorize step

2022-02-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103641

--- Comment #30 from Tamar Christina  ---
No problem during nightlies. No real changes in other workloads in compile time
nor runtime.

can confirm no perf change for xxhash and compile time decreased from 8 to 1
sec.

tree vectorization :   0.28 (  3%)   0.00 (  0%)   0.28 (  3%) 
 135k (  0%)
tree slp vectorization :   7.43 ( 89%)   0.00 (  0%)   7.41 ( 87%) 
3450k (  8%)

into

tree vectorization :   0.02 (  2%)   0.00 (  0%)   0.02 (  2%) 
 135k (  0%)
tree slp vectorization :   0.37 ( 35%)   0.00 (  0%)   0.39 ( 31%) 
3400k (  8%)

[Bug tree-optimization/104408] SLP discovery fails due to -Ofast rewriting

2022-02-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104408

--- Comment #4 from Tamar Christina  ---
(In reply to Richard Biener from comment #3)
> match.pd just does canonicalization here.  SLP discovery could handle this
> in the existing swap operands or reassoc support but I guess the desire here
> is to pull out a Complex SLP pattern.

Yes, though also to optimize the case where you don't have the optab, currently
the generated code is much worse at -Ofast.

> 
> So - no perfect idea yet how to reliably match a Complex pattern here but
> trying to attack this from the match.pd side sounds wrong.

Well the problem is that the scalar code is suboptimal too. even without
matching a complex pattern, so the epilogue here does an extra sub on each
unrolled step.

So I initially figured we'd want to not perform the canonization if it's coming
at the expense of sharing. However that looks harder than I though at first as
there are multiple points in const-fold.c that will try and force this form.

I can probably fix the epilogue post vectorization but that seemed like a worse
solution.

[Bug tree-optimization/104408] SLP discovery fails due to -Ofast rewriting

2022-02-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104408

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #2 from Tamar Christina  ---
Mine for GCC 13.

[Bug tree-optimization/104408] SLP discovery fails due to -Ofast rewriting

2022-02-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104408

--- Comment #1 from Tamar Christina  ---
In particular, the rewrite should probably be gated on the expression being
single use.

[Bug rtl-optimization/104405] Inefficient register allocation on complex arithmetic

2022-02-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104405

--- Comment #4 from Tamar Christina  ---
(In reply to Andrew Pinski from comment #2)
> The big question becomes now is really an issue in real world code or just a
> toy benchmark which is testing argument/return passing optimizations?

Can't say I've gotten it from real world code, I'm just cataloging issues I'm
finding while vectorization support for complex numbers.

But seems to me a simple enough thing that we should be able to handle.

[Bug tree-optimization/104406] SLP discovery doesn't use TWO_OPERAND nodes as seeds

2022-02-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104406

--- Comment #2 from Tamar Christina  ---
Yeah it looks like there's an overlap with
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485 indeed, but that ticket
seems to be trying to address multiple things at once including an x86 costing
issue.

I'm happy to mark this as a dup though.

[Bug tree-optimization/104408] New: SLP discovery fails due to -Ofast rewriting

2022-02-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104408

Bug ID: 104408
   Summary: SLP discovery fails due to -Ofast rewriting
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

The following testcase:

typedef struct { float r, i; } cf;
void
f (cf *restrict a, cf *restrict b, cf *restrict c, cf *restrict d, cf e)
{
  for (int i = 0; i < 100; ++i)
{
  b[i].r = e.r * (c[i].r - d[i].r) - e.i * (c[i].i - d[i].i);
  b[i].i = e.r * (c[i].i - d[i].i) + e.i * (c[i].r - d[i].r);
}
}

when compiled at -O3 forms an SLP tree but fails at -Ofast because match.pd
rewrites the expression into 

  b[i].r = e.r * (c[i].r - d[i].r) + e.i * (d[i].i - c[i].i);
  b[i].i = e.r * (c[i].i - d[i].i) + e.i * (c[i].r - d[i].r);

and so introduces a different interleaving in the second multiply operation.

It's unclear to me what the gain of actually doing this is as it results in
worse vector and scalar code due to you losing the sharing of the computed
value of the nodes.

Without the rewriting the first code can re-use the load from the first vector
and just reverse the elements:

.L2:
ldr q1, [x3, x0]
ldr q0, [x2, x0]
fsubv0.4s, v0.4s, v1.4s
fmulv1.4s, v2.4s, v0.4s
fmulv0.4s, v3.4s, v0.4s
rev64   v1.4s, v1.4s
fnegv0.2d, v0.2d
faddv0.4s, v0.4s, v1.4s
str q0, [x1, x0]
add x0, x0, 16
cmp x0, 800
bne .L2

While with the rewrite it forces an increase in VF to be able to handle the
interleaving

.L2:
ld2 {v0.4s - v1.4s}, [x3], 32
ld2 {v4.4s - v5.4s}, [x2], 32
fsubv2.4s, v1.4s, v5.4s
fsubv3.4s, v4.4s, v0.4s
fsubv5.4s, v5.4s, v1.4s
fmulv2.4s, v2.4s, v6.4s
fmulv4.4s, v6.4s, v3.4s
fmlav2.4s, v7.4s, v3.4s
fmlav4.4s, v5.4s, v7.4s
mov v0.16b, v2.16b
mov v1.16b, v4.16b
st2 {v0.4s - v1.4s}, [x1], 32
cmp x5, x1
bne .L2

in scalar you lose the ability to re-use the subtract so you get an extra sub.

[Bug tree-optimization/104406] New: SLP discovery doesn't use TWO_OPERAND nodes as seeds

2022-02-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104406

Bug ID: 104406
   Summary: SLP discovery doesn't use TWO_OPERAND nodes as seeds
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

the following example

#include 

complex double f (complex double a, complex double b)
{
return a * b;
}

compiled at -Ofast fails to SLP because at tree level it generates

  a$real_5 = REALPART_EXPR ;
  a$imag_6 = IMAGPART_EXPR ;
  b$real_7 = REALPART_EXPR ;
  b$imag_8 = IMAGPART_EXPR ;
  _9 = a$real_5 * b$real_7;
  _10 = a$imag_6 * b$imag_8;
  _11 = a$real_5 * b$imag_8;
  _12 = a$imag_6 * b$real_7;
  _13 = _9 - _10;
  _14 = _11 + _12;
  _3 = COMPLEX_EXPR <_13, _14>;

But SLP discovery does not attempt to start at _13 and _14, instead starts at
the multiplies:

 note:   Starting SLP discovery for
 note: _11 = a$real_5 * b$imag_8;
 note: _12 = a$imag_6 * b$real_7;

And ultimately fails because it doesn't know what to do with REALPART_EXPR and
IMAGPART_EXPR.

[Bug c/104405] New: Inefficient register allocation on complex arithmetic

2022-02-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104405

Bug ID: 104405
   Summary: Inefficient register allocation on complex arithmetic
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

The following testcase

#include 

complex double f (complex double a, complex double b)
{
return a * b * I;
}

compiled at -Ofast has unneeded copies of input

i.e.

f:
fmovd4, d1
fmovd1, d0
fmuld0, d4, d2
fmuld4, d4, d3
fnmadd  d0, d1, d3, d0
fnmsub  d1, d1, d2, d4
ret

the first two moves are unneeded and looks to be an artifact of how
IMAGPART_EXPR and REALPART_EXPR are expanded.  This seems to be a generic issue
as both x86 and Arm targets seem to have the same problem.

[Bug middle-end/103641] [11/12 regression] Severe compile time regression in SLP vectorize step

2022-02-04 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103641

--- Comment #29 from Tamar Christina  ---
(In reply to Richard Biener from comment #28)
> I'm not removing the regression marker yet - can ARM folks please update the
> trunk numbers with a fully built compiler (w/o checking)?

Sure, I'll come back on Monday when it's gathered data in the CI.

[Bug tree-optimization/102819] [11 Regression] IFN_COMPLEX_MUL matches things that it shouldn't

2022-02-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102819

Tamar Christina  changed:

   What|Removed |Added

Summary|[11/12 Regression]  |[11 Regression]
   |IFN_COMPLEX_MUL matches |IFN_COMPLEX_MUL matches
   |things that it shouldn't|things that it shouldn't

--- Comment #6 from Tamar Christina  ---
fixed on master, will backport after a few days stew.

[Bug tree-optimization/103169] [12 Regression] ICE: verify_ssa failed (error: definition in block 3 does not dominate use in block 4) since r12-4785-ged3de62ac949c92ad41ef6de7cc926fbb2a510ce

2022-02-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103169

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #7 from Tamar Christina  ---
fixed

[Bug target/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves

2022-02-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #8 from Tamar Christina  ---
(In reply to Andrew Pinski from comment #7)
> Hmm,
> ;; _43 = .REDUC_PLUS (vect__7.11_47);
> 
> (insn 23 22 24 (set (reg:V4SI 112)
> (unspec:V4SI [
> (reg:V4SI 103 [ vect__7.11 ])
> ] UNSPEC_ADDV)) -1
>  (nil))
> 
> (insn 24 23 0 (set (reg:SI 102 [ _43 ])
> (vec_select:SI (reg:V4SI 112)
> (parallel [
> (const_int 0 [0])
> ]))) -1
>  (nil))
> 
> Could we just represent REDUC_PLUS as:
> 
> (insn 23 22 24 (set (reg:SI 112)
> (unspec:SI [
> (reg:V4SI 103 [ vect__7.11 ])
> ] UNSPEC_ADDV_1)) -1
>  (nil))
> 
> And then I don't think we will have the issue with the vec_select and subreg
> really.

Hmm that does look like the RTL in the optab is wrong, The V2F optab is right
but the one with the mode iterator looks wrong.. I'll give it a shot.

[Bug rtl-optimization/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves

2022-01-18 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #4 from Tamar Christina  ---
(In reply to Vladimir Makarov from comment #3)
> (In reply to Richard Biener from comment #2)
> > We need to understand the issue at least.
> 
> I think that it is not an RA problem.
> 
> IRA assigns quite reasonable registers.  LRA just generates 2 reloads for
> this test, one for insn *add_lsr_si which has only one alternative and one
> for insn andsi3 which needs reload insns for any alternative and LRA in this
> case chooses the best one.
> 
> I guess the problem of the code generation regression is in some recent
> changes of combiner or most probably aarch64 machine dependent code
> directing the combiner (as Tamar wrote).

Aside from the duplicate move, the new code is actually better. It's better
that
it combined the shift in the add instead of the zero extend.  So that part is
fine.

> 
> It would be nice if somebody bisected and found what commit resulted in the
> regression.

It's caused by g:8695bf78dad1a42636775843ca832a2f4dba4da3 which is a general
change.
subreg 0 is the canonical RTL for the operation so this now folds to it more
often.

> 
> As for double transfer of the value, it could be removed by inheritance in
> LRA but it is impossible as an input reload pseudo got the same hard
> register (in LRA assignment subpass) as one of the insn output pseudo (the
> assignment was done in IRA) and the reloaded value is still used in
> subsequent insn.   Unfortunately it can happen as RA can not make allocation
> and code selection optimally in general case.
> 
>   Some coordination between LRA-assignment subpass and LRA-inheritance
> subpass could help to avoid the double transfer but right now I have no idea
> how to do this.  It is also dangerous to implement such coordination at this
> stage as LRA-inheritance sub-pass is very complicated.

That's fair if there's no easy fix for 12.  The benefits of the subreg change
outweigh
this downside.  So I'm ok with fixing it in 13.

[Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves

2022-01-16 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

Bug ID: 104049
   Summary: [12 Regression] vec_select to subreg lowering causes
superfluous moves
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64-*

Consider:

int test (uint8_t *p, uint32_t t[1][1], int n) {

  int sum = 0;
  uint32_t a0;
  for (int i = 0; i < 4; i++, p++)
t[i][0] = p[0];

  for (int i = 0; i < 4; i++) {
{
  int t0 = t[0][i] + t[0][i];
  a0 = t0;
};
sum += a0;
  }
  return (((uint16_t)sum) + ((uint32_t)sum >> 16)) >> 1;
}

Which after the reduction gets SLP'd used to generate at -O3

addvs0, v0.4s
fmovw0, s0
lsr w1, w0, 16
add w0, w1, w0, uxth
lsr w0, w0, 1

which was pretty good. However in GCC 12 we now generate worse code:

addvs0, v0.4s
fmovw0, s0
fmovw1, s0
and w0, w0, 65535
add w0, w0, w1, lsr 16
lsr w0, w0, 1

Notice the double transfer of the same value.

This is because at the RTL level the original mov becomes a vec_select

(insn 19 18 20 2 (set (reg:SI 102 [ _43 ])
(vec_select:SI (reg:V4SI 117)
(parallel [
(const_int 0 [0])
]))) -1
 (nil))

which previously stayed as a vec_select and the RA would use this pattern for
the w -> r move.

Now however this vec_select gets transformed into a subreg 0, which causes
combine to push the subreg into each instruction using reg 102.

(insn 21 18 22 2 (set (reg:SI 120)
(and:SI (subreg:SI (reg:V4SI 117) 0)
(const_int 65535 [0x]))) "/app/example.c":30:27 492 {andsi3}
 (nil))
(insn 22 21 28 2 (set (reg:SI 121)
(plus:SI (lshiftrt:SI (subreg:SI (reg:V4SI 117) 0)
(const_int 16 [0x10]))
(reg:SI 120))) "/app/example.c":30:27 211 {*add_lsr_si}
 (expr_list:REG_DEAD (reg:SI 120)
(expr_list:REG_DEAD (reg:V4SI 117)
(nil

and because these operations don't exist on the w side, reload is forced to
materialized many duplicate moves from w -> r.  So every operation that gets
the subreg pushed into it for which we don't have an operation for on the w
side gets an extra move.

Aside from that, we seem to lose that the & can be folded into the subreg by
simply truncating the subreg from SI to HI and zero extending that out.

A different reproducer is

#include 

typedef int v4si __attribute__ ((vector_size (16)));

int bar (v4si x)
{
  unsigned int sum = vaddvq_s32 (x);
  return (((uint16_t)(sum & 0x)) + ((uint32_t)sum >> 16));
}

Note that using -frename-registers does get us to an optimal sequence here
which is better than GCC 11.

[Bug target/103771] [12 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489

2022-01-04 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771

--- Comment #1 from Tamar Christina  ---
Looks like the change causes the simpler conditional to be detected by the
vectorizer as a masked operation, which in principle makes sense:

note:   vect_recog_mask_conversion_pattern: detected: iftmp.0_21 = x.1_14 > 255
? iftmp.0_19 : iftmp.0_20;
note:   mask_conversion pattern recognized: patt_43 = patt_42 ? iftmp.0_19 :
iftmp.0_20;
note:   extra pattern stmt: patt_40 = x.1_14 > 255;
note:   extra pattern stmt: patt_42 = () patt_40;

However not quite sure how the masking works on x86.  The additional statement
generated for patt_42 causes it to fail during vectorization:

note:   ==> examining pattern def statement: patt_42 = ()
patt_40;
note:   ==> examining statement: patt_42 = () patt_40;
note:   vect_is_simple_use: operand x.1_14 > 255, type of def: internal
note:   vect_is_simple_use: vectype vector(8) 
missed:   conversion not supported by target.
note:   vect_is_simple_use: operand x.1_14 > 255, type of def: internal
note:   vect_is_simple_use: vectype vector(8) 
note:   vect_is_simple_use: operand x.1_14 > 255, type of def: internal
note:   vect_is_simple_use: vectype vector(8) 
missed:   not vectorized: relevant stmt not supported: patt_42 =
() patt_40;
missed:  bad operation or unsupported loop bound.
note:  * Analysis  failed with vector mode V32QI

as there's no conversion patterns for `VEC_UNPACK_LO_EXPR` between bool and a
mask.

which explains why it works for AVX2 and AVX512BW. AVX512F doesn't seem to
allow any QI mode conversions [1] so it fails..

Not sure why it's doing the replacement without checking to see that the target
is able to vectorize the statements it generates later. Specifically it doesn't
check if what's returned by build_mask_conversion is supported or not.

My guess is because vectorizable_condition will fail anyway without the type of
the conditional being a vector boolean.

With -mavx512vl V32QI seems to generate in the pattern mask conversions between
vector (8)  and without it vector(32) . I
think some x86 person needs to give a hint here :)

[1] https://www.felixcloutier.com/x86/kunpckbw:kunpckwd:kunpckdq

[Bug tree-optimization/103741] [12 Regression] ICE in prepare_vec_mask, at tree-vect-stmts.c:1808 since r12-5772-g06f2e7d49fc6341ea0128ccd83fd13705dd2c523

2021-12-17 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103741

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #6 from Tamar Christina  ---
fixed on trunk, thanks for the report!

[Bug tree-optimization/103741] [12 Regression] ICE in prepare_vec_mask, at tree-vect-stmts.c:1808 since r12-5772-g06f2e7d49fc6341ea0128ccd83fd13705dd2c523

2021-12-16 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103741

--- Comment #4 from Tamar Christina  ---
Patch missing a check on vec_mode, testing a fix.

[Bug tree-optimization/103741] [12 Regression] ICE in prepare_vec_mask, at tree-vect-stmts.c:1808

2021-12-16 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103741

--- Comment #2 from Tamar Christina  ---
AH!

 
unit-size 
align:64 warn_if_not_align:0 symtab:0 alias-set 1 canonical-type
0x7fdf17594738 precision:64 min  max 
pointer_to_this >
VNx2DI
size 
constant
elt0:  elt1: >
unit-size 
constant
elt0:  elt1: >
align:128 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x7fdf17400690 nunits:[2,2]>
var 
def_stmt vect__2.35_68 = [vec_unpack_lo_expr] vect_x_1.34_67;
version:68>

That is most definitely not a mask..

[Bug tree-optimization/103741] [12 Regression] ICE in prepare_vec_mask, at tree-vect-stmts.c:1808

2021-12-16 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103741

Tamar Christina  changed:

   What|Removed |Added

   Last reconfirmed||2021-12-16
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 CC||tnfchris at gcc dot gnu.org
 Status|UNCONFIRMED |ASSIGNED
   Priority|P3  |P1

--- Comment #1 from Tamar Christina  ---
probably mine..

the assert

gcc_assert (useless_type_conversion_p (mask_type, TREE_TYPE (vec_mask)));

is failing but the types are the same

>>> p (TREE_TYPE (loop_mask))
$2 = (tree) 0x7fdf174631f8

>>> dbgtree ($2)
 
unit-size 
align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x7fdf17463150 precision:8 min  max
>
VNx2BI
size 
unit-size 
align:128 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x7fdf175940a8 precision:128 min  max
>
constant
elt0:  elt1: >
unit-size 
unit-size 
align:64 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x7fdf17594000 precision:64 min  max >
constant
elt0:  elt1: >
align:16 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x7fdf174631f8 nunits:[2,2]>
>>> dbgtree (mask_type)
 
unit-size 
align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x7fdf17463150 precision:8 min  max
>
VNx2BI
size 
unit-size 
align:128 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x7fdf175940a8 precision:128 min  max
>
constant
elt0:  elt1: >
unit-size 
unit-size 
align:64 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x7fdf17594000 precision:64 min  max >
constant
elt0:  elt1: >
align:16 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x7fdf174631f8 nunits:[2,2]>

So need to see why useless_type_conversion_p thinks they are not.

[Bug testsuite/78136] gcc.dg/cpp/trad/include.c fails with newer glibc versions

2021-12-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78136

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 CC||tnfchris at gcc dot gnu.org
 Status|NEW |RESOLVED

--- Comment #5 from Tamar Christina  ---
Sorry missed the message, these tests now pass.

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-12-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #30 from Tamar Christina  ---
(In reply to Martin Jambor from comment #29)
> (In reply to Tamar Christina from comment #23)
> > I wonder if we can get rid of the final magic parameter too, we run with
> > --param ipa-cp-unit-growth=80 too which seems to have no more effect on
> > exchange, though still a slight effect on leela but a 12% gain in imagick.
> > 
> > This looks like without the parameter we lose constprop on MorphologyApply
> > resulting in a much bigger function.  Do you want me to make a new ticket
> > for this?
> 
> Indeed I did not know about this.  I think tracking it in a (separate) PR
> would make sense.  Thanks a lot for the pointer!

Thanks, created a new ticket
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103734 the threshold seems to be a
lot lower than 80%, 20% seems to already be enough.

[Bug ipa/103734] New: IPA-CP opportunity for imagick in SPECCPU 2017

2021-12-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103734

Bug ID: 103734
   Summary: IPA-CP opportunity for imagick in SPECCPU 2017
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
CC: hubicka at gcc dot gnu.org, marxin at gcc dot gnu.org
  Target Milestone: ---

When using --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=20 on
imagick the hot functions MorphologyApply and GetVirtualPixelsFromNexus get
replaced by specialized versions that are much smaller and faster.

Some other benchmarks like leela also get very small uplifts but the imagick
one is worth 14%.  Both flags seem to be needed.

[Bug rtl-optimization/103350] [12 Regression] wrong code with -Os -fno-tree-ter on aarch64-unknown-linux-gnu since r12-2288-g8695bf78dad1a42636775843ca832a2f4dba4da3

2021-12-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103350

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #8 from Tamar Christina  ---
Fixed on master

<    1   2   3   4   5   6   7   8   >