[Bug tree-optimization/115120] New: Bad interaction between ivcanon and early break vectorization

2024-05-16 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115120

Bug ID: 115120
   Summary: Bad interaction between ivcanon and early break
vectorization
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

Consider the following testcase on aarch64:

int arr[1024];
int *f()
{
int i;
for (i = 0; i < 1024; i++)
  if (arr[i] == 42)
break;
return arr + i;
}

compiled with -O3 we get the following vector loop body:

.L2:
cmp x2, x1
beq .L9
.L6:
ldr q31, [x1]
add x1, x1, 16
mov v27.16b, v29.16b
mov v28.16b, v30.16b
cmeqv31.4s, v31.4s, v26.4s
add v29.4s, v29.4s, v24.4s
add v30.4s, v30.4s, v25.4s
umaxp   v31.4s, v31.4s, v31.4s
fmovx3, d31
cbz x3, .L2

it's somewhat surprising that there are two vector adds, looking at the
optimized dump:

 [local count: 1063004408]:
  # vect_vec_iv_.6_28 = PHI <_29(10), { 0, 1, 2, 3 }(2)>
  # vect_vec_iv_.7_33 = PHI <_34(10), { 1024, 1023, 1022, 1021 }(2)>
  # ivtmp.18_19 = PHI 
  _34 = vect_vec_iv_.7_33 + { 4294967292, 4294967292, 4294967292, 4294967292 };
  _29 = vect_vec_iv_.6_28 + { 4, 4, 4, 4 };
  _25 = (void *) ivtmp.18_19;
  vect__1.10_39 = MEM  [(int *)_25];
  mask_patt_9.11_41 = vect__1.10_39 == { 42, 42, 42, 42 };
  if (mask_patt_9.11_41 != { 0, 0, 0, 0 })
goto ; [5.50%]
  else
goto ; [94.50%]

we can see that there are two IV updates that got vectorized.  It turns out
that
one of these comes from the ivcanon pass.  If I add -fno-tree-loop-ivcanon we
instead get the following vector loop body:

.L2:
cmp x2, x1
beq .L9
.L6:
ldr q31, [x1]
add x1, x1, 16
mov v29.16b, v30.16b
add v30.4s, v30.4s, v27.4s
cmeqv31.4s, v31.4s, v28.4s
umaxp   v31.4s, v31.4s, v31.4s
fmovx3, d31
cbz x3, .L2

which is much cleaner.  Looking at the tree dumps, the ivcanon pass makes the
following transformation:

--- cddce2.tree 2024-05-16 13:49:10.426703350 +
+++ ivcanon.tree2024-05-16 13:49:17.678874925 +
@@ -4,6 +4,8 @@
   int i;
   int _1;
   int * _8;
+  unsigned int ivtmp_11;
+  unsigned int ivtmp_12;
   long unsigned int _13;
   long unsigned int _15;
   long unsigned int prephitmp_16;
@@ -12,6 +14,7 @@

[local count: 1063004408]:
   # i_10 = PHI 
+  # ivtmp_12 = PHI 
   _1 = arr[i_10];
   if (_1 == 42)
 goto ; [5.50%]
@@ -20,7 +23,8 @@

[local count: 1004539166]:
   i_7 = i_10 + 1;
-  if (i_7 != 1024)
+  ivtmp_11 = ivtmp_12 - 1;
+  if (ivtmp_11 != 0)
 goto ; [98.93%]
   else
 goto ; [1.07%]

i.e. it introduces the backwards-counting IV.  It seems in the general case
without vectorization ivopts then cleans this up and ensures we only have a
single IV.

In the vectorized case it seems this problem only shows up with early break
vectorization. Looking at a simple reduction, such as:

int a[1024];
int g()
{
int sum = 0;
for (int i = 0; i < 1024; i++)
sum += a[i];
return sum;
}

although we still have the backwards-counting IV in ifcvt:

   [local count: 1063004408]:
  # sum_9 = PHI 
  # i_11 = PHI 
  # ivtmp_8 = PHI 
  _1 = a[i_11];
  sum_5 = _1 + sum_9;
  i_6 = i_11 + 1;
  ivtmp_7 = ivtmp_8 - 1;
  if (ivtmp_7 != 0)
goto ; [98.99%]
  else
goto ; [1.01%]

we end up with only scalar IVs after vectorization, and the backwards scalar IV
ends up getting deleted by dce6:

Deleting : ivtmp_7 = ivtmp_8 - 1;

I'm not sure what the right solution is but we should avoid having duplicated
IVs with early break vectorization.

[Bug tree-optimization/113787] [12/13/14/15 Regression] Wrong code at -O with ipa-modref on aarch64

2024-05-16 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #20 from Alex Coplan  ---
I'd just like to ping this serious wrong code bug.  It's unfortunate that this
wasn't addressed for the 14.1 release.

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2024-05-09 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

Alex Coplan  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org

--- Comment #3 from Alex Coplan  ---
Mine for the aliasing issues/investigation, might be worth splitting off the RA
problem to track that separately.

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2024-05-09 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

--- Comment #2 from Alex Coplan  ---
Here is some analysis on why we miss some of these opportunities in ldp_fusion.
So initially in 267r.vregs we have some very clean RTL:

6: r101:DI=sfp:DI-0x40
7: x0:DI=r101:DI
8: call [`g'] argc:0
  REG_CALL_DECL `g'
9: r102:DI=sfp:DI-0x80
   10: r103:DI=sfp:DI-0x40
   11: r104:V4SI=[r103:DI]
   13: r105:V4SI=[r103:DI+0x10]
   15: r106:V4SI=[r103:DI+0x20]
   17: r107:V4SI=[r103:DI+0x30]
   12: [r102:DI]=r104:V4SI
   14: [r102:DI+0x10]=r105:V4SI
   16: [r102:DI+0x20]=r106:V4SI
   18: [r102:DI+0x30]=r107:V4SI

if were to run the ldp/stp pass on this it should form the pairs without a
problem.  Of course things go downhill from here.  The first slightly strange
thing is that fwprop propagates the sfp into the first of each group of
accesses (i.e. with offset 0), but not the others:

9: r102:DI=sfp:DI-0x80
   11: r104:V4SI=[sfp:DI-0x40]
   13: r105:V4SI=[r101:DI+0x10]
   15: r106:V4SI=[r101:DI+0x20]
   17: r107:V4SI=[r101:DI+0x30]
  REG_DEAD r103:DI
   12: [sfp:DI-0x80]=r104:V4SI
   14: [r102:DI+0x10]=r105:V4SI
  REG_DEAD r105:V4SI
   16: [r102:DI+0x20]=r106:V4SI
  REG_DEAD r106:V4SI
   18: [r102:DI+0x30]=r107:V4SI

the RTL then stays mostly unchanged until sched1, where things really start to
go downhill:

   11: r104:V4SI=[sfp:DI-0x40]
9: r102:DI=sfp:DI-0x80
   13: r105:V4SI=[r101:DI+0x10]
   20: x0:DI=r102:DI
  REG_DEAD r102:DI
  REG_EQUAL sfp:DI-0x80
   15: r106:V4SI=[r101:DI+0x20]
   12: [sfp:DI-0x80]=r104:V4SI
  REG_DEAD r104:V4SI
   17: r107:V4SI=[r101:DI+0x30]
  REG_DEAD r101:DI
   14: [r102:DI+0x10]=r105:V4SI
  REG_DEAD r105:V4SI
   16: [r102:DI+0x20]=r106:V4SI
  REG_DEAD r106:V4SI
   18: [r102:DI+0x30]=r107:V4SI

here the first of the stores (i12) has been moved up between the last pair of
loads (i15, i17).  Now the interesting thing is how sched1 knows that it is
safe to perform this transformation.  In the ldp_fusion1 pass we miss this pair
because we think that the loads may alias with i12:

cannot form pair (15,17) due to alias conflicts (12,12)

so it would be good to look into how our alias analysis differs from what
sched1 is doing.  It's worth further noting that while the loads have MEM_EXPR
information (they point to the var_decl for s) the stores do not.  Presumably
this is because the copy of s mandated by the ABI doesn't necessarily have a
tree decl representation that the MEM_EXPRs could point to.

Separately to the aliasing issue, because:
 - there is no MEM_EXPR information for the stores, and
 - forwprop1 substituted the sfp in for the first store
ldp_fusion fails to discover the (i12,i14) store pair opportunity.  As a result
we unfortunately end up forming an stp in the middle.

Interestingly if I turn off fwprop1 then we still fail to form the
(12,14) pair due to aliasing.

So it seems the main thing to investigate is how sched1 does its alias
analysis and how that differs from what we're doing in ldp_fusion.

I have some WIP patches that should improve the pair discovery and could
potentially be extended to help with the case of the (12,14) pair here.
Another thing that could help with that is if we populated the MEM_EXPR for the
stores of the structure copy.

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2024-05-09 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

Alex Coplan  changed:

   What|Removed |Added

   Last reconfirmed||2024-05-09
 Status|UNCONFIRMED |NEW
 CC||acoplan at gcc dot gnu.org,
   ||vmakarov at gcc dot gnu.org
 Ever confirmed|0   |1
   Keywords||missed-optimization, ra

--- Comment #1 from Alex Coplan  ---
Confirmed.  There is a lot to unpack here.  Of course, the include isn't needed
in this testcase and the problem can be seen more clearly with a slightly
smaller array size:

typedef struct { int arr[16]; } S;

void g (S *);
void h (S);
void f(int x)
{
  S s;
  g ();
  h (s);
}

In this case sizeof(S) = 64 so we should be able to do the copy with 2 LDPs + 2
STPs.

So just for clarity, the missed ldp/stp started when we turned off the early
ldp/stp formation in memcpy expansion, i.e. with
r14-9373-g19b23bf3c32df3cbb96b3d898a1d7142f7bea4a0 .

However, things already started to regress earlier for this testcase with
r14-4944-gf55cdce3f8dd8503e080e35be59c5f5390f6d95e i.e.

commit f55cdce3f8dd8503e080e35be59c5f5390f6d95e
Author: Vladimir N. Makarov 
Date:   Thu Oct 26 14:50:40 2023

[RA]: Modfify cost calculation for dealing with equivalences

before that RA change we get:

f:
stp x29, x30, [sp, -144]!
mov x29, sp
add x0, sp, 80
bl  g
ldp q29, q28, [sp, 80]
add x0, sp, 16
ldp q31, q30, [sp, 112]
stp q29, q28, [sp, 16]
stp q31, q30, [sp, 48]
bl  h
ldp x29, x30, [sp], 144
ret

and afterwards we get:

f:
stp x29, x30, [sp, -160]!
mov x29, sp
str x19, [sp, 16]
add x19, sp, 96
mov x0, x19
bl  g
add x0, sp, 32
ldp q29, q28, [x19]
ldp q31, q30, [x19, 32]
stp q29, q28, [x0]
stp q31, q30, [x0, 32]
bl  h
ldr x19, [sp, 16]
ldp x29, x30, [sp], 160
ret

which is really not great as now we have a save/restore of x19 and the accesses
end up using different (non-sp) registers which I suspect doesn't help with the
ldp/stp formation (on trunk).

I will try to give a detailed analysis on what goes wrong with the ldp/stp
formation at the RTL level shortly (there are a lot of different issues), but I
think that RA change is a contributing factor.

[Bug target/114936] [14 Regression] Typo in aarch64-ldp-fusion.cc:combine_reg_notes

2024-05-08 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114936

Alex Coplan  changed:

   What|Removed |Added

Summary|[14/15 Regression] Typo in  |[14 Regression] Typo in
   |aarch64-ldp-fusion.cc:combi |aarch64-ldp-fusion.cc:combi
   |ne_reg_notes|ne_reg_notes

--- Comment #2 from Alex Coplan  ---
Fixed on trunk, will backport to 14 after a week or so.

[Bug rtl-optimization/114674] [aarch64] ldp_fusion fails to merge 2 strs due to imprecise alignment info

2024-05-07 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114674

Alex Coplan  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Alex Coplan  ---
Fixed for GCC 15, thanks for the report.

[Bug target/114936] [14/15 Regression] Typo in aarch64-ldp-fusion.cc:combine_reg_notes

2024-05-03 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114936

Alex Coplan  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org
   Last reconfirmed||2024-05-03
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED

[Bug target/114936] New: [14/15 Regression] Typo in aarch64-ldp-fusion.cc:combine_reg_notes

2024-05-03 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114936

Bug ID: 114936
   Summary: [14/15 Regression] Typo in
aarch64-ldp-fusion.cc:combine_reg_notes
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

aarch64-ldp-fusion.cc:combine_reg_notes has:

  result = filter_notes (REG_NOTES (i2->rtl ()), result,
 _eh_region, fr_expr);
  result = filter_notes (REG_NOTES (i1->rtl ()), result,
 _eh_region, fr_expr + 1);

  if (!load_p)
{
  // Simple frame-related sp-relative saves don't need CFI notes, but when
  // we combine them into an stp we will need a CFI note as dwarf2cfi can't
  // interpret the unspec pair representation directly.
  if (RTX_FRAME_RELATED_P (i1->rtl ()) && !fr_expr[0])
fr_expr[0] = copy_rtx (PATTERN (i1->rtl ()));
  if (RTX_FRAME_RELATED_P (i2->rtl ()) && !fr_expr[1])
fr_expr[1] = copy_rtx (PATTERN (i2->rtl ()));
}

so any REG_FRAME_RELATED_EXPR from i2 goes to fr_expr[0] and likewise i1 goes
to fr_expr[1], but then we have the opposite association inside the if
statement.

Many thanks to Matthew Malcomson for pointing this out to me.

I'm going to post the (arguably obvious) patch after testing that writes to
fr_expr + 1 first when we call filter_notes for i2.  We may want to consider a
backport to GCC 14 too.

[Bug rtl-optimization/114924] [11/12/13/14/15 Regression] Wrong update of MEM_EXPR by RTL loop unrolling since r11-2963-gd6a05b494b4b71

2024-05-02 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114924

Alex Coplan  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org
   Last reconfirmed||2024-05-02
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1

[Bug rtl-optimization/114924] New: [11/12/13/14/15 Regression] Wrong update of MEM_EXPR by RTL loop unrolling since r11-2963-gd6a05b494b4b71

2024-05-02 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114924

Bug ID: 114924
   Summary: [11/12/13/14/15 Regression] Wrong update of MEM_EXPR
by RTL loop unrolling since r11-2963-gd6a05b494b4b71
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

The following testcase is reduced from
libgomp/testsuite/libgomp.fortran/imperfect-destructor.f90:

module m
  type t
contains
  final fini
  end type
  integer ccount(3)
  contains
subroutine init(x, n)
  type(t) x
  xi = n
  ccount = 1
end
subroutine fini(x)
  type(t) x
  dcount= s1 (a3)
  do i = 1, 1
block
  do j = 1, 2
block
  do k = 1, a3
block
  type (t) local3
  call init (local3, 3)
end block
  end do
end block
  end do
end block
  end do
end
end

compiling with -O2 -funroll-loops -da and looking at the RTL dumps, I see the
following insn in 284r.loop2_invariant:

(insn 44 40 45 8 (set (mem/c:SI (plus:DI (reg/f:DI 121)
(const_int 8 [0x8])) [3 ccount[2]+0 S4 A64])
(subreg:SI (reg:V2SI 111) 0)) "t.f90":11:16 discrim 2 69
{*movsi_aarch64}
 (expr_list:REG_DEAD (reg:V2SI 111)
(nil)))

then in 285r.loop2_unroll, I see:

(insn 44 40 45 8 (set (mem/c:SI (plus:DI (reg/f:DI 121)
(const_int 8 [0x8])) [3 ccount+0 S4 A64])
(subreg:SI (reg:V2SI 111) 0)) "t.f90":11:16 discrim 2 69
{*movsi_aarch64}
 (expr_list:REG_DEAD (reg/f:DI 121)
(expr_list:REG_DEAD (reg:V2SI 111)
(nil

notably the MEM_EXPR has been changed from ccount[2] to ccount, without a
corresponding change in offset.  This is incorrect.  Setting a watchpoint on
the
MEM_ATTRS of the relevant MEM showed that the update happens in
cfgrtl.cc:duplicate_insn_chain, which does the following:

/* We cannot adjust MR_DEPENDENCE_CLIQUE in-place
   since MEM_EXPR is shared so make a copy and
   walk to the subtree again.  */
tree new_expr = unshare_expr (MEM_EXPR (*iter));
if (TREE_CODE (new_expr) == WITH_SIZE_EXPR)
  new_expr = TREE_OPERAND (new_expr, 0);
while (handled_component_p (new_expr))
  new_expr = TREE_OPERAND (new_expr, 0);
MR_DEPENDENCE_CLIQUE (new_expr) = newc;
set_mem_expr (const_cast  (*iter), new_expr);

so the code (correctly) looks through the ARRAY_REF in this case to find
the underlying MEM_REF and updates MR_DEPENDENCE_CLIQUE for that
MEM_REF, but then proceeds to pass the MEM_REF to set_mem_expr, thereby
incorrectly dropping the ARRAY_REF in this case.

The code above was introduced in
r11-2963-gd6a05b494b4b714e996a5ca09c5a4a1c41dbd648 so I assume this is a
regression in GCC 11 and beyond.

I have a straightforward patch to fix this which passes bootstrap on
aarch64-linux-gnu, I will post that shortly.

While I don't have a wrong-code reproducer at the moment, we may want to
consider backporting the fix as incorrect MEM_EXPR information could
lead to wrong code.  I found the issue while working on a patch series
that has the side effect of introducing some consistency checking of the
MEM_EXPR information.

[Bug target/114801] New: [14 Regression] arm: ICE in find_cached_value, at rtx-vector-builder.cc:100 with MVE intrinsics

2024-04-22 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114801

Bug ID: 114801
   Summary: [14 Regression] arm: ICE in find_cached_value, at
rtx-vector-builder.cc:100 with MVE intrinsics
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

The following testcase:

#include 
uint32x4_t test_9() {
  return vdupq_m_n_u32(vdupq_n_u32(0), 0, 0x);
}

ICEs with -march=armv8.1-m.main+mve -mfloat-abi=hard on the trunk. This appears
to be a regression from GCC 13.

For a preprocessed reproducer, take the following:

$ cat t.c
#pragma GCC arm "arm_mve_types.h"
#pragma GCC arm "arm_mve.h" false
uint32x4_t test_9() {
  return vdupq_m_n_u32(vdupq_n_u32(0), 0, 0x);
}
$ gcc/xgcc -B gcc -c t.c -S -o /dev/null -march=armv8.1-m.main+mve
-mfloat-abi=hard
during RTL pass: expand
t.c: In function ‘test_9’:
t.c:4:10: internal compiler error: in find_cached_value, at
rtx-vector-builder.cc:100
4 |   return vdupq_m_n_u32(vdupq_n_u32(0), 0, 0x);
  |  ^~~~
0x2a7fc16 rtx_vector_builder::find_cached_value()
/home/alecop01/toolchain/src/gcc/gcc/rtx-vector-builder.cc:100
0x2a7f9c9 rtx_vector_builder::build()
/home/alecop01/toolchain/src/gcc/gcc/rtx-vector-builder.cc:64
0x2adff41 native_decode_vector_rtx(machine_mode, vec const&, unsigned int, unsigned int, unsigned int)
/home/alecop01/toolchain/src/gcc/gcc/simplify-rtx.cc:7269
0x2ae0068 native_decode_rtx(machine_mode, vec
const&, unsigned int)
/home/alecop01/toolchain/src/gcc/gcc/simplify-rtx.cc:7289
0x2ae10c4 simplify_immed_subreg
/home/alecop01/toolchain/src/gcc/gcc/simplify-rtx.cc:7529
0x2ae1807 simplify_context::simplify_subreg(machine_mode, rtx_def*,
machine_mode, poly_int<1u, unsigned long>)
/home/alecop01/toolchain/src/gcc/gcc/simplify-rtx.cc:7603
0x2ae31f2 simplify_context::simplify_gen_subreg(machine_mode, rtx_def*,
machine_mode, poly_int<1u, unsigned long>)
/home/alecop01/toolchain/src/gcc/gcc/simplify-rtx.cc:7875
0x2ae3644 simplify_context::lowpart_subreg(machine_mode, rtx_def*,
machine_mode)
/home/alecop01/toolchain/src/gcc/gcc/simplify-rtx.cc:7904
0x1e92c3e lowpart_subreg(machine_mode, rtx_def*, machine_mode)
/home/alecop01/toolchain/src/gcc/gcc/rtl.h:3565
0x22f4d11 gen_lowpart_common(machine_mode, rtx_def*)
/home/alecop01/toolchain/src/gcc/gcc/emit-rtl.cc:1627
0x2a7f336 gen_lowpart_general(machine_mode, rtx_def*)
/home/alecop01/toolchain/src/gcc/gcc/rtlhooks.cc:48
0x327a20e arm_mve::function_expander::add_input_operand(insn_code, rtx_def*)
   
/home/alecop01/toolchain/src/gcc/gcc/config/arm/arm-mve-builtins.cc:2103
0x327a887 arm_mve::function_expander::use_cond_insn(insn_code, unsigned int)
   
/home/alecop01/toolchain/src/gcc/gcc/config/arm/arm-mve-builtins.cc:2227
0x3282fe2
arm_mve::unspec_mve_function_exact_insn::expand(arm_mve::function_expander&)
const
   
/home/alecop01/toolchain/src/gcc/gcc/config/arm/arm-mve-builtins-functions.h:339
0x327ab65 arm_mve::function_expander::expand()
   
/home/alecop01/toolchain/src/gcc/gcc/config/arm/arm-mve-builtins.cc:2287
0x327ae1d arm_mve::expand_builtin(unsigned int, tree_node*, rtx_def*)
   
/home/alecop01/toolchain/src/gcc/gcc/config/arm/arm-mve-builtins.cc:2352
0x3275215 arm_expand_builtin(tree_node*, rtx_def*, rtx_def*, machine_mode, int)
/home/alecop01/toolchain/src/gcc/gcc/config/arm/arm-builtins.cc:4103
0x20fd3b9 expand_builtin(tree_node*, rtx_def*, rtx_def*, machine_mode, int)
/home/alecop01/toolchain/src/gcc/gcc/builtins.cc:7769
0x236a0ed expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
/home/alecop01/toolchain/src/gcc/gcc/expr.cc:12350
0x235c6d1 expand_expr_real(tree_node*, rtx_def*, machine_mode, expand_modifier,
rtx_def**, bool)
/home/alecop01/toolchain/src/gcc/gcc/expr.cc:9440
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.

[Bug rtl-optimization/114674] [aarch64] ldp_fusion fails to merge 2 strs due to imprecise alignment info

2024-04-10 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114674

Alex Coplan  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #4 from Alex Coplan  ---
Discussing offline with Richard S an alternative approach would be to use
replace_equiv_address[_nv] instead of adjust_address[_nv]; that way we preserve
most properties of the original mem and just take the address from the other.

In fact this is what aarch64_check_consecutive_mems already does so I think we
should follow that.

I'll try a patch along those lines for stage 1.

[Bug rtl-optimization/114674] [aarch64] ldp_fusion fails to merge 2 strs due to imprecise alignment info

2024-04-10 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114674

Alex Coplan  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2024-04-10

--- Comment #3 from Alex Coplan  ---
Confirmed.

I think it might be best to take the maximum MEM_ALIGN between the adjusted mem
(new_mem) and the original mem (change_mem).  In this case it happens that the
original mem (change_mem) has a stronger alignment guarantee, but in general it
could be the case that the adjusted mem gives a better alignment guarantee. 
Since both locations are known to point to the same address, it seems best to
me to take the largest alignment of the two.

[Bug rtl-optimization/114674] [aarch64] ldp_fusion fails to merge 2 strs due to imprecise alignment info

2024-04-10 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114674

Alex Coplan  changed:

   What|Removed |Added

 CC||acoplan at gcc dot gnu.org
   Keywords||missed-optimization

--- Comment #2 from Alex Coplan  ---
Thanks for the report (and patch), I'll look into this.

[Bug target/114492] Invalid use of gcc_assert (notably in gcc/config/aarch64/aarch64-ldp-fusion.cc)

2024-04-02 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114492

Alex Coplan  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org

--- Comment #4 from Alex Coplan  ---
I think these should be OK. In the case of:

  for (unsigned i = 0; i < changes.length (); i++)
gcc_assert (rtl_ssa::restrict_movement_ignoring (*changes[i],
is_changing));

I think this is OK because the pass guarantees to have chosen a singleton move
range for the pair, so we don't rely on the call to restrict_movement_ignoring
updating the move range for any of the changes.  Other changes in the set are
either deletions or no-ops in terms of movement.  So we call this purely for
checking purposes to make sure we're not attempting something invalid.

Similarly in the case of:

  gcc_assert (crtl->ssa->verify_insn_changes (changes));

this is OK because the function doesn't have side effects (other than possibly
dumping).

Discussing this offline with Richard S he suggested asserting that we have
singleton move ranges before calling restrict_movement_ignoring to make that
case more obviously correct, so mine for that improvement (but either way I
think this should be OK).

[Bug target/114323] [14 Regression] MVE vector load intrinsic miscompiled since r14-5622-g4d7647edfd7d98

2024-03-15 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114323

--- Comment #4 from Alex Coplan  ---
I think the problem is that the arm backend incorrectly sets the const
attribute on this builtin, but it can't be const because it reads memory (it
should be pure instead):

 
sizes-gimplified unsigned V4SI
size 
unit-size 
align:64 warn_if_not_align:0 symtab:0 alias-set -1
structural-equality
attributes 
value >> nunits:4>
HI
size 
unit-size 
align:16 warn_if_not_align:0 symtab:0 alias-set -1 structural-equality
arg-types 
chain >>
pointer_to_this >
readonly addressable used nothrow public external built-in decl_5 decl_6 SI
t.c:2:9
align:16 warn_if_not_align:0 built-in: BUILT_IN_MD:3923 context

attributes 
chain 
chain >>> chain
>

[Bug target/114323] [14 Regression] MVE vector load intrinsic miscompiled since r14-5622-g4d7647edfd7d98

2024-03-13 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114323

--- Comment #1 from Alex Coplan  ---
Hmm, so in 043t.mergephi1 we have:

uint32x4_t foo ()
{
  const uint32_t D.13439[4];
  uint32x4_t V0;

   :
  D.13439 = *.LC0;
  V0_3 = vld1q_u32 ();
  D.13439 ={v} {CLOBBER(eos)};
  return V0_3;

}

but then 044t.dse1 says:

  Deleted dead store: D.13439 = *.LC0;

leaving us with a load of uninitialized memory.

[Bug target/114323] New: [14 Regression] MVE vector load intrinsic miscompiled since r14-5622-g4d7647edfd7d98

2024-03-13 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114323

Bug ID: 114323
   Summary: [14 Regression] MVE vector load intrinsic miscompiled
since r14-5622-g4d7647edfd7d98
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

The following testcase:

#include 

uint32x4_t foo (void) {
  uint32x4_t V0 = vld1q_u32(((const uint32_t[4]){1, 2, 3, 4}));
  return V0;
}

is miscompiled with -O2 -march=armv8.1-m.main+mve -mfloat-abi=hard on
arm-none-eabi.  Since r14-5622-g4d7647edfd7d985fbefe13de03c8bc2e3a74fc61 we
generate:

foo:
sub sp, sp, #16
vldrw.32q0, [sp]
add sp, sp, #16
bx  lr

i.e. we do a vector load from uninitialized stack memory.  GCC 13 used to give:

foo:
sub sp, sp, #16
mov ip, sp
ldr r3, .L4
ldm r3, {r0, r1, r2, r3}
stm ip, {r0, r1, r2, r3}
vldrw.32q0, [ip]
add sp, sp, #16
bx  lr
.align  2
.L4:
.word   .LANCHOR0
.size   foo, .-foo
.section.rodata
.align  2
.set.LANCHOR0,. + 0
.word   1
.word   2
.word   3
.word   4

which, while not optimal, is at least correct.  Here is a full executable
testcase for the testsuite:

#include 

__attribute__((noipa))
uint32x4_t foo (void) {
  uint32x4_t V0 = vld1q_u32(((const uint32_t[4]){1, 2, 3, 4}));
  return V0;
}

int main(void)
{
  uint32_t buf[4];
  vst1q_u32 (buf, foo());

  for (int i = 0; i < 4; i++)
if (buf[i] != i+1)
  __builtin_abort ();
}

[Bug middle-end/114291] New: -fcompare-debug failure (length) with -fprofile-use at -O0

2024-03-09 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114291

Bug ID: 114291
   Summary: -fcompare-debug failure (length) with -fprofile-use at
-O0
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

The following is an -fcompare-debug failure that shows up with PGO (here on
aarch64-linux-gnu):

$ cat t.c
void foo() {}
int main(void) {}
$ gcc t.c -fprofile-generate
$ ./a.out
$ gcc t.c -fprofile-use -fcompare-debug
gcc: error: t.c: ‘-fcompare-debug’ failure (length)

The difference seems to be as follows:

$ gcc t.c -fprofile-use -fdump-final-insns=nodebug.final
$ gcc t.c -fprofile-use -g -fcompare-debug-second
-fdump-final-insns=debug.final
$ diff -u nodebug.final debug.final
--- nodebug.final   2024-03-09 12:00:43.875729773 +
+++ debug.final 2024-03-09 12:00:52.555650670 +
@@ -1,5 +1,6 @@

-;; Function foo (foo, funcdef_no=0, decl_uid=4426, cgraph_uid=1,
symbol_order=0) (unlikely executed)
+
+;; Function foo (foo, funcdef_no=0, cgraph_uid=1, symbol_order=0) (unlikely
executed)

 (note # 0 0 NOTE_INSN_DELETED)
 (note # 0 0 NOTE_INSN_PROLOGUE_END)
@@ -18,7 +19,10 @@
 (barrier # 0 0)
 (note # 0 0 NOTE_INSN_DELETED)

-;; Function main (main, funcdef_no=1, decl_uid=4429, cgraph_uid=2,
symbol_order=1)
+Declarations used by main, sorted by DECL_UID:
+0:   void ;
+
+;; Function main (main, funcdef_no=1, cgraph_uid=2, symbol_order=1)

 (note # 0 0 NOTE_INSN_DELETED)
 (note # 0 0 [bb 2] NOTE_INSN_BASIC_BLOCK)

[Bug target/114284] [14 Regression] arm: Load of volatile short gets miscompiled (loaded twice) since r14-8319

2024-03-08 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114284

--- Comment #3 from Alex Coplan  ---
I think this has been fixed by
r14-9379-ga0e945888d973fc1a4a9d2944aa7e96d2a4d7581

[Bug target/114284] New: [14 Regression] arm: Load of volatile short gets miscompiled (loaded twice)

2024-03-08 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114284

Bug ID: 114284
   Summary: [14 Regression] arm: Load of volatile short gets
miscompiled (loaded twice)
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

The following is a wrong code regression in GCC 14:

volatile short x;
short foo() {
  return x;
}

with -march=armv8-m.base -O2 on the trunk we get:

foo:
movwr3, #:lower16:.LANCHOR0
movtr3, #:upper16:.LANCHOR0
ldrhr2, [r3]
movsr0, #0
ldrsh   r0, [r3, r0]
bx  lr

i.e. x is loaded twice, but with GCC 13 we get:

foo:
movwr3, #:lower16:.LANCHOR0
movtr3, #:upper16:.LANCHOR0
ldrhr0, [r3]
sxthr0, r0
bx  lr

I suppose ideally we would have just one ldrsh, but the GCC 13 code is OK.

[Bug tree-optimization/114193] New: missed early break vectorization of reduction

2024-03-01 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114193

Bug ID: 114193
   Summary: missed early break vectorization of reduction
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

For the following loop:

int a[1024];
int f(int *x, int n)
{
int sum = 0;
for (int i = 0; i < n; i++)
{
if (a[i] == 42)
break;
sum += a[i];
}
return sum;
}

at -O3 on aarch64 we miss vectorizing it.  It works if I move the early exit
down below the update of sum.  It looks like vect_analyze_scalar_cycles fails
to detect this as a reduction:

/app/example.c:5:23: note:   Analyze phi: sum_10 = PHI 
/app/example.c:5:23: missed:   intermediate value used outside loop.
/app/example.c:5:23: missed:   Unknown def-use cycle pattern.

[Bug tree-optimization/114192] New: scalar code left around following early break vectorization of reduction

2024-03-01 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114192

Bug ID: 114192
   Summary: scalar code left around following early break
vectorization of reduction
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

For the following testcase:

int a[1024];
int f4(int *x, int n)
{
int sum = 0;
for (int i = 0; i < n; i++)
{
sum += a[i];
if (a[i] == 42)
break;
}
return sum;
}

at -O3 on aarch64 we vectorize it and get the following vector loop:

.L4:
cmp x7, x2
beq .L23
.L6:
ubfiz   x3, x2, 4, 32
ldr w6, [x4, x2, lsl 2]// scalar load
mov v27.16b, v30.16b
mov w0, w5
add v30.4s, v30.4s, v25.4s
add w5, w5, w6 // scalar add
ldr q29, [x4, x3]
add x2, x2, 1
cmeqv31.4s, v29.4s, v26.4s
add v28.4s, v28.4s, v29.4s
umaxp   v31.4s, v31.4s, v31.4s
fmovx3, d31
cbz x3, .L4

but here the old scalar code has been left around.  If we remove the early exit
from the loop, then although we still leave the scalar code around in the
vectorizer, it gets optimized away immediately by the following DCE pass.

Without the early exit, in the vectorizer dump we have:

   [local count: 860067200]:
  # sum_10 = PHI 
  # i_12 = PHI 
  # vect_sum_10.8_25 = PHI 
  # vectp_a.9_26 = PHI 
  # ivtmp_32 = PHI 
  vect__1.11_28 = MEM  [(int *)vectp_a.9_26];
  _1 = a[i_12]; // scalar load
  vect_sum_6.12_29 = vect__1.11_28 + vect_sum_10.8_25;
  sum_6 = _1 + sum_10;
  i_7 = i_12 + 1;
  vectp_a.9_27 = vectp_a.9_26 + 16;
  ivtmp_33 = ivtmp_32 + 1;
  if (ivtmp_33 < bnd.5_22)
goto ; [89.00%]
  else
goto ; [11.00%]

i.e. the scalar load is left around, but it seems to get cleaned up by the
(immediately following) dce pass:

   [local count: 860067200]:
  # vect_sum_10.8_25 = PHI 
  # vectp_a.9_26 = PHI 
  # ivtmp_32 = PHI 
  vect__1.11_28 = MEM  [(int *)vectp_a.9_26];
  vect_sum_6.12_29 = vect__1.11_28 + vect_sum_10.8_25;
  vectp_a.9_27 = vectp_a.9_26 + 16;
  ivtmp_33 = ivtmp_32 + 1;
  if (ivtmp_33 < bnd.5_22)
goto ; [89.00%]
  else
goto ; [11.00%]

perhaps the dce needs improving to clean up the dead scalar code in the early
exit case, too.

[Bug tree-optimization/111770] predicated loads inactive lane values not modelled

2024-02-22 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111770

--- Comment #4 from Alex Coplan  ---
(In reply to Richard Biener from comment #3)
> As said X + 0. -> X is an invalid transform with FP unless there are no
> signed zeros (maybe also problematic with sign-dependent rounding).

Yeah, I was thinking about the integer case above.

> 
> I think we agree to define .MASK_LOAD to zero masked elements.  When we need
> something else we need to add an explicit ELSE value.  That needs documenting
> of course and also possibly testsuite coverage - I _think_ you should be able
> to do a GIMPLE frontend testcase for this.

Sounds good, thanks.

> 
> Note this behavior would extend to .MASK_GATHER_LOAD as well as
> the load-lanes and -len variants.
> 
> Unfortunately we do not have _any_ internals manual documentation for
> internal functions - but you can backtrack to the optabs documentation
> where this would need documenting.
> 
> Now, if-conversion could indeed elide the .COND_ADD for integers.  It's
> problematic there only because of signed overflow undefinedness, so
> you shouldn't see it for 'unsigned' already, and adding zero is safe.

Can you elaborate on this a bit? Do you mean to say that the .COND_ADD is only
there to avoid if-conversion introducing UB due to signed overflow? ISTM it's
needed for correctness even without that, as the addend needn't be guaranteed
to be zero in the general case.

> if-conversion would need to have an idea of all the ranges involved here
> so it might be a bit sophisticated to get it right.

Does what I suggested above make any sense, or do you have in mind a different
way of handling this in if-conversion? I'm wondering how ifcvt should determine
that the addend is zero in the case where the predicate is false.

Thanks

[Bug tree-optimization/111770] predicated loads inactive lane values not modelled

2024-02-21 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111770

--- Comment #2 from Alex Coplan  ---
I think to progress this and related cases we need to have .MASK_LOAD defined
to zero in the case that the predicate is false (either unconditionally for all
targets if possible or otherwise conditionally for targets where that is safe).

Here is a related case:

int bar(int n, char *a, char *b, char *c) {
  int sum = 0;
  for (int i = 0; i < n; ++i)
if (c[i] == 0)
  sum += a[i] * b[i];
  return sum;
}

in this case we get the missed optimization even before vectorization during
ifcvt (in some ways it is a simpler case to consider as only scalars are
involved).  Here with -O3 -march=armv9-a from ifcvt we get:

   [local count: 955630224]:
  # sum_23 = PHI <_ifc__41(8), 0(18)>
  # i_25 = PHI 
  _1 = (sizetype) i_25;
  _2 = c_16(D) + _1;
  _3 = *_2;
  _29 = _3 == 0;
  _43 = _42 + _1;
  _4 = (char *) _43;
  _5 = .MASK_LOAD (_4, 8B, _29);
  _6 = (int) _5;
  _45 = _44 + _1;
  _7 = (char *) _45;
  _8 = .MASK_LOAD (_7, 8B, _29);
  _9 = (int) _8;
  _46 = (unsigned int) _6;
  _47 = (unsigned int) _9;
  _48 = _46 * _47;
  _10 = (int) _48;
  _ifc__41 = .COND_ADD (_29, sum_23, _10, sum_23);

for this case it should be possible to use an unpredicated add instead of a
.COND_ADD.  We essentially need to show that this transformation is valid:

  _29 ? sum_23 + _10 : sum_23 --> sum_23 + _10

and this essentially boils down to showing that:

  _29 = false => _10 = 0

now I'm not sure if there's a way of match-and-simplifying some GIMPLE
expression under the assumption that a given SSA name takes a particular value;
but if there were, and we defined .MASK_LOAD to zero given a false predicate,
then we could evaluate _10 under the assumption that _29 = false, which if we
added some simple match.pd rule for .MASK_LOAD with a false predicate would
allow it to evaluate to zero, and thus we could establish _10 = 0 proving the
transformation is correct.  If such an approach is possible then I guess ifcvt
could use it to avoid conditionalizing statements unnecessarily.

Richi: any thoughts on the above or on how we should handle this sort of thing?

[Bug target/112922] [14 Regression] 465.tonto from SPECFP 2006 fails train run on Aarch64-linux with -O2 and -flto

2024-02-20 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112922

Alex Coplan  changed:

   What|Removed |Added

 CC||acoplan at gcc dot gnu.org,
   ||rsandifo at gcc dot gnu.org

--- Comment #1 from Alex Coplan  ---
So I did some bisection on this, and indeed it seems to have started with
r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 i.e.

commit 9f0f7d802482a8958d6cdc72f1fe0c8549db2182
Author: Richard Sandiford 
Date:   Thu Dec 7 19:41:19 2023

aarch64: Add an early RA for strided registers

but then it seemed to get fixed shortly afterwards by
r14-6339-g8b5cd6c4519cc120badd2b35a9e30d4deb82c012 i.e.

commit 8b5cd6c4519cc120badd2b35a9e30d4deb82c012
Author: Richard Sandiford 
Date:   Fri Dec 8 16:27:40 2023

aarch64: Some tweaks to the early-ra pass

CCing Richard S who can hopefully confirm if that change was expected to fix
correctness / wrong code issues.

[Bug target/111677] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-02-20 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

Alex Coplan  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #34 from Alex Coplan  ---
Fixed for all active branches.

[Bug target/111677] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-02-14 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

Alex Coplan  changed:

   What|Removed |Added

Summary|[12 Regression] darktable   |darktable build on aarch64
   |build on aarch64 fails with |fails with unrecognizable
   |unrecognizable insn due to  |insn due to
   |-fstack-protector changes   |-fstack-protector changes

--- Comment #32 from Alex Coplan  ---
Fixed for GCC 12, keeping open for a final backport to GCC 11 (since the stack
protector patches were also backported there, and the underlying issue is
latent there).

[Bug c++/113658] GCC 14 has incomplete impl for declared feature "cxx_constexpr_string_builtins"

2024-02-13 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113658

Alex Coplan  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from Alex Coplan  ---
Fixed, thanks for the report.

[Bug target/111677] [12 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-02-12 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

--- Comment #30 from Alex Coplan  ---
Backport for GCC 12 submitted:
https://gcc.gnu.org/pipermail/gcc-patches/2024-February/645415.html

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-08 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #12 from Alex Coplan  ---
Here is an alternative testcase that also fails in the same way on the GCC 12
and 13 branches:

void foo(int x, int y, int z, int d, int *buf)
{
  for(int i = z; i < y-z; ++i)
for(int j = 0; j < d; ++j)
  buf[i*x+(z-j-1)] = buf[i*x+(z+j)];
}

void bar(int x, int y, int z, int d, int *buf)
{
  for(int i = 0; i < d; ++i)
for(int j = z; j < x-z; ++j)
  buf[j+(z-i-1)*x] = buf[j+(z+i)*x];
}

__attribute__((noipa))
void baz(int x, int y, int d, int *buf)
{
  foo(x, y, 0, d, buf);
  bar(x, y, 0, d, buf);
}

int main(void)
{
  int a[] = { 1, 2, 3 };
  baz (1, 2, 1, a+1);
  /* buf = a+1.
 foo does:
 buf[-1] = buf[0]; // { 2, 2, 3 }
 buf[0] = buf[1];  // { 2, 3, 3 }

 bar does:
 buf[-1] = buf[0]; // { 3, 3, 3 }  */
  for (int i = 0; i < 2; i++)
if (a[i] != 3)
  __builtin_abort ();
}

[Bug target/111677] [12 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-02-07 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

Alex Coplan  changed:

   What|Removed |Added

Summary|[12/13 Regression]  |[12 Regression] darktable
   |darktable build on aarch64  |build on aarch64 fails with
   |fails with unrecognizable   |unrecognizable insn due to
   |insn due to |-fstack-protector changes
   |-fstack-protector changes   |

--- Comment #29 from Alex Coplan  ---
Should be fixed for GCC 13, I'll work on a backport for GCC 12 too.

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-06 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #7 from Alex Coplan  ---
(In reply to Andrew Pinski from comment #6)
> (In reply to Jakub Jelinek from comment #5)
> > My bisection points to r12-5915-ge93809f62363ba4b233858005aef652fb550e896
> 
> Which means it is related to bug 110702 .
> 
> Again try -fno-ivopts . I suspect ivopts is producing some odd ir that is
> confusing modref here.

Yeah, it seems -fno-ivopts makes the execution test pass too (so -O
-fno-ivopts).

[Bug tree-optimization/113787] [14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-06 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #4 from Alex Coplan  ---
Same with the head of the GCC 12 branch, but I agree it isn't a [14 Regression]
as I can reproduce the issue with basepoints/gcc-14, so maybe something was
backported to 12/13 that is making it latent on the branches?

[Bug tree-optimization/113787] [14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-06 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #3 from Alex Coplan  ---
(In reply to Jakub Jelinek from comment #1)
> Why do you think it is a 14 Regression?
> Seems r12-5166 works fine while r12-6600 already doesn't, so that would make
> it [12/13/14 Regression], no?

Well on the head of the GCC 13 branch the execution test seems to pass for me
and I see no difference with/without ipa-modref, I'll double check with GCC 12.

[Bug tree-optimization/113787] New: [14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-06 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

Bug ID: 113787
   Summary: [14 Regression] Wrong code at -O with ipa-modref on
aarch64
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

The following testcase appears to be miscompiled on the trunk, on
aarch64-linux-gnu:

$ cat t.c
void foo(int x, int y, int z, int d, int *buf)
{
  for(int i = z; i < y-z; ++i)
for(int j = 0; j < d; ++j)
  /* buf[x(i+1) + j] = buf[x(i+1)-j-1] */
  buf[i*x+(x-z+j)] = buf[i*x+(x-z-1-j)];
}

void bar(int x, int y, int z, int d, int *buf)
{
  for(int i = 0; i < d; ++i)
for(int j = z; j < x-z; ++j)
  /* buf[j+(y+i)*x] = buf[j+(y-1-i)*x] */
  buf[j+(y-z+i)*x] = buf[j+(y-z-1-i)*x];
}

__attribute__((noipa))
void baz(int x, int y, int d, int *buf)
{
  foo(x, y, 0, d, buf);
  bar(x, y, 0, d, buf);
}

int main(void)
{
  int a[] = { 1, 2, 3 };
  baz (1, 2, 1, a);
  /* foo does:
 buf[1] = buf[0];
 buf[2] = buf[1];

 bar does:
 buf[2] = buf[1]; (no-op)
 so we should have { 1, 1, 1 }.  */
  for (int i = 0; i < 3; i++)
if (a[i] != 1)
  __builtin_abort ();
}
$ gcc t.c -O -fno-ipa-modref
$ ./a.out
$ gcc t.c -O
$ ./a.out
Aborted

The problem seems to be that the call to foo gets incorrectly optimized
out from baz when ipa-modref is enabled:

$ gcc -c -S -o /dev/null t.c -O -fno-ipa-modref -fdump-tree-optimized=good.tree
$ gcc -c -S -o /dev/null t.c -O -fdump-tree-optimized=bad.tree
$ diff -u good.tree bad.tree
--- good.tree   2024-02-06 13:23:36.080926703 +
+++ bad.tree2024-02-06 13:23:38.356916302 +
@@ -223,7 +223,6 @@
 void baz (int x, int y, int d, int * buf)
 {
[local count: 1073741824]:
-  foo (x_2(D), y_3(D), 0, d_4(D), buf_5(D));
   bar (x_2(D), y_3(D), 0, d_4(D), buf_5(D));
   return;

I can't seem to reproduce the issue with GCC 13 or on x86_64.

[Bug middle-end/113705] [14 Regression] ICE in decompose, at wide-int.h:1049 on aarch64-linux-gnu since r14-8680-g2f14c0dbb78985

2024-02-01 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113705

Alex Coplan  changed:

   What|Removed |Added

Summary|[14 Regression] ICE in  |[14 Regression] ICE in
   |decompose, at   |decompose, at
   |wide-int.h:1049 on  |wide-int.h:1049 on
   |aarch64-linux-gnu   |aarch64-linux-gnu since
   ||r14-8680-g2f14c0dbb78985

--- Comment #3 from Alex Coplan  ---
Started with r14-8680-g2f14c0dbb789852947cb58fdf7d3162413f053fa :

commit 2f14c0dbb789852947cb58fdf7d3162413f053fa
Author: Roger Sayle 
Date:   Thu Feb 1 06:10:42 2024

PR target/113560: Enhance is_widening_mult_rhs_p.

[Bug middle-end/113705] [14 Regression] ICE in decompose, at wide-int.h:1049 on aarch64-linux-gnu

2024-02-01 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113705

Alex Coplan  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 CC||acoplan at gcc dot gnu.org
   Last reconfirmed||2024-02-01
 Ever confirmed|0   |1

--- Comment #2 from Alex Coplan  ---
Confirmed. Here is a reduced testcase that ICEs with -O2 on aarch64-linux-gnu:

void free();
template  struct generic_wide_int : storage { 
  long elt() const; 
};  
int elt_i;  
template  long generic_wide_int::elt() const {   
  return this->get_val()[elt_i];
}   
struct wide_int_storage {
  struct {
long val[0];
long valp;
  } u;
  unsigned len;
  int precision;
  wide_int_storage(const wide_int_storage &);
  ~wide_int_storage();
  const long *get_val() const;
  unsigned get_len() const;
};
wide_int_storage::wide_int_storage(const wide_int_storage &) {
  if (__builtin_expect(precision, 0))
u.valp = 0;
}
wide_int_storage::~wide_int_storage() {
  if (__builtin_expect(precision, 0))
free();
}
const long *wide_int_storage::get_val() const { return u.val; }
unsigned wide_int_storage::get_len() const { return len; }
struct irange {
  generic_wide_int upper_bound() const;
  generic_wide_int *m_base;
};
generic_wide_int irange::upper_bound() const {
  return m_base[1];
}
void set_irange() {
  irange r;
  for (unsigned i;;) {
generic_wide_int __trans_tmp_1 = r.upper_bound();
long *__trans_tmp_2;
unsigned short *len;
*len = __trans_tmp_1.get_len();
for (i = 0; i < *len; ++i)
  *__trans_tmp_2++ = __trans_tmp_1.elt();
  }
}

[Bug target/111677] [12/13 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-01-31 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

Alex Coplan  changed:

   What|Removed |Added

Summary|[12/13/14 Regression]   |[12/13 Regression]
   |darktable build on aarch64  |darktable build on aarch64
   |fails with unrecognizable   |fails with unrecognizable
   |insn due to |insn due to
   |-fstack-protector changes   |-fstack-protector changes

--- Comment #27 from Alex Coplan  ---
Fixed on trunk for GCC 14, keeping open for backports.

[Bug target/111677] [12/13/14 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-01-30 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

--- Comment #25 from Alex Coplan  ---
Proposed fix for GCC 13:
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/644459.html

[Bug target/111677] [12/13/14 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-01-30 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

Alex Coplan  changed:

   What|Removed |Added

   Keywords||patch

--- Comment #24 from Alex Coplan  ---
Proposed fix for trunk:
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/61.html

[Bug target/111677] [12/13/14 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-01-30 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

Alex Coplan  changed:

   What|Removed |Added

   Keywords|needs-bisection |
  Known to fail|13.2.1  |14.0
  Known to work|14.0|
Version|13.2.0  |13.2.1
Summary|[12/13 Regression]  |[12/13/14 Regression]
   |darktable build on aarch64  |darktable build on aarch64
   |fails with unrecognizable   |fails with unrecognizable
   |insn due to |insn due to
   |-fstack-protector changes   |-fstack-protector changes

--- Comment #23 from Alex Coplan  ---
Discovered by accident while working on a patch for trunk, but adding
-funroll-loops to the testcase in #c20 is enough to make the ICE trigger on the
trunk, too.

Testing a fix for trunk and a backport to 13 (to start with).

To reproduce on the trunk (t.c as in #c20):

$ gcc/xgcc -B gcc -c t.c -O3 -ffast-math -fopenmp -fstack-protector-strong
-funroll-loops
t.c: In function ‘dt_bilateral_splat.simdclone.1’:
t.c:25:1: error: unrecognizable insn:
   25 | }
  | ^
(insn 2182 2181 406 85 (set (mem/c:TF (plus:DI (reg/f:DI 31 sp)
(const_int 512 [0x200])) [7  S16 A8])
(reg:TF 55 v23)) -1
 (expr_list:REG_DEAD (reg:TF 55 v23)
(nil)))
during RTL pass: sched_fusion
t.c:25:1: internal compiler error: in get_attr_type, at
config/aarch64/aarch64.md:29678
0x74a68f _fatal_insn(char const*, rtx_def const*, char const*, int, char
const*)
/home/alecop01/toolchain/src/gcc/gcc/rtl-error.cc:108
0x74a6c3 _fatal_insn_not_found(rtx_def const*, char const*, int, char const*)
/home/alecop01/toolchain/src/gcc/gcc/rtl-error.cc:116
0x18cf03b get_attr_type(rtx_insn*)
/home/alecop01/toolchain/src/gcc/gcc/config/aarch64/aarch64.md:29678
0x13278b7 aarch64_sched_variable_issue
/home/alecop01/toolchain/src/gcc/gcc/config/aarch64/aarch64.cc:15827
0x13278b7 aarch64_sched_variable_issue
/home/alecop01/toolchain/src/gcc/gcc/config/aarch64/aarch64.cc:15818
0x1e25057 schedule_block(basic_block_def**, void*)
/home/alecop01/toolchain/src/gcc/gcc/haifa-sched.cc:6912
0xeb307f schedule_region
/home/alecop01/toolchain/src/gcc/gcc/sched-rgn.cc:3203
0xeb307f schedule_insns()
/home/alecop01/toolchain/src/gcc/gcc/sched-rgn.cc:3525
0xeb34a3 schedule_insns()
/home/alecop01/toolchain/src/gcc/gcc/sched-rgn.cc:3511
0xeb34a3 rest_of_handle_sched_fusion
/home/alecop01/toolchain/src/gcc/gcc/sched-rgn.cc:3760
0xeb34a3 execute
/home/alecop01/toolchain/src/gcc/gcc/sched-rgn.cc:3938
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.

[Bug target/111677] [12/13 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-01-30 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

Alex Coplan  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #22 from Alex Coplan  ---
(In reply to Richard Sandiford from comment #21)
> 
> aarch64_get_separate_components is supposed to vet shrink-wrappable
> offsets, but in this case the offset looks valid, since:
> 
> str q22, [sp, #512]
> 
> is a valid instruction.  Perhaps the constraints are too narrow?

Yeah, as discussed offline, for T{I,F}mode we deliberately restrict the range
to the ldp x-reg range, since at least for TImode we don't know pre-RA how it
will be allocated (a single q reg or a pair of x regs).

We could look at using a different mode for the save that doesn't have those
restrictions, I'll try to do that.

[Bug c++/113658] GCC 14 has incomplete impl for declared feature "cxx_constexpr_string_builtins"

2024-01-30 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113658

Alex Coplan  changed:

   What|Removed |Added

   Last reconfirmed||2024-01-30
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org

--- Comment #5 from Alex Coplan  ---
(In reply to Jakub Jelinek from comment #3)
> Obviously using __has_builtin is much better than using the really badly
> designed __has_feature/__has_extension.
> That said, wcs{chr,cmp,len,ncmp} and wmem{chr,cmp} aren't builtins in gcc
> either, so I guess we shouldn't announce this "feature".

Mine, then.  I can prepare a patch to stop advertising the feature.

[Bug tree-optimization/113661] New: [14 Regression] xalancbmk miscompiled on aarch64 since r14-7194-g6cb155a6cf3142

2024-01-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113661

Bug ID: 113661
   Summary: [14 Regression] xalancbmk miscompiled on aarch64 since
r14-7194-g6cb155a6cf3142
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

xalancbmk (both from SPEC 2006 and SPEC 2017) seems to be miscompiled on
aarch64 since r14-7194-g6cb155a6cf314232248a12bdd395ed4151ae5a28 i.e.

commit 6cb155a6cf314232248a12bdd395ed4151ae5a28 (refs/bisect/bad)
Author: Tamar Christina 
Date:   Fri Jan 12 15:24:49 2024 +

middle-end: make memory analysis for early break more deterministic
[PR113135]

I see:

*** Miscompare of ref-t5.out

with the options -Ofast -fomit-frame-pointer -mcpu=neoverse-v1 -flto=auto .

[Bug target/113616] [14 Regression] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252

2024-01-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113616

Alex Coplan  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #6 from Alex Coplan  ---
Should be fixed, thanks for the report.

[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605

2024-01-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623

--- Comment #5 from Alex Coplan  ---
Indeed passing -mearly-ra=none makes the ICE go away as well.

[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605

2024-01-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623

Alex Coplan  changed:

   What|Removed |Added

 Status|ASSIGNED|NEW
   Assignee|acoplan at gcc dot gnu.org |unassigned at gcc dot 
gnu.org

[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605

2024-01-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623

Alex Coplan  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #4 from Alex Coplan  ---
I think this is an early RA problem.  In asmcons (in function qux), we have:

   29: x1:DI=[r122:DI+0x8]
   30: x0:DI=[r122:DI]

and then in early_ra, we get:

   29: x1:DI=[v31:DI+0x8]
   30: x0:DI=[v31:DI]

CCing Richard S for an opinion.

[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605

2024-01-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623

--- Comment #3 from Alex Coplan  ---
I think ldp_fusion is exposing a latent issue here.  We trip the assert:

gcc_assert (aarch64_mem_pair_lanes_operand (mem, pair_mode));

on the RTL:

(rr) pr mem
(mem/f:V2x8QI (reg:DI 63 v31) [0 +0 S16 A64])

because v31 isn't a valid base register according to
aarch64_regno_ok_for_base_p.  This comes from the following RTL in sched1,
where we already have:

   30: x0:DI=[v31:DI]
   29: x1:DI=[v31:DI+0x8]

but again these mems look invalid as per aarch64_regno_ok_for_base_p.

[Bug target/113616] [14 Regression] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252

2024-01-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113616

Alex Coplan  changed:

   What|Removed |Added

URL||https://gcc.gnu.org/piperma
   ||il/gcc-patches/2024-January
   ||/644167.html
   Keywords||patch

--- Comment #4 from Alex Coplan  ---
Patch submitted:
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/644167.html

[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623

Alex Coplan  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
  Known to fail||14.0
 Target||aarch64-*-*
   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org

--- Comment #2 from Alex Coplan  ---
Confirmed, mine.

[Bug target/113616] [14 Regression] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113616

--- Comment #3 from Alex Coplan  ---
Testing a patch.

[Bug target/113618] [14 Regression] AArch64: memmove idiom regression

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618

Alex Coplan  changed:

   What|Removed |Added

   Last reconfirmed||2024-01-26
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
 CC||acoplan at gcc dot gnu.org

--- Comment #1 from Alex Coplan  ---
Confirmed.

(In reply to Wilco from comment #0)
> A possible fix would be to avoid emitting LDP/STP in memcpy/memmove/memset
> expansions.

Yeah, so I had posted
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636855.html for that
but held off from committing it at the time as IMO there wasn't enough evidence
to show that this helps in general (and the pass could in theory miss
opportunities which would lead to regressions). 

But perhaps this is a good argument for going ahead with that change (of course
it will need rebasing).

[Bug target/113616] [14 Regression] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113616

--- Comment #2 from Alex Coplan  ---
I think the problem is this loop (and others that iterate over debug
uses in this way):

  // Now that we've characterized the defs involved, go through the
  // debug uses and determine how to update them (if needed).
  for (auto use : set->debug_insn_uses ())
{
  if (*pair_dst < *use->insn () && defs[1])
// We're re-ordering defs[1] above a previous use of the
// same resource.
update_debug_use (use, defs[1], writeback_pats[1]);
  else if (*pair_dst >= *use->insn ())
// We're re-ordering defs[0] below its use.
update_debug_use (use, defs[0], writeback_pats[0]);
}

because `update_debug_use` can remove uses from the list of debug uses,
we can't use a for-range loop as the iterator will become invalidated
before getting advanced.

Should be fairly straightforward to fix, sorry for the oversight.

[Bug target/113616] [14 Regression] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113616

Alex Coplan  changed:

   What|Removed |Added

   Keywords||ice-on-valid-code
   Last reconfirmed||2024-01-26
  Known to fail||14.0
 Ever confirmed|0   |1
   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=113089
 Target||aarch64-*-*
   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org
 Status|UNCONFIRMED |ASSIGNED

--- Comment #1 from Alex Coplan  ---
Confirmed, mine.

[Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

--- Comment #6 from Alex Coplan  ---
FWIW, if I move ldp_fusion1 before early_ra, with:

diff --git a/gcc/config/aarch64/aarch64-passes.def
b/gcc/config/aarch64/aarch64-passes.def
index 769d48f4faa..3853f6bf7a4 100644
--- a/gcc/config/aarch64/aarch64-passes.def
+++ b/gcc/config/aarch64/aarch64-passes.def
@@ -18,6 +18,7 @@
along with GCC; see the file COPYING3.  If not see
.  */

+INSERT_PASS_BEFORE (pass_sched, 1, pass_ldp_fusion);
 INSERT_PASS_BEFORE (pass_sched, 1, pass_aarch64_early_ra);
 INSERT_PASS_AFTER (pass_regrename, 1, pass_fma_steering);
 INSERT_PASS_BEFORE (pass_reorder_blocks, 1, pass_track_speculation);
@@ -25,5 +26,4 @@ INSERT_PASS_BEFORE (pass_late_thread_prologue_and_epilogue,
1, pass_switch_pstat
 INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
 INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
 INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
-INSERT_PASS_BEFORE (pass_early_remat, 1, pass_ldp_fusion);
 INSERT_PASS_BEFORE (pass_peephole2, 1, pass_ldp_fusion);

we get:

f:
.LFB0:
.cfi_startproc
adrpx0, .LANCHOR0
add x0, x0, :lo12:.LANCHOR0
ldp d31, d30, [x0]
ldp d29, d28, [x0, 32]
faddv29.2s, v31.2s, v29.2s
faddv28.2s, v30.2s, v28.2s
stp d29, d28, [x0]
ret

note that this does use more registers, though, so it's not necessarily a clear
win in the general case (particularly if register pressure is already high).

[Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

--- Comment #5 from Alex Coplan  ---
It looks like the current ordering of passes is:

early_ra
sched1
ldp_fusion1
early_remat

ISTM that ldp_fusion1 should probably be running before early_ra, but we found
that running ldp_fusion1 before sched1 could lead to increased register
pressure. Hmm.

[Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

Alex Coplan  changed:

   What|Removed |Added

 Status|ASSIGNED|NEW
   Assignee|acoplan at gcc dot gnu.org |unassigned at gcc dot 
gnu.org

[Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

Alex Coplan  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org
Summary|[14 Regression] Missing |[14 Regression] Missing
   |ldp/stp optimization|ldp/stp optimization since
   |sometimes   |r14-6290-g9f0f7d802482a8

--- Comment #4 from Alex Coplan  ---
Interestingly we started to miss this with the introduction of aarch64
early RA i.e. r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182.

My ldp/stp pattern rewrite was:
r14-6604-gd7ee988c491cde43d04fe25f2b3dbad9d85ded45
so we started to miss this before any of my ldp/stp patches.

Looking at what happens with the ldp/stp pass, I can see that in sched1 we've
already allocated hard regs to the vector load destinations:

3: NOTE_INSN_BASIC_BLOCK 2
2: NOTE_INSN_FUNCTION_BEG
   13: NOTE_INSN_DELETED
5: debug begin stmt marker
6: r107:DI=high(`*.LANCHOR0')
7: r106:DI=r107:DI+low(`*.LANCHOR0')
  REG_EQUAL `*.LANCHOR0'
   14: v31:V2SF=[r107:DI+low(`*.LANCHOR0')]
   15: v30:V2SF=[r106:DI+0x20]
   16: v30:V2SF=v31:V2SF+v30:V2SF
  REG_DEAD v31:V2SF
   27: v31:V2SF=[r106:DI+0x8]
   17: [r107:DI+low(`*.LANCHOR0')]=v30:V2SF
  REG_DEAD r107:DI
  REG_DEAD v30:V2SF
   18: debug begin stmt marker
   28: v30:V2SF=[r106:DI+0x28]
   29: v30:V2SF=v31:V2SF+v30:V2SF
  REG_DEAD v31:V2SF
   30: [r106:DI+0x8]=v30:V2SF
  REG_DEAD r106:DI
  REG_DEAD v30:V2SF
   33: NOTE_INSN_DELETED

and then there's nothing that the early ldp/stp pass can do because the
would-be load pair candidates already use the same (hard) transfer register due
to early RA:

merge_pairs [L=1], cand vecs (14) x (27)
analyzing pair (load=1): (14,27)
punting on ldp due to reg conflcits (14,27)
merge_pairs [L=1], cand vecs (15) x (28)
analyzing pair (load=1): (15,28)
punting on ldp due to reg conflcits (15,28)
merge_pairs [L=0], cand vecs (17) x (30)
analyzing pair (load=0): (17,30)
pair (17,30): rejecting base 106 due to dataflow hazards (28,29)
can't form pair (17,30) due to dataflow hazards
starting the processing of deferred insns
ending the processing of deferred insns

CCing Richard S for an opinion.

[Bug target/113613] [14 Regression] Missing ldp/stp optimization sometimes

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

Alex Coplan  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org
   Last reconfirmed||2024-01-26
 Status|UNCONFIRMED |ASSIGNED

--- Comment #3 from Alex Coplan  ---
Confirmed, I'll take a look.

[Bug target/111677] [12/13/14 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

--- Comment #20 from Alex Coplan  ---
I think the testcase in #c10 went latent on the 13 branch but the following
(reduced from the attachment) still ICEs on the tip of the 13 branch with
-Ofast -fopenmp -fstack-protector-strong:

typedef struct {
  long size_z;
  int width;
} dt_bilateral_t;
typedef float dt_aligned_pixel_t[4];
#pragma omp declare simd
void dt_bilateral_splat(dt_bilateral_t *b) {
  float *buf;
  long offsets[8];
  for (; b;) {
int firstrow;
for (int j = firstrow; j; j++)
  for (int i; i < b->width; i++) {
dt_aligned_pixel_t contrib;
for (int k = 0; k < 4; k++)
  buf[offsets[k]] += contrib[k];
  }
float *dest;
for (int j = (long)b; j; j++) {
  float *src = (float *)b->size_z;
  for (int i = 0; i < (long)b; i++)
dest[i] += src[i];
}
  }
}

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #9 from Alex Coplan  ---
(In reply to Andrew Pinski from comment #8)
> (In reply to Alex Coplan from comment #7)
> > I expect the store pairs come from memcpy lowering/expansion in the aarch64
> > backend, that is the only way we get store pairs so early in the RTL
> > pipeline IIRC.
> 
> In this case, memset is more likely.

Right, yeah.  I was using "memcpy lowering" to refer to all the
mem{cpy,set,move} expansion we have in the backend.

> 
> Either:
> for (int i = 0; i < j; i++)
> m[i] = vdupq_n_f32(0.F);
> Or
> for (int i = 0; i < l; i++)
> n[i] = vdupq_n_f32(0.F);

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #7 from Alex Coplan  ---
I expect the store pairs come from memcpy lowering/expansion in the aarch64
backend, that is the only way we get store pairs so early in the RTL pipeline
IIRC.

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #6 from Alex Coplan  ---
Looking at the dump files, the first difference seems to be in 292r.dse1:

 8: NOTE_INSN_BASIC_BLOCK 2
 2: r116:SI=zero_extend(x0:HI)
   REG_DEAD x0:HI
@@ -178,7 +161,26 @@
 5: NOTE_INSN_FUNCTION_BEG
10: r119:DI=sfp:DI-0x200
12: r121:V16QI=const_vector
+   13: [r119:DI]=unspec[r121:V16QI,r121:V16QI] 38
+   14: [r119:DI+0x20]=unspec[r121:V16QI,r121:V16QI] 38
+   15: [r119:DI+0x40]=unspec[r121:V16QI,r121:V16QI] 38
+   16: [r119:DI+0x60]=unspec[r121:V16QI,r121:V16QI] 38
+   17: [r119:DI+0x80]=unspec[r121:V16QI,r121:V16QI] 38
+   18: [r119:DI+0xa0]=unspec[r121:V16QI,r121:V16QI] 38
+   19: [r119:DI+0xc0]=unspec[r121:V16QI,r121:V16QI] 38
+   20: [r119:DI+0xe0]=unspec[r121:V16QI,r121:V16QI] 38
+  REG_DEAD r119:DI
21: r122:DI=sfp:DI-0x100
+   24: [r122:DI]=unspec[r121:V16QI,r121:V16QI] 38
+   25: [r122:DI+0x20]=unspec[r121:V16QI,r121:V16QI] 38
+   26: [r122:DI+0x40]=unspec[r121:V16QI,r121:V16QI] 38
+   27: [r122:DI+0x60]=unspec[r121:V16QI,r121:V16QI] 38
+   28: [r122:DI+0x80]=unspec[r121:V16QI,r121:V16QI] 38
+   29: [r122:DI+0xa0]=unspec[r121:V16QI,r121:V16QI] 38
+   30: [r122:DI+0xc0]=unspec[r121:V16QI,r121:V16QI] 38
+   31: [r122:DI+0xe0]=unspec[r121:V16QI,r121:V16QI] 38
+  REG_DEAD r122:DI
+  REG_DEAD r121:V16QI
 6: r100:V4SF=const_vector
 7: r106:SI=0
32: cc:CC=cmp(r116:SI,0)
@@ -254,6 +256,7 @@
73: r100:V4SF={r147:V4SF*r147:V4SF+r115:V4SF}
   REG_DEAD r147:V4SF
   REG_DEAD r115:V4SF
+   74: [sfp:DI-0x200]=r100:V4SF
75: r148:SI=r106:SI+0x2
   REG_DEAD r106:SI
76: r106:SI=zero_extend(r148:SI#0)

(the unspec 38s are store pairs).

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #4 from Alex Coplan  ---
Created attachment 57211
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57211=edit
after.s

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #3 from Alex Coplan  ---
Created attachment 57210
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57210=edit
before.s

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #2 from Alex Coplan  ---
(In reply to Richard Biener from comment #1)
> I will have a look - but can you explain for me what I see?  I suppose the
> testcase was reduced from something?

Yeah, the testcase is reduced.

> 
> Is the assembly diff complete?  That is, do we really have more fmla or
> are they just moved?

I think the diff is complete, I can upload the full before/after asm.

> 
> + stp q31, q31, [sp, 256] 
> 
> that's a store?  A paired store?  Aka, the sequence fills a stack(?)
> region with replications of q31?

That's right.

I'll try to take a look at the RTL dumps too to see if I can figure out
anything, too.

[Bug rtl-optimization/113597] New: [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

Bug ID: 113597
   Summary: [14 Regression] aarch64: Significant code quality
regression since r14-8346-ga98d5130a6dcff
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

The following testcase shows a significant regression in code quality
since r14-8346-ga98d5130a6dcff2ed4db371e500550134777b8cf on aarch64:

$ cat t.cc
#include 
typedef struct {
  float b;
  float c;
} d;
template  void f(uint16_t g, d *u, d *v) {
  uint16_t j, l = j = e * e;
  float32_t b[j];
  float32_t c[l];
  float32x4_t m[j];
  for (int i = 0; i < j; i++)
m[i] = vdupq_n_f32(0.F);
  float32x4_t n[l];
  for (int i = 0; i < l; i++)
n[i] = vdupq_n_f32(0.F);
  for (uint16_t k = 0; k < g; k += 2) {
float32x4_t o[e];
for (int i = 0; i < e; i++)
  o[i] = vld1q_f32((float32_t *)[k]);
int idx = 0;
for (int a = 0; a < e; a++)
  for (int ah = a; ah < e; ah++)
m[idx] = vfmaq_f32(m[idx], o[a], o[ah]);
float32x4_t p[e];
for (int i; i; i++)
  for (int a; a;)
for (int ah;;)
  vfmsq_f32(n[idx], o[a], p[ah]);
  }
  for (int i = 0; i < j; i++)
b[i] = vaddvq_f32(m[i]);
  for (int i = 0; i < l; i++)
c[i] = vaddvq_f32(n[i]);
  constexpr uint16_t q(e * e);
  float32x4_t r[q];
  float32x2_t s;
  r[4] = float32x4_t{b[5] - c[3]};
  for (int i = 0; i < q; i++)
vst1q_f32((float32_t *)[2 * i], r[i]);
  if (e % 2)
vst1_f32((float32_t *)v, s);
}
void t() {
  d v, u;
  f<4>(0, , );
}

$ cat cmp.sh
#!/bin/bash
set -e

BEFORE=/work/builds/r14-8345/gcc
AFTER=/work/builds/r14-8346/gcc
SRC=t.cc

$BEFORE/xgcc -B $BEFORE -c -S -o before.s $SRC -Wall -Werror -Ofast
-mcpu=neoverse-v2
$AFTER/xgcc -B $AFTER -c -S -o after.s $SRC -Wall -Werror -Ofast
-mcpu=neoverse-v2

diff -u before.s after.s

$ ./cmp.sh
--- before.s2024-01-25 10:35:56.977090552 +
+++ after.s 2024-01-25 10:35:57.385086341 +
@@ -9,16 +9,47 @@
 _Z1fILt4EEvtP1dS1_:
 .LFB3918:
.cfi_startproc
-   andsw0, w0, 65535
+   moviv31.4s, 0
sub sp, sp, #768
.cfi_def_cfa_offset 768
+   andsw0, w0, 65535
mov w3, 0
+   stp q31, q31, [sp, 256]
+   stp q31, q31, [sp, 288]
+   stp q31, q31, [sp, 320]
+   stp q31, q31, [sp, 352]
+   stp q31, q31, [sp, 384]
+   stp q31, q31, [sp, 416]
+   stp q31, q31, [sp, 448]
+   stp q31, q31, [sp, 480]
+   stp q31, q31, [sp, 512]
+   stp q31, q31, [sp, 544]
+   stp q31, q31, [sp, 576]
+   stp q31, q31, [sp, 608]
+   stp q31, q31, [sp, 640]
+   stp q31, q31, [sp, 672]
+   stp q31, q31, [sp, 704]
+   stp q31, q31, [sp, 736]
+   moviv31.4s, 0
beq .L3
.p2align 5,,15
 .L2:
-   add w1, w3, 2
-   and w3, w1, 65535
-   cmp w0, w1, uxth
+   ubfiz   x5, x3, 3, 16
+   add w4, w3, 2
+   and w3, w4, 65535
+   ldr q30, [x1, x5]
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   str q31, [sp, 256]
+   cmp w0, w4, uxth
bhi .L2
 .L3:
ldp q30, q31, [sp]

[Bug target/113089] [14 Regression][aarch64] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252 since r14-6605-gc0911c6b357ba9

2024-01-23 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113089

Alex Coplan  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #16 from Alex Coplan  ---
Should be fixed, thanks for the report.

[Bug target/113356] [14 Regression][aarch64] ICE in try_fuse_pair, at config/aarch64/aarch64-ldp-fusion.cc:2203 since r14-6947-g4b67ec7ff5b1aa

2024-01-23 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113356

Alex Coplan  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Alex Coplan  ---
Fixed, thanks for the report.

[Bug target/113070] [14 regression] [AArch64] [PGO/LTO] Miscompilation of go compiler

2024-01-23 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113070

Alex Coplan  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #13 from Alex Coplan  ---
Should be fixed, sorry for the delay, and thanks for the report.

[Bug target/113114] [14 Regression] ICE compiling gcc.c-torture/execute/pr59643.cwith -mabi=ilp32; in try_promote_writeback aarch64-ldp-fusion.cc

2024-01-23 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113114

Alex Coplan  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #10 from Alex Coplan  ---
Should be fixed, thanks for the report.

[Bug rtl-optimization/113546] [13/14 Regression] aarch64: bootstrap-debug-lean broken with -fcompare-debug failure since r13-2921-gf1adf45b17f7f1

2024-01-22 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113546

--- Comment #5 from Alex Coplan  ---
FWIW the original preprocessed testcase (regex.i) also started failing with the
same commit (as the reduced testcase).

[Bug rtl-optimization/113546] [13/14 Regression] aarch64: bootstrap-debug-lean broken with -fcompare-debug failure since r13-2921-gf1adf45b17f7f1

2024-01-22 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113546

Alex Coplan  changed:

   What|Removed |Added

   Keywords||compare-debug-failure
Summary|aarch64:|[13/14 Regression] aarch64:
   |bootstrap-debug-lean broken |bootstrap-debug-lean broken
   |with -fcompare-debug|with -fcompare-debug
   |failure |failure since
   ||r13-2921-gf1adf45b17f7f1
 Target||aarch64-*-*

--- Comment #1 from Alex Coplan  ---
The reduced testcase started failing with
r13-2921-gf1adf45b17f7f1ede463524d80032bb2ec866ead:

commit f1adf45b17f7f1ede463524d80032bb2ec866ead
Author: Eugene Rozenfeld 
Date:   Thu Apr 21 23:42:15 2022

Add instruction level discriminator support.

This is the first in a series of patches to enable discriminator support
in AutoFDO.

[Bug rtl-optimization/113546] New: aarch64: bootstrap-debug-lean broken with -fcompare-debug failure

2024-01-22 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113546

Bug ID: 113546
   Summary: aarch64: bootstrap-debug-lean broken with
-fcompare-debug failure
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

I tried a bootstrap --with-build-config=bootstrap-debug-lean on aarch64 and it
failed with an -fcompare-debug failure building libiberty/regex.c:

make[3]: Entering directory '/data/ajc/toolchain/builds/bstrap-lean/libiberty'
if [ x"-fPIC" != x ]; then \
  /home/alecop01/toolchain/builds/bstrap-lean/./prev-gcc/xgcc
-B/home/alecop01/toolchain/builds/bstrap-lean/./prev-gcc/
-B/home/alecop01/toolchain/builds/bstrap-lean/aarch64-unknown-linux-gnu/bin/
-B/home/alecop01/toolchain/builds/bstrap-lean/aarch64-unknown-linux-gnu/bin/
-B/home/alecop01/toolchain/builds/bstrap-lean/aarch64-unknown-linux-gnu/lib/
-isystem
/home/alecop01/toolchain/builds/bstrap-lean/aarch64-unknown-linux-gnu/include
-isystem
/home/alecop01/toolchain/builds/bstrap-lean/aarch64-unknown-linux-gnu/sys-include
  -fchecking=1 -c -DHAVE_CONFIG_H -g -O2 -fchecking=1 -fcompare-debug  -I.
-I/home/alecop01/toolchain/src/gcc/libiberty/../include  -W -Wall
-Wwrite-strings -Wc++-compat -Wstrict-prototypes -Wshadow=local -pedantic 
-D_GNU_SOURCE  -fPIC /home/alecop01/toolchain/src/gcc/libiberty/regex.c -o
pic/regex.o; \
else true; fi
xgcc: error: /home/alecop01/toolchain/src/gcc/libiberty/regex.c:
‘-fcompare-debug’ failure
Makefile:1219: recipe for target 'regex.o' failed
make[3]: *** [regex.o] Error 1
make[3]: Leaving directory '/data/ajc/toolchain/builds/bstrap-lean/libiberty'
Makefile:11725: recipe for target 'all-stage3-libiberty' failed
make[2]: *** [all-stage3-libiberty] Error 2
make[2]: Leaving directory '/data/ajc/toolchain/builds/bstrap-lean'
Makefile:26292: recipe for target 'stage3-bubble' failed
make[1]: *** [stage3-bubble] Error 2
make[1]: Leaving directory '/data/ajc/toolchain/builds/bstrap-lean'
Makefile:1099: recipe for target 'all' failed
make: *** [all] Error 2

Here is a reduced testcase for that:

$ cat t.c
int x;
void f() {
fail:
  switch (x) { case 0: goto fail;; }
}
$ ./xgcc -B . -c t.c -fcompare-debug -O -S -o /dev/null
xgcc: error: t.c: ‘-fcompare-debug’ failure

[Bug tree-optimization/113539] [14 Regression] perlbench miscompiled on aarch64 since r14-8223-g1c1853a70f

2024-01-22 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113539

--- Comment #4 from Alex Coplan  ---
Reproduces with just -O3 -fno-strict-aliasing FWIW, no LTO or -mcpu needed.

[Bug tree-optimization/113539] New: [14 Regression] perlbench miscompiled on aarch64 since r14-8223-g1c1853a70f

2024-01-22 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113539

Bug ID: 113539
   Summary: [14 Regression] perlbench miscompiled on aarch64 since
r14-8223-g1c1853a70f
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

I'm seeing miscompares of perlbench (both from SPEC CPU 2006 and SPEC CPU 2017)
on aarch64 with recent trunk changes, a bisect pointed to
r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c :

commit 1c1853a70f9422169190e65e568dcccbce02d95c
Author: Richard Biener 
Date:   Thu Jan 18 10:22:34 2024

Fix memory leak in vect_analyze_loop_form

The miscompares are with the checkspam.pl workload, I see:

*** Miscompare of checkspam.2500.5.25.11.150.1.1.1.1.out

I've seen this with:

-flto=auto -fomit-frame-pointer -O3 -fno-strict-aliasing

and various -mcpu options (at least -mcpu=cortex-a72 and -mcpu=neoverse-v1).

[Bug target/113114] [14 Regression] ICE compiling gcc.c-torture/execute/pr59643.cwith -mabi=ilp32; in try_promote_writeback aarch64-ldp-fusion.cc

2024-01-19 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113114

Alex Coplan  changed:

   What|Removed |Added

URL||https://gcc.gnu.org/piperma
   ||il/gcc-patches/2024-January
   ||/643460.html
   Keywords||patch

--- Comment #8 from Alex Coplan  ---
Patch submitted:
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643460.html

[Bug target/113114] [14 Regression] ICE compiling gcc.c-torture/execute/pr59643.cwith -mabi=ilp32; in try_promote_writeback aarch64-ldp-fusion.cc

2024-01-19 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113114

--- Comment #7 from Alex Coplan  ---
Testing a fix.

[Bug target/113089] [14 Regression][aarch64] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252 since r14-6605-gc0911c6b357ba9

2024-01-19 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113089

Alex Coplan  changed:

   What|Removed |Added

URL||https://patchwork.sourcewar
   ||e.org/project/gcc/list/?ser
   ||ies=29928
   Keywords||patch

--- Comment #12 from Alex Coplan  ---
Patches submitted:
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643430.html
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643431.html
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643432.html

[Bug middle-end/113494] New: [14 Regression] ICE (segfault) in slpeel_tree_duplicate_loop_to_edge_cfg since r14-8206-g0f38666680d6ad0e

2024-01-18 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113494

Bug ID: 113494
   Summary: [14 Regression] ICE (segfault) in
slpeel_tree_duplicate_loop_to_edge_cfg since
r14-8206-g0f3880d6ad0e
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

The following fails on aarch64:

./xgcc -B . -c
~/toolchain/src/gcc/gcc/testsuite/gcc.c-torture/execute/20150611-1.c -S -o
/dev/null -O3
during GIMPLE pass: vect
/home/alecop01/toolchain/src/gcc/gcc/testsuite/gcc.c-torture/execute/20150611-1.c:
In function ‘main’:
/home/alecop01/toolchain/src/gcc/gcc/testsuite/gcc.c-torture/execute/20150611-1.c:5:1:
internal compiler error: Segmentation fault
5 | main ()
  | ^~~~
0x30bdd30 crash_signal
/home/alecop01/toolchain/src/gcc/gcc/toplev.cc:317
0x7f38d1b6708f ???
   
/build/glibc-wuryBv/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
0x26e6cf6 gimple_phi_result(gphi const*)
/home/alecop01/toolchain/src/gcc/gcc/gimple.h:4608
0x34b7f10 slpeel_tree_duplicate_loop_to_edge_cfg(loop*, edge_def*, loop*,
edge_def*, edge_def*, edge_def**, bool, vec*)
/home/alecop01/toolchain/src/gcc/gcc/tree-vect-loop-manip.cc:1751
0x34bceea vect_do_peeling(_loop_vec_info*, tree_node*, tree_node*, tree_node**,
tree_node**, tree_node**, int, bool, bool, tree_node**)
/home/alecop01/toolchain/src/gcc/gcc/tree-vect-loop-manip.cc:3342
0x34a85f3 vect_transform_loop(_loop_vec_info*, gimple*)
/home/alecop01/toolchain/src/gcc/gcc/tree-vect-loop.cc:11929
0x34fdcda vect_transform_loops
/home/alecop01/toolchain/src/gcc/gcc/tree-vectorizer.cc:1006
0x34fe445 try_vectorize_loop_1
/home/alecop01/toolchain/src/gcc/gcc/tree-vectorizer.cc:1152
0x34fe57e try_vectorize_loop
/home/alecop01/toolchain/src/gcc/gcc/tree-vectorizer.cc:1182
0x34fe844 execute
/home/alecop01/toolchain/src/gcc/gcc/tree-vectorizer.cc:1298
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.

[Bug target/113114] [14 Regression] ICE compiling gcc.c-torture/execute/pr59643.cwith -mabi=ilp32; in try_promote_writeback aarch64-ldp-fusion.cc

2024-01-18 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113114

--- Comment #6 from Alex Coplan  ---
Hmm, it's worth noting that the ILP32 case is a bit different, though, in that
we have:

(rr) call debug (insn->rtl ())
(insn 16 21 19 3 (parallel [
(set (reg:DF 62 v30)
(unspec:DF [
(mem:V2x8QI (reg/v/f:DI 0 x0 [orig:123 a ] [123]) [0 +0
S16 A64])
] UNSPEC_LDP_FST))
(set (reg:DF 63 v31)
(unspec:DF [
(mem:V2x8QI (reg/v/f:DI 0 x0 [orig:123 a ] [123]) [0 +0
S16 A64])
] UNSPEC_LDP_SND))
]) 88 {*load_pair_8}
 (nil))
(rr) call debug (trailing_add->rtl ())
(insn 20 18 41 3 (set (reg:SI 0 x0 [orig:118 ivtmp.22 ] [118])
(plus:SI (reg:SI 0 x0 [orig:123 a ] [123])
(const_int 8 [0x8]))) 119 {*addsi3_aarch64}
 (nil))

i.e. x0 appears as DImode in the load pair addresses but the trailing update is
done in SImode, which means we end up not matching when forming the final
pattern.

I don't think either case is particularly interesting, so I'm leaning towards
just bailing out if recog fails in the pass (in which case both of these just
become missed-optimizations).

[Bug target/113184] [14 Regression] ICE: in extract_insn, at recog.cc:2812 (unrecognizable insn) with -O -frounding-math -fnon-call-exceptions since r14-6605

2024-01-18 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113184

Alex Coplan  changed:

   What|Removed |Added

 Resolution|--- |DUPLICATE
 Status|NEW |RESOLVED

--- Comment #2 from Alex Coplan  ---
Fixed by Andrew's r14-8194-g7a8124e341aebcc544b4720e920b625f4ffe4e8a (thanks!)
so a dup of PR113221.

*** This bug has been marked as a duplicate of bug 113221 ***

[Bug target/113221] [14 Regression][aarch64]ICE in extract_insn, at recog.cc:2812 since r14-6605-gc0911c6b357ba9

2024-01-18 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113221

Alex Coplan  changed:

   What|Removed |Added

 CC||zsojka at seznam dot cz

--- Comment #9 from Alex Coplan  ---
*** Bug 113184 has been marked as a duplicate of this bug. ***

[Bug target/113089] [14 Regression][aarch64] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252 since r14-6605-gc0911c6b357ba9

2024-01-18 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113089

--- Comment #11 from Alex Coplan  ---
Testing a patch, sorry for the delay on this.

[Bug target/113114] [14 Regression] ICE compiling gcc.c-torture/execute/pr59643.cwith -mabi=ilp32; in try_promote_writeback aarch64-ldp-fusion.cc

2024-01-17 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113114

--- Comment #5 from Alex Coplan  ---
Hmm, so initially (with the testcase in c3) we have:

ldp s30, s29, [x0, #-4]
...
add x0, x0, #-4

and we try to form:

ldp s30, s29, [x0, #-4]!

with this RTL:

(rr) call debug (pair_change.m_insn->rtl ())
(insn 47 18 20 3 (parallel [
(set (reg:DI 0 x0 [119])
(plus:DI (reg:DI 0 x0 [orig:101 ivtmp.12 ] [101])
(const_int -4 [0xfffc])))
(set (reg:SF 62 v30 [orig:122 MEM[(float *)_18] ] [122])
(mem:SF (plus:DI (reg:DI 0 x0 [orig:101 ivtmp.12 ] [101])
(const_int -4 [0xfffc])) [0 +0 S4 A32]))
(set (reg:SF 61 v29 [orig:116 MEM[(float *)_18] ] [116])
(mem:SF (reg:DI 0 x0 [orig:101 ivtmp.12 ] [101]) [0 +4 S4
A32]))
]) "t.c":6:7 -1
 (nil))

but the problem is that we're expecting to match this pattern:

;; Load pair with pre-index writeback.
(define_insn "*loadwb_pre_pair_"
  [(set (match_operand 0 "pmode_register_operand")
(match_operator 8 "pmode_plus_operator" [
  (match_operand 1 "pmode_register_operand")
  (match_operand 4 "const_int_operand")]))
   (set (match_operand:GPI 2 "aarch64_ldp_reg_operand")
(match_operator 6 "memory_operand" [
  (match_operator 9 "pmode_plus_operator" [
(match_dup 1)
(match_dup 4)
  ])]))
   (set (match_operand:GPI 3 "aarch64_ldp_reg_operand")
(match_operator 7 "memory_operand" [
  (match_operator 10 "pmode_plus_operator" [
 (match_dup 1)
 (match_operand 5 "const_int_operand")
  ])]))]
  "aarch64_mem_pair_offset (operands[4], mode)
   && known_eq (INTVAL (operands[5]),
INTVAL (operands[4]) + GET_MODE_SIZE (mode))"
  {@ [cons: =&0, 1, =2, =3; attrs: type ]
 [   rk, 0,  r,  r; load_] ldp\t%2, %3, [%0, %4]!
 [   rk, 0,  w,  w; neon_load1_2reg ] ldp\t%2, %3, [%0, %4]!
  }
)

which simply doesn't match due to the shape of the RTL: that is, the pattern
hard-codes two plus operands, but due to the offset of -4 here we end up with
the second operand accessing memory directly at (the initial value of) x0.

We could add a second pattern to handle this specific case, or we could just
adjust try_promote_writeback to not assert that recog succeeds and accept the
missed optimization for the time being.

[Bug target/113114] [14 Regression] ICE compiling gcc.c-torture/execute/pr59643.cwith -mabi=ilp32; in try_promote_writeback aarch64-ldp-fusion.cc

2024-01-17 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113114

--- Comment #4 from Alex Coplan  ---
(The above was reduced from gcc/testsuite/gcc.dg/torture/pr45720.c FWIW).

[Bug target/113114] [14 Regression] ICE compiling gcc.c-torture/execute/pr59643.cwith -mabi=ilp32; in try_promote_writeback aarch64-ldp-fusion.cc

2024-01-17 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113114

--- Comment #3 from Alex Coplan  ---
The following ICEs in the same way without ILP32 (reduced from a testsuite run
with -funroll-loops):

$ cat t.c
float val[128];
float x;
void bar() {
  int i = 55;
  for (; i >= 0; --i)
x += val[i];
}
$ gcc/xgcc -B gcc -c t.c -O -funroll-loops -mearly-ldp-fusion -mlate-ldp-fusion
during RTL pass: ldp_fusion
t.c: In function ‘bar’:
t.c:7:1: internal compiler error: in try_promote_writeback, at
config/aarch64/aarch64-ldp-fusion.cc:2675
7 | }
  | ^
0x14671b3 try_promote_writeback
   
/home/alecop01/toolchain/src/other_gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2675
0x14671b3 ldp_fusion_bb(rtl_ssa::bb_info*)
   
/home/alecop01/toolchain/src/other_gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2706
0x1467307 ldp_fusion()
   
/home/alecop01/toolchain/src/other_gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2726
0x146737b execute
   
/home/alecop01/toolchain/src/other_gcc/gcc/config/aarch64/aarch64-ldp-fusion.cc:2776
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.

I'll investigate.  Probably we just shouldn't assert that recog succeeds here,
but I'll take a closer look at what's going on.

[Bug bootstrap/113449] [14 Regression] Bootstrap comparison failure on f95-lang.o since r14-8174

2024-01-17 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113449

Alex Coplan  changed:

   What|Removed |Added

 CC||acoplan at gcc dot gnu.org

--- Comment #1 from Alex Coplan  ---
Looks like a dup of PR113445

[Bug c/113438] ICE (segfault) in dwarf2out_decl with -g -std=c23 on c23-tag-composite-2.c

2024-01-17 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113438

--- Comment #1 from Alex Coplan  ---
I also noticed the following C23 failures, not sure if these are worth tracking
separately or not:

FAIL: gcc.dg/gnu23-tag-1.c (internal compiler error: 'verify_type' failed)
FAIL: gcc.dg/gnu23-tag-4.c (internal compiler error: 'verify_type' failed)
FAIL: gcc.dg/gnu23-tag-alias-1.c (internal compiler error: 'verify_type'
failed)

[Bug c/113438] New: ICE (segfault) in dwarf2out_decl with -g -std=c23 on c23-tag-composite-2.c

2024-01-17 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113438

Bug ID: 113438
   Summary: ICE (segfault) in dwarf2out_decl with -g -std=c23 on
c23-tag-composite-2.c
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

I did a testsuite run with -g on aarch64 and noticed the following ICE:

$ gcc/xgcc -B gcc -c
~/toolchain/src/gcc/gcc/testsuite/gcc.dg/c23-tag-composite-2.c -g -std=c23
/home/alecop01/toolchain/src/gcc/gcc/testsuite/gcc.dg/c23-tag-composite-2.c:15:21:
internal compiler error: Segmentation fault
   15 | void g(const struct foo { int x; } a);
  | ^~~
0xf15783 crash_signal
/home/alecop01/toolchain/src/other_gcc/gcc/toplev.cc:317
0xa23f04 dwarf2out_decl
/home/alecop01/toolchain/src/other_gcc/gcc/dwarf2out.cc:27611
0xa242c3 dwarf2out_type_decl
/home/alecop01/toolchain/src/other_gcc/gcc/dwarf2out.cc:27405
0xa242c3 dwarf2out_type_decl
/home/alecop01/toolchain/src/other_gcc/gcc/dwarf2out.cc:27400
0xde0d8b rest_of_type_compilation(tree_node*, int)
/home/alecop01/toolchain/src/other_gcc/gcc/passes.cc:339
0x79ce17 finish_struct(unsigned int, tree_node*, tree_node*, tree_node*,
c_struct_parse_info*, tree_node**)
/home/alecop01/toolchain/src/other_gcc/gcc/c/c-decl.cc:9733
0x7b29a7 composite_type_internal(tree_node*, tree_node*, composite_cache*)
/home/alecop01/toolchain/src/other_gcc/gcc/c/c-typeck.cc:588
0x7b1867 composite_type_internal(tree_node*, tree_node*, composite_cache*)
/home/alecop01/toolchain/src/other_gcc/gcc/c/c-typeck.cc:408
0x7b1867 composite_type_internal(tree_node*, tree_node*, composite_cache*)
/home/alecop01/toolchain/src/other_gcc/gcc/c/c-typeck.cc:730
0x7b376f composite_type_internal(tree_node*, tree_node*, composite_cache*)
/home/alecop01/toolchain/src/other_gcc/gcc/c/c-typeck.cc:408
0x7b376f composite_type(tree_node*, tree_node*)
/home/alecop01/toolchain/src/other_gcc/gcc/c/c-typeck.cc:748
0x77c613 merge_decls
/home/alecop01/toolchain/src/other_gcc/gcc/c/c-decl.cc:2798
0x77c613 duplicate_decls
/home/alecop01/toolchain/src/other_gcc/gcc/c/c-decl.cc:3185
0x780347 pushdecl(tree_node*)
/home/alecop01/toolchain/src/other_gcc/gcc/c/c-decl.cc:3374
0x798ba7 start_decl(c_declarator*, c_declspecs*, bool, tree_node*, bool,
unsigned int*)
/home/alecop01/toolchain/src/other_gcc/gcc/c/c-decl.cc:5703
0x81bce3 c_parser_declaration_or_fndef
/home/alecop01/toolchain/src/other_gcc/gcc/c/c-parser.cc:2766
0x82962f c_parser_external_declaration
/home/alecop01/toolchain/src/other_gcc/gcc/c/c-parser.cc:2046
0x82a16f c_parser_translation_unit
/home/alecop01/toolchain/src/other_gcc/gcc/c/c-parser.cc:1900
0x82a16f c_parse_file()
/home/alecop01/toolchain/src/other_gcc/gcc/c/c-parser.cc:26815
0x8aa8df c_common_parse_file()
/home/alecop01/toolchain/src/other_gcc/gcc/c-family/c-opts.cc:1301
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.

[Bug target/113356] [14 Regression][aarch64] ICE in try_fuse_pair, at config/aarch64/aarch64-ldp-fusion.cc:2203 since r14-6947-g4b67ec7ff5b1aa

2024-01-15 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113356

Alex Coplan  changed:

   What|Removed |Added

   Keywords||patch
URL||https://gcc.gnu.org/piperma
   ||il/gcc-patches/2024-January
   ||/643011.html

--- Comment #4 from Alex Coplan  ---
Patch posted:
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643011.html

[Bug target/113070] [14 regression] [AArch64] [PGO/LTO] Miscompilation of go compiler

2024-01-13 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113070

Alex Coplan  changed:

   What|Removed |Added

   Keywords||patch
URL||https://patchwork.sourcewar
   ||e.org/project/gcc/list/?ser
   ||ies=29671

--- Comment #8 from Alex Coplan  ---
Patch series posted:
https://patchwork.sourceware.org/project/gcc/list/?series=29671

[Bug target/113070] [14 regression] [AArch64] [PGO/LTO] Miscompilation of go compiler

2024-01-12 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113070

--- Comment #7 from Alex Coplan  ---
Just to give a concrete example / reduced testcase where this goes wrong (to
aid review).  For the following testcase (reduced from libiberty) with -O2
-mlate-ldp-fusion:

struct {
  unsigned D;
  int E;
} * sha1_process_block_ctx;
void *sha1_process_block_buffer;
int sha1_process_block_ctx_1, sha1_process_block_ctx_0,
sha1_process_block_ctx_3, sha1_process_block_d, sha1_process_block_e,
sha1_process_block_tm, sha1_process_block_a, sha1_process_block_x_6,
sha1_process_block_x_14, sha1_process_block_x_15;
unsigned sha1_process_block_ctx_2;
void sha1_process_block() {
  int *words = sha1_process_block_buffer;
  int endp = *words, x_0;
  int x[6];
  unsigned b, c;
  while (endp) {
int t = 0;
for (; t < 6;)
  t = *words;
sha1_process_block_a +=
sha1_process_block_ctx_2 + 8348 + sha1_process_block_tm;
x_0 += sha1_process_block_tm = x[73];
b += sha1_process_block_x_15 = sha1_process_block_tm;
sha1_process_block_a += b | 1;
sha1_process_block_tm = sha1_process_block_x_14 ^ 8;
sha1_process_block_e = sha1_process_block_tm;
sha1_process_block_tm = x[8];
c += sha1_process_block_x_14 = sha1_process_block_tm;
b += sha1_process_block_x_15;
sha1_process_block_tm = x_0 ^ x[3];
sha1_process_block_a += sha1_process_block_tm;
sha1_process_block_tm = x[4] ^ x[15];
sha1_process_block_e +=
sha1_process_block_a + b ^ sha1_process_block_d +
sha1_process_block_tm;
sha1_process_block_tm = sha1_process_block_x_6 ^ x[15];
sha1_process_block_d +=
sha1_process_block_e >>
5 + (sha1_process_block_x_6 = sha1_process_block_tm);
sha1_process_block_ctx_0 += sha1_process_block_ctx_1 +=
sha1_process_block_ctx_2 += c;
sha1_process_block_ctx_3 += sha1_process_block_ctx->E +=
sha1_process_block_e;
  }
}

we try to do this:

fusing pair [L=0] (200,199), base=31, hazards: (27,54), move_range: (54,54)

with the initial IR:

insn i200 in bb3 [ebb3] at point 102:
  +---
  |  200: [sp:DI+0x64]=x0:SI
  |  REG_DEAD x0:SI
  +---
  uses:
use of set r0:i37 (x0:SI)
use of phi node r31:a12 (sp:DI)
  appears inside an address
  defines:
set mem:i200

insn i198 in bb3 [ebb3] at point 104:
  +---
  |  198: [sp:DI+0x6c]=x2:SI
  |  REG_DEAD x2:SI
  +---
  uses:
use of set r2:i81 (x2:SI)
use of phi node r31:a12 (sp:DI)
  appears inside an address
  defines:
set mem:i198
  used by insn i27 in bb3 [ebb3] at point 108

insn i54 in bb3 [ebb3] at point 106:
  +--
  |   54: x2:SI=x16:SI<<0x1
  +--
  uses:
SI use of set r16:i28 (x16:DI)
  defines:
set r2:i54 (x2:SI)
  used by insn i199 in bb3 [ebb3] at point 110

insn i27 in bb3 [ebb3] at point 108:
  +
  |   27: x0:DI=zero_extend([x1:DI+0x18])
  |  REG_EQUAL [const(`*.LANCHOR0'+0x18)]
  +
  uses:
use of set r1:i223 (x1:DI)
  appears inside an address
use of set mem:i198
  defines:
set r0:i27 (x0:DI)
  live out from bb3 [ebb3] at point 114
  used by phi node r0:a15 (x0:DI) in ebb6 at point 116

insn i199 in bb3 [ebb3] at point 110:
  +---
  |  199: [sp:DI+0x68]=x2:SI
  |  REG_DEAD x2:SI
  +---
  uses:
use of set r2:i54 (x2:SI)
use of phi node r31:a12 (sp:DI)
  appears inside an address
  defines:
set mem:i199
  used by phi node mem:a15 in ebb6 at point 116

as it stands, after fusing that pair, we have:

insn i200 in bb3 [ebb3] at point 102:
  +--
  |  200: clobber [scratch]
  +--
  defines:
set mem:i200

insn i198 in bb3 [ebb3] at point 104:
  +---
  |  198: [sp:DI+0x6c]=x2:SI
  |  REG_DEAD x2:SI
  +---
  uses:
use of set r2:i81 (x2:SI)
use of phi node r31:a12 (sp:DI)
  appears inside an address
  defines:
set mem:i198
  used by insn i27 in bb3 [ebb3] at point 108

insn i54 in bb3 [ebb3] at point 106:
  +--
  |   54: x2:SI=x16:SI<<0x1
  +--
  uses:
SI use of set r16:i28 (x16:DI)
  defines:
set r2:i54 (x2:SI)
  used by insn i244 in bb3 [ebb3] at point 107

insn i244 in bb3 [ebb3] at point 107:
  +
  |  244: [sp:DI+0x64]=unspec[x0:SI,x2:SI] 38
  

[Bug target/113070] [14 regression] [AArch64] [PGO/LTO] Miscompilation of go compiler

2024-01-12 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113070

--- Comment #6 from Alex Coplan  ---
And with those fixes it indeed looks like profiledbootstrap + LTO with all
frontends on aarch64 is working again (with the passes enabled).

[Bug target/113356] [14 Regression][aarch64] ICE in try_fuse_pair, at config/aarch64/aarch64-ldp-fusion.cc:2203 since r14-6947-g4b67ec7ff5b1aa

2024-01-12 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113356

--- Comment #3 from Alex Coplan  ---
... i13 to be a hazard w.r.t. itself, then we might not even need the clause in
the follow-up fix.  I'll investigate.

Alternatively the assert can probably be relaxed to include the previous
nondebug insn, as we're inserting after the insn in the move_range, anyway.

[Bug target/113356] [14 Regression][aarch64] ICE in try_fuse_pair, at config/aarch64/aarch64-ldp-fusion.cc:2203 since r14-6947-g4b67ec7ff5b1aa

2024-01-12 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113356

--- Comment #2 from Alex Coplan  ---
So we have this IR:

insn i8 in bb2 [ebb2] at point 18:
  +
  |8: [r104:DI++]=r101:DI
  |  REG_DEAD r101:DI
  |  REG_INC r104:DI
  +
  has pre/post-modify operations
  uses:
use of set r101:i7 (DI pseudo)
use of set r104:i17 (DI pseudo)
  appears in a read/write context
  defines:
set r104:i8 (DI pseudo)
  set by a pre/post-modify
  appears in a read/write context
  used by insn i13 in bb2 [ebb2] at point 24
set mem:i8

insn i11 in bb2 [ebb2] at point 20:
  +
  |   11: r106:DI=high(const(`_ZTV6Class1'+0x10))
  +
  defines:
set r106:i11 (DI pseudo)
  used by insn i12 in bb2 [ebb2] at point 22

insn i12 in bb2 [ebb2] at point 22:
  +---
  |   12: r105:DI=r106:DI+low(const(`_ZTV6Class1'+0x10))
  |  REG_DEAD r106:DI
  |  REG_EQUAL const(`_ZTV6Class1'+0x10)
  +---
  uses:
use of set r106:i11 (DI pseudo)
  defines:
set r105:i12 (DI pseudo)
  used by insn i13 in bb2 [ebb2] at point 24

insn i13 in bb2 [ebb2] at point 24:
  +
  |   13: [r104:DI]=r105:DI
  |  REG_DEAD r105:DI
  |  REG_DEAD r104:DI
  |  REG_EH_REGION 0x
  +
  uses:
use of set r104:i8 (DI pseudo)
  appears inside an address
use of set r105:i12 (DI pseudo)
  defines:
set mem:i13
  used by phi node mem:a7 in ebb1 at point 30

and we're trying to form (8,13).  i8 has i13 as a hazard due to the writeback
dataflow and i13 has i12 as a hazard (due to the initial fix for non-call
exceptions introducing a hazard on the previous nondebug insn).  I wonder if it
would be enough to get i

[Bug target/113356] [14 Regression][aarch64] ICE in try_fuse_pair, at config/aarch64/aarch64-ldp-fusion.cc:2203 since r14-6947-g4b67ec7ff5b1aa

2024-01-12 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113356

Alex Coplan  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2024-01-12
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org

  1   2   3   4   5   6   >