[Bug testsuite/116683] new test g++.dg/ext/pragma-unroll-lambda-lto.C from r15-3585-g9759f6299d9633 fails

2024-10-07 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116683

Alex Coplan  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #7 from Alex Coplan  ---
Fixed.

[Bug testsuite/116683] new test g++.dg/ext/pragma-unroll-lambda-lto.C from r15-3585-g9759f6299d9633 fails

2024-09-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116683

--- Comment #5 from Alex Coplan  ---
Ah, so the problem seems to be that we're scanning for "Unrolled loop 3 times"
appearing exactly once in the dump, but on powerpc it appears twice; that is
because the loop in main gets unrolled too (presumably due to different
unrolling heuristics on power).

The following, therefore, seems to fix the test failures on powerpc:

diff --git a/gcc/testsuite/g++.dg/ext/pragma-unroll-lambda-lto.C
b/gcc/testsuite/g++.dg/ext/pragma-unroll-lambda-lto.C
index 64cdf90f34d..20cbd2d15cf 100644
--- a/gcc/testsuite/g++.dg/ext/pragma-unroll-lambda-lto.C
+++ b/gcc/testsuite/g++.dg/ext/pragma-unroll-lambda-lto.C
@@ -24,6 +24,7 @@ short *use_find(short *p)
 int main(void)
 {
   short a[1024];
+#pragma GCC unroll 0
   for (int i = 0; i < 1024; i++)
 a[i] = rand ();

I'll submit that fix if it still passes on aarch64 with that change too.

[Bug testsuite/116683] new test g++.dg/ext/pragma-unroll-lambda-lto.C from r15-3585-g9759f6299d9633 fails

2024-09-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116683

Alex Coplan  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Last reconfirmed||2024-09-26

--- Comment #4 from Alex Coplan  ---
I can reproduce the failure on cfarm29, I'll try and see if I can figure out
what's going on.

[Bug testsuite/116683] new test g++.dg/ext/pragma-unroll-lambda-lto.C from r15-3585-g9759f6299d9633 fails

2024-09-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116683

--- Comment #3 from Alex Coplan  ---
Sorry for the delay in looking into this.

So it looks like the unrolling works as expected without LTO, at least I see:

;; Unrolled loop 3 times, constant # of iterations 26 insns

in the dump with a powerpc cc1.  So unfortunately the problem seems to be both
powerpc- and LTO-specific, meaning I'll need to build a full native powerpc
toolchain on a cfarm machine to reproduce.  In general, it would help if IBM
folks could provide more triage with such issues.  I can try to look into it
but it will take me much longer as I'm not familiar with powerpc and don't have
a suitable environment set up.

[Bug rtl-optimization/116783] [14/15 Regression] Wrong code at -O2 with late pair fusion pass (wrong alias analysis)

2024-09-20 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116783

--- Comment #4 from Alex Coplan  ---
Testing a fix for the trunk.

[Bug rtl-optimization/116783] [14/15 Regression] Wrong code at -O2 with late pair fusion pass (wrong alias analysis)

2024-09-20 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116783

Alex Coplan  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #3 from Alex Coplan  ---
Ah, right, thanks both for the input.  In that case, it sounds like we need to
be more conservative around calls to RTL alias analysis in pair-fusion.cc. 
Mine, then.

[Bug rtl-optimization/116783] New: [14/15 Regression] Wrong code at -O2 with late pair fusion pass (wrong alias analysis)

2024-09-19 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116783

Bug ID: 116783
   Summary: [14/15 Regression] Wrong code at -O2 with late pair
fusion pass (wrong alias analysis)
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

Created attachment 59150
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59150&action=edit
Executable reduced testcase for the testsuite

The attached executable reproducer (exec.cc) is reduced from a Debian package
(kf6-ktexttemplate) which is getting miscompiled on AArch64 (see
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1080974).

The problem can be reproduced on aarch64 as follows:

$ g++ exec.cc -O2 -fstack-protector-strong -fno-late-combine-instructions
-mno-late-ldp-fusion
$ ./a.out 
$ g++ exec.cc -O2 -fstack-protector-strong -fno-late-combine-instructions 
$ ./a.out 
Aborted

Note that late-combine hides the problem on the trunk, such that
-fno-late-combine-instructions isn't needed to reproduce the problem with GCC
14 (but is on trunk).

Looking at what's going on in late ldp_fusion, I see only a single pair getting
formed:

fusing pair [L=1] (92,94), base=19, hazards: (-,106), move_range: (94,94)

and we have the following RTL fragment:

  174: x1:DI=sp:DI+0x200
   92: v30:V4SI=[x1:DI-0xb8]
  REG_DEAD x1:DI
  176: x1:DI=x19:DI
  106: [x1:DI]=const_vector
  REG_DEAD x1:DI
  177: x1:DI=sp:DI+0x200
   94: v29:V4SI=[x19:DI+0x10]
  REG_EQUIV [x19:DI+0x10]

now looking back to the last assignment to x19, we have:

  x19:DI=sp:DI+0x148

so substituting through, we have:

  x1 - 0xb8 = sp + 0x200 - 0xb8 = sp + 0x148 = x19

i.e. the load i92 is to the exact same address as the store i106, yet we fail
to detect this aliasing hazard (in the forward direction) and thus form the
load pair at i94, incorrectly re-ordering the load (i92) over the store.

The problem seems to be not necessarily in pair-fusion.cc itself, however,
since memory_modified_in_insn_p fails to return true for the following
arguments:

(rr) pr mem
(mem/c:V4SI (plus:DI (reg:DI 1 x1 [195])
(const_int -184 [0xff48])) [0 D.5008.d+0 S16 A64])
(rr) pr insn
(insn 106 176 177 5 (set (mem/c:V4SI (reg:DI 1 x1 [198]) [0 MEM  [(struct Private *)&D.5008]+0 S16 A64])
(const_vector:V4SI [
(const_int 0 [0]) repeated x4
])) "exec.cc":20:13 discrim 1 1270 {*aarch64_simd_movv4si}
 (expr_list:REG_DEAD (reg:DI 1 x1 [198])
(nil)))

where (naively) it looks like the MEM_EXPRs alias, so I would have expected the
alias analysis machinery to figure this out.

I'll try to dig into why memory_modified_in_insn_p ends up returning false
here.

[Bug libstdc++/116140] [15 Regression] 5-35% slowdown of 483.xalancbmk and 523.xalancbmk_r since r15-2356-ge69456ff9a54ba

2024-09-11 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140

--- Comment #14 from Alex Coplan  ---
This should be largely fixed now (and in a position to get further improvements
from vectorisation further down the line), perhaps folks that monitor x86_64
performance can confirm if they see the expected improvement too.

[Bug tree-optimization/116674] [15 regression] ICE in vectorizable_simd_clone_call bisected to r15-3509-gd34cda72098867

2024-09-11 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116674

Alex Coplan  changed:

   What|Removed |Added

 Target||x86_64-linux-gnu,
   ||aarch64-linux-gnu
   Last reconfirmed||2024-09-11
 CC||acoplan at gcc dot gnu.org
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1

--- Comment #2 from Alex Coplan  ---
Confirmed, also ICEs on aarch64 with -Ofast -march=armv9-a.

[Bug target/116600] internal compiler error: in maybe_record_trace_start, at dwarf2cfi.cc:2584 since r7-5127-g827ab47ab1f

2024-09-06 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116600

Alex Coplan  changed:

   What|Removed |Added

Summary|internal compiler error: in |internal compiler error: in
   |maybe_record_trace_start,   |maybe_record_trace_start,
   |at dwarf2cfi.cc:2584|at dwarf2cfi.cc:2584 since
   ||r7-5127-g827ab47ab1f
 CC||ktkachov at gcc dot gnu.org

--- Comment #5 from Alex Coplan  ---
Started with r7-5127-g827ab47ab1f9f9b9b108a252b7a43c3c7bc828b7 (bisected with
-O3 -fno-common -g), so indeed seems shrink wrapping related.

[Bug target/116600] internal compiler error: in maybe_record_trace_start, at dwarf2cfi.cc:2584

2024-09-04 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116600

Alex Coplan  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
 CC||acoplan at gcc dot gnu.org
   Last reconfirmed||2024-09-04

--- Comment #2 from Alex Coplan  ---
Confirmed.  ICEs with -O3 -fno-common all the way back to GCC 7.

[Bug tree-optimization/116569] [15 Regression] ICE in to_constant, at poly-int.h:592

2024-09-02 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116569

Alex Coplan  changed:

   What|Removed |Added

   Last reconfirmed||2024-09-02
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
 CC||acoplan at gcc dot gnu.org,
   ||jschmitz at gcc dot gnu.org

--- Comment #1 from Alex Coplan  ---
Confirmed.  Started with r15-3082-g9bbad3685131ec95d970f81bf75f9556d4d92742.

[Bug target/116564] [12/13/14/15 Regression] aarch64: gcc hangs when compiling vst2_f64 instrinsic at -O1 and above since r12-4910-g66f206b853

2024-09-02 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116564

Alex Coplan  changed:

   What|Removed |Added

Summary|[12/13/14/15 Regression]|[12/13/14/15 Regression]
   |aarch64: gcc can't finish   |aarch64: gcc hangs when
   |when compiling vst2_f64 |compiling vst2_f64
   |instrinsic with opt level   |instrinsic at -O1 and above
   |>= O1   |since r12-4910-g66f206b853

--- Comment #3 from Alex Coplan  ---
Started with r12-4910-g66f206b85395c273980e2b81a54dbddc4897e4a7, FWIW (could
well be a latent issue, though).

[Bug target/116564] [12/13/14/15 Regression] aarch64: gcc can't finish when compiling vst2_f64 instrinsic with opt level >= O1

2024-09-02 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116564

--- Comment #2 from Alex Coplan  ---
Here's a preprocessed testcase (not for the testsuite, just to make it easier
to reproduce using only cc1):

#pragma GCC aarch64 "arm_neon.h"

typedef double float64_t;

__extension__ extern __inline void
__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
vst2_f64 (float64_t * __a, float64x1x2_t __val)
{
  __builtin_aarch64_st2df ((__builtin_aarch64_simd_df *) __a, __val);
}

void test()
{
  for (int L = 0; L < 4; ++L) {
float64_t ResData[1 * 2];
float64x1x2_t Src1;
vst2_f64(ResData, Src1);
  }
}

[Bug target/116564] [12/13/14/15 Regression] aarch64: gcc can't finish when compiling vst2_f64 instrinsic with opt level >= O1

2024-09-02 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116564

Alex Coplan  changed:

   What|Removed |Added

 CC||acoplan at gcc dot gnu.org

--- Comment #1 from Alex Coplan  ---
Looks like it's spinning running some DCE right at the start of combine:

#10 0x03068ab0 in mark_insn (insn=0xf4fd3880, fast=true) at
/home/alecop01/toolchain/src/gcc/gcc/dce.cc:227
#11 0x0306b034 in dce_process_block (bb=0xf50220c0, redo_out=false,
au=0x4774cd8, global_debug=0xea08) at
/home/alecop01/toolchain/src/gcc/gcc/dce.cc:1035
#12 0x0306b41c in fast_dce (word_level=false) at
/home/alecop01/toolchain/src/gcc/gcc/dce.cc:1128
#13 0x0306b618 in rest_of_handle_fast_dce () at
/home/alecop01/toolchain/src/gcc/gcc/dce.cc:1197
#14 0x0306b6d4 in run_fast_df_dce () at
/home/alecop01/toolchain/src/gcc/gcc/dce.cc:1245
#15 0x01158380 in df_lr_dce_finalize (all_blocks=0x466c6c0) at
/home/alecop01/toolchain/src/gcc/gcc/df-problems.cc:1252
#16 0x01151bd8 in df_analyze_problem (dflow=0x4519c00,
blocks_to_consider=0x466c6c0, postorder=0x446f9a0, n_blocks=5) at
/home/alecop01/toolchain/src/gcc/gcc/df-core.cc:1191
#17 0x01151d2c in df_analyze_1 () at
/home/alecop01/toolchain/src/gcc/gcc/df-core.cc:1236
#18 0x01152180 in df_analyze () at
/home/alecop01/toolchain/src/gcc/gcc/df-core.cc:1306
#19 0x030479b0 in rest_of_handle_combine () at
/home/alecop01/toolchain/src/gcc/gcc/combine.cc:15127
#20 0x03047ac4 in (anonymous namespace)::pass_combine::execute
(this=0x43b35a0) at /home/alecop01/toolchain/src/gcc/gcc/combine.cc:15177

the dump file shows the following:

Finding needed instructions:
  Adding insn 17 to worklist
  Adding insn 10 to worklist
  Adding insn 8 to worklist
Finished finding needed instructions:
processing block 4 lr out =  29 [x29] 31 [sp] 64 [sfp] 65 [ap]
processing block 3 lr out =  29 [x29] 31 [sp] 64 [sfp] 65 [ap] 101 104
  Adding insn 16 to worklist
  Adding insn 14 to worklist
  Adding insn 13 to worklist
processing block 2 lr out =  29 [x29] 31 [sp] 64 [sfp] 65 [ap] 101
  Adding insn 3 to worklist
DCE: Deleting insn 42
deleting insn with uid = 42.
DCE: Deleting insn 40
deleting insn with uid = 40.
DCE: Deleting insn 39
deleting insn with uid = 39.
DCE: Deleting insn 37
deleting insn with uid = 37.
df_worklist_dataflow_doublequeue: n_basic_blocks 5 n_edges 5 count 5 (1)
Finding needed instructions:
  Adding insn 17 to worklist
  Adding insn 10 to worklist
  Adding insn 8 to worklist
Finished finding needed instructions:
processing block 4 lr out =  29 [x29] 31 [sp] 64 [sfp] 65 [ap]
processing block 3 lr out =  29 [x29] 31 [sp] 64 [sfp] 65 [ap] 101 104
  Adding insn 16 to worklist
  Adding insn 14 to worklist
  Adding insn 13 to worklist
processing block 2 lr out =  29 [x29] 31 [sp] 64 [sfp] 65 [ap] 101
  Adding insn 3 to worklist
df_worklist_dataflow_doublequeue: n_basic_blocks 5 n_edges 5 count 5 (1)
Finding needed instructions:
  Adding insn 17 to worklist
  Adding insn 10 to worklist
  Adding insn 8 to worklist
Finished finding needed instructions:
processing block 4 lr out =  29 [x29] 31 [sp] 64 [sfp] 65 [ap]
processing block 3 lr out =  29 [x29] 31 [sp] 64 [sfp] 65 [ap] 101 104
  Adding insn 16 to worklist
  Adding insn 14 to worklist
  Adding insn 13 to worklist

and it just seems to repeat adding insns {17,10,8}, then {16,14,13}, then 3.

[Bug libstdc++/116140] [15 Regression] 5-35% slowdown of 483.xalancbmk and 523.xalancbmk_r since r15-2356-ge69456ff9a54ba

2024-08-30 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140

--- Comment #9 from Alex Coplan  ---
I think all except the first patch in the series (C++ patch) have been approved
now, so the rest are waiting on review for that:
https://gcc.gnu.org/pipermail/gcc-patches/2024-August/661559.html

[Bug testsuite/116522] [15 regression] gcc.dg/ipa/ipa-icf-38.c: error executing dg-final after r15-3254-g3f51f0dc88ec21

2024-08-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116522

Alex Coplan  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Alex Coplan  ---
Should be fixed, sorry for the breakage.

[Bug testsuite/116522] [15 regression] gcc.dg/ipa/ipa-icf-38.c: error executing dg-final after r15-3254-g3f51f0dc88ec21

2024-08-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116522

--- Comment #5 from Alex Coplan  ---
The following should fix it:

diff --git a/gcc/testsuite/lib/scanltranstree.exp
b/gcc/testsuite/lib/scanltranstree.exp
index a7d4de3765f..3d85813ea2f 100644
--- a/gcc/testsuite/lib/scanltranstree.exp
+++ b/gcc/testsuite/lib/scanltranstree.exp
@@ -24,7 +24,7 @@ load_lib scandump.exp
 foreach ir { tree rtl } {
 foreach modifier { {} -not -dem -dem-not } {
eval [string map [list @NAME@ scan-ltrans-$ir-dump$modifier \
-  @SCAN@ scan$modifier \
+  @SCAN@ scan-dump$modifier \
   @TYPE@ ltrans-$ir \
   @SUFFIX@ [string index $ir 0]] {
proc @NAME@ { args } {

will submit the above if testing goes OK.

[Bug testsuite/116522] [15 regression] gcc.dg/ipa/ipa-icf-38.c: error executing dg-final after r15-3254-g3f51f0dc88ec21

2024-08-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116522

--- Comment #4 from Alex Coplan  ---
Testing a fix.

[Bug testsuite/116522] [15 regression] gcc.dg/ipa/ipa-icf-38.c: error executing dg-final after r15-3254-g3f51f0dc88ec21

2024-08-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116522

Alex Coplan  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org

--- Comment #3 from Alex Coplan  ---
Apologies for the breakage, I think this is the usual problem of
dg-cmp-results.sh not reporting new ERRORs (which is why I didn't see this in
my regression testing). I need to work out a better way of comparing test
results.

I'll take a look.

[Bug libstdc++/116140] [15 Regression] 5-35% slowdown of 483.xalancbmk and 523.xalancbmk_r since r15-2356-ge69456ff9a54ba

2024-08-07 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140

Alex Coplan  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

[Bug libstdc++/116140] [15 Regression] 5-35% slowdown of 483.xalancbmk and 523.xalancbmk_r since r15-2356-ge69456ff9a54ba

2024-08-05 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140

--- Comment #7 from Alex Coplan  ---
So it turns out the reason #pragma GCC unroll doesn't work under LTO is because
we don't propagate the `has_unroll` flag when streaming functions during LTO,
so RTL loop2_unroll ends up not running at all.

The following patch allows us to recover it:

diff --git a/gcc/lto-streamer-in.cc b/gcc/lto-streamer-in.cc
index 2e592be8082..93877065d86 100644
--- a/gcc/lto-streamer-in.cc
+++ b/gcc/lto-streamer-in.cc
@@ -1136,6 +1136,8 @@ input_cfg (class lto_input_block *ib, class data_in
*data_in,
   /* Read OMP SIMD related info.  */
   loop->safelen = streamer_read_hwi (ib);
   loop->unroll = streamer_read_hwi (ib);
+  if (loop->unroll > 1)
+   fn->has_unroll = true;
   loop->owned_clique = streamer_read_hwi (ib);
   loop->dont_vectorize = streamer_read_hwi (ib);
   loop->force_vectorize = streamer_read_hwi (ib);

a more conservative fix might be to explicitly stream has_unroll out and in
again, but the above is simpler and I don't currently see a reason why we can't
infer it like this (comments welcome).

Anyway, this (together with the above C++ patch and adding the #pragma to
std::__find_if) gives us back ~3.9% on Neoverse V1.  That recovers about 71% of
the regression, leaving the effective regression (relative to the hand-unrolled
code) at 1.7% instead of 5.8%.

It's possible there are further improvements to be had by tweaking the unrolled
codegen or making inlining heuristics take #pragma GCC unroll into account
(assuming they don't currently, I haven't checked).  I'll try to do some more
analysis on the remaining difference.

In any case, I'll aim to polish and submit these patches unless there are any
objections at this point.

[Bug libstdc++/116140] [15 Regression] 5-35% slowdown of 483.xalancbmk and 523.xalancbmk_r since r15-2356-ge69456ff9a54ba

2024-08-02 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140

--- Comment #6 from Alex Coplan  ---
Just to give an update on this, the following testcase shows why adding:

#pragma GCC unroll 4

in libstdc++ doesn't immediately seem to help.  The testcase is:

$ cat lambda.cc
template
inline Iter
my_find(Iter first, Iter last, Pred pred)
{
#pragma GCC unroll 4
while (first != last && !pred(*first))
++first;
return first;
}

short *use_find(short *p)
{
auto pred = [](short x) { return x == 42; };
return my_find(p, p + 1024, pred);
}

compiling, we get:

$ /xgcc -B . -c lambda.cc -S -o /dev/null
lambda.cc: In function ‘Iter my_find(Iter, Iter, Pred) [with Iter = short int*;
Pred = use_find(short int*)::]’:
lambda.cc:6:5: warning: ignoring loop annotation
6 | while (first != last && !pred(*first))
  | ^

so the #pragma is indeed getting dropped.  This warning comes from
tree-cfg.cc:replace_loop_annotate.  The exiting basic block here is:

 :
D.4524 = .ANNOTATE (iftmp.1, 1, 4);
retval.0 = D.4524;
if (retval.0 != 0)
  goto ; [INV]
else
  goto ; [INV]

and the code in replace_loop_annotate_in_block (which looks for the .ANNOTATE
ifn call) iterates backwards over the gimple in that block, skipping over the
gcond, but it then expects to find any .ANNOTATE calls immediately before the
gcond.
In this case it doesn't, so we end up dropping the .ANNOTATE call on the floor
and emitting the warning (and not unrolling).

Consider the simpler testcase without the lambda:

template
inline Iter
find_nolambda(Iter first, Iter last)
{
#pragma GCC unroll 4
while (first != last && *first != 42)
++first;
return first;
}

short *use_nolambda(short *p)
{
return find_nolambda (p, p + 1024);
}

for this testcase, we don't get the warning, and indeed the exiting block for
this loop is just:

 :
D.4460 = .ANNOTATE (iftmp.0, 1, 4);
if (D.4460 != 0)
  goto ; [INV]
else
  goto ; [INV]

i.e. the .ANNOTATE comes immediately before the gcond.  To see what is really
going on we can look at -fdump-tree-original.  For the problematic testcase we
have:

if (<::operator() (&pred, *first), unroll 4>>>) goto
; else goto ;

and the simpler testcase without the lambda has:

if (ANNOTATE_EXPR ) goto ;
else
goto ;

so I think the problem is the CLEANUP_POINT_EXPR wrapping the ANNOTATE_EXPR in
the lambda case.  The following fixes that:

diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc
index a9abf32e01f..b2c29fbb028 100644
--- a/gcc/cp/semantics.cc
+++ b/gcc/cp/semantics.cc
@@ -966,6 +966,16 @@ maybe_convert_cond (tree cond)
   if (type_dependent_expression_p (cond))
 return cond;

+  /* If the condition has an ANNOTATE_EXPR, that must remain the outermost
+ expression of the condition.  Strip it off and re-apply it after the
+ conversion to maintain this invariant.  */
+  tree annotate = NULL_TREE;
+  if (TREE_CODE (cond) == ANNOTATE_EXPR)
+{
+  annotate = cond;
+  cond = TREE_OPERAND (cond, 0);
+}
+
   /* For structured binding used in condition, the conversion needs to be
  evaluated before the individual variables are initialized in the
  std::tuple_{size,elemenet} case.  cp_finish_decomp saved the conversion
@@ -983,7 +993,15 @@ maybe_convert_cond (tree cond)

   /* Do the conversion.  */
   cond = convert_from_reference (cond);
-  return condition_conversion (cond);
+  cond = condition_conversion (cond);
+
+  /* Restore the ANNOTATE_EXPR, if there was one.  */
+  if (annotate)
+{
+  TREE_OPERAND (annotate, 0) = cond;
+  cond = annotate;
+}
+  return cond;
 }

 /* Finish an expression-statement, whose EXPRESSION is as indicated.  */

where the CLEANUP_POINT_EXPR was getting added in condition_conversion.
That passes bootstrap on aarch64.  With that patch, adding:

#pragma GCC unroll 4

above the __find_if loop in stl_algobase.h, we get unrolled std::find
again.  E.g. for the following testcase I get:

#include 
long *f(long *p)
{
  return std::find (p, p + 1024, 42);
}

_Z1fPl:
.LFB675:
.cfi_startproc
mov x1, x0
add x0, x0, 8192
.p2align 5,,15
.L3:
ldr x2, [x1]
cmp x2, 42
beq .L4
ldr x2, [x1, 8]
add x1, x1, 8
mov x3, x1
cmp x2, 42
beq .L4
ldr x2, [x1, 8]!
cmp x2, 42
beq .L4
ldr x2, [x3, 16]
add x1, x3, 16
cmp x2, 42
beq .L4
add x1, x3, 24
cmp x0, x1
bne .L3
ret

at -O2.  But importantly this version should still be vectorizable
further down the line (unlike the hand-unrolled version).

Now for xalancbmk this seems to give back about 4.8% on Neoverse V1
_without_ LTO.  Unfortunately for some reason there is no difference in
the relevant hot function _with_ LTO, so that needs debugging (I'm
looking into that).

[Bug libstdc++/116140] [15 Regression] 5-35% slowdown of 483.xalancbmk and 523.xalancbmk_r since r15-2356-ge69456ff9a54ba

2024-08-01 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140

--- Comment #5 from Alex Coplan  ---
Yeah, I'm looking into this as Tamar mentioned above.

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2024-07-05 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

--- Comment #4 from Alex Coplan  ---
So the following is enough to fix the missed ldp due to alias analysis:

diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc
index 31d2c21c88f..ab49d955ccf 100644
--- a/gcc/pair-fusion.cc
+++ b/gcc/pair-fusion.cc
@@ -128,8 +128,12 @@ pair_fusion::run ()
   if (!track_loads_p () && !track_stores_p ())
 return;

+  init_alias_analysis ();
+
   for (auto bb : crtl->ssa->bbs ())
 process_block (bb);
+
+  end_alias_analysis ();
 }

 // State used by the pass for a given basic block.

that explains why sched1 was able to do the re-ordering but we weren't able to
do it in ldp_fusion1 (sched1 makes these calls).  Essentially this enables a
mini-pass that establishes register equivalences and allows the calls to
canon_rtx inside the alias machinery to re-write the memcpy accesses in terms
of the sfp for alias disambiguation purposes.  For the testcase in #c1:

--- without-patch.s 2024-07-05 11:33:57.395927975 +0100
+++ with-patch.s2024-07-05 11:33:32.164155523 +0100
@@ -17,9 +17,8 @@
bl  g
add x0, sp, 32
ldp q31, q30, [x19]
-   ldr q29, [x19, 32]
str q31, [sp, 32]
-   ldr q31, [x19, 48]
+   ldp q29, q31, [x19, 32]
stp q30, q29, [x0, 16]
str q31, [x0, 48]
bl  h

we still miss the stp in this case since the stores have different RTL bases
(sfp vs memcpy pseudo) and no MEM_EXPR information.  If we go ahead with the
above change then in theory we could also make use of this register equivalence
information during discovery (not just for alias analysis), allowing us to get
the remaining stp.

While the above patch seems to improve performance overall, there is one
workload with a significant compile-time regression which needs investigating.

There are also some codesize regressions which I think occur due to forming
more stack-based LDPs, but this scuppers the IRA REG_EQUIV optimization to
avoid spilling registers that were loaded from the stack.

So a bit more work needed before we can go ahead with this.

[Bug target/114936] [14 Regression] Typo in aarch64-ldp-fusion.cc:combine_reg_notes

2024-07-02 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114936

Alex Coplan  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #4 from Alex Coplan  ---
Fixed on all affected branches.

[Bug tree-optimization/115120] New: Bad interaction between ivcanon and early break vectorization

2024-05-16 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115120

Bug ID: 115120
   Summary: Bad interaction between ivcanon and early break
vectorization
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

Consider the following testcase on aarch64:

int arr[1024];
int *f()
{
int i;
for (i = 0; i < 1024; i++)
  if (arr[i] == 42)
break;
return arr + i;
}

compiled with -O3 we get the following vector loop body:

.L2:
cmp x2, x1
beq .L9
.L6:
ldr q31, [x1]
add x1, x1, 16
mov v27.16b, v29.16b
mov v28.16b, v30.16b
cmeqv31.4s, v31.4s, v26.4s
add v29.4s, v29.4s, v24.4s
add v30.4s, v30.4s, v25.4s
umaxp   v31.4s, v31.4s, v31.4s
fmovx3, d31
cbz x3, .L2

it's somewhat surprising that there are two vector adds, looking at the
optimized dump:

 [local count: 1063004408]:
  # vect_vec_iv_.6_28 = PHI <_29(10), { 0, 1, 2, 3 }(2)>
  # vect_vec_iv_.7_33 = PHI <_34(10), { 1024, 1023, 1022, 1021 }(2)>
  # ivtmp.18_19 = PHI 
  _34 = vect_vec_iv_.7_33 + { 4294967292, 4294967292, 4294967292, 4294967292 };
  _29 = vect_vec_iv_.6_28 + { 4, 4, 4, 4 };
  _25 = (void *) ivtmp.18_19;
  vect__1.10_39 = MEM  [(int *)_25];
  mask_patt_9.11_41 = vect__1.10_39 == { 42, 42, 42, 42 };
  if (mask_patt_9.11_41 != { 0, 0, 0, 0 })
goto ; [5.50%]
  else
goto ; [94.50%]

we can see that there are two IV updates that got vectorized.  It turns out
that
one of these comes from the ivcanon pass.  If I add -fno-tree-loop-ivcanon we
instead get the following vector loop body:

.L2:
cmp x2, x1
beq .L9
.L6:
ldr q31, [x1]
add x1, x1, 16
mov v29.16b, v30.16b
add v30.4s, v30.4s, v27.4s
cmeqv31.4s, v31.4s, v28.4s
umaxp   v31.4s, v31.4s, v31.4s
fmovx3, d31
cbz x3, .L2

which is much cleaner.  Looking at the tree dumps, the ivcanon pass makes the
following transformation:

--- cddce2.tree 2024-05-16 13:49:10.426703350 +
+++ ivcanon.tree2024-05-16 13:49:17.678874925 +
@@ -4,6 +4,8 @@
   int i;
   int _1;
   int * _8;
+  unsigned int ivtmp_11;
+  unsigned int ivtmp_12;
   long unsigned int _13;
   long unsigned int _15;
   long unsigned int prephitmp_16;
@@ -12,6 +14,7 @@

[local count: 1063004408]:
   # i_10 = PHI 
+  # ivtmp_12 = PHI 
   _1 = arr[i_10];
   if (_1 == 42)
 goto ; [5.50%]
@@ -20,7 +23,8 @@

[local count: 1004539166]:
   i_7 = i_10 + 1;
-  if (i_7 != 1024)
+  ivtmp_11 = ivtmp_12 - 1;
+  if (ivtmp_11 != 0)
 goto ; [98.93%]
   else
 goto ; [1.07%]

i.e. it introduces the backwards-counting IV.  It seems in the general case
without vectorization ivopts then cleans this up and ensures we only have a
single IV.

In the vectorized case it seems this problem only shows up with early break
vectorization. Looking at a simple reduction, such as:

int a[1024];
int g()
{
int sum = 0;
for (int i = 0; i < 1024; i++)
sum += a[i];
return sum;
}

although we still have the backwards-counting IV in ifcvt:

   [local count: 1063004408]:
  # sum_9 = PHI 
  # i_11 = PHI 
  # ivtmp_8 = PHI 
  _1 = a[i_11];
  sum_5 = _1 + sum_9;
  i_6 = i_11 + 1;
  ivtmp_7 = ivtmp_8 - 1;
  if (ivtmp_7 != 0)
goto ; [98.99%]
  else
goto ; [1.01%]

we end up with only scalar IVs after vectorization, and the backwards scalar IV
ends up getting deleted by dce6:

Deleting : ivtmp_7 = ivtmp_8 - 1;

I'm not sure what the right solution is but we should avoid having duplicated
IVs with early break vectorization.

[Bug tree-optimization/113787] [12/13/14/15 Regression] Wrong code at -O with ipa-modref on aarch64

2024-05-16 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #20 from Alex Coplan  ---
I'd just like to ping this serious wrong code bug.  It's unfortunate that this
wasn't addressed for the 14.1 release.

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2024-05-09 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

Alex Coplan  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org

--- Comment #3 from Alex Coplan  ---
Mine for the aliasing issues/investigation, might be worth splitting off the RA
problem to track that separately.

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2024-05-09 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

--- Comment #2 from Alex Coplan  ---
Here is some analysis on why we miss some of these opportunities in ldp_fusion.
So initially in 267r.vregs we have some very clean RTL:

6: r101:DI=sfp:DI-0x40
7: x0:DI=r101:DI
8: call [`g'] argc:0
  REG_CALL_DECL `g'
9: r102:DI=sfp:DI-0x80
   10: r103:DI=sfp:DI-0x40
   11: r104:V4SI=[r103:DI]
   13: r105:V4SI=[r103:DI+0x10]
   15: r106:V4SI=[r103:DI+0x20]
   17: r107:V4SI=[r103:DI+0x30]
   12: [r102:DI]=r104:V4SI
   14: [r102:DI+0x10]=r105:V4SI
   16: [r102:DI+0x20]=r106:V4SI
   18: [r102:DI+0x30]=r107:V4SI

if were to run the ldp/stp pass on this it should form the pairs without a
problem.  Of course things go downhill from here.  The first slightly strange
thing is that fwprop propagates the sfp into the first of each group of
accesses (i.e. with offset 0), but not the others:

9: r102:DI=sfp:DI-0x80
   11: r104:V4SI=[sfp:DI-0x40]
   13: r105:V4SI=[r101:DI+0x10]
   15: r106:V4SI=[r101:DI+0x20]
   17: r107:V4SI=[r101:DI+0x30]
  REG_DEAD r103:DI
   12: [sfp:DI-0x80]=r104:V4SI
   14: [r102:DI+0x10]=r105:V4SI
  REG_DEAD r105:V4SI
   16: [r102:DI+0x20]=r106:V4SI
  REG_DEAD r106:V4SI
   18: [r102:DI+0x30]=r107:V4SI

the RTL then stays mostly unchanged until sched1, where things really start to
go downhill:

   11: r104:V4SI=[sfp:DI-0x40]
9: r102:DI=sfp:DI-0x80
   13: r105:V4SI=[r101:DI+0x10]
   20: x0:DI=r102:DI
  REG_DEAD r102:DI
  REG_EQUAL sfp:DI-0x80
   15: r106:V4SI=[r101:DI+0x20]
   12: [sfp:DI-0x80]=r104:V4SI
  REG_DEAD r104:V4SI
   17: r107:V4SI=[r101:DI+0x30]
  REG_DEAD r101:DI
   14: [r102:DI+0x10]=r105:V4SI
  REG_DEAD r105:V4SI
   16: [r102:DI+0x20]=r106:V4SI
  REG_DEAD r106:V4SI
   18: [r102:DI+0x30]=r107:V4SI

here the first of the stores (i12) has been moved up between the last pair of
loads (i15, i17).  Now the interesting thing is how sched1 knows that it is
safe to perform this transformation.  In the ldp_fusion1 pass we miss this pair
because we think that the loads may alias with i12:

cannot form pair (15,17) due to alias conflicts (12,12)

so it would be good to look into how our alias analysis differs from what
sched1 is doing.  It's worth further noting that while the loads have MEM_EXPR
information (they point to the var_decl for s) the stores do not.  Presumably
this is because the copy of s mandated by the ABI doesn't necessarily have a
tree decl representation that the MEM_EXPRs could point to.

Separately to the aliasing issue, because:
 - there is no MEM_EXPR information for the stores, and
 - forwprop1 substituted the sfp in for the first store
ldp_fusion fails to discover the (i12,i14) store pair opportunity.  As a result
we unfortunately end up forming an stp in the middle.

Interestingly if I turn off fwprop1 then we still fail to form the
(12,14) pair due to aliasing.

So it seems the main thing to investigate is how sched1 does its alias
analysis and how that differs from what we're doing in ldp_fusion.

I have some WIP patches that should improve the pair discovery and could
potentially be extended to help with the case of the (12,14) pair here.
Another thing that could help with that is if we populated the MEM_EXPR for the
stores of the structure copy.

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2024-05-09 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

Alex Coplan  changed:

   What|Removed |Added

   Last reconfirmed||2024-05-09
 Status|UNCONFIRMED |NEW
 CC||acoplan at gcc dot gnu.org,
   ||vmakarov at gcc dot gnu.org
 Ever confirmed|0   |1
   Keywords||missed-optimization, ra

--- Comment #1 from Alex Coplan  ---
Confirmed.  There is a lot to unpack here.  Of course, the include isn't needed
in this testcase and the problem can be seen more clearly with a slightly
smaller array size:

typedef struct { int arr[16]; } S;

void g (S *);
void h (S);
void f(int x)
{
  S s;
  g (&s);
  h (s);
}

In this case sizeof(S) = 64 so we should be able to do the copy with 2 LDPs + 2
STPs.

So just for clarity, the missed ldp/stp started when we turned off the early
ldp/stp formation in memcpy expansion, i.e. with
r14-9373-g19b23bf3c32df3cbb96b3d898a1d7142f7bea4a0 .

However, things already started to regress earlier for this testcase with
r14-4944-gf55cdce3f8dd8503e080e35be59c5f5390f6d95e i.e.

commit f55cdce3f8dd8503e080e35be59c5f5390f6d95e
Author: Vladimir N. Makarov 
Date:   Thu Oct 26 14:50:40 2023

[RA]: Modfify cost calculation for dealing with equivalences

before that RA change we get:

f:
stp x29, x30, [sp, -144]!
mov x29, sp
add x0, sp, 80
bl  g
ldp q29, q28, [sp, 80]
add x0, sp, 16
ldp q31, q30, [sp, 112]
stp q29, q28, [sp, 16]
stp q31, q30, [sp, 48]
bl  h
ldp x29, x30, [sp], 144
ret

and afterwards we get:

f:
stp x29, x30, [sp, -160]!
mov x29, sp
str x19, [sp, 16]
add x19, sp, 96
mov x0, x19
bl  g
add x0, sp, 32
ldp q29, q28, [x19]
ldp q31, q30, [x19, 32]
stp q29, q28, [x0]
stp q31, q30, [x0, 32]
bl  h
ldr x19, [sp, 16]
ldp x29, x30, [sp], 160
ret

which is really not great as now we have a save/restore of x19 and the accesses
end up using different (non-sp) registers which I suspect doesn't help with the
ldp/stp formation (on trunk).

I will try to give a detailed analysis on what goes wrong with the ldp/stp
formation at the RTL level shortly (there are a lot of different issues), but I
think that RA change is a contributing factor.

[Bug target/114936] [14 Regression] Typo in aarch64-ldp-fusion.cc:combine_reg_notes

2024-05-08 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114936

Alex Coplan  changed:

   What|Removed |Added

Summary|[14/15 Regression] Typo in  |[14 Regression] Typo in
   |aarch64-ldp-fusion.cc:combi |aarch64-ldp-fusion.cc:combi
   |ne_reg_notes|ne_reg_notes

--- Comment #2 from Alex Coplan  ---
Fixed on trunk, will backport to 14 after a week or so.

[Bug rtl-optimization/114674] [aarch64] ldp_fusion fails to merge 2 strs due to imprecise alignment info

2024-05-07 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114674

Alex Coplan  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Alex Coplan  ---
Fixed for GCC 15, thanks for the report.

[Bug target/114936] [14/15 Regression] Typo in aarch64-ldp-fusion.cc:combine_reg_notes

2024-05-03 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114936

Alex Coplan  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org
   Last reconfirmed||2024-05-03
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED

[Bug target/114936] New: [14/15 Regression] Typo in aarch64-ldp-fusion.cc:combine_reg_notes

2024-05-03 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114936

Bug ID: 114936
   Summary: [14/15 Regression] Typo in
aarch64-ldp-fusion.cc:combine_reg_notes
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

aarch64-ldp-fusion.cc:combine_reg_notes has:

  result = filter_notes (REG_NOTES (i2->rtl ()), result,
 &found_eh_region, fr_expr);
  result = filter_notes (REG_NOTES (i1->rtl ()), result,
 &found_eh_region, fr_expr + 1);

  if (!load_p)
{
  // Simple frame-related sp-relative saves don't need CFI notes, but when
  // we combine them into an stp we will need a CFI note as dwarf2cfi can't
  // interpret the unspec pair representation directly.
  if (RTX_FRAME_RELATED_P (i1->rtl ()) && !fr_expr[0])
fr_expr[0] = copy_rtx (PATTERN (i1->rtl ()));
  if (RTX_FRAME_RELATED_P (i2->rtl ()) && !fr_expr[1])
fr_expr[1] = copy_rtx (PATTERN (i2->rtl ()));
}

so any REG_FRAME_RELATED_EXPR from i2 goes to fr_expr[0] and likewise i1 goes
to fr_expr[1], but then we have the opposite association inside the if
statement.

Many thanks to Matthew Malcomson for pointing this out to me.

I'm going to post the (arguably obvious) patch after testing that writes to
fr_expr + 1 first when we call filter_notes for i2.  We may want to consider a
backport to GCC 14 too.

[Bug rtl-optimization/114924] [11/12/13/14/15 Regression] Wrong update of MEM_EXPR by RTL loop unrolling since r11-2963-gd6a05b494b4b71

2024-05-02 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114924

Alex Coplan  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org
   Last reconfirmed||2024-05-02
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1

[Bug rtl-optimization/114924] New: [11/12/13/14/15 Regression] Wrong update of MEM_EXPR by RTL loop unrolling since r11-2963-gd6a05b494b4b71

2024-05-02 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114924

Bug ID: 114924
   Summary: [11/12/13/14/15 Regression] Wrong update of MEM_EXPR
by RTL loop unrolling since r11-2963-gd6a05b494b4b71
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

The following testcase is reduced from
libgomp/testsuite/libgomp.fortran/imperfect-destructor.f90:

module m
  type t
contains
  final fini
  end type
  integer ccount(3)
  contains
subroutine init(x, n)
  type(t) x
  xi = n
  ccount = 1
end
subroutine fini(x)
  type(t) x
  dcount= s1 (a3)
  do i = 1, 1
block
  do j = 1, 2
block
  do k = 1, a3
block
  type (t) local3
  call init (local3, 3)
end block
  end do
end block
  end do
end block
  end do
end
end

compiling with -O2 -funroll-loops -da and looking at the RTL dumps, I see the
following insn in 284r.loop2_invariant:

(insn 44 40 45 8 (set (mem/c:SI (plus:DI (reg/f:DI 121)
(const_int 8 [0x8])) [3 ccount[2]+0 S4 A64])
(subreg:SI (reg:V2SI 111) 0)) "t.f90":11:16 discrim 2 69
{*movsi_aarch64}
 (expr_list:REG_DEAD (reg:V2SI 111)
(nil)))

then in 285r.loop2_unroll, I see:

(insn 44 40 45 8 (set (mem/c:SI (plus:DI (reg/f:DI 121)
(const_int 8 [0x8])) [3 ccount+0 S4 A64])
(subreg:SI (reg:V2SI 111) 0)) "t.f90":11:16 discrim 2 69
{*movsi_aarch64}
 (expr_list:REG_DEAD (reg/f:DI 121)
(expr_list:REG_DEAD (reg:V2SI 111)
(nil

notably the MEM_EXPR has been changed from ccount[2] to ccount, without a
corresponding change in offset.  This is incorrect.  Setting a watchpoint on
the
MEM_ATTRS of the relevant MEM showed that the update happens in
cfgrtl.cc:duplicate_insn_chain, which does the following:

/* We cannot adjust MR_DEPENDENCE_CLIQUE in-place
   since MEM_EXPR is shared so make a copy and
   walk to the subtree again.  */
tree new_expr = unshare_expr (MEM_EXPR (*iter));
if (TREE_CODE (new_expr) == WITH_SIZE_EXPR)
  new_expr = TREE_OPERAND (new_expr, 0);
while (handled_component_p (new_expr))
  new_expr = TREE_OPERAND (new_expr, 0);
MR_DEPENDENCE_CLIQUE (new_expr) = newc;
set_mem_expr (const_cast  (*iter), new_expr);

so the code (correctly) looks through the ARRAY_REF in this case to find
the underlying MEM_REF and updates MR_DEPENDENCE_CLIQUE for that
MEM_REF, but then proceeds to pass the MEM_REF to set_mem_expr, thereby
incorrectly dropping the ARRAY_REF in this case.

The code above was introduced in
r11-2963-gd6a05b494b4b714e996a5ca09c5a4a1c41dbd648 so I assume this is a
regression in GCC 11 and beyond.

I have a straightforward patch to fix this which passes bootstrap on
aarch64-linux-gnu, I will post that shortly.

While I don't have a wrong-code reproducer at the moment, we may want to
consider backporting the fix as incorrect MEM_EXPR information could
lead to wrong code.  I found the issue while working on a patch series
that has the side effect of introducing some consistency checking of the
MEM_EXPR information.

[Bug target/114801] New: [14 Regression] arm: ICE in find_cached_value, at rtx-vector-builder.cc:100 with MVE intrinsics

2024-04-22 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114801

Bug ID: 114801
   Summary: [14 Regression] arm: ICE in find_cached_value, at
rtx-vector-builder.cc:100 with MVE intrinsics
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

The following testcase:

#include 
uint32x4_t test_9() {
  return vdupq_m_n_u32(vdupq_n_u32(0), 0, 0x);
}

ICEs with -march=armv8.1-m.main+mve -mfloat-abi=hard on the trunk. This appears
to be a regression from GCC 13.

For a preprocessed reproducer, take the following:

$ cat t.c
#pragma GCC arm "arm_mve_types.h"
#pragma GCC arm "arm_mve.h" false
uint32x4_t test_9() {
  return vdupq_m_n_u32(vdupq_n_u32(0), 0, 0x);
}
$ gcc/xgcc -B gcc -c t.c -S -o /dev/null -march=armv8.1-m.main+mve
-mfloat-abi=hard
during RTL pass: expand
t.c: In function ‘test_9’:
t.c:4:10: internal compiler error: in find_cached_value, at
rtx-vector-builder.cc:100
4 |   return vdupq_m_n_u32(vdupq_n_u32(0), 0, 0x);
  |  ^~~~
0x2a7fc16 rtx_vector_builder::find_cached_value()
/home/alecop01/toolchain/src/gcc/gcc/rtx-vector-builder.cc:100
0x2a7f9c9 rtx_vector_builder::build()
/home/alecop01/toolchain/src/gcc/gcc/rtx-vector-builder.cc:64
0x2adff41 native_decode_vector_rtx(machine_mode, vec const&, unsigned int, unsigned int, unsigned int)
/home/alecop01/toolchain/src/gcc/gcc/simplify-rtx.cc:7269
0x2ae0068 native_decode_rtx(machine_mode, vec
const&, unsigned int)
/home/alecop01/toolchain/src/gcc/gcc/simplify-rtx.cc:7289
0x2ae10c4 simplify_immed_subreg
/home/alecop01/toolchain/src/gcc/gcc/simplify-rtx.cc:7529
0x2ae1807 simplify_context::simplify_subreg(machine_mode, rtx_def*,
machine_mode, poly_int<1u, unsigned long>)
/home/alecop01/toolchain/src/gcc/gcc/simplify-rtx.cc:7603
0x2ae31f2 simplify_context::simplify_gen_subreg(machine_mode, rtx_def*,
machine_mode, poly_int<1u, unsigned long>)
/home/alecop01/toolchain/src/gcc/gcc/simplify-rtx.cc:7875
0x2ae3644 simplify_context::lowpart_subreg(machine_mode, rtx_def*,
machine_mode)
/home/alecop01/toolchain/src/gcc/gcc/simplify-rtx.cc:7904
0x1e92c3e lowpart_subreg(machine_mode, rtx_def*, machine_mode)
/home/alecop01/toolchain/src/gcc/gcc/rtl.h:3565
0x22f4d11 gen_lowpart_common(machine_mode, rtx_def*)
/home/alecop01/toolchain/src/gcc/gcc/emit-rtl.cc:1627
0x2a7f336 gen_lowpart_general(machine_mode, rtx_def*)
/home/alecop01/toolchain/src/gcc/gcc/rtlhooks.cc:48
0x327a20e arm_mve::function_expander::add_input_operand(insn_code, rtx_def*)
   
/home/alecop01/toolchain/src/gcc/gcc/config/arm/arm-mve-builtins.cc:2103
0x327a887 arm_mve::function_expander::use_cond_insn(insn_code, unsigned int)
   
/home/alecop01/toolchain/src/gcc/gcc/config/arm/arm-mve-builtins.cc:2227
0x3282fe2
arm_mve::unspec_mve_function_exact_insn::expand(arm_mve::function_expander&)
const
   
/home/alecop01/toolchain/src/gcc/gcc/config/arm/arm-mve-builtins-functions.h:339
0x327ab65 arm_mve::function_expander::expand()
   
/home/alecop01/toolchain/src/gcc/gcc/config/arm/arm-mve-builtins.cc:2287
0x327ae1d arm_mve::expand_builtin(unsigned int, tree_node*, rtx_def*)
   
/home/alecop01/toolchain/src/gcc/gcc/config/arm/arm-mve-builtins.cc:2352
0x3275215 arm_expand_builtin(tree_node*, rtx_def*, rtx_def*, machine_mode, int)
/home/alecop01/toolchain/src/gcc/gcc/config/arm/arm-builtins.cc:4103
0x20fd3b9 expand_builtin(tree_node*, rtx_def*, rtx_def*, machine_mode, int)
/home/alecop01/toolchain/src/gcc/gcc/builtins.cc:7769
0x236a0ed expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
/home/alecop01/toolchain/src/gcc/gcc/expr.cc:12350
0x235c6d1 expand_expr_real(tree_node*, rtx_def*, machine_mode, expand_modifier,
rtx_def**, bool)
/home/alecop01/toolchain/src/gcc/gcc/expr.cc:9440
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.

[Bug rtl-optimization/114674] [aarch64] ldp_fusion fails to merge 2 strs due to imprecise alignment info

2024-04-10 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114674

Alex Coplan  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #4 from Alex Coplan  ---
Discussing offline with Richard S an alternative approach would be to use
replace_equiv_address[_nv] instead of adjust_address[_nv]; that way we preserve
most properties of the original mem and just take the address from the other.

In fact this is what aarch64_check_consecutive_mems already does so I think we
should follow that.

I'll try a patch along those lines for stage 1.

[Bug rtl-optimization/114674] [aarch64] ldp_fusion fails to merge 2 strs due to imprecise alignment info

2024-04-10 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114674

Alex Coplan  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2024-04-10

--- Comment #3 from Alex Coplan  ---
Confirmed.

I think it might be best to take the maximum MEM_ALIGN between the adjusted mem
(new_mem) and the original mem (change_mem).  In this case it happens that the
original mem (change_mem) has a stronger alignment guarantee, but in general it
could be the case that the adjusted mem gives a better alignment guarantee. 
Since both locations are known to point to the same address, it seems best to
me to take the largest alignment of the two.

[Bug rtl-optimization/114674] [aarch64] ldp_fusion fails to merge 2 strs due to imprecise alignment info

2024-04-10 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114674

Alex Coplan  changed:

   What|Removed |Added

 CC||acoplan at gcc dot gnu.org
   Keywords||missed-optimization

--- Comment #2 from Alex Coplan  ---
Thanks for the report (and patch), I'll look into this.

[Bug target/114492] Invalid use of gcc_assert (notably in gcc/config/aarch64/aarch64-ldp-fusion.cc)

2024-04-02 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114492

Alex Coplan  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org

--- Comment #4 from Alex Coplan  ---
I think these should be OK. In the case of:

  for (unsigned i = 0; i < changes.length (); i++)
gcc_assert (rtl_ssa::restrict_movement_ignoring (*changes[i],
is_changing));

I think this is OK because the pass guarantees to have chosen a singleton move
range for the pair, so we don't rely on the call to restrict_movement_ignoring
updating the move range for any of the changes.  Other changes in the set are
either deletions or no-ops in terms of movement.  So we call this purely for
checking purposes to make sure we're not attempting something invalid.

Similarly in the case of:

  gcc_assert (crtl->ssa->verify_insn_changes (changes));

this is OK because the function doesn't have side effects (other than possibly
dumping).

Discussing this offline with Richard S he suggested asserting that we have
singleton move ranges before calling restrict_movement_ignoring to make that
case more obviously correct, so mine for that improvement (but either way I
think this should be OK).

[Bug target/114323] [14 Regression] MVE vector load intrinsic miscompiled since r14-5622-g4d7647edfd7d98

2024-03-15 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114323

--- Comment #4 from Alex Coplan  ---
I think the problem is that the arm backend incorrectly sets the const
attribute on this builtin, but it can't be const because it reads memory (it
should be pure instead):

 
sizes-gimplified unsigned V4SI
size 
unit-size 
align:64 warn_if_not_align:0 symtab:0 alias-set -1
structural-equality
attributes 
value >> nunits:4>
HI
size 
unit-size 
align:16 warn_if_not_align:0 symtab:0 alias-set -1 structural-equality
arg-types 
chain >>
pointer_to_this >
readonly addressable used nothrow public external built-in decl_5 decl_6 SI
t.c:2:9
align:16 warn_if_not_align:0 built-in: BUILT_IN_MD:3923 context

attributes 
chain 
chain >>> chain
>

[Bug target/114323] [14 Regression] MVE vector load intrinsic miscompiled since r14-5622-g4d7647edfd7d98

2024-03-13 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114323

--- Comment #1 from Alex Coplan  ---
Hmm, so in 043t.mergephi1 we have:

uint32x4_t foo ()
{
  const uint32_t D.13439[4];
  uint32x4_t V0;

   :
  D.13439 = *.LC0;
  V0_3 = vld1q_u32 (&D.13439);
  D.13439 ={v} {CLOBBER(eos)};
  return V0_3;

}

but then 044t.dse1 says:

  Deleted dead store: D.13439 = *.LC0;

leaving us with a load of uninitialized memory.

[Bug target/114323] New: [14 Regression] MVE vector load intrinsic miscompiled since r14-5622-g4d7647edfd7d98

2024-03-13 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114323

Bug ID: 114323
   Summary: [14 Regression] MVE vector load intrinsic miscompiled
since r14-5622-g4d7647edfd7d98
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

The following testcase:

#include 

uint32x4_t foo (void) {
  uint32x4_t V0 = vld1q_u32(((const uint32_t[4]){1, 2, 3, 4}));
  return V0;
}

is miscompiled with -O2 -march=armv8.1-m.main+mve -mfloat-abi=hard on
arm-none-eabi.  Since r14-5622-g4d7647edfd7d985fbefe13de03c8bc2e3a74fc61 we
generate:

foo:
sub sp, sp, #16
vldrw.32q0, [sp]
add sp, sp, #16
bx  lr

i.e. we do a vector load from uninitialized stack memory.  GCC 13 used to give:

foo:
sub sp, sp, #16
mov ip, sp
ldr r3, .L4
ldm r3, {r0, r1, r2, r3}
stm ip, {r0, r1, r2, r3}
vldrw.32q0, [ip]
add sp, sp, #16
bx  lr
.align  2
.L4:
.word   .LANCHOR0
.size   foo, .-foo
.section.rodata
.align  2
.set.LANCHOR0,. + 0
.word   1
.word   2
.word   3
.word   4

which, while not optimal, is at least correct.  Here is a full executable
testcase for the testsuite:

#include 

__attribute__((noipa))
uint32x4_t foo (void) {
  uint32x4_t V0 = vld1q_u32(((const uint32_t[4]){1, 2, 3, 4}));
  return V0;
}

int main(void)
{
  uint32_t buf[4];
  vst1q_u32 (buf, foo());

  for (int i = 0; i < 4; i++)
if (buf[i] != i+1)
  __builtin_abort ();
}

[Bug middle-end/114291] New: -fcompare-debug failure (length) with -fprofile-use at -O0

2024-03-09 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114291

Bug ID: 114291
   Summary: -fcompare-debug failure (length) with -fprofile-use at
-O0
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

The following is an -fcompare-debug failure that shows up with PGO (here on
aarch64-linux-gnu):

$ cat t.c
void foo() {}
int main(void) {}
$ gcc t.c -fprofile-generate
$ ./a.out
$ gcc t.c -fprofile-use -fcompare-debug
gcc: error: t.c: ‘-fcompare-debug’ failure (length)

The difference seems to be as follows:

$ gcc t.c -fprofile-use -fdump-final-insns=nodebug.final
$ gcc t.c -fprofile-use -g -fcompare-debug-second
-fdump-final-insns=debug.final
$ diff -u nodebug.final debug.final
--- nodebug.final   2024-03-09 12:00:43.875729773 +
+++ debug.final 2024-03-09 12:00:52.555650670 +
@@ -1,5 +1,6 @@

-;; Function foo (foo, funcdef_no=0, decl_uid=4426, cgraph_uid=1,
symbol_order=0) (unlikely executed)
+
+;; Function foo (foo, funcdef_no=0, cgraph_uid=1, symbol_order=0) (unlikely
executed)

 (note # 0 0 NOTE_INSN_DELETED)
 (note # 0 0 NOTE_INSN_PROLOGUE_END)
@@ -18,7 +19,10 @@
 (barrier # 0 0)
 (note # 0 0 NOTE_INSN_DELETED)

-;; Function main (main, funcdef_no=1, decl_uid=4429, cgraph_uid=2,
symbol_order=1)
+Declarations used by main, sorted by DECL_UID:
+0:   void ;
+
+;; Function main (main, funcdef_no=1, cgraph_uid=2, symbol_order=1)

 (note # 0 0 NOTE_INSN_DELETED)
 (note # 0 0 [bb 2] NOTE_INSN_BASIC_BLOCK)

[Bug target/114284] [14 Regression] arm: Load of volatile short gets miscompiled (loaded twice) since r14-8319

2024-03-08 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114284

--- Comment #3 from Alex Coplan  ---
I think this has been fixed by
r14-9379-ga0e945888d973fc1a4a9d2944aa7e96d2a4d7581

[Bug target/114284] New: [14 Regression] arm: Load of volatile short gets miscompiled (loaded twice)

2024-03-08 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114284

Bug ID: 114284
   Summary: [14 Regression] arm: Load of volatile short gets
miscompiled (loaded twice)
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

The following is a wrong code regression in GCC 14:

volatile short x;
short foo() {
  return x;
}

with -march=armv8-m.base -O2 on the trunk we get:

foo:
movwr3, #:lower16:.LANCHOR0
movtr3, #:upper16:.LANCHOR0
ldrhr2, [r3]
movsr0, #0
ldrsh   r0, [r3, r0]
bx  lr

i.e. x is loaded twice, but with GCC 13 we get:

foo:
movwr3, #:lower16:.LANCHOR0
movtr3, #:upper16:.LANCHOR0
ldrhr0, [r3]
sxthr0, r0
bx  lr

I suppose ideally we would have just one ldrsh, but the GCC 13 code is OK.

[Bug tree-optimization/114193] New: missed early break vectorization of reduction

2024-03-01 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114193

Bug ID: 114193
   Summary: missed early break vectorization of reduction
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

For the following loop:

int a[1024];
int f(int *x, int n)
{
int sum = 0;
for (int i = 0; i < n; i++)
{
if (a[i] == 42)
break;
sum += a[i];
}
return sum;
}

at -O3 on aarch64 we miss vectorizing it.  It works if I move the early exit
down below the update of sum.  It looks like vect_analyze_scalar_cycles fails
to detect this as a reduction:

/app/example.c:5:23: note:   Analyze phi: sum_10 = PHI 
/app/example.c:5:23: missed:   intermediate value used outside loop.
/app/example.c:5:23: missed:   Unknown def-use cycle pattern.

[Bug tree-optimization/114192] New: scalar code left around following early break vectorization of reduction

2024-03-01 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114192

Bug ID: 114192
   Summary: scalar code left around following early break
vectorization of reduction
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

For the following testcase:

int a[1024];
int f4(int *x, int n)
{
int sum = 0;
for (int i = 0; i < n; i++)
{
sum += a[i];
if (a[i] == 42)
break;
}
return sum;
}

at -O3 on aarch64 we vectorize it and get the following vector loop:

.L4:
cmp x7, x2
beq .L23
.L6:
ubfiz   x3, x2, 4, 32
ldr w6, [x4, x2, lsl 2]// scalar load
mov v27.16b, v30.16b
mov w0, w5
add v30.4s, v30.4s, v25.4s
add w5, w5, w6 // scalar add
ldr q29, [x4, x3]
add x2, x2, 1
cmeqv31.4s, v29.4s, v26.4s
add v28.4s, v28.4s, v29.4s
umaxp   v31.4s, v31.4s, v31.4s
fmovx3, d31
cbz x3, .L4

but here the old scalar code has been left around.  If we remove the early exit
from the loop, then although we still leave the scalar code around in the
vectorizer, it gets optimized away immediately by the following DCE pass.

Without the early exit, in the vectorizer dump we have:

   [local count: 860067200]:
  # sum_10 = PHI 
  # i_12 = PHI 
  # vect_sum_10.8_25 = PHI 
  # vectp_a.9_26 = PHI 
  # ivtmp_32 = PHI 
  vect__1.11_28 = MEM  [(int *)vectp_a.9_26];
  _1 = a[i_12]; // scalar load
  vect_sum_6.12_29 = vect__1.11_28 + vect_sum_10.8_25;
  sum_6 = _1 + sum_10;
  i_7 = i_12 + 1;
  vectp_a.9_27 = vectp_a.9_26 + 16;
  ivtmp_33 = ivtmp_32 + 1;
  if (ivtmp_33 < bnd.5_22)
goto ; [89.00%]
  else
goto ; [11.00%]

i.e. the scalar load is left around, but it seems to get cleaned up by the
(immediately following) dce pass:

   [local count: 860067200]:
  # vect_sum_10.8_25 = PHI 
  # vectp_a.9_26 = PHI 
  # ivtmp_32 = PHI 
  vect__1.11_28 = MEM  [(int *)vectp_a.9_26];
  vect_sum_6.12_29 = vect__1.11_28 + vect_sum_10.8_25;
  vectp_a.9_27 = vectp_a.9_26 + 16;
  ivtmp_33 = ivtmp_32 + 1;
  if (ivtmp_33 < bnd.5_22)
goto ; [89.00%]
  else
goto ; [11.00%]

perhaps the dce needs improving to clean up the dead scalar code in the early
exit case, too.

[Bug tree-optimization/111770] predicated loads inactive lane values not modelled

2024-02-22 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111770

--- Comment #4 from Alex Coplan  ---
(In reply to Richard Biener from comment #3)
> As said X + 0. -> X is an invalid transform with FP unless there are no
> signed zeros (maybe also problematic with sign-dependent rounding).

Yeah, I was thinking about the integer case above.

> 
> I think we agree to define .MASK_LOAD to zero masked elements.  When we need
> something else we need to add an explicit ELSE value.  That needs documenting
> of course and also possibly testsuite coverage - I _think_ you should be able
> to do a GIMPLE frontend testcase for this.

Sounds good, thanks.

> 
> Note this behavior would extend to .MASK_GATHER_LOAD as well as
> the load-lanes and -len variants.
> 
> Unfortunately we do not have _any_ internals manual documentation for
> internal functions - but you can backtrack to the optabs documentation
> where this would need documenting.
> 
> Now, if-conversion could indeed elide the .COND_ADD for integers.  It's
> problematic there only because of signed overflow undefinedness, so
> you shouldn't see it for 'unsigned' already, and adding zero is safe.

Can you elaborate on this a bit? Do you mean to say that the .COND_ADD is only
there to avoid if-conversion introducing UB due to signed overflow? ISTM it's
needed for correctness even without that, as the addend needn't be guaranteed
to be zero in the general case.

> if-conversion would need to have an idea of all the ranges involved here
> so it might be a bit sophisticated to get it right.

Does what I suggested above make any sense, or do you have in mind a different
way of handling this in if-conversion? I'm wondering how ifcvt should determine
that the addend is zero in the case where the predicate is false.

Thanks

[Bug tree-optimization/111770] predicated loads inactive lane values not modelled

2024-02-21 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111770

--- Comment #2 from Alex Coplan  ---
I think to progress this and related cases we need to have .MASK_LOAD defined
to zero in the case that the predicate is false (either unconditionally for all
targets if possible or otherwise conditionally for targets where that is safe).

Here is a related case:

int bar(int n, char *a, char *b, char *c) {
  int sum = 0;
  for (int i = 0; i < n; ++i)
if (c[i] == 0)
  sum += a[i] * b[i];
  return sum;
}

in this case we get the missed optimization even before vectorization during
ifcvt (in some ways it is a simpler case to consider as only scalars are
involved).  Here with -O3 -march=armv9-a from ifcvt we get:

   [local count: 955630224]:
  # sum_23 = PHI <_ifc__41(8), 0(18)>
  # i_25 = PHI 
  _1 = (sizetype) i_25;
  _2 = c_16(D) + _1;
  _3 = *_2;
  _29 = _3 == 0;
  _43 = _42 + _1;
  _4 = (char *) _43;
  _5 = .MASK_LOAD (_4, 8B, _29);
  _6 = (int) _5;
  _45 = _44 + _1;
  _7 = (char *) _45;
  _8 = .MASK_LOAD (_7, 8B, _29);
  _9 = (int) _8;
  _46 = (unsigned int) _6;
  _47 = (unsigned int) _9;
  _48 = _46 * _47;
  _10 = (int) _48;
  _ifc__41 = .COND_ADD (_29, sum_23, _10, sum_23);

for this case it should be possible to use an unpredicated add instead of a
.COND_ADD.  We essentially need to show that this transformation is valid:

  _29 ? sum_23 + _10 : sum_23 --> sum_23 + _10

and this essentially boils down to showing that:

  _29 = false => _10 = 0

now I'm not sure if there's a way of match-and-simplifying some GIMPLE
expression under the assumption that a given SSA name takes a particular value;
but if there were, and we defined .MASK_LOAD to zero given a false predicate,
then we could evaluate _10 under the assumption that _29 = false, which if we
added some simple match.pd rule for .MASK_LOAD with a false predicate would
allow it to evaluate to zero, and thus we could establish _10 = 0 proving the
transformation is correct.  If such an approach is possible then I guess ifcvt
could use it to avoid conditionalizing statements unnecessarily.

Richi: any thoughts on the above or on how we should handle this sort of thing?

[Bug target/112922] [14 Regression] 465.tonto from SPECFP 2006 fails train run on Aarch64-linux with -O2 and -flto

2024-02-20 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112922

Alex Coplan  changed:

   What|Removed |Added

 CC||acoplan at gcc dot gnu.org,
   ||rsandifo at gcc dot gnu.org

--- Comment #1 from Alex Coplan  ---
So I did some bisection on this, and indeed it seems to have started with
r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 i.e.

commit 9f0f7d802482a8958d6cdc72f1fe0c8549db2182
Author: Richard Sandiford 
Date:   Thu Dec 7 19:41:19 2023

aarch64: Add an early RA for strided registers

but then it seemed to get fixed shortly afterwards by
r14-6339-g8b5cd6c4519cc120badd2b35a9e30d4deb82c012 i.e.

commit 8b5cd6c4519cc120badd2b35a9e30d4deb82c012
Author: Richard Sandiford 
Date:   Fri Dec 8 16:27:40 2023

aarch64: Some tweaks to the early-ra pass

CCing Richard S who can hopefully confirm if that change was expected to fix
correctness / wrong code issues.

[Bug target/111677] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-02-20 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

Alex Coplan  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #34 from Alex Coplan  ---
Fixed for all active branches.

[Bug target/111677] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-02-14 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

Alex Coplan  changed:

   What|Removed |Added

Summary|[12 Regression] darktable   |darktable build on aarch64
   |build on aarch64 fails with |fails with unrecognizable
   |unrecognizable insn due to  |insn due to
   |-fstack-protector changes   |-fstack-protector changes

--- Comment #32 from Alex Coplan  ---
Fixed for GCC 12, keeping open for a final backport to GCC 11 (since the stack
protector patches were also backported there, and the underlying issue is
latent there).

[Bug c++/113658] GCC 14 has incomplete impl for declared feature "cxx_constexpr_string_builtins"

2024-02-13 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113658

Alex Coplan  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from Alex Coplan  ---
Fixed, thanks for the report.

[Bug target/111677] [12 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-02-12 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

--- Comment #30 from Alex Coplan  ---
Backport for GCC 12 submitted:
https://gcc.gnu.org/pipermail/gcc-patches/2024-February/645415.html

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-08 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #12 from Alex Coplan  ---
Here is an alternative testcase that also fails in the same way on the GCC 12
and 13 branches:

void foo(int x, int y, int z, int d, int *buf)
{
  for(int i = z; i < y-z; ++i)
for(int j = 0; j < d; ++j)
  buf[i*x+(z-j-1)] = buf[i*x+(z+j)];
}

void bar(int x, int y, int z, int d, int *buf)
{
  for(int i = 0; i < d; ++i)
for(int j = z; j < x-z; ++j)
  buf[j+(z-i-1)*x] = buf[j+(z+i)*x];
}

__attribute__((noipa))
void baz(int x, int y, int d, int *buf)
{
  foo(x, y, 0, d, buf);
  bar(x, y, 0, d, buf);
}

int main(void)
{
  int a[] = { 1, 2, 3 };
  baz (1, 2, 1, a+1);
  /* buf = a+1.
 foo does:
 buf[-1] = buf[0]; // { 2, 2, 3 }
 buf[0] = buf[1];  // { 2, 3, 3 }

 bar does:
 buf[-1] = buf[0]; // { 3, 3, 3 }  */
  for (int i = 0; i < 2; i++)
if (a[i] != 3)
  __builtin_abort ();
}

[Bug target/111677] [12 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-02-07 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

Alex Coplan  changed:

   What|Removed |Added

Summary|[12/13 Regression]  |[12 Regression] darktable
   |darktable build on aarch64  |build on aarch64 fails with
   |fails with unrecognizable   |unrecognizable insn due to
   |insn due to |-fstack-protector changes
   |-fstack-protector changes   |

--- Comment #29 from Alex Coplan  ---
Should be fixed for GCC 13, I'll work on a backport for GCC 12 too.

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-06 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #7 from Alex Coplan  ---
(In reply to Andrew Pinski from comment #6)
> (In reply to Jakub Jelinek from comment #5)
> > My bisection points to r12-5915-ge93809f62363ba4b233858005aef652fb550e896
> 
> Which means it is related to bug 110702 .
> 
> Again try -fno-ivopts . I suspect ivopts is producing some odd ir that is
> confusing modref here.

Yeah, it seems -fno-ivopts makes the execution test pass too (so -O
-fno-ivopts).

[Bug tree-optimization/113787] [14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-06 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #4 from Alex Coplan  ---
Same with the head of the GCC 12 branch, but I agree it isn't a [14 Regression]
as I can reproduce the issue with basepoints/gcc-14, so maybe something was
backported to 12/13 that is making it latent on the branches?

[Bug tree-optimization/113787] [14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-06 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #3 from Alex Coplan  ---
(In reply to Jakub Jelinek from comment #1)
> Why do you think it is a 14 Regression?
> Seems r12-5166 works fine while r12-6600 already doesn't, so that would make
> it [12/13/14 Regression], no?

Well on the head of the GCC 13 branch the execution test seems to pass for me
and I see no difference with/without ipa-modref, I'll double check with GCC 12.

[Bug tree-optimization/113787] New: [14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-06 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

Bug ID: 113787
   Summary: [14 Regression] Wrong code at -O with ipa-modref on
aarch64
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

The following testcase appears to be miscompiled on the trunk, on
aarch64-linux-gnu:

$ cat t.c
void foo(int x, int y, int z, int d, int *buf)
{
  for(int i = z; i < y-z; ++i)
for(int j = 0; j < d; ++j)
  /* buf[x(i+1) + j] = buf[x(i+1)-j-1] */
  buf[i*x+(x-z+j)] = buf[i*x+(x-z-1-j)];
}

void bar(int x, int y, int z, int d, int *buf)
{
  for(int i = 0; i < d; ++i)
for(int j = z; j < x-z; ++j)
  /* buf[j+(y+i)*x] = buf[j+(y-1-i)*x] */
  buf[j+(y-z+i)*x] = buf[j+(y-z-1-i)*x];
}

__attribute__((noipa))
void baz(int x, int y, int d, int *buf)
{
  foo(x, y, 0, d, buf);
  bar(x, y, 0, d, buf);
}

int main(void)
{
  int a[] = { 1, 2, 3 };
  baz (1, 2, 1, a);
  /* foo does:
 buf[1] = buf[0];
 buf[2] = buf[1];

 bar does:
 buf[2] = buf[1]; (no-op)
 so we should have { 1, 1, 1 }.  */
  for (int i = 0; i < 3; i++)
if (a[i] != 1)
  __builtin_abort ();
}
$ gcc t.c -O -fno-ipa-modref
$ ./a.out
$ gcc t.c -O
$ ./a.out
Aborted

The problem seems to be that the call to foo gets incorrectly optimized
out from baz when ipa-modref is enabled:

$ gcc -c -S -o /dev/null t.c -O -fno-ipa-modref -fdump-tree-optimized=good.tree
$ gcc -c -S -o /dev/null t.c -O -fdump-tree-optimized=bad.tree
$ diff -u good.tree bad.tree
--- good.tree   2024-02-06 13:23:36.080926703 +
+++ bad.tree2024-02-06 13:23:38.356916302 +
@@ -223,7 +223,6 @@
 void baz (int x, int y, int d, int * buf)
 {
[local count: 1073741824]:
-  foo (x_2(D), y_3(D), 0, d_4(D), buf_5(D));
   bar (x_2(D), y_3(D), 0, d_4(D), buf_5(D));
   return;

I can't seem to reproduce the issue with GCC 13 or on x86_64.

[Bug middle-end/113705] [14 Regression] ICE in decompose, at wide-int.h:1049 on aarch64-linux-gnu since r14-8680-g2f14c0dbb78985

2024-02-01 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113705

Alex Coplan  changed:

   What|Removed |Added

Summary|[14 Regression] ICE in  |[14 Regression] ICE in
   |decompose, at   |decompose, at
   |wide-int.h:1049 on  |wide-int.h:1049 on
   |aarch64-linux-gnu   |aarch64-linux-gnu since
   ||r14-8680-g2f14c0dbb78985

--- Comment #3 from Alex Coplan  ---
Started with r14-8680-g2f14c0dbb789852947cb58fdf7d3162413f053fa :

commit 2f14c0dbb789852947cb58fdf7d3162413f053fa
Author: Roger Sayle 
Date:   Thu Feb 1 06:10:42 2024

PR target/113560: Enhance is_widening_mult_rhs_p.

[Bug middle-end/113705] [14 Regression] ICE in decompose, at wide-int.h:1049 on aarch64-linux-gnu

2024-02-01 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113705

Alex Coplan  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 CC||acoplan at gcc dot gnu.org
   Last reconfirmed||2024-02-01
 Ever confirmed|0   |1

--- Comment #2 from Alex Coplan  ---
Confirmed. Here is a reduced testcase that ICEs with -O2 on aarch64-linux-gnu:

void free();
template  struct generic_wide_int : storage { 
  long elt() const; 
};  
int elt_i;  
template  long generic_wide_int::elt() const {   
  return this->get_val()[elt_i];
}   
struct wide_int_storage {
  struct {
long val[0];
long valp;
  } u;
  unsigned len;
  int precision;
  wide_int_storage(const wide_int_storage &);
  ~wide_int_storage();
  const long *get_val() const;
  unsigned get_len() const;
};
wide_int_storage::wide_int_storage(const wide_int_storage &) {
  if (__builtin_expect(precision, 0))
u.valp = 0;
}
wide_int_storage::~wide_int_storage() {
  if (__builtin_expect(precision, 0))
free();
}
const long *wide_int_storage::get_val() const { return u.val; }
unsigned wide_int_storage::get_len() const { return len; }
struct irange {
  generic_wide_int upper_bound() const;
  generic_wide_int *m_base;
};
generic_wide_int irange::upper_bound() const {
  return m_base[1];
}
void set_irange() {
  irange r;
  for (unsigned i;;) {
generic_wide_int __trans_tmp_1 = r.upper_bound();
long *__trans_tmp_2;
unsigned short *len;
*len = __trans_tmp_1.get_len();
for (i = 0; i < *len; ++i)
  *__trans_tmp_2++ = __trans_tmp_1.elt();
  }
}

[Bug target/111677] [12/13 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-01-31 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

Alex Coplan  changed:

   What|Removed |Added

Summary|[12/13/14 Regression]   |[12/13 Regression]
   |darktable build on aarch64  |darktable build on aarch64
   |fails with unrecognizable   |fails with unrecognizable
   |insn due to |insn due to
   |-fstack-protector changes   |-fstack-protector changes

--- Comment #27 from Alex Coplan  ---
Fixed on trunk for GCC 14, keeping open for backports.

[Bug target/111677] [12/13/14 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-01-30 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

--- Comment #25 from Alex Coplan  ---
Proposed fix for GCC 13:
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/644459.html

[Bug target/111677] [12/13/14 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-01-30 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

Alex Coplan  changed:

   What|Removed |Added

   Keywords||patch

--- Comment #24 from Alex Coplan  ---
Proposed fix for trunk:
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/61.html

[Bug target/111677] [12/13/14 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-01-30 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

Alex Coplan  changed:

   What|Removed |Added

   Keywords|needs-bisection |
  Known to fail|13.2.1  |14.0
  Known to work|14.0|
Version|13.2.0  |13.2.1
Summary|[12/13 Regression]  |[12/13/14 Regression]
   |darktable build on aarch64  |darktable build on aarch64
   |fails with unrecognizable   |fails with unrecognizable
   |insn due to |insn due to
   |-fstack-protector changes   |-fstack-protector changes

--- Comment #23 from Alex Coplan  ---
Discovered by accident while working on a patch for trunk, but adding
-funroll-loops to the testcase in #c20 is enough to make the ICE trigger on the
trunk, too.

Testing a fix for trunk and a backport to 13 (to start with).

To reproduce on the trunk (t.c as in #c20):

$ gcc/xgcc -B gcc -c t.c -O3 -ffast-math -fopenmp -fstack-protector-strong
-funroll-loops
t.c: In function ‘dt_bilateral_splat.simdclone.1’:
t.c:25:1: error: unrecognizable insn:
   25 | }
  | ^
(insn 2182 2181 406 85 (set (mem/c:TF (plus:DI (reg/f:DI 31 sp)
(const_int 512 [0x200])) [7  S16 A8])
(reg:TF 55 v23)) -1
 (expr_list:REG_DEAD (reg:TF 55 v23)
(nil)))
during RTL pass: sched_fusion
t.c:25:1: internal compiler error: in get_attr_type, at
config/aarch64/aarch64.md:29678
0x74a68f _fatal_insn(char const*, rtx_def const*, char const*, int, char
const*)
/home/alecop01/toolchain/src/gcc/gcc/rtl-error.cc:108
0x74a6c3 _fatal_insn_not_found(rtx_def const*, char const*, int, char const*)
/home/alecop01/toolchain/src/gcc/gcc/rtl-error.cc:116
0x18cf03b get_attr_type(rtx_insn*)
/home/alecop01/toolchain/src/gcc/gcc/config/aarch64/aarch64.md:29678
0x13278b7 aarch64_sched_variable_issue
/home/alecop01/toolchain/src/gcc/gcc/config/aarch64/aarch64.cc:15827
0x13278b7 aarch64_sched_variable_issue
/home/alecop01/toolchain/src/gcc/gcc/config/aarch64/aarch64.cc:15818
0x1e25057 schedule_block(basic_block_def**, void*)
/home/alecop01/toolchain/src/gcc/gcc/haifa-sched.cc:6912
0xeb307f schedule_region
/home/alecop01/toolchain/src/gcc/gcc/sched-rgn.cc:3203
0xeb307f schedule_insns()
/home/alecop01/toolchain/src/gcc/gcc/sched-rgn.cc:3525
0xeb34a3 schedule_insns()
/home/alecop01/toolchain/src/gcc/gcc/sched-rgn.cc:3511
0xeb34a3 rest_of_handle_sched_fusion
/home/alecop01/toolchain/src/gcc/gcc/sched-rgn.cc:3760
0xeb34a3 execute
/home/alecop01/toolchain/src/gcc/gcc/sched-rgn.cc:3938
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.

[Bug target/111677] [12/13 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-01-30 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

Alex Coplan  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #22 from Alex Coplan  ---
(In reply to Richard Sandiford from comment #21)
> 
> aarch64_get_separate_components is supposed to vet shrink-wrappable
> offsets, but in this case the offset looks valid, since:
> 
> str q22, [sp, #512]
> 
> is a valid instruction.  Perhaps the constraints are too narrow?

Yeah, as discussed offline, for T{I,F}mode we deliberately restrict the range
to the ldp x-reg range, since at least for TImode we don't know pre-RA how it
will be allocated (a single q reg or a pair of x regs).

We could look at using a different mode for the save that doesn't have those
restrictions, I'll try to do that.

[Bug c++/113658] GCC 14 has incomplete impl for declared feature "cxx_constexpr_string_builtins"

2024-01-30 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113658

Alex Coplan  changed:

   What|Removed |Added

   Last reconfirmed||2024-01-30
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org

--- Comment #5 from Alex Coplan  ---
(In reply to Jakub Jelinek from comment #3)
> Obviously using __has_builtin is much better than using the really badly
> designed __has_feature/__has_extension.
> That said, wcs{chr,cmp,len,ncmp} and wmem{chr,cmp} aren't builtins in gcc
> either, so I guess we shouldn't announce this "feature".

Mine, then.  I can prepare a patch to stop advertising the feature.

[Bug tree-optimization/113661] New: [14 Regression] xalancbmk miscompiled on aarch64 since r14-7194-g6cb155a6cf3142

2024-01-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113661

Bug ID: 113661
   Summary: [14 Regression] xalancbmk miscompiled on aarch64 since
r14-7194-g6cb155a6cf3142
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

xalancbmk (both from SPEC 2006 and SPEC 2017) seems to be miscompiled on
aarch64 since r14-7194-g6cb155a6cf314232248a12bdd395ed4151ae5a28 i.e.

commit 6cb155a6cf314232248a12bdd395ed4151ae5a28 (refs/bisect/bad)
Author: Tamar Christina 
Date:   Fri Jan 12 15:24:49 2024 +

middle-end: make memory analysis for early break more deterministic
[PR113135]

I see:

*** Miscompare of ref-t5.out

with the options -Ofast -fomit-frame-pointer -mcpu=neoverse-v1 -flto=auto .

[Bug target/113616] [14 Regression] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252

2024-01-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113616

Alex Coplan  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #6 from Alex Coplan  ---
Should be fixed, thanks for the report.

[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605

2024-01-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623

--- Comment #5 from Alex Coplan  ---
Indeed passing -mearly-ra=none makes the ICE go away as well.

[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605

2024-01-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623

Alex Coplan  changed:

   What|Removed |Added

 Status|ASSIGNED|NEW
   Assignee|acoplan at gcc dot gnu.org |unassigned at gcc dot 
gnu.org

[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605

2024-01-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623

Alex Coplan  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #4 from Alex Coplan  ---
I think this is an early RA problem.  In asmcons (in function qux), we have:

   29: x1:DI=[r122:DI+0x8]
   30: x0:DI=[r122:DI]

and then in early_ra, we get:

   29: x1:DI=[v31:DI+0x8]
   30: x0:DI=[v31:DI]

CCing Richard S for an opinion.

[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605

2024-01-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623

--- Comment #3 from Alex Coplan  ---
I think ldp_fusion is exposing a latent issue here.  We trip the assert:

gcc_assert (aarch64_mem_pair_lanes_operand (mem, pair_mode));

on the RTL:

(rr) pr mem
(mem/f:V2x8QI (reg:DI 63 v31) [0 +0 S16 A64])

because v31 isn't a valid base register according to
aarch64_regno_ok_for_base_p.  This comes from the following RTL in sched1,
where we already have:

   30: x0:DI=[v31:DI]
   29: x1:DI=[v31:DI+0x8]

but again these mems look invalid as per aarch64_regno_ok_for_base_p.

[Bug target/113616] [14 Regression] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252

2024-01-29 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113616

Alex Coplan  changed:

   What|Removed |Added

URL||https://gcc.gnu.org/piperma
   ||il/gcc-patches/2024-January
   ||/644167.html
   Keywords||patch

--- Comment #4 from Alex Coplan  ---
Patch submitted:
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/644167.html

[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623

Alex Coplan  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
  Known to fail||14.0
 Target||aarch64-*-*
   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org

--- Comment #2 from Alex Coplan  ---
Confirmed, mine.

[Bug target/113616] [14 Regression] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113616

--- Comment #3 from Alex Coplan  ---
Testing a patch.

[Bug target/113618] [14 Regression] AArch64: memmove idiom regression

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618

Alex Coplan  changed:

   What|Removed |Added

   Last reconfirmed||2024-01-26
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
 CC||acoplan at gcc dot gnu.org

--- Comment #1 from Alex Coplan  ---
Confirmed.

(In reply to Wilco from comment #0)
> A possible fix would be to avoid emitting LDP/STP in memcpy/memmove/memset
> expansions.

Yeah, so I had posted
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636855.html for that
but held off from committing it at the time as IMO there wasn't enough evidence
to show that this helps in general (and the pass could in theory miss
opportunities which would lead to regressions). 

But perhaps this is a good argument for going ahead with that change (of course
it will need rebasing).

[Bug target/113616] [14 Regression] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113616

--- Comment #2 from Alex Coplan  ---
I think the problem is this loop (and others that iterate over debug
uses in this way):

  // Now that we've characterized the defs involved, go through the
  // debug uses and determine how to update them (if needed).
  for (auto use : set->debug_insn_uses ())
{
  if (*pair_dst < *use->insn () && defs[1])
// We're re-ordering defs[1] above a previous use of the
// same resource.
update_debug_use (use, defs[1], writeback_pats[1]);
  else if (*pair_dst >= *use->insn ())
// We're re-ordering defs[0] below its use.
update_debug_use (use, defs[0], writeback_pats[0]);
}

because `update_debug_use` can remove uses from the list of debug uses,
we can't use a for-range loop as the iterator will become invalidated
before getting advanced.

Should be fairly straightforward to fix, sorry for the oversight.

[Bug target/113616] [14 Regression] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113616

Alex Coplan  changed:

   What|Removed |Added

   Keywords||ice-on-valid-code
   Last reconfirmed||2024-01-26
  Known to fail||14.0
 Ever confirmed|0   |1
   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=113089
 Target||aarch64-*-*
   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org
 Status|UNCONFIRMED |ASSIGNED

--- Comment #1 from Alex Coplan  ---
Confirmed, mine.

[Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

--- Comment #6 from Alex Coplan  ---
FWIW, if I move ldp_fusion1 before early_ra, with:

diff --git a/gcc/config/aarch64/aarch64-passes.def
b/gcc/config/aarch64/aarch64-passes.def
index 769d48f4faa..3853f6bf7a4 100644
--- a/gcc/config/aarch64/aarch64-passes.def
+++ b/gcc/config/aarch64/aarch64-passes.def
@@ -18,6 +18,7 @@
along with GCC; see the file COPYING3.  If not see
.  */

+INSERT_PASS_BEFORE (pass_sched, 1, pass_ldp_fusion);
 INSERT_PASS_BEFORE (pass_sched, 1, pass_aarch64_early_ra);
 INSERT_PASS_AFTER (pass_regrename, 1, pass_fma_steering);
 INSERT_PASS_BEFORE (pass_reorder_blocks, 1, pass_track_speculation);
@@ -25,5 +26,4 @@ INSERT_PASS_BEFORE (pass_late_thread_prologue_and_epilogue,
1, pass_switch_pstat
 INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
 INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
 INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
-INSERT_PASS_BEFORE (pass_early_remat, 1, pass_ldp_fusion);
 INSERT_PASS_BEFORE (pass_peephole2, 1, pass_ldp_fusion);

we get:

f:
.LFB0:
.cfi_startproc
adrpx0, .LANCHOR0
add x0, x0, :lo12:.LANCHOR0
ldp d31, d30, [x0]
ldp d29, d28, [x0, 32]
faddv29.2s, v31.2s, v29.2s
faddv28.2s, v30.2s, v28.2s
stp d29, d28, [x0]
ret

note that this does use more registers, though, so it's not necessarily a clear
win in the general case (particularly if register pressure is already high).

[Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

--- Comment #5 from Alex Coplan  ---
It looks like the current ordering of passes is:

early_ra
sched1
ldp_fusion1
early_remat

ISTM that ldp_fusion1 should probably be running before early_ra, but we found
that running ldp_fusion1 before sched1 could lead to increased register
pressure. Hmm.

[Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

Alex Coplan  changed:

   What|Removed |Added

 Status|ASSIGNED|NEW
   Assignee|acoplan at gcc dot gnu.org |unassigned at gcc dot 
gnu.org

[Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

Alex Coplan  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org
Summary|[14 Regression] Missing |[14 Regression] Missing
   |ldp/stp optimization|ldp/stp optimization since
   |sometimes   |r14-6290-g9f0f7d802482a8

--- Comment #4 from Alex Coplan  ---
Interestingly we started to miss this with the introduction of aarch64
early RA i.e. r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182.

My ldp/stp pattern rewrite was:
r14-6604-gd7ee988c491cde43d04fe25f2b3dbad9d85ded45
so we started to miss this before any of my ldp/stp patches.

Looking at what happens with the ldp/stp pass, I can see that in sched1 we've
already allocated hard regs to the vector load destinations:

3: NOTE_INSN_BASIC_BLOCK 2
2: NOTE_INSN_FUNCTION_BEG
   13: NOTE_INSN_DELETED
5: debug begin stmt marker
6: r107:DI=high(`*.LANCHOR0')
7: r106:DI=r107:DI+low(`*.LANCHOR0')
  REG_EQUAL `*.LANCHOR0'
   14: v31:V2SF=[r107:DI+low(`*.LANCHOR0')]
   15: v30:V2SF=[r106:DI+0x20]
   16: v30:V2SF=v31:V2SF+v30:V2SF
  REG_DEAD v31:V2SF
   27: v31:V2SF=[r106:DI+0x8]
   17: [r107:DI+low(`*.LANCHOR0')]=v30:V2SF
  REG_DEAD r107:DI
  REG_DEAD v30:V2SF
   18: debug begin stmt marker
   28: v30:V2SF=[r106:DI+0x28]
   29: v30:V2SF=v31:V2SF+v30:V2SF
  REG_DEAD v31:V2SF
   30: [r106:DI+0x8]=v30:V2SF
  REG_DEAD r106:DI
  REG_DEAD v30:V2SF
   33: NOTE_INSN_DELETED

and then there's nothing that the early ldp/stp pass can do because the
would-be load pair candidates already use the same (hard) transfer register due
to early RA:

merge_pairs [L=1], cand vecs (14) x (27)
analyzing pair (load=1): (14,27)
punting on ldp due to reg conflcits (14,27)
merge_pairs [L=1], cand vecs (15) x (28)
analyzing pair (load=1): (15,28)
punting on ldp due to reg conflcits (15,28)
merge_pairs [L=0], cand vecs (17) x (30)
analyzing pair (load=0): (17,30)
pair (17,30): rejecting base 106 due to dataflow hazards (28,29)
can't form pair (17,30) due to dataflow hazards
starting the processing of deferred insns
ending the processing of deferred insns

CCing Richard S for an opinion.

[Bug target/113613] [14 Regression] Missing ldp/stp optimization sometimes

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

Alex Coplan  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org
   Last reconfirmed||2024-01-26
 Status|UNCONFIRMED |ASSIGNED

--- Comment #3 from Alex Coplan  ---
Confirmed, I'll take a look.

[Bug target/111677] [12/13/14 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

--- Comment #20 from Alex Coplan  ---
I think the testcase in #c10 went latent on the 13 branch but the following
(reduced from the attachment) still ICEs on the tip of the 13 branch with
-Ofast -fopenmp -fstack-protector-strong:

typedef struct {
  long size_z;
  int width;
} dt_bilateral_t;
typedef float dt_aligned_pixel_t[4];
#pragma omp declare simd
void dt_bilateral_splat(dt_bilateral_t *b) {
  float *buf;
  long offsets[8];
  for (; b;) {
int firstrow;
for (int j = firstrow; j; j++)
  for (int i; i < b->width; i++) {
dt_aligned_pixel_t contrib;
for (int k = 0; k < 4; k++)
  buf[offsets[k]] += contrib[k];
  }
float *dest;
for (int j = (long)b; j; j++) {
  float *src = (float *)b->size_z;
  for (int i = 0; i < (long)b; i++)
dest[i] += src[i];
}
  }
}

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #9 from Alex Coplan  ---
(In reply to Andrew Pinski from comment #8)
> (In reply to Alex Coplan from comment #7)
> > I expect the store pairs come from memcpy lowering/expansion in the aarch64
> > backend, that is the only way we get store pairs so early in the RTL
> > pipeline IIRC.
> 
> In this case, memset is more likely.

Right, yeah.  I was using "memcpy lowering" to refer to all the
mem{cpy,set,move} expansion we have in the backend.

> 
> Either:
> for (int i = 0; i < j; i++)
> m[i] = vdupq_n_f32(0.F);
> Or
> for (int i = 0; i < l; i++)
> n[i] = vdupq_n_f32(0.F);

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #7 from Alex Coplan  ---
I expect the store pairs come from memcpy lowering/expansion in the aarch64
backend, that is the only way we get store pairs so early in the RTL pipeline
IIRC.

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #6 from Alex Coplan  ---
Looking at the dump files, the first difference seems to be in 292r.dse1:

 8: NOTE_INSN_BASIC_BLOCK 2
 2: r116:SI=zero_extend(x0:HI)
   REG_DEAD x0:HI
@@ -178,7 +161,26 @@
 5: NOTE_INSN_FUNCTION_BEG
10: r119:DI=sfp:DI-0x200
12: r121:V16QI=const_vector
+   13: [r119:DI]=unspec[r121:V16QI,r121:V16QI] 38
+   14: [r119:DI+0x20]=unspec[r121:V16QI,r121:V16QI] 38
+   15: [r119:DI+0x40]=unspec[r121:V16QI,r121:V16QI] 38
+   16: [r119:DI+0x60]=unspec[r121:V16QI,r121:V16QI] 38
+   17: [r119:DI+0x80]=unspec[r121:V16QI,r121:V16QI] 38
+   18: [r119:DI+0xa0]=unspec[r121:V16QI,r121:V16QI] 38
+   19: [r119:DI+0xc0]=unspec[r121:V16QI,r121:V16QI] 38
+   20: [r119:DI+0xe0]=unspec[r121:V16QI,r121:V16QI] 38
+  REG_DEAD r119:DI
21: r122:DI=sfp:DI-0x100
+   24: [r122:DI]=unspec[r121:V16QI,r121:V16QI] 38
+   25: [r122:DI+0x20]=unspec[r121:V16QI,r121:V16QI] 38
+   26: [r122:DI+0x40]=unspec[r121:V16QI,r121:V16QI] 38
+   27: [r122:DI+0x60]=unspec[r121:V16QI,r121:V16QI] 38
+   28: [r122:DI+0x80]=unspec[r121:V16QI,r121:V16QI] 38
+   29: [r122:DI+0xa0]=unspec[r121:V16QI,r121:V16QI] 38
+   30: [r122:DI+0xc0]=unspec[r121:V16QI,r121:V16QI] 38
+   31: [r122:DI+0xe0]=unspec[r121:V16QI,r121:V16QI] 38
+  REG_DEAD r122:DI
+  REG_DEAD r121:V16QI
 6: r100:V4SF=const_vector
 7: r106:SI=0
32: cc:CC=cmp(r116:SI,0)
@@ -254,6 +256,7 @@
73: r100:V4SF={r147:V4SF*r147:V4SF+r115:V4SF}
   REG_DEAD r147:V4SF
   REG_DEAD r115:V4SF
+   74: [sfp:DI-0x200]=r100:V4SF
75: r148:SI=r106:SI+0x2
   REG_DEAD r106:SI
76: r106:SI=zero_extend(r148:SI#0)

(the unspec 38s are store pairs).

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #4 from Alex Coplan  ---
Created attachment 57211
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57211&action=edit
after.s

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #3 from Alex Coplan  ---
Created attachment 57210
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57210&action=edit
before.s

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #2 from Alex Coplan  ---
(In reply to Richard Biener from comment #1)
> I will have a look - but can you explain for me what I see?  I suppose the
> testcase was reduced from something?

Yeah, the testcase is reduced.

> 
> Is the assembly diff complete?  That is, do we really have more fmla or
> are they just moved?

I think the diff is complete, I can upload the full before/after asm.

> 
> + stp q31, q31, [sp, 256] 
> 
> that's a store?  A paired store?  Aka, the sequence fills a stack(?)
> region with replications of q31?

That's right.

I'll try to take a look at the RTL dumps too to see if I can figure out
anything, too.

[Bug rtl-optimization/113597] New: [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

Bug ID: 113597
   Summary: [14 Regression] aarch64: Significant code quality
regression since r14-8346-ga98d5130a6dcff
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

The following testcase shows a significant regression in code quality
since r14-8346-ga98d5130a6dcff2ed4db371e500550134777b8cf on aarch64:

$ cat t.cc
#include 
typedef struct {
  float b;
  float c;
} d;
template  void f(uint16_t g, d *u, d *v) {
  uint16_t j, l = j = e * e;
  float32_t b[j];
  float32_t c[l];
  float32x4_t m[j];
  for (int i = 0; i < j; i++)
m[i] = vdupq_n_f32(0.F);
  float32x4_t n[l];
  for (int i = 0; i < l; i++)
n[i] = vdupq_n_f32(0.F);
  for (uint16_t k = 0; k < g; k += 2) {
float32x4_t o[e];
for (int i = 0; i < e; i++)
  o[i] = vld1q_f32((float32_t *)&u[k]);
int idx = 0;
for (int a = 0; a < e; a++)
  for (int ah = a; ah < e; ah++)
m[idx] = vfmaq_f32(m[idx], o[a], o[ah]);
float32x4_t p[e];
for (int i; i; i++)
  for (int a; a;)
for (int ah;;)
  vfmsq_f32(n[idx], o[a], p[ah]);
  }
  for (int i = 0; i < j; i++)
b[i] = vaddvq_f32(m[i]);
  for (int i = 0; i < l; i++)
c[i] = vaddvq_f32(n[i]);
  constexpr uint16_t q(e * e);
  float32x4_t r[q];
  float32x2_t s;
  r[4] = float32x4_t{b[5] - c[3]};
  for (int i = 0; i < q; i++)
vst1q_f32((float32_t *)&v[2 * i], r[i]);
  if (e % 2)
vst1_f32((float32_t *)v, s);
}
void t() {
  d v, u;
  f<4>(0, &u, &v);
}

$ cat cmp.sh
#!/bin/bash
set -e

BEFORE=/work/builds/r14-8345/gcc
AFTER=/work/builds/r14-8346/gcc
SRC=t.cc

$BEFORE/xgcc -B $BEFORE -c -S -o before.s $SRC -Wall -Werror -Ofast
-mcpu=neoverse-v2
$AFTER/xgcc -B $AFTER -c -S -o after.s $SRC -Wall -Werror -Ofast
-mcpu=neoverse-v2

diff -u before.s after.s

$ ./cmp.sh
--- before.s2024-01-25 10:35:56.977090552 +
+++ after.s 2024-01-25 10:35:57.385086341 +
@@ -9,16 +9,47 @@
 _Z1fILt4EEvtP1dS1_:
 .LFB3918:
.cfi_startproc
-   andsw0, w0, 65535
+   moviv31.4s, 0
sub sp, sp, #768
.cfi_def_cfa_offset 768
+   andsw0, w0, 65535
mov w3, 0
+   stp q31, q31, [sp, 256]
+   stp q31, q31, [sp, 288]
+   stp q31, q31, [sp, 320]
+   stp q31, q31, [sp, 352]
+   stp q31, q31, [sp, 384]
+   stp q31, q31, [sp, 416]
+   stp q31, q31, [sp, 448]
+   stp q31, q31, [sp, 480]
+   stp q31, q31, [sp, 512]
+   stp q31, q31, [sp, 544]
+   stp q31, q31, [sp, 576]
+   stp q31, q31, [sp, 608]
+   stp q31, q31, [sp, 640]
+   stp q31, q31, [sp, 672]
+   stp q31, q31, [sp, 704]
+   stp q31, q31, [sp, 736]
+   moviv31.4s, 0
beq .L3
.p2align 5,,15
 .L2:
-   add w1, w3, 2
-   and w3, w1, 65535
-   cmp w0, w1, uxth
+   ubfiz   x5, x3, 3, 16
+   add w4, w3, 2
+   and w3, w4, 65535
+   ldr q30, [x1, x5]
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   fmlav31.4s, v30.4s, v30.4s
+   str q31, [sp, 256]
+   cmp w0, w4, uxth
bhi .L2
 .L3:
ldp q30, q31, [sp]

[Bug target/113089] [14 Regression][aarch64] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252 since r14-6605-gc0911c6b357ba9

2024-01-23 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113089

Alex Coplan  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #16 from Alex Coplan  ---
Should be fixed, thanks for the report.

[Bug target/113356] [14 Regression][aarch64] ICE in try_fuse_pair, at config/aarch64/aarch64-ldp-fusion.cc:2203 since r14-6947-g4b67ec7ff5b1aa

2024-01-23 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113356

Alex Coplan  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Alex Coplan  ---
Fixed, thanks for the report.

[Bug target/113070] [14 regression] [AArch64] [PGO/LTO] Miscompilation of go compiler

2024-01-23 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113070

Alex Coplan  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #13 from Alex Coplan  ---
Should be fixed, sorry for the delay, and thanks for the report.

[Bug target/113114] [14 Regression] ICE compiling gcc.c-torture/execute/pr59643.cwith -mabi=ilp32; in try_promote_writeback aarch64-ldp-fusion.cc

2024-01-23 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113114

Alex Coplan  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #10 from Alex Coplan  ---
Should be fixed, thanks for the report.

  1   2   3   4   5   6   7   >