[Bug middle-end/112406] [14 Regression] Several SPECCPU 2017 benchmarks fail with on internal compiler error: in expand_insn, at optabs.cc:8305 after g:01c18f58d37865d5f3bbe93e666183b54ec608c7

2023-11-23 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112406

Tamar Christina  changed:

   What|Removed |Added

 Status|REOPENED|RESOLVED
 Resolution|--- |FIXED

--- Comment #23 from Tamar Christina  ---
Thanks! that seems to be all we've noticed.

Thanks for the quick fixes!

[Bug sanitizer/112644] [14 Regression] Some of the hwasan testcase fail after the recent merge

2023-11-22 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112644

--- Comment #4 from Tamar Christina  ---
I've asked Matthew to take a look since he wrote the initial support.

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2023-11-21 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 111370, which changed state.

Bug 111370 Summary: On Aarch64 4% 511.povray_r regression between 
g:6cd85273071b5f13 (2023-08-23 00:17) and g:e1f096a3cc96c719 (2023-08-25 22:34)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111370

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

[Bug target/111370] On Aarch64 4% 511.povray_r regression between g:6cd85273071b5f13 (2023-08-23 00:17) and g:e1f096a3cc96c719 (2023-08-25 22:34)

2023-11-21 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111370

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #6 from Tamar Christina  ---
Fixed.

[Bug middle-end/112406] [14 Regression] Several SPECCPU 2017 benchmarks fail with on internal compiler error: in expand_insn, at optabs.cc:8305 after g:01c18f58d37865d5f3bbe93e666183b54ec608c7

2023-11-21 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112406

--- Comment #19 from Tamar Christina  ---
(In reply to Robin Dapp from comment #18)
> Already in ifcvt we have:
> 
> _ifc__60 = .COND_ADD (_2, _6, MADPictureC1_lsm.10_25,
> MADPictureC1_lsm.10_25);
> 
> which we should not.  This is similar on riscv.
> 
> But during value numbering it still is
> 
> Value numbering stmt = _ifc__60 = .COND_ADD (_47, MADPictureC1_lsm.10_25,
> _6, MADPictureC1_lsm.10_25);
> 
> so we originally created the right thing.
> Hmm, no simplification is happening.  Is there still a swap somewhere that
> should not be?

ADD is commutative, so commutative_op declares COND_ADD as commutative and the
first commutative operand starts at op1.

So swapping _6 and MADPictureC1_lsm.10_25 should be legal to do.  Are you
depending on a specific order?

[Bug middle-end/112406] [14 Regression] Several SPECCPU 2017 benchmarks fail with on internal compiler error: in expand_insn, at optabs.cc:8305 after g:01c18f58d37865d5f3bbe93e666183b54ec608c7

2023-11-21 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112406

--- Comment #16 from Tamar Christina  ---
Ah, saves me the bisect then :)

Morning, new reproducer is:

> cat ratectl.i
double MADPictureC1;
extern int PictureRejected[];
int PictureMAD_0, MADModelEstimator_n_windowSize_i,
MADModelEstimator_n_windowSize_oneSampleQ;

void MADModelEstimator_n_windowSize() {
  int estimateX2 = 0;
  for (; MADModelEstimator_n_windowSize_i; MADModelEstimator_n_windowSize_i++)
{
if (MADModelEstimator_n_windowSize_oneSampleQ &&
!PictureRejected[MADModelEstimator_n_windowSize_i])
  estimateX2 = 1;
if (!PictureRejected[MADModelEstimator_n_windowSize_i])
  MADPictureC1 += PictureMAD_0;
  }
  if (estimateX2)
for (;;)
  ;
}


and called with:

gcc -c -o ratectl.o -Ofast -march=armv8-a+sve  ratectl.i
during GIMPLE pass: vect
ratectl.i: In function 'MADModelEstimator_n_windowSize':
ratectl.i:5:6: internal compiler error: in vect_transform_reduction, at
tree-vect-loop.cc:8442
5 | void MADModelEstimator_n_windowSize() {
  |  ^~
0xa9fc2e0f __libc_start_main
../csu/libc-start.c:308
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.

[Bug middle-end/112406] [14 Regression] Several SPECCPU 2017 benchmarks fail with on internal compiler error: in expand_insn, at optabs.cc:8305 after g:01c18f58d37865d5f3bbe93e666183b54ec608c7

2023-11-20 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112406

--- Comment #14 from Tamar Christina  ---
Thanks,  Those cases seem fixed now.

I do however still see another LTO failure that looks related in SPECCPU 2006:

ratectl.c:1566:6: internal compiler error: in vect_transform_reduction, at
tree-vect-loop.cc:8458
 1566 | void RCModelEstimator (int n_windowSize)
  |  ^
0xeb0adf vect_transform_reduction(_loop_vec_info*, _stmt_vec_info*,
gimple_stmt_iterator*, gimple**, _slp_tree*)
  /opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:8458
0x182840b vect_transform_stmt(vec_info*, _stmt_vec_info*,
gimple_stmt_iterator*, _slp_tree*, _slp_instance*)
  /opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-stmts.cc:13085
0xe9f7d3 vect_transform_loop_stmt
  /opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:11395
0xebd7c7 vect_transform_loop(_loop_vec_info*, gimple*)
  /opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:11840
0xef321b vect_transform_loops
  /opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1006
0xef385f try_vectorize_loop_1
  /opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1152
0xef385f try_vectorize_loop
  /opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1182
0xef3adf execute
  /opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1298

I will reduce and bisect this morning and will close this tickets and file a
new one if not.
Still waiting on the results of the other non SPECCPU workloads, but so far
looks good!

[Bug tree-optimization/111970] [14 regression] SLP for non-IFN gathers result in RISC-V test failure on gather since r14-4745-gbeab5b95c58145

2023-11-20 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111970

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #19 from Tamar Christina  ---
(In reply to JuzheZhong from comment #14)
> 
> Hi, @Tamar. Could you double-check whether my analysis (This bug not only
> happens on RVV, but also on ARM SVE) is correct or not ?

Hi, indeed it does:

test@sve-1:~/temp$ ./gcc/bin/gcc -march=armv8-a+sve -O3 -msve-vector-bits=256
sve.c -o sve.exe
test@sve-1:~/temp$ ./sve.exe
test@sve-1:~/temp$ ./gcc/bin/gcc -march=armv8-a+sve -O3 -msve-vector-bits=256
-fno-vect-cost-model sve.c -o sve-no-cost.exe
test@sve-1:~/temp$ ./sve-no-cost.exe
sve-no-cost.exe: sve.c:46: main: Assertion `dest_int16_t_int8_t[i * 2] ==
(src_int16_t_int8_t [index_int16_t_int8_t[i * 2]] + 1)' failed.
Aborted (core dumped)

I have noticed some other gather related failures but haven't had time to
triage them to file bugs.  Hoping to get to that soon.

[Bug rtl-optimization/112606] [14 Regression] powerpc64le-linux-gnu: 'FAIL: gcc.target/powerpc/p8vector-fp.c scan-assembler xsnabsdp'

2023-11-18 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112606

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #1 from Tamar Christina  ---
This looks like a target bug. You seem to have an fneg (fabs (..)) instruction
on powerpc.  This means your copysign pattern needs to either reject the
copysign expansion when the second operand is negative, or it needs to emit
xsnabsdp in this case rather than copysign.

The generic optimization is correct and is doing what the target has requested,
your copysign optab implementation says you prefer it over for any operand.

[Bug tree-optimization/111878] [14 Regression] ICE: in get_loop_exit_edges, at cfgloop.cc:1204 with -O3 -fgraphite-identity -fsave-optimization-record/-fdump-tree-graphite/-fopt-info since r14-4708-gd

2023-11-16 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111878

Tamar Christina  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from Tamar Christina  ---
Fixed, thanks for the report.

[Bug rtl-optimization/112483] [14 Regression] gfortran.dg/ieee/ieee_2.f90 fails on loongarch64-linux-gnu at -O1 or above

2023-11-12 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112483

Tamar Christina  changed:

   What|Removed |Added

 CC|tamar.christina at arm dot com |

--- Comment #15 from Tamar Christina  ---
removing duplicate mail

[Bug tree-optimization/112468] [14 Regression] Missed phi-opt after recent change

2023-11-12 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112468

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #7 from Tamar Christina  ---
testing patch

[Bug rtl-optimization/112483] [14 Regression] gfortran.dg/ieee/ieee_2.f90 fails on loongarch64-linux-gnu at -O1 or above

2023-11-12 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112483

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #7 from Tamar Christina  ---
Yeah, that fold-rtx code is bogus. It's a latent bug.

Optimizing copysign(x, -y) to neg(x) is just wrong.

Will you be sending a patch Xi or do you want me to?

[Bug tree-optimization/112468] [14 Regression] Missed phi-opt after recent change

2023-11-09 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112468

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #3 from Tamar Christina  ---
Hmm I rather think PHI ops should handle that null like other passes do. The
folding is supposed to already test and only happen if it succeeds.

This prevents match.pd from having to have a second check on every IFN usage.

[Bug tree-optimization/109154] [13 regression] jump threading de-optimizes nested floating point comparisons

2023-11-09 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109154

Tamar Christina  changed:

   What|Removed |Added

Summary|[13/14 regression] jump |[13 regression] jump
   |threading de-optimizes  |threading de-optimizes
   |nested floating point   |nested floating point
   |comparisons |comparisons
 Status|NEW |RESOLVED
   Target Milestone|13.3|14.0
 Resolution|--- |FIXED

--- Comment #82 from Tamar Christina  ---
This should give better performance then GCC-12.  The patches are not
backportable so closing as resolved in GCC-14.

[Bug middle-end/112406] [14 Regression] Several SPECCPU 2017 benchmarks fail with on internal compiler error: in expand_insn, at optabs.cc:8305 after g:01c18f58d37865d5f3bbe93e666183b54ec608c7

2023-11-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112406

--- Comment #10 from Tamar Christina  ---
Just finished second bisect and reduce.  Came out to this commit as well.

---

  module brute_force
integer, parameter :: r=9
 integer sudoku1(1, r)
contains
  subroutine brute
  integer l(r), u(r)
 where(sudoku1(1, :) /= 1)
  l = 1
u = 1
 end where
  do i1 = 1, u(1)
 do
end do
 end do
  end
  end

---

gfortran -w -c exchange2.f90 -fprofile-generate -march=armv8-a+sve -Ofast -o
exchange2.f90.o

gives:

during GIMPLE pass: vect
exchange2.fppized2.f90:5:18:

5 |   subroutine brute
  |  ^
internal compiler error: in vect_get_vec_defs_for_operand, at
tree-vect-stmts.cc:1257

which is probably related to your last message.

[Bug middle-end/112406] [14 Regression] Several SPECCPU 2017 benchmarks fail with on internal compiler error: in expand_insn, at optabs.cc:8305 after g:01c18f58d37865d5f3bbe93e666183b54ec608c7

2023-11-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112406

Tamar Christina  changed:

   What|Removed |Added

   Priority|P3  |P1
Summary|[14 Regression] Several |[14 Regression] Several
   |SPECCPU 2017 benchmarks |SPECCPU 2017 benchmarks
   |fail with LTO on internal   |fail with on internal
   |compiler error: in  |compiler error: in
   |expand_insn, at |expand_insn, at
   |optabs.cc:8305  |optabs.cc:8305 after
   ||g:01c18f58d37865d5f3bbe93e6
   ||66183b54ec608c7

[Bug middle-end/112406] [14 Regression] Several SPECCPU 2017 benchmarks fail with LTO on internal compiler error: in expand_insn, at optabs.cc:8305

2023-11-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112406

--- Comment #6 from Tamar Christina  ---
First reduction:

typedef struct {
  int red
} MagickPixelPacket;
GetImageChannelMoments_image, GetImageChannelMoments_image_0,
GetImageChannelMoments___trans_tmp_1, GetImageChannelMoments_M11_0,
GetImageChannelMoments_pixel_3, GetImageChannelMoments_y,
GetImageChannelMoments_p;
double GetImageChannelMoments_M00_0, GetImageChannelMoments_M00_1,
GetImageChannelMoments_M01_1;
MagickPixelPacket GetImageChannelMoments_pixel;
SetMagickPixelPacket(int color, MagickPixelPacket *pixel) {
  pixel->red = color;
}
GetImageChannelMoments() {
  for (; GetImageChannelMoments_y; GetImageChannelMoments_y++) {
SetMagickPixelPacket(GetImageChannelMoments_p,
 _pixel);
GetImageChannelMoments_M00_1 += GetImageChannelMoments_pixel.red;
if (GetImageChannelMoments_image)
  GetImageChannelMoments_M00_1++;
GetImageChannelMoments_M01_1 +=
GetImageChannelMoments_y * GetImageChannelMoments_pixel_3;
if (GetImageChannelMoments_image_0)
  GetImageChannelMoments_M00_0++;
GetImageChannelMoments_M01_1 +=
GetImageChannelMoments_y * GetImageChannelMoments_p++;
  }
  GetImageChannelMoments___trans_tmp_1 = atan(GetImageChannelMoments_M11_0);
}

reproduce with:

gcc -march=armv8-a+sve -w -Ofast statistic.i -o statistic.o

bisected to:

01c18f58d37865d5f3bbe93e666183b54ec608c7 is the first bad commit
commit 01c18f58d37865d5f3bbe93e666183b54ec608c7
Author: Robin Dapp 
Date:   Wed Sep 13 22:19:35 2023 +0200

ifcvt/vect: Emit COND_OP for conditional scalar reduction.

As described in PR111401 we currently emit a COND and a PLUS expression
for conditional reductions.  This makes it difficult to combine both
into a masked reduction statement later.
This patch improves that by directly emitting a COND_ADD/COND_OP during
ifcvt and adjusting some vectorizer code to handle it.

It also makes neutral_op_for_reduction return -0 if HONOR_SIGNED_ZEROS
is true.

gcc/ChangeLog:

PR middle-end/111401
* internal-fn.cc (internal_fn_else_index): New function.
* internal-fn.h (internal_fn_else_index): Define.
* tree-if-conv.cc (convert_scalar_cond_reduction): Emit COND_OP
if supported.
(predicate_scalar_phi): Add whitespace.
* tree-vect-loop.cc (fold_left_reduction_fn): Add IFN_COND_OP.
(neutral_op_for_reduction): Return -0 for PLUS.
(check_reduction_path): Don't count else operand in COND_OP.
(vect_is_simple_reduction): Ditto.
(vect_create_epilog_for_reduction): Fix whitespace.
(vectorize_fold_left_reduction): Add COND_OP handling.
(vectorizable_reduction): Don't count else operand in COND_OP.
(vect_transform_reduction): Add COND_OP handling.
* tree-vectorizer.h (neutral_op_for_reduction): Add default
parameter.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c: New test.
* gcc.target/riscv/rvv/autovec/cond/pr111401.c: New test.
* gcc.target/riscv/rvv/autovec/reduc/reduc_call-2.c: Adjust.
* gcc.target/riscv/rvv/autovec/reduc/reduc_call-4.c: Ditto.

--

I'll start on the exchange one now.

[Bug middle-end/112406] [14 Regression] Several SPECCPU 2017 benchmarks fail with LTO on internal compiler error: in expand_insn, at optabs.cc:8305

2023-11-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112406

Tamar Christina  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2023-11-08
 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

--- Comment #5 from Tamar Christina  ---
No, ICE is still there on imagick and on exchange it's changed to

during GIMPLE pass: vect
exchange2.fppized.f90: In function 'digits_2.isra':
exchange2.fppized.f90:998:31: internal compiler error: in
vect_get_vec_defs_for_operand, at tree-vect-stmts.cc:1257
  998 |   recursive subroutine digits_2(row)
  |   ^
0x1813b6b vect_get_vec_defs_for_operand(vec_info*, _stmt_vec_info*, unsigned
int, tree_node*, vec*, tree_node*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-stmts.cc:1257
0x1813ceb vect_get_vec_defs(vec_info*, _stmt_vec_info*, _slp_tree*, unsigned
int, tree_node*, vec*, tree_node*, tree_node*,
vec*, tree_node*, tree_node*, vec*, tree_node*, tree_node*, vec*,
tree_node*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-stmts.cc:1289
0x1813dcf vect_get_vec_defs(vec_info*, _stmt_vec_info*, _slp_tree*, unsigned
int, tree_node*, vec*, tree_node*, vec*, tree_node*, vec*, tree_node*,
vec*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-stmts.cc:1311
0xea9bc3 vect_transform_reduction(_loop_vec_info*, _stmt_vec_info*,
gimple_stmt_iterator*, gimple**, _slp_tree*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:8470
0x18311fb vect_transform_stmt(vec_info*, _stmt_vec_info*,
gimple_stmt_iterator*, _slp_tree*, _slp_instance*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-stmts.cc:13100
0xe9a223 vect_transform_loop_stmt
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:11322
0xeb79ff vect_transform_loop(_loop_vec_info*, gimple*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:11774
0xeedb4b vect_transform_loops
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1006
0xeee18f try_vectorize_loop_1
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1152
0xeee18f try_vectorize_loop
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1182
0xeee40f execute
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1298

I'll reduce exchange first since less object files.

[Bug middle-end/112406] [14 Regression] Several SPECCPU 2017 benchmarks fail with LTO on internal compiler error: in expand_insn, at optabs.cc:8305

2023-11-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112406

--- Comment #2 from Tamar Christina  ---
(In reply to Richard Biener from comment #1)
> Possibly the same as PR112359?

Some were yeah, looks like there are still 2 ICEs in imagick and exchange, I'll
start reducing those.

[Bug middle-end/112406] New: [14 Regression] Several SPECCPU 2017 benchmarks fail with internal compiler error: in expand_insn, at optabs.cc:8305

2023-11-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112406

Bug ID: 112406
   Summary: [14 Regression] Several SPECCPU 2017 benchmarks fail
with internal compiler error: in expand_insn, at
optabs.cc:8305
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*

blender, wrf, imagick and fotonik in SPECCPU 2017 all fail with:

blender/source/blender/blenkernel/intern/curve.c:1403:8: internal compiler
error: in expand_insn, at optabs.cc:8305
 1403 | float *BKE_curve_surf_make_orco(Object *ob)
  |^
0x76681f expand_insn(insn_code, unsigned int, expand_operand*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/optabs.cc:8305
0xb43ac3 expand_insn(insn_code, unsigned int, expand_operand*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/optabs.cc:8274
0x9f0fd7 expand_fn_using_insn
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/internal-fn.cc:260
0x80026b expand_call_stmt
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/cfgexpand.cc:2737
0x80026b expand_gimple_stmt_1
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/cfgexpand.cc:3880
0x80026b expand_gimple_stmt
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/cfgexpand.cc:4044
0x804917 expand_gimple_basic_block
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/cfgexpand.cc:6100
0x8065c7 execute
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/cfgexpand.cc:6835

Will reduce and bisect...

[Bug tree-optimization/112404] [14 Regression] 521.wrf_r fails to build with internal compiler error: in get_vectype_for_scalar_type, at tree-vect-stmts.cc:13311

2023-11-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112404

Tamar Christina  changed:

   What|Removed |Added

   Last reconfirmed||2023-11-6
 CC||tnfchris at gcc dot gnu.org

--- Comment #1 from Tamar Christina  ---
Same failure on AArch64 with SVE.

[Bug tree-optimization/111950] [14 Regression] ICE in compute_live_loop_exits, at tree-ssa-loop-manip.cc:250 since r14-4786-gd118738e71c

2023-11-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111950

--- Comment #9 from Tamar Christina  ---
Right, I've tried to apply that patch to my early break patch series and many
of the tests fail, all the same way in compute_live_loop_exits.

I guess we'll have a conflict here. So I'll post my patches without taking this
change into account and we can sort it out upstream,

[Bug tree-optimization/111878] [14 Regression] ICE: in get_loop_exit_edges, at cfgloop.cc:1204 with -O3 -fgraphite-identity -fsave-optimization-record/-fdump-tree-graphite/-fopt-info since r14-4708-gd

2023-10-31 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111878

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #7 from Tamar Christina  ---
Thanks for the report.

Graphite is feeding it a junk loop.  The loop's latch block is invalid. In fact
the block isn't even part of the loop.

Since the loop structure Graphite passes it is broken, get_loop_exit_edges (..)
asserts.

Previously for this situation the call would silently return NULL for such
loops.
I'd argue this is a bug in Graphite, but will restore the return NULL for
broken loops.

[Bug tree-optimization/112282] [14 Regression] wrong code (generated code hangs) at -O3 on x86_64-linux-gnu since r14-4777-g88c27070c25309

2023-10-30 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112282

--- Comment #8 from Tamar Christina  ---
Thanks for the report, that's very odd..

It looks like loop control is broken and `u` never gets incremented.  It's even
more strange since the structures getting lowered are both unused so should not
have had any effect at all..

will take a look.

[Bug tree-optimization/111950] [14 Regression] ICE in compute_live_loop_exits, at tree-ssa-loop-manip.cc:250 since r14-4786-gd118738e71c

2023-10-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111950

--- Comment #4 from Tamar Christina  ---
> turning c_I_lsm.18_38 into a fully invariant reduction def which likely isn't
supported - we had bugs here in the past with not relevant but live stmts.
But if-conversion also performs the (now valid) hoisting, this is maybe
why it was triggered by that rev.

Ah yeah this is something different from what I just fixed.

Indeed, this causes find_guard_arg to no longer find the tie to the original
PHI.

It's trying to match 

  # c_I_lsm.18_60 = PHI 

and

  # c_I_lsm.18_79 = PHI 

after it adds the edge. Normally loop invariant values are left in the the
guard block for this.  In this case we've left

 # c_I_lsm.18_65 = PHI 

Normally instead of 

  # c_I_lsm.18_60 = PHI 

we'd find c_I_lsm.18_65 here.  The value is as you mentioned loop invariant.

but since c_I_lsm.18_38 is no longer a PHI node the link was broken.

I don't think we can really recover this in the vectorizer can we?

Would the proper fix perhaps be to have ifconvert fully convert things?

It seems to have missed that

  # c_I_lsm.18_60 = PHI 

is just

  c_I_lsm.18_60 = _58 ? 1 : a.3_23;

that would prevent the PHI node confusion

[Bug tree-optimization/111950] [14 Regression] ICE in compute_live_loop_exits, at tree-ssa-loop-manip.cc:250 since r14-4786-gd118738e71c

2023-10-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111950

--- Comment #3 from Tamar Christina  ---
(In reply to Richard Biener from comment #2)
> For the epilog LC-SSA we lack the correct SSA name for the skip edge:
> 
> 
>  [local count: 16140304]:
> # prephitmp_78 = PHI 
> # c_I_lsm.18_79 = PHI 
> # iftmp.0_80 = PHI 
> 

FWIW I just fixed a similar bug in my early break rebasing branch.
I can check if it fixes this one too.

[Bug target/112105] [14 Regression] vector by lane operation costing broken since g:21416caf221fae4351319ef8ca8d41c0234bdfa7

2023-10-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112105

Tamar Christina  changed:

   What|Removed |Added

   Keywords||missed-optimization
   Target Milestone|--- |14.0

[Bug target/112105] New: [14 Regression] vector by lane operation costing broken since g:21416caf221fae4351319ef8ca8d41c0234bdfa7

2023-10-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112105

Bug ID: 112105
   Summary: [14 Regression] vector by lane operation costing
broken since
g:21416caf221fae4351319ef8ca8d41c0234bdfa7
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
CC: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64-*

After this commit g:21416caf221fae4351319ef8ca8d41c0234bdfa7

commit 21416caf221fae4351319ef8ca8d41c0234bdfa7
Author: Richard Sandiford 
Date:   Tue Oct 24 11:01:52 2023 +0100

aarch64: Define TARGET_INSN_COST

This patch adds a bare-bones TARGET_INSN_COST.  See the comment
in the patch for the rationale.

we now fail to form by lane instructions when they're not single use:

> cat test.c

#include 
typedef struct {
  float re;
  float im;
} cmplx_f32_t;

void test2x2_f32(const cmplx_f32_t *p_src_a,
 const cmplx_f32_t *p_src_b,
 cmplx_f32_t *p_dst) {
  const float32_t *a_ptr = (const float32_t *)p_src_a;
  const float32_t *b_ptr = (const float32_t *)p_src_b;
  float32_t *out_ptr = (float32_t *)p_dst;

  float32x2x2_t a_col[2];
  float32x2x2_t b[2];
  float32x2x2_t result[2];

  a_col[0] = vld2_f32(a_ptr);
  b[0] = vld2_f32(b_ptr);

  result[0].val[0] = vmul_lane_f32(a_col[0].val[0], b[0].val[0], 0);
  result[0].val[1] = vmul_lane_f32(a_col[0].val[1], b[0].val[0], 0);

  vst2_f32(out_ptr, result[0]);
  out_ptr = out_ptr + 4;
}

---
> ./bin/gcc test.c -O1 -S -o -
...
test2x2_f32:
ld2 {v27.2s - v28.2s}, [x0]
ld2 {v30.2s - v31.2s}, [x1]
dup v31.2s, v30.s[0]
fmulv29.2s, v31.2s, v27.2s
fmulv30.2s, v31.2s, v28.2s
st2 {v29.2s - v30.2s}, [x2]
ret

which has an unneeded dup.  Before this we generated:

test2x2_f32:
ld2 {v0.2s - v1.2s}, [x1]
ld2 {v4.2s - v5.2s}, [x0]
fmulv2.2s, v4.2s, v0.s[0]
fmulv3.2s, v5.2s, v0.s[0]
st2 {v2.2s - v3.2s}, [x2]
ret

[Bug fortran/90608] Inline non-scalar minloc/maxloc calls

2023-10-25 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90608

--- Comment #15 from Tamar Christina  ---
(In reply to Mikael Morin from comment #14)
> Created attachment 56313 [details]
> inline minloc with mask
> 
> This patch adds support for {min,max}loc with mask.

Awesome, thank you!

> It is not 100% testsuite clean as there are (runtime) error messages that
> regress slightly for maxloc_bounds_{4,5,6,7}.f90
> 
> 
> (In reply to Mikael Morin from comment #11)
> > 
> > > The problem could be with the initialization of loop iteration variables.
> > > (...)
> > > Unfortunately, this conditional initialization seems to
> > > confuse the optimizers a lot.
> > > 
> > On closer look, the conditional initialization doesn't seem to be that
> > confusing (at least in the problematic case), as it's removed early (ccp1)
> > in the pipeline.  The loop iteration variables remain initialized with phis,
> > but that's because of the loops.
> 
> Unfortunately, this is true for rank 1 arrays, but not for higher ranks.
> Constant values are slowly propagated to the phi arguments as optimization
> passes are run, but no simplification of the control flow happens as soon as
> multiple loop levels are involved.
> 
> Need to look into the dim argument next.

It's very much appreciated! this should help greatly! Sorry I hadn't reply to
the previous message. Finishing up some work for stage-1.

[Bug tree-optimization/111860] [14 Regression] incorrect vUSE after guard block loop skip block during vectorization.

2023-10-23 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111860

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|REOPENED|RESOLVED

--- Comment #24 from Tamar Christina  ---
ok, should be actually fixed now

[Bug tree-optimization/111860] [14 Regression] incorrect vUSE after guard block loop skip block during vectorization.

2023-10-20 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111860

--- Comment #21 from Tamar Christina  ---
patch submitted
https://gcc.gnu.org/pipermail/gcc-patches/2023-October/633734.html

[Bug tree-optimization/111866] [14 regression] ICE when compiling gcc.target/powerpc/p9-vec-length-full-7.c

2023-10-20 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111866

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #6 from Tamar Christina  ---
Fixed, sorry for the breakage!

[Bug tree-optimization/111866] [14 regression] ICE when compiling gcc.target/powerpc/p9-vec-length-full-7.c

2023-10-20 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111866

--- Comment #4 from Tamar Christina  ---
patch submitted
https://gcc.gnu.org/pipermail/gcc-patches/2023-October/633713.html

[Bug tree-optimization/111860] [14 Regression] incorrect vUSE after guard block loop skip block during vectorization.

2023-10-19 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111860

--- Comment #20 from Tamar Christina  ---
(In reply to David Binderman from comment #19)
> Created attachment 56154 [details]
> C source code
> 
> You might like to have a go at getting the attached code working:
> 
> $ ~/gcc/results/bin/gcc -c -w -O3  bug967B.c
> bug967B.c: In function ‘__wcstod128_l_internal’:
> bug967B.c:10:1: error: stmt with wrong VUSE
>10 | __wcstod128_l_internal() {
>   | ^~
> 
> I have 20+ other cases. I can provide them, if you like.

No need :) They're all the same bug.  The idea for the fix was correct, but the
way I checked if the loop was versioned wasn't strong enough.

All the reported testcases now pass. I'll start regressions.

[Bug tree-optimization/111860] [14 Regression] incorrect vUSE after guard block loop skip block during vectorization.

2023-10-19 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111860

Tamar Christina  changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

--- Comment #18 from Tamar Christina  ---
Fix is too conservative, when there's no use in either loop it fails as
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111877 shows.

[Bug tree-optimization/111860] [14 Regression] incorrect vUSE after guard block loop skip block during vectorization.

2023-10-19 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111860

Tamar Christina  changed:

   What|Removed |Added

 CC||zsojka at seznam dot cz

--- Comment #17 from Tamar Christina  ---
*** Bug 111877 has been marked as a duplicate of this bug. ***

[Bug tree-optimization/111877] [14 Regression] ICE: verify_ssa failed: PHI node with wrong VUSE on edge from BB 25 with -O -fno-tree-sink -ftree-vectorize

2023-10-19 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111877

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |DUPLICATE
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from Tamar Christina  ---
merging the two

*** This bug has been marked as a duplicate of bug 111860 ***

[Bug tree-optimization/111877] [14 Regression] ICE: verify_ssa failed: PHI node with wrong VUSE on edge from BB 25 with -O -fno-tree-sink -ftree-vectorize

2023-10-19 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111877

Tamar Christina  changed:

   What|Removed |Added

   Last reconfirmed||2023-10-19
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
   Priority|P3  |P1

--- Comment #2 from Tamar Christina  ---
(In reply to Richard Biener from comment #1)
> possibly fixed already

Sadly no, this is a third case where neither loop uses the value at all.

It's kept because the tree gets versioned and so it thinks the second loop
needs it.  I should probably always remove it if the first loop doesn't use it
and fix it up in the guard creation instead.

[Bug tree-optimization/111860] [14 Regression] incorrect vUSE after guard block loop skip block during vectorization.

2023-10-19 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111860

Tamar Christina  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #15 from Tamar Christina  ---
Fixed, thanks for the report

[Bug tree-optimization/111860] [14 Regression] incorrect vUSE after guard block loop skip block during vectorization.

2023-10-19 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111860

--- Comment #13 from Tamar Christina  ---
Patch posted https://gcc.gnu.org/pipermail/gcc-patches/2023-October/633569.html

[Bug tree-optimization/111866] [14 regression] ICE when compiling gcc.target/powerpc/p9-vec-length-full-7.c

2023-10-19 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111866

--- Comment #3 from Tamar Christina  ---
ok, so the crash looks like it's due to rgroups_control being empty during
prologue peeling.

It looks like the loop isn't masked so LOOP_VINFO_LENS (loop_vinfo) is being
used in this case, but (!rgc->controls.is_empty ()) fails.

As far as I can tell these are only filled in by vect_get_loop_len.

It looks like after the refactoring vect_get_loop_len is not being called.  It
was expected to be called from vectorizable_store.

Debugging why it doesn't get there.

[Bug tree-optimization/111860] [14 Regression] incorrect vUSE after guard block loop skip block during vectorization.

2023-10-19 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111860

--- Comment #12 from Tamar Christina  ---
yes, patch was tested on both aarch64 and x86, but I did not test libgomp
indeed.

In any case, waiting for regression run to finish and will submit patch.

[Bug middle-end/111868] [14 regression] many ICEs after r14-4710

2023-10-18 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111868

Tamar Christina  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #1 from Tamar Christina  ---
Duplicate of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111860 patch going
through regression testing.

*** This bug has been marked as a duplicate of bug 111860 ***

[Bug tree-optimization/111860] [14 Regression] incorrect vUSE after guard block loop skip block during vectorization.

2023-10-18 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111860

Tamar Christina  changed:

   What|Removed |Added

 CC||seurer at gcc dot gnu.org

--- Comment #7 from Tamar Christina  ---
*** Bug 111868 has been marked as a duplicate of this bug. ***

[Bug tree-optimization/111866] [14 regression] ICE when compiling gcc.target/powerpc/p9-vec-length-full-7.c

2023-10-18 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111866

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
   Priority|P3  |P1
   Target Milestone|--- |14.0
  Component|middle-end  |tree-optimization

[Bug middle-end/111866] [14 regression] ICE when compiling gcc.target/powerpc/p9-vec-length-full-7.c

2023-10-18 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111866

--- Comment #1 from Tamar Christina  ---
Thanks for reporting! I'll debug.

I suspect another case where the vectorized and scalar loop were sneakily
swapped.

[Bug tree-optimization/111860] [14 Regression] incorrect vUSE after guard block loop skip block during vectorization.

2023-10-18 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111860

--- Comment #6 from Tamar Christina  ---
Ok, so the problem is that the loop never creates memory references, and so
after redirecting the edges when we update the new references we do so by
trying to update the PHI nodes.

But since the loop has no MEM phi node there's nothing to update but we created
a new artificial node during redirect.

Because there's no PHI node that means that adjust_phi_and_debug_stmts isn't
strong enough here.

So I can either remove phi nodes whom's SSA vars haven't been defined inside
the loop have not been defined in the body, or I'll need to replace
adjust_phi_and_debug_stmts with something that goes through all uses inside the
new loop and exit.

Which do you prefer richi? It seems like removing the PHI node after redirect
is the simplest one and one less thing to keep updated.

[Bug tree-optimization/111860] error: stmt with wrong VUSE

2023-10-18 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111860

Tamar Christina  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2023-10-18
 Ever confirmed|0   |1

--- Comment #5 from Tamar Christina  ---
Confirmed,

looks like the rename failed on one edge, BB 15 has:

;;   basic block 15, loop depth 0, count 94607391 (estimated locally, freq
0.8010), maybe hot
;;prev block 10, next block 18, flags: (NEW, VISITED)
;;pred:   7 [11.0% (guessed)]  count:94607391 (estimated locally, freq
0.8010) (FALSE_VALUE,EXECUTABLE)
  # length_13 = PHI 
  # .MEM_8 = PHI <.MEM_30(7)>

and BB 19 has:

;;   basic block 19, loop depth 0, count 105119324 (estimated locally, freq
0.8900), maybe hot
;;prev block 17, next block 8, flags: (NEW, REACHABLE, VISITED)
;;pred:   16 [33.3% (guessed)]  count:81467477 (estimated locally, freq
0.6898) (FALSE_VALUE,EXECUTABLE)
;;15 [25.0% (guessed)]  count:23651848 (estimated locally, freq
0.2003) (TRUE_VALUE)
  # length_47 = PHI 
  # .MEM_48 = PHI <.MEM_30(16), .MEM_30(15)>

It looks like after the loop guard is added after peeling that the use for the
edge coming in from BB 15 wasn't updated.

Most likely find_guard failed.  Working on it.

Odd that it only fails on x86 though.

[Bug tree-optimization/111860] error: stmt with wrong VUSE

2023-10-18 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111860

Tamar Christina  changed:

   What|Removed |Added

   Priority|P3  |P1
Version|unknown |14.0
  Component|c   |tree-optimization
   Target Milestone|--- |14.0

[Bug c/111860] error: stmt with wrong VUSE

2023-10-18 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111860

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #4 from Tamar Christina  ---
Hmm how odd, probably an incorrect edge.

Thanks! taking a look.

[Bug fortran/90608] Inline non-scalar minloc/maxloc calls

2023-10-16 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90608

--- Comment #12 from Tamar Christina  ---
(In reply to Mikael Morin from comment #11)
> Created attachment 56094 [details]
> Improved patch
> 
> This improved patch (still single argument only) passes the fortran
> regression testsuite.
> 

Awesome! Thanks! it looks like the benchmark always uses dim=1 or the mask
argument.

Can you give a hint into what I'd need to do to add the additional params?

[Bug tree-optimization/111770] New: predicated loads inactive lane values not modelled

2023-10-11 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111770

Bug ID: 111770
   Summary: predicated loads inactive lane values not modelled
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

For this example:

int foo(int n, char *a, char *b) {
  int sum = 0;
  for (int i = 0; i < n; ++i) {
sum += a[i] * b[i];
  }
  return sum;
}

we generate with -O3 -march=armv8-a+sve

.L3:
ld1bz29.b, p7/z, [x1, x3]
ld1bz31.b, p7/z, [x2, x3]
add x3, x3, x4
sel z31.b, p7, z31.b, z28.b
whilelo p7.b, w3, w0
udotz30.s, z29.b, z31.b
b.any   .L3
uaddv   d30, p6, z30.s
fmovw0, s30
ret

Which is pretty good, but we completely ruin it with the SEL.

In gimple this is:

  vect__7.12_81 = .MASK_LOAD (_21, 8B, loop_mask_77);
  masked_op1_82 = .VCOND_MASK (loop_mask_77, vect__7.12_81, { 0, ... });
  vect_patt_33.13_83 = DOT_PROD_EXPR ;

The missed optimization here is that we don't model what happens with
predicated operations that zero inactive lanes.

i.e. in this case .MASK_LOAD will zero the unactive lanes, so the .VCOND_MASK
is  completely superfluous.

I'm not entirely sure how we should go about fixing this generally.

[Bug fortran/90608] Inline non-scalar minloc/maxloc calls

2023-10-11 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90608

--- Comment #9 from Tamar Christina  ---
(In reply to Mikael Morin from comment #8)
> Created attachment 56091 [details]
> Rough patch
> 
> Here is a rough patch to make the scalarizer support minloc calls.
> It regresses on minloc_1.f90 at least, but I haven't be able to pinpoint the
> problem in the original tree dump so far.
> 
> The problem could be with the initialization of loop iteration variables.
> The existing code used for scalar minloc was versioning loops, that is it
> was using too loops in a row in some cases.  With scalar minloc, the
> initialization of the loop variable could just be disabled in the second
> loop, but if there is more than one dimension as in the array case, this
> can't work. So the patch above initializes the loop variables conditionally
> on a "loop_break" boolean variable, which I hoped the optimizers would be
> able to remove.  Unfortunately, this conditional initialization seems to
> confuse the optimizers a lot.
> 
> Anyway, the patch is there; not sure how much I can pursue this further in
> the future.

Thanks Mikael!

That's already plenty of help! I can try to debug further after I finish my
current patches.  Would it be ok if I ask questions when I do?

[Bug fortran/90608] Inline non-scalar minloc/maxloc calls

2023-09-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90608

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org,
   ||toon at gcc dot gnu.org

--- Comment #6 from Tamar Christina  ---
This is the ticket I meant toon.

Do you or Thomas have any ideas how we can inline this?

[Bug target/111370] On Aarch64 4% 511.povray_r regression between g:6cd85273071b5f13 (2023-08-23 00:17) and g:e1f096a3cc96c719 (2023-08-25 22:34)

2023-09-12 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111370

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org
   Last reconfirmed||2023-9-12

--- Comment #1 from Tamar Christina  ---
Ok, I can reproduce this with the generic cost model on Neoverse N1 hardware.

The generic cost model is based on a 10+ years old cpu and is no longer fit for
modern CPUs.

We are planning to replace it this GCC release so the regression should go away
then.

I've tested with -mcpu=neoverse-n1 and it does go away and gives a much better
score.

[Bug target/89967] Inefficient code generation for vld2q_lane_u8 under aarch64

2023-08-23 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89967

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org
   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=106106

--- Comment #3 from Tamar Christina  ---
This is caused by SRA scalarizing the structural registers. i.e. it breaks
apart the uint8x16x2_t into two uint8x16_t, for use with vld2 we need them as a
whole, and so we recreate the type again.

This causes a copy through scalarization and then constructing the type again
in RTL. Reload is able to remove one copy but not the other.


The fix for #106106 will also fix this.

[Bug target/95958] [meta-bug] Inefficient arm_neon.h code for AArch64

2023-08-23 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958
Bug 95958 depends on bug 88212, which changed state.

Bug 88212 Summary: IRA Register Coalescing not working for the testcase
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88212

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

[Bug rtl-optimization/88212] IRA Register Coalescing not working for the testcase

2023-08-23 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88212

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org
 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED
   Target Milestone|--- |11.0

--- Comment #2 from Tamar Christina  ---
Fixed in GCC 12 with changes in how we pass structural types.

[Bug target/106346] [11/12/13/14 Regression] Potential regression on vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94

2023-08-04 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #9 from Tamar Christina  ---
Fixed in GCC 14.

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2023-08-04 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 106346, which changed state.

Bug 106346 Summary: [11/12/13/14 Regression] Potential regression on 
vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large

2023-08-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

--- Comment #18 from Tamar Christina  ---
Hi, here's the reduced case:


> cat analyse.i

double x264_weights_analyse___trans_tmp_1;
float x264_weights_analyse_ref_mean;
x264_weights_analyse() {
  x264_weights_analyse___trans_tmp_1 = floor(x264_weights_analyse_ref_mean);
}


> cat pixel.i

unsigned x264_pixel_satd_8x4___trans_tmp_1;
x264_pixel_satd_8x4_sum;
x264_pixel_satd_8x4() {
  for (int i; i; i++) {
x264_pixel_satd_8x4___trans_tmp_1 = i;
x264_pixel_satd_8x4_sum += x264_pixel_satd_8x4___trans_tmp_1;
  }
  return (unsigned)x264_pixel_satd_8x4_sum >> 1;
}

---

reproduce with:

gcc -c -o pixel.o pixel.i -mcpu=neoverse-v1 -flto=auto -Ofast -w
gcc -c -o analyse.o analyse.i -mcpu=neoverse-v1 -flto=auto -Ofast -w
gcc -flto=auto -Ofast pixel.o analyse.o -lm -o x264_r -r -w

[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large

2023-08-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

--- Comment #16 from Tamar Christina  ---
(In reply to Hao Liu from comment #15)
> Ah, I see.
> 
> I've sent out a quick fix patch for code review.  I'll investigate more
> about this and find out the root cause.

Thanks! I can reduce a testcase for you if you want :)

[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large

2023-08-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

--- Comment #14 from Tamar Christina  ---
Or rather, info_for_reduction looks at the original statement if it's a
pattern, whereas vect_is_reduction only looks at the direct statement.

You'll probably want to check vect_orig_stmt if using info_for_reduction.

[Bug target/110625] [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large

2023-08-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #13 from Tamar Christina  ---
Hi,

This patch is causing several ICEs:

For instance in x264,

during GIMPLE pass: vect
x264_src/common/pixel.c: In function 'x264_pixel_satd_8x4.constprop':
x264_src/common/pixel.c:234:21: internal compiler error: in info_for_reduction,
at tree-vect-loop.cc:5473
  234 | static NOINLINE int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1,
uint8_t *pix2, int i_pix2 )
  | ^
0xe45e23 info_for_reduction(vec_info*, _stmt_vec_info*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:5473
0xf1e317 aarch64_force_single_cycle
   
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/config/aarch64/aarch64.cc:16782
0xf1e317 aarch64_vector_costs::count_ops(unsigned int, vect_cost_for_stmt,
_stmt_vec_info*, aarch64_vec_op_count*)
   
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/config/aarch64/aarch64.cc:16807
0xf31fbb aarch64_vector_costs::add_stmt_cost(int, vect_cost_for_stmt,
_stmt_vec_info*, _slp_tree*, tree_node*, int, vect_cost_model_location)
   
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/config/aarch64/aarch64.cc:17074
0xe59edb add_stmt_cost(vector_costs*, int, vect_cost_for_stmt, _stmt_vec_info*,
_slp_tree*, tree_node*, int, vect_cost_model_location)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.h:1823
0xe59edb add_stmt_costs(vector_costs*, vec*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.h:1870
0xe59edb vect_compute_single_scalar_iteration_cost
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:1624
0xe59edb vect_analyze_loop_2
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:2710
0xe5bb07 vect_analyze_loop_1
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:3329
0xe5c1cb vect_analyze_loop(loop*, vec_info_shared*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:3483
0xe90797 try_vectorize_loop_1
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1064
0xe90797 try_vectorize_loop
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1180
0xe90cb3 execute
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1296

This seems to be caused because in aarch64_force_single_cycle is
unconditionally calling info_for_reduction without checking to see if this stmt
is actually a reduction.

You'll want to check STMT_VINFO_REDUC_DEF or STMT_VINFO_DEF_TYPE before calling
this.

[Bug target/106346] [11/12/13/14 Regression] Potential regression on vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94

2023-07-31 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

Tamar Christina  changed:

   What|Removed |Added

   Target Milestone|11.5|14.0

[Bug tree-optimization/109156] Support Absolute Difference detection in GCC

2023-07-14 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109156

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #6 from Tamar Christina  ---
This is now implemented

[Bug target/86486] GCC 8 stack clash protection on AArch64 is incomplete

2023-07-14 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86486

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #11 from Tamar Christina  ---
GCC 8 is long gone

[Bug tree-optimization/109154] [13/14 regression] jump threading de-optimizes nested floating point comparisons

2023-07-10 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109154

--- Comment #65 from Tamar Christina  ---
> > 
> > In which case ifcvt could move the cond to just before the first shared
> > statement?
> 
> I don't think PRE "knows" where the operation was created from since it's
> transforms from a global dataflow problem solution.
> 
> Btw, what's the testcase your last examples are from?

It's from https://gcc.gnu.org/bugzilla/attachment.cgi?id=54777

See https://godbolt.org/z/KfzW4ob4Y

[Bug tree-optimization/109154] [13/14 regression] jump threading de-optimizes nested floating point comparisons

2023-07-10 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109154

--- Comment #63 from Tamar Christina  ---
> > It looks like `-fno-tree-pre` does the trick, but then of course, messes up
> > elsewhere.  The conditional statement seem to stay in the most complicated
> > form possible in scalar code.
> > 
> > I'll try to track down what to turn off and experiment with a pre2 after
> > vect.
> > Is before predcom a good place?
> 
> I would avoid putting it into the loop pipeline.  Instead I'd turn the
> FRE pass that runs after tracer into PRE.  Maybe conditional on whether
> there are any loops.
> 
> Note it's not so easy to "tame" PRE, the existing things happen at
> elimination time in eliminate_dom_walker::eliminate_stmt.  I would
> experiment with restricting the use of inserted PHIs in innermost(!)
> loops containing invariants, maybe only if the number of PHI args is
> more than two ... (but that's somewhat artificial).
> 
> That said, I'm not really convinced this is a good idea.

I hear you.. there's also the added complexity that this likely only is
beneficial for fully masked architectures.  I wonder, if it might be feasible
and better to pass on additional information from pre to ifcvt to indicate that
the operation was created from a common block.

In which case ifcvt could move the cond to just before the first shared
statement?

[Bug tree-optimization/109154] [13/14 regression] jump threading de-optimizes nested floating point comparisons

2023-07-10 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109154

--- Comment #61 from Tamar Christina  ---
(In reply to Richard Biener from comment #60)
> (In reply to Tamar Christina from comment #59)
> > after ifcvt we end up with:
> > 
> >   _162 = chrg_init_70 * iftmp.8_76;
> >   _164 = ABS_EXPR <_162>;
> >   _167 = -_164;
> >   _ifc__166 = distbb_74 < iftmp.0_97 ? _167 : 0.0;
> >   prephitmp_169 = distbb_74 >= 0.0 ? _ifc__166 : _168;
> >   
> > instead of
> > 
> >   _160 = chrg_init_75 * iftmp.8_80;
> >   prephitmp_161 = distbb_79 < 0.0 ? chrg_init_75 : _160;
> >   _164 = ABS_EXPR ;
> >   _166 = -_164;
> >   prephitmp_167 = distbb_79 < iftmp.0_96 ? _166 : 0.0;
> > 
> > previously we'd make COND_MUL and COND_NEG and so don't need a VCOND in the
> > end,
> > now we select after the multiplication, so we only have a COND_NEG followed
> > by a VCOND.
> > 
> > This is obviously worse, but I have no idea how to recover it.  Any ideas?
> 
> None.  This is with -O3, right?  Can you try selectively disabling parts
> of PRE with -fno-tree-partial-pre -fno-code-hoisting?  But I suspect it's
> the improvement for general PRE that we hit here.
> 

Those don't seem to make a difference sadly.

> One idea that was always floating around was to move PRE after loop opts
> like we did with predcom.  But the no PRE before loop will likely hurt as
> well
> so we might instead want to limit PRE when it involves generating
> constants in PHIs and schedule another PRE after loop opts (at some cost
> then).  It's something to experiment with ...

It looks like `-fno-tree-pre` does the trick, but then of course, messes up
elsewhere.  The conditional statement seem to stay in the most complicated form
possible in scalar code.

I'll try to track down what to turn off and experiment with a pre2 after vect.
Is before predcom a good place?

[Bug tree-optimization/109154] [13/14 regression] jump threading de-optimizes nested floating point comparisons

2023-07-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109154

--- Comment #59 from Tamar Christina  ---
I've sent two patches upstream this morning to fix the remaining ifcvt issues:

https://gcc.gnu.org/pipermail/gcc-patches/2023-July/623848.html
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/623849.html

This brings us within 5% of GCC-12, but not all the way there,  the reason is
that since GCC-13 PRE behaves differently.

In GCC-12 after PRE we'd have the following CFG:

   [local count: 623751662]:
  _16 = distbb_79 * iftmp.1_100;
  iftmp.8_80 = 1.0e+0 - _16;
  _160 = chrg_init_75 * iftmp.8_80;

   [local count: 1057206200]:
  # iftmp.8_39 = PHI 
  # prephitmp_161 = PHI <_160(15), chrg_init_75(14)>
  if (distbb_79 < iftmp.0_96)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 528603100]:
  _164 = ABS_EXPR ;
  _166 = -_164;

   [local count: 1057206200]:
  # iftmp.9_40 = PHI <1.0e+0(17), 0.0(16)>
  # prephitmp_163 = PHI 
  # prephitmp_167 = PHI <_166(17), 0.0(16)>
  if (iftmp.2_38 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 528603100]:

   [local count: 1057206200]:
  # iftmp.10_41 = PHI 

That is to say, in both branches we always do the multiply, gimple-isel then
correctly turns this into a COND_MUL based on the mask.

Since GCC-13 PRE now does some extra optimizations:

   [local count: 1057206200]:
  # l_107 = PHI 
  _13 = lpos_x[l_107];
  x_72 = _13 - p_atom$x_81;
  powmult_73 = x_72 * x_72;
  distbb_74 = powmult_73 - radij_58;
  if (distbb_74 >= 0.0)
goto ; [59.00%]
  else
goto ; [41.00%]

   [local count: 433454538]:
  _165 = ABS_EXPR ;
  _168 = -_165;
  goto ; [100.00%]

   [local count: 623751662]:
  _14 = distbb_74 * iftmp.1_101;
  iftmp.8_76 = 1.0e+0 - _14;
  if (distbb_74 < iftmp.0_97)
goto ; [20.00%]
  else
goto ; [80.00%]

   [local count: 124750334]:
  _162 = chrg_init_70 * iftmp.8_76;
  _164 = ABS_EXPR <_162>;
  _167 = -_164;

   [local count: 1057206200]:
  # iftmp.9_38 = PHI <1.0e+0(18), 0.0(17), 1.0e+0(16)>
  # iftmp.8_102 = PHI 
  # prephitmp_163 = PHI <_162(18), 0.0(17), chrg_init_70(16)>
  # prephitmp_169 = PHI <_167(18), 0.0(17), _168(16)>
  if (iftmp.2_36 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

That is to say, the multiplication is now compleletely skipped in one branch,
this should be better for scalar code, but for vector we have to do the
multiplication anyway.

after ifcvt we end up with:

  _162 = chrg_init_70 * iftmp.8_76;
  _164 = ABS_EXPR <_162>;
  _167 = -_164;
  _ifc__166 = distbb_74 < iftmp.0_97 ? _167 : 0.0;
  prephitmp_169 = distbb_74 >= 0.0 ? _ifc__166 : _168;

instead of

  _160 = chrg_init_75 * iftmp.8_80;
  prephitmp_161 = distbb_79 < 0.0 ? chrg_init_75 : _160;
  _164 = ABS_EXPR ;
  _166 = -_164;
  prephitmp_167 = distbb_79 < iftmp.0_96 ? _166 : 0.0;

previously we'd make COND_MUL and COND_NEG and so don't need a VCOND in the
end,
now we select after the multiplication, so we only have a COND_NEG followed by
a VCOND.

This is obviously worse, but I have no idea how to recover it.  Any ideas?

[Bug bootstrap/54179] please split insn-emit.c !

2023-07-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54179

--- Comment #33 from Tamar Christina  ---
(In reply to Sam James from comment #32)
> I'll tentatively reopen as IIRC tamar mentioned they've had some ideas about
> this, apologies if I'm misremembering.

Hello, yes I have a patch locally that I need to finish (there's a lot of gen-
machinery).

I'll try to get it upstream soon :)

[Bug ada/110336] New: Ada doesn't build with coverage enabled on Arm

2023-06-21 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110336

Bug ID: 110336
   Summary: Ada doesn't build with coverage enabled on Arm
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: build
  Severity: normal
  Priority: P3
 Component: ada
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
CC: dkm at gcc dot gnu.org
  Target Milestone: ---
  Host: arm-none-linux-gnueabihf
Target: arm-none-linux-gnueabihf
 Build: arm-none-linux-gnueabihf

When building Ada with coverage enabled using

./configure --target=arm-none-linux-gnueabihf --build=arm-none-linux-gnueabihf
--host=arm-none-linux-gnueabihf --with-arch=armv7-a --with-fpu=neon
--with-float=hard --with-mode=thumb  --disable-bootstrap --enable-coverage=opt
--enable-languages=all --enable-host-shared

the build fails with lots of link errors:

/usr/bin/ld: ../../libcommon-target.a(vec.o): in function `_sub_I_00100_0':
vec.cc:(.text.startup+0x44): undefined reference to `__gcov_init'
/usr/bin/ld: ../../libcommon-target.a(vec.o): in function `_sub_D_00100_1':
vec.cc:(.text.exit+0x0): undefined reference to `__gcov_exit'
/usr/bin/ld: ../../libcommon-target.a(vec.o):(.data.rel+0x10): undefined
reference to `__gcov_merge_add'
/usr/bin/ld: ../../libcommon-target.a(hooks.o): in function `_sub_I_00100_0':
hooks.cc:(.text.startup+0x4): undefined reference to `__gcov_init'
/usr/bin/ld: ../../libcommon-target.a(hooks.o): in function `_sub_D_00100_1':
hooks.cc:(.text.exit+0x0): undefined reference to `__gcov_exit'

seems like libgcov is missing somewhere?

[Bug other/110329] [14 regression] build fails building documentation after r14-1949-g957ae904065917

2023-06-21 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110329

Tamar Christina  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Tamar Christina  ---
Fixed, thanks for the report!

[Bug bootstrap/110324] [14 Regression][build][nvptx] build/genpreds: Internal error: RTL check: expected elt 2 type 'T', have 's' due to r14-1949-g957ae904065917

2023-06-20 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110324

Tamar Christina  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Tamar Christina  ---
Fixed, thanks for the report.

[Bug tree-optimization/110223] New: Missed optimization vectorizing booleans comparisons

2023-06-12 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110223

Bug ID: 110223
   Summary: Missed optimization vectorizing booleans comparisons
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

== truncate before bool

float a[1024], b[1024], c[1024], d[1024];
int k[1024];
_Bool res[1024];

int main ()
{
  int i;
  for (i = 0; i < 1024; i++)
res[i] = k[i] != ((i - 3) == 0);
}

vectorizes but does the bit clear before the truncate. Due to the high unroll
factor if done the other way around we can save the extra bitclears.

== reduce using unpack

float a[1024], b[1024], c[1024], d[1024];
_Bool k[1024];
_Bool res[1024];

int main ()
{
  int i;
  for (i = 0; i < 1024; i++)
res[i] = k[i] != (i == 0);
}

Doesn't vectorize as the compiler doesn't know how to compare different boolean
vector element sizes.  Because i is an integer the result is a V4SI backed
boolean type, vs the V16QI one for k[i].  So it has to compare 4 V4SI vectors
against 1 V16QI, it can do this by truncating the the 4 V4SI bools to 1 V16QI
bool.

== mask vs non-mask type

_Bool k[1024];
_Bool res[1024];

int main ()
{
  char i;
  for (i = 0; i < 64; i++)
res[i] = k[i] != (i == 0);
}

doesn't vectorize because the compiler doesn't know how to compare a boolean
mask vs a non-mask boolean.  There's a comment in the source code that this can
be done using a pattern (presumably casting the types earlier).

in my case I need these to work on gcond as well, not just assigns,  and since
we don't codegen conds, it might be better to handle them in vectorizable_*.

[Bug middle-end/110142] [14 Regression] x264 from SPECCPU 2017 miscompares from g:2f482a07365d9f4a94a56edd13b7f01b8f78b5a0

2023-06-07 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110142

--- Comment #2 from Tamar Christina  ---
Thank you!

[Bug middle-end/110142] New: [14 Regression] x264 from SPECCPU 2017 miscompares from g:2f482a07365d9f4a94a56edd13b7f01b8f78b5a0

2023-06-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110142

Bug ID: 110142
   Summary: [14 Regression] x264 from SPECCPU 2017 miscompares
from g:2f482a07365d9f4a94a56edd13b7f01b8f78b5a0
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
CC: avieira at gcc dot gnu.org
  Target Milestone: ---
  Host: aarch64*
Target: aarch64*
 Build: aarch64*

Benchmark miscompiles after g:2f482a07365d9f4a94a56edd13b7f01b8f78b5a0

>From 2f482a07365d9f4a94a56edd13b7f01b8f78b5a0 Mon Sep 17 00:00:00 2001
From: Andre Vieira 
Date: Mon, 5 Jun 2023 17:53:10 +0100
Subject: [PATCH] internal-fn,vect: Refactor widen_plus as internal_fn

 DEF_INTERNAL_WIDENING_OPTAB_FN and DEF_INTERNAL_NARROWING_OPTAB_FN
are like DEF_INTERNAL_SIGNED_OPTAB_FN and DEF_INTERNAL_OPTAB_FN
respectively. With the exception that they provide convenience wrappers
for a single vector to vector conversion, a hi/lo split or an even/odd
split.  Each definition for  will require either signed optabs
named  and  (for widening) or a single  (for
narrowing) for each of the five functions it creates.

[Bug rtl-optimization/109940] [13 Regression] ICE in decide_candidate_validity since g:53dddbfeb213ac4ec39f550aa81eaa4264375d2c

2023-05-23 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109940

Tamar Christina  changed:

   What|Removed |Added

  Known to work|13.1.0  |
 CC||rguenth at gcc dot gnu.org,
   ||tnfchris at gcc dot gnu.org
  Component|c   |rtl-optimization
  Known to fail|14.0|
Summary|[14 Regression] ICE in  |[13 Regression] ICE in
   |decide_candidate_validity,  |decide_candidate_validity
   |bisected|since
   ||g:53dddbfeb213ac4ec39f550aa
   ||81eaa4264375d2c
Version|unknown |14.0
   Keywords||ice-on-valid-code

--- Comment #2 from Tamar Christina  ---
confirmed, started with g:53dddbfeb213ac4ec39f550aa81eaa4264375d2c

[Bug ipa/109711] [14 regression] ICE (tree check: expected class ‘type’, have ‘exceptional’ (error_mark) in verify_range, at value-range.cc:1060) when building ffmpeg-4.4.4 since r14-377-gc92b8be9b52b

2023-05-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109711

--- Comment #6 from Tamar Christina  ---
my own bisect does indeed end up at r14-377-gc92b8be9b52b7e and cannot
reproduce it on GCC 13.

[Bug ipa/109711] [14 regression] ICE (tree check: expected class ‘type’, have ‘exceptional’ (error_mark) in verify_range, at value-range.cc:1060) when building ffmpeg-4.4.4 since r14-377-gc92b8be9b52b

2023-05-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109711

--- Comment #5 from Tamar Christina  ---
(In reply to Martin Liška from comment #3)
> Hm, on x86_64-linux-gnu, it started with r13-6616-g2246d576f922ba.

$ cat prtest2.c
void lspf2lpc();

int interpolate_lpc_q_0;

void
interpolate_lpc(int subframe_num) {
  float weight;
  if (interpolate_lpc_q_0)
weight = subframe_num;
  else
weight = 1.0;
  if (weight != 1.0)
lspf2lpc();
}

void
qcelp_decode_frame() {
  int i;
  for (;; i++)
interpolate_lpc(i);
}

$ ./install/bin/gcc --version
gcc (GCC) 13.0.1 20230312 (experimental)

$ git log -1
commit 2246d576f922bae3629da0fe1dbfcc6ff06769ad (HEAD)
Author: Tamar Christina 
Date:   Sun Mar 12 18:39:33 2023 +

middle-end: Revert can_special_div_by_const changes [PR108583]

This reverts the changes for the CAN_SPECIAL_DIV_BY_CONST hook.

gcc/ChangeLog:

PR target/108583
* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Remove.
* doc/tm.texi.in: Likewise.
* explow.cc (round_push, align_dynamic_address): Revert previous
patch.
* expmed.cc (expand_divmod): Likewise.
* expmed.h (expand_divmod): Likewise.
* expr.cc (force_operand, expand_expr_divmod): Likewise.
* optabs.cc (expand_doubleword_mod, expand_doubleword_divmod):
Likewise.
* target.def (can_special_div_by_const): Remove.
* target.h: Remove tree-core.h include
* targhooks.cc (default_can_special_div_by_const): Remove.
* targhooks.h (default_can_special_div_by_const): Remove.
* tree-vect-generic.cc (expand_vector_operation): Remove hook.
* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook.
* tree-vect-stmts.cc (vectorizable_operation): Remove hook.

$ ./install/bin/gcc -O2 -S -o - prtest2.c
.file   "prtest2.c"
.text
.p2align 4
.globl  interpolate_lpc
.type   interpolate_lpc, @function
interpolate_lpc:
.LFB0:
.cfi_startproc
movlinterpolate_lpc_q_0(%rip), %eax
testl   %eax, %eax
je  .L1
pxor%xmm0, %xmm0
cvtsi2ssl   %edi, %xmm0
ucomiss .LC0(%rip), %xmm0
jp  .L4
jne .L4
.L1:
ret
.p2align 4,,10
.p2align 3
...

Also that commit doesn't build because I forgot to cp tm.texi to the source
directory after the revert.

So I think the bisect probably didn't find it in that range.

https://godbolt.org/z/r44xGzarY indicates GCC 13.1 is fine.  So I don't think
this one is mine.

[Bug target/109632] Inefficient codegen when complex numbers are emulated with structs

2023-04-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632

--- Comment #9 from Tamar Christina  ---
Thank you!

[Bug target/109632] Inefficient codegen when complex numbers are emulated with structs

2023-04-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632

--- Comment #6 from Tamar Christina  ---
That's an interesting approach, I think it would also fix
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109391 would it not? Since the
int16x8x3_t return would be "scalarized" avoiding the bad expansion?

[Bug target/109632] Inefficient codegen when complex numbers are emulated with structs

2023-04-26 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632

--- Comment #3 from Tamar Christina  ---
note that even if we can't stop SLP, we should be able to generate as efficient
code by being creative about the instruction selection, that's why I marked it
as a target bug :)

[Bug target/109632] Inefficient codegen when complex numbers are emulated with structs

2023-04-26 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632

--- Comment #2 from Tamar Christina  ---
(In reply to Richard Biener from comment #1)
> Well, the usual unknown ABI boundary at function entry/exit.

Yes but LLVM gets it right, so should be a solve able computer science problem.
:)

Note that this was reduced from a bigger routine but end result the same, the
thing shouldn't have been vectorized.

[Bug target/109632] New: Inefficient codegen when complex numbers are emulated with structs

2023-04-26 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632

Bug ID: 109632
   Summary: Inefficient codegen when complex numbers are emulated
with structs
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*

The following two cases are the same

struct complx_t {
float re;
float im;
};

complx_t
add(const complx_t , const complx_t ) {
  return {a.re + b.re, a.im + b.im};
}

_Complex float
add(const _Complex float *a, const _Complex float *b) {
  return {__real__ *a + __real__ *b, __imag__ *a + __imag__ *b};
}

But we generate much different code (looking at -O2),  For the first one we do:

ldr d1, [x1]
ldr d0, [x0]
faddv0.2s, v0.2s, v1.2s
fmovx0, d0
lsr x1, x0, 32
lsr w0, w0, 0
fmovs1, w1
fmovs0, w0
ret

which is bad for obvious reasons, but also also never needed to go through the
genreg for such a reversal. we could have used many other NEON instructions.

For the second one we generate the good instructions:

add(float _Complex const*, float _Complex const*):
ldp s3, s2, [x0]
ldp s0, s1, [x1]
fadds1, s2, s1
fadds0, s3, s0
ret

The difference being that in the second one we have decomposed the initial
structure by loading the elements:

   [local count: 1073741824]:
  _1 = REALPART_EXPR <*a_8(D)>;
  _2 = REALPART_EXPR <*b_9(D)>;
  _3 = _1 + _2;
  _4 = IMAGPART_EXPR <*a_8(D)>;
  _5 = IMAGPART_EXPR <*b_9(D)>;
  _6 = _4 + _5;
  _10 = COMPLEX_EXPR <_3, _6>;
  return _10;

In the first one we've kept them as vectors:

   [local count: 1073741824]:
  vect__1.6_13 = MEM  [(float *)a_8(D)];
  vect__2.9_15 = MEM  [(float *)b_9(D)];
  vect__3.10_16 = vect__1.6_13 + vect__2.9_15;
  MEM  [(float *)] = vect__3.10_16;
  return D.4435;

This part is probably a costing issue, we SLP them even though it's not
profitable because for the APCS we have to return them in separate registers.

Using -fno-tree-vectorize gets the gimple code right:

   [local count: 1073741824]:
  _1 = a_8(D)->re;
  _2 = b_9(D)->re;
  _3 = _1 + _2;
  D.4435.re = _3;
  _4 = a_8(D)->im;
  _5 = b_9(D)->im;
  _6 = _4 + _5;
  D.4435.im = _6;
  return D.4435;

But we generate worse code:

ldp s1, s0, [x0]
mov x2, 0
ldp s3, s2, [x1]
fadds1, s1, s3
fadds0, s0, s2
fmovw1, s1
fmovw0, s0
bfi x2, x1, 0, 32
bfi x2, x0, 32, 32
lsr x0, x2, 32
lsr w2, w2, 0
fmovs1, w0
fmovs0, w2

where we again use genreg as a very complicated way to do a no-op.

So there are two bugs here:

1. a costing, we shouldn't SLP
2. an expansion, the code out of expand is bad to begin with.

[Bug tree-optimization/109154] [13/14 regression] jump threading de-optimizes nested floating point comparisons

2023-04-26 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109154

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #57 from Tamar Christina  ---
Ah, Cool, will take the remaining work then.

Thanks for all the patches in stage 4 everyone!

[Bug tree-optimization/109154] [13/14 regression] jump threading de-optimizes nested floating point comparisons

2023-04-25 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109154

--- Comment #54 from Tamar Christina  ---
@Jakub, just to check to avoid doing duplicate work, did you intend to do the
remaining ifcvt changes or should we?

[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator with -O3

2023-04-24 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587

--- Comment #7 from Tamar Christina  ---
(In reply to Richard Biener from comment #5)
> (In reply to Tamar Christina from comment #4)
> > (In reply to Richard Biener from comment #3)
> > > The issue isn't unrolling but invariant motion.  We unroll the innermost
> > > loop, vectorizer the middle loop and then unroll that as well.  That 
> > > leaves
> > > us with
> > > 64 invariant loads from b[] in the outer loop which I think RTL opts never
> > > "schedule back", even with -fsched-pressure.
> > > 
> > 
> > Aside from the loads, by fully unrolling the inner loop, that means we need
> > 16 unique registers live for the destination every iteration.  That's
> > already half the SIMD register file on AArch64 gone, not counting the
> > invariant loads.
> 
> Why?  You can try -fno-tree-pre -fno-tree-loop-im -fno-predictive-commoning

Oh, I was basing that on the output of the existing using a lower loop count
with e.g.
template void f<16, 16, 4>

But yes, those options avoid the spills, but of course without them you leave
all the loads inside the loop iteration.

I was hoping more we could get closer to https://godbolt.org/z/7c5YfxE5j which
is a lot better code. i.e. the invariants moved inside the outer loop.  But
yes, I do understand this may be hard to do automatically.

> 
> > The #pragma GCC unroll 8 doesn't work as that seems to stop GIMPLE unrolling
> > and does it at RTL instead.
> 
> ... because on GIMPLE we only can fully unroll or not.

But is this an intrinsic limitation or just because atm we only unroll for SLP?

> 
> > At the moment a way for the user to locally control the unroll amount would
> > already be a good step. I know there's the param, but that's global and
> > typically the unroll factor would depend on the GEMM kernel.
> 
> As said it should already work to the extent that on GIMPLE we do not
> perform classical loop unrolling.

Right, but the RTL unroller produces horrible code.. e.g. the addressing modes
are pretty bad.

[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator with -O3

2023-04-24 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587

--- Comment #4 from Tamar Christina  ---
(In reply to Richard Biener from comment #3)
> The issue isn't unrolling but invariant motion.  We unroll the innermost
> loop, vectorizer the middle loop and then unroll that as well.  That leaves
> us with
> 64 invariant loads from b[] in the outer loop which I think RTL opts never
> "schedule back", even with -fsched-pressure.
> 

Aside from the loads, by fully unrolling the inner loop, that means we need 16
unique registers live for the destination every iteration.  That's already half
the SIMD register file on AArch64 gone, not counting the invariant loads.

> Estimating register pressure on GIMPLE is hard and we heavily rely on
> "optimistic" transforms with regard to things being optimized in followup
> passes during the GIMPLE phase.

Understood, but if we can't do it automatically, is there a way to tell the
unroller not to fully unroll this?

The #pragma GCC unroll 8 doesn't work as that seems to stop GIMPLE unrolling
and does it at RTL instead.

At the moment a way for the user to locally control the unroll amount would
already be a good step. I know there's the param, but that's global and
typically the unroll factor would depend on the GEMM kernel.

[Bug tree-optimization/109587] New: Deeply nested loop unrolling overwhelms register allocator

2023-04-21 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587

Bug ID: 109587
   Summary: Deeply nested loop unrolling overwhelms register
allocator
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*

On matrix multiplication routines such as 

#include 

template
void f(const float32_t *__restrict a, const float32_t *__restrict b, float32_t
*c) {
for (int i = 0; i < N; ++i) {
for (int j=0; j < M; ++j) {
for (int k=0; k < K; ++k) {
c[i*N + j] += a[i*K + k] * b[k*M + j];
}
}
}
}

template void f<16, 16, 16>(const float32_t *__restrict a, const float32_t
*__restrict b, float32_t *c);

the loop unroller fully unrolls the inner loop because the iteration count 16
is below the threshold.  But especially because this results in a RMW operation
we don't have enough registers to deal with it and we spill profoundly.

The loop can be split in two but this requires manual work on each GEMM kernel.

Perhaps the loop unroller can use a better heuristic here?

It also looks like adding a pragmas

template
void f(const float32_t *__restrict a, const float32_t *__restrict b, float32_t
*c) {
for (int i = 0; i < N; ++i) {
for (int j=0; j < M; ++j) {
#pragma GCC unroll 8
for (int k=0; k < K; ++k) {
c[i*N + j] += a[i*K + k] * b[k*M + j];
}
}
}
}

helps but because this blocks cunrolli and instead unrolls in RTL we loose
scheduling and result in not as efficient code.

So can we do better here with early unrolling?

[Bug tree-optimization/109154] [13 regression] jump threading de-optimizes nested floating point comparisons

2023-04-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109154

--- Comment #42 from Tamar Christina  ---
Thanks for all the work so far folks!

Just to clarify the current state, it looks like the first reduced testcase is
now correct.

But the larger example as in c26 is still suboptimal, but slightly better. 
https://godbolt.org/z/7vbrG8EMj

[Bug rtl-optimization/109391] New: Inefficient codegen on AArch64 when structure types are returned

2023-04-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109391

Bug ID: 109391
   Summary: Inefficient codegen on AArch64 when structure types
are returned
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: missed-optimization, ra
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
CC: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*

This example https://godbolt.org/z/Pe3f3ozGf

---

#include 

int16x8x3_t bsl(const uint16x8x3_t *check, const int16x8x3_t *in1,
  const int16x8x3_t *in2) {
  int16x8x3_t out;
  for (uint32_t j = 0; j < 3; j++) {
out.val[j] = vbslq_s16(check->val[j], in1->val[j], in2->val[j]);
  }
  return out;
}


---

Generates:

bsl:
ldp q6, q16, [x1]
ldp q0, q4, [x2]
ldp q5, q7, [x0]
bsl v5.16b, v6.16b, v0.16b
ldr q0, [x2, 32]
bsl v7.16b, v16.16b, v4.16b
ldr q6, [x1, 32]
mov v1.16b, v5.16b
ldr q5, [x0, 32]
bsl v5.16b, v6.16b, v0.16b
mov v0.16b, v1.16b
mov v1.16b, v7.16b
mov v2.16b, v5.16b
ret

with 3 superfluous moves.  It looks like reload is having trouble dealing
with the new compound types as return arguments.

So in RTL We have:

(insn 17 20 22 2 (set (subreg:V8HI (reg/v:V3x8HI 105 [ out ]) 16)
(xor:V8HI (and:V8HI (xor:V8HI (reg:V8HI 115 [ in2_11(D)->val[1] ])
(reg:V8HI 114 [ in1_10(D)->val[1] ]))
(reg:V8HI 113 [ check_9(D)->val[1] ]))
(reg:V8HI 115 [ in2_11(D)->val[1] ]))) "/app/example.c":7:16
discrim 1 2558 {aarch64_simd_bslv8hi_internal}
 (expr_list:REG_DEAD (reg:V8HI 115 [ in2_11(D)->val[1] ])
(expr_list:REG_DEAD (reg:V8HI 114 [ in1_10(D)->val[1] ])
(expr_list:REG_DEAD (reg:V8HI 113 [ check_9(D)->val[1] ])
(nil)
(insn 22 17 29 2 (set (subreg:V8HI (reg/v:V3x8HI 105 [ out ]) 32)
(xor:V8HI (and:V8HI (xor:V8HI (reg:V8HI 118 [ in2_11(D)->val[2] ])
(reg:V8HI 117 [ in1_10(D)->val[2] ]))
(reg:V8HI 116 [ check_9(D)->val[2] ]))
(reg:V8HI 118 [ in2_11(D)->val[2] ]))) "/app/example.c":7:16
discrim 1 2558 {aarch64_simd_bslv8hi_internal}
 (expr_list:REG_DEAD (reg:V8HI 118 [ in2_11(D)->val[2] ])
(expr_list:REG_DEAD (reg:V8HI 117 [ in1_10(D)->val[2] ])
(expr_list:REG_DEAD (reg:V8HI 116 [ check_9(D)->val[2] ])
(nil)
(insn 29 22 30 2 (set (reg/i:V3x8HI 32 v0)
(reg/v:V3x8HI 105 [ out ])) "/app/example.c":10:1 3964
{*aarch64_movv3x8hi}
 (expr_list:REG_DEAD (reg/v:V3x8HI 105 [ out ])
(nil)))
(insn 30 29 37 2 (use (reg/i:V3x8HI 32 v0)) "/app/example.c":10:1 -1
 (nil))

Reload then decides to insert a bunch of reloads:

 Choosing alt 0 in insn 17:  (0) =w  (1) 0  (2) w  (3) w
{aarch64_simd_bslv8hi_internal}
  Creating newreg=126 from oldreg=113, assigning class FP_REGS to r126
   17: r126:V8HI=r115:V8HI^r114:V8HI:V8HI^r115:V8HI
  REG_DEAD r115:V8HI
  REG_DEAD r114:V8HI
  REG_DEAD r113:V8HI
Inserting insn reload before:
   43: r126:V8HI=r113:V8HI
Inserting insn reload after:
   44: r105:V3x8HI#16=r126:V8HI

which introduces these moves.  The problem existed with the previous structure
types as well (OImode etc) so it's not new but costs us lots of perf.

I don't think I can fix this with the same pass as
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106106 can I? It looks like in
this case the RTL looks fine.

[Bug tree-optimization/109154] [13 regression] jump threading de-optimizes nested floating point comparisons

2023-03-28 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109154

--- Comment #25 from Tamar Christina  ---
Created attachment 54777
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54777=edit
extracted codegen

[Bug tree-optimization/109154] [13 regression] jump threading de-optimizes nested floating point comparisons

2023-03-28 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109154

--- Comment #24 from Tamar Christina  ---
(In reply to Jakub Jelinek from comment #12)
> (In reply to Richard Biener from comment #11)
> > _1 shoud be [-Inf, nextafter (0.0, -Inf)], not [-Inf, -0.0]
> The reduced testcase is invalid because it uses uninitialized l.

Sure, lets fix that, it was reduced a bit too far:

https://godbolt.org/z/he3rT5Exq

Has the extracted codegen part.

Note how GCC 14 does at least 2x the number of floating point comparisons in
the hot loops.

The scalar code doesn't look (off the top of my head) that bad, but the
additional entries in the phi nodes are still causing major headaches for
vector code.

  # iftmp.2_36 = PHI <1(10), _95(11), 0(9)>
  # iftmp.0_97 = PHI <2.0e+0(10), 2.0e+0(11), 4.0e+0(9)>
  # iftmp.1_101 = PHI <5.0e-1(10), 5.0e-1(11), 2.5e-1(9)>

vs before

  # iftmp.2_38 = PHI <1(11), _95(12)>
  # iftmp.0_96 = PHI <2.0e+0(11), iftmp.0_94(12)>
  # iftmp.1_100 = PHI <5.0e-1(11), iftmp.1_98(12)>

which causes it to generate:

fcmge   p3.s, p0/z, z0.s, z6.s
fcmlt   p1.s, p0/z, z0.s, z6.s
fcmge   p1.s, p1/z, z0.s, #0.0
fcmge   p1.s, p3/z, z0.s, #0.0
fcmlt   p3.s, p0/z, z0.s, #0.0

vs

fcmge   p3.s, p0/z, z0.s, #0.0
fcmlt   p2.s, p0/z, z0.s, z16.s

The split in threading is causing it to miss that it can do the comparison with
0 just once on all the element.

[Bug tree-optimization/109230] [13 Regression] Maybe wrong code for opus package on aarch64 since r13-4122-g1bc7efa948f751

2023-03-21 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109230

--- Comment #11 from Tamar Christina  ---
Neither of those vec_perms are valid targets for this optimization.

It looks like sel.series_p is not doing what I expected. It's matching even
elements and ignoring the odd ones.

[Bug tree-optimization/109230] [13 Regression] Maybe wrong code for opus package on aarch64 since r13-4122-g1bc7efa948f751

2023-03-21 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109230

--- Comment #1 from Tamar Christina  ---
That patch only fixed the bootstrap, in any case I'm on holidays so have asked
someone else to look.

[Bug target/109154] [13 regression] jump threading with de-optimizes nested floating point comparisons

2023-03-16 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109154

Tamar Christina  changed:

   What|Removed |Added

Summary|[13 regression] aarch64 |[13 regression] jump
   |-mcpu=neoverse-v1 microbude |threading with de-optimizes
   |performance regression  |nested floating point
   ||comparisons
 Status|UNCONFIRMED |NEW
 CC||aldyh at gcc dot gnu.org
 Ever confirmed|0   |1
   Last reconfirmed||2023-03-16

--- Comment #3 from Tamar Christina  ---
Aldy, any thoughts here?

[Bug target/109154] [13 regression] aarch64 -mcpu=neoverse-v1 microbude performance regression

2023-03-16 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109154

--- Comment #2 from Tamar Christina  ---
Confirmed, It looks like the extra range information from
g:4fbe3e6aa74dae5c75a73c46ae6683fdecd1a75d is leading jump threading down the
wrong path.

Reduced testcase:
---

int etot_0, fasten_main_natpro_chrg_init;

void fasten_main_natpro() {
  float elcdst = 1;
  for (int l; l < 1; l++) {
int zone1 = l < 0.0f, chrg_e = fasten_main_natpro_chrg_init * (zone1 ?: 1)
*
   (l < elcdst ? 1 : 0.0f);
etot_0 += chrg_e;
  }
}

---

and compile with `-O1`. Issue also effects all targets not just AArch64
https://godbolt.org/z/qes4K4oTz. and using `-fno-thread-jumps` confirmed to
"fix" it.

With the new case jump threading seems to duplicate the edges on the l < 0.0f
check.

the dump says:

"Jump threading proved probability of edge 5->7 too small (it is 41.0%
(guessed) should be 69.5% (guessed))"

In BB 3 the branch probabilities are guessed as:

if (_1 < 0.0)
  goto ; [41.00%]
else
  goto ; [59.00%]

and in BB 5:

if (_1 < 1.0e+0)   
  goto ; [41.00%]
else
  goto ; [59.00%]

and so it thinks that the chances of _1 >= 0.0 && _1 < 1.0 is very small:

if (_1 < 1.0e+0)
  goto ; [14.80%]
else
  goto ; [85.20%]

The problem is that both BB 4 falls through to BB 5, and BB 6 falls through to
BB 7.

jump threading optimizes BB 5 by splitting the work to be done in BB 5 for the
fall-through from BB 4 back into BB 4.
It then threads the additional edge to BB 7 where the final calculation is now
more expensive.  much more than before (three way phi-node).

but because the hot path in BB 6 also falls into BB 7 the overall result is
that all paths become slower. but the hot path actually got an additional
comparison.

This is why the code slows down, for each instance of this occurrence (and in
the example provided by microbude it happens often) we get an addition branch
in a few paths.

this has a bigger slow down in SVE (vs the scalar slowdown) because it then
creates a longer dependency chain on producing the predicate for the BB.

It looks like this threading shouldn't be done if both hot and cold branches
end up in the same place?

<    1   2   3   4   5   6   7   8   >