[Bug target/109519] aarch64: wrong code with NEON intrinsics on gcc-10 and later

2023-04-15 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109519

--- Comment #5 from Sebastian Pop  ---
Thanks Andrew for the patch, it fixes the issue.

[Bug target/109519] New: aarch64: wrong code with NEON intrinsics on gcc-10 and later

2023-04-14 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109519

Bug ID: 109519
   Summary: aarch64: wrong code with NEON intrinsics on gcc-10 and
later
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: spop at gcc dot gnu.org
  Target Milestone: ---

Steps to reproduce:
$ git clone https://github.com/sebpop/bitshuffle.git -b gcc-10-bug
$ cd bitshuffle/reproduce
$ make
$ ./a.out

The expected output is produced by gcc-7, gcc-9, and clang-15. 
16384
4
14
16
33
39
45
51
57
67
102
108
120
126
128
134
138
140
[...]

gcc-9 is the last version of gcc I tested that works.

gcc-10 produces the following output:
./a.out
16384
0
0
0
0
39
45
51
57

gcc-11 and gcc-trunk produce the following output:
./a.out
16384
0
0
0
0
0
0
0

The output is also correct when removing the before-last patch from the git
repo https://github.com/kiyo-masui/bitshuffle/pull/140 
This patch exposes the bug in gcc by using NEON intrinsics instead of scalar
computations to translate move_mask instructions from SSE2 to NEON.

[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba

2023-02-02 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409

Sebastian Pop  changed:

   What|Removed |Added

 CC||spop at gcc dot gnu.org

--- Comment #18 from Sebastian Pop  ---
A new 5% regression happened in gcc-trunk more recently and may be due to
another patch.

Rama was bisecting a 15% perf regression on lbm when updating gcc-7 to gcc-10.
The regression can be seen on the LNT graph link from comment#3 

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=633.477.0=683.477.0=664.477.0=648.477.0=618.477.0=605.477.0=759.477.0=584.477.0

gcc-6 has execution time of 213 seconds
gcc-7 is at 215 seconds
gcc-8 is at 266
gcc-9 at 259
gcc-10 at 260

Honza's patch seems to be unrelated as it was committed to trunk before gcc-10
release on May 7, 2020:

commit a9a4edf0e71bbac9f1b5dcecdcf9250111d16889
Author: Jan Hubicka 
Date:   Sat Nov 30 22:25:24 2019 +0100

Update max_bb_count in execute_fixup_cfg


We need to git-bisect between gcc-7 and gcc-8.

[Bug debug/98776] DW_AT_low_pc is inconsistent with function entry address, when enabling -fpatchable-function-entry

2022-12-15 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98776

Sebastian Pop  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #15 from Sebastian Pop  ---
Fixed for arm64 as well on master, and backported to active branches gcc-12,
11, and 10.

[Bug debug/98776] DW_AT_low_pc is inconsistent with function entry address, when enabling -fpatchable-function-entry

2022-11-30 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98776

--- Comment #10 from Sebastian Pop  ---
Patch for arm64:
https://gcc.gnu.org/pipermail/gcc-patches/2022-December/607601.html

[Bug middle-end/107485] [10 Regression] gcc-10 ICE with -fnon-call-exception

2022-11-14 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107485

--- Comment #10 from Sebastian Pop  ---
Thanks Richard.
The patch fixed the larger test as well.

[Bug middle-end/107485] New: gcc-10 ICE with -fnon-call-exception

2022-10-31 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107485

Bug ID: 107485
   Summary: gcc-10 ICE with -fnon-call-exception
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: spop at gcc dot gnu.org
  Target Milestone: ---

On arm64-linux I see the following crash only on gcc-10.
I do not see the ICE on gcc-11, 12, and trunk. 

$ ~/gcc-10/bld/gcc/cc1plus -fnon-call-exceptions f.ii
[...]
f.ii:29:23: internal compiler error: Segmentation fault
   29 |   template  void x(double *, b, unsigned long *) { f(); }
  |   ^
0x134e58b crash_signal
../../gcc/toplev.c:328
0x1639464 tree_vec_extract(gimple_stmt_iterator*, tree_node*, tree_node*,
tree_node*, tree_node*)
../../gcc/tree-vect-generic.c:140
0x163ca0f expand_vector_condition
../../gcc/tree-vect-generic.c:1044
0x164081f expand_vector_operations_1
../../gcc/tree-vect-generic.c:1988
0x16419f7 expand_vector_operations
../../gcc/tree-vect-generic.c:2240
0x1641b3f execute
../../gcc/tree-vect-generic.c:2284
[...]

$ cat f.ii
typedef long a;
typedef double b;
typedef struct {
  a c __attribute__((__vector_size__(32)));
  b d __attribute__((__vector_size__(32)));
} e;
__attribute__((__always_inline__)) b f() {
  e g, h, i;
  g.c = h.d < i.d;
}
class j {
  bool k();
};
template  void ab(aa, l, n) {
  int o;
  typename n::p q;
  unsigned long r;
  q(0, o, );
}
namespace s {
template 
void t(j *, long, long, unsigned long *, int u) {
  n ac;
  void v();
  ab(v, u, ac);
}
} // namespace s
struct w {
  template  void x(double *, b, unsigned long *) { f(); }
  double ad;
  void operator()(double, double, unsigned long *) {
unsigned long m;
x<0>(, 0, );
  }
};
using s::t;
struct y {
  using p = w;
};
long ag, ah;
unsigned long ai;
double aj;
bool j::k() {
  using n = y;
  t(this, ag, ah, , aj);
}



git bisect stops on this patch:

commit 1e676cfbe1e13fba2c636b560362ed4f0a56893d
Author: Richard Biener 
Date:   Mon May 18 08:51:23 2020 +0200

middle-end/95171 - inlining of trapping compare into non-call EH fn

This fixes always-inlining across -fnon-call-exception boundaries
for conditions which we do not allow to throw.

2020-05-18  Richard Biener  

PR middle-end/95171
* tree-inline.c (remap_gimple_stmt): Split out trapping compares
when inlining into a non-call EH function.

* gcc.dg/pr95171.c: New testcase.

(cherry picked from commit fe168751c5c1c517c7c89c9a1e4e561d66b24663)

[Bug debug/98776] DW_AT_low_pc is inconsistent with function entry address, when enabling -fpatchable-function-entry

2022-09-29 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98776

Sebastian Pop  changed:

   What|Removed |Added

 CC||spop at gcc dot gnu.org

--- Comment #9 from Sebastian Pop  ---
Hi, is somebody working on fixing this on arm64?  If not I will be working on
it.

The linux kernel needs this fixed for systemtap and perf probe.

[Bug target/105162] [AArch64] outline-atomics drops dmb ish barrier on __sync builtins

2022-05-16 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162

Sebastian Pop  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #14 from Sebastian Pop  ---
Fixed.

[Bug target/105162] [AArch64] outline-atomics drops dmb ish barrier on __sync builtins

2022-04-18 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162

Sebastian Pop  changed:

   What|Removed |Added

  Attachment #52762|0   |1
is obsolete||

--- Comment #8 from Sebastian Pop  ---
Created attachment 52826
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52826=edit
patch

You are right.  Please see attached an amended patch that only adds the
barriers to __sync builtins.

[Bug target/105162] [AArch64] outline-atomics drops dmb ish barrier on __sync builtins

2022-04-06 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162

Sebastian Pop  changed:

   What|Removed |Added

  Attachment #52755|0   |1
is obsolete||

--- Comment #5 from Sebastian Pop  ---
Created attachment 52762
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52762=edit
patch

The attached patch fixes the issue for __sync builtins by adding the missing
barrier to -march=armv8-a+nolse path in the outline-atomics functions.

The patch also changes the behavior of __atomic builtins for -moutline-atomics
-march=armv8-a+nolse to be the same as for -march=armv8-a+lse.

[Bug target/105162] [AArch64] outline-atomics drops dmb ish barrier on __sync builtins

2022-04-06 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162

--- Comment #4 from Sebastian Pop  ---
The attached patch degrades performance on cpus with LSE: the barrier is not
needed when outline-atomics execute an LSE instruction.

I was thinking to add the barrier to the armv8.0 generic path (no LSE) in the
outline-atomics functions.

[Bug target/105162] [AArch64] outline-atomics drops dmb ish barrier on __sync builtins

2022-04-05 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162

Sebastian Pop  changed:

   What|Removed |Added

  Attachment #52750|0   |1
is obsolete||

--- Comment #3 from Sebastian Pop  ---
Created attachment 52755
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52755=edit
patch

LSE atomics do not need a barrier.

Updated the patch to only generate the barriers after outline-atomics calls.

[Bug target/105162] [AArch64] outline-atomics drops dmb ish barrier on __sync builtins

2022-04-05 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162

--- Comment #2 from Sebastian Pop  ---
Created attachment 52750
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52750=edit
patch

Fix.

[Bug target/105162] [AArch64] outline-atomics drops dmb ish barrier on __sync builtins

2022-04-05 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162

--- Comment #1 from Sebastian Pop  ---
Also happens when compiling with LSE: -march=armv8.1-a or later.

[Bug target/105162] New: [AArch64] outline-atomics drops dmb ish barrier on __sync builtins

2022-04-05 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162

Bug ID: 105162
   Summary: [AArch64] outline-atomics drops dmb ish barrier on
__sync builtins
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: spop at gcc dot gnu.org
  Target Milestone: ---

With -mno-outline-atomics gcc produces a `dmb ish` barrier on __sync builtins
as required by the Intel specification 
(see fix for https://gcc.gnu.org/PR65697 
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=f70fb3b635f9618c6d2ee3848ba836914f7951c2
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=ab876106eb689947cdd8203f8ecc6e8ac38bf5ba
)

$ cat a.c
int foo(int a)
{
  return __sync_bool_compare_and_swap(, 4, 5);
}
$ gcc -O2 a.c -S -o- -mno-outline-atomics 
foo:
sub sp, sp, #16
mov w1, 5
str w0, [sp, 12]
add x0, sp, 12
.L4:
ldxrw2, [x0]
cmp w2, 4
bne .L5
stlxr   w3, w1, [x0]
cbnzw3, .L4
.L5:
dmb ish
csetw0, eq
add sp, sp, 16
ret

With -moutline-atomics gcc does not generate the barrier:

$ gcc -O2 a.c -S -o-  -moutline-atomics 
foo:
stp x29, x30, [sp, -32]!
mov w1, 5
mov x29, sp
add x2, sp, 28
str w0, [sp, 28]
mov w0, 4
bl  __aarch64_cas4_acq_rel
cmp w0, 4
csetw0, eq
ldp x29, x30, [sp], 32
ret

Happens on gcc-8, 9, 10, 11, and trunk.

[Bug rtl-optimization/99346] New: [aarch64] ICE in gen_rtx_SUBREG, at emit-rtl.c:1021

2021-03-02 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99346

Bug ID: 99346
   Summary: [aarch64] ICE in gen_rtx_SUBREG, at emit-rtl.c:1021
   Product: gcc
   Version: 8.4.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: spop at gcc dot gnu.org
  Target Milestone: ---

Created attachment 50289
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50289=edit
pre-processed reduced testcase

gcc-8, gcc-9, and gcc-10 from Ubuntu 20.04 are failing to compile the attached
test at -O2 and -O3 on Graviton2 aarch64-linux.

$ g++-10 -O2 a.ii
[...]
a.ii:362:50: internal compiler error: in gen_rtx_SUBREG, at emit-rtl.c:1021


$ g++-8 -O2 a.ii
[...]
a.ii:493:11: internal compiler error: in gen_rtx_SUBREG, at emit-rtl.c:1010

Similar bug was reported/fixed on x86:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83723

[Bug c++/99012] gcc-8.4.0 on aarch64 hits internal error during RTL pass: expand if `std::copysign` is used

2021-02-08 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99012

--- Comment #3 from Sebastian Pop  ---
I do not see the bug with today's cc1plus from origin/releases/gcc-8

[Bug c++/99012] gcc-8.4.0 on aarch64 hits internal error during RTL pass: expand if `std::copysign` is used

2021-02-08 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99012

Sebastian Pop  changed:

   What|Removed |Added

 CC||spop at gcc dot gnu.org

--- Comment #2 from Sebastian Pop  ---
I see the bug with

$ gcc-8 --version
gcc-8 (Ubuntu/Linaro 8.4.0-1ubuntu1~18.04) 8.4.0

[Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics

2021-01-28 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

Bug ID: 98877
   Summary: [AArch64] Inefficient code generated for tbl NEON
intrinsics
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: spop at gcc dot gnu.org
  Target Milestone: ---

The use of NEON intrinsics is inefficient and leads developers to prefer inline
assembly instead of intrinsics.

A similar performance bug for vmlal intrinsics was reported in
https://gcc.gnu.org/PR92665
The code generated by GCC for table lookups is also inefficient:

$ cat red.c
#include "arm_neon.h"

uint8x16_t fun(uint8x16_t lo, uint8x16_t hi, uint8x16_t idx) {
  uint8x16x2_t tab = { .val = {lo, hi} };
  uint8x16_t res = vqtbl2q_u8(tab, idx);
  return res;
}

$ gcc -O3 -S -o- red.c
fun:
mov v4.16b, v0.16b
mov v5.16b, v1.16b
tbl v0.16b, {v4.16b - v5.16b}, v2.16b
ret

$ clang -O3 -S -o- red.c
fun:
tbl v0.16b, { v0.16b, v1.16b }, v2.16b
ret

[Bug target/97802] New: [AArch64] Incorrect documentation for Arm64 NEON

2020-11-11 Thread spop at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97802

Bug ID: 97802
   Summary: [AArch64] Incorrect documentation for Arm64 NEON
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: spop at gcc dot gnu.org
  Target Milestone: ---

The following text in doc/invoke.texi seems to be outdated.  To avoid confusion
the text needs to be more specific on which NEON implementations it applies:

"If the selected floating-point hardware includes the NEON extension
(e.g.@: @option{-mfpu=neon}), note that floating-point
operations are not generated by GCC's auto-vectorization pass unless
@option{-funsafe-math-optimizations} is also specified.  This is
because NEON hardware does not fully implement the IEEE 754 standard for
floating-point arithmetic (in particular denormal values are treated as
zero), so the use of NEON instructions may lead to a loss of precision."

This used to be true for older NEON implementations.
NEON implementation in Armv8 and later is IEEE 754 compliant.