[Bug middle-end/115388] [15 Regression] wrong code at -O3 on x86_64-linux-gnu since r15-571-g1e0ae1f52741f7

2024-06-10 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115388

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #7 from Wilco  ---
(In reply to Richard Biener from comment #6)
> Fixed.  Unfortunately this didn't fix PR115256 if I checked correctly.  Keep
> searching!

The testcase hangs on AArch64, so this commit didn't fix it...

[Bug target/115342] New: [14/15 Regression] AArch64: Function multiversioning initialization incorrect

2024-06-04 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115342

Bug ID: 115342
   Summary: [14/15 Regression] AArch64: Function multiversioning
initialization incorrect
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

The CPU features initialization code uses CPUID registers. It uses incorrect
comparisons so that for example SVE is not set if SVE2 is available. Using
HWCAPs for these is both simpler and works correctly. The initialization must
also be done atomically so to avoid multiple threads causing corruption due to
non-atomic RMW of the global.

[Bug target/115188] [14/15 regression] invalid Thumb assembly for atomic store in loop on ARMv6

2024-05-23 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115188

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org
   Last reconfirmed||2024-05-23

--- Comment #2 from Wilco  ---
(In reply to Andrew Pinski from comment #1)
> At first I thought it was the same failure as PR 115153 but it is different.

It's similar in that 'm' apparently allows LDMIA/STMIA with writeback in
Thumb-1. The correct constraint is 'Uw'.

[Bug target/115153] [14/15 Regression] Error: bad immediate value for 8-bit offset - armv7ve

2024-05-20 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115153

Wilco  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org
 CC||wilco at gcc dot gnu.org
 Status|NEW |ASSIGNED

--- Comment #8 from Wilco  ---
Confirmed. Small example:

long long f(long long *p)
{
   return __atomic_load_n (p+32, __ATOMIC_RELAXED);
}

It only fails on ARM and if the offset is between 256 and 1024. This is a
latent bug: arm_legitimate_index_p has an explicit check that disallows
anything over 256, but an earlier check for VALID_NEON_DREG_MODE oddly enough
allows DImode and a larger range. Moving the Neon check after LDRD check fixes
this.

Note using ldrd_strd_offset_operand/Do should also work, but the existing code
for 'm' is supposed to handle this correctly.

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2024-05-08 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

Wilco  changed:

   What|Removed |Added

 Target||aarch64-*-*
   Target Milestone|--- |15.0

[Bug target/114991] New: [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2024-05-08 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

Bug ID: 114991
   Summary: [14/15 Regression] AArch64: LDP pass does not handle
some structure copies
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

The following example no longer emits LDP/STP since GCC14:

#include 

typedef struct { int arr[20]; } S;

void g (S *);
void h (S);
void f(int x)
{
  S s;
  g ();
  h (s);
}

f:
stp x29, x30, [sp, -176]!
add x1, sp, 96
mov x29, sp
add x0, sp, 16
ldp q29, q31, [x1]
ldr q30, [x1, 32]
str q29, [sp, 16]
ldr q29, [x1, 48]
str q31, [x0, 16]
ldr q31, [x1, 64]
stp q30, q29, [x0, 32]
str q31, [x0, 64]
bl  h
ldp x29, x30, [sp], 176
ret

The expansions for memcpy/move/memset no longer emit LDP directly in RTL and
now rely on the new LDP pass. Stack based loads/stores seem to confuse its
alias checks and it gives up.

Using -fno-schedule-insns fixes this example, but not all cases.

[Bug target/114890] [14/15 Regression] Big-endian addp intrinsics reorder operands

2024-04-29 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114890

Wilco  changed:

   What|Removed |Added

   Target Milestone|--- |15.0
 Target||aarch64-*-*

[Bug target/114890] New: [14/15 Regression] Big-endian addp intrinsics reorder operands

2024-04-29 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114890

Bug ID: 114890
   Summary: [14/15 Regression] Big-endian addp intrinsics reorder
operands
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

The following example:

#include "arm_neon.h"

uint32x4_t test (uint32x4_t v1, uint32x4_t v2)
{
  return vpaddq_u32 (v1, v2);
}

compiles with -O2 -mlittle-endian into:

test:
addpv0.4s, v0.4s, v1.4s
ret

However -O2 -mbig-endian gives the incorrect:

addpv0.4s, v1.4s, v0.4s
ret

[Bug target/114843] aarch64: epilogue in _Unwind_RaiseException corrupts return value due to __builtin_eh_return

2024-04-26 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114843

--- Comment #17 from Wilco  ---
(In reply to Andrew Pinski from comment #16)
> Patch posted with all of the testcases included:
> https://gcc.gnu.org/pipermail/gcc-patches/2024-April/650080.html

Not nearly enough testcases... What about:

void g(void);
int f(long offset, void *handler)
{
  g();
  if (offset > 5)
return arr[offset];
  __builtin_eh_return (offset, handler);
}

With -O2 -fomit-frame-pointer:

f:
.LFB0:
.cfi_startproc
stp x30, x0, [sp, -64]!
.cfi_def_cfa_offset 64
.cfi_offset 30, -64
.cfi_offset 0, -56
stp x1, x2, [sp, 16]
stp x3, x19, [sp, 32]
.cfi_offset 1, -48
.cfi_offset 2, -40
.cfi_offset 3, -32
.cfi_offset 19, -24
mov x19, x0
str x20, [sp, 48]
.cfi_offset 20, -16
mov x20, x1
bl  g
cmp x19, 5
ble .L8
mov w0, w19
ldp x19, x20, [sp, 40]
ldp x30, x0, [sp], 64** oops
.cfi_remember_state
.cfi_restore 0
.cfi_restore 30
.cfi_restore 19
.cfi_restore 20
.cfi_def_cfa_offset 0
ret
.L8:
.cfi_restore_state
mov x5, x19
ldp x1, x2, [sp, 16]
mov x6, x20
ldp x3, x19, [sp, 32]
ldr x20, [sp, 48]
ldp x30, x0, [sp], 64
.cfi_restore 0
.cfi_restore 30
.cfi_restore 20
.cfi_restore 3
.cfi_restore 19
.cfi_restore 1
.cfi_restore 2
.cfi_def_cfa_offset 0
add sp, sp, x5
br  x6
.cfi_endproc

So I don't believe you should change aarch64_pop_regs at all - it's too late to
change things and just adds unnecessary complexity and more bugs. The best
option would be to handle eh_return explicitly and insert the extra push/pops
rather than treating them like a generic callee-save (because clearly they are
not anymore).

[Bug target/114843] aarch64: epilogue in _Unwind_RaiseException corrupts return value due to __builtin_eh_return

2024-04-25 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114843

--- Comment #13 from Wilco  ---
(In reply to Andrew Pinski from comment #11)
> I have a fix for aarch64, able to produce now:
> ```
> f:
> .LFB0:
> .cfi_startproc
> stp x0, x1, [sp, -32]!
> .cfi_def_cfa_offset 32
> .cfi_offset 0, -32
> .cfi_offset 1, -24
> stp x2, x3, [sp, 16]
> .cfi_offset 2, -16
> .cfi_offset 3, -8
> ldr w0, [x0]
> cmp w0, 5
> bne .L8
> add sp, sp, 32
> .cfi_remember_state
> .cfi_def_cfa_offset 0
> ret
> .L8:
> .cfi_restore_state
> mov x5, x1
> ldp x2, x3, [sp, 16]
> ldp x0, x1, [sp], 32
> .cfi_restore 1
> .cfi_restore 0
> .cfi_restore 2
> .cfi_restore 3
> .cfi_def_cfa_offset 0
> add sp, sp, x5
> ret
> .cfi_endproc
> ```
> 
> Which is exactly what we should produce I think.
> The patch is a bit more complex than I expected but that is due to how
> aarch64 has some of the most complex epilogues.

I'm not convinced that is an easy solution. Try various cases with large stack
sizes, alloca and other scalar and FP callee-saves. Getting all cases right and
writing good tests for them is a lot of work.

[Bug target/114843] aarch64: epilogue in _Unwind_RaiseException corrupts return value due to __builtin_eh_return

2024-04-25 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114843

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #10 from Wilco  ---
(In reply to Andrew Pinski from comment #9)
> Just a quick note here. Even though eh_return pattern was removed with
> r7-6051-g8144a493ddc008, it was broken before that patch.

Yeah I only fixed the broken behaviours that I encountered at the time - no
tests tried to return a value on the non-exception path. There is no clear
specification (eg. making it clear that EH_RETURN_DATA_REGNO must not overlap
with registers used to return or if they do, you need to conditionally restore
them), so no wonder that many targets get this wrong. Who knew that introducing
lots of complex builtins that affect prolog and epilog generation in a major
way to avoid a few lines of assembly code was such a bad idea...

Since the whole eh_return is an internal ABI in libgcc, a fix would be to
change EH_RETURN_DATA_REGNO(N) to avoid x0 and x1. Since eh_return already
reserves 7 registers(!) and now need to avoid using x0/x1 too, using x2-x5 and
x6,x7 and x9 for the other special registers should work.

[Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations

2024-04-17 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741

--- Comment #7 from Wilco  ---
(In reply to Tamar Christina from comment #6)
> and the exact armv9-a cost model you quoted, also does the right codegen.
> https://godbolt.org/z/obafoT6cj
> 
> There is just an inexplicable penalty being applied to the r->r alternative.

Indeed it is not related to cost model - building SPEC shows a significant
regression (~1%) with -mcpu=neoverse-v1 due to AND immediate being quite common
in scalar code. The '^' incorrectly forces many cases to use the SVE
alternative.

[Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations

2024-04-16 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741

--- Comment #2 from Wilco  ---
It looks like the underlying bug is '^' being incorrectly treated like '?' in
record_reg_classes (which is never used during reload). Fixing that results in
the expected code being generated in all cases. It looks this issue was
introduced in the original commit d1457701461d5a49ca6b5d8a6d1c83a37a6dc771

[Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations

2024-04-16 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #1 from Wilco  ---
This example always goes wrong:

void foo2(unsigned *p)
{
*p &= 1;
}

Eg. with -mcpu=neoverse-v1:

ldr s31, [x0]
and z31.s, z31.s, #1
str s31, [x0]
ret

This doesn't make any sense since there are usually fewer vector units than
integer ALUs, and the typically have higher latency.

[Bug target/113986] [14 regression] Build failure on aarch64-linux-musl or if ifunc support is disabled (error: 'export_load_16' aliased to undefined symbol 'libat_load_16')

2024-04-08 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113986

Wilco  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #7 from Wilco  ---
Fixed

[Bug middle-end/110773] [Aarch64] crash (SIGBUS) due to atomic instructions on under-aligned memory

2024-04-04 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110773

--- Comment #8 from Wilco  ---
(In reply to Sainan from comment #7)
> (In reply to Wilco from comment #6)
> > That does not make any sense. The only thing I think might happen is that
> > your structure is not correctly aligned (for example by using a custom
> > memory allocator). Can you check the address of count when it fails? (should
> > be in the crash logs, or you can see it in gdb or just printf it).
> 
> I feel silly for not thinking of printing the address, but now that I did, I
> see the final hexit is '9' and so it just so happens this CPU can't deal
> with that...

So it's unaligned then, and that's not supported. And you're lucky your
specific alignment happens to work on v8.4 cores - it would fail for other
offsets.

[Bug middle-end/110773] [Aarch64] crash (SIGBUS) due to atomic instructions on under-aligned memory

2024-04-04 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110773

--- Comment #6 from Wilco  ---
(In reply to Sainan from comment #5)
> (In reply to Wilco from comment #4)
> > The atomic will also set correct struct alignment.
> 
> My thinking was that maybe this is not the case (= standard library issue)
> since both GCC and Clang seem to be causing this issue, but manually adding
> alignas(16) also didn't help.
> 
> > You would get a crash if you build for LSE so you get a LDADDAL instruction
> > and then run it on a CPU that doesn't. So try -mcpu=native and it should
> > work.
> 
> -mcpu=native didn't fix the SIGBUS, only removed __aarch64_ldadd4_acq_rel
> from the stack trace.
> 
> FWIW, the CPU on this system where I get the SIGBUS is Cortex-A76, which
> should support LSE and atomics, but it seems everytime it encounters
> atomics, it just throws a SIGBUS. It works fine on Snapdragon 8cx Gen 3.

That does not make any sense. The only thing I think might happen is that your
structure is not correctly aligned (for example by using a custom memory
allocator). Can you check the address of count when it fails? (should be in the
crash logs, or you can see it in gdb or just printf it).

[Bug middle-end/110773] [Aarch64] crash (SIGBUS) due to atomic instructions on under-aligned memory

2024-04-04 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110773

--- Comment #4 from Wilco  ---
(In reply to Sainan from comment #3)
> I seem to be having a related issue, although in my case the struct looks
> like this:
> 
> template 
> struct Data
> {
> T* data;
> std::atomic_uint count;
> bool flag;
> };
> 
> And it's crashing on `--count;`
> 
> Surely this is not a user issue in this case because the pointer should
> always be 8 bytes, so count should be evenly aligned on a 8-byte boundary.
> (Unless the atomic operation needs 16-byte alignment?)

The atomic will also set correct struct alignment.

> Same code also runs fine when compiled via MSVC and run on Windows, although
> it's unclear if this might simply be my Linux test machine running an older
> ARM CPU compared to my Windows on ARM test machine.

You would get a crash if you build for LSE so you get a LDADDAL instruction and
then run it on a CPU that doesn't. So try -mcpu=native and it should work.

[Bug rtl-optimization/93565] [11/12/13 Regression] Combine duplicates instructions

2024-04-03 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

--- Comment #31 from Wilco  ---
(In reply to Andrew Pinski from comment #29)
> Looking back at this one, I (In reply to Wilco from comment #8)
> > Here is a much simpler example:
> > 
> > void f (int *p, int y)
> > {
> >   int a = y & 14;
> >   *p = a | p[a];
> > }
> After r14-9692-g839bc42772ba7af66af3bd16efed4a69511312ae, we now get:
> f:
> .LFB0:
> .cfi_startproc
> and w2, w1, 14
> mov x1, x2
> ldr w2, [x0, x2, lsl 2]
> orr w1, w2, w1
> str w1, [x0]
> ret
> .cfi_endproc
> 
> There is an extra move still but the duplicated and is gone. (with
> -frename-registers added, the move is gone as REE is able to remove the zero
> extend but then there is a life range conflict so can't remove the move too).

Even with the mov it is better since that can be done with zero latency in
rename in most CPUs.

> So maybe this should be closed as fixed for GCC 14 and the cost changes for
> clz reverted.

The ctz costs are correct since it is a 2-instruction sequence - it only needs
adjusting for CSSC.

[Bug target/113618] [14 Regression] AArch64: memmove idiom regression

2024-03-13 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618

Wilco  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #6 from Wilco  ---
Fixed.

[Bug target/113915] [14 regression] glibc's _dl_find_object_update_1 miscompiled for armv7a since r14-4365-g0731889c026bfe

2024-03-06 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113915

Wilco  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #15 from Wilco  ---
Fixed on trunk.

[Bug target/113986] [14 regression] Build failure on aarch64-linux-musl or if ifunc support is disabled (error: 'export_load_16' aliased to undefined symbol 'libat_load_16')

2024-02-23 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113986

--- Comment #4 from Wilco  ---
Patch: https://gcc.gnu.org/pipermail/gcc-patches/2024-February/646408.html

[Bug target/113915] [14 regression] glibc's _dl_find_object_update_1 miscompiled for armv7a since r14-4365-g0731889c026bfe

2024-02-21 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113915

--- Comment #13 from Wilco  ---
Patch: https://gcc.gnu.org/pipermail/gcc-patches/2024-February/646189.html

[Bug target/113986] [14 regression] Build failure on aarch64-linux-musl or if ifunc support is disabled (error: 'export_load_16' aliased to undefined symbol 'libat_load_16')

2024-02-19 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113986

Wilco  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org

--- Comment #2 from Wilco  ---
(In reply to Andrew Pinski from comment #1)
> I am 99% sure it was caused by r14-6589-g3fa689f6ed8387 .
> 
> It is reproducible with --disable-gnu-indirect-function on the gcc configure
> line for a glibc build even without the patch for PR 113971 so confirmed. 
> 
> 
> Moving the definition of DONE for N==16 case to be under the `#if
> HAVE_IFUNC` case fixes the issue. I don't know if that is the correct fix or
> not ...

Confirmed. I never heard about that config - at the time I tried it on an old
system with GCC4.8 and that built and passed all tests. I can't see a reason to
ever switch off ifuncs...

The !HAVE_IFUNC case needs to also define IFUNC_ALT 1 and atomic_16.S must add
aliases to __atomic_load_16 etc. Then it works fine. I'll send a patch.

[Bug target/113915] [14 regression] glibc's _dl_find_object_update_1 miscompiled for armv7a since r14-4365-g0731889c026bfe

2024-02-14 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113915

Wilco  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org

--- Comment #11 from Wilco  ---
Yes the default for "conds" attribute is incorrect and at odds with the
"predicable" attribute. The fix should work but will disable conditional
execution on a few ARM-only patterns that just have "conds" attribute. Any
shared patterns will be OK since they already need to set "predicable" for
Thumb-2.

[Bug target/113618] [14 Regression] AArch64: memmove idiom regression

2024-01-31 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618

Wilco  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org

--- Comment #4 from Wilco  ---
(In reply to Alex Coplan from comment #1)
> Confirmed.
> 
> (In reply to Wilco from comment #0)
> > A possible fix would be to avoid emitting LDP/STP in memcpy/memmove/memset
> > expansions.
> 
> Yeah, so I had posted
> https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636855.html for that
> but held off from committing it at the time as IMO there wasn't enough
> evidence to show that this helps in general (and the pass could in theory
> miss opportunities which would lead to regressions). 
> 
> But perhaps this is a good argument for going ahead with that change (of
> course it will need rebasing).

Yes I have a patch based on current trunk + my outstanding memset cleanup
patch. It's slightly faster but causes a small codesize regression. This
appears mostly due to GCC being overly aggressive in changing loads/stores with
a zero offset into indexing, a non-zero offset or a lo_sym. This not only
blocks LDP opportunities but also increases register pressure and spilling.

[Bug target/113618] [14 Regression] AArch64: memmove idiom regression

2024-01-29 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618

--- Comment #3 from Wilco  ---
(In reply to Richard Biener from comment #2)
> It might be good to recognize this pattern in strlenopt or a related pass.
> 
> A purely local transform would turn it into
> 
> memcpy (temp, a, 64);
> memmove (b, a, 64);
> 
> relying on DSE to eliminate the copy to temp if possible.  Not sure if
> that possibly would be a bad transform if copying to temp is required.

This would only be beneficial if you know memmove is inlined if memcpy is - on
almost all targets memmove becomes a library call, so the transformation would
be worse if memcpy can be inlined.

> stp q30, q31, [sp]
> ldp q30, q31, [sp]
> 
> why is CSE not able to catch this?

The new RTL now has UNSPECs in them, so CSE doesn't know it is a plain
load/store:

STP: 

(insn 12 11 13 2 (set (mem/c:V2x16QI (reg:DI 102) [0 +0 S32 A128])
(unspec:V2x16QI [
(reg:V4SI 104)
(reg:V4SI 105)
] UNSPEC_STP)) "/app/example.c":5:5 -1
 (nil))

LDP:

(insn 16 15 17 2 (parallel [
(set (reg:V4SI 108)
(unspec:V4SI [
(mem/c:V2x16QI (reg:DI 107) [0 +0 S32 A128])
] UNSPEC_LDP_FST))
(set (reg:V4SI 109)
(unspec:V4SI [
(mem/c:V2x16QI (reg:DI 107) [0 +0 S32 A128])
] UNSPEC_LDP_SND))
]) "/app/example.c":6:5 -1
 (nil))

[Bug target/113618] New: [14 Regression] AArch64: memmove idiom regression

2024-01-26 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618

Bug ID: 113618
   Summary: [14 Regression] AArch64: memmove idiom regression
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

The following is often used as an idiom for memmove since GCC mid-end and most
back-ends have no support for inlining memmove:

void move64 (char *a, char *b)
{
char temp[64];
memcpy (temp, a, 64);
memcpy (b, temp, 64);
}

On trunk this generates:

ldp q30, q31, [x0]
sub sp, sp, #64
ldp q28, q29, [x0, 32]
stp q30, q31, [sp]
ldp q30, q31, [sp]
stp q28, q29, [sp, 32]
ldp q28, q29, [sp, 32]
stp q30, q31, [x1]
stp q28, q29, [x1, 32]
add sp, sp, 64
ret

This is a significant regression from GCC13 which has redundant stores but
avoids load-after-store forwarding penalties:

ldp q2, q3, [x0]
sub sp, sp, #64
ldp q0, q1, [x0, 32]
stp q2, q3, [sp]
stp q2, q3, [x1]
stp q0, q1, [sp, 32]
stp q0, q1, [x1, 32]
add sp, sp, 64
ret

LLVM avoids writing to the temporary and removes the stackframe altogether:

ldp q1, q0, [x0, #32]
ldp q2, q3, [x0]
stp q1, q0, [x1, #32]
stp q2, q3, [x1]
ret

The reason for the regression appears to be the changed RTL representation of
LDP/STP. The RTL optimizer does not understand LDP/STP, so emitting LDP/STP
early in memcpy expansion means it cannot remove the redundant stack stores.

A possible fix would be to avoid emitting LDP/STP in memcpy/memmove/memset
expansions.

[Bug target/110061] libatomic: 128-bit atomics should be lock-free on AArch64

2023-12-22 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110061

Wilco  changed:

   What|Removed |Added

   Target Milestone|--- |14.0

--- Comment #16 from Wilco  ---
Fixed by
https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=3fa689f6ed8387d315e58169bb9bace3bd508c0a

libatomic: Enable lock-free 128-bit atomics on AArch64

Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with
existing binaries (as for these GCC always calls into libatomic, so all 128-bit
atomic uses in a process are switched), gives better performance than locking
atomics and is what most users expect.

128-bit atomic loads use a load/store exclusive loop if LSE2 is not supported.
This results in an implicit store which is invisible to software as long as the
given address is writeable (which will be true when using atomics in real
code).

This doesn't yet change __atomic_is_lock_free eventhough all atomics are
finally
lock-free on AArch64.

libatomic:
* config/linux/aarch64/atomic_16.S: Implement lock-free ARMv8.0
atomics.
(libat_exchange_16): Merge RELEASE and ACQ_REL/SEQ_CST cases.
* config/linux/aarch64/host-config.h: Use atomic_16.S for baseline
v8.0.

[Bug target/112573] Suboptimal code generation with `-fdata-sections` on aarch64

2023-11-20 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112573

Wilco  changed:

   What|Removed |Added

   Last reconfirmed||2023-11-20
 Ever confirmed|0   |1
 CC||wilco at gcc dot gnu.org
 Status|UNCONFIRMED |NEW

--- Comment #3 from Wilco  ---
We should reassociate the immediate last for more optimal addressing like LLVM:

adrpx8, a
add x8, x8, :lo12:a
lsr w9, w0, #8
add x8, x8, w1, sxtw
strbw9, [x8, #1]
lsr w9, w0, #16
strbw0, [x8, #3]
strbw9, [x8, #2]
lsr w9, w0, #24
strbw9, [x8]
ret

However GCC's reassociation is incorrect - it has been for many years and
things got much worse in GCC12...

As a result we may merge the immediate offset into the base address like in
'h'. Using -fdata-sections behaves like -fno-section-anchors, so it works as
expected (and 'extern' is the same as well). We could block merging offsets to
get more address CSEs if that ends up better overall.

[Bug tree-optimization/90693] Missing popcount simplifications

2023-11-20 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90693

--- Comment #6 from Wilco  ---
Thanks Jakub - with the 2nd patch we get the expected sequence on AArch64:

sub x1, x0, #1
eor x0, x0, x1
cmp x0, x1
csetx0, hi

[Bug target/112426] sched1 pessimizes codegen on aarch64 by increasing register pressure

2023-11-09 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112426

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #4 from Wilco  ---
That first REG_DEAD note after scheduling looks wrong:

   15: x0:DI=r93:DI+0x10
  REG_DEAD r93:DI
8: [r93:DI]=r98:DI
  REG_DEAD r98:DI
9: [r93:DI+0x8]=r99:DI

[Bug target/112465] libgcc: aarch64: lse runtime does not work with big data segments

2023-11-09 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112465

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #2 from Wilco  ---
-mcmodel=large is not well supported in general (no support for PIC/PIE, not
well optimized or tested). The newly designed medium model will be far better,
but until that is implemented it is best to use -mcpu=native and only use
-mcmodel=large if there is no other option.

[Bug target/111416] [Armv7/v8 Mixing Bug]: 64-bit Sequentially Consistent Load can be Reordered before Store of RMW when v7 and v8 Implementations are Mixed

2023-10-31 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111416

Wilco  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=111235

--- Comment #3 from Wilco  ---
Fixed by commit r14-4365-g0731889c026bfe8d55c4851422ca5ec9d037f7a0 

#include 
#include 

int64_t f (_Atomic int64_t *p)
{
  return atomic_load (p);
}

now generates with -O2 -mcpu=cortex-a15:

dmb ish
ldrdr0, r1, [r0]
dmb ish
bx  lr

[Bug target/111416] [Armv7/v8 Mixing Bug]: 64-bit Sequentially Consistent Load can be Reordered before Store of RMW when v7 and v8 Implementations are Mixed

2023-10-31 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111416

Wilco  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from Wilco  ---
Fixed by commit r14-4365-g0731889c026bfe8d55c4851422ca5ec9d037f7a0 

#include 
#include 

int64_t f (_Atomic int64_t *p)
{
  return atomic_load (p);
}

now generates with -O2 -mcpu=cortex-a15:

dmb ish
ldrdr0, r1, [r0]
dmb ish
bx  lr

[Bug target/111235] [Armv7-a]: Control-dependency between atomic accesses removed by -O1.

2023-10-31 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111235

Wilco  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #6 from Wilco  ---
Fixed

[Bug target/111121] AArch64: MOPS memmove operand corruption

2023-09-29 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21

Wilco  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Wilco  ---
Fixed on trunk, backported to GCC13 and GCC12.

[Bug target/104611] memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64

2023-09-28 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611

Wilco  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2023-09-28

--- Comment #5 from Wilco  ---
(In reply to Mathias Stearn from comment #4)
> clang has already been using the optimized memcmp code since v16, even at
> -O1: https://www.godbolt.org/z/qEd768TKr. Older versions (at least since v9)
> were still branch-free, but via a less optimal sequence of instructions.
> 
> GCC's code gets even more ridiculous at 32 bytes, because it does a branch
> after every 8-byte compare, while the clang code is fully branch-free (not
> that branch-free is always better, but it seems clearly so in this case).
> 
> Judging by the codegen, there seems to be three deficiencies in GCC: 1) an
> inability to take advantage of the load-pair instructions to load 16-bytes
> at a time, and 2) an inability to use ccmp to combine comparisons. 3) using
> branching rather than cset to fill the output register. Ideally these could
> all be done in the general case by the low level instruction optimizer, but
> even getting them special cased for memcmp (and friends) would be an
> improvement.

I think 1, 2 and 3 are all related due to not having a TImode compare pattern,
so GCC splits things into 8-byte chunks using branches. We could add that and
see whether the result is better or add a backend expander for memcmp similar
to memset and memcpy.

Note what LLVM does is terrible, a 64-byte memcmp is ridiculously inefficient
due to long dependency chains, loading and comparing every byte even if there
is a mismatch in byte 0. So it's obviously better to use branches.

[Bug target/103100] [11/12/13/14 Regression] unaligned access generated with memset or {} and -O2 -mstrict-align

2023-09-20 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103100

--- Comment #24 from Wilco  ---
Patch to avoid emitting unaligned LDP/STP with -mstrict-align:
https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631022.html

[Bug target/105928] [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least)

2023-09-18 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105928

Wilco  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED
 Target|arm64-*-*   |aarch64
   Target Milestone|--- |14.0

--- Comment #5 from Wilco  ---
Fixed

[Bug target/111404] [AArch64] 128-bit __sync_val_compare_and_swap is not atomic

2023-09-14 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111404

Wilco  changed:

   What|Removed |Added

   Last reconfirmed||2023-09-14
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
 Target|arm64-*-*   |aarch64

--- Comment #1 from Wilco  ---
Patch: https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630198.html

[Bug target/111416] [Armv7/v8 Mixing Bug]: 64-bit Sequentially Consistent Load can be Reordered before Store of RMW when v7 and v8 Implementations are Mixed

2023-09-14 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111416

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 CC||wilco at gcc dot gnu.org
   Last reconfirmed||2023-09-14
 Target||arm-*
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org
  Component|translation |target

--- Comment #1 from Wilco  ---
This will be fixed by
https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629607.html

[Bug target/105928] [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least)

2023-09-14 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105928

--- Comment #3 from Wilco  ---
Patch: https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630358.html

[Bug target/111404] New: [AArch64] 128-bit __sync_val_compare_and_swap is not atomic

2023-09-13 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111404

Bug ID: 111404
   Summary: [AArch64] 128-bit __sync_val_compare_and_swap is not
atomic
   Product: gcc
   Version: 8.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

This compiles

__int128 f(__int128 *p, __int128 *q, __int128 x)
{
  return __sync_val_compare_and_swap (p, *q, x);
}

into:

f:
ldp x6, x7, [x1]
mov x4, x0
.L3:
ldxpx0, x1, [x4]
cmp x0, x6
ccmpx1, x7, 0, eq
bne .L4
stlxp   w5, x2, x3, [x4]
cbnzw5, .L3
.L4:
dmb ish
ret

This means if the compare fails, we return the value loaded via LDXP. However
unless the STXP succeeds, this returned value is not single-copy atomic.

So on failure we still need to execute STLXP.

[Bug target/105928] [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least)

2023-09-13 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105928

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org

--- Comment #2 from Wilco  ---
Shifted logical operations are single cycle on all recent cores.

[Bug middle-end/110773] [Aarch64] crash (SIGBUS) due to atomic instructions on under-aligned memory

2023-09-07 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110773

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #2 from Wilco  ---
This is really a user error, not a compiler issue. Just write it like:

struct Storage {
std::atomic fp1;
float padding;
std::atomic fp2;
} storage;

This ensures the correct alignment required for atomic accesses of fp1/fp2.

[Bug target/95751] [aarch64] Consider using ldapr for __atomic_load_n(acquire) on ARMv8.3-RCPC

2023-09-07 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95751

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org
 Resolution|--- |FIXED
 Status|NEW |RESOLVED
   Target Milestone|--- |13.0

--- Comment #2 from Wilco  ---
Fixed in GCC13.

[Bug target/111121] AArch64: MOPS memmove operand corruption

2023-08-23 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21

Wilco  changed:

   What|Removed |Added

   Target Milestone|--- |14.0
 Target||AArch64

[Bug target/111121] AArch64: MOPS memmove operand corruption

2023-08-23 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21

Wilco  changed:

   What|Removed |Added

   Last reconfirmed||2023-08-23
   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org
 Ever confirmed|0   |1
  Known to fail||12.0, 13.0
 Status|UNCONFIRMED |ASSIGNED

[Bug target/111121] New: AArch64: MOPS memmove operand corruption

2023-08-23 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21

Bug ID: 21
   Summary: AArch64: MOPS memmove operand corruption
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

Since GCC 12.0 the following example corrupts x0 when built with -O2
-march=armv8.6-a+mops:

int *f (int *p, int *q, long n) { memmove (p, q, n); return p; }

f:
cpyp[x0]!, [x1]!, x2!
cpym[x0]!, [x1]!, x2!
cpye[x0]!, [x1]!, x2!
ret

The expansion for memcpy works differently and inserts a copy to a temporary.

-mstrict-align is ignored with small constant-sized memcpy if MOPS is enabled:

void g(int *p, int *q) { memcpy (p, q, 32); }

g:
ldp q0, q1, [x1]
stp q0, q1, [x0]
ret

[Bug target/106671] aarch64: BTI instruction are not inserted for cross-section direct calls

2023-08-21 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106671

--- Comment #17 from Wilco  ---
(In reply to Mark Brown from comment #13)
> The kernel hasn't got any problem with BTI as far as I am aware - when built
> with clang we run the kernel with BTI enabled since clang does just insert a
> BTI C at the start of every function, and GCC works fine so long as we don't
> get any out of range jumps being generated. The issue is that we don't have
> anything to insert veneers in the case where section placement puts static
> functions into a distant enough part of memory to need an indirect jump but
> GCC has decided to omit the landing pad.

Is the kernel already larger than 128 MBytes .text? Or do people do weird stuff
with section placement that causes branches to be out of range?

[Bug target/106671] aarch64: BTI instruction are not inserted for cross-section direct calls

2023-08-11 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106671

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #10 from Wilco  ---
(In reply to Feng Xue from comment #9)
> On some occasions, we may not use the new ld, the kernel-building relies on
> its own runtime linker which is used for kernel modules. So I created a
> patch (https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626084.html),
> and this provides user another option that could be done at the compiler
> side.

Reducing BTI is important for security. With LTO a binary should only have BTI
on functions that are indirectly called. So I don't like the idea of adding
more BTI with a new option - it means we will need a linker optimization to
remove those redundant BTIs (eg. by changing them into NOPs).

Note that branch offsets up to 256MB don't need special veneer handling: one
should place a direct branch about halfway to the destination.

Does Linux do any weird hacks in -fpatchable-function-entry that makes it hard
to use BTI?

[Bug middle-end/110791] [12/13/14 Regression] arm: Wrong code with -Os -march=armv8.1-m.main

2023-07-24 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110791

Wilco  changed:

   What|Removed |Added

 Ever confirmed|0   |1
  Component|rtl-optimization|middle-end
 Status|UNCONFIRMED |NEW

--- Comment #6 from Wilco  ---
(In reply to Alex Coplan from comment #5)
> Thanks Wilco for the simpler example.
> 
> It seems to have started with
> r13-1268-g8c99e307b20c502e55c425897fb3884ba8f05882 with both of these
> testcases, but it's probably a latent issue elsewhere (since it doesn't seem
> to show up on other targets). Needs more analysis.

The bug happens on all targets with -fmodulo-sched -Os, eg. AArch64:

bug:
sub x2, x1, x0
add x2, x2, 1
cmp x0, x1
bhi .L7
cmn x0, #1
bne .L9
.L7:
mov x2, 1
.L9:
subsx2, x2, #1
beq .L1
ldrbw0, [x1, -1]
cmp w0, 47
beq .L6
.L1:
ret
.L6:
sub x1, x1, #1
b   .L9

[Bug rtl-optimization/110791] [12/13/14 Regression] arm: Wrong code with -Os -march=armv8.1-m.main

2023-07-24 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110791

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #4 from Wilco  ---
Simpler example:

void f(void);

void bug (char *path, char *p)
{
  while( p > path && p[-1] == '/' )
p--;
  if (p < path)
f();
}

bug:
subsr3, r1, r0
cmp r0, r1
add r3, r3, #1
bhi .L7
addsr0, r0, #1
bne .L5
.L7:
movsr3, #1
.L5:
subsr3, r3, #1
bne .L2
bcc .L3   ** this is obviously never taken
bx  lr
.L2:
ldrbr2, [r1, #-1]!  @ zero_extendqisi2
cmp r2, #47
beq .L5
bx  lr
.L3:
b   f

[Bug target/110061] libatomic: 128-bit atomics should be lock-free on AArch64

2023-06-02 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110061

--- Comment #14 from Wilco  ---
(In reply to Wilco from comment #13)
> (In reply to Xi Ruoyao from comment #12)
> > (In reply to Wilco from comment #11)
> > 
> > > > Then the compiler (and the standard) is not what they consider.  Such
> > > > misunderstandings are everywhere and this has no difference.
> > > 
> > > Where is int128 in "the standard"?
> > 
> > Consider this:
> > 
> > const _Atomic long double x = 0.1;
> > 
> > int main()
> > {
> > double y = x;
> > return y != 0.1;
> > }
> > 
> > If CAS is used here, the program will just segfault.  Does the standard say
> > this is ill-formed or not?
> 
> I'd say this is ill formed yes. And it will crash on Atom laptops.

Correction - it crashes on all AMD cpus too. Are you going to file bugreports
for this?

[Bug target/110061] libatomic: 128-bit atomics should be lock-free on AArch64

2023-06-02 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110061

--- Comment #13 from Wilco  ---
(In reply to Xi Ruoyao from comment #12)
> (In reply to Wilco from comment #11)
> 
> > > Then the compiler (and the standard) is not what they consider.  Such
> > > misunderstandings are everywhere and this has no difference.
> > 
> > Where is int128 in "the standard"?
> 
> Consider this:
> 
> const _Atomic long double x = 0.1;
> 
> int main()
> {
>   double y = x;
>   return y != 0.1;
> }
> 
> If CAS is used here, the program will just segfault.  Does the standard say
> this is ill-formed or not?

I'd say this is ill formed yes. And it will crash on Atom laptops.

[Bug target/110061] libatomic: 128-bit atomics should be lock-free on AArch64

2023-06-02 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110061

--- Comment #11 from Wilco  ---
(In reply to Xi Ruoyao from comment #10)
> (In reply to Wilco from comment #9)
> > (In reply to Xi Ruoyao from comment #8)
> > > (In reply to Wilco from comment #7)
> > > > I don't see the issue you have here. GCC for x86/x86_64 has been using
> > > > compare exchange for atomic load (which always does a write even if the
> > > > compare fails) for many years.
> > > 
> > > No we don't, since r7-6454.
> > 
> > Incorrect - libatomic still uses cmpxchg16b depending on the CPU.
> 
> You are incorrect.  It checks cmpxchg16b bit in CPUID but does not use the
> cmpxchg16b instruction.

No, it will use the cmpxchg16b instruction in the other ifunc when AVX is not
available. Libatomic will fallback to locking atomics if neither AVX nor
cmpxchg16b are available (first few generations of x86_64).

> The reason to check cmpxchg16b is both Intel and AMD guarantee that if both
> cmpxchg16b and AVX are available, then an aligned 16-byte load with vmovdqa
> is atomic.  So we can use vmovdqa to do a lock-free load then.  But using
> cmpxchg16b for a load is still wrong, and libatomic do NOT use it.
> 
> > > > The question is, do you believe compilers should provide users with 
> > > > fast and
> > > > efficient atomics they need? Or do you want to force every application 
> > > > to
> > > > implement their own version of 128-bit atomics?
> > > 
> > > But a compiler must generate correct code first.  They can use the 
> > > wonderful
> > > inline assembly because they know CAS is safe in their case, but the
> > > compiler does not know.
> > 
> > Many developers consider locking atomics fundamentally incorrect. If we emit
> > lock-free atomics they don't need to write inline assembler.
> 
> Then the compiler (and the standard) is not what they consider.  Such
> misunderstandings are everywhere and this has no difference.

Where is int128 in "the standard"?

[Bug target/110061] libatomic: 128-bit atomics should be lock-free on AArch64

2023-06-02 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110061

--- Comment #9 from Wilco  ---
(In reply to Xi Ruoyao from comment #8)
> (In reply to Wilco from comment #7)
> > I don't see the issue you have here. GCC for x86/x86_64 has been using
> > compare exchange for atomic load (which always does a write even if the
> > compare fails) for many years.
> 
> No we don't, since r7-6454.

Incorrect - libatomic still uses cmpxchg16b depending on the CPU.

> > The question is, do you believe compilers should provide users with fast and
> > efficient atomics they need? Or do you want to force every application to
> > implement their own version of 128-bit atomics?
> 
> But a compiler must generate correct code first.  They can use the wonderful
> inline assembly because they know CAS is safe in their case, but the
> compiler does not know.

Many developers consider locking atomics fundamentally incorrect. If we emit
lock-free atomics they don't need to write inline assembler.

[Bug rtl-optimization/109930] transform atomic exchange to unconditional store when old value is unused?

2023-05-31 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109930

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #4 from Wilco  ---
(In reply to Simon Richter from comment #3)
> I was looking at ARMv7 initially.
> 
> If I understood the implementation correctly, this can be a generic
> optimization.

This optimization is only valid for release or relaxed semantics, otherwise you
remove the acquire semantics of the exchange (without proof this is 100% safe,
this will likely allow an illegal reordering).

Btw if you know the old state then there is presumably no concurrent access
here and so you don't need atomic, let alone sequential consistency.

[Bug target/110061] libatomic: 128-bit atomics should be lock-free on AArch64

2023-05-31 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110061

Wilco  changed:

   What|Removed |Added

 Resolution|DUPLICATE   |---
 Status|RESOLVED|NEW

--- Comment #7 from Wilco  ---
I don't see the issue you have here. GCC for x86/x86_64 has been using compare
exchange for atomic load (which always does a write even if the compare fails)
for many years. LLVM does the same for AArch64/x86/x86_64.

If you believe this is incorrect/invalid, do you have any evidence this causes
crashes in real applications?

As a result of GCC's bad choice of using locking atomics on AArch64, many
applications are forced to implement 128-bit atomics themselves using hacky
inline assembler. Just one example for reference:

https://github.com/boostorg/atomic/blob/08bd4e20338c503d2acfdddfdaa8f5e0bcf9006c/include/boost/atomic/detail/core_arch_ops_gcc_aarch64.hpp#L1635

The question is, do you believe compilers should provide users with fast and
efficient atomics they need? Or do you want to force every application to
implement their own version of 128-bit atomics?

[Bug target/110061] libatomic: 128-bit atomics should be lock-free on AArch64

2023-05-31 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110061

Wilco  changed:

   What|Removed |Added

   Last reconfirmed||2023-05-31
   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=80878
   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org
 Status|RESOLVED|NEW
 Resolution|DUPLICATE   |---
 Ever confirmed|0   |1

--- Comment #4 from Wilco  ---
Reopened. Please don't close bugs without allowing for discussion first. I'll
send a patch soon that shows it's possible and valid.

And if there is a better solution that results in the same benefits (fast
lock-free atomics, allowing inlining and use of latest instructions without ABI
issues) then I would love to hear ideas and suggestions.

[Bug target/110061] New: libatomic: 128-bit atomics should be lock-free on AArch64

2023-05-31 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110061

Bug ID: 110061
   Summary: libatomic: 128-bit atomics should be lock-free on
AArch64
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

128-bit atomics should be lock-free on AArch64. This is what most users expect,
gives better performance and makes it possible to inline/outline the recently
added 128-bit atomic instructions. It also makes GCC and LLVM ABI compatible
(since LLVM atomics are always lock-free).

[Bug c/109553] Atomic operations vs const locations

2023-04-19 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109553

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #2 from Wilco  ---
(In reply to Xi Ruoyao from comment #1)

> > but even for atomic load we may want to hint to the user to avoid doing an
> > atomic load from const types.
> 
> this does not make sense.  The "const" in "const T *" only means you cannot
> modify the object via the pointer, not mean the value of the object won't
> change.  Consider:
> 
> void thread1(int *ptr)
> {
>   /* ... */
>   __atomic_add_fetch (ptr, 1, __ATOMIC_SEQ_CST);
>   /* ... */
> }
> 
> void thread2(const int *ptr)
> {
>   /* ... */
>   int t = __atomic_load_n (ptr, __ATOMIC_SEQ_CST);
>   /* ... */
> }
> 
> It's perfectly legal the two "ptr" can point to the same object.  Then if
> you use the usual load intead of __atomic_load_n, a race will happen.

It would be legal if __atomic_load_n is documented to use a const argument, but
it doesn't allow const:
https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/_005f_005fatomic-Builtins.html#g_t_005f_005fatomic-Builtins:~:text=Built%2Din%20Function%3A%20type%20__atomic_load_n%20(type%20*ptr%2C%20int%20memorder)

Since atomic accesses are about synchronizing writes with reads, a diagnostic
would be useful, particularly for the case Kyrill mentioned.

[Bug libgcc/108891] libatomic: AArch64 SEQ_CST 16-byte load missing barrier

2023-03-24 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108891

Wilco  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from Wilco  ---
Fixed

[Bug libgcc/108891] libatomic: AArch64 SEQ_CST 16-byte load missing barrier

2023-02-23 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108891

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2023-02-23
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org

[Bug libgcc/108891] New: libatomic: AArch64 SEQ_CST 16-byte load missing barrier

2023-02-22 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108891

Bug ID: 108891
   Summary: libatomic: AArch64 SEQ_CST 16-byte load missing
barrier
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libgcc
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

LSE2 uses the following sequence for a 16-byte atomic load:

ldp res0, res1, [x0]
dmb ish

The AArch64 memory model allows the LDP to be reordered with an earlier STLXP
(eg. a SEQ_CST exchange), thus breaking SEQ_CST ordering.

To avoid this, atomic loads need a barrier before the LDP - either DBM ISHLD or
LDAR works.

[Bug tree-optimization/90838] Detect table-based ctz implementation

2023-02-17 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #21 from Wilco  ---
(In reply to Gabriel Ravier from comment #19)

> If the original code being branchless makes it faster, wouldn't that imply
> that we should use the table-based implementation when generating code for
> `__builtin_ctz` ?

__builtin_ctz is 3-4 times faster than the table implementation, so this
optimization is always worth it. This is why I believe the current situation is
not ideal since various targets still set CTZ_DEFINED_VALUE_AT_ZERO to 0 or 1.
One option would be to always allow it in Gimple (perhaps add an extra argument
for the value to return for a zero input), and at expand time check whether the
backend supports the requested value. It it doesn't, emit branches.

[Bug tree-optimization/90838] Detect table-based ctz implementation

2023-02-17 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #17 from Wilco  ---
(In reply to Jakub Jelinek from comment #16)
> (In reply to Wilco from comment #15)
> > It would make more sense to move x86 backends to CTZ_DEFINED_VALUE_AT_ZERO
> > == 2 so that you always get the same result even when you don't have tzcnt.
> > A conditional move would be possible, so it adds an extra 2 instructions at
> > worst (ie. still significantly faster than doing the table lookup, multiply
> > etc). And it could be optimized when you know CLZ/CTZ input is non-zero.
> 
> Conditional moves are a lottery on x86, in many cases very bad idea.  And
> when people actually use __builtin_clz*, they state that they don't care
> about the 0 value, so emitting terribly performing code for it just in case
> would be wrong.
> If forwprop emits the conditional in separate blocks for the CTZ_DVAZ!=2
> case, on targets where conditional moves are beneficial for it it can also
> emit them, or emit the jump which say on x86 will be most likely faster than
> cmov.

Well GCC emits a cmov for this (-O2 -march=x86-64-v2):

int ctz(long a)
{
  return (a == 0) ? 64 : __builtin_ctzl (a);
}

ctz:
xor edx, edx
mov eax, 64
rep bsf rdx, rdi
testrdi, rdi
cmovne  eax, edx
ret

Note the extra 'test' seems redundant since IIRC bsf sets Z=1 if the input is
zero.

On Zen 2 this has identical performance as the plain builtin when you loop it
as res = ctz (res) + 1; (ie. measuring latency of non-zero case). So I find it
hard to believe cmov is expensive on modern cores.

[Bug tree-optimization/90838] Detect table-based ctz implementation

2023-02-17 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #15 from Wilco  ---
(In reply to Jakub Jelinek from comment #14)
> The patch does:
> +  bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), ctzval)
> == 2;
> +
> +  /* Skip if there is no value defined at zero, or if we can't easily
> +return the correct value for zero.  */
> +  if (!zero_ok)
> +   return false;
> +  if (zero_val != ctzval && !(zero_val == 0 && ctzval == type_size))
> +   return false;
> For CTZ_DEFINED_VALUE_AT_ZERO == 1 we could support it the same way but we'd
> need
> to emit into the IL an equivalent of val == 0 ? zero_val : .CTZ (val) (with
> GIMPLE_COND and a separate bb - not sure if anything in forwprop creates new
> basic blocks right now), where there is a high chance that RTL opts would
> turn it back into unconditional
> ctz.
> That still wouldn't help non--mbmi x86, because CTZ_DEFINED_VALUE_AT_ZERO is
> 0 there.
> We could handle even that case by doing the branches around, but those would
> stay there
> in the generated code, at which point I wonder whether it would be a win. 
> The original
> code is branchless...

It would make more sense to move x86 backends to CTZ_DEFINED_VALUE_AT_ZERO == 2
so that you always get the same result even when you don't have tzcnt. A
conditional move would be possible, so it adds an extra 2 instructions at worst
(ie. still significantly faster than doing the table lookup, multiply etc). And
it could be optimized when you know CLZ/CTZ input is non-zero.

[Bug target/108659] Suboptimal 128 bit atomics codegen on AArch64 and x64

2023-02-03 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108659

--- Comment #11 from Wilco  ---
(In reply to Niall Douglas from comment #10)
> (In reply to Jakub Jelinek from comment #9)
> > (In reply to Wilco from comment #8)
> > > Yes that sounds like a reasonable approach.
> > 
> > I don't think so.  Not all variables on which __atomic_* intrinsics are used
> > are actually _Atomic, the vars can be embedded in const aggregates etc.
> 
> I'd have the attribute propagate to enclosing types, like over-alignment.

Yes, a structure with a 128-bit Atomic type in a subfield/union would be forced
to rwdata.

And arbitrary casts (eg. from char* to an atomic type) wouldn't work due to
Atomics requiring strict alignment. A 128-bit atomic type might have a higher
alignment than a 128-bit integer so even casting that seems questionable.

[Bug target/108659] Suboptimal 128 bit atomics codegen on AArch64 and x64

2023-02-03 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108659

--- Comment #8 from Wilco  ---
(In reply to Niall Douglas from comment #7)
> (In reply to Andrew Pinski from comment #4)
> > (In reply to Niall Douglas from comment #3) 
> > > You may be interested in reading https://reviews.llvm.org/D110069. It 
> > > wanted
> > > to have LLVM generate a 128 bit AArch64 CAS for atomics. LLVM merged that
> > > change, it'll be in the next release.
> > 
> > Using CAS for atomic load is not valid thing to do ...
> > Because atomic load from constant rodata needs to work.
> > LLVM breaks this case as they don't care about it. GCC does though.
> 
> I've heard that argument before, and I've always wondered why _Atomic128
> types couldn't have an attribute which applies attribute section to their
> static const variable incarnations to force them into r/w memory. That would
> also solve the LLVM issue. Said attribute is not unuseful in general
> actually, it would help avoid having to mess with mprotect to apply copy on
> write perms on regions in .rodata when you need to modify static const
> variable values.
> 
> I don't think that the standard *guarantees* that static const variables go
> into read only memory, and besides, before C23 128 bit integers weren't
> supported anyway so one could argue as a proprietary extension (__int128)
> you get proprietary special casing.

Yes that sounds like a reasonable approach. There will language lawyers that
say it must also work on mmap after mprotect of course, but that seems even
more unlikely in the real world...

I believe that the vast majority of developers just want 128-bit atomics to
work efficiently without locks when possible.

Currently various packages are forced to create 128-bit atomics using inline
assembler - and that seems a much worse hack than supporting lock-free atomics
in the compiler.

[Bug target/108659] Suboptimal 128 bit atomics codegen on AArch64 and x64

2023-02-03 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108659

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #5 from Wilco  ---
(In reply to Andrew Pinski from comment #4)
> (In reply to Niall Douglas from comment #3) 
> > You may be interested in reading https://reviews.llvm.org/D110069. It wanted
> > to have LLVM generate a 128 bit AArch64 CAS for atomics. LLVM merged that
> > change, it'll be in the next release.
> 
> Using CAS for atomic load is not valid thing to do ...
> Because atomic load from constant rodata needs to work.
> LLVM breaks this case as they don't care about it. GCC does though.

The question is how useful is this in reality? If memory is not writeable then
you can use atomic loads but no other atomic accesses.

We could be pragmatic and say that using 128-bit atomic loads from
non-writeable memory is a user error just like unaligned atomic accesses.

To me a far worse issue is that this difference for 128-bit atomics means that
LLVM and GCC are binary incompatible. AFAIK isn't an option to make them
compatible either (on AArch64 GCC13 will use a compatible sequence only if LSE2
is available).

[Bug target/107678] [13 Regression] Segfault in aarch64_fallback_frame_state when running SVE code

2023-01-23 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107678

Wilco  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from Wilco  ---
Fixed

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-18 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #21 from Wilco  ---
(In reply to Jakub Jelinek from comment #20)
> __attribute__((noinline, optimize ("rounding-math"))) static int
> round_to_nearest (void) { return 1.0f - __FLT_MIN__ == 1.0f + __FLT_MIN__; }

Wouldn't that always set inexact?

> and
>   if (round_to_nearest ()) \
> _fcw = FP_RND_NEAREST; \
>   else \
> __asm__ __volatile__ ("%vstmxcsr\t%0" : "=m" (_fcw)); \
> 
> Except that from _fcw we don't determine just the rounding mode but also
> what exceptions are enabled.

Yes that wouldn't work in fenv but FP emulation functions don't need to read
the exception flags.

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-18 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #18 from Wilco  ---
(In reply to Michael_S from comment #12)

> This set of options does not map too well into real difficulties of
> implementation.
> There are only 2 things that are expensive:
> 1. Inexact Exception
> 2. Fetching of the current rounding mode.
> The rest of IEEE-754 features is so cheap that creating separate variants
> without them simply is not worth the effort of maintaining distinct
> variants, even if all difference is a single three-lines #ifdef

In general reading the current rounding mode is relatively cheap, but modifying
can be expensive, so optimized fenv implementations in GLIBC only modify the FP
status if a change is required. It should be feasible to check for
round-to-even and use optimized code for that case.

> BTW, Inexact Exception can be made fairly affordable with a little help from
> compiler. All we need for that is ability to say "don't remove this floating
> point addition even if you don't see that it produces any effect".
> Something similar to 'volatile', but with volatile compiler currently puts
> result of addition on stack, which adds undesirable cost.
> However, judged by comment of Jakub, compiler maintainers are not
> particularly interested in this enterprise.

There are macros in GLIBC math-barriers.h which do what you want - eg. AArch64:

#define math_opt_barrier(x) \
  ({ __typeof (x) __x = (x); __asm ("" : "+w" (__x)); __x; })
#define math_force_eval(x)  \
  ({ __typeof (x) __x = (x); __asm __volatile__ ("" : : "w" (__x)); })

The first blocks optimizations (like constant folding) across the barrier, the
2nd forces evaluation of an expression even if it is deemed useless. These are
used in many math functions in GLIBC. They are target specific due to needing
inline assembler operands, but it should be easy to add similar definitions to
libgcc.

[Bug target/108006] [13 Regression] ICE in aarch64_move_imm building 502.gcc_r

2022-12-07 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108006

Wilco  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org
 Status|UNCONFIRMED |RESOLVED
 CC||wilco at gcc dot gnu.org
 Resolution|--- |FIXED

--- Comment #3 from Wilco  ---
Fixed now.

[Bug target/107678] [13 Regression] Segfault in aarch64_fallback_frame_state when running SVE code

2022-12-01 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107678

Wilco  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org

--- Comment #6 from Wilco  ---
So the issue is that AArch64 return address signing was using the loc.offset
field which is now uninitialized. The fix is to check for REG_UNSAVED first and
use that to initialize the offset. I'm testing a patch.

[Bug target/107678] [13 Regression] Segfault in aarch64_fallback_frame_state when running SVE code

2022-12-01 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107678

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #5 from Wilco  ---
It's not related to SVE unwinding since adding +nosve still fails. The crash
happens after reading the correct return address from the stack. However the
top bits of this value get corrupted, resulting in an illegal access in
aarch64_fallback_frame_state (it seems doing an unconditional read is a bad
idea since at this point the RA may be corrupted).

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2022-12-01 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 107413, which changed state.

Bug 107413 Summary: Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with 
r8-7132-gb5b33e113434be
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with r8-7132-gb5b33e113434be

2022-12-01 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

Wilco  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #17 from Wilco  ---
(In reply to Rama Malladi from comment #16)
> (In reply to Wilco from comment #15)
> > (In reply to Rama Malladi from comment #14)
> > > This fix also improved performance of 538.imagick_r by 15%. Did you have a
> > > similar observation? Thank you.
> > 
> > No, but I was using -mcpu=neoverse-n1 as my baseline. It's possible
> > -mcpu=neoverse-v1 shows larger speedups, what gain do you get on the overall
> > FP score?
> 
> I was using -mcpu=native and run on a Neoverse V1 arch (Graviton3). Here are
> the scores I got (relative gains of latest mainline vs. an earlier mainline).
> 
> Latest mainline: 0976b012d89e3d819d83cdaf0dab05925b3eb3a0
> Earlier mainline: f896c13489d22b30d01257bc8316ab97b3359d1c

Right that's about 3 weeks of changes, I think
1b9a5cc9ec08e9f239dd2096edcc447b7a72f64a has improved imagick_r.

> geomean   1.03

That's a nice gain in 3 weeks!

[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with r8-7132-gb5b33e113434be

2022-11-29 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

--- Comment #15 from Wilco  ---
(In reply to Rama Malladi from comment #14)
> This fix also improved performance of 538.imagick_r by 15%. Did you have a
> similar observation? Thank you.

No, but I was using -mcpu=neoverse-n1 as my baseline. It's possible
-mcpu=neoverse-v1 shows larger speedups, what gain do you get on the overall FP
score?

[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with r8-7132-gb5b33e113434be

2022-11-04 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

Wilco  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2022-11-04
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org

--- Comment #10 from Wilco  ---
(In reply to Rama Malladi from comment #9)
> (In reply to Rama Malladi from comment #8)
> > (In reply to Wilco from comment #7)
> > > The revert results in about 0.5% loss on Neoverse N1, so it looks like the
> > > reassociation pass is still splitting FMAs into separate MUL and ADD 
> > > (which
> > > is bad for narrow cores).
> > 
> > Thank you for checking on N1. Did you happen to check on V1 too to reproduce
> > the perf results I had? Any other experiments/ tests I can do to help on
> > this filing? Thanks again for the debug/ fix.
> 
> I ran SPEC cpu2017 fprate 1-copy benchmark built with the patch reverted and
> using option 'neoverse-n1' on the Graviton 3 processor (which has support
> for SVE). The performance was up by 0.4%, primary contributor being
> 519.lbm_r which was up 13%.

I'm seeing about 1.5% gain on Neoverse V1 and 0.5% loss on Neoverse N1. I'll
post a patch that allows per-CPU settings for FMA reassociation, so you'll get
good performance with -mcpu=native. However reassociation really needs to be
taught about the existence of FMAs.

[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with r8-7132-gb5b33e113434be

2022-11-01 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

--- Comment #7 from Wilco  ---
(In reply to Rama Malladi from comment #5)

> So, looks like we aren't impacted much with this commit revert.
> 
> I haven't yet tried fp_reassoc_width. Will try shortly.

The revert results in about 0.5% loss on Neoverse N1, so it looks like the
reassociation pass is still splitting FMAs into separate MUL and ADD (which is
bad for narrow cores).

[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark

2022-10-26 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #2 from Wilco  ---
That's interesting - if the reassociation pass has become a bit smarter in the
last 5 years, we might no longer need this workaround. What is the effect on
the overall SPECFP score? Did you try other values like fp_reassoc_width = 2 or
3?

[Bug target/107316] [aarch64] Init big const value should be improved compare to llvm

2022-10-25 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107316

Wilco  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||wilco at gcc dot gnu.org
 Resolution|--- |FIXED

--- Comment #3 from Wilco  ---
As Andrew says, it's a duplicate so fixed now.

[Bug target/107316] [aarch64] Init big const value should be improved compare to llvm

2022-10-25 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107316
Bug 107316 depends on bug 106583, which changed state.

Bug 106583 Summary: Suboptimal immediate generation on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106583

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug target/106583] Suboptimal immediate generation on aarch64

2022-10-25 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106583

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org
 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #3 from Wilco  ---
Fixed

[Bug target/105773] [Aarch64] Failure to optimize and+cmp to tst

2022-10-13 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105773

Wilco  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #3 from Wilco  ---
Fixed.

[Bug middle-end/106323] [Suboptimal] memcmp(s1, s2, n) == 0 expansion on AArch64 compare to llvm

2022-07-18 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106323

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #3 from Wilco  ---
(In reply to Andrew Pinski from comment #1)
> GCC might be better if the first bytes are in cache but the next bytes are
> not and then branch is predictable (which it might be).
> 
> So this is much more complex than just changing this really.

Neither sequence is efficient. Caches are not really relevant here, it's more
about giving a wide OoO core lots of useful parallel work to do, so avoiding
unnecessary instructions and branches that just slow you down. Hence 4 loads
and CMP+CCMP is best.

[Bug target/106279] reload problem on arm iwmmxt

2022-07-13 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106279

--- Comment #3 from Wilco  ---
(In reply to David Binderman from comment #2)
> (In reply to Wilco from comment #1)
> > iwmmxt has been dead for 2 decades now - it's support has most likely
> > bitrotted, so I'm surprised anyone is trying to use it...
> 
> Time to remove support for it, or just mark this bug report as WONTFIX ?

Both!

[Bug target/106279] reload problem on arm iwmmxt

2022-07-13 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106279

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #1 from Wilco  ---
iwmmxt has been dead for 2 decades now - it's support has most likely
bitrotted, so I'm surprised anyone is trying to use it...

[Bug target/106270] [Aarch64] -mlong-calls should be provided on aarch64 for users with large applications

2022-07-12 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106270

--- Comment #6 from Wilco  ---
(In reply to Qing Zhao from comment #4)
> > On Jul 12, 2022, at 1:02 PM, wilco at gcc dot gnu.org 
> >  wrote:
> > 
> > Note that GCC could split huge .text sections automatically to allow 
> > insertion
> > of linker veneers every 128MB.
> 
> Does GCC do this by default? Any option is needed for this functionality?

No, currently it is not able to reach this limit, but once it can, it should be
done automatically.

[Bug target/106270] [Aarch64] -mlong-calls should be provided on aarch64 for users with large applications

2022-07-12 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106270

--- Comment #5 from Wilco  ---
(In reply to Jose E. Marchesi from comment #3)
> Wilco: The assessment in comment 1 was extracted from an internal discussion
> on an issue that is still under investigation.  We are certainly hitting a
> cant-reach-the-linker-generated-veneer problem, but it is not fully clear to
> us how since it is getting difficult to get proper reproducers.

It is worth checking you're using a recent binutils since old ones had a bug in
the veneer code (https://sourceware.org/bugzilla/show_bug.cgi?id=25665).

You can hit offset limits easily if you use a linker script which places text
sections very far apart. As the example shows, incorrect use of alignment
directives can cause issues as well. Ideally the assembler should give a
warning if there is a text section larger than 127 MB.

> In any case, the idea of splitting of the text section by the compiler is
> interesting, and a much better solution than -mlong-calls since it wouldn't
> involve generate unnecessary indirect branches.
> 
> But how would the back-end keep track on the size of the code it generates? 
> Using insn size attributes?

Yes, GCC tracks branch ranges. CBZ and TBZ have a small range and are
automatically handled if out of range. IIRC GCC doesn't yet extend Bcc, so if a
single function is over 1MB, GCC won't be able to compile it.

[Bug target/106270] [Aarch64] -mlong-calls should be provided on aarch64 for users with large applications

2022-07-12 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106270

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #2 from Wilco  ---
GCC will crash well before reaching 128 MBytes of .text. So what is the real
underlying problem?

Note that GCC could split huge .text sections automatically to allow insertion
of linker veneers every 128MB. So -mlong-calls is simply an incorrect solution
for a problem that doesn't exist yet...

[Bug libgcc/105708] libgcc: aarch64: init_lse_atomics can race with user-defined constructors

2022-05-24 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105708

Wilco  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|RESOLVED|ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org
   Last reconfirmed||2022-05-24
 Resolution|WONTFIX |---

[Bug libgcc/105708] libgcc: aarch64: init_lse_atomics can race with user-defined constructors

2022-05-24 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105708

--- Comment #12 from Wilco  ---
(In reply to Jakub Jelinek from comment #11)
> How can changing the constructor priority in libgcc affect anything?
> Constructor priorities are within the same shared library or within the same
> executable, not inside of the same process.  So, e.g. when using
> libgcc_s.so.1,
> it might change order with other constructors inside of that shared library
> (there are likely none), but nothing else.  For libgcc.a, it might affect
> even ctors of the other objects with which the library is linked, but still
> not between different shared libraries or binaries.

The outline atomics are linked with each .so (to avoid the PLT and GOT), so
there are multiple copies and any initialization order issues are within the
.so.

[Bug libgcc/105708] libgcc: aarch64: init_lse_atomics can race with user-defined constructors

2022-05-24 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105708

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #10 from Wilco  ---
Increasing the priority of the constructor is perfectly reasonable given that
it has no effect on correctness and doing it as early as possible is better for
performance if other constructors use atomics.

[Bug target/105162] [AArch64] outline-atomics drops dmb ish barrier on __sync builtins

2022-04-14 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #7 from Wilco  ---
(In reply to Sebastian Pop from comment #5)
> Created attachment 52762 [details]
> patch
> 
> The attached patch fixes the issue for __sync builtins by adding the missing
> barrier to -march=armv8-a+nolse path in the outline-atomics functions.
> 
> The patch also changes the behavior of __atomic builtins for
> -moutline-atomics -march=armv8-a+nolse to be the same as for
> -march=armv8-a+lse.

So what is your reasoning for adding the barrier to __atomic as well? Only
__sync needs the extra full barrier, but __atomic does not.

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

2022-03-15 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #49 from Wilco  ---
(In reply to d_vampile from comment #48)
> (In reply to Jiu Fu Guo from comment #41)
> > (In reply to Wilco from comment #40)
> > > (In reply to Jiu Fu Guo from comment #39)
> > > > I’m thinking to draft a patch for this optimization.  If any 
> > > > suggestions,
> > > > please point out, thanks.
> > > 
> > > Which optimization to be precise? Besides unrolling I haven't seen a
> > > proposal for an optimization which is both safe and generally applicable.
> > 
> > 1. For unroll, there are still branches in the loop. And then need careful
> > merge on those reading and comparison.  Another thing about unroll would be
> > that, if we prefer to optimize this early in GIMPLE, we still not GIMPLE
> > unroll on it.
> > while (len != max)
> > {
> > if (p[len] != cur[len])
> >   break; ++len;
> > if (p[len] != cur[len])
> >   break; ++len;
> > if (p[len] != cur[len])
> >   break; ++len;
> > 
> > }
> > 
> > 2. Also thinking about if it makes sense to enhance GIMPLE vectorization
> > pass.  In an aspect that using a vector to read and compare, also need to
> > handle/merge compares into vector compare and handle early exit carefully.
> > if (len + 8 < max && buffers not cross page) ///(p&4K) == (p+8)&4k?
> > 4k:pagesize
> >  while (len != max)
> > {
> >  vec a = xx p;
> >  vec b = xx cur;
> >  if (a != b) /// may not only for comparison 
> >   {;break;}
> >  len += 8;
> > }
> > 
> > 3. Introduce a new stand-alone pass to optimize reading/computing shorter
> > types into large(dword/vector) reading/computing.
> > 
> > Thanks a lot for your comments/suggestions!
> 
> Any progress or patches for the new pass mentioned in point 3? Or new ideas?

A new standalone pass would have to redo a lot of the work of the existing loop
optimizers and vectorizer. And you wouldn't be able to skip the extra checks to
guarantee safety of the optimization. This is why these optimizations are
typically done in the source code.

[Bug target/104611] memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64

2022-02-21 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #1 from Wilco  ---
(In reply to Andrew Pinski from comment #0)
> Take:
> 
> bool f(char *a)
> {
> char t[] = "0123456789012345678901234567890";
> return __builtin_memcmp(a, [0], sizeof(t)) == 0;
> }
> 
> Right now GCC uses branches to optimize this but this could be done via a
> few loads followed by xor (eor) of the two sides and then oring the results
> of xor
> and then umavx and then comparing that to 0. This can be done for the
> middle-end code too if there is a max reduction opcode.

It's not worth optimizing small inline memcmp using vector instructions - the
umaxv and move back to integer side adds extra latency.

However the expansion could be more efficient and use the same sequence used in
GLIBC memcmp:

ldp data1, data3, [src1, 16]
ldp data2, data4, [src2, 16]
cmp data1, data2
ccmpdata3, data4, 0, eq
b.neL(return2)

Also the array t[] gets copied on the stack instead of just using the string
literal directly.

  1   2   >