[Bug target/110061] libatomic: 128-bit atomics should be lock-free on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110061 Wilco changed: What|Removed |Added Target Milestone|14.3|13.4 --- Comment #21 from Wilco --- Fixed on trunk and GCC14 - keeping open for backports.
[Bug target/116032] [12/13/14/15 Regression] gcc.target/arm/pr40457-2.c produces larger code for armv7ve+neon
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116032 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #3 from Wilco --- With -O2 I get: void f(int *p) { p[2] = 1; p[3] = 0; } vldrd16, .L3 addsr0, r0, #8 vst1.32 {d16}, [r0] bx lr With -Os it produces far more efficient code: movsr2, #1 movsr3, #0 strdr2, r3, [r0, #8] bx lr So the test is correct in flagging this as a regression. I imagine this is particularly bad on old cores or where loading literals from code regions is expensive.
[Bug target/115954] Alignment of _Atomic structs incompatible between GCC and LLVM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115954 --- Comment #12 from Wilco --- This came out of the AArch64 Atomic ABI design work: https://github.com/ARM-software/abi-aa/pull/256
[Bug target/115954] Alignment of _Atomic structs incompatible between GCC and LLVM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115954 --- Comment #7 from Wilco --- (In reply to Andrew Pinski from comment #6) > https://gitlab.com/x86-psABIs/i386-ABI/-/issues/1 for x86_64 abi. > > Aarch64 should most likely also do the same ... Yes, that's why I raised this - GCC only over aligning some sizes seems more an accident rather than a designed ABI.
[Bug middle-end/115954] Alignment of _Atomic structs incompatible between GCC and LLVM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115954 --- Comment #5 from Wilco --- (In reply to Richard Biener from comment #2) > (In reply to Richard Biener from comment #1) > > Not sure what the x86 psABI says here (possibly nothing for aggregate > > _Atomic). > > It doesn't consider _Atomic [influencing the ABI] at all. > > Note I think your test queries actual object alignment which a compiler > can of course increase vs. what the ABI requires as minimum alignment, > you should possibly cross-check with alignof/sizeof of the type. > > On x86 clang returns size 8 and align 8 for the atomic A7 type (GCC does > not). I tried using the type for sizeof/alignof, and it returns the same values. So GCC overaligns structs that are an exact power of 2.
[Bug middle-end/115954] New: Alignment of _Atomic structs incompatible between GCC and LLVM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115954 Bug ID: 115954 Summary: Alignment of _Atomic structs incompatible between GCC and LLVM Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- The following code shows ABI inconsistencies between GCC and LLVM: #include #include #include _Atomic struct A3 { char a[3]; } a3; _Atomic struct A7 { char a[7]; } a7; _Atomic struct A8 { char a[8]; } a8; _Atomic struct A9 { char a[9]; } a9; _Atomic struct A16 { char a[16]; } a16; int main (void) { printf("size %ld align %ld lockfree %d\n", sizeof (a3), alignof (a3), atomic_is_lock_free (&a3)); printf("size %ld align %ld lockfree %d\n", sizeof (a7), alignof (a7), atomic_is_lock_free (&a7)); printf("size %ld align %ld lockfree %d\n", sizeof (a8), alignof (a8), atomic_is_lock_free (&a8)); printf("size %ld align %ld lockfree %d\n", sizeof (a9), alignof (a9), atomic_is_lock_free (&a9)); printf("size %ld align %ld lockfree %d\n", sizeof (a16), alignof (a16), atomic_is_lock_free (&a16)); return 0; } Compiled with GCC -O2 -latomic I get this on AArch64: size 3 align 1 lockfree 1 size 7 align 1 lockfree 1 size 8 align 8 lockfree 1 size 9 align 1 lockfree 0 size 16 align 16 lockfree 0 However LLVM reports: size 4 align 4 lockfree 1 size 8 align 8 lockfree 1 size 8 align 8 lockfree 1 size 16 align 16 lockfree 1 size 16 align 16 lockfree 1 The same is true for x86_64 GCC: size 3 align 1 lockfree 0 size 7 align 1 lockfree 1 (due to alignment in libatomic) size 8 align 8 lockfree 1 size 9 align 1 lockfree 0 size 16 align 16 lockfree 0 and LLVM: size 4 align 4 lockfree 1 size 8 align 8 lockfree 1 size 8 align 8 lockfree 1 size 16 align 16 lockfree 0 size 16 align 16 lockfree 0 Increasing the alignment of small _Atomic structs to a power of 2 means these will always be lock free rather than sometimes depending on alignment. This also has the nice property that all types smaller than the maximum supported atomic size are always lock free so there is no need to make libatomic calls.
[Bug target/115153] [14/15 Regression] Error: bad immediate value for 8-bit offset - armv7ve
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115153 Wilco changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #11 from Wilco --- Fixed on trunk and GCC14.
[Bug target/114890] [14/15 Regression] Big-endian addp intrinsics reorder operands
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114890 Wilco changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED Target Milestone|15.0|14.2 --- Comment #3 from Wilco --- Fixed on trunk and GCC-14.
[Bug target/103100] [11 Regression] unaligned access generated with memset or {} and -O2 -mstrict-align
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103100 Wilco changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED Summary|[11/12/13 Regression] |[11 Regression] unaligned |unaligned access generated |access generated with |with memset or {} and -O2 |memset or {} and -O2 |-mstrict-align |-mstrict-align --- Comment #30 from Wilco --- Fixed on GCC12 branch too. It doesn't apply to GCC11, so it's unlikely to be worth fixing since GCC11 branch will be closed soon.
[Bug target/105886] -mstrict-align is ignorning unalign in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105886 Bug 105886 depends on bug 103100, which changed state. Bug 103100 Summary: [11 Regression] unaligned access generated with memset or {} and -O2 -mstrict-align https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103100 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug target/115188] [14/15 regression] invalid Thumb assembly for atomic store in loop on ARMv6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115188 Wilco changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #5 from Wilco --- Fixed on trunk and GCC14.2.
[Bug ipa/114531] Feature proposal for an `-finline-functions-aggressive` compiler option
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114531 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #10 from Wilco --- A 1.1% overall performance gain looks good - is there a significant codesize hit from this? If so, are there slightly less aggressive settings that still get most of the performance gains but at a lower (acceptable) codesize cost? It seems there may be scope to improve the default settings of -O2.
[Bug target/115342] [14/15 Regression] AArch64: Function multiversioning initialization incorrect
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115342 Wilco changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from Wilco --- Fixed on trunk and gcc-14 branch.
[Bug target/115342] [14/15 Regression] AArch64: Function multiversioning initialization incorrect
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115342 Wilco changed: What|Removed |Added Status|NEW |ASSIGNED --- Comment #3 from Wilco --- (In reply to Ramana Radhakrishnan from comment #2) > Confirmed ? And fixed on trunk. Will do the backport later.
[Bug middle-end/115388] [15 Regression] wrong code at -O3 on x86_64-linux-gnu since r15-571-g1e0ae1f52741f7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115388 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #7 from Wilco --- (In reply to Richard Biener from comment #6) > Fixed. Unfortunately this didn't fix PR115256 if I checked correctly. Keep > searching! The testcase hangs on AArch64, so this commit didn't fix it...
[Bug target/115342] New: [14/15 Regression] AArch64: Function multiversioning initialization incorrect
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115342 Bug ID: 115342 Summary: [14/15 Regression] AArch64: Function multiversioning initialization incorrect Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- The CPU features initialization code uses CPUID registers. It uses incorrect comparisons so that for example SVE is not set if SVE2 is available. Using HWCAPs for these is both simpler and works correctly. The initialization must also be done atomically so to avoid multiple threads causing corruption due to non-atomic RMW of the global.
[Bug target/115188] [14/15 regression] invalid Thumb assembly for atomic store in loop on ARMv6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115188 Wilco changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org Last reconfirmed||2024-05-23 --- Comment #2 from Wilco --- (In reply to Andrew Pinski from comment #1) > At first I thought it was the same failure as PR 115153 but it is different. It's similar in that 'm' apparently allows LDMIA/STMIA with writeback in Thumb-1. The correct constraint is 'Uw'.
[Bug target/115153] [14/15 Regression] Error: bad immediate value for 8-bit offset - armv7ve
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115153 Wilco changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org CC||wilco at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #8 from Wilco --- Confirmed. Small example: long long f(long long *p) { return __atomic_load_n (p+32, __ATOMIC_RELAXED); } It only fails on ARM and if the offset is between 256 and 1024. This is a latent bug: arm_legitimate_index_p has an explicit check that disallows anything over 256, but an earlier check for VALID_NEON_DREG_MODE oddly enough allows DImode and a larger range. Moving the Neon check after LDRD check fixes this. Note using ldrd_strd_offset_operand/Do should also work, but the existing code for 'm' is supposed to handle this correctly.
[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991 Wilco changed: What|Removed |Added Target||aarch64-*-* Target Milestone|--- |15.0
[Bug target/114991] New: [14/15 Regression] AArch64: LDP pass does not handle some structure copies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991 Bug ID: 114991 Summary: [14/15 Regression] AArch64: LDP pass does not handle some structure copies Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- The following example no longer emits LDP/STP since GCC14: #include typedef struct { int arr[20]; } S; void g (S *); void h (S); void f(int x) { S s; g (&s); h (s); } f: stp x29, x30, [sp, -176]! add x1, sp, 96 mov x29, sp add x0, sp, 16 ldp q29, q31, [x1] ldr q30, [x1, 32] str q29, [sp, 16] ldr q29, [x1, 48] str q31, [x0, 16] ldr q31, [x1, 64] stp q30, q29, [x0, 32] str q31, [x0, 64] bl h ldp x29, x30, [sp], 176 ret The expansions for memcpy/move/memset no longer emit LDP directly in RTL and now rely on the new LDP pass. Stack based loads/stores seem to confuse its alias checks and it gives up. Using -fno-schedule-insns fixes this example, but not all cases.
[Bug target/114890] [14/15 Regression] Big-endian addp intrinsics reorder operands
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114890 Wilco changed: What|Removed |Added Target Milestone|--- |15.0 Target||aarch64-*-*
[Bug target/114890] New: [14/15 Regression] Big-endian addp intrinsics reorder operands
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114890 Bug ID: 114890 Summary: [14/15 Regression] Big-endian addp intrinsics reorder operands Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- The following example: #include "arm_neon.h" uint32x4_t test (uint32x4_t v1, uint32x4_t v2) { return vpaddq_u32 (v1, v2); } compiles with -O2 -mlittle-endian into: test: addpv0.4s, v0.4s, v1.4s ret However -O2 -mbig-endian gives the incorrect: addpv0.4s, v1.4s, v0.4s ret
[Bug target/114843] aarch64: epilogue in _Unwind_RaiseException corrupts return value due to __builtin_eh_return
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114843 --- Comment #17 from Wilco --- (In reply to Andrew Pinski from comment #16) > Patch posted with all of the testcases included: > https://gcc.gnu.org/pipermail/gcc-patches/2024-April/650080.html Not nearly enough testcases... What about: void g(void); int f(long offset, void *handler) { g(); if (offset > 5) return arr[offset]; __builtin_eh_return (offset, handler); } With -O2 -fomit-frame-pointer: f: .LFB0: .cfi_startproc stp x30, x0, [sp, -64]! .cfi_def_cfa_offset 64 .cfi_offset 30, -64 .cfi_offset 0, -56 stp x1, x2, [sp, 16] stp x3, x19, [sp, 32] .cfi_offset 1, -48 .cfi_offset 2, -40 .cfi_offset 3, -32 .cfi_offset 19, -24 mov x19, x0 str x20, [sp, 48] .cfi_offset 20, -16 mov x20, x1 bl g cmp x19, 5 ble .L8 mov w0, w19 ldp x19, x20, [sp, 40] ldp x30, x0, [sp], 64** oops .cfi_remember_state .cfi_restore 0 .cfi_restore 30 .cfi_restore 19 .cfi_restore 20 .cfi_def_cfa_offset 0 ret .L8: .cfi_restore_state mov x5, x19 ldp x1, x2, [sp, 16] mov x6, x20 ldp x3, x19, [sp, 32] ldr x20, [sp, 48] ldp x30, x0, [sp], 64 .cfi_restore 0 .cfi_restore 30 .cfi_restore 20 .cfi_restore 3 .cfi_restore 19 .cfi_restore 1 .cfi_restore 2 .cfi_def_cfa_offset 0 add sp, sp, x5 br x6 .cfi_endproc So I don't believe you should change aarch64_pop_regs at all - it's too late to change things and just adds unnecessary complexity and more bugs. The best option would be to handle eh_return explicitly and insert the extra push/pops rather than treating them like a generic callee-save (because clearly they are not anymore).
[Bug target/114843] aarch64: epilogue in _Unwind_RaiseException corrupts return value due to __builtin_eh_return
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114843 --- Comment #13 from Wilco --- (In reply to Andrew Pinski from comment #11) > I have a fix for aarch64, able to produce now: > ``` > f: > .LFB0: > .cfi_startproc > stp x0, x1, [sp, -32]! > .cfi_def_cfa_offset 32 > .cfi_offset 0, -32 > .cfi_offset 1, -24 > stp x2, x3, [sp, 16] > .cfi_offset 2, -16 > .cfi_offset 3, -8 > ldr w0, [x0] > cmp w0, 5 > bne .L8 > add sp, sp, 32 > .cfi_remember_state > .cfi_def_cfa_offset 0 > ret > .L8: > .cfi_restore_state > mov x5, x1 > ldp x2, x3, [sp, 16] > ldp x0, x1, [sp], 32 > .cfi_restore 1 > .cfi_restore 0 > .cfi_restore 2 > .cfi_restore 3 > .cfi_def_cfa_offset 0 > add sp, sp, x5 > ret > .cfi_endproc > ``` > > Which is exactly what we should produce I think. > The patch is a bit more complex than I expected but that is due to how > aarch64 has some of the most complex epilogues. I'm not convinced that is an easy solution. Try various cases with large stack sizes, alloca and other scalar and FP callee-saves. Getting all cases right and writing good tests for them is a lot of work.
[Bug target/114843] aarch64: epilogue in _Unwind_RaiseException corrupts return value due to __builtin_eh_return
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114843 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #10 from Wilco --- (In reply to Andrew Pinski from comment #9) > Just a quick note here. Even though eh_return pattern was removed with > r7-6051-g8144a493ddc008, it was broken before that patch. Yeah I only fixed the broken behaviours that I encountered at the time - no tests tried to return a value on the non-exception path. There is no clear specification (eg. making it clear that EH_RETURN_DATA_REGNO must not overlap with registers used to return or if they do, you need to conditionally restore them), so no wonder that many targets get this wrong. Who knew that introducing lots of complex builtins that affect prolog and epilog generation in a major way to avoid a few lines of assembly code was such a bad idea... Since the whole eh_return is an internal ABI in libgcc, a fix would be to change EH_RETURN_DATA_REGNO(N) to avoid x0 and x1. Since eh_return already reserves 7 registers(!) and now need to avoid using x0/x1 too, using x2-x5 and x6,x7 and x9 for the other special registers should work.
[Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741 --- Comment #7 from Wilco --- (In reply to Tamar Christina from comment #6) > and the exact armv9-a cost model you quoted, also does the right codegen. > https://godbolt.org/z/obafoT6cj > > There is just an inexplicable penalty being applied to the r->r alternative. Indeed it is not related to cost model - building SPEC shows a significant regression (~1%) with -mcpu=neoverse-v1 due to AND immediate being quite common in scalar code. The '^' incorrectly forces many cases to use the SVE alternative.
[Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741 --- Comment #2 from Wilco --- It looks like the underlying bug is '^' being incorrectly treated like '?' in record_reg_classes (which is never used during reload). Fixing that results in the expected code being generated in all cases. It looks this issue was introduced in the original commit d1457701461d5a49ca6b5d8a6d1c83a37a6dc771
[Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #1 from Wilco --- This example always goes wrong: void foo2(unsigned *p) { *p &= 1; } Eg. with -mcpu=neoverse-v1: ldr s31, [x0] and z31.s, z31.s, #1 str s31, [x0] ret This doesn't make any sense since there are usually fewer vector units than integer ALUs, and the typically have higher latency.
[Bug target/113986] [14 regression] Build failure on aarch64-linux-musl or if ifunc support is disabled (error: 'export_load_16' aliased to undefined symbol 'libat_load_16')
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113986 Wilco changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #7 from Wilco --- Fixed
[Bug middle-end/110773] [Aarch64] crash (SIGBUS) due to atomic instructions on under-aligned memory
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110773 --- Comment #8 from Wilco --- (In reply to Sainan from comment #7) > (In reply to Wilco from comment #6) > > That does not make any sense. The only thing I think might happen is that > > your structure is not correctly aligned (for example by using a custom > > memory allocator). Can you check the address of count when it fails? (should > > be in the crash logs, or you can see it in gdb or just printf it). > > I feel silly for not thinking of printing the address, but now that I did, I > see the final hexit is '9' and so it just so happens this CPU can't deal > with that... So it's unaligned then, and that's not supported. And you're lucky your specific alignment happens to work on v8.4 cores - it would fail for other offsets.
[Bug middle-end/110773] [Aarch64] crash (SIGBUS) due to atomic instructions on under-aligned memory
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110773 --- Comment #6 from Wilco --- (In reply to Sainan from comment #5) > (In reply to Wilco from comment #4) > > The atomic will also set correct struct alignment. > > My thinking was that maybe this is not the case (= standard library issue) > since both GCC and Clang seem to be causing this issue, but manually adding > alignas(16) also didn't help. > > > You would get a crash if you build for LSE so you get a LDADDAL instruction > > and then run it on a CPU that doesn't. So try -mcpu=native and it should > > work. > > -mcpu=native didn't fix the SIGBUS, only removed __aarch64_ldadd4_acq_rel > from the stack trace. > > FWIW, the CPU on this system where I get the SIGBUS is Cortex-A76, which > should support LSE and atomics, but it seems everytime it encounters > atomics, it just throws a SIGBUS. It works fine on Snapdragon 8cx Gen 3. That does not make any sense. The only thing I think might happen is that your structure is not correctly aligned (for example by using a custom memory allocator). Can you check the address of count when it fails? (should be in the crash logs, or you can see it in gdb or just printf it).
[Bug middle-end/110773] [Aarch64] crash (SIGBUS) due to atomic instructions on under-aligned memory
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110773 --- Comment #4 from Wilco --- (In reply to Sainan from comment #3) > I seem to be having a related issue, although in my case the struct looks > like this: > > template > struct Data > { > T* data; > std::atomic_uint count; > bool flag; > }; > > And it's crashing on `--count;` > > Surely this is not a user issue in this case because the pointer should > always be 8 bytes, so count should be evenly aligned on a 8-byte boundary. > (Unless the atomic operation needs 16-byte alignment?) The atomic will also set correct struct alignment. > Same code also runs fine when compiled via MSVC and run on Windows, although > it's unclear if this might simply be my Linux test machine running an older > ARM CPU compared to my Windows on ARM test machine. You would get a crash if you build for LSE so you get a LDADDAL instruction and then run it on a CPU that doesn't. So try -mcpu=native and it should work.
[Bug rtl-optimization/93565] [11/12/13 Regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 --- Comment #31 from Wilco --- (In reply to Andrew Pinski from comment #29) > Looking back at this one, I (In reply to Wilco from comment #8) > > Here is a much simpler example: > > > > void f (int *p, int y) > > { > > int a = y & 14; > > *p = a | p[a]; > > } > After r14-9692-g839bc42772ba7af66af3bd16efed4a69511312ae, we now get: > f: > .LFB0: > .cfi_startproc > and w2, w1, 14 > mov x1, x2 > ldr w2, [x0, x2, lsl 2] > orr w1, w2, w1 > str w1, [x0] > ret > .cfi_endproc > > There is an extra move still but the duplicated and is gone. (with > -frename-registers added, the move is gone as REE is able to remove the zero > extend but then there is a life range conflict so can't remove the move too). Even with the mov it is better since that can be done with zero latency in rename in most CPUs. > So maybe this should be closed as fixed for GCC 14 and the cost changes for > clz reverted. The ctz costs are correct since it is a 2-instruction sequence - it only needs adjusting for CSSC.
[Bug target/113618] [14 Regression] AArch64: memmove idiom regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618 Wilco changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #6 from Wilco --- Fixed.
[Bug target/113915] [14 regression] glibc's _dl_find_object_update_1 miscompiled for armv7a since r14-4365-g0731889c026bfe
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113915 Wilco changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #15 from Wilco --- Fixed on trunk.
[Bug target/113986] [14 regression] Build failure on aarch64-linux-musl or if ifunc support is disabled (error: 'export_load_16' aliased to undefined symbol 'libat_load_16')
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113986 --- Comment #4 from Wilco --- Patch: https://gcc.gnu.org/pipermail/gcc-patches/2024-February/646408.html
[Bug target/113915] [14 regression] glibc's _dl_find_object_update_1 miscompiled for armv7a since r14-4365-g0731889c026bfe
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113915 --- Comment #13 from Wilco --- Patch: https://gcc.gnu.org/pipermail/gcc-patches/2024-February/646189.html
[Bug target/113986] [14 regression] Build failure on aarch64-linux-musl or if ifunc support is disabled (error: 'export_load_16' aliased to undefined symbol 'libat_load_16')
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113986 Wilco changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org --- Comment #2 from Wilco --- (In reply to Andrew Pinski from comment #1) > I am 99% sure it was caused by r14-6589-g3fa689f6ed8387 . > > It is reproducible with --disable-gnu-indirect-function on the gcc configure > line for a glibc build even without the patch for PR 113971 so confirmed. > > > Moving the definition of DONE for N==16 case to be under the `#if > HAVE_IFUNC` case fixes the issue. I don't know if that is the correct fix or > not ... Confirmed. I never heard about that config - at the time I tried it on an old system with GCC4.8 and that built and passed all tests. I can't see a reason to ever switch off ifuncs... The !HAVE_IFUNC case needs to also define IFUNC_ALT 1 and atomic_16.S must add aliases to __atomic_load_16 etc. Then it works fine. I'll send a patch.
[Bug target/113915] [14 regression] glibc's _dl_find_object_update_1 miscompiled for armv7a since r14-4365-g0731889c026bfe
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113915 Wilco changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org --- Comment #11 from Wilco --- Yes the default for "conds" attribute is incorrect and at odds with the "predicable" attribute. The fix should work but will disable conditional execution on a few ARM-only patterns that just have "conds" attribute. Any shared patterns will be OK since they already need to set "predicable" for Thumb-2.
[Bug target/113618] [14 Regression] AArch64: memmove idiom regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618 Wilco changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org --- Comment #4 from Wilco --- (In reply to Alex Coplan from comment #1) > Confirmed. > > (In reply to Wilco from comment #0) > > A possible fix would be to avoid emitting LDP/STP in memcpy/memmove/memset > > expansions. > > Yeah, so I had posted > https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636855.html for that > but held off from committing it at the time as IMO there wasn't enough > evidence to show that this helps in general (and the pass could in theory > miss opportunities which would lead to regressions). > > But perhaps this is a good argument for going ahead with that change (of > course it will need rebasing). Yes I have a patch based on current trunk + my outstanding memset cleanup patch. It's slightly faster but causes a small codesize regression. This appears mostly due to GCC being overly aggressive in changing loads/stores with a zero offset into indexing, a non-zero offset or a lo_sym. This not only blocks LDP opportunities but also increases register pressure and spilling.
[Bug target/113618] [14 Regression] AArch64: memmove idiom regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618 --- Comment #3 from Wilco --- (In reply to Richard Biener from comment #2) > It might be good to recognize this pattern in strlenopt or a related pass. > > A purely local transform would turn it into > > memcpy (temp, a, 64); > memmove (b, a, 64); > > relying on DSE to eliminate the copy to temp if possible. Not sure if > that possibly would be a bad transform if copying to temp is required. This would only be beneficial if you know memmove is inlined if memcpy is - on almost all targets memmove becomes a library call, so the transformation would be worse if memcpy can be inlined. > stp q30, q31, [sp] > ldp q30, q31, [sp] > > why is CSE not able to catch this? The new RTL now has UNSPECs in them, so CSE doesn't know it is a plain load/store: STP: (insn 12 11 13 2 (set (mem/c:V2x16QI (reg:DI 102) [0 +0 S32 A128]) (unspec:V2x16QI [ (reg:V4SI 104) (reg:V4SI 105) ] UNSPEC_STP)) "/app/example.c":5:5 -1 (nil)) LDP: (insn 16 15 17 2 (parallel [ (set (reg:V4SI 108) (unspec:V4SI [ (mem/c:V2x16QI (reg:DI 107) [0 +0 S32 A128]) ] UNSPEC_LDP_FST)) (set (reg:V4SI 109) (unspec:V4SI [ (mem/c:V2x16QI (reg:DI 107) [0 +0 S32 A128]) ] UNSPEC_LDP_SND)) ]) "/app/example.c":6:5 -1 (nil))
[Bug target/113618] New: [14 Regression] AArch64: memmove idiom regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618 Bug ID: 113618 Summary: [14 Regression] AArch64: memmove idiom regression Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- The following is often used as an idiom for memmove since GCC mid-end and most back-ends have no support for inlining memmove: void move64 (char *a, char *b) { char temp[64]; memcpy (temp, a, 64); memcpy (b, temp, 64); } On trunk this generates: ldp q30, q31, [x0] sub sp, sp, #64 ldp q28, q29, [x0, 32] stp q30, q31, [sp] ldp q30, q31, [sp] stp q28, q29, [sp, 32] ldp q28, q29, [sp, 32] stp q30, q31, [x1] stp q28, q29, [x1, 32] add sp, sp, 64 ret This is a significant regression from GCC13 which has redundant stores but avoids load-after-store forwarding penalties: ldp q2, q3, [x0] sub sp, sp, #64 ldp q0, q1, [x0, 32] stp q2, q3, [sp] stp q2, q3, [x1] stp q0, q1, [sp, 32] stp q0, q1, [x1, 32] add sp, sp, 64 ret LLVM avoids writing to the temporary and removes the stackframe altogether: ldp q1, q0, [x0, #32] ldp q2, q3, [x0] stp q1, q0, [x1, #32] stp q2, q3, [x1] ret The reason for the regression appears to be the changed RTL representation of LDP/STP. The RTL optimizer does not understand LDP/STP, so emitting LDP/STP early in memcpy expansion means it cannot remove the redundant stack stores. A possible fix would be to avoid emitting LDP/STP in memcpy/memmove/memset expansions.
[Bug target/110061] libatomic: 128-bit atomics should be lock-free on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110061 Wilco changed: What|Removed |Added Target Milestone|--- |14.0 --- Comment #16 from Wilco --- Fixed by https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=3fa689f6ed8387d315e58169bb9bace3bd508c0a libatomic: Enable lock-free 128-bit atomics on AArch64 Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with existing binaries (as for these GCC always calls into libatomic, so all 128-bit atomic uses in a process are switched), gives better performance than locking atomics and is what most users expect. 128-bit atomic loads use a load/store exclusive loop if LSE2 is not supported. This results in an implicit store which is invisible to software as long as the given address is writeable (which will be true when using atomics in real code). This doesn't yet change __atomic_is_lock_free eventhough all atomics are finally lock-free on AArch64. libatomic: * config/linux/aarch64/atomic_16.S: Implement lock-free ARMv8.0 atomics. (libat_exchange_16): Merge RELEASE and ACQ_REL/SEQ_CST cases. * config/linux/aarch64/host-config.h: Use atomic_16.S for baseline v8.0.
[Bug target/112573] Suboptimal code generation with `-fdata-sections` on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112573 Wilco changed: What|Removed |Added Last reconfirmed||2023-11-20 Ever confirmed|0 |1 CC||wilco at gcc dot gnu.org Status|UNCONFIRMED |NEW --- Comment #3 from Wilco --- We should reassociate the immediate last for more optimal addressing like LLVM: adrpx8, a add x8, x8, :lo12:a lsr w9, w0, #8 add x8, x8, w1, sxtw strbw9, [x8, #1] lsr w9, w0, #16 strbw0, [x8, #3] strbw9, [x8, #2] lsr w9, w0, #24 strbw9, [x8] ret However GCC's reassociation is incorrect - it has been for many years and things got much worse in GCC12... As a result we may merge the immediate offset into the base address like in 'h'. Using -fdata-sections behaves like -fno-section-anchors, so it works as expected (and 'extern' is the same as well). We could block merging offsets to get more address CSEs if that ends up better overall.
[Bug tree-optimization/90693] Missing popcount simplifications
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90693 --- Comment #6 from Wilco --- Thanks Jakub - with the 2nd patch we get the expected sequence on AArch64: sub x1, x0, #1 eor x0, x0, x1 cmp x0, x1 csetx0, hi
[Bug target/112426] sched1 pessimizes codegen on aarch64 by increasing register pressure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112426 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #4 from Wilco --- That first REG_DEAD note after scheduling looks wrong: 15: x0:DI=r93:DI+0x10 REG_DEAD r93:DI 8: [r93:DI]=r98:DI REG_DEAD r98:DI 9: [r93:DI+0x8]=r99:DI
[Bug target/112465] libgcc: aarch64: lse runtime does not work with big data segments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112465 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #2 from Wilco --- -mcmodel=large is not well supported in general (no support for PIC/PIE, not well optimized or tested). The newly designed medium model will be far better, but until that is implemented it is best to use -mcpu=native and only use -mcmodel=large if there is no other option.
[Bug target/111416] [Armv7/v8 Mixing Bug]: 64-bit Sequentially Consistent Load can be Reordered before Store of RMW when v7 and v8 Implementations are Mixed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111416 Wilco changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=111235 --- Comment #3 from Wilco --- Fixed by commit r14-4365-g0731889c026bfe8d55c4851422ca5ec9d037f7a0 #include #include int64_t f (_Atomic int64_t *p) { return atomic_load (p); } now generates with -O2 -mcpu=cortex-a15: dmb ish ldrdr0, r1, [r0] dmb ish bx lr
[Bug target/111416] [Armv7/v8 Mixing Bug]: 64-bit Sequentially Consistent Load can be Reordered before Store of RMW when v7 and v8 Implementations are Mixed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111416 Wilco changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #2 from Wilco --- Fixed by commit r14-4365-g0731889c026bfe8d55c4851422ca5ec9d037f7a0 #include #include int64_t f (_Atomic int64_t *p) { return atomic_load (p); } now generates with -O2 -mcpu=cortex-a15: dmb ish ldrdr0, r1, [r0] dmb ish bx lr
[Bug target/111235] [Armv7-a]: Control-dependency between atomic accesses removed by -O1.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111235 Wilco changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #6 from Wilco --- Fixed
[Bug target/111121] AArch64: MOPS memmove operand corruption
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21 Wilco changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #5 from Wilco --- Fixed on trunk, backported to GCC13 and GCC12.
[Bug target/104611] memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611 Wilco changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Last reconfirmed||2023-09-28 --- Comment #5 from Wilco --- (In reply to Mathias Stearn from comment #4) > clang has already been using the optimized memcmp code since v16, even at > -O1: https://www.godbolt.org/z/qEd768TKr. Older versions (at least since v9) > were still branch-free, but via a less optimal sequence of instructions. > > GCC's code gets even more ridiculous at 32 bytes, because it does a branch > after every 8-byte compare, while the clang code is fully branch-free (not > that branch-free is always better, but it seems clearly so in this case). > > Judging by the codegen, there seems to be three deficiencies in GCC: 1) an > inability to take advantage of the load-pair instructions to load 16-bytes > at a time, and 2) an inability to use ccmp to combine comparisons. 3) using > branching rather than cset to fill the output register. Ideally these could > all be done in the general case by the low level instruction optimizer, but > even getting them special cased for memcmp (and friends) would be an > improvement. I think 1, 2 and 3 are all related due to not having a TImode compare pattern, so GCC splits things into 8-byte chunks using branches. We could add that and see whether the result is better or add a backend expander for memcmp similar to memset and memcpy. Note what LLVM does is terrible, a 64-byte memcmp is ridiculously inefficient due to long dependency chains, loading and comparing every byte even if there is a mismatch in byte 0. So it's obviously better to use branches.
[Bug target/103100] [11/12/13/14 Regression] unaligned access generated with memset or {} and -O2 -mstrict-align
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103100 --- Comment #24 from Wilco --- Patch to avoid emitting unaligned LDP/STP with -mstrict-align: https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631022.html
[Bug target/105928] [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105928 Wilco changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED Target|arm64-*-* |aarch64 Target Milestone|--- |14.0 --- Comment #5 from Wilco --- Fixed
[Bug target/111404] [AArch64] 128-bit __sync_val_compare_and_swap is not atomic
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111404 Wilco changed: What|Removed |Added Last reconfirmed||2023-09-14 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Target|arm64-*-* |aarch64 --- Comment #1 from Wilco --- Patch: https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630198.html
[Bug target/111416] [Armv7/v8 Mixing Bug]: 64-bit Sequentially Consistent Load can be Reordered before Store of RMW when v7 and v8 Implementations are Mixed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111416 Wilco changed: What|Removed |Added Status|UNCONFIRMED |NEW CC||wilco at gcc dot gnu.org Last reconfirmed||2023-09-14 Target||arm-* Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org Component|translation |target --- Comment #1 from Wilco --- This will be fixed by https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629607.html
[Bug target/105928] [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105928 --- Comment #3 from Wilco --- Patch: https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630358.html
[Bug target/111404] New: [AArch64] 128-bit __sync_val_compare_and_swap is not atomic
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111404 Bug ID: 111404 Summary: [AArch64] 128-bit __sync_val_compare_and_swap is not atomic Product: gcc Version: 8.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- This compiles __int128 f(__int128 *p, __int128 *q, __int128 x) { return __sync_val_compare_and_swap (p, *q, x); } into: f: ldp x6, x7, [x1] mov x4, x0 .L3: ldxpx0, x1, [x4] cmp x0, x6 ccmpx1, x7, 0, eq bne .L4 stlxp w5, x2, x3, [x4] cbnzw5, .L3 .L4: dmb ish ret This means if the compare fails, we return the value loaded via LDXP. However unless the STXP succeeds, this returned value is not single-copy atomic. So on failure we still need to execute STLXP.
[Bug target/105928] [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105928 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org --- Comment #2 from Wilco --- Shifted logical operations are single cycle on all recent cores.
[Bug middle-end/110773] [Aarch64] crash (SIGBUS) due to atomic instructions on under-aligned memory
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110773 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #2 from Wilco --- This is really a user error, not a compiler issue. Just write it like: struct Storage { std::atomic fp1; float padding; std::atomic fp2; } storage; This ensures the correct alignment required for atomic accesses of fp1/fp2.
[Bug target/95751] [aarch64] Consider using ldapr for __atomic_load_n(acquire) on ARMv8.3-RCPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95751 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org Resolution|--- |FIXED Status|NEW |RESOLVED Target Milestone|--- |13.0 --- Comment #2 from Wilco --- Fixed in GCC13.
[Bug target/111121] AArch64: MOPS memmove operand corruption
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21 Wilco changed: What|Removed |Added Target Milestone|--- |14.0 Target||AArch64
[Bug target/111121] AArch64: MOPS memmove operand corruption
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21 Wilco changed: What|Removed |Added Last reconfirmed||2023-08-23 Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org Ever confirmed|0 |1 Known to fail||12.0, 13.0 Status|UNCONFIRMED |ASSIGNED
[Bug target/111121] New: AArch64: MOPS memmove operand corruption
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21 Bug ID: 21 Summary: AArch64: MOPS memmove operand corruption Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- Since GCC 12.0 the following example corrupts x0 when built with -O2 -march=armv8.6-a+mops: int *f (int *p, int *q, long n) { memmove (p, q, n); return p; } f: cpyp[x0]!, [x1]!, x2! cpym[x0]!, [x1]!, x2! cpye[x0]!, [x1]!, x2! ret The expansion for memcpy works differently and inserts a copy to a temporary. -mstrict-align is ignored with small constant-sized memcpy if MOPS is enabled: void g(int *p, int *q) { memcpy (p, q, 32); } g: ldp q0, q1, [x1] stp q0, q1, [x0] ret
[Bug target/106671] aarch64: BTI instruction are not inserted for cross-section direct calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106671 --- Comment #17 from Wilco --- (In reply to Mark Brown from comment #13) > The kernel hasn't got any problem with BTI as far as I am aware - when built > with clang we run the kernel with BTI enabled since clang does just insert a > BTI C at the start of every function, and GCC works fine so long as we don't > get any out of range jumps being generated. The issue is that we don't have > anything to insert veneers in the case where section placement puts static > functions into a distant enough part of memory to need an indirect jump but > GCC has decided to omit the landing pad. Is the kernel already larger than 128 MBytes .text? Or do people do weird stuff with section placement that causes branches to be out of range?
[Bug target/106671] aarch64: BTI instruction are not inserted for cross-section direct calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106671 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #10 from Wilco --- (In reply to Feng Xue from comment #9) > On some occasions, we may not use the new ld, the kernel-building relies on > its own runtime linker which is used for kernel modules. So I created a > patch (https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626084.html), > and this provides user another option that could be done at the compiler > side. Reducing BTI is important for security. With LTO a binary should only have BTI on functions that are indirectly called. So I don't like the idea of adding more BTI with a new option - it means we will need a linker optimization to remove those redundant BTIs (eg. by changing them into NOPs). Note that branch offsets up to 256MB don't need special veneer handling: one should place a direct branch about halfway to the destination. Does Linux do any weird hacks in -fpatchable-function-entry that makes it hard to use BTI?
[Bug middle-end/110791] [12/13/14 Regression] arm: Wrong code with -Os -march=armv8.1-m.main
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110791 Wilco changed: What|Removed |Added Ever confirmed|0 |1 Component|rtl-optimization|middle-end Status|UNCONFIRMED |NEW --- Comment #6 from Wilco --- (In reply to Alex Coplan from comment #5) > Thanks Wilco for the simpler example. > > It seems to have started with > r13-1268-g8c99e307b20c502e55c425897fb3884ba8f05882 with both of these > testcases, but it's probably a latent issue elsewhere (since it doesn't seem > to show up on other targets). Needs more analysis. The bug happens on all targets with -fmodulo-sched -Os, eg. AArch64: bug: sub x2, x1, x0 add x2, x2, 1 cmp x0, x1 bhi .L7 cmn x0, #1 bne .L9 .L7: mov x2, 1 .L9: subsx2, x2, #1 beq .L1 ldrbw0, [x1, -1] cmp w0, 47 beq .L6 .L1: ret .L6: sub x1, x1, #1 b .L9
[Bug rtl-optimization/110791] [12/13/14 Regression] arm: Wrong code with -Os -march=armv8.1-m.main
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110791 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #4 from Wilco --- Simpler example: void f(void); void bug (char *path, char *p) { while( p > path && p[-1] == '/' ) p--; if (p < path) f(); } bug: subsr3, r1, r0 cmp r0, r1 add r3, r3, #1 bhi .L7 addsr0, r0, #1 bne .L5 .L7: movsr3, #1 .L5: subsr3, r3, #1 bne .L2 bcc .L3 ** this is obviously never taken bx lr .L2: ldrbr2, [r1, #-1]! @ zero_extendqisi2 cmp r2, #47 beq .L5 bx lr .L3: b f
[Bug target/110061] libatomic: 128-bit atomics should be lock-free on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110061 --- Comment #14 from Wilco --- (In reply to Wilco from comment #13) > (In reply to Xi Ruoyao from comment #12) > > (In reply to Wilco from comment #11) > > > > > > Then the compiler (and the standard) is not what they consider. Such > > > > misunderstandings are everywhere and this has no difference. > > > > > > Where is int128 in "the standard"? > > > > Consider this: > > > > const _Atomic long double x = 0.1; > > > > int main() > > { > > double y = x; > > return y != 0.1; > > } > > > > If CAS is used here, the program will just segfault. Does the standard say > > this is ill-formed or not? > > I'd say this is ill formed yes. And it will crash on Atom laptops. Correction - it crashes on all AMD cpus too. Are you going to file bugreports for this?
[Bug target/110061] libatomic: 128-bit atomics should be lock-free on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110061 --- Comment #13 from Wilco --- (In reply to Xi Ruoyao from comment #12) > (In reply to Wilco from comment #11) > > > > Then the compiler (and the standard) is not what they consider. Such > > > misunderstandings are everywhere and this has no difference. > > > > Where is int128 in "the standard"? > > Consider this: > > const _Atomic long double x = 0.1; > > int main() > { > double y = x; > return y != 0.1; > } > > If CAS is used here, the program will just segfault. Does the standard say > this is ill-formed or not? I'd say this is ill formed yes. And it will crash on Atom laptops.
[Bug target/110061] libatomic: 128-bit atomics should be lock-free on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110061 --- Comment #11 from Wilco --- (In reply to Xi Ruoyao from comment #10) > (In reply to Wilco from comment #9) > > (In reply to Xi Ruoyao from comment #8) > > > (In reply to Wilco from comment #7) > > > > I don't see the issue you have here. GCC for x86/x86_64 has been using > > > > compare exchange for atomic load (which always does a write even if the > > > > compare fails) for many years. > > > > > > No we don't, since r7-6454. > > > > Incorrect - libatomic still uses cmpxchg16b depending on the CPU. > > You are incorrect. It checks cmpxchg16b bit in CPUID but does not use the > cmpxchg16b instruction. No, it will use the cmpxchg16b instruction in the other ifunc when AVX is not available. Libatomic will fallback to locking atomics if neither AVX nor cmpxchg16b are available (first few generations of x86_64). > The reason to check cmpxchg16b is both Intel and AMD guarantee that if both > cmpxchg16b and AVX are available, then an aligned 16-byte load with vmovdqa > is atomic. So we can use vmovdqa to do a lock-free load then. But using > cmpxchg16b for a load is still wrong, and libatomic do NOT use it. > > > > > The question is, do you believe compilers should provide users with > > > > fast and > > > > efficient atomics they need? Or do you want to force every application > > > > to > > > > implement their own version of 128-bit atomics? > > > > > > But a compiler must generate correct code first. They can use the > > > wonderful > > > inline assembly because they know CAS is safe in their case, but the > > > compiler does not know. > > > > Many developers consider locking atomics fundamentally incorrect. If we emit > > lock-free atomics they don't need to write inline assembler. > > Then the compiler (and the standard) is not what they consider. Such > misunderstandings are everywhere and this has no difference. Where is int128 in "the standard"?
[Bug target/110061] libatomic: 128-bit atomics should be lock-free on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110061 --- Comment #9 from Wilco --- (In reply to Xi Ruoyao from comment #8) > (In reply to Wilco from comment #7) > > I don't see the issue you have here. GCC for x86/x86_64 has been using > > compare exchange for atomic load (which always does a write even if the > > compare fails) for many years. > > No we don't, since r7-6454. Incorrect - libatomic still uses cmpxchg16b depending on the CPU. > > The question is, do you believe compilers should provide users with fast and > > efficient atomics they need? Or do you want to force every application to > > implement their own version of 128-bit atomics? > > But a compiler must generate correct code first. They can use the wonderful > inline assembly because they know CAS is safe in their case, but the > compiler does not know. Many developers consider locking atomics fundamentally incorrect. If we emit lock-free atomics they don't need to write inline assembler.
[Bug rtl-optimization/109930] transform atomic exchange to unconditional store when old value is unused?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109930 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #4 from Wilco --- (In reply to Simon Richter from comment #3) > I was looking at ARMv7 initially. > > If I understood the implementation correctly, this can be a generic > optimization. This optimization is only valid for release or relaxed semantics, otherwise you remove the acquire semantics of the exchange (without proof this is 100% safe, this will likely allow an illegal reordering). Btw if you know the old state then there is presumably no concurrent access here and so you don't need atomic, let alone sequential consistency.
[Bug target/110061] libatomic: 128-bit atomics should be lock-free on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110061 Wilco changed: What|Removed |Added Resolution|DUPLICATE |--- Status|RESOLVED|NEW --- Comment #7 from Wilco --- I don't see the issue you have here. GCC for x86/x86_64 has been using compare exchange for atomic load (which always does a write even if the compare fails) for many years. LLVM does the same for AArch64/x86/x86_64. If you believe this is incorrect/invalid, do you have any evidence this causes crashes in real applications? As a result of GCC's bad choice of using locking atomics on AArch64, many applications are forced to implement 128-bit atomics themselves using hacky inline assembler. Just one example for reference: https://github.com/boostorg/atomic/blob/08bd4e20338c503d2acfdddfdaa8f5e0bcf9006c/include/boost/atomic/detail/core_arch_ops_gcc_aarch64.hpp#L1635 The question is, do you believe compilers should provide users with fast and efficient atomics they need? Or do you want to force every application to implement their own version of 128-bit atomics?
[Bug target/110061] libatomic: 128-bit atomics should be lock-free on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110061 Wilco changed: What|Removed |Added Last reconfirmed||2023-05-31 See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=80878 Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org Status|RESOLVED|NEW Resolution|DUPLICATE |--- Ever confirmed|0 |1 --- Comment #4 from Wilco --- Reopened. Please don't close bugs without allowing for discussion first. I'll send a patch soon that shows it's possible and valid. And if there is a better solution that results in the same benefits (fast lock-free atomics, allowing inlining and use of latest instructions without ABI issues) then I would love to hear ideas and suggestions.
[Bug target/110061] New: libatomic: 128-bit atomics should be lock-free on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110061 Bug ID: 110061 Summary: libatomic: 128-bit atomics should be lock-free on AArch64 Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- 128-bit atomics should be lock-free on AArch64. This is what most users expect, gives better performance and makes it possible to inline/outline the recently added 128-bit atomic instructions. It also makes GCC and LLVM ABI compatible (since LLVM atomics are always lock-free).
[Bug c/109553] Atomic operations vs const locations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109553 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #2 from Wilco --- (In reply to Xi Ruoyao from comment #1) > > but even for atomic load we may want to hint to the user to avoid doing an > > atomic load from const types. > > this does not make sense. The "const" in "const T *" only means you cannot > modify the object via the pointer, not mean the value of the object won't > change. Consider: > > void thread1(int *ptr) > { > /* ... */ > __atomic_add_fetch (ptr, 1, __ATOMIC_SEQ_CST); > /* ... */ > } > > void thread2(const int *ptr) > { > /* ... */ > int t = __atomic_load_n (ptr, __ATOMIC_SEQ_CST); > /* ... */ > } > > It's perfectly legal the two "ptr" can point to the same object. Then if > you use the usual load intead of __atomic_load_n, a race will happen. It would be legal if __atomic_load_n is documented to use a const argument, but it doesn't allow const: https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/_005f_005fatomic-Builtins.html#g_t_005f_005fatomic-Builtins:~:text=Built%2Din%20Function%3A%20type%20__atomic_load_n%20(type%20*ptr%2C%20int%20memorder) Since atomic accesses are about synchronizing writes with reads, a diagnostic would be useful, particularly for the case Kyrill mentioned.
[Bug libgcc/108891] libatomic: AArch64 SEQ_CST 16-byte load missing barrier
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108891 Wilco changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #2 from Wilco --- Fixed
[Bug libgcc/108891] libatomic: AArch64 SEQ_CST 16-byte load missing barrier
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108891 Wilco changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2023-02-23 Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org
[Bug libgcc/108891] New: libatomic: AArch64 SEQ_CST 16-byte load missing barrier
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108891 Bug ID: 108891 Summary: libatomic: AArch64 SEQ_CST 16-byte load missing barrier Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgcc Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- LSE2 uses the following sequence for a 16-byte atomic load: ldp res0, res1, [x0] dmb ish The AArch64 memory model allows the LDP to be reordered with an earlier STLXP (eg. a SEQ_CST exchange), thus breaking SEQ_CST ordering. To avoid this, atomic loads need a barrier before the LDP - either DBM ISHLD or LDAR works.
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #21 from Wilco --- (In reply to Gabriel Ravier from comment #19) > If the original code being branchless makes it faster, wouldn't that imply > that we should use the table-based implementation when generating code for > `__builtin_ctz` ? __builtin_ctz is 3-4 times faster than the table implementation, so this optimization is always worth it. This is why I believe the current situation is not ideal since various targets still set CTZ_DEFINED_VALUE_AT_ZERO to 0 or 1. One option would be to always allow it in Gimple (perhaps add an extra argument for the value to return for a zero input), and at expand time check whether the backend supports the requested value. It it doesn't, emit branches.
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #17 from Wilco --- (In reply to Jakub Jelinek from comment #16) > (In reply to Wilco from comment #15) > > It would make more sense to move x86 backends to CTZ_DEFINED_VALUE_AT_ZERO > > == 2 so that you always get the same result even when you don't have tzcnt. > > A conditional move would be possible, so it adds an extra 2 instructions at > > worst (ie. still significantly faster than doing the table lookup, multiply > > etc). And it could be optimized when you know CLZ/CTZ input is non-zero. > > Conditional moves are a lottery on x86, in many cases very bad idea. And > when people actually use __builtin_clz*, they state that they don't care > about the 0 value, so emitting terribly performing code for it just in case > would be wrong. > If forwprop emits the conditional in separate blocks for the CTZ_DVAZ!=2 > case, on targets where conditional moves are beneficial for it it can also > emit them, or emit the jump which say on x86 will be most likely faster than > cmov. Well GCC emits a cmov for this (-O2 -march=x86-64-v2): int ctz(long a) { return (a == 0) ? 64 : __builtin_ctzl (a); } ctz: xor edx, edx mov eax, 64 rep bsf rdx, rdi testrdi, rdi cmovne eax, edx ret Note the extra 'test' seems redundant since IIRC bsf sets Z=1 if the input is zero. On Zen 2 this has identical performance as the plain builtin when you loop it as res = ctz (res) + 1; (ie. measuring latency of non-zero case). So I find it hard to believe cmov is expensive on modern cores.
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #15 from Wilco --- (In reply to Jakub Jelinek from comment #14) > The patch does: > + bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), ctzval) > == 2; > + > + /* Skip if there is no value defined at zero, or if we can't easily > +return the correct value for zero. */ > + if (!zero_ok) > + return false; > + if (zero_val != ctzval && !(zero_val == 0 && ctzval == type_size)) > + return false; > For CTZ_DEFINED_VALUE_AT_ZERO == 1 we could support it the same way but we'd > need > to emit into the IL an equivalent of val == 0 ? zero_val : .CTZ (val) (with > GIMPLE_COND and a separate bb - not sure if anything in forwprop creates new > basic blocks right now), where there is a high chance that RTL opts would > turn it back into unconditional > ctz. > That still wouldn't help non--mbmi x86, because CTZ_DEFINED_VALUE_AT_ZERO is > 0 there. > We could handle even that case by doing the branches around, but those would > stay there > in the generated code, at which point I wonder whether it would be a win. > The original > code is branchless... It would make more sense to move x86 backends to CTZ_DEFINED_VALUE_AT_ZERO == 2 so that you always get the same result even when you don't have tzcnt. A conditional move would be possible, so it adds an extra 2 instructions at worst (ie. still significantly faster than doing the table lookup, multiply etc). And it could be optimized when you know CLZ/CTZ input is non-zero.
[Bug target/108659] Suboptimal 128 bit atomics codegen on AArch64 and x64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108659 --- Comment #11 from Wilco --- (In reply to Niall Douglas from comment #10) > (In reply to Jakub Jelinek from comment #9) > > (In reply to Wilco from comment #8) > > > Yes that sounds like a reasonable approach. > > > > I don't think so. Not all variables on which __atomic_* intrinsics are used > > are actually _Atomic, the vars can be embedded in const aggregates etc. > > I'd have the attribute propagate to enclosing types, like over-alignment. Yes, a structure with a 128-bit Atomic type in a subfield/union would be forced to rwdata. And arbitrary casts (eg. from char* to an atomic type) wouldn't work due to Atomics requiring strict alignment. A 128-bit atomic type might have a higher alignment than a 128-bit integer so even casting that seems questionable.
[Bug target/108659] Suboptimal 128 bit atomics codegen on AArch64 and x64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108659 --- Comment #8 from Wilco --- (In reply to Niall Douglas from comment #7) > (In reply to Andrew Pinski from comment #4) > > (In reply to Niall Douglas from comment #3) > > > You may be interested in reading https://reviews.llvm.org/D110069. It > > > wanted > > > to have LLVM generate a 128 bit AArch64 CAS for atomics. LLVM merged that > > > change, it'll be in the next release. > > > > Using CAS for atomic load is not valid thing to do ... > > Because atomic load from constant rodata needs to work. > > LLVM breaks this case as they don't care about it. GCC does though. > > I've heard that argument before, and I've always wondered why _Atomic128 > types couldn't have an attribute which applies attribute section to their > static const variable incarnations to force them into r/w memory. That would > also solve the LLVM issue. Said attribute is not unuseful in general > actually, it would help avoid having to mess with mprotect to apply copy on > write perms on regions in .rodata when you need to modify static const > variable values. > > I don't think that the standard *guarantees* that static const variables go > into read only memory, and besides, before C23 128 bit integers weren't > supported anyway so one could argue as a proprietary extension (__int128) > you get proprietary special casing. Yes that sounds like a reasonable approach. There will language lawyers that say it must also work on mmap after mprotect of course, but that seems even more unlikely in the real world... I believe that the vast majority of developers just want 128-bit atomics to work efficiently without locks when possible. Currently various packages are forced to create 128-bit atomics using inline assembler - and that seems a much worse hack than supporting lock-free atomics in the compiler.
[Bug target/108659] Suboptimal 128 bit atomics codegen on AArch64 and x64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108659 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #5 from Wilco --- (In reply to Andrew Pinski from comment #4) > (In reply to Niall Douglas from comment #3) > > You may be interested in reading https://reviews.llvm.org/D110069. It wanted > > to have LLVM generate a 128 bit AArch64 CAS for atomics. LLVM merged that > > change, it'll be in the next release. > > Using CAS for atomic load is not valid thing to do ... > Because atomic load from constant rodata needs to work. > LLVM breaks this case as they don't care about it. GCC does though. The question is how useful is this in reality? If memory is not writeable then you can use atomic loads but no other atomic accesses. We could be pragmatic and say that using 128-bit atomic loads from non-writeable memory is a user error just like unaligned atomic accesses. To me a far worse issue is that this difference for 128-bit atomics means that LLVM and GCC are binary incompatible. AFAIK isn't an option to make them compatible either (on AArch64 GCC13 will use a compatible sequence only if LSE2 is available).
[Bug target/107678] [13 Regression] Segfault in aarch64_fallback_frame_state when running SVE code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107678 Wilco changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #9 from Wilco --- Fixed
[Bug libgcc/108279] Improved speed for float128 routines
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #21 from Wilco --- (In reply to Jakub Jelinek from comment #20) > __attribute__((noinline, optimize ("rounding-math"))) static int > round_to_nearest (void) { return 1.0f - __FLT_MIN__ == 1.0f + __FLT_MIN__; } Wouldn't that always set inexact? > and > if (round_to_nearest ()) \ > _fcw = FP_RND_NEAREST; \ > else \ > __asm__ __volatile__ ("%vstmxcsr\t%0" : "=m" (_fcw)); \ > > Except that from _fcw we don't determine just the rounding mode but also > what exceptions are enabled. Yes that wouldn't work in fenv but FP emulation functions don't need to read the exception flags.
[Bug libgcc/108279] Improved speed for float128 routines
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #18 from Wilco --- (In reply to Michael_S from comment #12) > This set of options does not map too well into real difficulties of > implementation. > There are only 2 things that are expensive: > 1. Inexact Exception > 2. Fetching of the current rounding mode. > The rest of IEEE-754 features is so cheap that creating separate variants > without them simply is not worth the effort of maintaining distinct > variants, even if all difference is a single three-lines #ifdef In general reading the current rounding mode is relatively cheap, but modifying can be expensive, so optimized fenv implementations in GLIBC only modify the FP status if a change is required. It should be feasible to check for round-to-even and use optimized code for that case. > BTW, Inexact Exception can be made fairly affordable with a little help from > compiler. All we need for that is ability to say "don't remove this floating > point addition even if you don't see that it produces any effect". > Something similar to 'volatile', but with volatile compiler currently puts > result of addition on stack, which adds undesirable cost. > However, judged by comment of Jakub, compiler maintainers are not > particularly interested in this enterprise. There are macros in GLIBC math-barriers.h which do what you want - eg. AArch64: #define math_opt_barrier(x) \ ({ __typeof (x) __x = (x); __asm ("" : "+w" (__x)); __x; }) #define math_force_eval(x) \ ({ __typeof (x) __x = (x); __asm __volatile__ ("" : : "w" (__x)); }) The first blocks optimizations (like constant folding) across the barrier, the 2nd forces evaluation of an expression even if it is deemed useless. These are used in many math functions in GLIBC. They are target specific due to needing inline assembler operands, but it should be easy to add similar definitions to libgcc.
[Bug target/108006] [13 Regression] ICE in aarch64_move_imm building 502.gcc_r
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108006 Wilco changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org Status|UNCONFIRMED |RESOLVED CC||wilco at gcc dot gnu.org Resolution|--- |FIXED --- Comment #3 from Wilco --- Fixed now.
[Bug target/107678] [13 Regression] Segfault in aarch64_fallback_frame_state when running SVE code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107678 Wilco changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org --- Comment #6 from Wilco --- So the issue is that AArch64 return address signing was using the loc.offset field which is now uninitialized. The fix is to check for REG_UNSAVED first and use that to initialize the offset. I'm testing a patch.
[Bug target/107678] [13 Regression] Segfault in aarch64_fallback_frame_state when running SVE code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107678 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #5 from Wilco --- It's not related to SVE unwinding since adding +nosve still fails. The crash happens after reading the correct return address from the stack. However the top bits of this value get corrupted, resulting in an illegal access in aarch64_fallback_frame_state (it seems doing an unconditional read is a bad idea since at this point the RA may be corrupted).
[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 Bug 26163 depends on bug 107413, which changed state. Bug 107413 Summary: Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with r8-7132-gb5b33e113434be https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with r8-7132-gb5b33e113434be
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413 Wilco changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #17 from Wilco --- (In reply to Rama Malladi from comment #16) > (In reply to Wilco from comment #15) > > (In reply to Rama Malladi from comment #14) > > > This fix also improved performance of 538.imagick_r by 15%. Did you have a > > > similar observation? Thank you. > > > > No, but I was using -mcpu=neoverse-n1 as my baseline. It's possible > > -mcpu=neoverse-v1 shows larger speedups, what gain do you get on the overall > > FP score? > > I was using -mcpu=native and run on a Neoverse V1 arch (Graviton3). Here are > the scores I got (relative gains of latest mainline vs. an earlier mainline). > > Latest mainline: 0976b012d89e3d819d83cdaf0dab05925b3eb3a0 > Earlier mainline: f896c13489d22b30d01257bc8316ab97b3359d1c Right that's about 3 weeks of changes, I think 1b9a5cc9ec08e9f239dd2096edcc447b7a72f64a has improved imagick_r. > geomean 1.03 That's a nice gain in 3 weeks!
[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with r8-7132-gb5b33e113434be
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413 --- Comment #15 from Wilco --- (In reply to Rama Malladi from comment #14) > This fix also improved performance of 538.imagick_r by 15%. Did you have a > similar observation? Thank you. No, but I was using -mcpu=neoverse-n1 as my baseline. It's possible -mcpu=neoverse-v1 shows larger speedups, what gain do you get on the overall FP score?
[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with r8-7132-gb5b33e113434be
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413 Wilco changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2022-11-04 Status|UNCONFIRMED |ASSIGNED Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org --- Comment #10 from Wilco --- (In reply to Rama Malladi from comment #9) > (In reply to Rama Malladi from comment #8) > > (In reply to Wilco from comment #7) > > > The revert results in about 0.5% loss on Neoverse N1, so it looks like the > > > reassociation pass is still splitting FMAs into separate MUL and ADD > > > (which > > > is bad for narrow cores). > > > > Thank you for checking on N1. Did you happen to check on V1 too to reproduce > > the perf results I had? Any other experiments/ tests I can do to help on > > this filing? Thanks again for the debug/ fix. > > I ran SPEC cpu2017 fprate 1-copy benchmark built with the patch reverted and > using option 'neoverse-n1' on the Graviton 3 processor (which has support > for SVE). The performance was up by 0.4%, primary contributor being > 519.lbm_r which was up 13%. I'm seeing about 1.5% gain on Neoverse V1 and 0.5% loss on Neoverse N1. I'll post a patch that allows per-CPU settings for FMA reassociation, so you'll get good performance with -mcpu=native. However reassociation really needs to be taught about the existence of FMAs.
[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark with r8-7132-gb5b33e113434be
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413 --- Comment #7 from Wilco --- (In reply to Rama Malladi from comment #5) > So, looks like we aren't impacted much with this commit revert. > > I haven't yet tried fp_reassoc_width. Will try shortly. The revert results in about 0.5% loss on Neoverse N1, so it looks like the reassociation pass is still splitting FMAs into separate MUL and ADD (which is bad for narrow cores).
[Bug tree-optimization/107413] Perf loss ~14% on 519.lbm_r SPEC cpu2017 benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #2 from Wilco --- That's interesting - if the reassociation pass has become a bit smarter in the last 5 years, we might no longer need this workaround. What is the effect on the overall SPECFP score? Did you try other values like fp_reassoc_width = 2 or 3?
[Bug target/107316] [aarch64] Init big const value should be improved compare to llvm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107316 Wilco changed: What|Removed |Added Status|NEW |RESOLVED CC||wilco at gcc dot gnu.org Resolution|--- |FIXED --- Comment #3 from Wilco --- As Andrew says, it's a duplicate so fixed now.
[Bug target/107316] [aarch64] Init big const value should be improved compare to llvm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107316 Bug 107316 depends on bug 106583, which changed state. Bug 106583 Summary: Suboptimal immediate generation on aarch64 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106583 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED