[Bug middle-end/50481] builtin to reverse the bit order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50481 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #5 from Wilco --- Yes it would be good to have a generic builtin, this issue keeps coming up: https://gcc.gnu.org/ml/gcc-patches/2019-06/msg01187.html
[Bug target/84521] aarch64: Frame-pointer corruption with __builtin_setjmp/__builtin_longjmp and -fomit-frame-pointer
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84521 --- Comment #29 from Wilco --- Author: wilco Date: Wed Jun 19 12:52:43 2019 New Revision: 272473 URL: https://gcc.gnu.org/viewcvs?rev=272473=gcc=rev Log: Simplify setjmp and non-local goto implementation (PR84521) This fixes and simplifies the setjmp and non-local goto implementation. Currently the virtual frame pointer is saved when using __builtin_setjmp or a non-local goto. Depending on whether a frame pointer is used, this may either save SP or FP with an immediate offset. However the goto or longjmp always updates the hard frame pointer. A receiver veneer in the original function then assigns the hard frame pointer to the virtual frame pointer, which should, if it works correctly, again assign SP or FP. However the special elimination code in eliminate_regs_in_insn doesn't do this correctly unless the frame pointer is used, and even if it worked by writing SP, the frame pointer would still be corrupted. A much simpler implementation is to always save and restore the hard frame pointer. This avoids 2 redundant instructions which add/subtract the virtual frame offset. A large amount of code can be removed as a result, including all implementations of TARGET_BUILTIN_SETJMP_FRAME_VALUE (all of which already use the hard frame pointer). The expansion of nonlocal_goto on PA can be simplied to just restore the hard frame pointer. This fixes the most obvious issues, however there are still issues on targets which define HARD_FRAME_POINTER_IS_FRAME_POINTER (arm, mips). Each function could have a different hard frame pointer, so a non-local goto may restore the wrong frame pointer (TARGET_BUILTIN_SETJMP_FRAME_VALUE could be useful for this). The i386 TARGET_BUILTIN_SETJMP_FRAME_VALUE was incorrect: if stack_realign_fp is true, it would save the hard frame pointer value but restore the virtual frame pointer which according to ix86_initial_elimination_offset can have a non-zero offset from the hard frame pointer. The ia64 implementation of nonlocal_goto seems incorrect since the helper function moves the the frame pointer value into the static chain register (so this patch does nothing to make it better or worse). AArch64 + x86-64 bootstrap OK, new test passes on AArch64, x86-64 and Arm. gcc/ PR middle-end/84521 * builtins.c (expand_builtin_setjmp_setup): Save hard_frame_pointer_rtx. (expand_builtin_setjmp_receiver): Do not emit sfp = fp move since we restore fp. * function.c (expand_function_start): Save hard_frame_pointer_rtx for non-local goto. * lra-eliminations.c (eliminate_regs_in_insn): Remove sfp = fp elimination code. (remove_reg_equal_offset_note): Remove unused function. * reload1.c (eliminate_regs_in_insn): Remove sfp = hfp elimination code. * config/arc/arc.c (TARGET_BUILTIN_SETJMP_FRAME_VALUE): Remove. (arc_builtin_setjmp_frame_value): Remove function. * config/avr/avr.c (TARGET_BUILTIN_SETJMP_FRAME_VALUE): Remove. (avr_builtin_setjmp_frame_value): Remove function. * config/i386/i386.c (TARGET_BUILTIN_SETJMP_FRAME_VALUE): Remove. (ix86_builtin_setjmp_frame_value): Remove function. * config/pa/pa.md (nonlocal_goto): Remove FP adjustment. * config/sparc/sparc.c (TARGET_BUILTIN_SETJMP_FRAME_VALUE): Remove. (sparc_builtin_setjmp_frame_value): Remove function. * config/vax/vax.c (TARGET_BUILTIN_SETJMP_FRAME_VALUE): Remove. (vax_builtin_setjmp_frame_value): Remove function. * config/xtensa/xtensa.c (xtensa_frame_pointer_required): Force frame pointer if has_nonlocal_label. testsuite/ PR middle-end/84521 * gcc.c-torture/execute/pr84521.c: New test. Added: trunk/gcc/testsuite/gcc.c-torture/execute/pr84521.c Modified: trunk/gcc/ChangeLog trunk/gcc/builtins.c trunk/gcc/config/arc/arc.c trunk/gcc/config/avr/avr.c trunk/gcc/config/i386/i386.c trunk/gcc/config/pa/pa.md trunk/gcc/config/sparc/sparc.c trunk/gcc/config/vax/vax.c trunk/gcc/config/xtensa/xtensa.c trunk/gcc/function.c trunk/gcc/lra-eliminations.c trunk/gcc/reload1.c trunk/gcc/testsuite/ChangeLog
[Bug middle-end/64242] Longjmp expansion incorrect
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64242 --- Comment #33 from Wilco --- Author: wilco Date: Mon Jun 17 11:25:12 2019 New Revision: 272382 URL: https://gcc.gnu.org/viewcvs?rev=272382=gcc=rev Log: Improve PR64242 testcase Clear the input array to avoid the testcase accidentally passing with an incorrect frame pointer. Committed as obvious. testsuite/ PR middle-end/64242 * gcc.c-torture/execute/pr64242.c: Improve test. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.c-torture/execute/pr64242.c
[Bug middle-end/64242] Longjmp expansion incorrect
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64242 --- Comment #27 from Wilco --- (In reply to dave.anglin from comment #26) > On 2019-06-10 9:51 a.m., wilco at gcc dot gnu.org wrote: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64242 > > > > --- Comment #25 from Wilco --- > > I believe this is now fixed for generic code - however targets which > > implement > > the nonlocal_goto expander (eg. pa, sparc, vax) need similar fixes in their > > backends. > > > The test no longer fails on pa. We need a test that demonstrates the > problem in the current > pa expander. It allocates the buf variables at exactly the same offset in both frames, so the code below accidentally reads the right values from the wrong address: broken_longjmp: .PROC .CALLINFO FRAME=192,CALLS,SAVE_RP,SAVE_SP,ENTRY_GR=3 .ENTRY copy %r3,%r1 stw %r2,-20(%r30) copy %r30,%r3 stwm %r1,192(%r30) copy %r26,%r25 ldi 20,%r24 bl memcpy,%r2 ldo 8(%r3),%r26 ldw 8(%r3),%r28 -> load frame pointer of parent function ldo -8(%r28),%r3 -> set hard_frame_pointer ldw 16(%r3),%r30 -> use wrong frame pointer to read stack pointer ldw 12(%r3),%r1 -> use wrong frame pointer to read return address be,n 0(%sr4,%r1) Given there are many possible stack layouts, the easiest option would be to clear the input buffer so it will jump to a null pointer. Eg. __attribute ((noinline)) void broken_longjmp (void *p) { void *buf[32]; __builtin_memcpy (buf, p, 5 * sizeof (void*)); __builtin_memset (p, 0, 5 * sizeof (void*)); /* Corrupts stack pointer... */ __builtin_longjmp (buf, 1); }
[Bug tree-optimization/90836] Missing popcount pattern matching
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90836 --- Comment #2 from Wilco --- (In reply to Richard Biener from comment #1) > While it could be matched in match.pd variations would be quite a lot. > I don't see where else it fits (apart from forwprop which already does > fancy pattern matching...). > > You probably want to match generic INTEGER_CSTs and delay the actual > verification into a with {}. > > I suppose popcount on integers other than long long work with the same > pattern? There are lots of possible variations and given proving that a particular sequence is equivalent to popcount may be too difficult, it's about supporting the cases that occur in code that people care about.
[Bug middle-end/64242] Longjmp expansion incorrect
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64242 --- Comment #25 from Wilco --- I believe this is now fixed for generic code - however targets which implement the nonlocal_goto expander (eg. pa, sparc, vax) need similar fixes in their backends.
[Bug middle-end/90693] Missing popcount simplifications
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90693 --- Comment #2 from Wilco --- (In reply to Dávid Bolvanský from comment #1) > >> __builtin_popcount (x) == 1 into x == (x & -x) > > > This will not work for x = 0. > > Should work: > x && x == (x & -x) > x && (x & x-1) == 0 Good point, though that's not needed for (x & (x-1)) != 0 given you can only have 2 or more bits set if x was non-zero to start with. It's worth finding a branchless sequence, otherwise it may not be faster. Maybe (x << clz (x)) == INT_MIN or (x-1)
[Bug middle-end/64242] Longjmp expansion incorrect
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64242 --- Comment #24 from Wilco --- Author: wilco Date: Mon Jun 3 13:55:15 2019 New Revision: 271870 URL: https://gcc.gnu.org/viewcvs?rev=271870=gcc=rev Log: Fix PR64242 - Longjmp expansion incorrect Improve the fix for PR64242. Various optimizations can change a memory reference into a frame access. Given there are multiple virtual frame pointers which may be replaced by multiple hard frame pointers, there are no checks for writes to the various frame pointers. So updates to a frame pointer tends to generate incorrect code. Improve the previous fix to also add clobbers of several frame pointers and add a scheduling barrier. This should work in most cases until GCC supports a generic "don't optimize across this instruction" feature. Bootstrap OK. Testcase passes on AArch64 and x86-64. Inspected x86, Arm, Thumb-1 and Thumb-2 assembler which looks correct. gcc/ PR middle-end/64242 * builtins.c (expand_builtin_longjmp): Add frame clobbers and schedule block. (expand_builtin_nonlocal_goto): Likewise. testsuite/ PR middle-end/64242 * gcc.c-torture/execute/pr64242.c: Update test. Modified: trunk/gcc/ChangeLog trunk/gcc/builtins.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.c-torture/execute/pr64242.c
[Bug driver/90684] New alignment options incorrectly report error
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90684 --- Comment #3 from Wilco --- Author: wilco Date: Mon Jun 3 11:27:50 2019 New Revision: 271864 URL: https://gcc.gnu.org/viewcvs?rev=271864=gcc=rev Log: Fix alignment option parser (PR90684) Fix the alignment option parser to always allow up to 4 alignments. Now -falign-functions=16:8:8:8 no longer reports an error. gcc/ PR driver/90684 * opts.c (parse_and_check_align_values): Allow 4 alignment values. Mgcc/ChangeLog Mgcc/opts.c Modified: trunk/gcc/ChangeLog trunk/gcc/opts.c
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #36 from Wilco --- (In reply to Orr Shalom Dvory from comment #35) > Hi, thanks for your respond. can someone mark this bug as need to be > improved? > Does anyone agree/disagree with my new proposed method? It's best to create a new bug if there are still easy cases that could be optimized. Also it seems the constants it uses are quite complex - it may be feasible to simplify them. Eg. long f(long x) { return x % 35 == 0; }
[Bug middle-end/90693] New: Missing popcount simplifications
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90693 Bug ID: 90693 Summary: Missing popcount simplifications Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- While GCC optimizes __builtin_popcount (x) != 0 into x != 0, we can also optimize __builtin_popcount (x) == 1 into x == (x & -x), and __builtin_popcount (x) > 1 into (x & (x-1)) != 0.
[Bug driver/90684] New alignment options incorrectly report error
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90684 Wilco changed: What|Removed |Added Component|middle-end |driver Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org --- Comment #1 from Wilco --- Proposed patch: https://gcc.gnu.org/ml/gcc-patches/2019-05/msg02030.html
[Bug middle-end/90684] New: New alignment options incorrectly report error
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90684 Bug ID: 90684 Summary: New alignment options incorrectly report error Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- GCC9 always reports an error when using -falign-functions=16:8:8 cc1: error: invalid number of arguments for ‘-falign-functions’ option: ‘16:8:8’ This is not working as documented in https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options. The issue is that the parser expects an undocumented macro SUBALIGN_LOG to be defined eventhough that is not necessary for the option parsing.
[Bug middle-end/12849] testing divisibility by constant
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=12849 Wilco changed: What|Removed |Added Status|NEW |RESOLVED CC||wilco at gcc dot gnu.org Resolution|--- |FIXED --- Comment #6 from Wilco --- Fixed in GCC9.
[Bug target/90317] [7/8/9/10] ICE for arm sha1h and wrong optimisations on sha1h/c/m/p
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90317 Wilco changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2019-05-29 CC||wilco at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Wilco --- Confirmed
[Bug tree-optimization/90594] New: [9/10 regression] Spurious popcount emitted
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90594 Bug ID: 90594 Summary: [9/10 regression] Spurious popcount emitted Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- The following testcase emits a popcount which computes the final pointer value. This is redundant given the loop already computes the pointer value. The popcount causes the code to be significantly larger and slower than previous GCC versions. int *bad_popcount (unsigned x, int *p) { for (; x != 0; ) { int tmp = __builtin_ctz (x); x = x & (x - 1); *p++ = tmp; } return p; } GCC8: cbz w0, .L2 .L3: rbitw2, w0 clz w2, w2 str w2, [x1], 4 sub w2, w0, #1 andsw0, w0, w2 bne .L3 .L2: mov x0, x1 ret GCC9: cbz w0, .L12 mov x4, x1 mov w2, w0 .L11: rbitw3, w2 clz w3, w3 str w3, [x4], 4 sub w3, w2, #1 andsw2, w2, w3 bne .L11 fmovs0, w0 cnt v0.8b, v0.8b addvb0, v0.8b umovw0, v0.b[0] sub w0, w0, #1 add x0, x0, 1 add x0, x1, x0, lsl 2 ret .L12: mov x0, x1 ret
[Bug target/41999] Bug in generation of interrupt function code for ARM processor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=41999 Wilco changed: What|Removed |Added Status|NEW |RESOLVED CC||wilco at gcc dot gnu.org Resolution|--- |WORKSFORME --- Comment #4 from Wilco --- Works since at least GCC4.5.4.
[Bug other/16996] [meta-bug] code size improvements
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=16996 Bug 16996 depends on bug 38570, which changed state. Bug 38570 Summary: [arm] -mthumb generates sub-optimal prolog/epilog https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38570 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |WORKSFORME
[Bug target/38570] [arm] -mthumb generates sub-optimal prolog/epilog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38570 Wilco changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |WORKSFORME --- Comment #13 from Wilco --- Close
[Bug target/38570] [arm] -mthumb generates sub-optimal prolog/epilog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38570 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #12 from Wilco --- This has been fixed since at least GCC5.4.1: https://www.godbolt.org/z/BlaE1o
[Bug target/9831] [ARM] Peephole for multiple load/store could be more effective.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831 Wilco changed: What|Removed |Added Status|NEW |RESOLVED CC||wilco at gcc dot gnu.org Resolution|--- |WONTFIX --- Comment #8 from Wilco --- There doesn't appear to be anything that can be improved here. Literal pool loads can't be easily peepholed into LDM, and there aren't many opportunities anyway.
[Bug target/42017] gcc compiling C for ARM has stopped using r14 in leaf functions?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42017 Wilco changed: What|Removed |Added Status|NEW |RESOLVED CC||wilco at gcc dot gnu.org Resolution|--- |WORKSFORME --- Comment #6 from Wilco --- This has been fixed since at least GCC5.4: https://www.godbolt.org/z/6IAGfh
[Bug middle-end/90263] Calls to mempcpy should use memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263 --- Comment #20 from Wilco --- (In reply to Martin Liška from comment #19) > Created attachment 46265 [details] > Patch candidate v2 > > Update patch that should be fine. Tests on x86_64 work except: > FAIL: gcc.c-torture/execute/builtins/mempcpy.c execution, -O1 > > Where the test expects that mempcpy without LHS should be transformed into > memcpy. Thanks, this version works as expected on AArch64. In principle we could keep mempcpy for tailcalls with -Os, but I'm not sure it's worth the trouble.
[Bug middle-end/90263] Calls to mempcpy should use memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263 --- Comment #18 from Wilco --- (In reply to Martin Liška from comment #14) > Created attachment 46262 [details] > Patch candidate > > Patch candidate that handles: > > $ cat ~/Programming/testcases/mempcpy.c > int *mempcopy2 (int *p, int *q, long n) > { > return __builtin_mempcpy (p, q, n); > } > > $ ./xgcc -B. -O2 -S ~/Programming/testcases/mempcpy.c -o/dev/stdout > ... > mempcopy2: > .LFB0: > .cfi_startproc > pushq %rbx > .cfi_def_cfa_offset 16 > .cfi_offset 3, -16 > movq%rdx, %rbx > callmemcpy > addq%rbx, %rax > popq%rbx > .cfi_def_cfa_offset 8 > ret > ... > > There's a single failing test: gcc.dg/20050503-1.c On AArch64 it calls mempcpy but memcpy on x86, which is the wrong way around... - if (retmode == RETURN_END && target != const0_rtx) + if (!TARGET_HAS_FAST_MEMPCPY_ROUTINE && retmode == RETURN_END && target != const0_rtx) When I uncomment this, it uses memcpy. Also: +TARGET_HAS_FAST_MEMPCPY_ROUTINE +&& retmode == RETURN_END_MINUS_ONE, +_move_done); Should that be RETURN_END? And is_move_done isn't used, is that right?
[Bug middle-end/90263] Calls to mempcpy should use memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263 --- Comment #17 from Wilco --- (In reply to Wilco from comment #16) > (In reply to Martin Sebor from comment #15) > > I just noticed I have been misreading mempcpy as memccpy and so making no > > sense. Sorry about that! Please ignore my comments. > > I see, yes we have too many and the names are too similar. Now all we need > is someone to propose strlpcpy or stplcpy... https://github.com/fishilico/shared/blob/master/linux/nolibc/uname.c Aargh...
[Bug middle-end/90263] Calls to mempcpy should use memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263 --- Comment #16 from Wilco --- (In reply to Martin Sebor from comment #15) > I just noticed I have been misreading mempcpy as memccpy and so making no > sense. Sorry about that! Please ignore my comments. I see, yes we have too many and the names are too similar. Now all we need is someone to propose strlpcpy or stplcpy...
[Bug rtl-optimization/90249] [9/10 Regression] Code size regression on thumb2 due to sub-optimal register allocation starting with r265398
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90249 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #5 from Wilco --- (In reply to Segher Boessenkool from comment #4) > That is code *size*. Code size is expected to grow a tiny bit, because of > *better* register allocation. > > But we could not do make_more_copies at -Os, if that helps? (The hard > register > changes themselves are required for correctness). Better register allocation implies lower average codesize due to fewer spills, fewer callee-saves, fewer moves etc. I still don't understand what specific problem make_more_copies is trying to solve. Is it trying to do life-range splitting of argument registers?
[Bug middle-end/90263] Calls to mempcpy should use memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263 --- Comment #12 from Wilco --- (In reply to Martin Sebor from comment #11) > My concern is that transforming memccpy to memcpy would leave little > incentive for libraries like glibc to provide a more optimal implementation. > Would implementing the function simply as memcpy and having the latter > return the result of the former be a viable option? Basically, rename > memcpy to meccpy, parameterizing it on the termination character in the > process, and change memcpy to call memccpy and return the first pointer. I have no idea what you mean - there is no way you can implement memcpy using memccpy. It never makes sense to slow down a performance critical function in order to speed up an infrequently used one.
[Bug middle-end/90263] Calls to mempcpy should use memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263 --- Comment #9 from Wilco --- (In reply to Martin Sebor from comment #7) > Rather than unconditionally transforming mempcpy to memcpy I would prefer to > see libc implementations of memccpy optimized. WG14 N2349 discusses a > rationale for this: > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2349.htm I don't think one precludes the other. memccpy is an odd interface, basically just a shorthand for memchr+memcpy. GLIBC already has an efficient generic implementation doing this - and like mempcpy it's unlikely many targets will add an optimized assembler version. So inlining in GCC would be best. Having so many non-standard string functions just means they are not optimized at all on most targets and never get significant use. I've spent a lot of time optimizing the generic string functions in GLIBC so they are at least not ridiculously slow (a loop doing 1 byte per iteration is unacceptable today).
[Bug middle-end/90263] Calls to mempcpy should use memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263 --- Comment #6 from Wilco --- (In reply to Martin Liška from comment #5) > The discussion looks familiar to me. Isn't that PR70140, where I was > suggesting something like: > > https://marc.info/?l=gcc-patches=150166433909242=2 > > with a new target macro: TARGET_HAS_FAST_MEMPCPY_ROUTINE ? Yes something like that should be fine. Unfortunately that patch doesn't apply anymore, and when I got it to compile it didn't do anything.
[Bug middle-end/90263] Calls to mempcpy should use memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263 --- Comment #4 from Wilco --- (In reply to Jakub Jelinek from comment #3) > Because then you penalize properly maintained targets which do have > efficient mempcpy. And even if some targets don't have efficient mempcpy > right now, that doesn't mean they can't have it in the future. On the > caller side, when not expanded inline, calling mempcpy instead of memcpy is > smaller, often requires fewer registers to be live across the call etc. Few targets have an optimized mempcpy, particularly when you look at other libraries besides GLIBC. We could easily have a target hook to emit mempcpy, but I believe the default should be to do what's fastest for most targets.
[Bug middle-end/90263] Calls to mempcpy should use memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263 --- Comment #2 from Wilco --- (In reply to Jakub Jelinek from comment #1) > As stated several times in the past, I strongly disagree. Why? GCC already does this for bzero and bcopy.
[Bug middle-end/90263] New: Calls to mempcpy should use memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263 Bug ID: 90263 Summary: Calls to mempcpy should use memcpy Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- While GCC now inlines fixed-size mempcpy like memcpy, GCC still emits calls to mempcpy rather than converting to memcpy. Since most libraries, including GLIBC, do not have optimized mempcpy for most targets, calling mempcpy is less efficient than calling memcpy and emitting an extra addition to compute the result. int *mempcopy1 (int *p, int *q) { return __builtin_mempcpy (p, q, 16); } int *mempcopy2 (int *p, int *q, long n) { return __builtin_mempcpy (p, q, n); } mempcopy1: ldp x2, x3, [x1] stp x2, x3, [x0], 16 ret mempcopy2: b mempcpy
[Bug middle-end/90262] New: Inline small constant memmoves
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90262 Bug ID: 90262 Summary: Inline small constant memmoves Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- GCC does not inline fixed-size memmoves. However memmove can be as easily inlined as memcpy. The existing memcpy infrastructure could be reused/expanded for this - all loads would be emitted first, followed by the stores. Large memmoves could be inlined on targets with vector registers. Targets without vector registers could emit an overlap check and use inlined memcpy for the no overlap case. void copy(int *p, int *q) { __builtin_memcpy(p, q, 24); } void move(int *p, int *q) { __builtin_memmove(p, q, 24); } copy: ldp x2, x3, [x1] stp x2, x3, [x0] ldr x1, [x1, 16] str x1, [x0, 16] ret move: mov x2, 24 b memmove The memmove could be expanded using the same number of registers: ldp x2, x3, [x1] ldr x1, [x1, 16] stp x2, x3, [x0] str x1, [x0, 16]
[Bug rtl-optimization/87871] [9 Regression] testcases fail after r265398 on arm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87871 --- Comment #52 from Wilco --- (In reply to Segher Boessenkool from comment #48) > With just Peter's and Jakub's patch, it *improves* code size by 0.090%. > That does not fix this PR though :-/ But it does fix most of the codesize regression. The shrinkwrapping testcase seems a preexisting problem that was exposed by the combine changes, so it doesn't need to hold up the release. The regalloc change might fix addr-modes-float.c too.
[Bug rtl-optimization/87871] [9 Regression] testcases fail after r265398 on arm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87871 --- Comment #47 from Wilco --- (In reply to Segher Boessenkool from comment #46) > With all three patches together (Peter's, mine, Jakub's), I get a code size > increase of only 0.047%, much more acceptable. Now looking what that diff > really *is* :-) I think with Jakub's change you don't need to disable the movsi_compare0 pattern in combine. If regalloc works as expected, it will get split into a compare so shrinkwrap can handle it.
[Bug rtl-optimization/87871] [9 Regression] testcases fail after r265398 on arm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87871 --- Comment #38 from Wilco --- (In reply to Segher Boessenkool from comment #37) > Yes, it is a balancing act. Which option works better? Well the question really is what is bad about movsi_compare0 that could be easily fixed? The move is for free so there is no need for the "r,0" variant in principle, so if that helps reducing constraints on register allocation then we could remove or reorder that alternative.
[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763 --- Comment #54 from Wilco --- (In reply to Jeffrey A. Law from comment #53) > Realistically the register allocation issues are not going to get addressed > this cycle nor are improvements to the overall handling of RMW insns in > combine. So we're going to be stuck with bandaids. > > I've got an updated backend pattern that should address the remainder of the > insv_1 and insv_2 regressions and Steve has a backend pattern to address the > other regression in this BZ. I'd prefer not to add quick hacks that aren't beneficial in the long term. Adding a very general pattern to handle any bitfield insert of any constant would be much more useful. There is no issue with xfailing these tests - neither insv pattern was used frequently so the regression for these is not significant compared to the register allocation and move issues.
[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763 --- Comment #52 from Wilco --- (In reply to Jeffrey A. Law from comment #49) > I think the insv_1 (and it's closely related insv_2) regressions can be > fixed by a single ior/and pattern in the backend or by hacking up combine a > bit. I'm still playing with the latter, but may have to put it on the back > burner because of the pain of note management :( Hoping to draw a > conclusion on that by the end of this week. If I can't get a clean combine > solution, then my recommendation would be to build the suitable backend > pattern. It just has to match stuff like > > (set (reg1) (ior (and (reg2) ...)) with a matching constraint on reg1 and > reg2 to ensure it's a RMW operand. I don't think the current insv patterns are very useful, a more general approach would be to support bitfield insert of any immediate which is not currently supported. This can then be expanded into a bic/orr, bic/add, mov/bfi or movk depending on the mask/immediate. Note the register allocation issue as discussed in PR87871 which causes the codesize regressions after combine inserts extra moves is still the worst part of this issue.
[Bug rtl-optimization/87871] [9 Regression] testcases fail after r265398 on arm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87871 --- Comment #21 from Wilco --- (In reply to Vladimir Makarov from comment #20) > (In reply to Wilco from comment #19) > > (In reply to Peter Bergner from comment #18) > > > (In reply to Segher Boessenkool from comment #15) > > > > Popping a5(r116,l0) -- assign reg 3 > > > > Popping a3(r112,l0) -- assign reg 4 > > > > Popping a2(r114,l0) -- assign reg 3 > > > > Popping a0(r111,l0) -- assign reg 0 > > > > Popping a4(r117,l0) -- assign reg 0 > > > > Popping a1(r113,l0) -- assign reg 2 > > > > Assigning 4 to a5r116 > > > > Disposition: > > > > 0:r111 l0 03:r112 l0 41:r113 l0 22:r114 l0 > > > >3 > > > > 5:r116 l0 44:r117 l0 0 > > > > > > > > > > > > r116 does not conflict with *any* other pseudo. It is alive in the > > > > first > > > > two insns of the function, which are > > > > > > So we initially assign r3 to r116 presumably because it has the same cost > > > as > > > the other gprs and it occurs first in REG_ALLOC_ORDER. Then > > > improve_allocation() decides that r4 is a better hard reg and switches the > > > assignment to that. I'm not sure why it wouldn't choose r0 there instead. > > > > I would expect that r116 has a strong preference for r0 given the r116 = mov > > r0 and thus allocating r116 to r0 should have the lowest cost by a large > > margin. > > p116 conflicts with hr0. Therefore it can not get hr0. p112 is connected > with p116. p112 got hr4 and p116 got 3. Assigning 4 to 116 is profitable. > Therefore assignment of p116 is changed to 4. > > The question is why p116 conflicts with hr0. Before RA we have That's a bug since register copies should not create a conflict. It's one of the most basic optimization of register allocator. And there is also the question why we do move r0 into a virtual register but not assign the virtual register to an argument register.
[Bug rtl-optimization/87871] [9 Regression] testcases fail after r265398 on arm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87871 --- Comment #19 from Wilco --- (In reply to Peter Bergner from comment #18) > (In reply to Segher Boessenkool from comment #15) > > Popping a5(r116,l0) -- assign reg 3 > > Popping a3(r112,l0) -- assign reg 4 > > Popping a2(r114,l0) -- assign reg 3 > > Popping a0(r111,l0) -- assign reg 0 > > Popping a4(r117,l0) -- assign reg 0 > > Popping a1(r113,l0) -- assign reg 2 > > Assigning 4 to a5r116 > > Disposition: > > 0:r111 l0 03:r112 l0 41:r113 l0 22:r114 l0 3 > > 5:r116 l0 44:r117 l0 0 > > > > > > r116 does not conflict with *any* other pseudo. It is alive in the first > > two insns of the function, which are > > So we initially assign r3 to r116 presumably because it has the same cost as > the other gprs and it occurs first in REG_ALLOC_ORDER. Then > improve_allocation() decides that r4 is a better hard reg and switches the > assignment to that. I'm not sure why it wouldn't choose r0 there instead. I would expect that r116 has a strong preference for r0 given the r116 = mov r0 and thus allocating r116 to r0 should have the lowest cost by a large margin.
[Bug target/81800] [8/9 regression] on aarch64 ilp32 lrint should not be inlined as two instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81800 --- Comment #16 from Wilco --- (In reply to Jakub Jelinek from comment #15) > (In reply to Wilco from comment #14) > > (In reply to Jakub Jelinek from comment #13) > > > Patches should be pinged after a week if they aren't reviewed, > > > furthermore, > > > it is better to CC explicitly relevant maintainers. > > > > I've got about 10 patches waiting, I'll ping after stage 1 opens. > > This PR is marked as 8/9 Regression though, therefore it should be resolved > if possible already for GCC 9 and maybe even for GCC 8.4 too. It's a minor issue. I can't see how it could be more important than serious code generation bugs (eg. setjmp/longjmp) that affect many GCC versions and targets.
[Bug target/81800] [8/9 regression] on aarch64 ilp32 lrint should not be inlined as two instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81800 --- Comment #14 from Wilco --- (In reply to Jakub Jelinek from comment #13) > Patches should be pinged after a week if they aren't reviewed, furthermore, > it is better to CC explicitly relevant maintainers. I've got about 10 patches waiting, I'll ping after stage 1 opens.
[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834 --- Comment #16 from Wilco --- (In reply to kugan from comment #15) > (In reply to Wilco from comment #11) > > There is also something odd with the way the loop iterates, this doesn't > > look right: > > > > whilelo p0.s, x3, x4 > > incwx3 > > ptest p1, p0.b > > bne .L3 > > I am not sure I understand this. I tried with qemu using an execution > testcase and It seems to work. > > whilelo p0.s, x4, x5 > incwx4 > ptest p1, p0.b > bne .L3 > In my case I have the above (register allocation difference only) incw is > correct considering two vector word registers? Am I missing something here? I'm talking about the completely redundant ptest, where does that come from?
[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #11 from Wilco --- There is also something odd with the way the loop iterates, this doesn't look right: whilelo p0.s, x3, x4 incwx3 ptest p1, p0.b bne .L3
[Bug target/89607] Missing optimization for store of multiple registers on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89607 Wilco changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED CC||wilco at gcc dot gnu.org Resolution|--- |FIXED Target Milestone|--- |9.0 --- Comment #9 from Wilco --- Fixed in GCC9 already, so closing.
[Bug ada/89493] [9 Regression] Stack smashing on armv7hl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89493 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #2 from Wilco --- So is this an Ada unwind bug? What kind of unwinder does Ada use?
[Bug target/89752] [8/9 Regression] ICE in emit_move_insn, at expr.c:3723
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89752 --- Comment #10 from Wilco --- It seems that rewriting "+rm" into "=rm" and "0" is not equivalent. Eg. __asm__ ("" : [a0] "=m" (A0) : "0" (A0)); gives a million warnings "matching constraint does not allow a register", so "0" appears to imply a register operand. Reload seems to deal with "=rm" and "rm" or "=rm" and "0m", but a single "0" prefers a register and that's why it tries to insert an incorrect BLK move.
[Bug target/89752] [8/9 Regression] ICE in emit_move_insn, at expr.c:3723
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89752 --- Comment #4 from Wilco --- Small example which generates the same ICE on every GCC version: typedef struct { int x, y, z; } X; void f(void) { X A0, A1; __asm__ ("" : [a0] "+rm" (A0),[a1] "+rm" (A1)); } So it's completely invalid inline asm...
[Bug target/89752] [8/9 Regression] ICE in emit_move_insn, at expr.c:3723
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89752 --- Comment #3 from Wilco --- Full instruction: (insn 531 530 532 19 (parallel [ (set (mem/c:BLK (reg:DI 3842) [29 A0+0 S2 A64]) (asm_operands:BLK ("") ("=rm") 0 [ (mem/c:BLK (reg:DI 3846) [29 A0+0 S2 A64]) (mem/c:BLK (reg:DI 3848) [29 A1+0 S2 A128]) ] [ (asm_input:BLK ("0") external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/products/GeneralBlockPanelKernel.h:1418) (asm_input:BLK ("1") external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/products/GeneralBlockPanelKernel.h:1418) ] [] external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/products/GeneralBlockPanelKernel.h:1418)) (set (mem/c:BLK (reg:DI 3844) [29 A1+0 S2 A128]) (asm_operands:BLK ("") ("=rm") 1 [ (mem/c:BLK (reg:DI 3846) [29 A0+0 S2 A64]) (mem/c:BLK (reg:DI 3848) [29 A1+0 S2 A128]) ] [ (asm_input:BLK ("0") external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/products/GeneralBlockPanelKernel.h:1418) (asm_input:BLK ("1") external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/products/GeneralBlockPanelKernel.h:1418) ] [] external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/products/GeneralBlockPanelKernel.h:1418)) ]) "external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/products/GeneralBlockPanelKernel.h":1418:511 -1 (nil)) I guess this comes from the source like this with some of the arguments being immediates due to template substitution: __asm__ ("" : [a0] "+rm" (A0),[a1] "+rm" (A1)); It's not obvious what this is supposed to achieve, let alone whether it is valid...
[Bug target/89752] [8/9 Regression] ICE in emit_move_insn, at expr.c:3723
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89752 Wilco changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2019-03-18 CC||wilco at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #2 from Wilco --- Confirmed. It ICEs in Eigen::internal::gebp_kernel, 2, 4, false, false>::operator() It seems to choke on this asm during reload: 531: {[r3842:DI]=asm_operands;[r3844:DI]=asm_operands;} and somehow emit a move between these operands: (reg:BLK 3849) (mem/c:BLK (reg:DI 3846) [29 A0+0 S2 A64])
[Bug target/89222] [7/8 regression] ARM thumb-2 misoptimisation of func ptr call with -O2 or -Os
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89222 Wilco changed: What|Removed |Added Target Milestone|--- |8.5 Summary|[7/8/9 regression] ARM |[7/8 regression] ARM |thumb-2 misoptimisation of |thumb-2 misoptimisation of |func ptr call with -O2 or |func ptr call with -O2 or |-Os |-Os
[Bug target/89222] [7/8/9 regression] ARM thumb-2 misoptimisation of func ptr call with -O2 or -Os
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89222 --- Comment #9 from Wilco --- Author: wilco Date: Tue Mar 5 15:04:01 2019 New Revision: 269390 URL: https://gcc.gnu.org/viewcvs?rev=269390=gcc=rev Log: [ARM] Fix PR89222 The GCC optimizer can generate symbols with non-zero offset from simple if-statements. Bit zero is used for the Arm/Thumb state bit, so relocations with offsets fail if it changes bit zero and the relocation forces bit zero to true. The fix is to disable offsets on function pointer symbols. gcc/ PR target/89222 * config/arm/arm.md (movsi): Use targetm.cannot_force_const_mem to decide when to split off a non-zero offset from a symbol. * config/arm/arm.c (arm_cannot_force_const_mem): Disallow offsets in function symbols. testsuite/ PR target/89222 * gcc.target/arm/pr89222.c: Add new test. Added: trunk/gcc/testsuite/gcc.target/arm/pr89222.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/arm/arm.c trunk/gcc/config/arm/arm.md trunk/gcc/testsuite/ChangeLog
[Bug tree-optimization/89437] [9 regression] incorrect result for sinl (atanl (x))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89437 Wilco changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #2 from Wilco --- Fixed
[Bug tree-optimization/89437] [9 regression] incorrect result for sinl (atanl (x))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89437 --- Comment #1 from Wilco --- Author: wilco Date: Mon Mar 4 12:36:04 2019 New Revision: 269364 URL: https://gcc.gnu.org/viewcvs?rev=269364=gcc=rev Log: Fix PR89437 Fix PR89437. Fix the sinatan-1.c testcase to not run without a C99 target system. Use nextafterl for long double initialization. Fix an issue with sinl (atanl (sqrtl (LDBL_MAX)) returning 0.0 instead of 1.0 by using x < sqrtl (LDBL_MAX) in match.pd. gcc/ PR tree-optimization/89437 * match.pd: Use lt in sin(atan(x)) and cos(atan(x)) simplifications. testsuite/ PR tree-optimization/89437 * gcc.dg/sinatan-1.c: Fix testcase. Modified: trunk/gcc/ChangeLog trunk/gcc/match.pd trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/sinatan-1.c
[Bug tree-optimization/86829] Missing sin(atan(x)) and cos(atan(x)) optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86829 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #8 from Wilco --- This has caused a regression, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89437 and https://gcc.gnu.org/ml/gcc-patches/2019-02/msg01739.html
[Bug tree-optimization/89437] New: [9 regression] incorrect result for sinl (atanl (x))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89437 Bug ID: 89437 Summary: [9 regression] incorrect result for sinl (atanl (x)) Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- A recently added optimization uses an inline expansion for sinl (atanl (x)). As it involves computing sqrtl (x * x + 1) which can overflow for large x, there is a check to ensure x is within the right range. The check emitted is x <= sqrtl (LDBL_MAX). It's not clear whether that's a bug in mpfr or build_sinatan_real, however it is safe to change the test to x < sqrtl (LDBL_MAX). In principle the comparison should use a much smaller constant given that x * x + 1 equals x * x for x > 2^113 for any long double format, so we know sinl (atanl (x)) must be 1.0 for all larger values. Testcase is gcc.dg/sinatan-1.c after using nextafterl instead of incorrect nextafter for the long double. I'll post a patch to fix this.
[Bug libfortran/78314] [aarch64] ieee_support_halting does not report unsupported fpu traps correctly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78314 --- Comment #25 from Wilco --- (In reply to Steve Ellcey from comment #24) > See email strings at: > > https://gcc.gnu.org/ml/fortran/2019-01/msg00276.html > https://gcc.gnu.org/ml/fortran/2019-02/msg00057.html > > For more discussion. Sure, it looks like we simply need to reinstate Szabolc's patch with a #ifdef arm/aarch64 around it. I need to find out whether a feclearexcept (FE_ALL_EXCEPT) is necessary on Arm targets which can trap (very few exist IIRC).
[Bug libfortran/78314] [aarch64] ieee_support_halting does not report unsupported fpu traps correctly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78314 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #23 from Wilco --- (In reply to Steve Ellcey from comment #22) > It looks like the recent checkins are causing gfortran.dg/ieee/ieee_6.f90 > to fail on aarch64. Reading through the comments it looks like this isn't a > new problem but the failure showing up in the GCC testsuite run is new. Well it's a regression since it used to work in GCC7/8/9. Note it's incorrect to allow folding of these calls, for example support_fpu_flag can return true or false depending on the flag (FE_DENORMAL if not supported on most targets), and yet the constant folding always returns true for all flags.
[Bug middle-end/89037] checking ice emitting 128-bit bit-field initializer
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89037 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org Version|9.0 |6.5.0 Target Milestone|--- |9.0 Known to fail||6.0, 7.0, 8.0 --- Comment #5 from Wilco --- Note this fails on GCC6, GCC7 and GCC8 too.
[Bug target/85711] [8 regression] ICE in aarch64_classify_address, at config/aarch64/aarch64.c:5678
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85711 Wilco changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2019-02-15 CC||wilco at gcc dot gnu.org Summary|ICE in |[8 regression] ICE in |aarch64_classify_address, |aarch64_classify_address, |at |at |config/aarch64/aarch64.c:56 |config/aarch64/aarch64.c:56 |78 |78 Ever confirmed|0 |1 Known to fail||8.0 --- Comment #4 from Wilco --- Fixed in GCC9, still failing in GCC8.
[Bug target/89190] [8/9 regression][ARM] armv8-m.base invalid ldm ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89190 --- Comment #2 from Wilco --- Author: wilco Date: Wed Feb 13 16:22:25 2019 New Revision: 268848 URL: https://gcc.gnu.org/viewcvs?rev=268848=gcc=rev Log: [ARM] Fix Thumb-1 ldm (PR89190) This patch fixes an ICE in the Thumb-1 LDM peepholer. Thumb-1 LDMs always update the base register except if the base is loaded. The current implementation rejects LDMs where the base is not dead, however this doesn't exclude the case where the base is loaded as well as dead. Fix this by explicitly checking whether the base is loaded. Also enable LDMs which load the first register. gcc/ PR target/89190 * config/arm/arm.c (ldm_stm_operation_p) Set addr_reg_in_reglist correctly for first register. (load_multiple_sequence): Remove dead base check. (gen_ldm_seq): Correctly set write_back for Thumb-1. testsuite/ PR target/89190 * gcc.target/arm/pr89190.c: New test. Modified: trunk/gcc/ChangeLog trunk/gcc/config/arm/arm.c trunk/gcc/testsuite/ChangeLog
[Bug target/89222] [7/8/9 regression] ARM thumb-2 misoptimisation of func ptr call with -O2 or -Os
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89222 --- Comment #8 from Wilco --- (In reply to Wilco from comment #7) > Patch: https://gcc.gnu.org/ml/gcc-patches/2019-02/msg00780.html Updated patch: https://gcc.gnu.org/ml/gcc-patches/2019-02/msg00947.html
[Bug tree-optimization/86637] [9 Regression] ICE: tree check: expected block, have in inlining_chain_to_json, at optinfo-emit-json.cc:293
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86637 --- Comment #14 from Wilco --- Author: wilco Date: Mon Feb 11 18:14:37 2019 New Revision: 268777 URL: https://gcc.gnu.org/viewcvs?rev=268777=gcc=rev Log: [COMMITTED] Fix pthread errors in pr86637-2.c Fix test errors on targets which do not support pthreads. Committed as obvious. testsuite/ PR tree-optimization/86637 * gcc.c-torture/compile/pr86637-2.c: Test pthread and graphite target. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.c-torture/compile/pr86637-2.c
[Bug target/89222] [7/8/9 regression] ARM thumb-2 misoptimisation of func ptr call with -O2 or -Os
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89222 --- Comment #7 from Wilco --- Patch: https://gcc.gnu.org/ml/gcc-patches/2019-02/msg00780.html
[Bug target/89222] [7.x regression] ARM thumb-2 misoptimisation of func ptr call with -O2 or -Os
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89222 Wilco changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2019-02-08 Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org Ever confirmed|0 |1
[Bug target/89222] [7.x regression] ARM thumb-2 misoptimisation of func ptr call with -O2 or -Os
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89222 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #5 from Wilco --- (In reply to Andrew Pinski from comment #4) > I think the bug is in the assembler or the linker: > .L22: > .word myhandler2-1 > > > > Basically what is happening is: > (__handler != ((__sighandler_t) 2)) && (__handler != ((__sighandler_t) > SIG_DFL)) > > is converted to: > > (((size_t)__handler)-1) <= 1 > > And then GCC emits myhandler2-1 in the constant pool which is correct but > the assembler/linker decides to put 0x23a7 in that location (See the > .L22 above) and then GCC adds +1 to it to try to make it myhandler2 (again). > > This is why using SIG_DFL of 5 works, it is just by accident because GCC > decides not to do the transformation or put myhandler2-1 in the constant > pool. > > Again I think this is an assembler/linker issue of putting the wrong value > for > > .L22: > .word myhandler2-1 The +1 is added by the assembler since it is a Thumb function. Then the compiler adds another +1, making it +2. Basically one shouldn't do arithmetic with function symbols since bit 0 encodes the Arm/Thumb state.
[Bug rtl-optimization/87871] [9 Regression] testcases fail after r265398 on arm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87871 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #8 from Wilco --- (In reply to Segher Boessenkool from comment #5) > The first one just needs an xfail. I don't know if it should be *-*-* there > or only arm*-*-* should be added. > > The other two need some debugging by someone who knows the target and/or > these tests. The previous code for Arm was: cbz r0, .L5 push{r4, lr} mov r4, r0 bl foo movwr2, #:lower16:.LANCHOR0 movtr2, #:upper16:.LANCHOR0 add r4, r4, r0 str r4, [r2] pop {r4, pc} .L5: movsr0, #1 bx lr Now it fails to shrinkwrap: push{r4, lr} mov r4, r0 cmp r4, #0 moveq r0, #1 beq .L3 bl foo ldr r2, .L7 add r3, r4, r0 str r3, [r2] .L3: pop {r4, lr} bx lr It seems shrinkwrapping is more random, sometimes it's done as expected, sometimes it is not. It was more consistent on older GCC's.
[Bug rtl-optimization/89195] [7/8/9 regression] Corrupted stack offset after combine
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89195 --- Comment #11 from Wilco --- (In reply to Segher Boessenkool from comment #9) > That patch is pre-approved if it regchecks fine (on more than just x86). > Thanks! check-gcc is clean on aarch64_be-none-elf
[Bug rtl-optimization/89195] [7/8/9 regression] Corrupted stack offset after combine
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89195 --- Comment #10 from Wilco --- (In reply to Jakub Jelinek from comment #8) > Created attachment 45606 [details] > gcc9-pr89195.patch > > Now in patch form (untested so far). That works fine indeed. It avoids accessing the object out of bounds but still allows the optimization for eg. i:16 using a 16-bit access within the struct.
[Bug rtl-optimization/89195] [7/8/9 regression] Corrupted stack offset after combine
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89195 --- Comment #4 from Wilco --- (In reply to Segher Boessenkool from comment #3) > (In reply to Wilco from comment #1) > > len is unsigned HOST_WIDE_INT, so bits_to_bytes_round_down does an unsigned > > division... > > That shouldn't make a difference though, both dividend and divisor should be > non-negative. Are they? Well I guess not... So pos points outside of the > register here?! pos is a frame offset so always negative. And yes this code is then creating an access outside the original object (starting at -1 or +1 depending on endianness). > Was it correct before that? At least it was symmetric so it *seemed* > correct.. It was broken in the same way. If I cast just len to HOST_WIDE_INT it works fine.
[Bug rtl-optimization/89195] [7/8/9 regression] Corrupted stack offset after combine
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89195 --- Comment #1 from Wilco --- make_extraction does: if (MEM_P (inner)) { poly_int64 offset; /* POS counts from lsb, but make OFFSET count in memory order. */ if (BYTES_BIG_ENDIAN) offset = bits_to_bytes_round_down (GET_MODE_PRECISION (is_mode) - len - pos); else offset = pos / BITS_PER_UNIT; new_rtx = adjust_address_nv (inner, tmode, offset); len is unsigned HOST_WIDE_INT, so bits_to_bytes_round_down does an unsigned division...
[Bug rtl-optimization/89195] [7/8/9 regression] Corrupted stack offset after combine
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89195 Wilco changed: What|Removed |Added Target||aarch64 Target Milestone|--- |9.0 Known to fail||7.0, 8.0, 9.0
[Bug rtl-optimization/89195] New: [7/8/9 regression] Corrupted stack offset after combine
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89195 Bug ID: 89195 Summary: [7/8/9 regression] Corrupted stack offset after combine Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- The following testcase generates incorrect stack offsets on AArch64 since GCC7 when compiled with -O1 -mbig-endian: struct S { unsigned i : 24; }; volatile unsigned char x; int f (struct S d) { return d.i & x; } This produces: sub sp, sp, #16 str x0, [sp, 8] adrpx0, x ldrbw0, [x0, #:lo12:x] and w0, w0, 255 mov x1, 2305843009213693952 add x1, sp, x1 ldr w1, [x1, 7] and w0, w1, w0 add sp, sp, 16 ret The combine output is as follows: Trying 10, 11 -> 12: 10: r100:SI=[sfp:DI-0x8] 11: r101:SI#0=zero_extract(r100:SI#0,0x18,0x8) REG_DEAD r100:SI 12: r98:SI=r101:SI:SI REG_DEAD r101:SI REG_DEAD r92:SI Failed to match this instruction: (set (reg:SI 98) (and:SI (mem/c:SI (plus:DI (reg/f:DI 64 sfp) (const_int 2305843009213693943 [0x1ff7])) [1 dD.3407+2305843009213693951 S4 A8]) (reg:SI 92 [ x.0_3+-3 ]))) Successfully matched this instruction: (set (reg:SI 101) (mem/c:SI (plus:DI (reg/f:DI 64 sfp) (const_int 2305843009213693943 [0x1ff7])) [1 dD.3407+2305843009213693951 S4 A8])) Successfully matched this instruction: (set (reg:SI 98) (and:SI (reg:SI 101) (reg:SI 92 [ x.0_3+-3 ]))) allowing combination of insns 10, 11 and 12 original costs 16 + 4 + 4 = 24 replacement costs 16 + 4 = 20 deferring deletion of insn with uid = 10. modifying insn i211: r101:SI=[sfp:DI+0x1ff7] deferring rescan insn with uid = 11. modifying insn i312: r98:SI=r101:SI:SI REG_DEAD r92:SI REG_DEAD r101:SI deferring rescan insn with uid = 12. So it appears to want to change the offset -8 to -7 to optimize the zero-extend away (this is an out-of-bound access, but maybe OK for locals?). It must be converting a bit offset to a byte offset using an unsigned shift, losing the top 3 bits, which results in a wildly out of range offset...
[Bug target/89190] [8/9 regression][ARM] armv8-m.base invalid ldm ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89190 Wilco changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2019-02-04 Ever confirmed|0 |1 --- Comment #1 from Wilco --- Fix: https://gcc.gnu.org/ml/gcc-patches/2019-02/msg00201.html
[Bug target/89190] New: [8/9 regression][ARM] armv8-m.base invalid ldm ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89190 Bug ID: 89190 Summary: [8/9 regression][ARM] armv8-m.base invalid ldm ICE Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- The following testcases ICEs with -march=armv8-m.base on arm.none.eabi: long long a; int b, c; int d(int e, int f) { return e << f; } void g() { long long h; char i = d(b >= 7, 2); c = i == 0 ?: 1 / i; h = c && a ?: c + a; b = h; } error: unrecognizable insn: 10 | } | ^ (insn 133 114 105 3 (parallel [ (set (reg:SI 2 r2) (plus:SI (reg:SI 2 r2) (const_int 8 [0x8]))) (set (reg:SI 0 r0) (mem/c:SI (reg/f:SI 2 r2 [127]) [2 a+0 S4 A64])) (set (reg:SI 2 r2) (mem/c:SI (plus:SI (reg/f:SI 2 r2 [127]) (const_int 4 [0x4])) [2 a+4 S4 A32])) ]) -1 (nil)) The reason is that the Thumb-1 LDM code set writeback if the base is dead. However it does not ensure the base register is not also loaded. The fix is to disallow writeback if the base is loaded.
[Bug ipa/89104] ICE: Segmentation fault (in tree_int_cst_elt_check)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89104 --- Comment #5 from Wilco --- (In reply to Jakub Jelinek from comment #4) > I really don't like these aarch64 warnings, declare simd is an optimization > (admittedly with ABI consequences) and warning about this by default is > weird, > + it is going to be a pain, any time any declare simd testcase is added > there is potential "regression" on aarch64. > Plus it really looks like a bug in this case, there is no mixed type at all, > the int * argument is uniform, so should be passed as any other pointer, and > all the others are int and so should use the same vector int type. I agree backend specific warnings are not ideal but it's unclear whether a better solution exists beyond just not emitting these warnings at all and letting the user figure it out. However the key question is why do testcases which do not specifically test for a specific warning fail if an extra warning is emitted? That's completely harmless in most cases.
[Bug ipa/89104] ICE: Segmentation fault (in tree_int_cst_elt_check)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89104 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #3 from Wilco --- (In reply to Jakub Jelinek from comment #2) gcc.dg/gomp/pr89104.c fails with a warning: gcc/gcc/testsuite/gcc.dg/gomp/pr89104.c:8:1: warning: GCC does not currently support mixed size types for 'simd' functions Should we just ignore warnings with -w?
[Bug target/89101] [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101 Wilco changed: What|Removed |Added Status|WAITING |NEW Known to work||9.0 Version|unknown |8.2.0 Target Milestone|--- |9.0 Known to fail||8.2.0 --- Comment #3 from Wilco --- (In reply to Gael Guennebaud from comment #2) > Indeed, it fails to remove the dup only if the coefficient is used multiple > times as in the following reduced exemple: (https://godbolt.org/z/hmSaE0) > > > #include > > void foo(const float* a, const float * b, float * c, int n) { > float32x4_t c0, c1, c2, c3; > c0 = vld1q_f32(c+0*4); > c1 = vld1q_f32(c+1*4); > for(int k=0; k { > float32x4_t a0 = vld1q_f32(a+0*4+k*4); > float32x4_t b0 = vld1q_f32(b+k*4); > c0 = vfmaq_laneq_f32(c0, a0, b0, 0); > c1 = vfmaq_laneq_f32(c1, a0, b0, 0); > } > vst1q_f32(c+0*4, c0); > vst1q_f32(c+1*4, c1); > } > > > I tested with gcc 7 and 8. Confirmed for GCC8, fixed on trunk. I tried the above example with up to 4 uses and it always generates the expected code on trunk. So this is fixed for GCC9, however it seems unlikely the fix (multi-use support in Combine) could be backported.
[Bug target/89101] [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #1 from Wilco --- (In reply to Gael Guennebaud from comment #0) > vfmaq_laneq_f32 is currently implemented as: > > __extension__ static __inline float32x4_t __attribute__ ((__always_inline__)) > vfmaq_laneq_f32 (float32x4_t __a, float32x4_t __b, >float32x4_t __c, const int __lane) > { > return __builtin_aarch64_fmav4sf (__b, > __aarch64_vdupq_laneq_f32 (__c, __lane), > __a); > } > > thus leading to unoptimized code as: > > ldr q1, [x2, 16] > dup v28.4s, v1.s[0] > dup v27.4s, v1.s[1] > dup v26.4s, v1.s[2] > dup v1.4s, v1.s[3] > fmlav22.4s, v25.4s, v28.4s > fmlav3.4s, v25.4s, v27.4s > fmlav6.4s, v25.4s, v26.4s > fmlav17.4s, v25.4s, v1.4s > > instead of: > > ldr q1, [x2, 16] > fmlav22.4s, v25.4s, v1.s[0] > fmlav3.4s, v25.4s, v1.s[1] > fmlav6.4s, v25.4s, v1.s[2] > fmlav17.4s, v25.4s, v1.s[3] > > I guess several other *lane* intrinsics exhibit the same shortcoming. Which compiler version did you use? I tried this on GCC6, 7, 8, and 9 with -O2: #include "arm_neon.h" float32x4_t f(float32x4_t a, float32x4_t b, float32x4_t c) { a = vfmaq_laneq_f32 (a, b, c, 0); a = vfmaq_laneq_f32 (a, b, c, 1); return a; } fmlav0.4s, v1.4s, v2.4s[0] fmlav0.4s, v1.4s, v2.4s[1] ret In all cases the optimizer is able to merge the dups as expected. If it still fails for you, could you provide a compilable example like above that shows the issue? > For the record, I managed to partly workaround this issue by writing my own > version as: > > if(LaneID==0) asm("fmla %0.4s, %1.4s, %2.s[0]\n" : "+w" (c) : "w" > (a), "w" (b) : ); > else if(LaneID==1) asm("fmla %0.4s, %1.4s, %2.s[1]\n" : "+w" (c) : "w" > (a), "w" (b) : ); > else if(LaneID==2) asm("fmla %0.4s, %1.4s, %2.s[2]\n" : "+w" (c) : "w" > (a), "w" (b) : ); > else if(LaneID==3) asm("fmla %0.4s, %1.4s, %2.s[3]\n" : "+w" (c) : "w" > (a), "w" (b) : ); > > but that's of course not ideal. This change yields a 32% speed up in Eigen's > matrix product: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1633 I'd strongly advise against using inline assembler since most people make mistakes writing it, and GCC won't be able to optimize code using inline assembler.
[Bug tree-optimization/88760] GCC unrolling is suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #23 from Wilco --- (In reply to ktkachov from comment #22) > helps even more. On Cortex-A72 it gives a bit more than 6% (vs 3%) > improvement on parest, and about 5.3% on a more aggressive CPU. > I tried unrolling 8x in a similar manner and that was not faster than 4x on > either target. The 4x unrolled version has 19 instructions (and microops) vs 7*4 for the non-unrolled version, a significant reduction (without LDP it would be 21 vs 28). There is potential to use 2 more LDPs and use load+writeback which would make it 15 vs 28 instructions, so close to 2x reduction in instructions.
[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763 --- Comment #32 from Wilco --- Author: wilco Date: Fri Jan 25 13:29:06 2019 New Revision: 268265 URL: https://gcc.gnu.org/viewcvs?rev=268265=gcc=rev Log: [PATCH][AArch64] Fix generation of tst (PR87763) The TST instruction no longer matches in all cases due to changes in Combine. The fix is simple, we now need to allow a subreg as well when selecting the cc_mode. This fixes the tst_5.c and tst_6.c failures. AArch64 regress & bootstrap OK. PR rtl-optimization/87763 * config/aarch64/aarch64.c (aarch64_select_cc_mode): Allow SUBREG when matching CC_NZmode compare. Modified: trunk/gcc/ChangeLog trunk/gcc/config/aarch64/aarch64.c
[Bug tree-optimization/88760] GCC unrolling is suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #21 from Wilco --- (In reply to rguent...@suse.de from comment #20) > On Thu, 24 Jan 2019, wilco at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > > > --- Comment #19 from Wilco --- > > (In reply to rguent...@suse.de from comment #18) > > > > > > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's > > > > suggestion) > > > > > > If that helps, sure (I'd have guessed uarchs are going to split > > > load-multiple into separate loads, but eventually it avoids > > > load-port contention?) > > > > Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a > > 128-bit LDP in a single cycle (see Optimization Guide). > > > > > > 2) Unrolling and breaking accumulator dependencies. > > > > > > IIRC RTL unrolling can do this (as side-effect, not as main > > > cost motivation) guarded with an extra switch. > > > > > > > I think more general unrolling and the peeling associated with it can be > > > > discussed independently of 1) and 2) once we collect more data on more > > > > microarchitectures. > > > > > > I think both of these can be "implemented" on the RTL unroller > > > side. > > > > You still need dependence analysis, alias info, ivopt to run again. The > > goal is > > to remove the increment of the index, use efficient addressing modes > > (base+imm) > > and allow scheduling to move instructions between iterations. I don't > > believe > > the RTL unroller supports any of this today. > > There's no way to encode load-multiple on GIMPLE that wouldn't be > awkward to other GIMPLE optimizers. I don't think anyone want LDP/STP directly in GIMPLE - that doesn't seem useful. We don't even form LDP until quite late in RTL. The key to forming LDP/STP is using base+imm addressing modes and having correct alias info (so loads/stores from different iterations can be interleaved and then combined into LDP/STP). The main thing a backend would need to do is tune address costs to take future LDP formation into account (and yes, the existing cost models need to be improved anyway). > Yes, the RTL unroller supports scheduling (sched runs after unrolling) > and scheduling can do dependence analysis. Yes, the RTL unroller > does _not_ do dependence analysis at the moment, so if you want to > know beforehand whether you can interleave iterations you have to > actually perform dependence analysis. Of course you can do that > on RTL. And of course you can do IVOPTs on RTL (yes, we don't do that > at the moment). Sure we *could* duplicate all high-level loop optimizations to work on RTL. However is that worth the effort given we have them already at tree level? > Note I'm not opposed to have IVOPTs on GIMPLE itself perform > unrolling (I know Bin was against this given IVOPTs is already > so comples) and a do accumulator breakup. But I don't see how > to do the load-multiple thing (yes, you could represent it > as a vector load plus N element extracts on GIMPLE and it > would be easy to teach SLP vectorization to perform this > transform on its own if it is really profitable - which I > doubt you can reasonably argue before RA, let alone on GIMPLE). Let's forget about load-multiple in GIMPLE. Kyrill's example shows that unrolling at the high level means the existing loop optimizations and analysis work as expected and we end up with good addressing modes, LDPs and interleaving of different iterations. With the existing RTL unroller this just isn't happening.
[Bug tree-optimization/88760] GCC unrolling is suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #19 from Wilco --- (In reply to rguent...@suse.de from comment #18) > > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's > > suggestion) > > If that helps, sure (I'd have guessed uarchs are going to split > load-multiple into separate loads, but eventually it avoids > load-port contention?) Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a 128-bit LDP in a single cycle (see Optimization Guide). > > 2) Unrolling and breaking accumulator dependencies. > > IIRC RTL unrolling can do this (as side-effect, not as main > cost motivation) guarded with an extra switch. > > > I think more general unrolling and the peeling associated with it can be > > discussed independently of 1) and 2) once we collect more data on more > > microarchitectures. > > I think both of these can be "implemented" on the RTL unroller > side. You still need dependence analysis, alias info, ivopt to run again. The goal is to remove the increment of the index, use efficient addressing modes (base+imm) and allow scheduling to move instructions between iterations. I don't believe the RTL unroller supports any of this today.
[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763 --- Comment #23 from Wilco --- Author: wilco Date: Tue Jan 22 17:49:46 2019 New Revision: 268159 URL: https://gcc.gnu.org/viewcvs?rev=268159=gcc=rev Log: Fix vect-nop-move.c test Fix a failing test - changes in Combine mean the test now fails eventhough the generated code is the same. Given there are several AArch64-specific tests for vec-select, remove the scanning of Combine output. Committed as trivial fix. testsuite/ PR rtl-optimization/87763 * gcc.dg/vect/vect-nop-move.c: Fix testcase on AArch64. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/vect/vect-nop-move.c
[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763 --- Comment #22 from Wilco --- (In reply to Steve Ellcey from comment #21) > If I look at this specific example: > > int f2 (int x, int y) > { > return (x & ~0x0ff000) | ((y & 0x0ff) << 12); > } > > Is this because of x0 (a hard register) at the destination in insn 15? Yes, the hard reg change affects the rtl shape in these tests so it fails to match in combine. I have a simple fix for the tst_5 and tst_6 failures. You can check this by ensuring there are no hard registers in the test: int f2a (int x, int y) { x++; y++; return (x & ~0x0ff000) | ((y & 0x0ff) << 12); } This also fails to match before r265398.
[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763 --- Comment #19 from Wilco --- (In reply to Segher Boessenkool from comment #18) > https://gcc.gnu.org/ml/gcc/2019-01/msg00112.html Thanks, I hadn't noticed that yet... I need to look at it in more detail, but are you saying that combine no longer generates zero_extracts today and all patterns that rely on it must be changed to a different canonical form? I suspect the tst_5/6 cases are similar if the canonical form of rtl has changed. Note that doing (x & 63) != 0 just works fine, it's only 255 and 65535 which appear to be treated differently...
[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763 --- Comment #17 from Wilco --- (In reply to Vladimir Makarov from comment #14) > I've checked cvtf_1.c generated code and I don't see additional fmov > anymore. I guess it was fixed by an ira-costs.c change (a special > consideration of moves containing hard regs). I think this PR is resolved > (although the change resulted in a new PR 88560 but it is a different story). Some failures disappeared, however various failures still exist. It appears the ones reported above are not all directly related to the move change, but another change around the same time. For example all of these are due to combine no longer creating bfi or tst instructions in trivial cases: FAIL: gcc.target/aarch64/combine_bfi_1.c scan-assembler-times \\tbfi\\t 5 FAIL: gcc.target/aarch64/insv_1.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 0, 8 FAIL: gcc.target/aarch64/insv_1.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 16, 5 FAIL: gcc.target/aarch64/insv_1.c scan-assembler movk\tx[0-9]+, 0x1d6b, lsl 32 FAIL: gcc.target/aarch64/insv_2.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 43, 5 FAIL: gcc.target/aarch64/insv_2.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 56, 8 FAIL: gcc.target/aarch64/insv_2.c scan-assembler movk\tx[0-9]+, 0x1d6b, lsl 16 FAIL: gcc.target/aarch64/lsl_asr_sbfiz.c scan-assembler sbfiz\tw FAIL: gcc.target/aarch64/tst_5.c scan-assembler tst\t(x|w)[0-9]+,[ \t]*255 FAIL: gcc.target/aarch64/tst_5.c scan-assembler tst\t(x|w)[0-9]+,[ \t]*65535 FAIL: gcc.target/aarch64/tst_6.c scan-assembler tst\t(x|w)[0-9]+,[ \t]*65535 > If we want to improve the generated code size, it would be better to find > a small testcase from SPEC2006 showing what is wrong. I understand it is a > hard work but otherwise we could only speculate what is going wrongly. I > don't think that reverting the combiner change would be a solution. Yes it's hard to get good reproducible examples, especially from large functions. It's easier to first sort out the known test failures.
[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763 --- Comment #13 from Wilco --- (In reply to Segher Boessenkool from comment #12) > Before the change combine forwarded all argument (etc.) hard registers > wherever > it could, doing part of RA's job (and doing a lousy job of it). If after the > change it no longer two ranges, than a) that is a good decision, or b) there > is > some bug in RA. I think it's important not to conflate avoiding increasing live ranges of hard registers and inserting redundant extra moves which cannot be optimized. The former unconditionally looks like a good idea, however I can't see any valid reasoning for the latter. Basically we now always get 2 moves for every physical register, and this seems to thwart register preferencing. > 0.05% code size is nothing, most other biggish changes have a much bigger > effect. Compiler optimization is all about making many small changes which add up to a large improvement. This is actually quite significant given practically all code is affected negatively.
[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763 --- Comment #11 from Wilco --- A SPEC2006 run shows the codesize cost of make_more_copies is 0.05%. Practically all tests are worse, the largest increases are perlbench at 0.20%, gromacs 0.12%, calculix 0.12%, soplex 0.08%, xalancbmk 0.07%, wrf 0.06%. In total ~26KBytes of extra moves... Overall this doesn't look like a good change. So the question is, what is the purpose of adding extra live ranges when the register allocator doesn't appear to be able to merge them?
[Bug middle-end/88560] [9 Regression] armv8_2-fp16-move-1.c and related regressions after r266385
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88560 --- Comment #6 from Wilco --- (In reply to Vladimir Makarov from comment #5) > We have too many tests checking expected generated code. We should more > focus on overall effect of the change. SPEC would be a good criterium > although it is hard to check SPEC for each patch. > > I've checked the generated code of arm8_2-fp16-move-1.c and found that in > most cases GCC generates better code with the patch: > > @@ -80,7 +80,6 @@ test_load_store_1: > @ frame_needed = 0, uses_anonymous_args = 0 > @ link register save eliminated. > lsl r1, r1, #1 > - vmov.f16s0, r3 @ __fp16 > ldrhr3, [r2, r1]@ __fp16 > strhr3, [r0, r1]@ __fp16 > bx lr When I tested it, this test added that vmov, not removed it - see comment #2.
[Bug tree-optimization/88760] GCC unrolling is suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #16 from Wilco --- (In reply to rguent...@suse.de from comment #15) > which is what I refered to for branch prediction. Your & prompts me > to a way to do sth similar as duffs device, turning the loop into a nest. > > head: >if (n == 0) exit; ><1 iteration> >if (n & 1) > n -= 1, goto head; ><1 iteration> >if (n & 2) > n -= 2, goto head; ><2 iterations> >if (n & 4) > n -= 4, goto head; ><4 iterations> >n -= 8, goto head; > > the inner loop tests should become well-predicted quickly. > > But as always - more branches make it more likely that one is > mispredicted. For a single non-unrolled loop you usually only > get the actual exit mispredicted. Yes the overlapping the branches for the tail loop and the main loop will result in more mispredictions. And there are still 4 branches for an 8x unrolled loop, blocking optimizations and scheduling. So Duff's device is always inefficient - the above loop is much faster like this: if (n & 1) do 1 iteration if (n & 2) do 2 iterations if (n >= 4) do 4 iterations while ((n -= 4) > 0)
[Bug tree-optimization/88760] GCC unrolling is suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #14 from Wilco --- (In reply to rguent...@suse.de from comment #13) > Usually the peeling is done to improve branch prediction on the > prologue/epilogue. Modern branch predictors do much better on a loop than with this kind of code: andsx12, x11, 7 beq .L70 cmp x12, 1 beq .L55 cmp x12, 2 beq .L57 cmp x12, 3 beq .L59 cmp x12, 4 beq .L61 cmp x12, 5 beq .L63 cmp x12, 6 bne .L72 That's way too many branches close together so most predictors will hit the maximum branches per fetch block limit and not predict them. If you wanted to peel it would have to be like: if (n & 4) do 4 iterations if (n & 2) do 2 iterations if (n & 1) do 1 iteration However that's still way too much code explosion for little or no gain. The only case where this makes sense is in a handwritten memcpy.
[Bug tree-optimization/88739] [7/8/9 Regression] Big-endian union bug
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88739 --- Comment #37 from Wilco --- (In reply to rsand...@gcc.gnu.org from comment #35) > Yeah, the expr.c patch makes the original testcase work, but we still fail > for: That's the folding in ccp1 after inlining, which will require a similar fix. There are likely more places that need to be fixed to handle the 'short' bit types.
[Bug tree-optimization/88739] [7/8/9 Regression] Big-endian union bug
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88739 --- Comment #34 from Wilco --- With just the expr.c patch the gcc regression tests all pass on big-endian AArch64. Interestingly this includes the new torture test, ie. it does not trigger the union bug.
[Bug tree-optimization/88739] [7/8/9 Regression] Big-endian union bug
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88739 --- Comment #33 from Wilco --- (In reply to Richard Biener from comment #32) > > > > Index: gcc/expr.c > > === > > --- gcc/expr.c (revision 267553) > > +++ gcc/expr.c (working copy) > > @@ -10562,6 +10562,15 @@ expand_expr_real_1 (tree exp, rtx target > >infinitely recurse. */ > > gcc_assert (tem != exp); > > > > + /* When extracting from non-mode bitsize entities adjust the > > + bit position for BYTES_BIG_ENDIAN. */ > > + if (INTEGRAL_TYPE_P (TREE_TYPE (tem)) > > + && (TYPE_PRECISION (TREE_TYPE (tem)) > > + < GET_MODE_BITSIZE (as_a (TYPE_MODE > > (TREE_TYPE (tem) > > + && BYTES_BIG_ENDIAN) > > + bitpos += (GET_MODE_BITSIZE (as_a (TYPE_MODE > > (TREE_TYPE (tem > > +- TYPE_PRECISION (TREE_TYPE (tem))); > > + > > /* If TEM's type is a union of variable size, pass TARGET to the > > inner > >computation, since it will need a temporary and TARGET is known > >to have to do. This occurs in unchecked conversion in Ada. */ > > Btw, this needs to be amended for WORDS_BIG_ENDIAN of course. I guess > we might even run into the case that such BIT_FIELD_REF references > a non-contiguous set of bits... (that's also true for BITS_BIG_ENDIAN != > BYTES_BIG_ENDIAN I guess). Was that meant to be instead or in addition to the tree-ssa-sccvn.c patch? With both I get: lsr w20, w1, 2 ... and w1, w20, 65535 With only the expr.c patch it starts to look as expected: lsr w20, w1, 2 ... lsr w1, w20, 14 And with the latter case the new torture test now passes on big-endian!
[Bug tree-optimization/88760] GCC unrolling is suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #7 from Wilco --- (In reply to rguent...@suse.de from comment #6) > On Wed, 9 Jan 2019, wilco at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > > > --- Comment #5 from Wilco --- > > (In reply to Wilco from comment #4) > > > (In reply to ktkachov from comment #2) > > > > Created attachment 45386 [details] > > > > aarch64-llvm output with -Ofast -mcpu=cortex-a57 > > > > > > > > I'm attaching the full LLVM aarch64 output. > > > > > > > > The output you quoted is with -funroll-loops. If that's not given, GCC > > > > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my > > > > testing). > > > > > > > > Is there anything we can do to make the default unrolling a bit more > > > > aggressive? > > > > > > I don't think the RTL unroller works at all. It doesn't have the right > > > settings, and doesn't understand how to unroll, so we always get > > > inefficient > > > and bloated code. > > > > > > To do unrolling correctly it has to be integrated at tree level - for > > > example when vectorization isn't possible/beneficial, unrolling might > > > still > > > be a good idea. > > > > To add some numbers to the conversation, the gain LLVM gets from default > > unrolling is 4.5% on SPECINT2017 and 1.0% on SPECFP2017. > > > > This clearly shows there is huge potential from unrolling, *if* we can teach > > GCC to unroll properly like LLVM. That means early unrolling, using good > > default settings and using a trailing loop rather than inefficient peeling. > > I don't see why this cannot be done on RTL where we have vastly more > information of whether there are execution resources that can be > used by unrolling. Note we also want unrolling to interleave > instructions to not rely on pre-reload scheduling which in turn means > having a good eye on register pressure (again sth not very well handled > on GIMPLE) The main issue is that other loop optimizations are done on tree, so things like addressing modes, loop invariants, CSEs are run on the non-unrolled version. Then when we unroll in RTL we end up with very non-optimal code. Typical unrolled loop starts like this: add x13, x2, 1 add x14, x2, 2 add x11, x2, 3 add x10, x2, 4 ldr w30, [x4, x13, lsl 2] add x9, x2, 5 add x5, x2, 6 add x12, x2, 7 ldr d23, [x3, x2, lsl 3] ... rest of unrolled loop So basically it decides to create a new induction variable for every unrolled copy in the loop. This often leads to spills just because it creates way too many redundant addressing instructions. It also blocks scheduling between iterations since the alias optimization doesn't appear to understand simple constant differences between indices. So unrolling should definitely be done at a high level just like vectorization.
[Bug tree-optimization/88760] GCC unrolling is suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #5 from Wilco --- (In reply to Wilco from comment #4) > (In reply to ktkachov from comment #2) > > Created attachment 45386 [details] > > aarch64-llvm output with -Ofast -mcpu=cortex-a57 > > > > I'm attaching the full LLVM aarch64 output. > > > > The output you quoted is with -funroll-loops. If that's not given, GCC > > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my > > testing). > > > > Is there anything we can do to make the default unrolling a bit more > > aggressive? > > I don't think the RTL unroller works at all. It doesn't have the right > settings, and doesn't understand how to unroll, so we always get inefficient > and bloated code. > > To do unrolling correctly it has to be integrated at tree level - for > example when vectorization isn't possible/beneficial, unrolling might still > be a good idea. To add some numbers to the conversation, the gain LLVM gets from default unrolling is 4.5% on SPECINT2017 and 1.0% on SPECFP2017. This clearly shows there is huge potential from unrolling, *if* we can teach GCC to unroll properly like LLVM. That means early unrolling, using good default settings and using a trailing loop rather than inefficient peeling.
[Bug tree-optimization/88760] GCC unrolling is suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #4 from Wilco --- (In reply to ktkachov from comment #2) > Created attachment 45386 [details] > aarch64-llvm output with -Ofast -mcpu=cortex-a57 > > I'm attaching the full LLVM aarch64 output. > > The output you quoted is with -funroll-loops. If that's not given, GCC > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my > testing). > > Is there anything we can do to make the default unrolling a bit more > aggressive? I don't think the RTL unroller works at all. It doesn't have the right settings, and doesn't understand how to unroll, so we always get inefficient and bloated code. To do unrolling correctly it has to be integrated at tree level - for example when vectorization isn't possible/beneficial, unrolling might still be a good idea.
[Bug tree-optimization/88739] [7/8/9 Regression] Big-endian union bug
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88739 --- Comment #29 from Wilco --- (In reply to Richard Biener from comment #26) > Did anybody test the patch? Testing on x86_64 will be quite pointless... Well that generates _18 = BIT_FIELD_REF <_2, 16, 14>; and becomes: ubfxx1, x20, 2, 16 This extracts bits 2-17 of the 30-bit value instead of bits 14-29. The issue is that we're using a bitfield reference on a value that is claimed not to be a bitfield in comment 6. So I can't see how using BIT_FIELD_REF could ever work correctly.