[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583 --- Comment #10 from Richard Sandiford --- I have a proof-of-concept hack (far from submission quality). But it looks like some cases will also require us to extend aarch64_evpc_reencode to handle SVE modes, which is also worthwhile for its own sake. Have a hack for that too.
[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583 Richard Sandiford changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #9 from Richard Sandiford --- Mine. Lets see if I can remember how this “vectoriser” thing works…
[Bug target/116629] [14/15 Regression] Building openvino with -flto ICEs in aarch64_sve::gimple_folder::redirect_pred_x
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116629 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #13 from Richard Sandiford --- Mine then.
[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583 --- Comment #7 from Richard Sandiford --- ...actually, they probably don't need to bijective. I suppose [0, 0] for two-lane SLP is handled too.
[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583 --- Comment #6 from Richard Sandiford --- Sorry for the slow response (here and in general). Been having to spend my time on other things recently :( I agree that this case is regular enough to handle for VLA, but it seems to me like a separate case from the one that the current "repeating" code handles. The current repeating code is really for bijective 1-vector-to-1-vector permutes, rather than for packing operations.
[Bug rtl-optimization/116326] [lra] internal compiler error: in get_reload_reg, at lra-constraints.cc:755
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116326 --- Comment #8 from Richard Sandiford --- (In reply to Georg-Johann Lay from comment #7) > What about the following line in reload1.h: > > // Used during roload -> LRA transition because ELIMINABLE_REGS may depend > // on command line options. Used in avr.h for example. > #define IN_RELOAD1_CC > > and the use #ifdef IN_RELOAD1_CC in avr.h? > > It's still a permanent change, but a no-op (except for avr.h) because the > macro is unused currently. Ah, yeah, and it keeps the change local to the code that is slated to disappear. That sounds good to me FWIW.
[Bug rtl-optimization/116326] [lra] internal compiler error: in get_reload_reg, at lra-constraints.cc:755
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116326 --- Comment #6 from Richard Sandiford --- (In reply to Georg-Johann Lay from comment #5) > > But either way, I think we should start with the assumption that the entry > > should be removed and make everything else work to that. Unfortunately that > > probably excludes reload though, if the original problem that motivated the > > entry was there. > > With the curret state of affairs, switching avr to LRA is not a good idea; > you can't even build neither libgcc nor libc. Sure, but I meant: remove it locally while testing/working on LRA support. > Couldn't the elimination code in IRA / LRA just skip the high part? Or let > reload handle multi-reg FPs correctly, even though reload will come to an > end? > > Or maybe make eliminable regs an array that can be accessed in the backend, > so that avr.cc could adjust the entries in adjust_reg_alloc_order or > init_expanders to something like: > > eliminables[3] = eliminables[2] > > having a duplicate entry shouldn't hurt? I don't think we should make any permanent changes to support this kind of manipulation, since it's only needed during the transition.
[Bug rtl-optimization/116326] [lra] internal compiler error: in get_reload_reg, at lra-constraints.cc:755
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116326 --- Comment #4 from Richard Sandiford --- (In reply to Georg-Johann Lay from comment #3) > It was due to problems with multi-reg frame-pointer. (AFAIR, using a > hard-frame-poiner besides frame-poiner didn't resolve the issues.) > > My problem now is that avr.h has: > > #define ELIMINABLE_REGS { \ > { ARG_POINTER_REGNUM, STACK_POINTER_REGNUM }, \ > { ARG_POINTER_REGNUM, FRAME_POINTER_REGNUM }, \ > { FRAME_POINTER_REGNUM, STACK_POINTER_REGNUM }, \ > { FRAME_POINTER_REGNUM + 1, STACK_POINTER_REGNUM + 1 } } > > but ira.cc uses that to initialize a static const array, hence there's no > way to depend that on options I guess? Yeah. And I don't think it's worth adding a way to make it depend on an option, since it will only be a temporary state until LRA is fully supported. Could you "just" remove the final entry locally while testing -mlra? > One solution would be to skip upper parts of multi-part pointers in > elimination. (And maybe other places: TARGET_HARD_REGNO_SCRATCH_OK and > HARD_REGNO_RENAME_OK assume special handling for the high part of multi-reg > FP, too). In backend code, or in middle-end code? But either way, I think we should start with the assumption that the entry should be removed and make everything else work to that. Unfortunately that probably excludes reload though, if the original problem that motivated the entry was there.
[Bug rtl-optimization/116326] [lra] internal compiler error: in get_reload_reg, at lra-constraints.cc:755
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116326 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #2 from Richard Sandiford --- This is caused by the final entry in ELIMINABLE_REGS: { FRAME_POINTER_REGNUM + 1, STACK_POINTER_REGNUM + 1 } I guess this was added to work around a reload bug involving multi-register pointers, but it shouldn't be necessary for LRA (and as this PR shows, it's actively harmful for LRA). LRA treats each entry in ELIMINABLE_REGS as a pointer of mode Pmode. This means that (as far as LRA is concerned) the final entry above occupies FRAME_POINTER_REGNUM + 1 and FRAME_POINTER_REGNUM + 2. But FRAME_POINTER_REGNUM + 2 is the Z register, so this takes the Z register out of action, leading to a reload loop. I'm sure there are lingering bugs around multi-register pointers :) But I think we should remove the entry above and try to fix the bugs at source.
[Bug rtl-optimization/116516] [15 Regression] [lra] ICE in decompose_normal_address, at rtlanal.cc:6712 by r15-3213-g708ee71808ea61
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116516 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #7 from Richard Sandiford --- Hopefully fixed. Thanks for the report.
[Bug tree-optimization/116481] [12/13/14/15 Regression] `arrays of functions are not meaningful` error message happens with -W -Wall -O2 even though there are no arrays of function types used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116481 --- Comment #10 from Richard Sandiford --- (In reply to Richard Sandiford from comment #9) > If the tag bit is dropped by going > out of bounds, that's a feature, not a bug, and would happen equally for > void*/char* arithmetic as for (u)intptr_t arithmetic. “out of bounds” here meaning “out of the representable bounds” of a capability (as distinct from “out of the object's bounds”, which is checked at dereference).
[Bug tree-optimization/116481] [12/13/14/15 Regression] `arrays of functions are not meaningful` error message happens with -W -Wall -O2 even though there are no arrays of function types used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116481 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #9 from Richard Sandiford --- (In reply to Bruno Haible from comment #7) > Thanks for the attempted advice. But, no, converting a pointer to uintptr_t > and later back to a pointer is bad for portability: It does not work on the > CHERI architecture and its implementation on the ARM Morello chip > https://developer.arm.com/Architectures/Morello, because it loses the tag > bits that are essential for accessing the pointer. See > https://www.gnu.org/software/gnulib/manual/html_node/C99-features-avoided. > html and https://lists.gnu.org/archive/html/bug-gnulib/2023-12/msg00021.html > . I think Richard's point still stands though. Conversion to and from (u)intptr_t does not in itself drop the tag bit. CHERI/Morello (u)intptr_t can hold exactly what a void* can hold. If the tag bit is dropped by going out of bounds, that's a feature, not a bug, and would happen equally for void*/char* arithmetic as for (u)intptr_t arithmetic.
[Bug rtl-optimization/116516] [15 Regression] [lra] ICE in decompose_normal_address, at rtlanal.cc:6712 by r15-3213-g708ee71808ea61
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116516 Richard Sandiford changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #3 from Richard Sandiford --- Gah, mine then.
[Bug fortran/116254] new test case gfortran.dg/class_transformational_2.f90 from r15-2739-g4cb07a38233aad fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116254 Richard Sandiford changed: What|Removed |Added Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed||2024-08-21 --- Comment #10 from Richard Sandiford --- A bit more info: valgrind succeeds for -O0. But with optimisation enabled (-O is enough), it flags: ==12989== Conditional jump or move depends on uninitialised value(s) ==12989==at 0x49B47A0: _gfortran_spread (spread_generic.c:278) ==12989==by 0x401227: check_spread (class_transformational_2.f90:56) ==12989==by 0x401227: MAIN__ (class_transformational_2.f90:20) ==12989==by 0x4035C3: main (class_transformational_2.f90:24) It seems that the _data._desc field of the spread results are being copied from uninitialised memory. .gimple has: __attribute__((fn spec (". "))) void check_spread () { … { … struct array02_character(kind=1) atmp.98; struct __class_MAIN___T_2_0a ctmp.99; … try { … ctmp.99 = b; ctmp.99._data = atmp.98; ctmp.99._data.span = D.3687; ctmp.99._data.data = 0B; ctmp.99._data.offset = 0; _gfortran_spread (&ctmp.99._data, D.3682, D.3684, D.3686); where nothing has initialised atmp.98 before the copy.
[Bug fortran/116254] new test case gfortran.dg/class_transformational_2.f90 from r15-2739-g4cb07a38233aad fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116254 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #9 from Richard Sandiford --- FWIW, the failure seems intermittent for me. If I run the test individually multiple times for the same toolchain, it sometimes passes, and sometimes fails with different options. This is on aarch64 Ubuntu 22.04 (so the same as mentioned in #c8).
[Bug rtl-optimization/116238] [12/13/14 Regression] ICE building 526.blender_r on aarch64 SVE after r15-1619-g3b9b8d6cfdf593
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116238 Richard Sandiford changed: What|Removed |Added Known to fail|15.0|14.2.1 Summary|[12/13/14/15 Regression]|[12/13/14 Regression] ICE |ICE building 526.blender_r |building 526.blender_r on |on aarch64 SVE after|aarch64 SVE after |r15-1619-g3b9b8d6cfdf593|r15-1619-g3b9b8d6cfdf593 Known to work||15.0 --- Comment #8 from Richard Sandiford --- Fixed on trunk. Will backport after a while if there is no fallout.
[Bug target/116413] [LRA] [M68K] ICE: unrecognized insn in lra_set_insn_recog_data, at lra.cc:1036
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116413 --- Comment #17 from Richard Sandiford --- Hmm, but if the ICE is coming from vregs then it doesn't sound like it's related to LRA. vregs is the first RTL pass to run.
[Bug target/116413] [LRA] [M68K] ICE: unrecognized insn in lra_set_insn_recog_data, at lra.cc:1036
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116413 --- Comment #14 from Richard Sandiford --- Created attachment 58967 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58967&action=edit Patch for the decompse_mem_address ICE Thanks. Can you try the attached patch?
[Bug target/116413] [LRA] [M68K] ICE: unrecognized insn in lra_set_insn_recog_data, at lra.cc:1036
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116413 --- Comment #11 from Richard Sandiford --- I can't reproduce this with m68k-elf. Do you have any local changes beyond making TARGET_LRA_P return true?
[Bug rtl-optimization/116321] [lra][avr] internal compiler error: in avr_out_lpm_no_lpmx, at config/avr/avr.cc:4572
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116321 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #5 from Richard Sandiford --- Created attachment 58964 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58964&action=edit Proof of concept patch The problem seems to be a bad interaction between spilling and register elimination. I think it would only occur on targets that use a real target register as FRAME_POINTER_REGNUM, rather than having a separate FRAME_POINTER_REGNUM and HARD_FRAME_POINTER_REGNUM. IRA sees: (insn 47 46 48 2 (set (reg:QI 24 r24 [+6 ]) (mem:QI (reg/f:HI 58) [1 *p_2(D)+6 S1 A8 AS1])) "pr116321.c":6:1 86 {movqi_insn_split} (expr_list:REG_DEAD (reg/f:HI 58) (nil))) (insn 48 47 49 2 (set (reg:QI 25 r25 [+7 ]) (mem:QI (reg/f:HI 60) [1 *p_2(D)+7 S1 A8 AS1])) "pr116321.c":6:1 86 {movqi_insn_split} (expr_list:REG_DEAD (reg/f:HI 60) (nil))) where both instructions need an r30 base. As it happens, IRA assigns r30 to r60 . However, when it comes to reloading earlier insns, we have to use r30 for the base even though r60 is live, so we need to spill r60: ** Assignment #1: ** Assigning to 91 (cl=POINTER_Z_REGS, orig=58, freq=2000, tfirst=91, tfreq=2000)... Trying 30: spill 60(freq=2000) Now best 30(cost=0, bad_spills=0, insn_pseudos=0) Trying 31: spill 60(freq=2000) Spill r60(hr=30, freq=2000) for r91 Assign 30 to reload r91 (freq=2000) Assigning to 85 (cl=LD_REGS, orig=50, freq=3000, tfirst=85, tfreq=3000)... Assign 30 to reload r85 (freq=3000) Assigning to 86 (cl=LD_REGS, orig=52, freq=3000, tfirst=86, tfreq=3000)... Assign 30 to reload r86 (freq=3000) Assigning to 87 (cl=LD_REGS, orig=54, freq=3000, tfirst=87, tfreq=3000)... Assign 30 to reload r87 (freq=3000) Assigning to 88 (cl=LD_REGS, orig=56, freq=3000, tfirst=88, tfreq=3000)... Assign 30 to reload r88 (freq=3000) Assigning to 89 (cl=LD_REGS, orig=58, freq=3000, tfirst=89, tfreq=3000)... Assign 26 to reload r89 (freq=3000) Assigning to 90 (cl=LD_REGS, orig=77, freq=3000, tfirst=90, tfreq=3000)... Assign 24 to reload r90 (freq=3000) Reassigning non-reload pseudos Assign 28 to r60 (freq=2000) where r28 is the eliminated FRAME_POINTER_REGNUM. So far so good (AFAICT). But when trying to reload insn 48 after the spill, we use get_reg_class to work out what class the base register (r60) has. This is structured as: if (! HARD_REGISTER_NUM_P (hard_regno = regno)) hard_regno = lra_get_regno_hard_regno (regno); if (hard_regno >= 0) { hard_regno = lra_get_elimination_hard_regno (hard_regno); return REGNO_REG_CLASS (hard_regno); } So we replace r60 with r28 thanks to the assignment, but then apply eliminations to r28 to get r32 (the stack pointer). Things go downhill from there. This patch is a naive proof-of-concept fix, on the basis that eliminations should only apply to registers that aren't subject to allocation. But I imagine something somewhere else will break.
[Bug rtl-optimization/116238] [15 Regression] ICE building 526.blender_r on aarch64 SVE after r15-1619-g3b9b8d6cfdf593
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116238 Richard Sandiford changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #5 from Richard Sandiford --- Yeah, seems to be a latent bug in aarch64_hard_regno_caller_save_mode. A brute-force reproducer is: void foo(); typedef unsigned char v2qi __attribute__((vector_size(2))); void f(v2qi *ptr) { v2qi x = *ptr; asm volatile ("" :: "w" (x)); asm volatile ("" ::: "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15"); foo(); asm volatile ("" :: "w" (x)); *ptr = x; }
[Bug target/116413] [LRA] [M68K] ICE: unrecognized insn in lra_set_insn_recog_data, at lra.cc:1036
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116413 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #9 from Richard Sandiford --- (In reply to Michael Matz from comment #8) > Yeah, reload doesn't expect ASHIFTs within operands here. The below would > fix > that: > > diff --git a/gcc/final.cc b/gcc/final.cc > index eb9e065d9f0..5d911586de5 100644 > --- a/gcc/final.cc > +++ b/gcc/final.cc > @@ -3146,6 +3146,7 @@ walk_alter_subreg (rtx *xp, bool *changed) > case PLUS: > case MULT: > case AND: > +case ASHIFT: >XEXP (x, 0) = walk_alter_subreg (&XEXP (x, 0), changed); >XEXP (x, 1) = walk_alter_subreg (&XEXP (x, 1), changed); >break; LGTM FWIW. The different canonicalisation rules around ASHIFT/MULT in memories vs outside memories is an unfortunate wart, but what LRA is doing for lea is the correct behaviour.
[Bug target/115683] [15 Regression] SSE2 regressions after obselete of vcond{,u,eq}.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115683 --- Comment #8 from Richard Sandiford --- (In reply to Uroš Bizjak from comment #7) > Richi, maybe tree optimizers can perform their optimizations with > vec_cmp{,u} and vcond_mask, and at the end provide the true coditional > vector move (that calls "vcond{,u}") as a compound operation of these two > operations? Yeah, if a target can only handle a comparison by inverting the result, then IMO the long-term outcome should be that the target doesn't provide that comparison, and that target-independent code instead does the obvious adjustment. At the moment (or at least, before the removal of the optabs), every target did this internally, and the true picture was hidden from gimple.
[Bug target/114603] aarch64: Invalid SVE cnot optimisation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114603 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED Known to work||12.4.1, 13.3.1, 15.0 --- Comment #6 from Richard Sandiford --- Fixed on all active branches.
[Bug target/114607] aarch64: Incorrect expansion of svsudot
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114607 Richard Sandiford changed: What|Removed |Added Known to work||12.4.1, 13.3.1, 14.1.0, ||15.0 Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #6 from Richard Sandiford --- Fixed on all active branches.
[Bug target/115464] [14 Backport] ICE when building libaom on arm64 (neon sve bridge usage with tbl/perm)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115464 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #18 from Richard Sandiford --- The patches I pushed should be enough to fix the PR on GCC 14. (Oops, I forgot that I said I would wait until Monday for comments. :( )
[Bug target/113939] Switch m68k to LRA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113939 Bug 113939 depends on bug 116236, which changed state. Bug 116236 Summary: [LRA] [M68K] ICE insn does not satisfy its constraints https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116236 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug target/116236] [LRA] [M68K] ICE insn does not satisfy its constraints
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116236 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #24 from Richard Sandiford --- Hopefully fixed, but please let me know if this code causes similar problems elsewhere.
[Bug target/116343] [15 regression] ICE on valid code at -Os with "-fschedule-insns -fno-thread-jumps -fno-dce" on x86_64-linux-gnu: in extract_insn, at recog.cc:2869
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116343 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #7 from Richard Sandiford --- Fixed. Thanks for the report.
[Bug target/116236] [LRA] [M68K] ICE insn does not satisfy its constraints
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116236 --- Comment #22 from Richard Sandiford --- (In reply to Michael Matz from comment #21) > (In reply to Richard Sandiford from comment #17) > > > But if LRA needs to be extended for correctness, then, ... meh. > > But this is how it's always worked. The corresponding reload code is in > > find_reloads_address_1, which similarly tries to tear apart an address, work > > out whether something is a base or an index, and use that to calculate the > > appropriate register class. ISTM this PR is just about an inconsistency > > between the base/index identification in LRA vs reload. > > > > The reg_renumber indirection performed by GO_IF_LEGITIMATE_ADDRESS for > > reload only handled the initial register allocation. It didn't help for > > pseudos that were spilled, or were allocated to the wrong class. The > > BASE_REG_CLASS/INDEX_REG_CLASS mechanism was always required to work for > > correctness, not just optimisation. > > Yes, and I can't say that I ever liked it very much in reload either (for > basically the same reasons I'm whining now here) :-) Perhaps I have just > hoped > LRA would come up with some other solution. Yeah, I definitely agree that a different solution would nicer. Part of the problem with the current approach is obvious from the way we had BASE_REG_CLASS, then (when that wasn't enough) MODE_BASE_REG_CLASS, then MODE_BASE_REG_REG_CLASS, and finally(?) MODE_CODE_BASE_REG_CLASS. :) So I agree that it would be better for the target to say what parts of an address need reloading, and how, without any of the current guesswork.
[Bug target/116371] The SME2 svpext intrinsics are missing a _lane suffix
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116371 --- Comment #2 from Richard Sandiford --- Fixed on trunk. I'll wait a bit before backporting.
[Bug target/116371] The SME2 svpext intrinsics are missing a _lane suffix
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116371 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Ever confirmed|0 |1 Last reconfirmed||2024-08-14 Status|UNCONFIRMED |ASSIGNED
[Bug target/116371] New: The SME2 svpext intrinsics are missing a _lane suffix
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116371 Bug ID: 116371 Summary: The SME2 svpext intrinsics are missing a _lane suffix Product: gcc Version: 14.1.0 Status: UNCONFIRMED Keywords: rejects-valid Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Target Milestone: --- Target: aarch64*-*-* The intrinsics for PEXT are supposed to be called svpext_lane_cN and svpext_lane_cN_x2, but GCC is missing the _lane suffix.
[Bug target/116236] [LRA] [M68K] ICE insn does not satisfy its constraints
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116236 --- Comment #19 from Richard Sandiford --- Of course, immediately after posting I realise it should be address_mode instead of pointer_mode, but that shouldn't affect m68k.
[Bug target/116236] [LRA] [M68K] ICE insn does not satisfy its constraints
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116236 --- Comment #18 from Richard Sandiford --- Created attachment 58927 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58927&action=edit Candidate patch Could you try the attached patch? It seems to fix the reduced testcase for me.
[Bug target/116236] [LRA] [M68K] ICE insn does not satisfy its constraints
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116236 Richard Sandiford changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #17 from Richard Sandiford --- (In reply to Michael Matz from comment #16) > (In reply to Richard Sandiford from comment #15) > > > Yes, I considered adding this handling of (zero_extend Rx) to LRA. I'm > > > confident > > > it would fix this instance of the problem, if done correctly (it > > > essentially > > > already has code to deal with shortening subregs, which is somewhat > > > similar > > > in structure). > > It should already be present, via the decompose_*_address mechanism. > > As get_index_term doesn't accept any _EXTEND as index outer code I don't see > how (currently). It basically gets into the tie-breaker of leaving the > operands > after the loop in decompose_normal_address for baseness() to decide which of > course gets it "wrong" on m68k (see my comment #5 where I checked :) ) ... > > > FWIW, aarch64 also supports (mem (plus (zero_extend R1) R2)) and works with > > LRA, so I don't think there's a fundamental limitation. > > ... but obviously gets it right by luck on aarch64. Hmm, ok. I think that's probably my bug then, so... taking. > > > invalid, do something". Alternatively (and better) it needs to have a way > > > to say "this > > > address, while structurally valid, will need these regsets in those reg > > > operands", generally (i.e. it should be possible to have targets with > > > e.g. 4 > > > register operands, or such). > > > > > > If that's not possible then the design of the LRA-target interface is not > > > yet complete IMHO. > > Yeah, the current approach is the latter one (which I agree is better). > > legitimate_address_p answers the question “is this address structurally > > valid?” while BASE_REG_CLASS and INDEX_REG_CLASS specify the regsets that > > should be used to reload registers in structurally valid addresses. > > Well, that then rules out targets that allow three registers in addresses. > Or ones that have more complicated validity rules than "op1 here, op2 there". > E.g. "when op1 is an even-numbered register, then op2 needs to be > odd-numbered, > and vice versa". Contrived, sure, but ISA might want to save a bit here or > there in encoding. > > > Like you say, in practice it has to be done by using regsets, since the RA > > needs to know “what do I need to do to make this valid?”. It shouldn't have > > to use trial and error (trying particular hard registers to see if they're > > valid). > > Agreed. I just think that BASE/INDEX_REG_CLASS as the only way of > communicating that from the target to LRA is quite constraining. It means > (like > here) to always having to adjust LRA whenever a new target with other forms > comes along. Basically: when the target has to look at the structure of > an address to check validity (reasonably so), then it irks me that also > LRA needs to get at the inner structure of those addresses for correctness. > The latter part, the correctness, is what triggers me. If it's necessary for > optimality: sure. But if LRA needs to be extended for correctness, then, ... > meh. But this is how it's always worked. The corresponding reload code is in find_reloads_address_1, which similarly tries to tear apart an address, work out whether something is a base or an index, and use that to calculate the appropriate register class. ISTM this PR is just about an inconsistency between the base/index identification in LRA vs reload. The reg_renumber indirection performed by GO_IF_LEGITIMATE_ADDRESS for reload only handled the initial register allocation. It didn't help for pseudos that were spilled, or were allocated to the wrong class. The BASE_REG_CLASS/INDEX_REG_CLASS mechanism was always required to work for correctness, not just optimisation.
[Bug target/116236] [LRA] [M68K] ICE insn does not satisfy its constraints
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116236 --- Comment #15 from Richard Sandiford --- (In reply to Michael Matz from comment #14) > (In reply to Richard Sandiford from comment #13) > > (In reply to Michael Matz from comment #12) > > > That's why I struggle a bit, I lack the bigger picture, how LRA is > > > envisioned to deal with these situations given how targets are supposed > > > (and > > > still documented) > > > to use the strict/nonstrict distinction of testing for legal operands. > > The generic address decomposition logic should realise that, in: > > > > (plus (zero_extend R1) R2) > > > > R2 must be the base and R1 must be the index. Therefore, R2 must be > > constrained to BASE_REG_CLASS (or its variants) and R1 must be constrained > > to INDEX_REG_CLASS (or its variants). > > Yes, I considered adding this handling of (zero_extend Rx) to LRA. I'm > confident > it would fix this instance of the problem, if done correctly (it essentially > already has code to deal with shortening subregs, which is somewhat similar > in structure). It should already be present, via the decompose_*_address mechanism. FWIW, aarch64 also supports (mem (plus (zero_extend R1) R2)) and works with LRA, so I don't think there's a fundamental limitation. > But then I asked myself "what if another target has still > different representations of address?": Should LRA be continuously extended > to deal with these and those representations? Sure, for optimality that > might > be necessary. But in the grand scheme of things, the target should have the > opportunity, without any changes to LRA code, to be made completely working > with whatever representations it's using internally. It needs to have a way > to say "this address, if written with the assigned hardregs, will be > invalid, do something". Alternatively (and better) it needs to have a way > to say "this > address, while structurally valid, will need these regsets in those reg > operands", generally (i.e. it should be possible to have targets with e.g. 4 > register operands, or such). > > If that's not possible then the design of the LRA-target interface is not > yet complete IMHO. Yeah, the current approach is the latter one (which I agree is better). legitimate_address_p answers the question “is this address structurally valid?” while BASE_REG_CLASS and INDEX_REG_CLASS specify the regsets that should be used to reload registers in structurally valid addresses. Like you say, in practice it has to be done by using regsets, since the RA needs to know “what do I need to do to make this valid?”. It shouldn't have to use trial and error (trying particular hard registers to see if they're valid). > > (FWIW, I think it's correct that LRA never passes strict_p==true. > > The strict_p distinction should go away once reload does.) > > But how could it? The various target hooks are called from LRA when the > pseudos are _not_ yet replaced with their MEMs or their hardregs. How would > the target know to do this substitution internally for correctness checking? > How would it know that one pseudo is equivalent to a MEM, and another to a > hardreg? Looking at lra_in_progress as some targets are currently doing (and > my hack does)? That doesn't sound like a good design, it's simply a hidden > parameter for the target hook in a global variable. > > So, how else? Substituting before the hooks are asked is just busy work, > not asking the target at all means continuously adding stuff to LRA whenever > a > target wants something more. BASE_REG_CLASS and INDEX_REG_CLASS (and their variants) specify what needs to happen for an address that doesn't currently use hard registers. And legitimate_address_p should test whether addresses that *do* currently use hard registers are using the right registers. There should be no need for the hooks to do their own pseudo-to-hard-register lookup. The idea that one pseudo register maps to one hard register is very simplistic and doesn't take region-based allocation or inheritance into account. IMO it's a concept that should go away when reload does. And it doesn't make sense IMO to accept (say) %aN index registers or %dN base registers at any stage, even before allocation.
[Bug target/116343] [15 regression] ICE on valid code at -Os with "-fschedule-insns -fno-thread-jumps -fno-dce" on x86_64-linux-gnu: in extract_insn, at recog.cc:2869
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116343 Richard Sandiford changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #5 from Richard Sandiford --- Argh, substituting into the REG_EQUAL note is causing INSN_CODE to be reset. Testing a patch…
[Bug target/116343] [15 regression] ICE on valid code at -Os with "-fschedule-insns -fno-thread-jumps -fno-dce" on x86_64-linux-gnu: in extract_insn, at recog.cc:2869
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116343 --- Comment #4 from Richard Sandiford --- (In reply to Andrew Pinski from comment #3) > Note I think late_combine 1 depends on DCE later on to delete the > noop_move_p instructions but since -fno-dce was passed on the command line, > it is not deleted and that causes the ICE. I can look tomorrow, but: if the insn code is set to NOOP_MOVE_INSN_CODE, possibly_queue_changes is supposed to ensure that perform_pending_updates deletes it later.
[Bug target/116312] Use LDP instead of LD2 on for Advanced SIMD when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116312 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #3 from Richard Sandiford --- FWIW, see the comment in aarch64_sve_adjust_stmt_cost for some of the problems with costing LDP and STP correctly: /* Advanced SIMD can load and store pairs of registers using LDP and STP, but there are no equivalent instructions for SVE. This means that (all other things being equal) 128-bit SVE needs twice as many load and store instructions as Advanced SIMD in order to process vector pairs. Also, scalar code can often use LDP and STP to access pairs of values, so it is too simplistic to say that one SVE load or store replaces VF scalar loads and stores. Ideally we would account for this in the scalar and Advanced SIMD costs by making suitable load/store pairs as cheap as a single load/store. However, that would be a very invasive change and in practice it tends to stress other parts of the cost model too much. E.g. stores of scalar constants currently count just a store, whereas stores of vector constants count a store and a vec_init. This is an artificial distinction for AArch64, where stores of nonzero scalar constants need the same kind of register invariant as vector stores. An alternative would be to double the cost of any SVE loads and stores that could be paired in Advanced SIMD (and possibly also paired in scalar code). But this tends to stress other parts of the cost model in the same way. It also means that we can fall back to Advanced SIMD even if full-loop predication would have been useful. Here we go for a more conservative version: double the costs of SVE loads and stores if one iteration of the scalar loop processes enough elements for it to use a whole number of Advanced SIMD LDP or STP instructions. This makes it very likely that the VF would be 1 for Advanced SIMD, and so no epilogue should be needed. */
[Bug target/116236] [LRA] [M68K] ICE insn does not satisfy its constraints
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116236 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #13 from Richard Sandiford --- (In reply to Michael Matz from comment #12) > _Here_ we have a case where the target isn't even in the position to reject > an invalid address (or one that definitely _will_ become invalid), because > IRA/LRA don't make use of the strict predicates, and even if it were I don't > see any code dealing with the fall-out of the target saying that > '(plus (zero_extend(a0)) (d0))' would be invalid but > '(plus (zero_extend(d0)) (a0))' would be valid. > (my hack rectifies this a bit in forcing LRA to just reload the whole > invalid address to another reg, but it doesn't fix all failures of this mode > in the testsuite) > > That's why I struggle a bit, I lack the bigger picture, how LRA is > envisioned to deal with these situations given how targets are supposed (and > still documented) > to use the strict/nonstrict distinction of testing for legal operands. The generic address decomposition logic should realise that, in: (plus (zero_extend R1) R2) R2 must be the base and R1 must be the index. Therefore, R2 must be constrained to BASE_REG_CLASS (or its variants) and R1 must be constrained to INDEX_REG_CLASS (or its variants). If that isn't happening then I agree it's an LRA bug. (FWIW, I think it's correct that LRA never passes strict_p==true. The strict_p distinction should go away once reload does.)
[Bug other/30920] Incorrect splaying that fails to assure the caching property
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30920 --- Comment #8 from Richard Sandiford --- The patch in comment 7 is just one step.
[Bug target/116145] SVE constant pool loads not hoisted outside loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116145 Richard Sandiford changed: What|Removed |Added Resolution|FIXED |--- Status|RESOLVED|REOPENED --- Comment #12 from Richard Sandiford --- I had to revert the patch due to PR116200.
[Bug rtl-optimization/116200] [15 regression] ICE in stage 2 since r15-2696-gba730fd10934e4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116200 Richard Sandiford changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #8 from Richard Sandiford --- Sorry for the breakage. I've reverted the patch.
[Bug other/30920] Incorrect splaying that fails to assure the caching property
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30920 Richard Sandiford changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 Last reconfirmed||2024-08-02 --- Comment #6 from Richard Sandiford --- When I added splay-tree-utils.h, Jeff asked me to look at converting typed_splay_utils to use it, so that we didn't have two competing C++ implementations. I forgot to go back and do that. I wasn't aware of this PR though. So I'll assign myself for the part about moving typed_splay_utils over to splay-tree-utils.h. Looks like the main missing piece is foreach, for which I'll try to use iterators.
[Bug target/116145] SVE constant pool loads not hoisted outside loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116145 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #11 from Richard Sandiford --- Fixed.
[Bug target/116145] Suboptimal SVE immediate synthesis
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116145 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #7 from Richard Sandiford --- I'll test a cleaned-up version of the change in comment 6. Kyrill, is it OK to use your godbolt testcase in the testsuite?
[Bug target/116145] Suboptimal SVE immediate synthesis
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116145 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #6 from Richard Sandiford --- The problem seems to be that may_trap_or_fault_p returns true for the memory, which stops loop2_invariant from hoisting the load. The code there is: if (/* MEM_NOTRAP_P only relates to the actual position of the memory reference; moving it out of context such as when moving code when optimizing, might cause its address to become invalid. */ code_changed || !MEM_NOTRAP_P (x)) { poly_int64 size = MEM_SIZE_KNOWN_P (x) ? MEM_SIZE (x) : -1; return rtx_addr_can_trap_p_1 (XEXP (x, 0), 0, size, GET_MODE (x), code_changed); } (where code_changed is true). Since the SVE load doesn't support the full LO_SUM range, there is no hint that the address refers to invariant memory. I'm not sure the comment in the code is correct for MEM_READONLY_P, which only refers to _statically allocated_ readonly memory. (At least, the comment doesn't apply to constant memory created by force_const_mem, which creates things that cannot trap regardless of context.) Adding: if (MEM_READONLY_P (x) && MEM_NOTRAP_P (x)) return false; makes the test work for me.
[Bug target/115881] [15 Regression] ICE: RTL check: expected code 'reg', have 'subreg' in rhs_regno, at rtl.h:1934 with -O2 -mx32 -maddress-mode=long
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115881 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #9 from Richard Sandiford --- Fixed.
[Bug tree-optimization/116109] Missed optimisation: unnecessary register dependency on reduction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116109 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #4 from Richard Sandiford --- This isn't useful given that it isn't your target, but FWIW, we do take reduction latencies into account when deciding the vectorisation/unrolling factor for AArch64. E.g. the testcase produces the main loop: .L5: ldp q25, q24, [x1] ldp q30, q31, [x1, 32] add x1, x1, 64 fmlav26.2d, v25.2d, v25.2d fmlav27.2d, v24.2d, v24.2d fmlav28.2d, v30.2d, v30.2d fmlav29.2d, v31.2d, v31.2d cmp x3, x1 bne .L5 when compiled with -Ofast -mcpu=neoverse-v2. The decision is made by aarch64_vector_costs::determine_suggested_unroll_factor, which uses issue rates and reduction latencies recorded by the main costing code.
[Bug rtl-optimization/116136] [15 Regression] ext-dce exposes latent subreg simplification bug (big-endian) on m68k
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116136 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #2 from Richard Sandiford --- I was curious how far back this went, but it seems to be present in the original combine.c commit: /* Changing mode twice with SUBREG => just change it once, or not at all if changing back to starting mode. */ if (GET_CODE (SUBREG_REG (x)) == SUBREG) { if (mode == GET_MODE (SUBREG_REG (SUBREG_REG (x))) && SUBREG_WORD (x) == 0 && SUBREG_WORD (SUBREG_REG (x)) == 0) return SUBREG_REG (SUBREG_REG (x)); Kind-of amazed that this went undetected so long, especially since multiword big-endian values were much more common back in the day.
[Bug tree-optimization/116100] (VEC_COND (UNCOND_EXPR)) -> COND_FN conversion should happen in isel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116100 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #2 from Richard Sandiford --- Also, the COND_FNs are generated directly by the vectoriser for things that cannot be accurately represented any other way (for FP ops). So I think it's relatively more important that other passes handle them well.
[Bug target/116044] [15 Regression] GCN vs. rtl-ssa: Avoid using a stale splay tree root [PR116009]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116044 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from Richard Sandiford --- Fixed.
[Bug middle-end/116058] [15 Regression] sh4-linux-gnu fails to bootstrap, late combine issue
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116058 Richard Sandiford changed: What|Removed |Added CC|richard.sandiford at arm dot com |rsandifo at gcc dot gnu.org --- Comment #10 from Richard Sandiford --- The patch in comment 8 looks like the correct fix to me FWIW. It's ok for trunk from my POV, if Jeff agrees.
[Bug target/116044] [15 Regression] GCN vs. rtl-ssa: Avoid using a stale splay tree root [PR116009]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116044 Richard Sandiford changed: What|Removed |Added Last reconfirmed||2024-07-23 Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Ever confirmed|0 |1 Status|UNCONFIRMED |ASSIGNED --- Comment #2 from Richard Sandiford --- Gah, this part of the code is causing more than it's fair share of trouble…
[Bug rtl-optimization/116009] [15 regression] ICE when building cython-3.0.10 on arm64 for Python 3.12 (insert_def_after, at rtl-ssa/accesses.cc:622)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116009 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #10 from Richard Sandiford --- Fixed.
[Bug target/115969] [15 regression] ICE when building clang-16.0.6 on arm64 (output_operand: invalid expression as operand)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115969 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #10 from Richard Sandiford --- Fixed. The patch could be backported to release branches if the same error shows up in other circumstances, but as Andrew says, the problem seems to have been latent since the port was added.
[Bug rtl-optimization/116009] [15 regression] ICE when building cython-3.0.10 on arm64 for Python 3.12 (insert_def_after, at rtl-ssa/accesses.cc:622)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116009 Richard Sandiford changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #6 from Richard Sandiford --- Testing a patch.
[Bug target/115406] [15 Regression] wrong code with vector compare at -O0 with -mavx512f since r15-920-gb6c6d5abf0d31c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115406 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #10 from Richard Sandiford --- Fixed. Although the problem is latent on branches, I think we should only consider a backport after a few months, to give plenty of time for fallout to be detected.
[Bug tree-optimization/115991] [15 Regression] ICE on linux-6.10 in ix86_print_operand_address_as() since r15-1945-g9d20529d94b232
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115991 --- Comment #9 from Richard Sandiford --- (In reply to Andrew Pinski from comment #7) > (In reply to Sam James from comment #6) > > The ICE bisects to r15-1945-g9d20529d94b232. The ivopts issue needs to be > > done still. > > Note this is basically the same issue as PR 115881 except it could be fixed > earlier on with an IV-OPTs fix I think. Yeah, I suggest keeping this PR for the ivopts part only and using PR115881 to track the subreg-of-sp thing.
[Bug target/115969] [15 regression] ICE when building clang-16.0.6 on arm64 (output_operand: invalid expression as operand)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115969 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #8 from Richard Sandiford --- Yeah, agree that's where the problem is.
[Bug target/115950] Missed SVE fold to INCP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115950 --- Comment #5 from Richard Sandiford --- Alternatively, I suppose we could say that svcntp_bN(svptrue_bN(), pg) etc. should be folded to svcntp_bN(pg, pg) in gimple/expand, and then make the INCP/DECP wrappers only match the equal-predicate version.
[Bug target/115950] Missed SVE fold to INCP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115950 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #4 from Richard Sandiford --- I'd rather not double up the patterns for this equivalence. Seems like one of those cases where being able to define target-specific unspec simplification rules would help.
[Bug rtl-optimization/115929] [15 regression] ICE on valid code at -O{2,3} with "-fschedule-insns" on x86_64-linux-gnu: Segmentation fault
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115929 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from Richard Sandiford --- Fixed.
[Bug rtl-optimization/115928] [15 regression] ICE on valid code at -O2 with "-fgcse-sm" on x86_64-linux-gnu: in merge_clobber_groups, at rtl-ssa/accesses.cc:757
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115928 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #2 from Richard Sandiford --- Fixed.
[Bug rtl-optimization/115929] [15 regression] ICE on valid code at -O{2,3} with "-fschedule-insns" on x86_64-linux-gnu: Segmentation fault
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115929 --- Comment #3 from Richard Sandiford --- As it turned out, the two tests exposed different bugs. I've submitted a patch for the other one and will close once that's resolved.
[Bug rtl-optimization/115901] [15 regression] ICE when building coreutils-9.5 on arm64 with -O3 -flto -fno-vect-cost-model -ftrivial-auto-var-init=zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115901 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #12 from Richard Sandiford --- Fixed.
[Bug target/115891] [15 regression] libgcrypt tests segfault in crc32_less_than_16 with LTO with late-combine
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115891 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from Richard Sandiford --- Fixed.
[Bug rtl-optimization/115929] [15 regression] ICE on valid code at -O{2,3} with "-fschedule-insns" on x86_64-linux-gnu: Segmentation fault
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115929 Richard Sandiford changed: What|Removed |Added Last reconfirmed||2024-07-15 Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org
[Bug rtl-optimization/115928] [15 regression] ICE on valid code at -O2 with "-fgcse-sm" on x86_64-linux-gnu: in merge_clobber_groups, at rtl-ssa/accesses.cc:757
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115928 Richard Sandiford changed: What|Removed |Added Last reconfirmed||2024-07-15 Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org
[Bug rtl-optimization/115901] [15 regression] ICE when building coreutils-9.5 on arm64 with -O3 -flto -fno-vect-cost-model -ftrivial-auto-var-init=zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115901 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #10 from Richard Sandiford --- Gah. I'll restrict r15-1972 to non-paradoxical subregs (unless the replacement value is constant).
[Bug target/115891] [15 regression] libgcrypt tests segfault in crc32_less_than_16 with LTO with late-combine
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115891 Richard Sandiford changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2024-07-12 Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|UNCONFIRMED |ASSIGNED --- Comment #3 from Richard Sandiford --- Comes from: trying to combine definition of r5 in: 1058: di:SI=ax:SI into: 299: {[di:SI]=asm_operands;clobber ax:SI;clobber flags:CC;} successfully matched this instruction: (parallel [ (set (mem:SI (reg:SI 0 ax [520]) [2 *_3+0 S4 A32]) (asm_operands/v:SI ("movd %%xmm0, %%eax bswapl %%eax movl %%eax, %0 ") ("=m") 0 [] [] [] ../cipher/crc-intel-pclmul.c:781)) (clobber (reg:SI 0 ax)) (clobber (reg:CC 17 flags)) ]) original cost = 4 + 4 (weighted: 0.090837), replacement cost = 4 (weighted: 0.045419); keeping replacement which ought to be prevented by the clobber, but for some reason isn't.
[Bug rtl-optimization/115785] [15 regression] ICE when building embree-4.3.1 on amd64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115785 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from Richard Sandiford --- Fixed. Thanks for the report.
[Bug target/115881] [15 Regression] ICE: RTL check: expected code 'reg', have 'subreg' in rhs_regno, at rtl.h:1934 with -O2 -mx32 -maddress-mode=long
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115881 Richard Sandiford changed: What|Removed |Added Assignee|rsandifo at gcc dot gnu.org|unassigned at gcc dot gnu.org Status|ASSIGNED|NEW --- Comment #4 from Richard Sandiford --- As expected, the results weren't pretty. Things like: (define_split [(set (match_operand:SWI48 0 "register_operand") (and:SWI48 (match_dup 0) (const_int -65536))) (clobber (reg:CC FLAGS_REG))] "(TARGET_FAST_PREFIX && !TARGET_PARTIAL_REG_STALL) || optimize_function_for_size_p (cfun)" [(set (strict_low_part (match_dup 1)) (const_int 0))] "operands[1] = gen_lowpart (HImode, operands[0]);") expect to be able to generate (strict_low_part (subreg:HI (reg:DI SP) 0)) after reload, so we can't ban it outright. The fallback was going to be to make register_operand reject subregs after reload, but patterns like: (define_insn_and_split "*neg_1_slp" [(set (strict_low_part (match_operand:SWI12 0 "register_operand" "+,&")\ ) (neg:SWI12 (match_operand:SWI12 1 "register_operand" "0,!"))) (clobber (reg:CC FLAGS_REG))] use register_operand for cases like the above. Not really sure what to do here.
[Bug target/115881] [15 Regression] ICE: RTL check: expected code 'reg', have 'subreg' in rhs_regno, at rtl.h:1934 with -O2 -mx32 -maddress-mode=long
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115881 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #3 from Richard Sandiford --- Hmm, yeah. This is due to an inconsistency between simplify_subreg_regno (which rightly refuses to generate (reg:SI sp)) and validate_subreg, which instead uses subreg_offset_representable_p + some preparatory tests. Those preparatory tests are similar, but not quite identical to, the first tests in simplify_subre_regno. And they don't include the stack and frame pointer tests. In principle, it feels like the validate_subreg test should just be: /* For hard registers, check that the subreg can be folded to a REG. */ if (reg && REG_P (reg) && HARD_REGISTER_P (reg)) return simplify_subreg_regno (REGNO (reg), imode, offset, omode) >= 0; but I've no idea how much that will break. Will try it and see.
[Bug rtl-optimization/115782] [15 Regression] ICE on valid code at -O{2,3} with "-fno-guess-branch-probability -fgcse-sm -fno-expensive-optimizations -fno-gcse" on x86_64-linux-gnu: in possibly_queue_
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115782 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #4 from Richard Sandiford --- Fixed. It also fixes the second example, but I didn't see that in time to include it in the commit.
[Bug rtl-optimization/115785] ICE when building embree-4.3.1 on amd64 with -flate-combine-instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115785 --- Comment #2 from Richard Sandiford --- The patch in #c1 is just part 1. Part 2 will fix the bug.
[Bug ipa/114531] Feature proposal for an `-finline-functions-aggressive` compiler option
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114531 --- Comment #20 from Richard Sandiford --- (In reply to Jan Hubicka from comment #18) > I am trying to understand how useful this is. I am basically worried > about two things > 1) we have other optimization passes that behave differently at -O2 and > -O3 (vectorizer, unrolling etc.) and I think we may want to have > more. We also have -Os and -O1. > > So perhaps we want kind of more systmatic solution. We already have > -fvect-cost-model that is kind of vectorizer version of the proposed > inliner option. Yeah, like you say, -fvect-cost-model solves this problem for the vectoriser. There we have an intermediate setting (cheap) that isn't the default for any -O. But the proposal isn't to do something like that for inlining. >From a quick grep, it looked like there are three places that distinguish directly between -O2 and -O3: (1) a couple of expansions in builtins.cc (2) cunrolli, which uses it to decide whether unrolling can increase size (3) number_of_iterations_exit_assumptions, where it decides whether outer evolutions should be taken into account. (Was surprised by this one.) If someone does want to control those things separately in future, I think the defined way of doing that would be to add a -f flag. So it feels like this patch is moving in the same direction that we'd expect for the rest. > 2) inliner is already quite painful to tune. Especially since > one really needs to benchmark packages significantly bigger than > SPECs which tends to be bit hard to set up and benchmark > meaningfully. I usually do at least Firefox and clang where the > first is always quite some work to get working well with latest > GCC. We SUSE's LNT we also run "C++ behchmarks" which were > initially collected as kind of inliner tests with higher > abstraction penalty (tramp3d etc.). > > For many years I benchmarked primarily -O3 and -O3 + profile > feedbcak on x86-64 only with ocassional look at -O2 and -Os > behaviour which were generally more stable. > I also tested other targets (poer and aarch64) but just > sporadically, which is not good. > > After GCC5 I doubled testing to include both lto/non-lto variant. > Since GCC10 -O2 started to envolve and needed re-testing too > (lto/nonlto). One metric I know I ought to tune is -O2 -flto and > FDO which used to be essentially -O3 before the optimization level > --params were introduced, but now -O2 + FDO inlining is more > conservative which hurts, for example, profiledbootstrapped GCC. > > So naturally I am bit worried to introduce even more combinations > that needs testing and maintenance. If we add user friendly way to > tweak this, we also make a promise to keep it sane. Yeah, I agree we don't want to increase the number of supported settings. My understanding of the new option is that it should, by definition, do exactly what -O3 does, now and in the future. So there would be no extra tuning burden. Whatever is right for -O3 in future is also right for the new option.
[Bug rtl-optimization/115785] ICE when building embree-4.3.1 on amd64 with -flate-combine-instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115785 Richard Sandiford changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Ever confirmed|0 |1 Last reconfirmed||2024-07-05
[Bug rtl-optimization/115782] ICE on valid code at -O{2,3} with "-fno-guess-branch-probability -fgcse-sm -fno-expensive-optimizations -fno-gcse" on x86_64-linux-gnu: in possibly_queue_changes, at rtl-
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115782 Richard Sandiford changed: What|Removed |Added Last reconfirmed||2024-07-05 Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #1 from Richard Sandiford --- I'll take a look.
[Bug tree-optimization/115629] Inefficient if-convert of masked conditionals
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115629 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #8 from Richard Sandiford --- Perhaps I'm missing the point, but I think one of the issues here is that we (still) don't model that MASK_LOAD sets inactive elements to zero. Inactive elements are currently undefined instead. (I think Robin mentioned that assuming zero is problematic for RVV, so we might need an explicit MASK_LOAD argument for inactive elements, like for COND_ADD etc.) So quoting the IL in comment 4: # loop_mask_63 = PHI vect__4.10_64 = .MASK_LOAD (vectp_a.8_53, 32B, loop_mask_63); mask__31.11_66 = vect__4.10_64 != { 0, ... }; mask__56.12_67 = ~mask__31.11_66; vec_mask_and_70 = mask__56.12_67 & loop_mask_63; vect__7.15_71 = .MASK_LOAD (vectp_c.13_68, 32B, vec_mask_and_70); mask__22.16_73 = vect__7.15_71 == { 0, ... }; mask__34.17_75 = vec_mask_and_70 & mask__22.16_73; I think this and... vect_iftmp.20_78 = .MASK_LOAD (vectp_d.18_76, 32B, mask__34.17_75); vect__61.21_79 = vect__4.10_64 | vect__7.15_71; mask__35.22_81 = vect__61.21_79 != { 0, ... }; vec_mask_and_84 = mask__35.22_81 & loop_mask_63; ...this have to be kept until we model inactive elements. vect_iftmp.25_85 = .MASK_LOAD (vectp_b.23_82, 32B, vec_mask_and_84); _86 = mask__34.17_75 & loop_mask_63; This one is really curious though :) Why does the code think that the loop mask is needed here? Does the code think the mask is needed for correctness, or is the scalar_cond_masked_set optimisation misfiring? vect_iftmp.26_87 = VEC_COND_EXPR <_86, vect_iftmp.20_78, vect_iftmp.25_85>; .MASK_STORE (vectp_res.27_88, 32B, loop_mask_63, vect_iftmp.26_87);
[Bug rtl-optimization/115677] ICE when building argon2 with -flate-combine-instructions on amd64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115677 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #7 from Richard Sandiford --- Fixed.
[Bug rtl-optimization/115677] ICE when building argon2 with -flate-combine-instructions on amd64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115677 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Ever confirmed|0 |1 Last reconfirmed||2024-06-27 Status|UNCONFIRMED |ASSIGNED --- Comment #4 from Richard Sandiford --- Guess we need the usual optimize > 0 guard too.
[Bug target/115634] [15 regression] s390 bootstrap failure since r15-1579-g792f97b44ffc5e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115634 --- Comment #3 from Richard Sandiford --- Yeah, I agree that sounds like the right fix. Specifically, I assume s390_decompose_addrstyle_without_index, when doing: if (op && GET_CODE (op) != REG) return false; should check whether the register is a pseudo register or is a member of ADDR_REGS. The check should probably be limited to reload_completed, so that the RAs have a chance to reload other registers. The reason this works during register allocation is that the RAs check REGNO_OK_FOR_BASE_P and REGNO_OK_FOR_INDEX_P as part of the constraints process. But those macros are not checked automatically by address constraints themselves (perhaps somewhat surprisingly). It has to be done explicitly in target code. I disagree that the predicates & constraints are icky though. Given the way that the architecture treats shift amounts as addresses, it seems like a neat approach.
[Bug ipa/114531] Feature proposal for an `-finline-functions-aggressive` compiler option
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114531 --- Comment #17 from Richard Sandiford --- I can see that it's useful to ask whether the current -O2 & -O3 inlining heuristics are making the right trade-off. But I think that's really a different issue from the one that is raised in the PR. (Unless we think that -O2 and -O3 should always have the same inlining heuristics henceforward, but that seems unlikely.) At the moment, -O3 is essentially -O2 + some -f options + some --param options. Users who want to pick & chose some of the -f options can do so, and can add them to stable build systems. Normally, obsolete -f options are turned into no-ops rather than removed. But users can't pick & choose the --params, and add them to stable build systems, because we reserve the right to remove --params without warning. So IMO, we should have an -f option that represents “the inlining parameters enabled by -O3”, whatever they happen to be for a given release. It's OK if the set is empty. For such a change, it doesn't really matter whether the current --params are the right ones. It just matters that the --params are the ones that we currently use. If the --params are changed later, the -f option and -O3 will automatically stay in sync.
[Bug target/115631] [15 Regression] GCN: [-PASS:-]{+FAIL:+} c-c++-common/torture/builtin-arith-overflow-6.c -O2 execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115631 --- Comment #4 from Richard Sandiford --- Created attachment 58513 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58513&action=edit A patch for a bug seen on arm*-*-* Also, could you check whether the attached patch makes any difference? It fixes a problem seen on arm*-*-*, and I notice GCN also defines cannot_copy_insn_p.
[Bug target/115631] [15 Regression] GCN: [-PASS:-]{+FAIL:+} c-c++-common/torture/builtin-arith-overflow-6.c -O2 execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115631 --- Comment #3 from Richard Sandiford --- I've now pushed a debug counter for late_combine. Sorry to ask, but could you bisect on N in -fdbg-cnt=late_combine:N to see which transformation is causing the problem?
[Bug target/115633] [15 Regression] powerpc64le: "relocation truncated to fit: R_PPC64_TOC16 against `.rodata.cst4'" with (default) '-flate-combine-instructions'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115633 --- Comment #2 from Richard Sandiford --- -flate-combine-instructions is supposed to be disabled by default for all powerpc targets. Could you look at why that isn't the case for you?
[Bug target/115631] [15 Regression] GCN: [-PASS:-]{+FAIL:+} c-c++-common/torture/builtin-arith-overflow-6.c -O2 execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115631 --- Comment #2 from Richard Sandiford --- I suppose for issues like this, it would be useful to have a debug counter to bisect on. I'll post a patch for doing that today, but I'm afraid I'll be relying on someone with gcn access to actually do the bisection.
[Bug other/115622] gcc.dg/ipa/iinline-attr.c fails after r15-1579-g792f97b44ffc5e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115622 --- Comment #1 from Richard Sandiford --- Are you sure about the bisection? late-combine only affects RTL, and in any case is disabled by default for powerpc.
[Bug target/115478] [15 Regression] gcc.target/aarch64/bitint-args.c fails since r15-1120-g2277f987979445
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115478 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #5 from Richard Sandiford --- How about adding a new match_operator predicate to common.md for this kind of situation? It would be nice if it could automatically detect when the two operands have no nonzero bits in common, but doing that would need some refactoring of the nonzero_bits code, to ensure that the predicate gives a consistent result (and does that without polluting the current nonzero_bits cache). In the meantime, it might be enough to say that the insn must enforce the non-overlapping bits check itself.
[Bug target/115613] New: xtensa: splits dependent on can_create_pseudo_p
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115613 Bug ID: 115613 Summary: xtensa: splits dependent on can_create_pseudo_p Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: ice-on-valid-code Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org CC: jcmvbkbc at gcc dot gnu.org Target Milestone: --- Target: xtensa-*-* The late-combine pass is disabled by default for xtensa: /* One of the late-combine passes runs after register allocation and can match define_insn_and_splits that were previously used only before register allocation. Some of those define_insn_and_splits require the split to take place, but have a split condition of can_create_pseudo_p, and so matching after RA will give an unsplittable instruction. Disable late-combine by default until the define_insn_and_splits are fixed. */ if (!OPTION_SET_P (flag_late_combine_instructions)) flag_late_combine_instructions = 0; For example, compiling gcc.c-torture/compile/bx.c with -Os -flate-combine gives: (insn 53 21 27 4 (set (reg/i:SI 2 a2) (and:SI (not:SI (reg:SI 8 a8 [orig:44 _10 ] [44])) (reg:SI 2 a2 [55]))) "/home/ricsan01/gcc/git/gcc/gcc/testsuite/gcc.c-torture/compile/bx.c":5:1 36 {*andsi3_bitcmpl} (expr_list:REG_DEAD (reg:SI 8 a8 [orig:44 _10 ] [44]) (nil))) during RTL pass: final .../gcc.c-torture/compile/bx.c:5:1: internal compiler error: in final_scan_insn_1, at final.cc:2807 0xe331b7 _fatal_insn(char const*, rtx_def const*, char const*, int, char const*) .././src/gcc/rtl-error.cc:108 0x9e5667 final_scan_insn_1 .././src/gcc/final.cc:2807 0x9e5877 final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*) .././src/gcc/final.cc:2886 0x9e659f final_1 .././src/gcc/final.cc:1977 0x9e689f rest_of_handle_final .././src/gcc/final.cc:4239 0x9e689f execute .././src/gcc/final.cc:4317 The associated define_insn_and_split is: (define_insn_and_split "*andsi3_bitcmpl" [(set (match_operand:SI 0 "register_operand" "=a") (and:SI (not:SI (match_operand:SI 1 "register_operand" "r")) (match_operand:SI 2 "register_operand" "r")))] "" "#" "&& can_create_pseudo_p ()" [(set (match_dup 3) (and:SI (match_dup 1) (match_dup 2))) (set (match_dup 0) (xor:SI (match_dup 3) (match_dup 2)))] { operands[3] = gen_reg_rtx (SImode); } [(set_attr "type" "arith") (set_attr "mode" "SI") (set_attr "length" "6")]) The define_insn can be matched before or after register allocation, but the "&& can_create_pseudo_p ()" condition means that the split can only happen before RA. And the "#" means that the split must happen: there is no assembly fallback. Matching the define_insn after register allocation therefore leads to an ICE. See: https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655446.html for more discussion.
[Bug target/115612] New: powerpc: define_insn_and_splits calling gen_reg_rtx unconditionally
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115612 Bug ID: 115612 Summary: powerpc: define_insn_and_splits calling gen_reg_rtx unconditionally Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: ice-on-valid-code Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org CC: dje at gcc dot gnu.org, linkw at gcc dot gnu.org, segher at gcc dot gnu.org Target Milestone: --- Target: powerpc*-*-* The late-combine pass is disabled by default for rs6000: /* One of the late-combine passes runs after register allocation and can match define_insn_and_splits that were previously used only before register allocation. Some of those define_insn_and_splits use gen_reg_rtx unconditionally. Disable late-combine by default until the define_insn_and_splits are fixed. */ if (!OPTION_SET_P (flag_late_combine_instructions)) flag_late_combine_instructions = 0; For example, compiling gcc.c-torture/compile/20021001-1.c with -Os -flate-combine-instructions results in: 0x9c34cb gen_reg_rtx(machine_mode) .././src/gcc/emit-rtl.cc:1177 0x1b248a7 gen_split_452(rtx_insn*, rtx_def**) .././src/gcc/config/rs6000/rs6000.md:13257 0x1c1c783 split_17 .././src/gcc/config/rs6000/rs6000.md:13254 0x1c1c783 split_insns(rtx_def*, rtx_insn*) .././src/gcc/config/rs6000/rs6000.md:12707 0x9c8847 try_split(rtx_def*, rtx_insn*, int) .././src/gcc/emit-rtl.cc:3941 0xe58963 split_insn .././src/gcc/recog.cc:3409 0xe5e3ef split_all_insns() .././src/gcc/recog.cc:3513 0xe5e5cb execute .././src/gcc/recog.cc:4482 due to: (define_insn_and_split "*_cc" [(set (match_operand:GPR 0 "gpc_reg_operand" "=r") (fp_rev:GPR (match_operand:CCFP 1 "cc_reg_operand" "y") (const_int 0)))] "!flag_finite_math_only" "#" "&& 1" [(pc)] { rtx_code revcode = reverse_condition_maybe_unordered (); rtx eq = gen_rtx_fmt_ee (revcode, mode, operands[1], const0_rtx); rtx tmp = gen_reg_rtx (mode); emit_move_insn (tmp, eq); emit_insn (gen_xor3 (operands[0], tmp, const1_rtx)); DONE; } [(set_attr "length" "12")]) The instruction can be matched before or after RA, but the split only works before RA. It looked from a quick scan like there were a few other instances of this (although the vast majority of define_insn_and_splits are written to work both before and after RA).
[Bug target/115610] New: -flate-combine disabled by default for x86 port
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115610 Bug ID: 115610 Summary: -flate-combine disabled by default for x86 port Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org CC: crazylht at gmail dot com, hubicka at gcc dot gnu.org, ubizjak at gmail dot com Target Milestone: --- Target: i?86-*-* x86_64-*-* The late-combine pass is disabled by default for x86: /* Late combine tends to undo some of the effects of STV and RPAD, by combining instructions back to their original form. */ if (!OPTION_SET_P (flag_late_combine_instructions)) flag_late_combine_instructions = 0; To give more details, from an earlier version of the pass: For example, gcc.target/i386/minmax-6.c tests whether the code compiles without any spilling. The RTL created by STV contains: (insn 33 31 3 2 (set (subreg:V4SI (reg:SI 120) 0) (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 116)) (const_vector:V4SI [ (const_int 0 [0]) repeated x4 ]) (const_int 1 [0x1]))) -1 (nil)) (insn 3 33 34 2 (set (subreg:V4SI (reg:SI 118) 0) (subreg:V4SI (reg:SI 120) 0)) {movv4si_internal} (expr_list:REG_DEAD (reg:SI 120) (nil))) (insn 34 3 32 2 (set (reg/v:SI 108 [ y ]) (reg:SI 118)) -1 (nil)) and it's crucial for the test that reg 108 is kept, rather than propagated into uses. As things stand, 118 can be allocated a vector register and 108 a scalar register. If 108 is propagated, there will be scalar and vector uses of 118, and so it will be spilled to memory. and it's crucial for the test that reg 108 is kept, rather than propagated into uses. As things stand, 118 can be allocated a vector register and 108 a scalar register. If 108 is propagated, there will be scalar and vector uses of 118, and so it will be spilled to memory. That one could be solved by running STV2 later. But RPAD is a bigger problem. In gcc.target/i386/pr87007-5.c, RPAD converts: (insn 27 26 28 6 (set (reg:DF 100 [ _15 ]) (sqrt:DF (mem/c:DF (symbol_ref:DI ("d2") {*sqrtdf2_sse} (nil)) into: (insn 45 26 44 6 (set (reg:V4SF 108) (const_vector:V4SF [ (const_double:SF 0.0 [0x0.0p+0]) repeated x4 ])) -1 (nil)) (insn 44 45 27 6 (set (reg:V2DF 109) (vec_merge:V2DF (vec_duplicate:V2DF (sqrt:DF (mem/c:DF (symbol_ref:DI (\ "d2") (subreg:V2DF (reg:V4SF 108) 0) (const_int 1 [0x1]))) -1 (nil)) (insn 27 44 28 6 (set (reg:DF 100 [ _15 ]) (subreg:DF (reg:V2DF 109) 0)) {*movdf_internal} (nil)) But both the pre-RA and post-RA passes are able to combine these instructions back to the original form.
[Bug rtl-optimization/114515] [15 Regression] Failure to use aarch64 lane forms after PR101523
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515 Richard Sandiford changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #14 from Richard Sandiford --- Fixed.
[Bug rtl-optimization/106594] [13/14/15 Regression] sign-extensions no longer merged into addressing mode
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106594 Richard Sandiford changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #30 from Richard Sandiford --- Fixed on trunk.
[Bug rtl-optimization/114996] [15 Regression] [RISC-V] 2->2 combination no longer occurring
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114996 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #6 from Richard Sandiford --- FWIW, late-combine also fixes this. I'm in the process of getting the submission ready (still going through multi-target testing).
[Bug target/115518] New: aarch64: Poor codegen for arm_neon_sve_bridge.h
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115518 Bug ID: 115518 Summary: aarch64: Poor codegen for arm_neon_sve_bridge.h Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: aarch64-sve Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Target Milestone: --- Target: aarch64*-*-* With PR115464 fixed, the following testcase: #include #include #include svuint16_t convolve4_4_x (uint16x8x2_t permute_tbl, svuint16_t a) { return svset_neonq_u16 (a, permute_tbl.val[1]); } generates: mov v0.16b, v1.16b ptrue p3.h, vl8 sel z0.h, p3, z0.h, z2.h ret The move is redundant: we should be able to use z1.h as input to the sel instead.
[Bug target/115464] [14 Backport] ICE when building libaom on arm64 (neon sve bridge usage with tbl/perm)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115464 --- Comment #11 from Richard Sandiford --- Yeah, like I mentioned in the commit message, I'm in the process of rolling this fix out to more places. Was just testing the waters with the minimal fix for comment 4. But yeah, maybe more of it will need to be backported than I'd hoped.
[Bug target/115464] [14 Backport] ICE when building libaom on arm64 (neon sve bridge usage with tbl/perm)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115464 Richard Sandiford changed: What|Removed |Added Known to work||15.0 Known to fail||14.1.0 Summary|ICE when building libaom on |[14 Backport] ICE when |arm64 (neon sve bridge |building libaom on arm64 |usage with tbl/perm)|(neon sve bridge usage with ||tbl/perm) --- Comment #9 from Richard Sandiford --- Fixed on trunk. Will backport to GCC 14 if there is no fallout.