[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector

2024-10-02 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

--- Comment #10 from Richard Sandiford  ---
I have a proof-of-concept hack (far from submission quality).  But it looks
like some cases will also require us to extend aarch64_evpc_reencode to handle
SVE modes, which is also worthwhile for its own sake.  Have a hack for that
too.

[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector

2024-10-02 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #9 from Richard Sandiford  ---
Mine.  Lets see if I can remember how this “vectoriser” thing works…

[Bug target/116629] [14/15 Regression] Building openvino with -flto ICEs in aarch64_sve::gimple_folder::redirect_pred_x

2024-10-01 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116629

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #13 from Richard Sandiford  ---
Mine then.

[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector

2024-09-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

--- Comment #7 from Richard Sandiford  ---
...actually, they probably don't need to bijective.  I suppose
[0, 0] for two-lane SLP is handled too.

[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector

2024-09-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

--- Comment #6 from Richard Sandiford  ---
Sorry for the slow response (here and in general).  Been having to
spend my time on other things recently :(

I agree that this case is regular enough to handle for VLA, but it
seems to me like a separate case from the one that the current "repeating"
code handles.  The current repeating code is really for bijective
1-vector-to-1-vector permutes, rather than for packing operations.

[Bug rtl-optimization/116326] [lra] internal compiler error: in get_reload_reg, at lra-constraints.cc:755

2024-08-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116326

--- Comment #8 from Richard Sandiford  ---
(In reply to Georg-Johann Lay from comment #7)
> What about the following line in reload1.h:
> 
> // Used during roload -> LRA transition because ELIMINABLE_REGS may depend
> // on command line options.  Used in avr.h for example.
> #define IN_RELOAD1_CC
> 
> and the use #ifdef IN_RELOAD1_CC in avr.h?
> 
> It's still a permanent change, but a no-op (except for avr.h) because the
> macro is unused currently.
Ah, yeah, and it keeps the change local to the code that is slated to
disappear.  That sounds good to me FWIW.

[Bug rtl-optimization/116326] [lra] internal compiler error: in get_reload_reg, at lra-constraints.cc:755

2024-08-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116326

--- Comment #6 from Richard Sandiford  ---
(In reply to Georg-Johann Lay from comment #5)
> > But either way, I think we should start with the assumption that the entry
> > should be removed and make everything else work to that.  Unfortunately that
> > probably excludes reload though, if the original problem that motivated the
> > entry was there.
> 
> With the curret state of affairs, switching avr to LRA is not a good idea;
> you can't even build neither libgcc nor libc.
Sure, but I meant: remove it locally while testing/working on LRA support.

> Couldn't the elimination code in IRA / LRA just skip the high part? Or let
> reload handle multi-reg FPs correctly, even though reload will come to an
> end?
> 
> Or maybe make eliminable regs an array that can be accessed in the backend,
> so that avr.cc could adjust the entries in adjust_reg_alloc_order or
> init_expanders to something like:
> 
> eliminables[3] = eliminables[2]
> 
> having a duplicate entry shouldn't hurt?
I don't think we should make any permanent changes to support this kind of
manipulation, since it's only needed during the transition.

[Bug rtl-optimization/116326] [lra] internal compiler error: in get_reload_reg, at lra-constraints.cc:755

2024-08-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116326

--- Comment #4 from Richard Sandiford  ---
(In reply to Georg-Johann Lay from comment #3)
> It was due to problems with multi-reg frame-pointer. (AFAIR, using a
> hard-frame-poiner besides frame-poiner didn't resolve the issues.)
> 
> My problem now is that avr.h has:
> 
> #define ELIMINABLE_REGS { \
> { ARG_POINTER_REGNUM, STACK_POINTER_REGNUM },   \
> { ARG_POINTER_REGNUM, FRAME_POINTER_REGNUM },   \
> { FRAME_POINTER_REGNUM, STACK_POINTER_REGNUM }, \
> { FRAME_POINTER_REGNUM + 1, STACK_POINTER_REGNUM + 1 } }
> 
> but ira.cc uses that to initialize a static const array, hence there's no
> way to depend that on options I guess?
Yeah.  And I don't think it's worth adding a way to make it depend on an
option, since it will only be a temporary state until LRA is fully supported. 
Could you "just" remove the final entry locally while testing -mlra?

> One solution would be to skip upper parts of multi-part pointers in
> elimination. (And maybe other places: TARGET_HARD_REGNO_SCRATCH_OK and
> HARD_REGNO_RENAME_OK assume special handling for the high part of multi-reg
> FP, too).
In backend code, or in middle-end code?

But either way, I think we should start with the assumption that the entry
should be removed and make everything else work to that.  Unfortunately that
probably excludes reload though, if the original problem that motivated the
entry was there.

[Bug rtl-optimization/116326] [lra] internal compiler error: in get_reload_reg, at lra-constraints.cc:755

2024-08-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116326

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #2 from Richard Sandiford  ---
This is caused by the final entry in ELIMINABLE_REGS:

  { FRAME_POINTER_REGNUM + 1, STACK_POINTER_REGNUM + 1 }

I guess this was added to work around a reload bug involving multi-register
pointers, but it shouldn't be necessary for LRA (and as this PR shows, it's
actively harmful for LRA).

LRA treats each entry in ELIMINABLE_REGS as a pointer of mode Pmode.  This
means that (as far as LRA is concerned) the final entry above occupies
FRAME_POINTER_REGNUM + 1 and FRAME_POINTER_REGNUM + 2.  But
FRAME_POINTER_REGNUM + 2 is the Z register, so this takes the Z register out of
action, leading to a reload loop.

I'm sure there are lingering bugs around multi-register pointers :)  But I
think we should remove the entry above and try to fix the bugs at source.

[Bug rtl-optimization/116516] [15 Regression] [lra] ICE in decompose_normal_address, at rtlanal.cc:6712 by r15-3213-g708ee71808ea61

2024-08-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116516

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Richard Sandiford  ---
Hopefully fixed.  Thanks for the report.

[Bug tree-optimization/116481] [12/13/14/15 Regression] `arrays of functions are not meaningful` error message happens with -W -Wall -O2 even though there are no arrays of function types used

2024-08-28 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116481

--- Comment #10 from Richard Sandiford  ---
(In reply to Richard Sandiford from comment #9)
> If the tag bit is dropped by going
> out of bounds, that's a feature, not a bug, and would happen equally for
> void*/char* arithmetic as for (u)intptr_t arithmetic.
“out of bounds” here meaning “out of the representable bounds” of a capability
(as distinct from “out of the object's bounds”, which is checked at
dereference).

[Bug tree-optimization/116481] [12/13/14/15 Regression] `arrays of functions are not meaningful` error message happens with -W -Wall -O2 even though there are no arrays of function types used

2024-08-28 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116481

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #9 from Richard Sandiford  ---
(In reply to Bruno Haible from comment #7)
> Thanks for the attempted advice. But, no, converting a pointer to uintptr_t
> and later back to a pointer is bad for portability: It does not work on the
> CHERI architecture and its implementation on the ARM Morello chip
> https://developer.arm.com/Architectures/Morello, because it loses the tag
> bits that are essential for accessing the pointer. See
> https://www.gnu.org/software/gnulib/manual/html_node/C99-features-avoided.
> html and https://lists.gnu.org/archive/html/bug-gnulib/2023-12/msg00021.html
> .
I think Richard's point still stands though.  Conversion to and from
(u)intptr_t does not in itself drop the tag bit.  CHERI/Morello (u)intptr_t can
hold exactly what a void* can hold.  If the tag bit is dropped by going out of
bounds, that's a feature, not a bug, and would happen equally for void*/char*
arithmetic as for (u)intptr_t arithmetic.

[Bug rtl-optimization/116516] [15 Regression] [lra] ICE in decompose_normal_address, at rtlanal.cc:6712 by r15-3213-g708ee71808ea61

2024-08-28 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116516

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #3 from Richard Sandiford  ---
Gah, mine then.

[Bug fortran/116254] new test case gfortran.dg/class_transformational_2.f90 from r15-2739-g4cb07a38233aad fails

2024-08-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116254

Richard Sandiford  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Last reconfirmed||2024-08-21

--- Comment #10 from Richard Sandiford  ---
A bit more info: valgrind succeeds for -O0.  But with optimisation enabled (-O
is enough), it flags:

==12989== Conditional jump or move depends on uninitialised value(s)
==12989==at 0x49B47A0: _gfortran_spread (spread_generic.c:278)
==12989==by 0x401227: check_spread (class_transformational_2.f90:56)
==12989==by 0x401227: MAIN__ (class_transformational_2.f90:20)
==12989==by 0x4035C3: main (class_transformational_2.f90:24)

It seems that the _data._desc field of the spread results are being copied from
uninitialised memory.  .gimple has:

__attribute__((fn spec (". ")))
void check_spread ()
{
  …
  {
…
struct array02_character(kind=1) atmp.98;
struct __class_MAIN___T_2_0a ctmp.99;
…
try
  {
…
ctmp.99 = b;
ctmp.99._data = atmp.98;
ctmp.99._data.span = D.3687;
ctmp.99._data.data = 0B;
ctmp.99._data.offset = 0;
_gfortran_spread (&ctmp.99._data, D.3682, D.3684, D.3686);

where nothing has initialised atmp.98 before the copy.

[Bug fortran/116254] new test case gfortran.dg/class_transformational_2.f90 from r15-2739-g4cb07a38233aad fails

2024-08-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116254

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #9 from Richard Sandiford  ---
FWIW, the failure seems intermittent for me.  If I run the test individually
multiple times for the same toolchain, it sometimes passes, and sometimes fails
with different options.  This is on aarch64 Ubuntu 22.04 (so the same as
mentioned in #c8).

[Bug rtl-optimization/116238] [12/13/14 Regression] ICE building 526.blender_r on aarch64 SVE after r15-1619-g3b9b8d6cfdf593

2024-08-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116238

Richard Sandiford  changed:

   What|Removed |Added

  Known to fail|15.0|14.2.1
Summary|[12/13/14/15 Regression]|[12/13/14 Regression] ICE
   |ICE building 526.blender_r  |building 526.blender_r on
   |on aarch64 SVE after|aarch64 SVE after
   |r15-1619-g3b9b8d6cfdf593|r15-1619-g3b9b8d6cfdf593
  Known to work||15.0

--- Comment #8 from Richard Sandiford  ---
Fixed on trunk.  Will backport after a while if there is no fallout.

[Bug target/116413] [LRA] [M68K] ICE: unrecognized insn in lra_set_insn_recog_data, at lra.cc:1036

2024-08-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116413

--- Comment #17 from Richard Sandiford  ---
Hmm, but if the ICE is coming from vregs then it doesn't sound like it's
related to LRA.  vregs is the first RTL pass to run.

[Bug target/116413] [LRA] [M68K] ICE: unrecognized insn in lra_set_insn_recog_data, at lra.cc:1036

2024-08-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116413

--- Comment #14 from Richard Sandiford  ---
Created attachment 58967
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58967&action=edit
Patch for the decompse_mem_address ICE

Thanks.  Can you try the attached patch?

[Bug target/116413] [LRA] [M68K] ICE: unrecognized insn in lra_set_insn_recog_data, at lra.cc:1036

2024-08-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116413

--- Comment #11 from Richard Sandiford  ---
I can't reproduce this with m68k-elf.  Do you have any local changes beyond
making TARGET_LRA_P return true?

[Bug rtl-optimization/116321] [lra][avr] internal compiler error: in avr_out_lpm_no_lpmx, at config/avr/avr.cc:4572

2024-08-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116321

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #5 from Richard Sandiford  ---
Created attachment 58964
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58964&action=edit
Proof of concept patch

The problem seems to be a bad interaction between spilling and register
elimination.  I think it would only occur on targets that use a real target
register as FRAME_POINTER_REGNUM, rather than having a separate
FRAME_POINTER_REGNUM and HARD_FRAME_POINTER_REGNUM.

IRA sees:

(insn 47 46 48 2 (set (reg:QI 24 r24 [+6 ])
(mem:QI (reg/f:HI 58) [1 *p_2(D)+6 S1 A8 AS1])) "pr116321.c":6:1 86
{movqi_insn_split}
 (expr_list:REG_DEAD (reg/f:HI 58)
(nil)))
(insn 48 47 49 2 (set (reg:QI 25 r25 [+7 ])
(mem:QI (reg/f:HI 60) [1 *p_2(D)+7 S1 A8 AS1])) "pr116321.c":6:1 86
{movqi_insn_split}
 (expr_list:REG_DEAD (reg/f:HI 60)
(nil)))

where both instructions need an r30 base.  As it happens, IRA assigns r30 to
r60 .  However, when it comes to reloading earlier insns, we have to use r30
for the base even though r60 is live, so we need to spill r60:

** Assignment #1: **

 Assigning to 91 (cl=POINTER_Z_REGS, orig=58, freq=2000, tfirst=91,
tfreq=2000)...
 Trying 30: spill 60(freq=2000)  Now best 30(cost=0, bad_spills=0,
insn_pseudos=0)

 Trying 31: spill 60(freq=2000)
  Spill r60(hr=30, freq=2000) for r91
   Assign 30 to reload r91 (freq=2000)
 Assigning to 85 (cl=LD_REGS, orig=50, freq=3000, tfirst=85,
tfreq=3000)...
   Assign 30 to reload r85 (freq=3000)
 Assigning to 86 (cl=LD_REGS, orig=52, freq=3000, tfirst=86,
tfreq=3000)...
   Assign 30 to reload r86 (freq=3000)
 Assigning to 87 (cl=LD_REGS, orig=54, freq=3000, tfirst=87,
tfreq=3000)...
   Assign 30 to reload r87 (freq=3000)
 Assigning to 88 (cl=LD_REGS, orig=56, freq=3000, tfirst=88,
tfreq=3000)...
   Assign 30 to reload r88 (freq=3000)
 Assigning to 89 (cl=LD_REGS, orig=58, freq=3000, tfirst=89,
tfreq=3000)...
   Assign 26 to reload r89 (freq=3000)
 Assigning to 90 (cl=LD_REGS, orig=77, freq=3000, tfirst=90,
tfreq=3000)...
   Assign 24 to reload r90 (freq=3000)
  Reassigning non-reload pseudos
   Assign 28 to r60 (freq=2000)

where r28 is the eliminated FRAME_POINTER_REGNUM.  So far so good (AFAICT). 
But when trying to reload insn 48 after the spill, we use get_reg_class to work
out what class the base register (r60) has.  This is structured as:

  if (! HARD_REGISTER_NUM_P (hard_regno = regno))
hard_regno = lra_get_regno_hard_regno (regno);
  if (hard_regno >= 0)
{
  hard_regno = lra_get_elimination_hard_regno (hard_regno);
  return REGNO_REG_CLASS (hard_regno);
}

So we replace r60 with r28 thanks to the assignment, but then apply
eliminations to r28 to get r32 (the stack pointer).  Things go downhill from
there.

This patch is a naive proof-of-concept fix, on the basis that eliminations
should only apply to registers that aren't subject to allocation.  But I
imagine something somewhere else will break.

[Bug rtl-optimization/116238] [15 Regression] ICE building 526.blender_r on aarch64 SVE after r15-1619-g3b9b8d6cfdf593

2024-08-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116238

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #5 from Richard Sandiford  ---
Yeah, seems to be a latent bug in aarch64_hard_regno_caller_save_mode.  A
brute-force reproducer is:

void foo();
typedef unsigned char v2qi __attribute__((vector_size(2)));
void f(v2qi *ptr)
{
  v2qi x = *ptr;
  asm volatile ("" :: "w" (x));
  asm volatile ("" ::: "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15");
  foo();
  asm volatile ("" :: "w" (x));
  *ptr = x;
}

[Bug target/116413] [LRA] [M68K] ICE: unrecognized insn in lra_set_insn_recog_data, at lra.cc:1036

2024-08-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116413

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #9 from Richard Sandiford  ---
(In reply to Michael Matz from comment #8)
> Yeah, reload doesn't expect ASHIFTs within operands here.  The below would
> fix
> that:
> 
> diff --git a/gcc/final.cc b/gcc/final.cc
> index eb9e065d9f0..5d911586de5 100644
> --- a/gcc/final.cc
> +++ b/gcc/final.cc
> @@ -3146,6 +3146,7 @@ walk_alter_subreg (rtx *xp, bool *changed)
>  case PLUS:
>  case MULT:
>  case AND:
> +case ASHIFT:
>XEXP (x, 0) = walk_alter_subreg (&XEXP (x, 0), changed);
>XEXP (x, 1) = walk_alter_subreg (&XEXP (x, 1), changed);
>break;
LGTM FWIW.

The different canonicalisation rules around ASHIFT/MULT in memories vs outside
memories is an unfortunate wart, but what LRA is doing for lea is the correct
behaviour.

[Bug target/115683] [15 Regression] SSE2 regressions after obselete of vcond{,u,eq}.

2024-08-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115683

--- Comment #8 from Richard Sandiford  ---
(In reply to Uroš Bizjak from comment #7)
> Richi, maybe tree optimizers can perform their optimizations with
> vec_cmp{,u} and vcond_mask, and at the end provide the true coditional
> vector move (that calls "vcond{,u}") as a compound operation of these two
> operations?
Yeah, if a target can only handle a comparison by inverting the result, then
IMO the long-term outcome should be that the target doesn't provide that
comparison, and that target-independent code instead does the obvious
adjustment.  At the moment (or at least, before the removal of the optabs),
every target did this internally, and the true picture was hidden from gimple.

[Bug target/114603] aarch64: Invalid SVE cnot optimisation

2024-08-16 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114603

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED
  Known to work||12.4.1, 13.3.1, 15.0

--- Comment #6 from Richard Sandiford  ---
Fixed on all active branches.

[Bug target/114607] aarch64: Incorrect expansion of svsudot

2024-08-16 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114607

Richard Sandiford  changed:

   What|Removed |Added

  Known to work||12.4.1, 13.3.1, 14.1.0,
   ||15.0
 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #6 from Richard Sandiford  ---
Fixed on all active branches.

[Bug target/115464] [14 Backport] ICE when building libaom on arm64 (neon sve bridge usage with tbl/perm)

2024-08-15 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115464

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #18 from Richard Sandiford  ---
The patches I pushed should be enough to fix the PR on GCC 14.  (Oops, I forgot
that I said I would wait until Monday for comments. :( )

[Bug target/113939] Switch m68k to LRA

2024-08-15 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113939
Bug 113939 depends on bug 116236, which changed state.

Bug 116236 Summary: [LRA] [M68K] ICE insn does not satisfy its constraints
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116236

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug target/116236] [LRA] [M68K] ICE insn does not satisfy its constraints

2024-08-15 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116236

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #24 from Richard Sandiford  ---
Hopefully fixed, but please let me know if this code causes similar problems
elsewhere.

[Bug target/116343] [15 regression] ICE on valid code at -Os with "-fschedule-insns -fno-thread-jumps -fno-dce" on x86_64-linux-gnu: in extract_insn, at recog.cc:2869

2024-08-15 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116343

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #7 from Richard Sandiford  ---
Fixed.  Thanks for the report.

[Bug target/116236] [LRA] [M68K] ICE insn does not satisfy its constraints

2024-08-15 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116236

--- Comment #22 from Richard Sandiford  ---
(In reply to Michael Matz from comment #21)
> (In reply to Richard Sandiford from comment #17)
> > > But if LRA needs to be extended for correctness, then, ... meh.
> > But this is how it's always worked.  The corresponding reload code is in
> > find_reloads_address_1, which similarly tries to tear apart an address, work
> > out whether something is a base or an index, and use that to calculate the
> > appropriate register class.  ISTM this PR is just about an inconsistency
> > between the base/index identification in LRA vs reload.
> > 
> > The reg_renumber indirection performed by GO_IF_LEGITIMATE_ADDRESS for
> > reload only handled the initial register allocation.  It didn't help for
> > pseudos that were spilled, or were allocated to the wrong class.  The
> > BASE_REG_CLASS/INDEX_REG_CLASS mechanism was always required to work for
> > correctness, not just optimisation.
> 
> Yes, and I can't say that I ever liked it very much in reload either (for
> basically the same reasons I'm whining now here) :-)  Perhaps I have just
> hoped
> LRA would come up with some other solution.
Yeah, I definitely agree that a different solution would nicer.  Part of the
problem with the current approach is obvious from the way we had
BASE_REG_CLASS, then (when that wasn't enough) MODE_BASE_REG_CLASS, then
MODE_BASE_REG_REG_CLASS, and finally(?) MODE_CODE_BASE_REG_CLASS. :)

So I agree that it would be better for the target to say what parts of an
address need reloading, and how, without any of the current guesswork.

[Bug target/116371] The SME2 svpext intrinsics are missing a _lane suffix

2024-08-15 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116371

--- Comment #2 from Richard Sandiford  ---
Fixed on trunk.  I'll wait a bit before backporting.

[Bug target/116371] The SME2 svpext intrinsics are missing a _lane suffix

2024-08-14 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116371

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Ever confirmed|0   |1
   Last reconfirmed||2024-08-14
 Status|UNCONFIRMED |ASSIGNED

[Bug target/116371] New: The SME2 svpext intrinsics are missing a _lane suffix

2024-08-14 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116371

Bug ID: 116371
   Summary: The SME2 svpext intrinsics are missing a _lane suffix
   Product: gcc
   Version: 14.1.0
Status: UNCONFIRMED
  Keywords: rejects-valid
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

The intrinsics for PEXT are supposed to be called svpext_lane_cN and
svpext_lane_cN_x2, but GCC is missing the _lane suffix.

[Bug target/116236] [LRA] [M68K] ICE insn does not satisfy its constraints

2024-08-13 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116236

--- Comment #19 from Richard Sandiford  ---
Of course, immediately after posting I realise it should be address_mode
instead of pointer_mode, but that shouldn't affect m68k.

[Bug target/116236] [LRA] [M68K] ICE insn does not satisfy its constraints

2024-08-13 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116236

--- Comment #18 from Richard Sandiford  ---
Created attachment 58927
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58927&action=edit
Candidate patch

Could you try the attached patch?  It seems to fix the reduced testcase for me.

[Bug target/116236] [LRA] [M68K] ICE insn does not satisfy its constraints

2024-08-13 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116236

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #17 from Richard Sandiford  ---
(In reply to Michael Matz from comment #16)
> (In reply to Richard Sandiford from comment #15)
> > > Yes, I considered adding this handling of (zero_extend Rx) to LRA.  I'm
> > > confident
> > > it would fix this instance of the problem, if done correctly (it 
> > > essentially
> > > already has code to deal with shortening subregs, which is somewhat 
> > > similar
> > > in structure).
> > It should already be present, via the decompose_*_address mechanism.
> 
> As get_index_term doesn't accept any _EXTEND as index outer code I don't see
> how (currently).  It basically gets into the tie-breaker of leaving the
> operands
> after the loop in decompose_normal_address for baseness() to decide which of
> course gets it "wrong" on m68k (see my comment #5 where I checked :) ) ...
> 
> > FWIW, aarch64 also supports (mem (plus (zero_extend R1) R2)) and works with
> > LRA, so I don't think there's a fundamental limitation.
> 
> ... but obviously gets it right by luck on aarch64.
Hmm, ok.  I think that's probably my bug then, so... taking.

> > > invalid, do something".  Alternatively (and better) it needs to have a way
> > > to say "this
> > > address, while structurally valid, will need these regsets in those reg
> > > operands", generally (i.e. it should be possible to have targets with 
> > > e.g. 4
> > > register operands, or such).
> > > 
> > > If that's not possible then the design of the LRA-target interface is not
> > > yet complete IMHO.
> > Yeah, the current approach is the latter one (which I agree is better). 
> >  legitimate_address_p answers the question “is this address structurally
> > valid?” while BASE_REG_CLASS and INDEX_REG_CLASS specify the regsets that
> > should be used to reload registers in structurally valid addresses.
> 
> Well, that then rules out targets that allow three registers in addresses.
> Or ones that have more complicated validity rules than "op1 here, op2 there".
> E.g. "when op1 is an even-numbered register, then op2 needs to be
> odd-numbered,
> and vice versa".  Contrived, sure, but ISA might want to save a bit here or
> there in encoding.
> 
> > Like you say, in practice it has to be done by using regsets, since the RA
> > needs to know “what do I need to do to make this valid?”.  It shouldn't have
> > to use trial and error (trying particular hard registers to see if they're
> > valid).
> 
> Agreed.  I just think that BASE/INDEX_REG_CLASS as the only way of
> communicating that from the target to LRA is quite constraining.  It means
> (like
> here) to always having to adjust LRA whenever a new target with other forms
> comes along.  Basically: when the target has to look at the structure of
> an address to check validity (reasonably so), then it irks me that also
> LRA needs to get at the inner structure of those addresses for correctness.
> The latter part, the correctness, is what triggers me.  If it's necessary for
> optimality: sure.  But if LRA needs to be extended for correctness, then, ...
> meh.
But this is how it's always worked.  The corresponding reload code is in
find_reloads_address_1, which similarly tries to tear apart an address, work
out whether something is a base or an index, and use that to calculate the
appropriate register class.  ISTM this PR is just about an inconsistency
between the base/index identification in LRA vs reload.

The reg_renumber indirection performed by GO_IF_LEGITIMATE_ADDRESS for reload
only handled the initial register allocation.  It didn't help for pseudos that
were spilled, or were allocated to the wrong class.  The
BASE_REG_CLASS/INDEX_REG_CLASS mechanism was always required to work for
correctness, not just optimisation.

[Bug target/116236] [LRA] [M68K] ICE insn does not satisfy its constraints

2024-08-13 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116236

--- Comment #15 from Richard Sandiford  ---
(In reply to Michael Matz from comment #14)
> (In reply to Richard Sandiford from comment #13)
> > (In reply to Michael Matz from comment #12)
> > > That's why I struggle a bit, I lack the bigger picture, how LRA is
> > > envisioned to deal with these situations given how targets are supposed 
> > > (and
> > > still documented)
> > > to use the strict/nonstrict distinction of testing for legal operands.
> > The generic address decomposition logic should realise that, in:
> > 
> >   (plus (zero_extend R1) R2)
> > 
> > R2 must be the base and R1 must be the index.  Therefore, R2 must be
> > constrained to BASE_REG_CLASS (or its variants) and R1 must be constrained
> > to INDEX_REG_CLASS (or its variants).
> 
> Yes, I considered adding this handling of (zero_extend Rx) to LRA.  I'm
> confident
> it would fix this instance of the problem, if done correctly (it essentially
> already has code to deal with shortening subregs, which is somewhat similar
> in structure).
It should already be present, via the decompose_*_address mechanism.

FWIW, aarch64 also supports (mem (plus (zero_extend R1) R2)) and works with
LRA, so I don't think there's a fundamental limitation.

> But then I asked myself "what if another target has still
> different representations of address?": Should LRA be continuously extended
> to deal with these and those representations?  Sure, for optimality that
> might
> be necessary.  But in the grand scheme of things, the target should have the
> opportunity, without any changes to LRA code, to be made completely working
> with whatever representations it's using internally.  It needs to have a way
> to say "this address, if written with the assigned hardregs, will be
> invalid, do something".  Alternatively (and better) it needs to have a way
> to say "this
> address, while structurally valid, will need these regsets in those reg
> operands", generally (i.e. it should be possible to have targets with e.g. 4
> register operands, or such).
> 
> If that's not possible then the design of the LRA-target interface is not
> yet complete IMHO.
Yeah, the current approach is the latter one (which I agree is better). 
 legitimate_address_p answers the question “is this address structurally
valid?” while BASE_REG_CLASS and INDEX_REG_CLASS specify the regsets that
should be used to reload registers in structurally valid addresses.

Like you say, in practice it has to be done by using regsets, since the RA
needs to know “what do I need to do to make this valid?”.  It shouldn't have to
use trial and error (trying particular hard registers to see if they're valid).

> > (FWIW, I think it's correct that LRA never passes strict_p==true.
> > The strict_p distinction should go away once reload does.)
> 
> But how could it?  The various target hooks are called from LRA when the
> pseudos are _not_ yet replaced with their MEMs or their hardregs.  How would
> the target know to do this substitution internally for correctness checking?
> How would it know that one pseudo is equivalent to a MEM, and another to a
> hardreg?  Looking at lra_in_progress as some targets are currently doing (and
> my hack does)?  That doesn't sound like a good design, it's simply a hidden
> parameter for the target hook in a global variable.
> 
> So, how else?  Substituting before the hooks are asked is just busy work,
> not asking the target at all means continuously adding stuff to LRA whenever
> a
> target wants something more.
BASE_REG_CLASS and INDEX_REG_CLASS (and their variants) specify what needs to
happen for an address that doesn't currently use hard registers.  And
legitimate_address_p should test whether addresses that *do* currently use hard
registers are using the right registers.

There should be no need for the hooks to do their own pseudo-to-hard-register
lookup.  The idea that one pseudo register maps to one hard register is very
simplistic and doesn't take region-based allocation or inheritance into
account.  IMO it's a concept that should go away when reload does.

And it doesn't make sense IMO to accept (say) %aN index registers or %dN base
registers at any stage, even before allocation.

[Bug target/116343] [15 regression] ICE on valid code at -Os with "-fschedule-insns -fno-thread-jumps -fno-dce" on x86_64-linux-gnu: in extract_insn, at recog.cc:2869

2024-08-13 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116343

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #5 from Richard Sandiford  ---
Argh, substituting into the REG_EQUAL note is causing INSN_CODE to be reset. 
Testing a patch…

[Bug target/116343] [15 regression] ICE on valid code at -Os with "-fschedule-insns -fno-thread-jumps -fno-dce" on x86_64-linux-gnu: in extract_insn, at recog.cc:2869

2024-08-12 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116343

--- Comment #4 from Richard Sandiford  ---
(In reply to Andrew Pinski from comment #3)
> Note I think late_combine 1 depends on DCE later on to delete the
> noop_move_p instructions but since -fno-dce was passed on the command line,
> it is not deleted and that causes the ICE.
I can look tomorrow, but: if the insn code is set to NOOP_MOVE_INSN_CODE,
possibly_queue_changes is supposed to ensure that perform_pending_updates
deletes it later.

[Bug target/116312] Use LDP instead of LD2 on for Advanced SIMD when possible

2024-08-12 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116312

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #3 from Richard Sandiford  ---
FWIW, see the comment in aarch64_sve_adjust_stmt_cost for some of the problems
with costing LDP and STP correctly:

  /* Advanced SIMD can load and store pairs of registers using LDP and STP,
 but there are no equivalent instructions for SVE.  This means that
 (all other things being equal) 128-bit SVE needs twice as many load
 and store instructions as Advanced SIMD in order to process vector pairs.

 Also, scalar code can often use LDP and STP to access pairs of values,
 so it is too simplistic to say that one SVE load or store replaces
 VF scalar loads and stores.

 Ideally we would account for this in the scalar and Advanced SIMD
 costs by making suitable load/store pairs as cheap as a single
 load/store.  However, that would be a very invasive change and in
 practice it tends to stress other parts of the cost model too much.
 E.g. stores of scalar constants currently count just a store,
 whereas stores of vector constants count a store and a vec_init.
 This is an artificial distinction for AArch64, where stores of
 nonzero scalar constants need the same kind of register invariant
 as vector stores.

 An alternative would be to double the cost of any SVE loads and stores
 that could be paired in Advanced SIMD (and possibly also paired in
 scalar code).  But this tends to stress other parts of the cost model
 in the same way.  It also means that we can fall back to Advanced SIMD
 even if full-loop predication would have been useful.

 Here we go for a more conservative version: double the costs of SVE
 loads and stores if one iteration of the scalar loop processes enough
 elements for it to use a whole number of Advanced SIMD LDP or STP
 instructions.  This makes it very likely that the VF would be 1 for
 Advanced SIMD, and so no epilogue should be needed.  */

[Bug target/116236] [LRA] [M68K] ICE insn does not satisfy its constraints

2024-08-12 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116236

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #13 from Richard Sandiford  ---
(In reply to Michael Matz from comment #12)
> _Here_ we have a case where the target isn't even in the position to reject
> an invalid address (or one that definitely _will_ become invalid), because
> IRA/LRA don't make use of the strict predicates, and even if it were I don't
> see any code dealing with the fall-out of the target saying that
> '(plus (zero_extend(a0)) (d0))' would be invalid but 
> '(plus (zero_extend(d0)) (a0))' would be valid.
> (my hack rectifies this a bit in forcing LRA to just reload the whole
> invalid address to another reg, but it doesn't fix all failures of this mode
> in the testsuite)
> 
> That's why I struggle a bit, I lack the bigger picture, how LRA is
> envisioned to deal with these situations given how targets are supposed (and
> still documented)
> to use the strict/nonstrict distinction of testing for legal operands.
The generic address decomposition logic should realise that, in:

  (plus (zero_extend R1) R2)

R2 must be the base and R1 must be the index.  Therefore, R2 must be
constrained to BASE_REG_CLASS (or its variants) and R1 must be constrained
to INDEX_REG_CLASS (or its variants).  If that isn't happening then I agree
it's an LRA bug.

(FWIW, I think it's correct that LRA never passes strict_p==true.
The strict_p distinction should go away once reload does.)

[Bug other/30920] Incorrect splaying that fails to assure the caching property

2024-08-12 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30920

--- Comment #8 from Richard Sandiford  ---
The patch in comment 7 is just one step.

[Bug target/116145] SVE constant pool loads not hoisted outside loops

2024-08-05 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116145

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|FIXED   |---
 Status|RESOLVED|REOPENED

--- Comment #12 from Richard Sandiford  ---
I had to revert the patch due to PR116200.

[Bug rtl-optimization/116200] [15 regression] ICE in stage 2 since r15-2696-gba730fd10934e4

2024-08-05 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116200

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from Richard Sandiford  ---
Sorry for the breakage.  I've reverted the patch.

[Bug other/30920] Incorrect splaying that fails to assure the caching property

2024-08-02 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30920

Richard Sandiford  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Last reconfirmed||2024-08-02

--- Comment #6 from Richard Sandiford  ---
When I added splay-tree-utils.h, Jeff asked me to look at converting
typed_splay_utils to use it, so that we didn't have two competing C++
implementations.  I forgot to go back and do that.  I wasn't aware of this PR
though.

So I'll assign myself for the part about moving typed_splay_utils over to
splay-tree-utils.h.  Looks like the main missing piece is foreach, for which
I'll try to use iterators.

[Bug target/116145] SVE constant pool loads not hoisted outside loops

2024-08-02 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116145

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #11 from Richard Sandiford  ---
Fixed.

[Bug target/116145] Suboptimal SVE immediate synthesis

2024-07-31 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116145

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #7 from Richard Sandiford  ---
I'll test a cleaned-up version of the change in comment 6.  Kyrill, is it OK to
use your godbolt testcase in the testsuite?

[Bug target/116145] Suboptimal SVE immediate synthesis

2024-07-31 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116145

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #6 from Richard Sandiford  ---
The problem seems to be that may_trap_or_fault_p returns true for the memory,
which stops loop2_invariant from hoisting the load.  The code there is:

  if (/* MEM_NOTRAP_P only relates to the actual position of the memory
 reference; moving it out of context such as when moving code
 when optimizing, might cause its address to become invalid.  */
  code_changed
  || !MEM_NOTRAP_P (x))
{
  poly_int64 size = MEM_SIZE_KNOWN_P (x) ? MEM_SIZE (x) : -1;
  return rtx_addr_can_trap_p_1 (XEXP (x, 0), 0, size,
GET_MODE (x), code_changed);
}

(where code_changed is true).  Since the SVE load doesn't support the full
LO_SUM range, there is no hint that the address refers to invariant memory.

I'm not sure the comment in the code is correct for MEM_READONLY_P, which only
refers to _statically allocated_ readonly memory.  (At least, the comment
doesn't apply to constant memory created by force_const_mem, which creates
things that cannot trap regardless of context.)  Adding:

  if (MEM_READONLY_P (x) && MEM_NOTRAP_P (x))
return false;

makes the test work for me.

[Bug target/115881] [15 Regression] ICE: RTL check: expected code 'reg', have 'subreg' in rhs_regno, at rtl.h:1934 with -O2 -mx32 -maddress-mode=long

2024-07-31 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115881

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #9 from Richard Sandiford  ---
Fixed.

[Bug tree-optimization/116109] Missed optimisation: unnecessary register dependency on reduction

2024-07-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116109

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #4 from Richard Sandiford  ---
This isn't useful given that it isn't your target, but FWIW, we do take
reduction latencies into account when deciding the vectorisation/unrolling
factor for AArch64.  E.g. the testcase produces the main loop:

.L5:
ldp q25, q24, [x1]
ldp q30, q31, [x1, 32]
add x1, x1, 64
fmlav26.2d, v25.2d, v25.2d
fmlav27.2d, v24.2d, v24.2d
fmlav28.2d, v30.2d, v30.2d
fmlav29.2d, v31.2d, v31.2d
cmp x3, x1
bne .L5

when compiled with -Ofast -mcpu=neoverse-v2.  The decision is made by
aarch64_vector_costs::determine_suggested_unroll_factor, which uses issue rates
and reduction latencies recorded by the main costing code.

[Bug rtl-optimization/116136] [15 Regression] ext-dce exposes latent subreg simplification bug (big-endian) on m68k

2024-07-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116136

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #2 from Richard Sandiford  ---
I was curious how far back this went, but it seems to be present in the
original combine.c commit:

  /* Changing mode twice with SUBREG => just change it once,
 or not at all if changing back to starting mode.  */
  if (GET_CODE (SUBREG_REG (x)) == SUBREG)
   {
 if (mode == GET_MODE (SUBREG_REG (SUBREG_REG (x)))
 && SUBREG_WORD (x) == 0 && SUBREG_WORD (SUBREG_REG (x)) == 0)
   return SUBREG_REG (SUBREG_REG (x));

Kind-of amazed that this went undetected so long, especially since multiword
big-endian values were much more common back in the day.

[Bug tree-optimization/116100] (VEC_COND (UNCOND_EXPR)) -> COND_FN conversion should happen in isel

2024-07-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116100

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #2 from Richard Sandiford  ---
Also, the COND_FNs are generated directly by the vectoriser for things that
cannot be accurately represented any other way (for FP ops).  So I think it's
relatively more important that other passes handle them well.

[Bug target/116044] [15 Regression] GCN vs. rtl-ssa: Avoid using a stale splay tree root [PR116009]

2024-07-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116044

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Richard Sandiford  ---
Fixed.

[Bug middle-end/116058] [15 Regression] sh4-linux-gnu fails to bootstrap, late combine issue

2024-07-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116058

Richard Sandiford  changed:

   What|Removed |Added

 CC|richard.sandiford at arm dot com   |rsandifo at gcc dot 
gnu.org

--- Comment #10 from Richard Sandiford  ---
The patch in comment 8 looks like the correct fix to me FWIW.  It's ok for
trunk from my POV, if Jeff agrees.

[Bug target/116044] [15 Regression] GCN vs. rtl-ssa: Avoid using a stale splay tree root [PR116009]

2024-07-23 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116044

Richard Sandiford  changed:

   What|Removed |Added

   Last reconfirmed||2024-07-23
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED

--- Comment #2 from Richard Sandiford  ---
Gah, this part of the code is causing more than it's fair share of trouble…

[Bug rtl-optimization/116009] [15 regression] ICE when building cython-3.0.10 on arm64 for Python 3.12 (insert_def_after, at rtl-ssa/accesses.cc:622)

2024-07-22 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116009

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #10 from Richard Sandiford  ---
Fixed.

[Bug target/115969] [15 regression] ICE when building clang-16.0.6 on arm64 (output_operand: invalid expression as operand)

2024-07-22 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115969

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #10 from Richard Sandiford  ---
Fixed.  The patch could be backported to release branches if the same error
shows up in other circumstances, but as Andrew says, the problem seems to have
been latent since the port was added.

[Bug rtl-optimization/116009] [15 regression] ICE when building cython-3.0.10 on arm64 for Python 3.12 (insert_def_after, at rtl-ssa/accesses.cc:622)

2024-07-22 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116009

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #6 from Richard Sandiford  ---
Testing a patch.

[Bug target/115406] [15 Regression] wrong code with vector compare at -O0 with -mavx512f since r15-920-gb6c6d5abf0d31c

2024-07-19 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115406

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #10 from Richard Sandiford  ---
Fixed.  Although the problem is latent on branches, I think we should only
consider a backport after a few months, to give plenty of time for fallout to
be detected.

[Bug tree-optimization/115991] [15 Regression] ICE on linux-6.10 in ix86_print_operand_address_as() since r15-1945-g9d20529d94b232

2024-07-19 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115991

--- Comment #9 from Richard Sandiford  ---
(In reply to Andrew Pinski from comment #7)
> (In reply to Sam James from comment #6)
> > The ICE bisects to r15-1945-g9d20529d94b232. The ivopts issue needs to be
> > done still.
> 
> Note this is basically the same issue as PR 115881 except it could be fixed
> earlier on with an IV-OPTs fix I think.
Yeah, I suggest keeping this PR for the ivopts part only and using PR115881 to
track the subreg-of-sp thing.

[Bug target/115969] [15 regression] ICE when building clang-16.0.6 on arm64 (output_operand: invalid expression as operand)

2024-07-18 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115969

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #8 from Richard Sandiford  ---
Yeah, agree that's where the problem is.

[Bug target/115950] Missed SVE fold to INCP

2024-07-18 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115950

--- Comment #5 from Richard Sandiford  ---
Alternatively, I suppose we could say that svcntp_bN(svptrue_bN(), pg) etc.
should be folded to svcntp_bN(pg, pg) in gimple/expand, and then make the
INCP/DECP wrappers only match the equal-predicate version.

[Bug target/115950] Missed SVE fold to INCP

2024-07-18 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115950

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #4 from Richard Sandiford  ---
I'd rather not double up the patterns for this equivalence.  Seems like one of
those cases where being able to define target-specific unspec simplification
rules would help.

[Bug rtl-optimization/115929] [15 regression] ICE on valid code at -O{2,3} with "-fschedule-insns" on x86_64-linux-gnu: Segmentation fault

2024-07-17 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115929

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Richard Sandiford  ---
Fixed.

[Bug rtl-optimization/115928] [15 regression] ICE on valid code at -O2 with "-fgcse-sm" on x86_64-linux-gnu: in merge_clobber_groups, at rtl-ssa/accesses.cc:757

2024-07-17 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115928

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from Richard Sandiford  ---
Fixed.

[Bug rtl-optimization/115929] [15 regression] ICE on valid code at -O{2,3} with "-fschedule-insns" on x86_64-linux-gnu: Segmentation fault

2024-07-16 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115929

--- Comment #3 from Richard Sandiford  ---
As it turned out, the two tests exposed different bugs.  I've submitted a patch
for the other one and will close once that's resolved.

[Bug rtl-optimization/115901] [15 regression] ICE when building coreutils-9.5 on arm64 with -O3 -flto -fno-vect-cost-model -ftrivial-auto-var-init=zero

2024-07-16 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115901

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #12 from Richard Sandiford  ---
Fixed.

[Bug target/115891] [15 regression] libgcrypt tests segfault in crc32_less_than_16 with LTO with late-combine

2024-07-16 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115891

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Richard Sandiford  ---
Fixed.

[Bug rtl-optimization/115929] [15 regression] ICE on valid code at -O{2,3} with "-fschedule-insns" on x86_64-linux-gnu: Segmentation fault

2024-07-15 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115929

Richard Sandiford  changed:

   What|Removed |Added

   Last reconfirmed||2024-07-15
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

[Bug rtl-optimization/115928] [15 regression] ICE on valid code at -O2 with "-fgcse-sm" on x86_64-linux-gnu: in merge_clobber_groups, at rtl-ssa/accesses.cc:757

2024-07-15 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115928

Richard Sandiford  changed:

   What|Removed |Added

   Last reconfirmed||2024-07-15
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

[Bug rtl-optimization/115901] [15 regression] ICE when building coreutils-9.5 on arm64 with -O3 -flto -fno-vect-cost-model -ftrivial-auto-var-init=zero

2024-07-13 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115901

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #10 from Richard Sandiford  ---
Gah.  I'll restrict r15-1972 to non-paradoxical subregs (unless the replacement
value is constant).

[Bug target/115891] [15 regression] libgcrypt tests segfault in crc32_less_than_16 with LTO with late-combine

2024-07-12 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115891

Richard Sandiford  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2024-07-12
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|UNCONFIRMED |ASSIGNED

--- Comment #3 from Richard Sandiford  ---
Comes from:

trying to combine definition of r5 in:
 1058: di:SI=ax:SI
into:
  299: {[di:SI]=asm_operands;clobber ax:SI;clobber flags:CC;}
successfully matched this instruction:
(parallel [
(set (mem:SI (reg:SI 0 ax [520]) [2 *_3+0 S4 A32])
(asm_operands/v:SI ("movd %%xmm0, %%eax
bswapl %%eax
movl %%eax, %0
") ("=m") 0 []
 []
 [] ../cipher/crc-intel-pclmul.c:781))
(clobber (reg:SI 0 ax))
(clobber (reg:CC 17 flags))
])
original cost = 4 + 4 (weighted: 0.090837), replacement cost = 4 (weighted:
0.045419); keeping replacement

which ought to be prevented by the clobber, but for some reason isn't.

[Bug rtl-optimization/115785] [15 regression] ICE when building embree-4.3.1 on amd64

2024-07-12 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115785

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Richard Sandiford  ---
Fixed.  Thanks for the report.

[Bug target/115881] [15 Regression] ICE: RTL check: expected code 'reg', have 'subreg' in rhs_regno, at rtl.h:1934 with -O2 -mx32 -maddress-mode=long

2024-07-12 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115881

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|rsandifo at gcc dot gnu.org|unassigned at gcc dot 
gnu.org
 Status|ASSIGNED|NEW

--- Comment #4 from Richard Sandiford  ---
As expected, the results weren't pretty.  Things like:

(define_split
  [(set (match_operand:SWI48 0 "register_operand")
(and:SWI48 (match_dup 0)
   (const_int -65536)))
   (clobber (reg:CC FLAGS_REG))]
  "(TARGET_FAST_PREFIX && !TARGET_PARTIAL_REG_STALL)
|| optimize_function_for_size_p (cfun)"
  [(set (strict_low_part (match_dup 1)) (const_int 0))]
  "operands[1] = gen_lowpart (HImode, operands[0]);")

expect to be able to generate (strict_low_part (subreg:HI (reg:DI SP) 0))
after reload, so we can't ban it outright.  The fallback was going to be
to make register_operand reject subregs after reload, but patterns like:

(define_insn_and_split "*neg_1_slp"
  [(set (strict_low_part (match_operand:SWI12 0 "register_operand"
"+,&")\
)
(neg:SWI12 (match_operand:SWI12 1 "register_operand" "0,!")))
   (clobber (reg:CC FLAGS_REG))]

use register_operand for cases like the above.  Not really sure what
to do here.

[Bug target/115881] [15 Regression] ICE: RTL check: expected code 'reg', have 'subreg' in rhs_regno, at rtl.h:1934 with -O2 -mx32 -maddress-mode=long

2024-07-11 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115881

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #3 from Richard Sandiford  ---
Hmm, yeah.  This is due to an inconsistency between simplify_subreg_regno
(which rightly refuses to generate (reg:SI sp)) and validate_subreg,
which instead uses subreg_offset_representable_p + some preparatory tests.
Those preparatory tests are similar, but not quite identical to, the first
tests in simplify_subre_regno.  And they don't include the stack and frame
pointer tests.

In principle, it feels like the validate_subreg test should just be:

  /* For hard registers, check that the subreg can be folded to a REG.  */
  if (reg && REG_P (reg) && HARD_REGISTER_P (reg))
return simplify_subreg_regno (REGNO (reg), imode, offset, omode) >= 0;

but I've no idea how much that will break.  Will try it and see.

[Bug rtl-optimization/115782] [15 Regression] ICE on valid code at -O{2,3} with "-fno-guess-branch-probability -fgcse-sm -fno-expensive-optimizations -fno-gcse" on x86_64-linux-gnu: in possibly_queue_

2024-07-11 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115782

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #4 from Richard Sandiford  ---
Fixed.  It also fixes the second example, but I didn't see that in time to
include it in the commit.

[Bug rtl-optimization/115785] ICE when building embree-4.3.1 on amd64 with -flate-combine-instructions

2024-07-10 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115785

--- Comment #2 from Richard Sandiford  ---
The patch in #c1 is just part 1.  Part 2 will fix the bug.

[Bug ipa/114531] Feature proposal for an `-finline-functions-aggressive` compiler option

2024-07-08 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114531

--- Comment #20 from Richard Sandiford  ---
(In reply to Jan Hubicka from comment #18)
> I am trying to understand how useful this is.  I am basically worried
> about two things
>  1) we have other optimization passes that behave differently at -O2 and
> -O3 (vectorizer, unrolling etc.) and I think we may want to have
> more. We also have -Os and -O1.
> 
> So perhaps we want kind of more systmatic solution. We already have
> -fvect-cost-model that is kind of vectorizer version of the proposed
> inliner option.
Yeah, like you say, -fvect-cost-model solves this problem for the vectoriser.
There we have an intermediate setting (cheap) that isn't the default for any
-O.  But the proposal isn't to do something like that for inlining.

>From a quick grep, it looked like there are three places that distinguish
directly between -O2 and -O3:

(1) a couple of expansions in builtins.cc

(2) cunrolli, which uses it to decide whether unrolling can increase size

(3) number_of_iterations_exit_assumptions, where it decides whether outer
evolutions should be taken into account.  (Was surprised by this one.)

If someone does want to control those things separately in future, I think
the defined way of doing that would be to add a -f flag.  So it feels like
this patch is moving in the same direction that we'd expect for the rest.

>  2) inliner is already quite painful to tune. Especially since 
>  one really needs to benchmark packages significantly bigger than
>  SPECs which tends to be bit hard to set up and benchmark
>  meaningfully. I usually do at least Firefox and clang where the
>  first is always quite some work to get working well with latest
>  GCC. We SUSE's LNT we also run "C++ behchmarks" which were
>  initially collected as kind of inliner tests with higher
>  abstraction penalty (tramp3d etc.).
> 
>  For many years I benchmarked primarily -O3 and -O3 + profile
>  feedbcak on x86-64 only with ocassional look at -O2 and -Os
>  behaviour which were generally more stable.
>  I also tested other targets (poer and aarch64) but just
>  sporadically, which is not good.
> 
>  After GCC5 I doubled testing to include both lto/non-lto variant.
>  Since GCC10 -O2 started to envolve and needed re-testing too
>  (lto/nonlto). One metric I know I ought to tune is -O2 -flto and
>  FDO which used to be essentially -O3 before the optimization level
>  --params were introduced, but now -O2 + FDO inlining is more
>  conservative which hurts, for example, profiledbootstrapped GCC.
> 
>  So naturally I am bit worried to introduce even more combinations
>  that needs testing and maintenance.  If we add user friendly way to
>  tweak this, we also make a promise to keep it sane.
Yeah, I agree we don't want to increase the number of supported settings.
My understanding of the new option is that it should, by definition,
do exactly what -O3 does, now and in the future.  So there would be
no extra tuning burden.  Whatever is right for -O3 in future is also
right for the new option.

[Bug rtl-optimization/115785] ICE when building embree-4.3.1 on amd64 with -flate-combine-instructions

2024-07-05 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115785

Richard Sandiford  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Ever confirmed|0   |1
   Last reconfirmed||2024-07-05

[Bug rtl-optimization/115782] ICE on valid code at -O{2,3} with "-fno-guess-branch-probability -fgcse-sm -fno-expensive-optimizations -fno-gcse" on x86_64-linux-gnu: in possibly_queue_changes, at rtl-

2024-07-05 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115782

Richard Sandiford  changed:

   What|Removed |Added

   Last reconfirmed||2024-07-05
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #1 from Richard Sandiford  ---
I'll take a look.

[Bug tree-optimization/115629] Inefficient if-convert of masked conditionals

2024-07-02 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115629

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #8 from Richard Sandiford  ---
Perhaps I'm missing the point, but I think one of the issues here is that we
(still) don't model that MASK_LOAD sets inactive elements to zero.  Inactive
elements are currently undefined instead.  (I think Robin mentioned that
assuming zero is problematic for RVV, so we might need an explicit MASK_LOAD
argument for inactive elements, like for COND_ADD etc.)

So quoting the IL in comment 4:

  # loop_mask_63 = PHI 
  vect__4.10_64 = .MASK_LOAD (vectp_a.8_53, 32B, loop_mask_63);
  mask__31.11_66 = vect__4.10_64 != { 0, ... };
  mask__56.12_67 = ~mask__31.11_66;
  vec_mask_and_70 = mask__56.12_67 & loop_mask_63;
  vect__7.15_71 = .MASK_LOAD (vectp_c.13_68, 32B, vec_mask_and_70);
  mask__22.16_73 = vect__7.15_71 == { 0, ... };
  mask__34.17_75 = vec_mask_and_70 & mask__22.16_73;

I think this and...

  vect_iftmp.20_78 = .MASK_LOAD (vectp_d.18_76, 32B, mask__34.17_75);
  vect__61.21_79 = vect__4.10_64 | vect__7.15_71;
  mask__35.22_81 = vect__61.21_79 != { 0, ... };
  vec_mask_and_84 = mask__35.22_81 & loop_mask_63;

...this have to be kept until we model inactive elements.

  vect_iftmp.25_85 = .MASK_LOAD (vectp_b.23_82, 32B, vec_mask_and_84);
  _86 = mask__34.17_75 & loop_mask_63;

This one is really curious though :)  Why does the code think that the loop
mask is needed here?  Does the code think the mask is needed for correctness,
or is the scalar_cond_masked_set optimisation misfiring?

  vect_iftmp.26_87 = VEC_COND_EXPR <_86, vect_iftmp.20_78, vect_iftmp.25_85>;
  .MASK_STORE (vectp_res.27_88, 32B, loop_mask_63, vect_iftmp.26_87);

[Bug rtl-optimization/115677] ICE when building argon2 with -flate-combine-instructions on amd64

2024-06-27 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115677

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Richard Sandiford  ---
Fixed.

[Bug rtl-optimization/115677] ICE when building argon2 with -flate-combine-instructions on amd64

2024-06-27 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115677

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Ever confirmed|0   |1
   Last reconfirmed||2024-06-27
 Status|UNCONFIRMED |ASSIGNED

--- Comment #4 from Richard Sandiford  ---
Guess we need the usual optimize > 0 guard too.

[Bug target/115634] [15 regression] s390 bootstrap failure since r15-1579-g792f97b44ffc5e

2024-06-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115634

--- Comment #3 from Richard Sandiford  ---
Yeah, I agree that sounds like the right fix.  Specifically, I assume
s390_decompose_addrstyle_without_index, when doing:

  if (op && GET_CODE (op) != REG)
return false;

should check whether the register is a pseudo register or is a member of
ADDR_REGS.  The check should probably be limited to reload_completed, so that
the RAs have a chance to reload other registers.

The reason this works during register allocation is that the RAs check
REGNO_OK_FOR_BASE_P and REGNO_OK_FOR_INDEX_P as part of the constraints
process.  But those macros are not checked automatically by address constraints
themselves (perhaps somewhat surprisingly).  It has to be done explicitly in
target code.

I disagree that the predicates & constraints are icky though.  Given the way
that the architecture treats shift amounts as addresses, it seems like a neat
approach.

[Bug ipa/114531] Feature proposal for an `-finline-functions-aggressive` compiler option

2024-06-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114531

--- Comment #17 from Richard Sandiford  ---
I can see that it's useful to ask whether the current -O2 & -O3 inlining
heuristics are making the right trade-off.  But I think that's really a
different issue from the one that is raised in the PR.  (Unless we think that
-O2 and -O3 should always have the same inlining heuristics henceforward, but
that seems unlikely.)

At the moment, -O3 is essentially -O2 + some -f options + some --param options.
 Users who want to pick & chose some of the -f options can do so, and can add
them to stable build systems.  Normally, obsolete -f options are turned into
no-ops rather than removed.  But users can't pick & choose the --params, and
add them to stable build systems, because we reserve the right to remove
--params without warning.

So IMO, we should have an -f option that represents “the inlining parameters
enabled by -O3”, whatever they happen to be for a given release.  It's OK if
the set is empty.

For such a change, it doesn't really matter whether the current --params are
the right ones.  It just matters that the --params are the ones that we
currently use.  If the --params are changed later, the -f option and -O3 will
automatically stay in sync.

[Bug target/115631] [15 Regression] GCN: [-PASS:-]{+FAIL:+} c-c++-common/torture/builtin-arith-overflow-6.c -O2 execution test

2024-06-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115631

--- Comment #4 from Richard Sandiford  ---
Created attachment 58513
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58513&action=edit
A patch for a bug seen on arm*-*-*

Also, could you check whether the attached patch makes any difference?  It
fixes a problem seen on arm*-*-*, and I notice GCN also defines
cannot_copy_insn_p.

[Bug target/115631] [15 Regression] GCN: [-PASS:-]{+FAIL:+} c-c++-common/torture/builtin-arith-overflow-6.c -O2 execution test

2024-06-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115631

--- Comment #3 from Richard Sandiford  ---
I've now pushed a debug counter for late_combine.  Sorry to ask, but could you
bisect on N in -fdbg-cnt=late_combine:N to see which transformation is causing
the problem?

[Bug target/115633] [15 Regression] powerpc64le: "relocation truncated to fit: R_PPC64_TOC16 against `.rodata.cst4'" with (default) '-flate-combine-instructions'

2024-06-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115633

--- Comment #2 from Richard Sandiford  ---
-flate-combine-instructions is supposed to be disabled by default for all
powerpc targets.  Could you look at why that isn't the case for you?

[Bug target/115631] [15 Regression] GCN: [-PASS:-]{+FAIL:+} c-c++-common/torture/builtin-arith-overflow-6.c -O2 execution test

2024-06-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115631

--- Comment #2 from Richard Sandiford  ---
I suppose for issues like this, it would be useful to have a debug counter to
bisect on.  I'll post a patch for doing that today, but I'm afraid I'll be
relying on someone with gcn access to actually do the bisection.

[Bug other/115622] gcc.dg/ipa/iinline-attr.c fails after r15-1579-g792f97b44ffc5e

2024-06-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115622

--- Comment #1 from Richard Sandiford  ---
Are you sure about the bisection?  late-combine only affects RTL, and in any
case is disabled by default for powerpc.

[Bug target/115478] [15 Regression] gcc.target/aarch64/bitint-args.c fails since r15-1120-g2277f987979445

2024-06-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115478

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #5 from Richard Sandiford  ---
How about adding a new match_operator predicate to common.md for this kind of
situation?  It would be nice if it could automatically detect when the two
operands have no nonzero bits in common, but doing that would need some
refactoring of the nonzero_bits code, to ensure that the predicate gives a
consistent result (and does that without polluting the current nonzero_bits
cache).

In the meantime, it might be enough to say that the insn must enforce the
non-overlapping bits check itself.

[Bug target/115613] New: xtensa: splits dependent on can_create_pseudo_p

2024-06-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115613

Bug ID: 115613
   Summary: xtensa: splits dependent on can_create_pseudo_p
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
CC: jcmvbkbc at gcc dot gnu.org
  Target Milestone: ---
Target: xtensa-*-*

The late-combine pass is disabled by default for xtensa:

  /* One of the late-combine passes runs after register allocation
 and can match define_insn_and_splits that were previously used
 only before register allocation.  Some of those define_insn_and_splits
 require the split to take place, but have a split condition of
 can_create_pseudo_p, and so matching after RA will give an
 unsplittable instruction.  Disable late-combine by default until
 the define_insn_and_splits are fixed.  */
  if (!OPTION_SET_P (flag_late_combine_instructions))
flag_late_combine_instructions = 0;

For example, compiling gcc.c-torture/compile/bx.c with -Os -flate-combine
gives:

(insn 53 21 27 4 (set (reg/i:SI 2 a2)
(and:SI (not:SI (reg:SI 8 a8 [orig:44 _10 ] [44]))
(reg:SI 2 a2 [55])))
"/home/ricsan01/gcc/git/gcc/gcc/testsuite/gcc.c-torture/compile/bx.c":5:1 36
{*andsi3_bitcmpl}
 (expr_list:REG_DEAD (reg:SI 8 a8 [orig:44 _10 ] [44])
(nil)))
during RTL pass: final
.../gcc.c-torture/compile/bx.c:5:1: internal compiler error: in
final_scan_insn_1, at final.cc:2807
0xe331b7 _fatal_insn(char const*, rtx_def const*, char const*, int, char
const*)
   .././src/gcc/rtl-error.cc:108
0x9e5667 final_scan_insn_1
   .././src/gcc/final.cc:2807
0x9e5877 final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*)
   .././src/gcc/final.cc:2886
0x9e659f final_1
   .././src/gcc/final.cc:1977
0x9e689f rest_of_handle_final
   .././src/gcc/final.cc:4239
0x9e689f execute
   .././src/gcc/final.cc:4317

The associated define_insn_and_split is:

(define_insn_and_split "*andsi3_bitcmpl"
  [(set (match_operand:SI 0 "register_operand" "=a")
(and:SI (not:SI (match_operand:SI 1 "register_operand" "r"))
(match_operand:SI 2 "register_operand" "r")))]
  ""
  "#"
  "&& can_create_pseudo_p ()"
  [(set (match_dup 3)
(and:SI (match_dup 1)
(match_dup 2)))
   (set (match_dup 0)
(xor:SI (match_dup 3)
(match_dup 2)))]
{
  operands[3] = gen_reg_rtx (SImode);
}
  [(set_attr "type" "arith")
   (set_attr "mode" "SI")
   (set_attr "length"   "6")])

The define_insn can be matched before or after register allocation, but the "&&
can_create_pseudo_p ()" condition means that the split can only happen before
RA.  And the "#" means that the split must happen: there is no assembly
fallback.  Matching the define_insn after register allocation therefore leads
to an ICE.

See: https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655446.html for more
discussion.

[Bug target/115612] New: powerpc: define_insn_and_splits calling gen_reg_rtx unconditionally

2024-06-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115612

Bug ID: 115612
   Summary: powerpc: define_insn_and_splits calling gen_reg_rtx
unconditionally
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
CC: dje at gcc dot gnu.org, linkw at gcc dot gnu.org, segher at 
gcc dot gnu.org
  Target Milestone: ---
Target: powerpc*-*-*

The late-combine pass is disabled by default for rs6000:

  /* One of the late-combine passes runs after register allocation
 and can match define_insn_and_splits that were previously used
 only before register allocation.  Some of those define_insn_and_splits
 use gen_reg_rtx unconditionally.  Disable late-combine by default
 until the define_insn_and_splits are fixed.  */
  if (!OPTION_SET_P (flag_late_combine_instructions))
flag_late_combine_instructions = 0;

For example, compiling gcc.c-torture/compile/20021001-1.c with -Os
-flate-combine-instructions results in:

0x9c34cb gen_reg_rtx(machine_mode)
   .././src/gcc/emit-rtl.cc:1177
0x1b248a7 gen_split_452(rtx_insn*, rtx_def**)
   .././src/gcc/config/rs6000/rs6000.md:13257
0x1c1c783 split_17
   .././src/gcc/config/rs6000/rs6000.md:13254
0x1c1c783 split_insns(rtx_def*, rtx_insn*)
   .././src/gcc/config/rs6000/rs6000.md:12707
0x9c8847 try_split(rtx_def*, rtx_insn*, int)
   .././src/gcc/emit-rtl.cc:3941
0xe58963 split_insn
   .././src/gcc/recog.cc:3409
0xe5e3ef split_all_insns()
   .././src/gcc/recog.cc:3513
0xe5e5cb execute
   .././src/gcc/recog.cc:4482

due to:

(define_insn_and_split "*_cc"
  [(set (match_operand:GPR 0 "gpc_reg_operand" "=r")
(fp_rev:GPR (match_operand:CCFP 1 "cc_reg_operand" "y")
(const_int 0)))]
  "!flag_finite_math_only"
  "#"
  "&& 1"
  [(pc)]
{
  rtx_code revcode = reverse_condition_maybe_unordered ();
  rtx eq = gen_rtx_fmt_ee (revcode, mode, operands[1], const0_rtx);
  rtx tmp = gen_reg_rtx (mode);
  emit_move_insn (tmp, eq);
  emit_insn (gen_xor3 (operands[0], tmp, const1_rtx));
  DONE;
}
  [(set_attr "length" "12")])

The instruction can be matched before or after RA, but the split only works
before RA.

It looked from a quick scan like there were a few other instances of this
(although the vast majority of define_insn_and_splits are written to work both
before and after RA).

[Bug target/115610] New: -flate-combine disabled by default for x86 port

2024-06-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115610

Bug ID: 115610
   Summary: -flate-combine disabled by default for x86 port
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: enhancement
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
CC: crazylht at gmail dot com, hubicka at gcc dot gnu.org,
ubizjak at gmail dot com
  Target Milestone: ---
Target: i?86-*-* x86_64-*-*

The late-combine pass is disabled by default for x86:

  /* Late combine tends to undo some of the effects of STV and RPAD,
 by combining instructions back to their original form.  */
  if (!OPTION_SET_P (flag_late_combine_instructions))
flag_late_combine_instructions = 0;

To give more details, from an earlier version of the pass:


For example, gcc.target/i386/minmax-6.c tests whether the code
compiles without any spilling.  The RTL created by STV contains:

(insn 33 31 3 2 (set (subreg:V4SI (reg:SI 120) 0)
(vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 116))
(const_vector:V4SI [
(const_int 0 [0]) repeated x4
])
(const_int 1 [0x1]))) -1
 (nil))
(insn 3 33 34 2 (set (subreg:V4SI (reg:SI 118) 0)
(subreg:V4SI (reg:SI 120) 0)) {movv4si_internal}
 (expr_list:REG_DEAD (reg:SI 120)
(nil)))
(insn 34 3 32 2 (set (reg/v:SI 108 [ y ])
(reg:SI 118)) -1
 (nil))

and it's crucial for the test that reg 108 is kept, rather than
propagated into uses.  As things stand, 118 can be allocated
a vector register and 108 a scalar register.  If 108 is propagated,
there will be scalar and vector uses of 118, and so it will be
spilled to memory.

and it's crucial for the test that reg 108 is kept, rather than
propagated into uses.  As things stand, 118 can be allocated
a vector register and 108 a scalar register.  If 108 is propagated,
there will be scalar and vector uses of 118, and so it will be
spilled to memory.

That one could be solved by running STV2 later.  But RPAD is
a bigger problem.  In gcc.target/i386/pr87007-5.c, RPAD converts:

(insn 27 26 28 6 (set (reg:DF 100 [ _15 ])
(sqrt:DF (mem/c:DF (symbol_ref:DI ("d2") {*sqrtdf2_sse}
 (nil))

into:

(insn 45 26 44 6 (set (reg:V4SF 108)
(const_vector:V4SF [
(const_double:SF 0.0 [0x0.0p+0]) repeated x4
])) -1
 (nil))
(insn 44 45 27 6 (set (reg:V2DF 109)
(vec_merge:V2DF (vec_duplicate:V2DF (sqrt:DF (mem/c:DF (symbol_ref:DI
(\
"d2")
(subreg:V2DF (reg:V4SF 108) 0)
(const_int 1 [0x1]))) -1
 (nil))
(insn 27 44 28 6 (set (reg:DF 100 [ _15 ])
(subreg:DF (reg:V2DF 109) 0)) {*movdf_internal}
 (nil))

But both the pre-RA and post-RA passes are able to combine these
instructions back to the original form.


[Bug rtl-optimization/114515] [15 Regression] Failure to use aarch64 lane forms after PR101523

2024-06-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #14 from Richard Sandiford  ---
Fixed.

[Bug rtl-optimization/106594] [13/14/15 Regression] sign-extensions no longer merged into addressing mode

2024-06-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106594

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #30 from Richard Sandiford  ---
Fixed on trunk.

[Bug rtl-optimization/114996] [15 Regression] [RISC-V] 2->2 combination no longer occurring

2024-06-18 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114996

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #6 from Richard Sandiford  ---
FWIW, late-combine also fixes this.  I'm in the process of getting the
submission ready (still going through multi-target testing).

[Bug target/115518] New: aarch64: Poor codegen for arm_neon_sve_bridge.h

2024-06-17 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115518

Bug ID: 115518
   Summary: aarch64: Poor codegen for arm_neon_sve_bridge.h
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: aarch64-sve
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

With PR115464 fixed, the following testcase:

#include 
#include 
#include 

svuint16_t
convolve4_4_x (uint16x8x2_t permute_tbl, svuint16_t a)
{
return svset_neonq_u16 (a, permute_tbl.val[1]);
}

generates:

mov v0.16b, v1.16b
ptrue   p3.h, vl8
sel z0.h, p3, z0.h, z2.h
ret

The move is redundant: we should be able to use z1.h as input to the sel
instead.

[Bug target/115464] [14 Backport] ICE when building libaom on arm64 (neon sve bridge usage with tbl/perm)

2024-06-13 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115464

--- Comment #11 from Richard Sandiford  ---
Yeah, like I mentioned in the commit message, I'm in the process of rolling
this fix out to more places.  Was just testing the waters with the minimal fix
for comment 4.

But yeah, maybe more of it will need to be backported than I'd hoped.

[Bug target/115464] [14 Backport] ICE when building libaom on arm64 (neon sve bridge usage with tbl/perm)

2024-06-13 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115464

Richard Sandiford  changed:

   What|Removed |Added

  Known to work||15.0
  Known to fail||14.1.0
Summary|ICE when building libaom on |[14 Backport] ICE when
   |arm64 (neon sve bridge  |building libaom on arm64
   |usage with tbl/perm)|(neon sve bridge usage with
   ||tbl/perm)

--- Comment #9 from Richard Sandiford  ---
Fixed on trunk.  Will backport to GCC 14 if there is no fallout.

  1   2   3   4   5   6   7   8   9   10   >