[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-08-09 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

--- Comment #22 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:b66e613a1a8d5b8fc9d8b03f7b60260700acf833

commit r14-3095-gb66e613a1a8d5b8fc9d8b03f7b60260700acf833
Author: Richard Biener 
Date:   Tue Jul 25 15:36:30 2023 +0200

rtl-optimization/110587 - speedup find_hard_regno_for_1

The following applies a micro-optimization to find_hard_regno_for_1,
re-ordering the check so we can easily jump-thread by using an else.
This reduces the time spent in this function by 15% for the testcase
in the PR.

PR rtl-optimization/110587
* lra-assigns.cc (find_hard_regno_for_1): Re-order checks.

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-08-02 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

--- Comment #21 from Richard Biener  ---
(In reply to Uroš Bizjak from comment #20)
> Can we revert the Comment #13 kludge now?

When we revert it we get

 integrated RA  :   0.42 ( 17%)   0.00 (  0%)   0.43 ( 17%)
   19M ( 16%)
 LRA non-specific   :   0.39 ( 16%)   0.00 (  0%)   0.39 ( 15%)
 6304k (  5%)
 LRA virtuals elimination   :   0.03 (  1%)   0.00 (  0%)   0.02 (  1%)
 3729k (  3%)
 LRA reload inheritance :   0.17 (  7%)   0.01 ( 10%)   0.18 (  7%)
 5109k (  4%)
 LRA create live ranges :   0.27 ( 11%)   0.00 (  0%)   0.28 ( 11%)
  984k (  1%)
 LRA hard reg assignment:   0.72 ( 30%)   0.01 ( 10%)   0.74 ( 29%)
0  (  0%)
 TOTAL  :   2.43  0.10  2.54   
  123M

so the regression is back and also code size increases significantly.

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-08-02 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

--- Comment #20 from Uroš Bizjak  ---
Can we revert the Comment #13 kludge now?

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-08-02 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

Richard Biener  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #19 from Richard Biener  ---
The tester shows the issue is fixed now (we're faster than before the
regression).  At -O0 compile-time is still dominated by RA
(r14-2920-g07b7cd70399d22, release checking):

 integrated RA  :   0.29 ( 32%)
 LRA non-specific   :   0.15 ( 16%)
 TOTAL  :   0.91

Samples: 3K of event 'cycles:u', Event count (approx.): 5038659855  
Overhead   Samples  Command  Shared Object   Symbol 
   6.15%   233  cc1  cc1 [.] process_alt_operands
   4.29%   163  cc1  cc1 [.] process_bb_node_lives
   3.72%   142  cc1  cc1 [.] record_reg_classes
   3.01%   114  cc1  cc1 [.] mark_ref_dead
   2.87%   109  cc1  cc1 [.] constrain_operands
   2.71%   114  cc1  cc1 [.]
df_ref_create_structure
   2.47%94  cc1  cc1 [.] ira_setup_alts

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-08-02 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

--- Comment #18 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:07b7cd70399d22c113ad8bb1eff5cc2d12973d33

commit r14-2920-g07b7cd70399d22c113ad8bb1eff5cc2d12973d33
Author: Richard Biener 
Date:   Tue Jul 25 15:32:11 2023 +0200

rtl-optimization/110587 - remove quadratic regno_in_use_p

The following removes the code checking whether a noop copy
is between something involved in the return sequence composed
of a SET and USE.  Instead of checking for this special-case
the following makes us only ever remove noop copies between
pseudos - which is the case that is necessary for IRA/LRA
interfacing to function according to the comment.  That makes
looking for the return reg special case unnecessary, reducing
the compile-time in LRA non-specific to zero for the testcase.

PR rtl-optimization/110587
* lra-spills.cc (return_regno_p): Remove.
(regno_in_use_p): Likewise.
(lra_final_code_change): Do not remove noop moves
between hard registers.

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-28 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

--- Comment #17 from CVS Commits  ---
The master branch has been updated by Roger Sayle :

https://gcc.gnu.org/g:095eb138f736d94dabf9a07a6671bd351be0e66a

commit r14-2851-g095eb138f736d94dabf9a07a6671bd351be0e66a
Author: Roger Sayle 
Date:   Fri Jul 28 09:39:46 2023 +0100

PR rtl-optimization/110587: Reduce useless moves in compile-time hog.

This patch is one of a series of fixes for PR rtl-optimization/110587,
a compile-time regression with -O0, that attempts to address the underlying
cause.  As noted previously, the pathological test case pr28071.c contains
a large number of useless register-to-register moves that can produce
quadratic behaviour (in LRA).  These moves are generated during RTL
expansion in emit_group_load_1, where the middle-end attempts to simplify
the source before calling extract_bit_field.  This is reasonable if the
source is a complex expression (from before the tree-ssa optimizers), or
a SUBREG, or a hard register, but it's not particularly useful to copy
a pseudo register into a new pseudo register.  This patch eliminates that
redundancy.

The -fdump-tree-expand for pr28071.c compiled with -O0 currently contains
777K lines, with this patch it contains 717K lines, i.e. saving about 60K
lines (admittedly of debugging text output, but it makes the point).

2023-07-28  Roger Sayle  
Richard Biener  

gcc/ChangeLog
PR middle-end/28071
PR rtl-optimization/110587
* expr.cc (emit_group_load_1): Simplify logic for calling
force_reg on ORIG_SRC, to avoid making a copy if the source
is already in a pseudo register.

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-27 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

Roger Sayle  changed:

   What|Removed |Added

   Assignee|roger at nextmovesoftware dot com  |unassigned at gcc dot 
gnu.org

--- Comment #16 from Roger Sayle  ---
My patch (in comment #15) is obsoleted by Richard Biener's much better
solution(s):
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625416.html
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625417.html

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-25 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

--- Comment #15 from Roger Sayle  ---
Hi Richard,
There's another patch awaiting review at
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625282.html
and I've another follow-up after that currently regression testing...

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

--- Comment #14 from Richard Biener  ---
compile-time is back to the first jump caused by r14-2337-g37a231cc7594d1,
thanks Roger.  We still have

 LRA non-specific   :   3.53 ( 75%)

at -O0 here which Rogers followup patch will improve (but not generally
solve the issue).

At -O1 combine dominates, at -O2 we see other parts of RA being slow:

 integrated RA  :   7.10 ( 23%) 
 LRA non-specific   :   1.56 (  5%)
 LRA virtuals elimination   :   0.07 (  0%)
 LRA reload inheritance :   1.02 (  3%)
 LRA create live ranges :   0.88 (  3%)
 LRA hard reg assignment:   8.22 ( 27%)
 LRA coalesce pseudo regs   :   0.00 (  0%)
 LRA rematerialization  :   0.18 (  1%)

Samples: 124K of event 'cycles:u', Event count (approx.): 164730867020  
Overhead   Samples  Command  Shared Object   Symbol 
  16.60% 20660  cc1  cc1 [.] find_hard_regno_for_1
  11.90% 14742  cc1  cc1 [.] bitmap_set_bit
   6.47%  7973  cc1  cc1 [.] color_allocnos
   3.31%  4023  cc1  cc1 [.] bitmap_bit_p
   3.07%  3791  cc1  cc1 [.]
remove_allocno_from_bucket_and_push
   2.77%  3435  cc1  cc1 [.] assign_hard_reg
   2.54%  3138  cc1  cc1 [.] ira_build_conflicts

in find_hard_regno_for_1 the loop over live ranges is what's costly, esp.
because it seems the conditionals in the loops depend on (indirect) memory
and that no longer fits nicely into caches.

Maybe regno_allocno_class_array can be shrunk from 'enum reg_class'
(unsigned int) to something smaller.  It looks like this array is a
memory optimization since reg_allocno_class would perform a much sparser
access.

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-22 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

--- Comment #13 from CVS Commits  ---
The master branch has been updated by Roger Sayle :

https://gcc.gnu.org/g:8125b12f846b41f26e58c0fe3b218d654f65d1c8

commit r14-2730-g8125b12f846b41f26e58c0fe3b218d654f65d1c8
Author: Roger Sayle 
Date:   Sat Jul 22 21:52:55 2023 +0100

i386: Don't use insvti_{high,low}part with -O0 (for compile-time).

This patch attempts to help with PR rtl-optimization/110587, a regression
of -O0 compile time for the pathological pr28071.c.  My recent patch helps
a bit, but hasn't returned -O0 compile-time to where it was before my
ix86_expand_move changes.  The obvious solution/workaround is to guard
these new TImode parameter passing optimizations with "&& optimize", so
they don't trigger when compiling with -O0.  The very minor complication
is that "&& optimize" alone leads to the regression of pr110533.c, where
our improved TImode parameter passing fixes a wrong-code issue with naked
functions, importantly, when compiling with -O0.  This should explain
the one line fix below "&& (optimize || ix86_function_naked (cfun))".

I've an additional fix/tweak or two for this compile-time issue, but
this change eliminates the part of the regression that I've caused.

2023-07-22  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_move): Disable the
64-bit insertions into TImode optimizations with -O0, unless
the function has the "naked" attribute (for PR target/110533).

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-18 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

--- Comment #12 from Richard Biener  ---
This code block has a rich history with many fixes for many issues :/  (I
thought of just scrapping it ...), still regno_in_use_p is badly engineered in
this context.  Of course we're quite unlucky that the return REG is in use that
much for this large BB.

In the end the reason why this code exists and also some of the fallout
observed in the history point at issues that might be worth fixing elsewhere as
well.

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-17 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

Roger Sayle  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |roger at 
nextmovesoftware dot com

--- Comment #11 from Roger Sayle  ---
My (upcoming) patch for PR88873 dramatically reduces the compile-time (with
-O0) for this test case (by reducing the number of pseudos and reducing the
number of reloads).  But don't let that stop anyone from speeding up
lra_final_code_change.

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-17 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

--- Comment #10 from Richard Biener  ---
I wonder what the following does anyway.  We delete the noop move
only when either the reg isn't used for return or it isn't in
use in later insns between 'insn' and the next set of it.
That seems to detect the hardreg = X; USE (hardreg); return sequence
and wants to protect that despite X being the same as 'hardreg'.

  /* IRA can generate move insns involving pseudos.  It is
 better remove them earlier to speed up compiler a bit.
 It is also better to do it here as they might not pass
 final RTL check in LRA, (e.g. insn moving a control
 register into itself).  So remove an useless move insn
 unless next insn is USE marking the return reg (we should
 save this as some subsequent optimizations assume that
 such original insns are saved).  */
  if (NONJUMP_INSN_P (insn) && GET_CODE (pat) == SET
  && REG_P (SET_SRC (pat)) && REG_P (SET_DEST (pat))
  && REGNO (SET_SRC (pat)) == REGNO (SET_DEST (pat))
  && (! return_regno_p (REGNO (SET_SRC (pat)))
  || ! regno_in_use_p (insn, REGNO (SET_SRC (pat)

what's odd is of course that return_regno_p returns true so much for this
testcase.

The return sequence to protect should be easily discoverable by walking
from the function exit and thus could be marked instead of trying to
match it to each insn like above.

But I don't understand why we want to preserve this noop copy anyway ...

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-17 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com
   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=88873

--- Comment #9 from Roger Sayle  ---
I'll check whether turning off the insvti_{low,high}part transformations during
lra_in_progress helps compile-time.  I believe everytime reload encounters a
TI<->SSE SUBREG, the spill/reload generates two or three additional
instructions.  I'm thinking that perhaps this should ideally be an UNSPEC, that
we can split after reload. As shown in PR 88873, we'd like SSE->TI->SSE to
avoid going via memory [where currently this happens twice]. It looks like
"interval" in pr28071.c suffers from the same x86 ABI issues [i.e. is placed in
scalar TImode, where ideally we'd like V2DI].

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-17 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

Richard Biener  changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu.org

--- Comment #8 from Richard Biener  ---
Btw, with GCC 13.1 this is already a LRA hog:

 LRA non-specific   :   3.31 ( 73%)   0.01 (  9%)   3.33 ( 72%)
 3876k (  3%)
 TOTAL  :   4.53  0.11  4.65   
  126M

GCC 8 and before were worse.  On trunk:

 LRA non-specific   :   6.22 ( 69%)   0.02 ( 20%)   6.22 ( 69%)
 8922k (  6%)
 LRA hard reg assignment:   1.00 ( 11%)   0.02 ( 20%)   1.02 ( 11%)
0  (  0%)
 TOTAL  :   8.97  0.10  9.08   
  149M

the above is with just -O0.

Profile:

Samples: 37K of event 'cycles:u', Event count (approx.): 49984847870
Overhead   Samples  Command  Shared Object   Symbol 
  51.58% 19087  cc1  cc1 [.] lra_final_code_change
  11.10%  4106  cc1  cc1 [.] next_nondebug_insn
   7.61%  2879  cc1  cc1 [.] bitmap_set_bit
   6.42%  2425  cc1  cc1 [.] find_hard_regno_for_1
   2.28%   842  cc1  cc1 [.] bitmap_bit_p
   0.99%   365  cc1  cc1 [.]
lra_create_live_ranges_1

it possibly means we now spill more, at -O0 at least.  We have a 10%
regression in assembly line count between 13 and trunk.

The main hog in lra_final_code_change is calls to regno_in_use_p and
the loop within that.  The BB in this function is _huge_ so the whole
process quickly becomes quadratic.  Maybe the whole thing should work
backwards on a BB and this info collected on-the-fly as some "liveness"
problem?

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-17 Thread jamborm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

Martin Jambor  changed:

   What|Removed |Added

 CC||sayle at gcc dot gnu.org

--- Comment #7 from Martin Jambor  ---
Oops sorry, indeed, the much bigger regression is because of:

commit 8911879415d6c2a7baad88235554a912887a1c5c
Author: Roger Sayle 
Date:   Fri Jul 14 18:10:05 2023 +0100

i386: Improved insv of DImode/DFmode {high,low}parts into TImode.

This is the next piece towards a fix for (the x86_64 ABI issues affecting)
PR 88873.  This patch generalizes the recent tweak to ix86_expand_move
for setting the highpart of a TImode reg from a DImode source using
*insvti_highpart_1, to handle both DImode and DFmode sources, and also
use the recently added *insvti_lowpart_1 for setting the lowpart.

Although this is another intermediate step (not yet a fix), towards
enabling *insvti and *concat* patterns to be candidates for TImode STV
(by using V2DI/V2DF instructions), it already improves things a little.
[...]

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-17 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #6 from Richard Biener  ---
That doesn't seem to be the larger jump at Jul 16/17?  Can we bisect that as
well?

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-17 Thread jamborm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

--- Comment #5 from Martin Jambor  ---
(In reply to Hongtao.liu from comment #3)
> I can't find pr28071.c in GCC testsuite, but find an attached source file in
> the PR #c1, is that pr28071.c you means?

Yes.


(In reply to Hongtao.liu from comment #4)
> (In reply to Jan Hubicka from comment #0)
> > Seen here:
> > https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=288.597.8
> > https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=468.597.8
> > https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=172.597.8
> 
> Also is O0_g means compile flag is -O0 -g?

That is what I used to bisect, although I *think* that -g is not
necessary.

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-17 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

--- Comment #4 from Hongtao.liu  ---
(In reply to Jan Hubicka from comment #0)
> Seen here:
> https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=288.597.8
> https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=468.597.8
> https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=172.597.8

Also is O0_g means compile flag is -O0 -g?

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-17 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #3 from Hongtao.liu  ---
I can't find pr28071.c in GCC testsuite, but find an attached source file in
the PR #c1, is that pr28071.c you means?

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-15 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

Andrew Pinski  changed:

   What|Removed |Added

 Target||x86_64-linux-gnu

--- Comment #2 from Andrew Pinski  ---
Would be interesting to see if it is the register allocator and where (which
function) in GCC the compile time slow down happens.