[Bug rtl-optimization/117467] [15/16 Regression] 521.wrf_r again explodes memory/compile-time wise

2025-03-09 Thread law at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

--- Comment #19 from Jeffrey A. Law  ---
Nuts. Busted most of the optimizations for rv64 with the change to the use side
handling.  I guess that's what I get for trying to generalize a pattern I was
seeing -- I'd tested the ad-hoc variant on rv64, but not the more general one. 
I'll deal with it tomorrow, it's too late tonight.

[Bug rtl-optimization/117467] [15/16 Regression] 521.wrf_r again explodes memory/compile-time wise

2025-03-09 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

--- Comment #18 from GCC Commits  ---
The master branch has been updated by Jeff Law :

https://gcc.gnu.org/g:7d3aec2a832ef47be547d9426187562e4548bae6

commit r15-7916-g7d3aec2a832ef47be547d9426187562e4548bae6
Author: Jeff Law 
Date:   Sun Mar 9 14:25:37 2025 -0600

[rtl-optimization/117467] Mark FP destinations as dead

The next step in improving ext-dce is to clean up a minor wart in the
set/clobber handling code.

In that code the safe thing to do is to not process a destination at all. 
That
will leave bits set in the live bitmaps for objects that may no longer be
live.
Of course with extraneous bits set we use more memory and do more work
managing
the bitmaps, but it's safe from a code correctness standpoint.

One case that is slipping through that we need to fix is scalar fp
destinations.  Essentially the code never tried to handle those and as a
result
would leave those entities live and bubble them up through the CFG.

In the testcase at hand this takes us from ~10k live objects at entry to
~4k
live objects at entry.  Time spent in ext-dce goes from 2.14s to .64s.

Bootstrapped and regression tested on x86_64.

PR rtl-optimization/117467
gcc/
* ext-dce.cc (ext_dce_process_sets): Handle FP destinations better.

[Bug rtl-optimization/117467] [15/16 Regression] 521.wrf_r again explodes memory/compile-time wise

2025-03-09 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

--- Comment #17 from GCC Commits  ---
The master branch has been updated by Jeff Law :

https://gcc.gnu.org/g:4ed07a11ee2845c2085a3cd5cff043209a452441

commit r15-7915-g4ed07a11ee2845c2085a3cd5cff043209a452441
Author: Jeff Law 
Date:   Sun Mar 9 13:28:10 2025 -0600

[rtl-optimization/117467] Avoid unnecessarily marking things live in
ext-dce

This is the first of what I expect to be a few patches to improve memory
consumption and performance of ext-dce.

While I haven't been able to reproduce the insane memory usage that Richi
saw,
I can certainly see how we might get there.  I instrumented ext-dce to dump
the
size of liveness sets, removed the memory allocation limiter, then compiled
the
appropriate file from specfp on rv64.

In my test I saw the liveness sets growing to absurd sizes as we worked
from
the last block back to the first.  Think 125k entries by the time we got
back
to the entry block which would mean ~30k live registers.  Simply no way
that's
correct.

The use handling is the primary source of problems and the code that I most
want to rewrite for gcc-16.  It's just a fugly mess.  I'm not terribly
inclined
to do that rewrite for gcc-15 though.  So these will be spot adjustments.

The most important thing to know about use processing is it sets up an
iterator
and walks that.  When a SET is encountered we actually manually
dive into the SRC/DEST and ideally terminate the iterator.

If during that SET processing we encounter something unexpected we let the
iterator continue normally, which causes iteration down into the SET_DEST
object.  That's safe behavior, though it can lead to too many objects as
being
marked live.

We can refine that behavior by trivially realizing that we need not process
the
SET_DEST if it is a naked REG (and probably for other cases too, but
they're
not expected to be terribly important).  So once we see the SET with a
simple
REG destination, we can bump the iterator to avoid having it dive into the
SET_DEST if something unexpected is seen on the SET_SRC side.

Fixing this alone takes us from 125k live objects to 10k live objects at
the
entry block.  Time in ext-dce for rv64 on the testcase goes from 10.81s to
2.14s.

Given this reduces the things considered live, this could easily result in
finding more cases for ext-dce to improve.  In fact a missed optimization
issue
for rv64 I've been poking at needs this patch as a prerequisite.

Bootstrapped and regression tested on x86_64.

Pushing to the trunk.

PR rtl-optimization/117467
gcc
* ext-dce.cc (ext_dce_process_uses): When trivially possible
advance
the iterator over the destination of a SET.

[Bug rtl-optimization/117467] [15/16 Regression] 521.wrf_r again explodes memory/compile-time wise

2025-03-07 Thread law at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

--- Comment #16 from Jeffrey A. Law  ---
OK.  Funny I'd just been looking at this problem in a different context.

When an RTX is encountered when handling uses that the code does not know how
to handle it will, in effect, continue normal iteration through the sub-rtxs
marking any REG seen as completely live.  Seems sensible.

What goes wrong (and what I was looking at earlier this week) is that
processing will dive into the destination.  We certainly need to look at the
destination; consider it could be a MEM and there may be REGs in the address.

But it just blindly goes through the entire RTL object.  So even a simple LHS
such as (set (reg) ...) will make the reg as *live*.  It's safe, but highly
suboptimal.  I'd planned to tackle this in gcc-16 in an attempt to clean up the
control flow in the use handling.  But I think I need to accelerate at least an
investigation of cleaning up the use handling ot avoid this problem.

To answer another question from this BZ.  No, ext-dce doesn't really duplicate
the functionality found elsewhere such as combine, ree, cse.  Combine comes the
closest as combine does some nonzero & sign bit tracking.  But it doesn't do a
livetime analysis on a global basis and use the results to simplify RTXs.   REE
is concerned with finding multiple extension rtxs and eliminating one,
similarly for CSE.

[Bug rtl-optimization/117467] [15/16 Regression] 521.wrf_r again explodes memory/compile-time wise

2025-03-07 Thread law at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

--- Comment #15 from Jeffrey A. Law  ---
So what's weird here is on that file for riscv64, after removing the memory
based limiter, I see ext-dce at .94s out of 295s of cpu time and I never see a
major memory spike -- I don't ever see it get much above 1G, certainly nowhere
near the 25G reported in c#2.

While I don't doubt the results you saw, I do wonder if either checking or x86
target tickled some quirk.  I'll test those next.