[Bug target/101523] Huge number of combine attempts

2024-03-27 Thread sarah.kriesch at opensuse dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

--- Comment #49 from Sarah Julia Kriesch  ---
(In reply to Sam James from comment #44)
> I'm really curious as to if there's other test cases which could be shared,
> as Andreas mentioned distributions were complaining about this even. That's
> unlikely if it's a single degenerate case.
> 
> Even listing some example package names could help.

Sorry for the late response! I am a volunteer and went through all constraints
files from the last few years (I added to multiple packages). Most
memory-related issues have been already resolved.
But I found some easter eggs for you today:

1) nodejs21 with 11,5GB on s390x, 2,5GB on x86, 3,7GB on PPCle, 2,5GB on
aarch64 and 2,4GB on armv7:
https://build.opensuse.org/package/show/devel:languages:nodejs/nodejs21

2) PDAL with 9,7GB on s390x, 2,2GB on x86 and 2,2GB on aarch64:
https://build.opensuse.org/package/show/openSUSE:Factory:zSystems/PDAL

3) python-numpy with 15,2GB on s390x, 8,6GB on PPCle, 9,3GB on x86,1,9 on
armv7, 9,3GB on aarch64:
https://build.opensuse.org/package/show/devel:languages:python:numeric/python-numpy

I wish you a happy Eastern!

[Bug target/101523] Huge number of combine attempts

2024-03-22 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

--- Comment #48 from Richard Biener  ---
So another "simple" way is to keep the redundant insn walking ("it's O(1)") but
remember processsed insns and only re-process those we mark as such.

There might be a free "visited" bit on rtx_insn, who knows, the following uses
a bitmap to track this.  Likely where we set/update added_links_insn we
should mark insns for re-processing.

A worklist, if it were to be processed in instruction order, would need to
be kept ordered and DF docs say DF_INSN_LUID isn't to be trusted after
adding/removing insns.

diff --git a/gcc/combine.cc b/gcc/combine.cc
index a4479f8d836..c2f04e6b86e 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -1106,6 +1106,8 @@ insn_a_feeds_b (rtx_insn *a, rtx_insn *b)
   return false;
 }
 ^L
+static bitmap processed;
+
 /* Main entry point for combiner.  F is the first insn of the function.
NREGS is the first unused pseudo-reg number.

@@ -1211,6 +1213,8 @@ combine_instructions (rtx_insn *f, unsigned int nregs)
   setup_incoming_promotions (first);
   last_bb = ENTRY_BLOCK_PTR_FOR_FN (cfun);
   int max_combine = param_max_combine_insns;
+  processed = BITMAP_ALLOC (NULL);
+  bitmap_tree_view (processed);

   FOR_EACH_BB_FN (this_basic_block, cfun)
 {
@@ -1231,6 +1235,7 @@ combine_instructions (rtx_insn *f, unsigned int nregs)
label_tick_ebb_start = label_tick;
   last_bb = this_basic_block;

+  bitmap_clear (processed);
   rtl_profile_for_bb (this_basic_block);
   for (insn = BB_HEAD (this_basic_block);
   insn != NEXT_INSN (BB_END (this_basic_block));
@@ -1240,6 +1245,9 @@ combine_instructions (rtx_insn *f, unsigned int nregs)
  if (!NONDEBUG_INSN_P (insn))
continue;

+ if (!bitmap_set_bit (processed, INSN_UID (insn)))
+   continue;
+
  while (last_combined_insn
 && (!NONDEBUG_INSN_P (last_combined_insn)
 || last_combined_insn->deleted ()))
@@ -1427,6 +1435,7 @@ retry:
  ;
}
 }
+  BITMAP_FREE (processed);

   default_rtl_profile ();
   clear_bb_flags ();
@@ -4758,6 +4767,14 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1,
rtx_insn *i0,
   if (added_notes_insn && DF_INSN_LUID (added_notes_insn) < DF_INSN_LUID
(ret))
 ret = added_notes_insn;

+  bitmap_clear_bit (processed, INSN_UID (i3));
+  if (newi2pat)
+bitmap_clear_bit (processed, INSN_UID (newi2pat));
+  if (added_links_insn)
+bitmap_clear_bit (processed, INSN_UID (added_links_insn));
+  if (added_notes_insn)
+bitmap_clear_bit (processed, INSN_UID (added_notes_insn));
+
   return ret;
 }
 ^L

[Bug target/101523] Huge number of combine attempts

2024-03-22 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

--- Comment #47 from Richard Biener  ---
The rtx_equal_p change gets us 50% improvement only, it's necessary to also
disable the added_{links,notes}_insn extra re-processing to get us all the
way to -O1 speed.  We'd need the worklist to avoid combine regressions there
(though for the actual testcase it doesn't make a difference).

[Bug target/101523] Huge number of combine attempts

2024-03-22 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

--- Comment #46 from Richard Biener  ---
Maybe combine already knows that it just "keeps i2" rather than replacing it?
When !newi2pat we seem to delete i2.  Anyway, somebody more familiar with
combine should produce a good(TM) patch.

[Bug target/101523] Huge number of combine attempts

2024-03-22 Thread sjames at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

--- Comment #45 from Sam James  ---
(ah, not andreas, but sarah)

[Bug target/101523] Huge number of combine attempts

2024-03-22 Thread sjames at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

--- Comment #44 from Sam James  ---
I'm really curious as to if there's other test cases which could be shared, as
Andreas mentioned distributions were complaining about this even. That's
unlikely if it's a single degenerate case.

Even listing some example package names could help.

[Bug target/101523] Huge number of combine attempts

2024-03-22 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

--- Comment #43 from Richard Biener  ---
The interesting bit is that there are only 12026 overall loglinks, the
number of combine attempts is way higher than that would suggest also
given the few successful combinations.  So something is odd here.

There's one interesting machinery in try_combine through added_links_insn
and added_notes_insn we can end up re-processing a large swat of insns
(even though we should need to only re-process the link target insns, not
all insns inbetween).  There might be the opportunity, for the "reprocessing",
to use a worklist instead of resetting the insn walk.

I added statistics to note the "distance" we travel there by taking
DF_INSN_LUID (ret) - DF_INSN_LUID (added_{notes,links}_insn) as that.
This shows 48 such jumps with seemingly large distances:

305 combine "restart earlier == 143" 3
305 combine "restart earlier == 254" 1
305 combine "restart earlier == 684" 1
305 combine "restart earlier == 726" 1
305 combine "restart earlier == 777" 1
305 combine "restart earlier == 1158" 1
305 combine "restart earlier == 1421" 1
305 combine "restart earlier == 2073" 1
305 combine "restart earlier == 2130" 1
...
305 combine "restart earlier == 49717" 1
305 combine "restart earlier == 49763" 1
305 combine "restart earlier == 49866" 1
305 combine "restart earlier == 50010" 1
305 combine "restart earlier == 50286" 1
305 combine "restart earlier == 50754" 1
305 combine "restart earlier == 50809" 1

killing this feature doesn't improve things to -O1 levels though so it's
more likely the fact that we also do

  rtx_insn *ret = newi2pat ? i2 : i3;

thus re-start at i2 when we altered i2.  We re-start through this 6910
times.  Always re-starting at i3 helps a lot and gets us -O1 performance
back.  From comment#1 it suggests that newi2pat might in fact be equal
to the old, so I tried to count how many times this happens with a stupid

diff --git a/gcc/combine.cc b/gcc/combine.cc
index a4479f8d836..acd176d3185 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -4435,6 +4435,8 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1,
rtx_insn *i0,
  propagate_for_debug (i2, last_combined_insn, i2dest, i2src,
   this_basic_block);
INSN_CODE (i2) = i2_code_number;
+   if (rtx_equal_p (PATTERN (i2), newi2pat))
+ statistics_counter_event (cfun, "equal newi2pat", 1);
PATTERN (i2) = newi2pat;
   }
 else

and indeed this shows this to be the case 9211 times.

The following improves compile-time to 20s and 460MB memory use.  In general
the algorithmic deficiency with the "restarting" remains and a proper fix
is to use a worklist for those that you'd drain before advancing in the
instruction chain (so not have a single 'ret' insn to reprocess but add
to the worklist).

I'm not sure whether identifying a not changed "new" i2 can be done better.
I'll leave it all to Segher of course - he'll be fastest to produce something
he likes.

diff --git a/gcc/combine.cc b/gcc/combine.cc
index a4479f8d836..0c61dcedaa1 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -4276,6 +4276,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1,
rtx
_insn *i0,
}
 }

+  bool newi2pat_not_new = false;
   {
 rtx i3notes, i2notes, i1notes = 0, i0notes = 0;
 struct insn_link *i3links, *i2links, *i1links = 0, *i0links = 0;
@@ -4435,6 +4436,8 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1,
rtx_insn *i0,
  propagate_for_debug (i2, last_combined_insn, i2dest, i2src,
   this_basic_block);
INSN_CODE (i2) = i2_code_number;
+   if (rtx_equal_p (PATTERN (i2), newi2pat))
+ newi2pat_not_new = true;
PATTERN (i2) = newi2pat;
   }
 else
@@ -4752,7 +4755,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1,
rtx_insn *i0,
   combine_successes++;
   undo_commit ();

-  rtx_insn *ret = newi2pat ? i2 : i3;
+  rtx_insn *ret = newi2pat && !newi2pat_not_new ? i2 : i3;
   if (added_links_insn && DF_INSN_LUID (added_links_insn) < DF_INSN_LUID
(ret))
 ret = added_links_insn;
   if (added_notes_insn && DF_INSN_LUID (added_notes_insn) < DF_INSN_LUID
(ret))

[Bug target/101523] Huge number of combine attempts

2024-03-22 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

Richard Biener  changed:

   What|Removed |Added

  Known to fail||14.0
 Ever confirmed|0   |1
   Last reconfirmed||2024-03-22
 Status|UNCONFIRMED |NEW

--- Comment #42 from Richard Biener  ---
I checked with a cross btw, and with -O1 we use 10s and 900MB memory for the
testcase for comment #22.  With -O2 it's 160s and 11GB as reported.

It might of course that with -O1 we simply do not confront combine with
the opportunity to blow up.

So IMHO this is a non-issue and the reporter should use -O1 for such a TU.

Still confirmed as a s390x specific problem and confirmed on trunk.

Statistics with the -O2 combines:

305 combine "successes" 9425 
305 combine "three-insn combine" 1
305 combine "four-insn combine" 1
305 combine "merges" 40418007 
305 combine "extras" 9110287
305 combine "two-insn combine" 9423
305 combine "attempts" 40440004

With -O1:

305 combine "successes" 1861
305 combine "three-insn combine" 1732
305 combine "merges" 191051
305 combine "extras" 9274
305 combine "two-insn combine" 129
305 combine "attempts" 192434

[Bug target/101523] Huge number of combine attempts

2024-03-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

--- Comment #41 from Richard Biener  ---
(In reply to Segher Boessenkool from comment #38)
> (In reply to Richard Biener from comment #36)
[...]
> But linear is linear, and stays linear, for way too big code it is just as
> acceptable as for "normal" code.  Just slow.  If you don't want the compiler
> to
> take a long time compiling your way too big code, use -O0, or preferably do
> not
> write insane code in the first place :-)

;)  We promise to try to behave reasonably with insane code, but
technically we tell people to use at most -O1 for that.  That will
at least avoid trying three and four insn combinations.

[...]

> Ideally we'll not do *any* artificial limitations.

I agree.  And we should try hard to fix actual algorithmic problems if
they exist before resorting to limits.

>  But GCC just throws its hat
> in the ring in other cases as well, say, too big RA problems.  You do get not
> as high quality code as wanted, but at least you get something compiled in
> an acceptable timeframe :-)

Yep.  See above for my comment about -O1.  I think it's fine to take
time (and memory) to optimize high quality code at -O2.  And if you
throw insane code to GCC then also an insane amount of time and memory ;)

So I do wonder whether with -O1 the issue is gone anyway already?

If not then for the sake of -O1 and insane we want such limit.  It can
be more crude aka just count all attempts and stop alltogether, or like
PRE, just not PRE when the number of pseudos/blocks crosses a magic barrier.
I just thought combine is a bit a too core part of our instruction selection
so disabling it completely (after some point) would be too bad even for
insane code ...

Andreas - can you try --param max-combine-insns=2 please?  That is I think
what -O1 uses and only then does two-insn combinations.

[Bug target/101523] Huge number of combine attempts

2024-03-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

--- Comment #40 from Richard Biener  ---
(In reply to Segher Boessenkool from comment #39)
> (In reply to Richard Biener from comment #37)
> > Created attachment 57753 [details]
> > quick attempt at a limit
> > 
> > So like this?
> 
> Hrm.  It should be possible to not have the same test 28 times.  Just at one
> spot!

Not sure.  We loop over log-links multiple times.  You could argue we
should only call try_combine once ;)  But yeah, it's not very pretty,
agreed.  We could pass the counter to try_combine and have a special
return value, (void *)-1 for the failed-and-exhausted case, handling
that in retry: like

retry:
  if (next == NULL || next == (void *)-1)
attempts = 0;

then only the last case where we mangle 'set' and have to restore it on
failure would need special casing (and of course try_combine itself).
But in a way that's also ugly so I guess I stay with my proposal.

At least until somebody actually tried if and how much it helps (and for which
values of the --param).

[Bug target/101523] Huge number of combine attempts

2024-03-21 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

--- Comment #39 from Segher Boessenkool  ---
(In reply to Richard Biener from comment #37)
> Created attachment 57753 [details]
> quick attempt at a limit
> 
> So like this?

Hrm.  It should be possible to not have the same test 28 times.  Just at one
spot!

[Bug target/101523] Huge number of combine attempts

2024-03-21 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

--- Comment #38 from Segher Boessenkool  ---
(In reply to Richard Biener from comment #36)
> > No, it definitely should be done.  As I showed back then, it costs less than
> > 1%
> > extra compile time on *any platform* on average, and it reduced code size by
> > 1%-2%
> > everywhere.
> > 
> > It also cannot get stuck, any combination is attempted only once, any
> > combination
> > that succeeds eats up a loglink.  It is finite, (almost) linear in fact.
> 
> So the slowness for the testcase comes from failed attempts.

Of course.  Most attempts do not succeed, there aren't instructions for most
"random" combinations of instructions feeding each other.  But combine blindly
tries everything, that is its strength!  It ends up finding many more thing
than
any recognition automaton does.

> > Something that is the real complaint here: it seems we do not GC often
> > enough,
> > only after processing a BB (or EBB)?  That adds up for artificial code like
> > this, sure.
> 
> For memory use if you know combine doesn't have "dangling" links to GC memory
> you can call ggc_collect at any point you like.  Or, when you create
> throw-away RTL, ggc_free it explicitly (yeah, that only frees the
> "toplevel").

A lot of it *is* toplevel (well, completely disconnected RTX), just
temporaries,
things we can just throw away.  At every try_combine call even, kinda.  There
might be some more RTX that needs some protection.  We'll see.

> > And the "param to give an upper limit to how many combination attempts are
> > done
> > (per BB)" offer is on the table still, too.  I don't think it would ever be
> > useful (if you want your code to compile faster just write better code!),
> > but :-)
> 
> Well, while you say the number of successful combinations is linear the
> number of combine attempts appearantly isn't

It is, and that is pretty easy to show even.  With retries it stays linear, but
with a hefty constant.  And on some targets (with more than three inputs for
some instructions, say) it can be a big constant anyway.

But linear is linear, and stays linear, for way too big code it is just as
acceptable as for "normal" code.  Just slow.  If you don't want the compiler to
take a long time compiling your way too big code, use -O0, or preferably do not
write insane code in the first place :-)


> (well, of course, if we ever
> combine from multi-use defs).  So yeah, a param might be useful here but
> instead of some constant limit on the number of combine attempts per
> function or per BB it might make sense to instead limit it on the number
> of DEFs?

We still use loglinks in combine.  These are nice to prove that things stay
linear, even (every time combine succeeds a loglink is used up).

The number of loglinks and insns (insns combine can do anything with) differs
by a small constant factor.

> I understand we work on the uses

We work on the loglinks, a def-use pair if you want.

> so it'll be a bit hard to
> apply this in a way to, say, combine a DEF only with the N nearest uses
> (but not any ones farther out),

There is only a loglink from a def to the very next use.  If that combines, the
insn that does the def is retained as well, if there is any other use.  But
there never is a combination of a def with a later use tried, if the earliest
use does not combine.

> and maintaining such a count per DEF would
> cost.  So more practical might be to limit the number of attempts to combine
> into an (unchanged?) insns?
> 
> Basically I would hope with a hard limit in place we'd not stop after the
> first half of a BB leaving trivial combinations in the second half
> unhandled but instead somehow throttle the "expensive" cases?

Ideally we'll not do *any* artificial limitations.  But GCC just throws its hat
in the ring in other cases as well, say, too big RA problems.  You do get not
as high quality code as wanted, but at least you get something compiled in
an acceptable timeframe :-)

[Bug target/101523] Huge number of combine attempts

2024-03-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

--- Comment #37 from Richard Biener  ---
Created attachment 57753
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57753&action=edit
quick attempt at a limit

So like this?

[Bug target/101523] Huge number of combine attempts

2024-03-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

--- Comment #36 from Richard Biener  ---
(In reply to Segher Boessenkool from comment #35)
> (In reply to Richard Biener from comment #34)
> > The change itself looks reasonable given costs, though maybe 2 -> 2
> > combinations should not trigger when the cost remains the same?  In
> > this case it definitely doesn't look profitable, does it?  Well,
> > in theory it might hide latency and the 2nd instruction can issue
> > at the same time as the first.
> 
> No, it definitely should be done.  As I showed back then, it costs less than
> 1%
> extra compile time on *any platform* on average, and it reduced code size by
> 1%-2%
> everywhere.
> 
> It also cannot get stuck, any combination is attempted only once, any
> combination
> that succeeds eats up a loglink.  It is finite, (almost) linear in fact.

So the slowness for the testcase comes from failed attempts.

> Any backend is free to say certain insns shouldn't combine at all.  This will
> lead to reduced performance though.
> 
> - ~ - ~ -
> 
> Something that is the real complaint here: it seems we do not GC often
> enough,
> only after processing a BB (or EBB)?  That adds up for artificial code like
> this, sure.

For memory use if you know combine doesn't have "dangling" links to GC memory
you can call ggc_collect at any point you like.  Or, when you create
throw-away RTL, ggc_free it explicitly (yeah, that only frees the "toplevel").

> And the "param to give an upper limit to how many combination attempts are
> done
> (per BB)" offer is on the table still, too.  I don't think it would ever be
> useful (if you want your code to compile faster just write better code!),
> but :-)

Well, while you say the number of successful combinations is linear the
number of combine attempts appearantly isn't (well, of course, if we ever
combine from multi-use defs).  So yeah, a param might be useful here but
instead of some constant limit on the number of combine attempts per
function or per BB it might make sense to instead limit it on the number
of DEFs?  I understand we work on the uses so it'll be a bit hard to
apply this in a way to, say, combine a DEF only with the N nearest uses
(but not any ones farther out), and maintaining such a count per DEF would
cost.  So more practical might be to limit the number of attempts to combine
into an (unchanged?) insns?

Basically I would hope with a hard limit in place we'd not stop after the
first half of a BB leaving trivial combinations in the second half
unhandled but instead somehow throttle the "expensive" cases?

[Bug target/101523] Huge number of combine attempts

2024-03-21 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

--- Comment #35 from Segher Boessenkool  ---
(In reply to Richard Biener from comment #34)
> The change itself looks reasonable given costs, though maybe 2 -> 2
> combinations should not trigger when the cost remains the same?  In
> this case it definitely doesn't look profitable, does it?  Well,
> in theory it might hide latency and the 2nd instruction can issue
> at the same time as the first.

No, it definitely should be done.  As I showed back then, it costs less than 1%
extra compile time on *any platform* on average, and it reduced code size by
1%-2%
everywhere.

It also cannot get stuck, any combination is attempted only once, any
combination
that succeeds eats up a loglink.  It is finite, (almost) linear in fact.

Any backend is free to say certain insns shouldn't combine at all.  This will
lead to reduced performance though.

- ~ - ~ -

Something that is the real complaint here: it seems we do not GC often enough,
only after processing a BB (or EBB)?  That adds up for artificial code like
this, sure.

And the "param to give an upper limit to how many combination attempts are done
(per BB)" offer is on the table still, too.  I don't think it would ever be
useful (if you want your code to compile faster just write better code!), but
:-)

[Bug target/101523] Huge number of combine attempts

2024-03-20 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

--- Comment #34 from Richard Biener  ---
(In reply to Andreas Krebbel from comment #1)
> This appears to be triggered by try_combine unnecessarily setting back the
> position by returning the i2 insn.
> 
> When 866 is inserted into 973 866 still needs to be kept around for other
> users. So try_combine first merges the two sets into a parallel and
> immediately notices that this can't be recognized. Because none of the sets
> is a trivial move it is split again into two separate insns. Although the
> new i2 pattern exactly matches the input i2 combine considers this to be a
> new insn and triggers all the scanning log link creation and eventually
> returns it what let's the combine start all over at 866.
> 
> Due to that combine tries many of the substitutions more than 400x.
> 
> Trying 866 -> 973:
>   866: r22393:DI=r22391:DI+r22392:DI
>   973: r22499:DF=r22498:DF*[r22393:DI]
>   REG_DEAD r22498:DF
> Failed to match this instruction:
> (parallel [
> (set (reg:DF 22499)
> (mult:DF (reg:DF 22498)
> (mem:DF (plus:DI (reg/f:DI 22391 [ _85085 ])
> (reg:DI 22392 [ _85086 ])) [17 *_85087+0 S8 A64])))
> (set (reg/f:DI 22393 [ _85087 ])
> (plus:DI (reg/f:DI 22391 [ _85085 ])
> (reg:DI 22392 [ _85086 ])))
> ])
> Failed to match this instruction:
> (parallel [
> (set (reg:DF 22499)
> (mult:DF (reg:DF 22498)
> (mem:DF (plus:DI (reg/f:DI 22391 [ _85085 ])
> (reg:DI 22392 [ _85086 ])) [17 *_85087+0 S8 A64])))
> (set (reg/f:DI 22393 [ _85087 ])
> (plus:DI (reg/f:DI 22391 [ _85085 ])
> (reg:DI 22392 [ _85086 ])))
> ])
> Successfully matched this instruction:
> (set (reg/f:DI 22393 [ _85087 ])
> (plus:DI (reg/f:DI 22391 [ _85085 ])
> (reg:DI 22392 [ _85086 ])))

So this is "unchanged", do we re-combine into it?

> Successfully matched this instruction:
> (set (reg:DF 22499)
> (mult:DF (reg:DF 22498)
> (mem:DF (plus:DI (reg/f:DI 22391 [ _85085 ])
> (reg:DI 22392 [ _85086 ])) [17 *_85087+0 S8 A64])))

This one is changed.

> allowing combination of insns 866 and 973
> original costs 4 + 4 = 8
> replacement costs 4 + 4 = 8
> modifying insn i2   866: r22393:DI=r22391:DI+r22392:DI
> deferring rescan insn with uid = 866.
> modifying insn i3   973: r22499:DF=r22498:DF*[r22391:DI+r22392:DI]
>   REG_DEAD r22498:DF
> deferring rescan insn with uid = 973.

The change itself looks reasonable given costs, though maybe 2 -> 2
combinations should not trigger when the cost remains the same?  In
this case it definitely doesn't look profitable, does it?  Well,
in theory it might hide latency and the 2nd instruction can issue
at the same time as the first.