> From: Segher Boessenkool <seg...@kernel.crashing.org>
> Date: Tue, 7 Jul 2020 01:42:47 +0200

TL;DR: recognize a parallel with a clobber of TARGET_FLAGS_REGNUM?

> Hi!
> 
> On Mon, Jul 06, 2020 at 03:11:17AM +0200, Hans-Peter Nilsson wrote:
> > TL;DR: fixing a misdetection of what is a "simple move".
> 
> That is not a very correct characterisation of what this does :-)

That's apparently where we completely disagree. :-)

> > Looking into performace degradation after de-cc0 for CRIS, I
> > noticed combine behaving badly; it changed a move and a
> > right-shift into two right-shifts, where the "combined" move was
> > not eliminated in later passes, and where the deficiency caused
> > an extra insn in a hot loop: crcu16 (and crcu32) in coremark.
> > 
> > Before de-cc0, the insns input to combine looked like:
> >    33: r58:SI=r56:SI 0>>r48:SI
> >       REG_DEAD r56:SI
> >    35: r37:HI=r58:SI#0
> > and after:
> >    33: {r58:SI=r56:SI 0>>r48:SI;clobber dccr:CC;}
> >       REG_DEAD r56:SI
> >       REG_UNUSED dccr:CC
> >    35: {r37:HI=r58:SI#0;clobber dccr:CC;}
> >       REG_UNUSED dccr:CC
> 
> So a shift like this is at most as expensive as a move, on your target
> (or, in your backend, anyway ;-) )

On CRIS, the backend *and* the target, yes; one cycle, one short
instruction.

> > That is, there's always a parallel with a clobber of the
> > condition-codes register.  Being a parallel, it's not an
> > is_just_move, but e.g. a single_set.
> > 
> > For the de-cc0:ed "combination", it ended up as
> >    33: {r58:SI=r56:SI 0>>r48:SI;clobber dccr:CC;}
> >       REG_UNUSED dccr:CC
> >    35: {r37:HI#0=r56:SI 0>>r48:SI;clobber dccr:CC;}
> >       REG_DEAD r56:SI
> >       REG_UNUSED dccr:CC
> > That is, a move and a shift turned into two shifts; the extra
> > shift is not eliminated by later passes, while the move was
> > (with cc0, and "will be again") leading to redundant
> > instructions.
> 
> Which was the whole point of the is_just_move() thing, yes.  Combine
> doesn't know most moves will be eliminated by RA (but many are useful to
> do have before RA, because it gives RA much more freedom).  If a move is
> the same cost as a simple insn, doing two (say shift, like here) insns
> in parallel is cheaper on most machines than having a shift and a move
> sequentially.

Most parallel machines you mean, but why bring up them when
there's no means for combine to tell the difference?  Here, the
end result is that it *added* an instruction to the hot loop.
It's a deficiency and it caused a performance regression, can't
argue with that.

>  (2-2 combinations are helpful on single-scalar and even
> in-order machines as well, btw).

I certainly don't contest that the move can be eliminated, and
that most cost-effective 2-2 eliminations are helpful.  (See my
other post about combine being eager with allowing same-cost
combinations.)

> > At first I thought this was due to parallel-ignorant old code
> > but the "guilty" change is actually pretty recent.  Regarding a
> > parallel with a clobber not being "just" a move, there's only
> > the two adjacent callers seen in the patch (obviously with the
> > rename), and their use exactly matches to check that the
> > argument is a single_set which is a move.  It's always applied
> > to an rtx_insn, so I changed the type and name to avoid the
> > "just" issue.  I had to adjust the type when calling single_set.
> 
> But it is *not* supposed to be the same as single_set.

(I'm not saying that a single_set is a move.  But that's
obvious.)  I guess you meant to say that a single_set with a
general_operand as a source (as in the patch) is not supposed to
be the same as a move.  This is the only place in combine where
that distinction would be important.  Why?

I think you're just overconcious about clobbers here.  The
combine pass throws and adds them as it pleases in other places,
so I'm not sure what your worries are in *this* place.  (It
collects them and adds them to the combinations.)

> > I checked the original commit, c4c5ad1d6d1e1e a.k.a r263067 and
> > it seems parallels-as-sets were just overlooked and that this
> 
> They were not.  It causes regressions.

Do you have some pointers to PR:s or something else backing that
statement, or is it your work-in-progress hinted below?

>  That is why it has a different
> name, not something with "single set".  "just_move" isn't a very good
> name, I couldn't come up with something better, it is a pretty
> complicated concept :-/
> 
> "i2_should_not_be_2_2_combined"?
> 
> I'll rerun some testing to show this.  It'll take a while.

Care to fill in some details on what kind of testing?

> > patch appears to agree with the intent and the comment at the
> > use of i2_was_move and i3_was_move, which has a clause saying
> > "Also do this if we started with two insns neither of which was
> > a simple move".
> 
> That says that we combine to 2 insns also if we started with only 2,
> but neither of those was just a move.

That's what I'm saying too.

> > With this correction in place, the performance degradation
> > related to de-cc0 of the CRIS port as measured by coremark is
> > gone and turned into a small win.  N.B.: there certainly is a
> > code difference in other hot functions, and the swing between
> > different functions is larger than this difference; to be dealt
> > with separately.
> > 
> > Tested cris-elf, x86_64-linux, powerpc64le-linux, 2/3 through
> > aarch64-linux (unexpectedly slow).
> > 
> > Ok to commit?
> 
> No, sorry.

Sigh.  I'm very interested in what your investigation will show.

> One thing that could work is allowing (unused) clobbers of fixed
> registers (like you have here), or maybe some hook is needed to say this
> register is like a flags register, or similar.  That should work for you,
> and not regress other targets, maybe even help a little?  We'll see.

Still, there is already TARGET_FLAGS_REGNUM (a "hooked"
constant), so I take it you would be happy if we recognize a
clobber of *just that*, in a parallel?  (I'll take care of
updating tm.texi of course.)

brgds, H-P

Reply via email to