Quoting Richard Sandiford <rdsandif...@googlemail.com>:

But what I'm trying to get at is: why can't the backend tell
shorten_branches about the amount of alignment/misalignment
that the target wants, and where?  Via an attribute or a hook,
I don't mind which.  But it should be declarative, rather than
a part of the shorten_branches algorithm.

So, if we want to go with a declarative approach, each instruction can have
a plurality of variants, each described by a size, an instruction fetch
alignment base, an instruction fetch alignment offset set, a cache line base,
a cache line offset set, a cost as a branch destination - cached,
a cost as a branch destination - uncached, a fall-through predicted cost,
a fall-through mispredicted cost, a branch-taken-predicted cost,
a branch-taken-mispredicted cost.
Or, if we want to get into more detail, we should actually have a scheduler
instruction reservation descriptions instead of singular costs.

A variants availability and costs will depend on branch offsets,
the presence, kind and size of delay slot instructions and further
length-sensitive context (like the set of paths from the function entry,
with aggregate instruction fetch sizes and probabilities), so you have
to call a target hook each iteration to get the currently available set
of variants.

Then you have to see this in the context of relative execution frequencies
(for cached/uncached branch destinations), and branch probabilities and
the reliability of the probability information, and to-be-added annotations
about branch predictability. to arrive at a near-optimal solution in
reasonable time.

Do you have any suggestions about what algorithm to use?

FWIW, the current ARC description with the adjust_insn_length hook patch
uses instruction fetch offsets 0 and 2 to base 4, branch-taken/non-taken
costs with implicit machine-specific heuristics about predictability
(which are not as good as I'd like them to be, but better than nothing),
length-sensitive context information, branch probabilities and their
reliability, and delay slot instruction presence and size.  There are
some canned rules about how different requirements interact with each
other to get at a converging solution quickly and with minimal compromises.
All of the other mentioned features would be in principle interesting, but not
enough so far to justify implementing them in the previous length/lock_length
attribute framework (which was prone to cycles if you were not very careful
while adding new rules).

I believe other ports would have different priorities, possibly within the
sent I enumerated, maybe going beyond that.

...it looks like arc_reorg already runs shorten_branches in a loop.
Is that right?  So in a sense we already have iteration involving
shorten_branches and the backend.

Actually, that is mostly for instruction selection/splitting/combination
for compare-and-branch/test-bit-and-branch instructions, which have
very stringent offset requirements; these decisions need to be made
(mostly; there is code to handle rare out-of-range offsets later, but it
 makes code quality worse)
before delayed branch scheduling and final conditional execution decisions.

But the fine-tuning of instruction sizes and alignments happens later in
the original branch shortening pass.

I realise we want to cut down the number of iterations where possible,
but at the same time, it seems like you're trying to do conditional
execution conversion on the fly during shorten_branches, such that the
rtl doesn't actually represent the instructions we output.  And the
performance problem you're seeing is that those behind-the-scenes
changes are too confusing for the current shorten_branches code to
handle well.  E.g. sometimes you decide to "delete" branches that stay
in the rtl stream as branches, and that got a length in a previous
iteration.

No, not really.  Within each branch shortening pass, these decisions
are static; they may change when the branches are modified after
early branch shortening.  They may also change after delay slot scheduling.

If we want to get really precise with rtl, I have to alter it in every
branch shortening iteration when the decision changes if a
compare-and-branch/ test-bit-and branch is within range or not.
If it's a single instruction, it is transparent to previous flag settings;
if it isn't, we might be able to use the flag setting result for a
successive branch or conditional execution.
So then we'd need a way to describe this rtl modification depending on
branch offset ranges.
But that still won't get rid of the ARC requirement of running an extra
branch shortening pass, unless you want to roll the delay slot scheduling
into branch shortening as well.
Others claim that delayed branch scheduling should be integrated with the
normal scheduler, while there are also good arguments to integrate the
scheduler with the register allocator.  Just roll it all into one big
pile of hair.

Going back to pragmatic issues...

The problem I had with the original GCC 4.4 framework was that branch
shortening would not terminate for some functions when there is non-linear
positive and negative feedback.
The problem I have with the framework in the current trunk is that it
applies length locking even to those instructions that it should not
apply to - or at least, not so quickly, i.e. the short insns that are
upsized for alignment purposes.  When the input alignment changes, all
previous decisions are wrong.

At some point, as you say, you have have to commit to using a particular
form (conditionally executed or not) in order to avoid cycles.  But that
seems like a conditional execution decision rather than a shorten_branches
decision.  (It'd also be good to represent it in the rtl if possible.)

Well, I could do this with additional machine-dependent rtl passes, but
this is really a (costly, in terms of development time and maintenance)
solution looking for a problem.

The ARC ccfsm machinery was originally only run at instruction output
time, and I modified it so I can run it earlier to tell what changes
it would make to the instructions and their lengths to get better quality
length predictions (most importantly, when optimizing for speed, it can
at times increase code size slightly, which can be catastrophic when we
commited to using a short branch at the edge of its offset range.  Running
the same code to make or predict decisions avoids issues with code getting
out of sync I would have with using a modified rtl-modifying copy.
By not actually modifying rtl, I don't have to allocate new uids during branch
shortening (which would require branch shorteing to restart from scratch or
reallocate arrays and somehow map different instructions and uids), nor do
I have to convert rtl back as the input changes, with the hazard of
increasing the rtl each iteration.

Reply via email to