Joern Rennecke <joern.renne...@embecosm.com> writes:
> Quoting Richard Sandiford <rdsandif...@googlemail.com>:
>
>> But what I'm trying to get at is: why can't the backend tell
>> shorten_branches about the amount of alignment/misalignment
>> that the target wants, and where?  Via an attribute or a hook,
>> I don't mind which.  But it should be declarative, rather than
>> a part of the shorten_branches algorithm.
>
> So, if we want to go with a declarative approach, each instruction can have
> a plurality of variants, each described by a size, an instruction fetch
> alignment base, an instruction fetch alignment offset set, a cache line base,
> a cache line offset set, a cost as a branch destination - cached,
> a cost as a branch destination - uncached, a fall-through predicted cost,
> a fall-through mispredicted cost, a branch-taken-predicted cost,
> a branch-taken-mispredicted cost.
> Or, if we want to get into more detail, we should actually have a scheduler
> instruction reservation descriptions instead of singular costs.
>
> A variants availability and costs will depend on branch offsets,
> the presence, kind and size of delay slot instructions and further
> length-sensitive context (like the set of paths from the function entry,
> with aggregate instruction fetch sizes and probabilities), so you have
> to call a target hook each iteration to get the currently available set
> of variants.
>
> Then you have to see this in the context of relative execution frequencies
> (for cached/uncached branch destinations), and branch probabilities and
> the reliability of the probability information, and to-be-added annotations
> about branch predictability. to arrive at a near-optimal solution in
> reasonable time.

Hey, I wasn't suggesting anything like that :-)  I was just asking
about alignment.  It seemed to me that part of the problem is that
ARC wants to increase the length of an instruction in order to align
a following instruction (which I agree is a useful thing to support).
But as things stand, shorten_branches has no way of distinguing that from
an instruction that was forced to grow because of offsets being out of range.

If instead shorten_branches knows that an instruction prefers a particular
alignment/misalignment, it can cooperate with the backend to extend earlier
instructions to make that possible.  If we do it that way, shorten_branches
knows what's going on.

You quote lots of complications above, that's also my concern.  Directly
hooking into ADJUST_INSN_LENGTH restricts us to a very peephole optimisation
with no view wider than the current instruction.  E.g. for:

      insn1
      insn2
      insn3 -- wants to be misaligned by 2 bytes

hooking directly into ADJUST_INSN_LENGTH doesn't make it easy to consider
using insn1 rather than insn2 to provide the misalignment.

>> ...it looks like arc_reorg already runs shorten_branches in a loop.
>> Is that right?  So in a sense we already have iteration involving
>> shorten_branches and the backend.
>
> Actually, that is mostly for instruction selection/splitting/combination
> for compare-and-branch/test-bit-and-branch instructions, which have
> very stringent offset requirements; these decisions need to be made
> (mostly; there is code to handle rare out-of-range offsets later, but it
>   makes code quality worse)
> before delayed branch scheduling and final conditional execution decisions.
>
> But the fine-tuning of instruction sizes and alignments happens later in
> the original branch shortening pass.

OK, but the same kind of iteration seems acceptable after delayed-branch
scheduling too.

>> I realise we want to cut down the number of iterations where possible,
>> but at the same time, it seems like you're trying to do conditional
>> execution conversion on the fly during shorten_branches, such that the
>> rtl doesn't actually represent the instructions we output.  And the
>> performance problem you're seeing is that those behind-the-scenes
>> changes are too confusing for the current shorten_branches code to
>> handle well.  E.g. sometimes you decide to "delete" branches that stay
>> in the rtl stream as branches, and that got a length in a previous
>> iteration.
>
> No, not really.  Within each branch shortening pass, these decisions
> are static; they may change when the branches are modified after
> early branch shortening.  They may also change after delay slot scheduling.
>
> If we want to get really precise with rtl, I have to alter it in every
> branch shortening iteration when the decision changes if a
> compare-and-branch/ test-bit-and branch is within range or not.

I was suggesting that once you've committed to something (which I think
in your proposal corresponds to the iteration count becoming too high)
that commitment should be reflected in the rtl.  That also stops the
decision from being accidentally reversed.

>> At some point, as you say, you have have to commit to using a particular
>> form (conditionally executed or not) in order to avoid cycles.  But that
>> seems like a conditional execution decision rather than a shorten_branches
>> decision.  (It'd also be good to represent it in the rtl if possible.)
>
> Well, I could do this with additional machine-dependent rtl passes, but
> this is really a (costly, in terms of development time and maintenance)
> solution looking for a problem.
>
> The ARC ccfsm machinery was originally only run at instruction output
> time, and I modified it so I can run it earlier to tell what changes
> it would make to the instructions and their lengths to get better quality
> length predictions (most importantly, when optimizing for speed, it can
> at times increase code size slightly, which can be catastrophic when we
> commited to using a short branch at the edge of its offset range.  Running
> the same code to make or predict decisions avoids issues with code getting
> out of sync I would have with using a modified rtl-modifying copy.
> By not actually modifying rtl, I don't have to allocate new uids during branch
> shortening (which would require branch shorteing to restart from scratch or
> reallocate arrays and somehow map different instructions and uids), nor do
> I have to convert rtl back as the input changes, with the hazard of
> increasing the rtl each iteration.

But I think extending the FSM from just being an on-the-fly
transformatino during final (as for ARM) to something that happens
earlier is a pretty major step in terms of the interface between the
backend and the rest of GCC.  There are then undocumented restrictions
on what hooks can be called when, which if broken could lead to the FSM
producing different results during final than it did earlier.

E.g., personally, I don't think it's acceptable for ADJUST_INSN_LENGTH
to hard-code an assumption that all instructions are processed in
assembly order, which the ARC one seems to (irrespective of this patch).
That is, whenever ARC's ADJUST_INSN_LENGTH is called, it seems to
"simulate" the instructions between the current instruction and the
one previously passed to ADJUST_INSN_LENGTH, so it will break if we
ever try to take lengths in a different order.

I talked earlier about doing predication on the fly, but the ARC hook also
does alignment optimisations on the fly.  AIUI, the alignment optimisations
can increase offsets, and increasing offsets interferes with the predication.
As mentioned above, my main problem with this is that you seem to be saying
"there are lots of complications, so we have to do everything at once",
but at the same time you're suggesting something that forces both the
target-independent and ARC code to process things in assembly order and
look at each instruction in isolation.  It makes it very difficult to do
any of the three optimisations (branch shortening, predication, and alignment)
in a less peephole way in future, such as by taking execution frequency
into account.

OTOH, I realise I'm probably not being helpful here.  I've said my piece,
so I'd better bow out.

Richard

Reply via email to