On Thu, May 15, 2014 at 9:26 AM, bin.cheng wrote:
> Hi,
> Targets like ARM and AARCH64 support double-word load store instructions,
> and these instructions are generally faster than the corresponding two
> load/stores.  GCC currently uses peephole2 to merge paired load/store into
> one single instruction which has a disadvantage.  It can only handle simple
> cases like the two instructions actually appear sequentially in instruction
> stream, and is too weak to handle cases in which the two load/store are
> intervened by other irrelevant instructions.
>
> Here comes up with a new GCC pass looking through each basic block and
> merging paired load store even they are not adjacent to each other.  The
> algorithm is pretty simple:
> 1) In initialization pass iterating over instruction stream it collects
> relevant memory access information for each instruction.
> 2) It iterates over each basic block, tries to find possible paired
> instruction for each memory access instruction.  During this work, it checks
> dependencies between the two possible instructions and also records the
> information indicating how to pair the two instructions.  To avoid quadratic
> behavior of the algorithm, It introduces new parameter
> max-merge-paired-loadstore-distance and set the default value to 4, which is
> large enough to catch major part of opportunities on ARM/cortex-a15.
> 3) For each candidate pair, it calls back-end's hook to do target dependent
> check and merge the two instructions if possible.
>
> Though the parameter is set to 4, for miscellaneous benchmarks, this pass
> can merge numerous opportunities except ones already merged by peephole2
> (same level numbers of opportunities comparing to peepholed ones).  GCC
> bootstrap can also confirm this finding.
>
> Yet there is an open issue about when we should run this new pass.  Though
> register renaming is disabled by default now, I put this pass after it,
> because renaming can resolve some false dependencies thus benefit this pass.
> Another finding is, it can capture a lot more opportunities if it's after
> sched2, but I am not sure whether it will mess up with scheduling results in
> this way.
>
> So, any comments about this?

First off: Why does this need a target hook? We're getting more and
more of them -- too many IMHO. There should be really good reasons for
adding even more new ones.

Does this pass have to run after scheduling? The way you describe it,
this sounds like a scheduling problem, where you don't need regrename
to resolve false dependencies. Your sched2 pass should be able to
prioritize mergeable loads/stores to schedule them adjacent. Of if you
must do this before scheduling, then at least do it before sched2. Now
you're really ruining the order of the scheduled instructions, which
seems bad.

I don't see how regrename will help resolve [base+offset] false
dependencies. Can you explain? I'd expect effects from
hardreg-copyprop "commoning" a base register.

ChangeLog is missing the entry for arm.c.

Your pass should make those peephole2's redundant, so you should
remove the relevant define_peephole2's.


>  +   generated by spilling during reigster allocation.  To catch all

s/reigster/register/


> +   whose address expression is in the form of "[base_offset]".  It

s/[base_offset]/[base+offset]/


> +   only guarantee that the two consecutive memory access instructions

s/guarantee/guarantees/


> +   has no data dependency issues.  After a pair is found it calls to

s/has/have/


> +/* Maximal number of memory references we support in single isntruction.  */

s/Maximal/Maximum/
s/isntruction/instruction/


> +/* Check instruction recorded in PINSN_INFO to see if it is the
> +   first instruction of load store pair.  If yes, record the
> +   information in PPAIR_INFO and return TRUE.  */
> +
> +static bool
> +find_pair_x_insn (struct pair_info_t *ppair_info,

The function description doesn't seem to match the function
implementation. There is nothing in find_pair_x_insn that looks for a
load/store pair.


> +  /* Don't bother if base is a register and we are modifying it.  */
> +  if (REG_P (base) && reg_modified_p (base, insn))
> +    return false;

You can shortcut this if you're looking at a load.


> +  for (; *ref_rec; ref_rec++)

Nit: "while (*ref_rec)"


> +  /* Can't be paired if memory refs have different mode.  */
> +  if (GET_MODE (mem) != mode)
> +    return false;

Not even if same GET_MODE_SIZE?


> +  /* Clear PAIR_SINK if the memory unit of the first instruction
> +     is clobbered between.  */
> +  if ((pinfo->pair_kind & PAIR_SINK) != 0
> +      && MEM_CLOBBERED_P (pinfo->mem_clob, MEM_CLOB_SELF))
> +    pinfo->pair_kind &= (~PAIR_SINK);

This has the smell of architecture-specific behavior. Maybe you can't
even pair instructions if there is a MEM_CLOB_SELF involved. Likewise
for some of the checks in valid_insn_pair_p following the one quoted
above.


> +      HOST_WIDE_INT offset_t;

We generally reserve "_t" for types, so I'd recommend a different name
for this variable.



> +find_pair_y_insn

You're using the fixed distance here. Can't you instead walk the
entire basic block once and build dependence chains?

A distance of 4 seems rather arbitrary. Have you tried other maximum
distances to see how many extra cases you catch/miss?


> +  FOR_EACH_BB_FN (bb, cfun)
> +    {
> +      if (bb->index < NUM_FIXED_BLOCKS)
> + continue;

Redundant check. You only run into ENTRY/EXIT if you use
FOR_ALL_BB_FN. With FOR_EACH_BB you only visit blocks between
ENTRY/EXIT.


> +      for (; insn != NULL_RTX && insn != BB_END (bb); insn = next)

Off-by-one: If you stop at BB_END like this, then you're overlooking
the last instruction in the basic block for pair matching. Stop at
NEXT_INSN (BB_END (bb)), like FOR_BB_INSNS.
(In a perfect world, we'd have proper insn iterators...)

That's all for now! ;-)

Ciao!
Steven

Reply via email to