On Thu, May 15, 2014 at 12:26 AM, bin.cheng <bin.ch...@arm.com> wrote:
> Hi,
> Targets like ARM and AARCH64 support double-word load store instructions,
> and these instructions are generally faster than the corresponding two
> load/stores.  GCC currently uses peephole2 to merge paired load/store into
> one single instruction which has a disadvantage.  It can only handle simple
> cases like the two instructions actually appear sequentially in instruction
> stream, and is too weak to handle cases in which the two load/store are
> intervened by other irrelevant instructions.
>
> Here comes up with a new GCC pass looking through each basic block and
> merging paired load store even they are not adjacent to each other.  The
> algorithm is pretty simple:
> 1) In initialization pass iterating over instruction stream it collects
> relevant memory access information for each instruction.
> 2) It iterates over each basic block, tries to find possible paired
> instruction for each memory access instruction.  During this work, it checks
> dependencies between the two possible instructions and also records the
> information indicating how to pair the two instructions.  To avoid quadratic
> behavior of the algorithm, It introduces new parameter
> max-merge-paired-loadstore-distance and set the default value to 4, which is
> large enough to catch major part of opportunities on ARM/cortex-a15.
> 3) For each candidate pair, it calls back-end's hook to do target dependent
> check and merge the two instructions if possible.
>
> Though the parameter is set to 4, for miscellaneous benchmarks, this pass
> can merge numerous opportunities except ones already merged by peephole2
> (same level numbers of opportunities comparing to peepholed ones).  GCC
> bootstrap can also confirm this finding.
>
> Yet there is an open issue about when we should run this new pass.  Though
> register renaming is disabled by default now, I put this pass after it,
> because renaming can resolve some false dependencies thus benefit this pass.
> Another finding is, it can capture a lot more opportunities if it's after
> sched2, but I am not sure whether it will mess up with scheduling results in
> this way.
>
> So, any comments about this?
>
> Thanks,
> bin
>
>
> 2014-05-15  Bin Cheng  <bin.ch...@arm.com>
>         * common.opt (flag_merge_paired_loadstore): New option.
>         * merge-paired-loadstore.c: New file.
>         * Makefile.in: Support new file.
>         * config/arm/arm.c (TARGET_MERGE_PAIRED_LOADSTORE): New macro.
>         (load_latency_expanded_p, arm_merge_paired_loadstore): New function.
>         * params.def (PARAM_MAX_MERGE_PAIRED_LOADSTORE_DISTANCE): New param.
>         * doc/invoke.texi (-fmerge-paired-loadstore): New.
>         (max-merge-paired-loadstore-distance): New.
>         * doc/tm.texi.in (TARGET_MERGE_PAIRED_LOADSTORE): New.
>         * doc/tm.texi: Regenerated.
>         * target.def (merge_paired_loadstore): New.
>         * tree-pass.h (make_pass_merge_paired_loadstore): New decl.
>         * passes.def (pass_merge_paired_loadstore): New pass.
>         * timevar.def (TV_MERGE_PAIRED_LOADSTORE): New time var.
>
> gcc/testsuite/ChangeLog
> 2014-05-15  Bin Cheng  <bin.ch...@arm.com>
>
>         * gcc.target/arm/merge-paired-loadstore.c: New test.
>

Here is a testcase on x86-64:

---
struct Foo
{
  Foo (double x0, double x1, double x2)
    {
      data[0] = x0;
      data[1] = x1;
      data[2] = x2;
    }
  double data[3];
};

const Foo f1 (0.0, 0.0, 1.0);
const Foo f2 (1.0, 0.0, 0.0);

struct Bar
{
  Bar (float x0, float x1, float x2, float x3, float x4)
    {
      data[0] = x0;
      data[1] = x1;
      data[2] = x2;
      data[3] = x3;
      data[4] = x4;
    }
  float data[5];
};

const Bar b1 (0.0, 0.0, 0.0, 0.0, 1.0);
const Bar b2 (1.0, 0.0, 0.0, 0.0, 0.0);
---

We generate

xorpd %xmm0, %xmm0
movsd .LC1(%rip), %xmm1
movsd %xmm0, _ZL2f1(%rip)
movsd %xmm0, _ZL2f1+8(%rip)
movsd %xmm0, _ZL2f2+8(%rip)
movsd %xmm0, _ZL2f2+16(%rip)
xorps %xmm0, %xmm0
movsd %xmm1, _ZL2f1+16(%rip)
movsd %xmm1, _ZL2f2(%rip)
movss .LC3(%rip), %xmm1
movss %xmm0, _ZL2b1(%rip)
movss %xmm0, _ZL2b1+4(%rip)
movss %xmm0, _ZL2b1+8(%rip)
movss %xmm0, _ZL2b1+12(%rip)
movss %xmm1, _ZL2b1+16(%rip)
movss %xmm1, _ZL2b2(%rip)
movss %xmm0, _ZL2b2+4(%rip)
movss %xmm0, _ZL2b2+8(%rip)
movss %xmm0, _ZL2b2+12(%rip)
movss %xmm0, _ZL2b2+16(%rip)

There are pairs of movsd and sets of 4 movss.  We should
be able to handle more than 2 load/store insns.

-- 
H.J.

Reply via email to