On Thu, May 15, 2014 at 12:26 AM, bin.cheng <bin.ch...@arm.com> wrote: > Hi, > Targets like ARM and AARCH64 support double-word load store instructions, > and these instructions are generally faster than the corresponding two > load/stores. GCC currently uses peephole2 to merge paired load/store into > one single instruction which has a disadvantage. It can only handle simple > cases like the two instructions actually appear sequentially in instruction > stream, and is too weak to handle cases in which the two load/store are > intervened by other irrelevant instructions. > > Here comes up with a new GCC pass looking through each basic block and > merging paired load store even they are not adjacent to each other. The > algorithm is pretty simple: > 1) In initialization pass iterating over instruction stream it collects > relevant memory access information for each instruction. > 2) It iterates over each basic block, tries to find possible paired > instruction for each memory access instruction. During this work, it checks > dependencies between the two possible instructions and also records the > information indicating how to pair the two instructions. To avoid quadratic > behavior of the algorithm, It introduces new parameter > max-merge-paired-loadstore-distance and set the default value to 4, which is > large enough to catch major part of opportunities on ARM/cortex-a15. > 3) For each candidate pair, it calls back-end's hook to do target dependent > check and merge the two instructions if possible. > > Though the parameter is set to 4, for miscellaneous benchmarks, this pass > can merge numerous opportunities except ones already merged by peephole2 > (same level numbers of opportunities comparing to peepholed ones). GCC > bootstrap can also confirm this finding. > > Yet there is an open issue about when we should run this new pass. Though > register renaming is disabled by default now, I put this pass after it, > because renaming can resolve some false dependencies thus benefit this pass. > Another finding is, it can capture a lot more opportunities if it's after > sched2, but I am not sure whether it will mess up with scheduling results in > this way. > > So, any comments about this? > > Thanks, > bin > > > 2014-05-15 Bin Cheng <bin.ch...@arm.com> > * common.opt (flag_merge_paired_loadstore): New option. > * merge-paired-loadstore.c: New file. > * Makefile.in: Support new file. > * config/arm/arm.c (TARGET_MERGE_PAIRED_LOADSTORE): New macro. > (load_latency_expanded_p, arm_merge_paired_loadstore): New function. > * params.def (PARAM_MAX_MERGE_PAIRED_LOADSTORE_DISTANCE): New param. > * doc/invoke.texi (-fmerge-paired-loadstore): New. > (max-merge-paired-loadstore-distance): New. > * doc/tm.texi.in (TARGET_MERGE_PAIRED_LOADSTORE): New. > * doc/tm.texi: Regenerated. > * target.def (merge_paired_loadstore): New. > * tree-pass.h (make_pass_merge_paired_loadstore): New decl. > * passes.def (pass_merge_paired_loadstore): New pass. > * timevar.def (TV_MERGE_PAIRED_LOADSTORE): New time var. > > gcc/testsuite/ChangeLog > 2014-05-15 Bin Cheng <bin.ch...@arm.com> > > * gcc.target/arm/merge-paired-loadstore.c: New test. >
Here is a testcase on x86-64: --- struct Foo { Foo (double x0, double x1, double x2) { data[0] = x0; data[1] = x1; data[2] = x2; } double data[3]; }; const Foo f1 (0.0, 0.0, 1.0); const Foo f2 (1.0, 0.0, 0.0); struct Bar { Bar (float x0, float x1, float x2, float x3, float x4) { data[0] = x0; data[1] = x1; data[2] = x2; data[3] = x3; data[4] = x4; } float data[5]; }; const Bar b1 (0.0, 0.0, 0.0, 0.0, 1.0); const Bar b2 (1.0, 0.0, 0.0, 0.0, 0.0); --- We generate xorpd %xmm0, %xmm0 movsd .LC1(%rip), %xmm1 movsd %xmm0, _ZL2f1(%rip) movsd %xmm0, _ZL2f1+8(%rip) movsd %xmm0, _ZL2f2+8(%rip) movsd %xmm0, _ZL2f2+16(%rip) xorps %xmm0, %xmm0 movsd %xmm1, _ZL2f1+16(%rip) movsd %xmm1, _ZL2f2(%rip) movss .LC3(%rip), %xmm1 movss %xmm0, _ZL2b1(%rip) movss %xmm0, _ZL2b1+4(%rip) movss %xmm0, _ZL2b1+8(%rip) movss %xmm0, _ZL2b1+12(%rip) movss %xmm1, _ZL2b1+16(%rip) movss %xmm1, _ZL2b2(%rip) movss %xmm0, _ZL2b2+4(%rip) movss %xmm0, _ZL2b2+8(%rip) movss %xmm0, _ZL2b2+12(%rip) movss %xmm0, _ZL2b2+16(%rip) There are pairs of movsd and sets of 4 movss. We should be able to handle more than 2 load/store insns. -- H.J.