On Fri, May 16, 2014 at 12:57 AM, Steven Bosscher <stevenb....@gmail.com> wrote: > On Thu, May 15, 2014 at 9:26 AM, bin.cheng wrote: >> Hi, >> Targets like ARM and AARCH64 support double-word load store instructions, >> and these instructions are generally faster than the corresponding two >> load/stores. GCC currently uses peephole2 to merge paired load/store into >> one single instruction which has a disadvantage. It can only handle simple >> cases like the two instructions actually appear sequentially in instruction >> stream, and is too weak to handle cases in which the two load/store are >> intervened by other irrelevant instructions. >> >> Here comes up with a new GCC pass looking through each basic block and >> merging paired load store even they are not adjacent to each other. The >> algorithm is pretty simple: >> 1) In initialization pass iterating over instruction stream it collects >> relevant memory access information for each instruction. >> 2) It iterates over each basic block, tries to find possible paired >> instruction for each memory access instruction. During this work, it checks >> dependencies between the two possible instructions and also records the >> information indicating how to pair the two instructions. To avoid quadratic >> behavior of the algorithm, It introduces new parameter >> max-merge-paired-loadstore-distance and set the default value to 4, which is >> large enough to catch major part of opportunities on ARM/cortex-a15. >> 3) For each candidate pair, it calls back-end's hook to do target dependent >> check and merge the two instructions if possible. >> >> Though the parameter is set to 4, for miscellaneous benchmarks, this pass >> can merge numerous opportunities except ones already merged by peephole2 >> (same level numbers of opportunities comparing to peepholed ones). GCC >> bootstrap can also confirm this finding. >> >> Yet there is an open issue about when we should run this new pass. Though >> register renaming is disabled by default now, I put this pass after it, >> because renaming can resolve some false dependencies thus benefit this pass. >> Another finding is, it can capture a lot more opportunities if it's after >> sched2, but I am not sure whether it will mess up with scheduling results in >> this way. >> >> So, any comments about this?
Hi Steven, Thanks for reviewing this. Here are some answers to the general questions. > > First off: Why does this need a target hook? We're getting more and > more of them -- too many IMHO. There should be really good reasons for > adding even more new ones. Yes, I think this one does have a good reason. The target independent pass just makes sure that two consecutive memory access instructions are free of data-dependency with each other, then feeds it to back-end hook. It's back-end's responsibility to generate correct instruction. It's not about modifying an existing insn then recognize it, it's about creating new instruction sometimes. For example, we can generate a simple move insn in Arm mode, while have to generate a parallel instruction in Thumb mode. Target independent part has no idea how to generate an expected insn. Moreover, back-end may check some special conditions too. > > Does this pass have to run after scheduling? The way you describe it, No, I just meant there is more opportunities after regrenaming, and even more opportunities after sched2, I haven't investigated reason for the latter one yet, but this pass doesn't depend on sched2 to work. > this sounds like a scheduling problem, where you don't need regrename > to resolve false dependencies. Your sched2 pass should be able to > prioritize mergeable loads/stores to schedule them adjacent. Of if you > must do this before scheduling, then at least do it before sched2. Now > you're really ruining the order of the scheduled instructions, which > seems bad. Yes, I agree it should NOT disturb scheduling results, that's why I put it before sched2 and after register renaming right now. > > I don't see how regrename will help resolve [base+offset] false > dependencies. Can you explain? I'd expect effects from > hardreg-copyprop "commoning" a base register. It's the register operand's false dependency, rather than the base's one. Considering below simple case: mov r1, #const1 store r1, [base+offset1] mov r1, #const2 store r1, [base_offset2] It should be renamed into: mov r1, #const1 store r1, [base+offset1] mov r2, #const2 store r2, [base_offset2] Then caught by the patch. I will leave other comments for a moment after more discussion here. Thanks, bin > > ChangeLog is missing the entry for arm.c. > > Your pass should make those peephole2's redundant, so you should > remove the relevant define_peephole2's. > > >> + generated by spilling during reigster allocation. To catch all > > s/reigster/register/ > > >> + whose address expression is in the form of "[base_offset]". It > > s/[base_offset]/[base+offset]/ > > >> + only guarantee that the two consecutive memory access instructions > > s/guarantee/guarantees/ > > >> + has no data dependency issues. After a pair is found it calls to > > s/has/have/ > > >> +/* Maximal number of memory references we support in single isntruction. */ > > s/Maximal/Maximum/ > s/isntruction/instruction/ > > >> +/* Check instruction recorded in PINSN_INFO to see if it is the >> + first instruction of load store pair. If yes, record the >> + information in PPAIR_INFO and return TRUE. */ >> + >> +static bool >> +find_pair_x_insn (struct pair_info_t *ppair_info, > > The function description doesn't seem to match the function > implementation. There is nothing in find_pair_x_insn that looks for a > load/store pair. > > >> + /* Don't bother if base is a register and we are modifying it. */ >> + if (REG_P (base) && reg_modified_p (base, insn)) >> + return false; > > You can shortcut this if you're looking at a load. > > >> + for (; *ref_rec; ref_rec++) > > Nit: "while (*ref_rec)" > > >> + /* Can't be paired if memory refs have different mode. */ >> + if (GET_MODE (mem) != mode) >> + return false; > > Not even if same GET_MODE_SIZE? > > >> + /* Clear PAIR_SINK if the memory unit of the first instruction >> + is clobbered between. */ >> + if ((pinfo->pair_kind & PAIR_SINK) != 0 >> + && MEM_CLOBBERED_P (pinfo->mem_clob, MEM_CLOB_SELF)) >> + pinfo->pair_kind &= (~PAIR_SINK); > > This has the smell of architecture-specific behavior. Maybe you can't > even pair instructions if there is a MEM_CLOB_SELF involved. Likewise > for some of the checks in valid_insn_pair_p following the one quoted > above. > > >> + HOST_WIDE_INT offset_t; > > We generally reserve "_t" for types, so I'd recommend a different name > for this variable. > > > >> +find_pair_y_insn > > You're using the fixed distance here. Can't you instead walk the > entire basic block once and build dependence chains? > > A distance of 4 seems rather arbitrary. Have you tried other maximum > distances to see how many extra cases you catch/miss? > > >> + FOR_EACH_BB_FN (bb, cfun) >> + { >> + if (bb->index < NUM_FIXED_BLOCKS) >> + continue; > > Redundant check. You only run into ENTRY/EXIT if you use > FOR_ALL_BB_FN. With FOR_EACH_BB you only visit blocks between > ENTRY/EXIT. > > >> + for (; insn != NULL_RTX && insn != BB_END (bb); insn = next) > > Off-by-one: If you stop at BB_END like this, then you're overlooking > the last instruction in the basic block for pair matching. Stop at > NEXT_INSN (BB_END (bb)), like FOR_BB_INSNS. > (In a perfect world, we'd have proper insn iterators...) > > That's all for now! ;-) > > Ciao! > Steven -- Best Regards.