Re: [GCC RFC]A new and simple pass merging paired load store instructions
On Tue, May 20, 2014 at 1:30 AM, Jeff Law l...@redhat.com wrote: On 05/19/14 00:38, Bin.Cheng wrote: On Sat, May 17, 2014 at 12:32 AM, Jeff Law l...@redhat.com wrote: On 05/16/14 04:07, Bin.Cheng wrote: But can't you go through movXX to generate either the simple insn on the ARM or the PARALLEL on the thumb? Yes, I think it's more than upsizing the mode. There is another example from one of x86's candidate peephole patch at https://gcc.gnu.org/ml/gcc-patches/2014-04/msg00467.html The patch wants to do below transformation, which I think is very target dependent. Presumably there's no way to go through an expander here? I don't know very much with respect to this case, maybe the patch author can help here. The idea being that common cases where a pair moves can be turned into a single wider move without having to write target code to make that happen much of the time. ie 2xQI-HI, 2xHI-SI, 2xSI-DI 2xSF-DF. For things outside those simple cases, fall back to a target hook or a target expander. This is practicable, but the question is why we would bother with general cases if the hook interface is needed anyway. Is it because target calls are generally more expensive? Thanks, bin -- Best Regards.
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On Tue, May 20, 2014 at 5:02 AM, Mike Stump mikest...@comcast.net wrote: On May 19, 2014, at 10:30 AM, Jeff Law l...@redhat.com wrote: Yes, I think it's more than upsizing the mode. There is another example from one of x86's candidate peephole patch at https://gcc.gnu.org/ml/gcc-patches/2014-04/msg00467.html The patch wants to do below transformation, which I think is very target dependent. Presumably there's no way to go through an expander here? The idea being that common cases where a pair moves can be turned into a single wider move without having to write target code to make that happen much of the time. ie 2xQI-HI, 2xHI-SI, 2xSI-DI 2xSF-DF. For things outside those simple cases, fall back to a target hook or a target expander. For completeness, load_multiple can handle quite a bit in a generic way. For example, mine is wired to support 3 DI. One problem is load/store_multiple only supports consecutive registers, while targets like arm and aarch64 supports load/store of non-consecutive registers. Thanks, bin -- Best Regards.
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On 05/20/14 01:13, Bin.Cheng wrote: The idea being that common cases where a pair moves can be turned into a single wider move without having to write target code to make that happen much of the time. ie 2xQI-HI, 2xHI-SI, 2xSI-DI 2xSF-DF. For things outside those simple cases, fall back to a target hook or a target expander. This is practicable, but the question is why we would bother with general cases if the hook interface is needed anyway. Is it because target calls are generally more expensive? No, it's because most (all?) targets would benefit without having to write target dependent code. jeff
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On Tue, May 20, 2014 at 12:13 AM, Bin.Cheng amker.ch...@gmail.com wrote: On Tue, May 20, 2014 at 1:30 AM, Jeff Law l...@redhat.com wrote: On 05/19/14 00:38, Bin.Cheng wrote: On Sat, May 17, 2014 at 12:32 AM, Jeff Law l...@redhat.com wrote: On 05/16/14 04:07, Bin.Cheng wrote: But can't you go through movXX to generate either the simple insn on the ARM or the PARALLEL on the thumb? Yes, I think it's more than upsizing the mode. There is another example from one of x86's candidate peephole patch at https://gcc.gnu.org/ml/gcc-patches/2014-04/msg00467.html The patch wants to do below transformation, which I think is very target dependent. Presumably there's no way to go through an expander here? I don't know very much with respect to this case, maybe the patch author can help here. I just checked the expand result for my case. TER could make expand see the two intrinsics at the same time so in theory it is possible to make the merge happen in expand. I havn't looked into detail. Thanks, Wei.
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On 05/20/14 11:14, Wei Mi wrote: On Tue, May 20, 2014 at 12:13 AM, Bin.Cheng amker.ch...@gmail.com wrote: On Tue, May 20, 2014 at 1:30 AM, Jeff Law l...@redhat.com wrote: On 05/19/14 00:38, Bin.Cheng wrote: On Sat, May 17, 2014 at 12:32 AM, Jeff Law l...@redhat.com wrote: On 05/16/14 04:07, Bin.Cheng wrote: But can't you go through movXX to generate either the simple insn on the ARM or the PARALLEL on the thumb? Yes, I think it's more than upsizing the mode. There is another example from one of x86's candidate peephole patch at https://gcc.gnu.org/ml/gcc-patches/2014-04/msg00467.html The patch wants to do below transformation, which I think is very target dependent. Presumably there's no way to go through an expander here? I don't know very much with respect to this case, maybe the patch author can help here. I just checked the expand result for my case. TER could make expand see the two intrinsics at the same time so in theory it is possible to make the merge happen in expand. I havn't looked into detail. I'm not referring to in the gimple-rtl expansion. I'm referring to using a define_expand to generate the load/store multiple instructions. Given define_expand in the backend, the target independent code can use HAVE_XXX and gen_XXX to test for the existence of the expander and to call the expander to generate insns. Jeff
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On Sat, May 17, 2014 at 12:52 AM, Mike Stump mikest...@comcast.net wrote: On May 16, 2014, at 3:07 AM, Bin.Cheng amker.ch...@gmail.com wrote: I don't see how regrename will help resolve [base+offset] false dependencies. Can you explain? I'd expect effects from hardreg-copyprop commoning a base register. It's the register operand's false dependency, rather than the base's one. Considering below simple case: mov r1, #const1 store r1, [base+offset1] mov r1, #const2 store r1, [base_offset2] It should be renamed into: mov r1, #const1 store r1, [base+offset1] mov r2, #const2 store r2, [base_offset2] Ah, but, what did this look like right before pass_web? I don't think this would be a problem for pre-RA, generally GCC won't try to reuse pseudo register in this way, right? Thanks, bin -- Best Regards.
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On Sat, May 17, 2014 at 12:32 AM, Jeff Law l...@redhat.com wrote: On 05/16/14 04:07, Bin.Cheng wrote: Yes, I think this one does have a good reason. The target independent pass just makes sure that two consecutive memory access instructions are free of data-dependency with each other, then feeds it to back-end hook. It's back-end's responsibility to generate correct instruction. But given these two memory access insns, there's only a couple ways they're likely to combine into a single insn. We could just as easily have the target independent code construct a new insn then try to recognize it. If it's not recognized, then try the other way. Or is it the case that we're doing something beyond upsizing the mode? It's not about modifying an existing insn then recognize it, it's about creating new instruction sometimes. For example, we can generate a simple move insn in Arm mode, while have to generate a parallel instruction in Thumb mode. Target independent part has no idea how to generate an expected insn. Moreover, back-end may check some special conditions too. But can't you go through movXX to generate either the simple insn on the ARM or the PARALLEL on the thumb? Yes, I think it's more than upsizing the mode. There is another example from one of x86's candidate peephole patch at https://gcc.gnu.org/ml/gcc-patches/2014-04/msg00467.html The patch wants to do below transformation, which I think is very target dependent. +(define_peephole2 + [(set (match_operand:DF 0 register_operand) + (match_operand:DF 1 memory_operand)) + (set (match_operand:V2DF 2 register_operand) + (vec_concat:V2DF (match_dup 0) +(match_operand:DF 3 memory_operand)))] + TARGET_SSE_UNALIGNED_LOAD_OPTIMAL +REGNO (operands[0]) == REGNO (operands[2]) +adjacent_mem_locations (operands[1], operands[3]) + [(set (match_dup 2) + (unspec:V2DF [(match_dup 4)] UNSPEC_LOADU))] + +;; merge movsd/movhpd to movupd when TARGET_SSE_UNALIGNED_STORE_OPTIMAL +;; is true. +(define_peephole2 + [(set (match_operand:DF 0 memory_operand) +(vec_select:DF (match_operand:V2DF 1 register_operand) + (parallel [(const_int 0)]))) + (set (match_operand:DF 2 memory_operand) +(vec_select:DF (match_dup 1) + (parallel [(const_int 1)])))] + TARGET_SSE_UNALIGNED_STORE_OPTIMAL +adjacent_mem_locations (operands[0], operands[2]) + [(set (match_dup 3) +(unspec:V2DF [(match_dup 1)] UNSPEC_STOREU))] Thanks, bin -- Best Regards.
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On Sat, May 17, 2014 at 12:18 AM, Jeff Law l...@redhat.com wrote: On 05/16/14 04:07, Bin.Cheng wrote: On Fri, May 16, 2014 at 1:13 AM, Jeff Law l...@redhat.com wrote: On 05/15/14 10:51, Mike Stump wrote: On May 15, 2014, at 12:26 AM, bin.cheng bin.ch...@arm.com wrote: Here comes up with a new GCC pass looking through each basic block and merging paired load store even they are not adjacent to each other. So I have a target that has load and store multiple support that supports large a number of registers (2-n registers), and I added a sched0 pass that is a light copy of the regular scheduling pass that uses a different cost function which arranges all loads first, then all stores then everything else. Within a group of loads or stores the secondary key is the base register, the next key is the offset. The net result, all loads off the same register are sorted in increasing order. Glad to see someone else stumble on (ab)using the scheduler to do this. Emm, If it's (ab)using, should we still do it then? I think it'd still be fine. There's even been a comment about doing this kind of thing in the scheduler that's been around since the early 90s... The scheduler is a bit interesting in that it has a wealth of dependency information and the ability to reorganize the insn stream in relatively arbitrary ways. That seems to make it a natural place to think about transformations of this nature. We just haven't had a good infrastructure for doing that. In theory we're a lot closer now to being able to plug in different costing/sorting models and let the scheduler do its thing. Those models might rewrite for register pressure, or encourage certain independent insns to issue back-to-back to encourage combining, or to build candidate insns for delay slot scheduling, etc. As Mike stated, merging of consecutive memory accesses is all about the base register and the offset. I am thinking another method collecting all memory accesses with same base register then doing the merge work. In this way, we should be able to merge more than 2 instructions, also it would be possible to remove redundant load instructions in one pass. My question is how many these redundant loads could be? Is there any rtl pass responsible for this now? I suspect it's a lot less important now than it used to be. But there's probably some cases where it'd be useful. Combining sub-word accesses into full-word accesses come immediately to mind. I'm not aware of any pass which does these kind of changes in a general form. Some passes (caller-save) do a fair amount of work to track when they can generate multi-object loads/stores (and it was a huge win back on the old sparc processors). Glad this RFC has attracted some attentions and thanks for all the comments. Here I can see four major concerns as below: 1) Should we do it in a separated pass, or just along with scheduler? 2) When should we run the new pass, before or after RA? There are both advantages and disadvantages and very depends on the target for which we are compiling. I have no simple answer to this. Maybe we can run the pass twice or follow Oleg's suggestion. I think it's a new strategy for GCC to let backend decide when to run a pass. 3) Do we need a new target hook interface? I answered this in other messages and I still think it's target dependent. 4) The optimization should be able to handle cases with more than 2 consecutive load/store instructions. The current implementation can't handle such cases and need further extension. The 3) and 4) are just implementation questions, while I am not sure about 1) and 2), so any more comments that we could make some decisions to carry on this optimization? Thanks, bin -- Best Regards.
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On 05/19/14 00:38, Bin.Cheng wrote: On Sat, May 17, 2014 at 12:52 AM, Mike Stump mikest...@comcast.net wrote: On May 16, 2014, at 3:07 AM, Bin.Cheng amker.ch...@gmail.com wrote: I don't see how regrename will help resolve [base+offset] false dependencies. Can you explain? I'd expect effects from hardreg-copyprop commoning a base register. It's the register operand's false dependency, rather than the base's one. Considering below simple case: mov r1, #const1 store r1, [base+offset1] mov r1, #const2 store r1, [base_offset2] It should be renamed into: mov r1, #const1 store r1, [base+offset1] mov r2, #const2 store r2, [base_offset2] Ah, but, what did this look like right before pass_web? I don't think this would be a problem for pre-RA, generally GCC won't try to reuse pseudo register in this way, right? There's nothing which would prevent reuse of a pseudo in this manner; for example, a backend might do something ugly like this in an expander and we have to handle it properly. jeff
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On 05/19/14 00:38, Bin.Cheng wrote: 1) Should we do it in a separated pass, or just along with scheduler? ISTM that when we're able to combine insns that can impact the schedule we'd like to generate, possibly in significant ways. That argues for a separate pass that runs before the scheduler. So I'd probably look at splitting out the code which builds the dependency information, running that first, then the pass to merge these memory ops (incrementally updating the dependency info), then the scheduler. 2) When should we run the new pass, before or after RA? There are both advantages and disadvantages and very depends on the target for which we are compiling. I have no simple answer to this. Maybe we can run the pass twice or follow Oleg's suggestion. I think it's a new strategy for GCC to let backend decide when to run a pass. If we only run the dependency stuff once and share the data between the new pass and the sceduler, then I think running the new pass twice just before scheduling would be fine. 3) Do we need a new target hook interface? I answered this in other messages and I still think it's target dependent. Still pondering that one :-) 4) The optimization should be able to handle cases with more than 2 consecutive load/store instructions. The current implementation can't handle such cases and need further extension. Yea, probably. I think if we build the infrastructure right extension for 2 load/stores shouldn't be terribly bad to implement. Jeff
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On 05/19/14 00:38, Bin.Cheng wrote: On Sat, May 17, 2014 at 12:32 AM, Jeff Law l...@redhat.com wrote: On 05/16/14 04:07, Bin.Cheng wrote: Yes, I think this one does have a good reason. The target independent pass just makes sure that two consecutive memory access instructions are free of data-dependency with each other, then feeds it to back-end hook. It's back-end's responsibility to generate correct instruction. But given these two memory access insns, there's only a couple ways they're likely to combine into a single insn. We could just as easily have the target independent code construct a new insn then try to recognize it. If it's not recognized, then try the other way. Or is it the case that we're doing something beyond upsizing the mode? It's not about modifying an existing insn then recognize it, it's about creating new instruction sometimes. For example, we can generate a simple move insn in Arm mode, while have to generate a parallel instruction in Thumb mode. Target independent part has no idea how to generate an expected insn. Moreover, back-end may check some special conditions too. But can't you go through movXX to generate either the simple insn on the ARM or the PARALLEL on the thumb? Yes, I think it's more than upsizing the mode. There is another example from one of x86's candidate peephole patch at https://gcc.gnu.org/ml/gcc-patches/2014-04/msg00467.html The patch wants to do below transformation, which I think is very target dependent. Presumably there's no way to go through an expander here? The idea being that common cases where a pair moves can be turned into a single wider move without having to write target code to make that happen much of the time. ie 2xQI-HI, 2xHI-SI, 2xSI-DI 2xSF-DF. For things outside those simple cases, fall back to a target hook or a target expander. jeff
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On May 19, 2014, at 10:30 AM, Jeff Law l...@redhat.com wrote: Yes, I think it's more than upsizing the mode. There is another example from one of x86's candidate peephole patch at https://gcc.gnu.org/ml/gcc-patches/2014-04/msg00467.html The patch wants to do below transformation, which I think is very target dependent. Presumably there's no way to go through an expander here? The idea being that common cases where a pair moves can be turned into a single wider move without having to write target code to make that happen much of the time. ie 2xQI-HI, 2xHI-SI, 2xSI-DI 2xSF-DF. For things outside those simple cases, fall back to a target hook or a target expander. For completeness, load_multiple can handle quite a bit in a generic way. For example, mine is wired to support 3 DI.
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On Fri, May 16, 2014 at 12:57 AM, Steven Bosscher stevenb@gmail.com wrote: On Thu, May 15, 2014 at 9:26 AM, bin.cheng wrote: Hi, Targets like ARM and AARCH64 support double-word load store instructions, and these instructions are generally faster than the corresponding two load/stores. GCC currently uses peephole2 to merge paired load/store into one single instruction which has a disadvantage. It can only handle simple cases like the two instructions actually appear sequentially in instruction stream, and is too weak to handle cases in which the two load/store are intervened by other irrelevant instructions. Here comes up with a new GCC pass looking through each basic block and merging paired load store even they are not adjacent to each other. The algorithm is pretty simple: 1) In initialization pass iterating over instruction stream it collects relevant memory access information for each instruction. 2) It iterates over each basic block, tries to find possible paired instruction for each memory access instruction. During this work, it checks dependencies between the two possible instructions and also records the information indicating how to pair the two instructions. To avoid quadratic behavior of the algorithm, It introduces new parameter max-merge-paired-loadstore-distance and set the default value to 4, which is large enough to catch major part of opportunities on ARM/cortex-a15. 3) For each candidate pair, it calls back-end's hook to do target dependent check and merge the two instructions if possible. Though the parameter is set to 4, for miscellaneous benchmarks, this pass can merge numerous opportunities except ones already merged by peephole2 (same level numbers of opportunities comparing to peepholed ones). GCC bootstrap can also confirm this finding. Yet there is an open issue about when we should run this new pass. Though register renaming is disabled by default now, I put this pass after it, because renaming can resolve some false dependencies thus benefit this pass. Another finding is, it can capture a lot more opportunities if it's after sched2, but I am not sure whether it will mess up with scheduling results in this way. So, any comments about this? Hi Steven, Thanks for reviewing this. Here are some answers to the general questions. First off: Why does this need a target hook? We're getting more and more of them -- too many IMHO. There should be really good reasons for adding even more new ones. Yes, I think this one does have a good reason. The target independent pass just makes sure that two consecutive memory access instructions are free of data-dependency with each other, then feeds it to back-end hook. It's back-end's responsibility to generate correct instruction. It's not about modifying an existing insn then recognize it, it's about creating new instruction sometimes. For example, we can generate a simple move insn in Arm mode, while have to generate a parallel instruction in Thumb mode. Target independent part has no idea how to generate an expected insn. Moreover, back-end may check some special conditions too. Does this pass have to run after scheduling? The way you describe it, No, I just meant there is more opportunities after regrenaming, and even more opportunities after sched2, I haven't investigated reason for the latter one yet, but this pass doesn't depend on sched2 to work. this sounds like a scheduling problem, where you don't need regrename to resolve false dependencies. Your sched2 pass should be able to prioritize mergeable loads/stores to schedule them adjacent. Of if you must do this before scheduling, then at least do it before sched2. Now you're really ruining the order of the scheduled instructions, which seems bad. Yes, I agree it should NOT disturb scheduling results, that's why I put it before sched2 and after register renaming right now. I don't see how regrename will help resolve [base+offset] false dependencies. Can you explain? I'd expect effects from hardreg-copyprop commoning a base register. It's the register operand's false dependency, rather than the base's one. Considering below simple case: mov r1, #const1 store r1, [base+offset1] mov r1, #const2 store r1, [base_offset2] It should be renamed into: mov r1, #const1 store r1, [base+offset1] mov r2, #const2 store r2, [base_offset2] Then caught by the patch. I will leave other comments for a moment after more discussion here. Thanks, bin ChangeLog is missing the entry for arm.c. Your pass should make those peephole2's redundant, so you should remove the relevant define_peephole2's. + generated by spilling during reigster allocation. To catch all s/reigster/register/ + whose address expression is in the form of [base_offset]. It s/[base_offset]/[base+offset]/ + only guarantee that the two consecutive memory access instructions s/guarantee/guarantees/ +
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On Fri, May 16, 2014 at 1:13 AM, Jeff Law l...@redhat.com wrote: On 05/15/14 10:51, Mike Stump wrote: On May 15, 2014, at 12:26 AM, bin.cheng bin.ch...@arm.com wrote: Here comes up with a new GCC pass looking through each basic block and merging paired load store even they are not adjacent to each other. So I have a target that has load and store multiple support that supports large a number of registers (2-n registers), and I added a sched0 pass that is a light copy of the regular scheduling pass that uses a different cost function which arranges all loads first, then all stores then everything else. Within a group of loads or stores the secondary key is the base register, the next key is the offset. The net result, all loads off the same register are sorted in increasing order. Glad to see someone else stumble on (ab)using the scheduler to do this. Emm, If it's (ab)using, should we still do it then? I've poked at the scheduler several times to do similar stuff, but was never really satisfied with the results and never tried to polish those prototypes into something worth submitting. One example I've poked at was discovery of stores which then feed into a load from the same location. Which obviously we'd prefer to turn into a store + copy (subject to mess of constraints). There's a handful of these kind of transformations that seem to naturally drop out of this kind of work. As Mike stated, merging of consecutive memory accesses is all about the base register and the offset. I am thinking another method collecting all memory accesses with same base register then doing the merge work. In this way, we should be able to merge more than 2 instructions, also it would be possible to remove redundant load instructions in one pass. My question is how many these redundant loads could be? Is there any rtl pass responsible for this now? Thanks, bin Similarly a post-reload pass could be used to promote single word loads/stores to double-word operations. If anyone cared about PA 1.1 code generation, it'd be a much cleaner way to support the non-fused fmpyadd fmpysub insns. Anyway, if you want to move forward with the idea, I'd certainly support doing so. jeff -- Best Regards.
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On Fri, May 16, 2014 at 12:51 AM, Mike Stump mikest...@comcast.net wrote: On May 15, 2014, at 12:26 AM, bin.cheng bin.ch...@arm.com wrote: Here comes up with a new GCC pass looking through each basic block and merging paired load store even they are not adjacent to each other. So I have a target that has load and store multiple support that supports large a number of registers (2-n registers), and I added a sched0 pass that is a light copy of the regular scheduling pass that uses a different cost function which arranges all loads first, then all stores then everything else. Within a group of loads or stores the secondary key is the base register, the next key is the offset. The net result, all loads off the same register are sorted in increasing order. We then can use some define_insns and some peephole to patterns to take the seemingly unrelated instructions, which are now adjacent to knock them down into single instructions, instead of the mass of instructions they were before. And then a peephole pass that runs early to allow the patterns to do the heavy lifting. This scheme can handle unlimited complexity on the addressing forms just by ensuring the cost function for the new scheduling pass looks at all relevant bits (target dependent, if the simple machine independent form reg + n is not enough). The sched0 and the peephole pass run early: + NEXT_PASS (pass_sched0); + NEXT_PASS (pass_peephole1); NEXT_PASS (pass_web); NEXT_PASS (pass_rtl_cprop); NEXT_PASS (pass_cse2); (before register allocation) so, it can arrange to have things in adjacent registers for the load and store multiple instructions. The register allocator can then arrange all the code to use those registers directly. So, having done all this work, I think it would be nicer if there were a pass that managed it (so that one doesn't have to write any of the peephole or the define_insns (you need like 3*n patterns, and the patterns of O(n), so, you need around n*4/2 lines of code, which is annoying for large n. A pass could use the existing load store multiple patterns directly, so, no additional port work. In my work, since I extend life times of values into registers, pretty much without limit, this could be a bad thing. The code is naturally limited to only extending the lives of things where load and store multiple are used, as if they aren't used, the regular scheduling pass would undo all the sched0 motion. I choose a light copy of sched, as I don't care about compile time, and I wanted a very small patch that was easy to maintain. If this pass when into trunk, we'd run the new passes _only_ if a port asked for them. 99% of the ports likely don't want either, though, peephole before register allocation might be interesting for others to solve other issues. I wanted this to run before register allocation as my load and store multiple instructions only take consecutive register ranges (n-m), and I need the register allocator to manage to make it true. I do reg to reg motion to move things around as needed, but almost all I expect the allocator to get rid of. Very complex cases might wind up with a few extra moves, but I have nice bubbles that I can fit these extra moves into. In my scheme, no new options, no documentation for new options, no new param options, no silly constants, no hard to write/maintain pass, no new weird targets interface, no limit on just pairs, works on stores as well, runs earlier, 430 lines instead of 1086 lines, conceptually much simpler, added benefit of peephole before register allocation that can be used for other things by the port... So, my question is, does my scheme on your target find more or fewer things? Would your scheme pair pairs (so that 4 registers would go into 1 instruction)? Hi Mike, Thanks for bringing up this new method. I had to admit that I was not very much into this method at the first glance especially it requires another pass in cooperation with scheduler. For the first question, unfortunately, I can't apply the patch to svn trunk, could you please update it thus I can do some experiment with it? From my experience, lots of opportunities on ARM are generated by RA, putting it before RA would miss many cases. Another possible issue is about interaction with other optimizations. Putting it at early stage of RTL, we may disturb other optimizations like /gcse/-fgcse-lm/-fgcse-sm/dse. Of course, it has advantages too, for example, fwprop_addr sometimes corrupt load/store pair opportunities on ARM, it won't be a problem for your patch. For the second question, the answer is no for current implementation. For ARM the most important opportunity is the paired one, so I just started with pair of two instructions, which is much simpler. I do have a draft idea how to merge more than two instructions, but haven't worked on it yet. Thanks, bin
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On Thu, May 15, 2014 at 6:31 PM, Oleg Endo oleg.e...@t-online.de wrote: Hi, On 15 May 2014, at 09:26, bin.cheng bin.ch...@arm.com wrote: Hi, Targets like ARM and AARCH64 support double-word load store instructions, and these instructions are generally faster than the corresponding two load/stores. GCC currently uses peephole2 to merge paired load/store into one single instruction which has a disadvantage. It can only handle simple cases like the two instructions actually appear sequentially in instruction stream, and is too weak to handle cases in which the two load/store are intervened by other irrelevant instructions. Here comes up with a new GCC pass looking through each basic block and merging paired load store even they are not adjacent to each other. The algorithm is pretty simple: 1) In initialization pass iterating over instruction stream it collects relevant memory access information for each instruction. 2) It iterates over each basic block, tries to find possible paired instruction for each memory access instruction. During this work, it checks dependencies between the two possible instructions and also records the information indicating how to pair the two instructions. To avoid quadratic behavior of the algorithm, It introduces new parameter max-merge-paired-loadstore-distance and set the default value to 4, which is large enough to catch major part of opportunities on ARM/cortex-a15. 3) For each candidate pair, it calls back-end's hook to do target dependent check and merge the two instructions if possible. Though the parameter is set to 4, for miscellaneous benchmarks, this pass can merge numerous opportunities except ones already merged by peephole2 (same level numbers of opportunities comparing to peepholed ones). GCC bootstrap can also confirm this finding. This is interesting. E.g. on SH there are insns to load/store SFmode pairs. However, these insns require a mode switch and have some constraints on register usage. So in the SH case the load/store pairing would need to be done before reg alloc and before mode switching. Yet there is an open issue about when we should run this new pass. Though register renaming is disabled by default now, I put this pass after it, because renaming can resolve some false dependencies thus benefit this pass. Another finding is, it can capture a lot more opportunities if it's after sched2, but I am not sure whether it will mess up with scheduling results in this way. How about the following. Instead of adding new hooks and inserting the pass to the general pass list, make the new pass class take the necessary callback functions directly. Then targets can just instantiate the pass, passing their impl of the callbacks, and insert the pass object into the pass list at a place that fits best for the target. Oh, I don't know we can do this in GCC. But yes, a target may want to run it at some place that fits best for the target. Thanks, bin So, any comments about this? Thanks, bin 2014-05-15 Bin Cheng bin.ch...@arm.com * common.opt (flag_merge_paired_loadstore): New option. * merge-paired-loadstore.c: New file. * Makefile.in: Support new file. * config/arm/arm.c (TARGET_MERGE_PAIRED_LOADSTORE): New macro. (load_latency_expanded_p, arm_merge_paired_loadstore): New function. * params.def (PARAM_MAX_MERGE_PAIRED_LOADSTORE_DISTANCE): New param. * doc/invoke.texi (-fmerge-paired-loadstore): New. (max-merge-paired-loadstore-distance): New. * doc/tm.texi.in (TARGET_MERGE_PAIRED_LOADSTORE): New. * doc/tm.texi: Regenerated. * target.def (merge_paired_loadstore): New. * tree-pass.h (make_pass_merge_paired_loadstore): New decl. * passes.def (pass_merge_paired_loadstore): New pass. * timevar.def (TV_MERGE_PAIRED_LOADSTORE): New time var. gcc/testsuite/ChangeLog 2014-05-15 Bin Cheng bin.ch...@arm.com * gcc.target/arm/merge-paired-loadstore.c: New test. merge-paired-loadstore-20140515.txt -- Best Regards.
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On Fri, May 16, 2014 at 12:10 PM, Bin.Cheng amker.ch...@gmail.com wrote: On Thu, May 15, 2014 at 6:31 PM, Oleg Endo oleg.e...@t-online.de wrote: Hi, On 15 May 2014, at 09:26, bin.cheng bin.ch...@arm.com wrote: Hi, Targets like ARM and AARCH64 support double-word load store instructions, and these instructions are generally faster than the corresponding two load/stores. GCC currently uses peephole2 to merge paired load/store into one single instruction which has a disadvantage. It can only handle simple cases like the two instructions actually appear sequentially in instruction stream, and is too weak to handle cases in which the two load/store are intervened by other irrelevant instructions. Here comes up with a new GCC pass looking through each basic block and merging paired load store even they are not adjacent to each other. The algorithm is pretty simple: 1) In initialization pass iterating over instruction stream it collects relevant memory access information for each instruction. 2) It iterates over each basic block, tries to find possible paired instruction for each memory access instruction. During this work, it checks dependencies between the two possible instructions and also records the information indicating how to pair the two instructions. To avoid quadratic behavior of the algorithm, It introduces new parameter max-merge-paired-loadstore-distance and set the default value to 4, which is large enough to catch major part of opportunities on ARM/cortex-a15. 3) For each candidate pair, it calls back-end's hook to do target dependent check and merge the two instructions if possible. Though the parameter is set to 4, for miscellaneous benchmarks, this pass can merge numerous opportunities except ones already merged by peephole2 (same level numbers of opportunities comparing to peepholed ones). GCC bootstrap can also confirm this finding. This is interesting. E.g. on SH there are insns to load/store SFmode pairs. However, these insns require a mode switch and have some constraints on register usage. So in the SH case the load/store pairing would need to be done before reg alloc and before mode switching. Yet there is an open issue about when we should run this new pass. Though register renaming is disabled by default now, I put this pass after it, because renaming can resolve some false dependencies thus benefit this pass. Another finding is, it can capture a lot more opportunities if it's after sched2, but I am not sure whether it will mess up with scheduling results in this way. How about the following. Instead of adding new hooks and inserting the pass to the general pass list, make the new pass class take the necessary callback functions directly. Then targets can just instantiate the pass, passing their impl of the callbacks, and insert the pass object into the pass list at a place that fits best for the target. Oh, I don't know we can do this in GCC. But yes, a target may want to run it at some place that fits best for the target. Btw, the bswap pass enhancements that are currently in review may also be an opportunity to catch these. They can merge adjacent loads that are used composed (but not yet composed by storing into adjacent memory). The basic-block vectorizer should also handle this (if the composition happens to be by storing into adjacent memory) - of course it needs vector modes available and it has to be enabled. Richard. Thanks, bin So, any comments about this? Thanks, bin 2014-05-15 Bin Cheng bin.ch...@arm.com * common.opt (flag_merge_paired_loadstore): New option. * merge-paired-loadstore.c: New file. * Makefile.in: Support new file. * config/arm/arm.c (TARGET_MERGE_PAIRED_LOADSTORE): New macro. (load_latency_expanded_p, arm_merge_paired_loadstore): New function. * params.def (PARAM_MAX_MERGE_PAIRED_LOADSTORE_DISTANCE): New param. * doc/invoke.texi (-fmerge-paired-loadstore): New. (max-merge-paired-loadstore-distance): New. * doc/tm.texi.in (TARGET_MERGE_PAIRED_LOADSTORE): New. * doc/tm.texi: Regenerated. * target.def (merge_paired_loadstore): New. * tree-pass.h (make_pass_merge_paired_loadstore): New decl. * passes.def (pass_merge_paired_loadstore): New pass. * timevar.def (TV_MERGE_PAIRED_LOADSTORE): New time var. gcc/testsuite/ChangeLog 2014-05-15 Bin Cheng bin.ch...@arm.com * gcc.target/arm/merge-paired-loadstore.c: New test. merge-paired-loadstore-20140515.txt -- Best Regards.
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On Fri, May 16, 2014 at 12:51 PM, Richard Biener wrote: Btw, the bswap pass enhancements that are currently in review may also be an opportunity to catch these. They can merge adjacent loads that are used composed (but not yet composed by storing into adjacent memory). The basic-block vectorizer should also handle this (if the composition happens to be by storing into adjacent memory) - of course it needs vector modes available and it has to be enabled. That won't work for cleaning up spill code load/reloads, which is one of the motivations for the new pass. Ciao! Steven
Re: [GCC RFC]A new and simple pass merging paired load store instructions
Btw, the bswap pass enhancements that are currently in review may also be an opportunity to catch these. They can merge adjacent loads that are used composed (but not yet composed by storing into adjacent memory). The basic-block vectorizer should also handle this (if the composition happens to be by storing into adjacent memory) - of course it needs vector modes available and it has to be enabled. Should we really do it there ? If we start merging multiple loads and stores into the vector register set, on some architectures or microarchitectures the cost of moving to and from the vector register set to the general purpose register set might be too expensive for some operations and then we get into all sorts of issues. I think there is merit in an RTL pass. Ramana Richard. Thanks, bin So, any comments about this? Thanks, bin 2014-05-15 Bin Cheng bin.ch...@arm.com * common.opt (flag_merge_paired_loadstore): New option. * merge-paired-loadstore.c: New file. * Makefile.in: Support new file. * config/arm/arm.c (TARGET_MERGE_PAIRED_LOADSTORE): New macro. (load_latency_expanded_p, arm_merge_paired_loadstore): New function. * params.def (PARAM_MAX_MERGE_PAIRED_LOADSTORE_DISTANCE): New param. * doc/invoke.texi (-fmerge-paired-loadstore): New. (max-merge-paired-loadstore-distance): New. * doc/tm.texi.in (TARGET_MERGE_PAIRED_LOADSTORE): New. * doc/tm.texi: Regenerated. * target.def (merge_paired_loadstore): New. * tree-pass.h (make_pass_merge_paired_loadstore): New decl. * passes.def (pass_merge_paired_loadstore): New pass. * timevar.def (TV_MERGE_PAIRED_LOADSTORE): New time var. gcc/testsuite/ChangeLog 2014-05-15 Bin Cheng bin.ch...@arm.com * gcc.target/arm/merge-paired-loadstore.c: New test. merge-paired-loadstore-20140515.txt -- Best Regards.
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On 05/16/14 04:07, Bin.Cheng wrote: On Fri, May 16, 2014 at 1:13 AM, Jeff Law l...@redhat.com wrote: On 05/15/14 10:51, Mike Stump wrote: On May 15, 2014, at 12:26 AM, bin.cheng bin.ch...@arm.com wrote: Here comes up with a new GCC pass looking through each basic block and merging paired load store even they are not adjacent to each other. So I have a target that has load and store multiple support that supports large a number of registers (2-n registers), and I added a sched0 pass that is a light copy of the regular scheduling pass that uses a different cost function which arranges all loads first, then all stores then everything else. Within a group of loads or stores the secondary key is the base register, the next key is the offset. The net result, all loads off the same register are sorted in increasing order. Glad to see someone else stumble on (ab)using the scheduler to do this. Emm, If it's (ab)using, should we still do it then? I think it'd still be fine. There's even been a comment about doing this kind of thing in the scheduler that's been around since the early 90s... The scheduler is a bit interesting in that it has a wealth of dependency information and the ability to reorganize the insn stream in relatively arbitrary ways. That seems to make it a natural place to think about transformations of this nature. We just haven't had a good infrastructure for doing that. In theory we're a lot closer now to being able to plug in different costing/sorting models and let the scheduler do its thing. Those models might rewrite for register pressure, or encourage certain independent insns to issue back-to-back to encourage combining, or to build candidate insns for delay slot scheduling, etc. As Mike stated, merging of consecutive memory accesses is all about the base register and the offset. I am thinking another method collecting all memory accesses with same base register then doing the merge work. In this way, we should be able to merge more than 2 instructions, also it would be possible to remove redundant load instructions in one pass. My question is how many these redundant loads could be? Is there any rtl pass responsible for this now? I suspect it's a lot less important now than it used to be. But there's probably some cases where it'd be useful. Combining sub-word accesses into full-word accesses come immediately to mind. I'm not aware of any pass which does these kind of changes in a general form. Some passes (caller-save) do a fair amount of work to track when they can generate multi-object loads/stores (and it was a huge win back on the old sparc processors). jeff
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On 05/16/14 04:07, Bin.Cheng wrote: Yes, I think this one does have a good reason. The target independent pass just makes sure that two consecutive memory access instructions are free of data-dependency with each other, then feeds it to back-end hook. It's back-end's responsibility to generate correct instruction. But given these two memory access insns, there's only a couple ways they're likely to combine into a single insn. We could just as easily have the target independent code construct a new insn then try to recognize it. If it's not recognized, then try the other way. Or is it the case that we're doing something beyond upsizing the mode? It's not about modifying an existing insn then recognize it, it's about creating new instruction sometimes. For example, we can generate a simple move insn in Arm mode, while have to generate a parallel instruction in Thumb mode. Target independent part has no idea how to generate an expected insn. Moreover, back-end may check some special conditions too. But can't you go through movXX to generate either the simple insn on the ARM or the PARALLEL on the thumb? Jeff
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On May 16, 2014, at 3:07 AM, Bin.Cheng amker.ch...@gmail.com wrote: I don't see how regrename will help resolve [base+offset] false dependencies. Can you explain? I'd expect effects from hardreg-copyprop commoning a base register. It's the register operand's false dependency, rather than the base's one. Considering below simple case: mov r1, #const1 store r1, [base+offset1] mov r1, #const2 store r1, [base_offset2] It should be renamed into: mov r1, #const1 store r1, [base+offset1] mov r2, #const2 store r2, [base_offset2] Ah, but, what did this look like right before pass_web?
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On Fri, 2014-05-16 at 18:10 +0800, Bin.Cheng wrote: On Thu, May 15, 2014 at 6:31 PM, Oleg Endo oleg.e...@t-online.de wrote: How about the following. Instead of adding new hooks and inserting the pass to the general pass list, make the new pass class take the necessary callback functions directly. Then targets can just instantiate the pass, passing their impl of the callbacks, and insert the pass object into the pass list at a place that fits best for the target. Oh, I don't know we can do this in GCC. But yes, a target may want to run it at some place that fits best for the target. I think it's better than trying to come up with a scheme that so-so fits all. My idea would look like: // merge_paired_loadstore.h class merge_paired_loadstore : public rtl_opt_pass { public: struct delegate { virtual bool merge_paired_loadstore (rtx x, rtx y, ...) = 0; ... }; merge_paired_loadstore (gcc::context* ctx, const char* name, delegate* d); ... }; // target.cc #include merge_paired_loadstore.h static struct target_merge_loadstore_delegate : merge_paired_loadstore::delegate { virtual bool merge_paired_loadstore (...) { // code as if this was a freestanding target hook function }; } g_merge_loadstore_delegate; static void target_register_passes (void) { register_pass ( new merge_paired_loadstore (g, merge_ls, g_merge_loadstore_delegate), PASS_POS_INSERT_AFTER, other pass, 1); } Then, later, maybe sometime in the future, if there's something like a class target, it'd look like: class my_target : public target, merge_paired_loadstore::delegate { ... virtual bool merge_paired_loadstore (...); }; Maybe it's a bit far fetched at the moment, but it would be a start. Cheers, Oleg
Re: [GCC RFC]A new and simple pass merging paired load store instructions
Hi, On 15 May 2014, at 09:26, bin.cheng bin.ch...@arm.com wrote: Hi, Targets like ARM and AARCH64 support double-word load store instructions, and these instructions are generally faster than the corresponding two load/stores. GCC currently uses peephole2 to merge paired load/store into one single instruction which has a disadvantage. It can only handle simple cases like the two instructions actually appear sequentially in instruction stream, and is too weak to handle cases in which the two load/store are intervened by other irrelevant instructions. Here comes up with a new GCC pass looking through each basic block and merging paired load store even they are not adjacent to each other. The algorithm is pretty simple: 1) In initialization pass iterating over instruction stream it collects relevant memory access information for each instruction. 2) It iterates over each basic block, tries to find possible paired instruction for each memory access instruction. During this work, it checks dependencies between the two possible instructions and also records the information indicating how to pair the two instructions. To avoid quadratic behavior of the algorithm, It introduces new parameter max-merge-paired-loadstore-distance and set the default value to 4, which is large enough to catch major part of opportunities on ARM/cortex-a15. 3) For each candidate pair, it calls back-end's hook to do target dependent check and merge the two instructions if possible. Though the parameter is set to 4, for miscellaneous benchmarks, this pass can merge numerous opportunities except ones already merged by peephole2 (same level numbers of opportunities comparing to peepholed ones). GCC bootstrap can also confirm this finding. This is interesting. E.g. on SH there are insns to load/store SFmode pairs. However, these insns require a mode switch and have some constraints on register usage. So in the SH case the load/store pairing would need to be done before reg alloc and before mode switching. Yet there is an open issue about when we should run this new pass. Though register renaming is disabled by default now, I put this pass after it, because renaming can resolve some false dependencies thus benefit this pass. Another finding is, it can capture a lot more opportunities if it's after sched2, but I am not sure whether it will mess up with scheduling results in this way. How about the following. Instead of adding new hooks and inserting the pass to the general pass list, make the new pass class take the necessary callback functions directly. Then targets can just instantiate the pass, passing their impl of the callbacks, and insert the pass object into the pass list at a place that fits best for the target. So, any comments about this? Thanks, bin 2014-05-15 Bin Cheng bin.ch...@arm.com * common.opt (flag_merge_paired_loadstore): New option. * merge-paired-loadstore.c: New file. * Makefile.in: Support new file. * config/arm/arm.c (TARGET_MERGE_PAIRED_LOADSTORE): New macro. (load_latency_expanded_p, arm_merge_paired_loadstore): New function. * params.def (PARAM_MAX_MERGE_PAIRED_LOADSTORE_DISTANCE): New param. * doc/invoke.texi (-fmerge-paired-loadstore): New. (max-merge-paired-loadstore-distance): New. * doc/tm.texi.in (TARGET_MERGE_PAIRED_LOADSTORE): New. * doc/tm.texi: Regenerated. * target.def (merge_paired_loadstore): New. * tree-pass.h (make_pass_merge_paired_loadstore): New decl. * passes.def (pass_merge_paired_loadstore): New pass. * timevar.def (TV_MERGE_PAIRED_LOADSTORE): New time var. gcc/testsuite/ChangeLog 2014-05-15 Bin Cheng bin.ch...@arm.com * gcc.target/arm/merge-paired-loadstore.c: New test. merge-paired-loadstore-20140515.txt
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On May 15, 2014, at 12:26 AM, bin.cheng bin.ch...@arm.com wrote: Here comes up with a new GCC pass looking through each basic block and merging paired load store even they are not adjacent to each other. So I have a target that has load and store multiple support that supports large a number of registers (2-n registers), and I added a sched0 pass that is a light copy of the regular scheduling pass that uses a different cost function which arranges all loads first, then all stores then everything else. Within a group of loads or stores the secondary key is the base register, the next key is the offset. The net result, all loads off the same register are sorted in increasing order. We then can use some define_insns and some peephole to patterns to take the seemingly unrelated instructions, which are now adjacent to knock them down into single instructions, instead of the mass of instructions they were before. And then a peephole pass that runs early to allow the patterns to do the heavy lifting. This scheme can handle unlimited complexity on the addressing forms just by ensuring the cost function for the new scheduling pass looks at all relevant bits (target dependent, if the simple machine independent form reg + n is not enough). The sched0 and the peephole pass run early: + NEXT_PASS (pass_sched0); + NEXT_PASS (pass_peephole1); NEXT_PASS (pass_web); NEXT_PASS (pass_rtl_cprop); NEXT_PASS (pass_cse2); (before register allocation) so, it can arrange to have things in adjacent registers for the load and store multiple instructions. The register allocator can then arrange all the code to use those registers directly. So, having done all this work, I think it would be nicer if there were a pass that managed it (so that one doesn’t have to write any of the peephole or the define_insns (you need like 3*n patterns, and the patterns of O(n), so, you need around n*4/2 lines of code, which is annoying for large n. A pass could use the existing load store multiple patterns directly, so, no additional port work. In my work, since I extend life times of values into registers, pretty much without limit, this could be a bad thing. The code is naturally limited to only extending the lives of things where load and store multiple are used, as if they aren’t used, the regular scheduling pass would undo all the sched0 motion. I choose a light copy of sched, as I don’t care about compile time, and I wanted a very small patch that was easy to maintain. If this pass when into trunk, we’d run the new passes _only_ if a port asked for them. 99% of the ports likely don’t want either, though, peephole before register allocation might be interesting for others to solve other issues. I wanted this to run before register allocation as my load and store multiple instructions only take consecutive register ranges (n-m), and I need the register allocator to manage to make it true. I do reg to reg motion to move things around as needed, but almost all I expect the allocator to get rid of. Very complex cases might wind up with a few extra moves, but I have nice bubbles that I can fit these extra moves into. In my scheme, no new options, no documentation for new options, no new param options, no silly constants, no hard to write/maintain pass, no new weird targets interface, no limit on just pairs, works on stores as well, runs earlier, 430 lines instead of 1086 lines, conceptually much simpler, added benefit of peephole before register allocation that can be used for other things by the port… So, my question is, does my scheme on your target find more or fewer things? Would your scheme pair pairs (so that 4 registers would go into 1 instruction)? diff --git a/gcc/haifa-sched.c b/gcc/haifa-sched.c index 4317c9f..9702ebe 100644 --- a/gcc/haifa-sched.c +++ b/gcc/haifa-sched.c @@ -1361,7 +1361,10 @@ insn_cost (rtx insn) if (recog_memoized (insn) 0) return 0; - cost = insn_default_latency (insn); + if (sched_load) + cost = 0; + else + cost = insn_default_latency (insn); if (cost 0) cost = 0; @@ -1383,7 +1386,10 @@ insn_cost (rtx insn) } else { - cost = insn_default_latency (insn); + if (sched_load) + cost = 0; + else + cost = insn_default_latency (insn); if (cost 0) cost = 0; @@ -1568,6 +1574,27 @@ dep_list_size (rtx insn, sd_list_types_def list) return nodbgcount; } +static int saved_priority; +bool sched_load; + + +static int +pri (bool wasload, int regno, HOST_WIDE_INT offp) { + int p; + int off = (int)offp; + int m = /* ((114)-1) */ 0x3fff; + off = off + (113); + off = off m; + /* loads go first, then stores */ + p = wasload*(INT_MAX/4+1) + (INT_MAX/4+1); + /* then we sort upon the base register, increasing regnos. */ + p -= (regno 0xfff) 14; + /* then we sort on the offset,
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On Thu, May 15, 2014 at 9:26 AM, bin.cheng wrote: Hi, Targets like ARM and AARCH64 support double-word load store instructions, and these instructions are generally faster than the corresponding two load/stores. GCC currently uses peephole2 to merge paired load/store into one single instruction which has a disadvantage. It can only handle simple cases like the two instructions actually appear sequentially in instruction stream, and is too weak to handle cases in which the two load/store are intervened by other irrelevant instructions. Here comes up with a new GCC pass looking through each basic block and merging paired load store even they are not adjacent to each other. The algorithm is pretty simple: 1) In initialization pass iterating over instruction stream it collects relevant memory access information for each instruction. 2) It iterates over each basic block, tries to find possible paired instruction for each memory access instruction. During this work, it checks dependencies between the two possible instructions and also records the information indicating how to pair the two instructions. To avoid quadratic behavior of the algorithm, It introduces new parameter max-merge-paired-loadstore-distance and set the default value to 4, which is large enough to catch major part of opportunities on ARM/cortex-a15. 3) For each candidate pair, it calls back-end's hook to do target dependent check and merge the two instructions if possible. Though the parameter is set to 4, for miscellaneous benchmarks, this pass can merge numerous opportunities except ones already merged by peephole2 (same level numbers of opportunities comparing to peepholed ones). GCC bootstrap can also confirm this finding. Yet there is an open issue about when we should run this new pass. Though register renaming is disabled by default now, I put this pass after it, because renaming can resolve some false dependencies thus benefit this pass. Another finding is, it can capture a lot more opportunities if it's after sched2, but I am not sure whether it will mess up with scheduling results in this way. So, any comments about this? First off: Why does this need a target hook? We're getting more and more of them -- too many IMHO. There should be really good reasons for adding even more new ones. Does this pass have to run after scheduling? The way you describe it, this sounds like a scheduling problem, where you don't need regrename to resolve false dependencies. Your sched2 pass should be able to prioritize mergeable loads/stores to schedule them adjacent. Of if you must do this before scheduling, then at least do it before sched2. Now you're really ruining the order of the scheduled instructions, which seems bad. I don't see how regrename will help resolve [base+offset] false dependencies. Can you explain? I'd expect effects from hardreg-copyprop commoning a base register. ChangeLog is missing the entry for arm.c. Your pass should make those peephole2's redundant, so you should remove the relevant define_peephole2's. + generated by spilling during reigster allocation. To catch all s/reigster/register/ + whose address expression is in the form of [base_offset]. It s/[base_offset]/[base+offset]/ + only guarantee that the two consecutive memory access instructions s/guarantee/guarantees/ + has no data dependency issues. After a pair is found it calls to s/has/have/ +/* Maximal number of memory references we support in single isntruction. */ s/Maximal/Maximum/ s/isntruction/instruction/ +/* Check instruction recorded in PINSN_INFO to see if it is the + first instruction of load store pair. If yes, record the + information in PPAIR_INFO and return TRUE. */ + +static bool +find_pair_x_insn (struct pair_info_t *ppair_info, The function description doesn't seem to match the function implementation. There is nothing in find_pair_x_insn that looks for a load/store pair. + /* Don't bother if base is a register and we are modifying it. */ + if (REG_P (base) reg_modified_p (base, insn)) +return false; You can shortcut this if you're looking at a load. + for (; *ref_rec; ref_rec++) Nit: while (*ref_rec) + /* Can't be paired if memory refs have different mode. */ + if (GET_MODE (mem) != mode) +return false; Not even if same GET_MODE_SIZE? + /* Clear PAIR_SINK if the memory unit of the first instruction + is clobbered between. */ + if ((pinfo-pair_kind PAIR_SINK) != 0 + MEM_CLOBBERED_P (pinfo-mem_clob, MEM_CLOB_SELF)) +pinfo-pair_kind = (~PAIR_SINK); This has the smell of architecture-specific behavior. Maybe you can't even pair instructions if there is a MEM_CLOB_SELF involved. Likewise for some of the checks in valid_insn_pair_p following the one quoted above. + HOST_WIDE_INT offset_t; We generally reserve _t for types, so I'd recommend a different name for this variable. +find_pair_y_insn
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On 05/15/14 10:51, Mike Stump wrote: On May 15, 2014, at 12:26 AM, bin.cheng bin.ch...@arm.com wrote: Here comes up with a new GCC pass looking through each basic block and merging paired load store even they are not adjacent to each other. So I have a target that has load and store multiple support that supports large a number of registers (2-n registers), and I added a sched0 pass that is a light copy of the regular scheduling pass that uses a different cost function which arranges all loads first, then all stores then everything else. Within a group of loads or stores the secondary key is the base register, the next key is the offset. The net result, all loads off the same register are sorted in increasing order. Glad to see someone else stumble on (ab)using the scheduler to do this. I've poked at the scheduler several times to do similar stuff, but was never really satisfied with the results and never tried to polish those prototypes into something worth submitting. One example I've poked at was discovery of stores which then feed into a load from the same location. Which obviously we'd prefer to turn into a store + copy (subject to mess of constraints). There's a handful of these kind of transformations that seem to naturally drop out of this kind of work. Similarly a post-reload pass could be used to promote single word loads/stores to double-word operations. If anyone cared about PA 1.1 code generation, it'd be a much cleaner way to support the non-fused fmpyadd fmpysub insns. Anyway, if you want to move forward with the idea, I'd certainly support doing so. jeff
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On Thu, May 15, 2014 at 12:26 AM, bin.cheng bin.ch...@arm.com wrote: Hi, Targets like ARM and AARCH64 support double-word load store instructions, and these instructions are generally faster than the corresponding two load/stores. GCC currently uses peephole2 to merge paired load/store into one single instruction which has a disadvantage. It can only handle simple cases like the two instructions actually appear sequentially in instruction stream, and is too weak to handle cases in which the two load/store are intervened by other irrelevant instructions. Here comes up with a new GCC pass looking through each basic block and merging paired load store even they are not adjacent to each other. The algorithm is pretty simple: 1) In initialization pass iterating over instruction stream it collects relevant memory access information for each instruction. 2) It iterates over each basic block, tries to find possible paired instruction for each memory access instruction. During this work, it checks dependencies between the two possible instructions and also records the information indicating how to pair the two instructions. To avoid quadratic behavior of the algorithm, It introduces new parameter max-merge-paired-loadstore-distance and set the default value to 4, which is large enough to catch major part of opportunities on ARM/cortex-a15. 3) For each candidate pair, it calls back-end's hook to do target dependent check and merge the two instructions if possible. Though the parameter is set to 4, for miscellaneous benchmarks, this pass can merge numerous opportunities except ones already merged by peephole2 (same level numbers of opportunities comparing to peepholed ones). GCC bootstrap can also confirm this finding. Yet there is an open issue about when we should run this new pass. Though register renaming is disabled by default now, I put this pass after it, because renaming can resolve some false dependencies thus benefit this pass. Another finding is, it can capture a lot more opportunities if it's after sched2, but I am not sure whether it will mess up with scheduling results in this way. So, any comments about this? Thanks, bin 2014-05-15 Bin Cheng bin.ch...@arm.com * common.opt (flag_merge_paired_loadstore): New option. * merge-paired-loadstore.c: New file. * Makefile.in: Support new file. * config/arm/arm.c (TARGET_MERGE_PAIRED_LOADSTORE): New macro. (load_latency_expanded_p, arm_merge_paired_loadstore): New function. * params.def (PARAM_MAX_MERGE_PAIRED_LOADSTORE_DISTANCE): New param. * doc/invoke.texi (-fmerge-paired-loadstore): New. (max-merge-paired-loadstore-distance): New. * doc/tm.texi.in (TARGET_MERGE_PAIRED_LOADSTORE): New. * doc/tm.texi: Regenerated. * target.def (merge_paired_loadstore): New. * tree-pass.h (make_pass_merge_paired_loadstore): New decl. * passes.def (pass_merge_paired_loadstore): New pass. * timevar.def (TV_MERGE_PAIRED_LOADSTORE): New time var. gcc/testsuite/ChangeLog 2014-05-15 Bin Cheng bin.ch...@arm.com * gcc.target/arm/merge-paired-loadstore.c: New test. Here is a testcase on x86-64: --- struct Foo { Foo (double x0, double x1, double x2) { data[0] = x0; data[1] = x1; data[2] = x2; } double data[3]; }; const Foo f1 (0.0, 0.0, 1.0); const Foo f2 (1.0, 0.0, 0.0); struct Bar { Bar (float x0, float x1, float x2, float x3, float x4) { data[0] = x0; data[1] = x1; data[2] = x2; data[3] = x3; data[4] = x4; } float data[5]; }; const Bar b1 (0.0, 0.0, 0.0, 0.0, 1.0); const Bar b2 (1.0, 0.0, 0.0, 0.0, 0.0); --- We generate xorpd %xmm0, %xmm0 movsd .LC1(%rip), %xmm1 movsd %xmm0, _ZL2f1(%rip) movsd %xmm0, _ZL2f1+8(%rip) movsd %xmm0, _ZL2f2+8(%rip) movsd %xmm0, _ZL2f2+16(%rip) xorps %xmm0, %xmm0 movsd %xmm1, _ZL2f1+16(%rip) movsd %xmm1, _ZL2f2(%rip) movss .LC3(%rip), %xmm1 movss %xmm0, _ZL2b1(%rip) movss %xmm0, _ZL2b1+4(%rip) movss %xmm0, _ZL2b1+8(%rip) movss %xmm0, _ZL2b1+12(%rip) movss %xmm1, _ZL2b1+16(%rip) movss %xmm1, _ZL2b2(%rip) movss %xmm0, _ZL2b2+4(%rip) movss %xmm0, _ZL2b2+8(%rip) movss %xmm0, _ZL2b2+12(%rip) movss %xmm0, _ZL2b2+16(%rip) There are pairs of movsd and sets of 4 movss. We should be able to handle more than 2 load/store insns. -- H.J.
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On May 15, 2014, at 10:13 AM, Jeff Law l...@redhat.com wrote: I've poked at the scheduler several times to do similar stuff, but was never really satisfied with the results and never tried to polish those prototypes into something worth submitting. What was lacking? The cleanliness of the patch or the, it didn’t win me more than 0.01% code improvement so I never submitted it? One example I've poked at was discovery of stores which then feed into a load from the same location. Trivial to do with my patch, just change the sort key to arrange that to happen, however, if you want multiple support and this, then, you need to recognize store_m / load from a part of that, which would be kinda annoying. Anyway, if you want to move forward with the idea, I'd certainly support doing so. I’d be happy to clean up my patch as necessary and contribute it, however, people would have to tell me what they would like to see done to it. In my mind, the arm, more, nds32, rs6000 and s390 ports would be the biggest in tree winners. To be the most useful, my patch would benefit from an optimization person writing MI code to coalesce adjacent memory operations into either larger mode load and stores or into gen_load_multiple, gen_store_multiplethen, then all the port work (mine is around 16*n*n lines) would be zero.
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On 05/15/14 12:41, Mike Stump wrote: On May 15, 2014, at 10:13 AM, Jeff Law l...@redhat.com wrote: I've poked at the scheduler several times to do similar stuff, but was never really satisfied with the results and never tried to polish those prototypes into something worth submitting. What was lacking? The cleanliness of the patch or the, it didn’t win me more than 0.01% code improvement so I never submitted it? Cleanliness and applicability of my implementation. For fmpyadd/fmpysub on the PA, my recollection was that the right way to go was to detect when both were in the ready queue at the same time, then resort the queue to have them issue consecutively. We didn't have the necessary hooks to have target dependent code examine and rearrange the queues back when I looked at this. By the time we had those capabilities, the PA8000 class processors had been released and those instructions were considered bad so I never came back to this. For the memory optimizations, IIRC, the dependencies keep them from getting into the ready queue at the same time. Thus it's significantly harder to get them to issue consecutively when you've got an issue rate 1. Trivial to do with my patch, just change the sort key to arrange that to happen, however, if you want multiple support and this, then, you need to recognize store_m / load from a part of that, which would be kinda annoying. But if you've got an issue rate 1, then it's a lot less likely you'll get the store/load pairing up the way you want. Arguably one could throttle down the issue rate for these scheduler-like passes which makes that problem go away. JEff
Re: [GCC RFC]A new and simple pass merging paired load store instructions
On May 15, 2014, at 1:01 PM, Jeff Law l...@redhat.com wrote: For the memory optimizations, IIRC, the dependencies keep them from getting into the ready queue at the same time. Thus it's significantly harder to get them to issue consecutively when you've got an issue rate 1. But if you've got an issue rate 1, then it's a lot less likely you'll get the store/load pairing up the way you want. Ah, yeah, on second thought, that won’t work if they are not ready together… I sort the ready queue: first ready set: load store everything else next ready set: load store everything else Now, it might work, if there are no other instructions and no loads in the next set, and so on, but that would be an accident. The utility is only in instructions that are ready together, though, tick boundaries and issue rates don’t matter any, as I don’t care about ticks or issue rates. The only consideration on order is my ordering operator and once ordered, the target is free to combine as it sees fit. So, for store load optimization, my patch is rather useless.