Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-20 Thread Bin.Cheng
On Tue, May 20, 2014 at 1:30 AM, Jeff Law l...@redhat.com wrote:
 On 05/19/14 00:38, Bin.Cheng wrote:

 On Sat, May 17, 2014 at 12:32 AM, Jeff Law l...@redhat.com wrote:

 On 05/16/14 04:07, Bin.Cheng wrote:



 But can't you go through movXX to generate either the simple insn on the
 ARM
 or the PARALLEL on the thumb?

 Yes, I think it's more than upsizing the mode.  There is another
 example from one of x86's candidate peephole patch at
 https://gcc.gnu.org/ml/gcc-patches/2014-04/msg00467.html

 The patch wants to do below transformation, which I think is very
 target dependent.

 Presumably there's no way to go through an expander here?
I don't know very much with respect to this case, maybe the patch
author can help here.


 The idea being that common cases where a pair moves can be turned into a
 single wider move without having to write target code to make that happen
 much of the time.  ie 2xQI-HI, 2xHI-SI, 2xSI-DI 2xSF-DF.  For things
 outside those simple cases, fall back to a target hook or a target expander.
This is practicable, but the question is why we would bother with
general cases if the hook interface is needed anyway.  Is it because
target calls are generally more expensive?

Thanks,
bin


-- 
Best Regards.


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-20 Thread Bin.Cheng
On Tue, May 20, 2014 at 5:02 AM, Mike Stump mikest...@comcast.net wrote:
 On May 19, 2014, at 10:30 AM, Jeff Law l...@redhat.com wrote:
 Yes, I think it's more than upsizing the mode.  There is another
 example from one of x86's candidate peephole patch at
 https://gcc.gnu.org/ml/gcc-patches/2014-04/msg00467.html

 The patch wants to do below transformation, which I think is very
 target dependent.
 Presumably there's no way to go through an expander here?

 The idea being that common cases where a pair moves can be turned into a 
 single wider move without having to write target code to make that happen 
 much of the time.  ie 2xQI-HI, 2xHI-SI, 2xSI-DI 2xSF-DF.  For things 
 outside those simple cases, fall back to a target hook or a target expander.

 For completeness, load_multiple can handle quite a bit in a generic way.  For 
 example, mine is wired to support 3 DI.
One problem is load/store_multiple only supports consecutive
registers, while targets like arm and aarch64 supports load/store of
non-consecutive registers.

Thanks,
bin


-- 
Best Regards.


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-20 Thread Jeff Law

On 05/20/14 01:13, Bin.Cheng wrote:


The idea being that common cases where a pair moves can be turned into a
single wider move without having to write target code to make that happen
much of the time.  ie 2xQI-HI, 2xHI-SI, 2xSI-DI 2xSF-DF.  For things
outside those simple cases, fall back to a target hook or a target expander.

This is practicable, but the question is why we would bother with
general cases if the hook interface is needed anyway.  Is it because
target calls are generally more expensive?
No, it's because most (all?) targets would benefit without having to 
write target dependent code.


jeff


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-20 Thread Wei Mi
On Tue, May 20, 2014 at 12:13 AM, Bin.Cheng amker.ch...@gmail.com wrote:
 On Tue, May 20, 2014 at 1:30 AM, Jeff Law l...@redhat.com wrote:
 On 05/19/14 00:38, Bin.Cheng wrote:

 On Sat, May 17, 2014 at 12:32 AM, Jeff Law l...@redhat.com wrote:

 On 05/16/14 04:07, Bin.Cheng wrote:



 But can't you go through movXX to generate either the simple insn on the
 ARM
 or the PARALLEL on the thumb?

 Yes, I think it's more than upsizing the mode.  There is another
 example from one of x86's candidate peephole patch at
 https://gcc.gnu.org/ml/gcc-patches/2014-04/msg00467.html

 The patch wants to do below transformation, which I think is very
 target dependent.

 Presumably there's no way to go through an expander here?
 I don't know very much with respect to this case, maybe the patch
 author can help here.


I just checked the expand result for my case. TER could make expand
see the two intrinsics at the same time so in theory it is possible to
make the merge happen in expand. I havn't looked into detail.

Thanks,
Wei.


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-20 Thread Jeff Law

On 05/20/14 11:14, Wei Mi wrote:

On Tue, May 20, 2014 at 12:13 AM, Bin.Cheng amker.ch...@gmail.com wrote:

On Tue, May 20, 2014 at 1:30 AM, Jeff Law l...@redhat.com wrote:

On 05/19/14 00:38, Bin.Cheng wrote:


On Sat, May 17, 2014 at 12:32 AM, Jeff Law l...@redhat.com wrote:


On 05/16/14 04:07, Bin.Cheng wrote:



But can't you go through movXX to generate either the simple insn on the
ARM
or the PARALLEL on the thumb?


Yes, I think it's more than upsizing the mode.  There is another
example from one of x86's candidate peephole patch at
https://gcc.gnu.org/ml/gcc-patches/2014-04/msg00467.html

The patch wants to do below transformation, which I think is very
target dependent.


Presumably there's no way to go through an expander here?

I don't know very much with respect to this case, maybe the patch
author can help here.



I just checked the expand result for my case. TER could make expand
see the two intrinsics at the same time so in theory it is possible to
make the merge happen in expand. I havn't looked into detail.

I'm not referring to in the gimple-rtl expansion.

I'm referring to using a define_expand to generate the load/store 
multiple instructions.  Given define_expand in the backend, the target 
independent code can use HAVE_XXX and gen_XXX to test for the existence 
of the expander and to call the expander to generate insns.


Jeff


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-19 Thread Bin.Cheng
On Sat, May 17, 2014 at 12:52 AM, Mike Stump mikest...@comcast.net wrote:
 On May 16, 2014, at 3:07 AM, Bin.Cheng amker.ch...@gmail.com wrote:

 I don't see how regrename will help resolve [base+offset] false
 dependencies. Can you explain? I'd expect effects from
 hardreg-copyprop commoning a base register.
 It's the register operand's false dependency, rather than the base's
 one.  Considering below simple case:
mov r1,  #const1
store r1, [base+offset1]
mov r1, #const2
store r1, [base_offset2]
 It should be renamed into:
mov r1,  #const1
store r1, [base+offset1]
mov r2, #const2
store r2, [base_offset2]

 Ah, but, what did this look like right before pass_web?
I don't think this would be a problem for pre-RA, generally GCC won't
try to reuse pseudo register in this way, right?

Thanks,
bin


-- 
Best Regards.


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-19 Thread Bin.Cheng
On Sat, May 17, 2014 at 12:32 AM, Jeff Law l...@redhat.com wrote:
 On 05/16/14 04:07, Bin.Cheng wrote:

 Yes, I think this one does have a good reason.  The target independent
 pass just makes sure that two consecutive memory access instructions
 are free of data-dependency with each other, then feeds it to back-end
 hook.  It's back-end's responsibility to generate correct instruction.

 But given these two memory access insns, there's only a couple ways they're
 likely to combine into a single insn.  We could just as easily have the
 target independent code construct a new insn then try to recognize it.  If
 it's not recognized, then try the other way.

 Or is it the case that we're doing something beyond upsizing the mode?



   It's not about modifying an existing insn then recognize it, it's
 about creating new instruction sometimes.  For example, we can
 generate a simple move insn in Arm mode, while have to generate a
 parallel instruction in Thumb mode.  Target independent part has no
 idea how to generate an expected insn.  Moreover, back-end may check
 some special conditions too.

 But can't you go through movXX to generate either the simple insn on the ARM
 or the PARALLEL on the thumb?

Yes, I think it's more than upsizing the mode.  There is another
example from one of x86's candidate peephole patch at
https://gcc.gnu.org/ml/gcc-patches/2014-04/msg00467.html

The patch wants to do below transformation, which I think is very
target dependent.

+(define_peephole2
+  [(set (match_operand:DF 0 register_operand)
+   (match_operand:DF 1 memory_operand))
+   (set (match_operand:V2DF 2 register_operand)
+   (vec_concat:V2DF (match_dup 0)
+(match_operand:DF 3 memory_operand)))]
+  TARGET_SSE_UNALIGNED_LOAD_OPTIMAL
+REGNO (operands[0]) == REGNO (operands[2])
+adjacent_mem_locations (operands[1], operands[3])
+  [(set (match_dup 2)
+   (unspec:V2DF [(match_dup 4)] UNSPEC_LOADU))]
+
+;; merge movsd/movhpd to movupd when TARGET_SSE_UNALIGNED_STORE_OPTIMAL
+;; is true.
+(define_peephole2
+  [(set (match_operand:DF 0 memory_operand)
+(vec_select:DF (match_operand:V2DF 1 register_operand)
+  (parallel [(const_int 0)])))
+   (set (match_operand:DF 2 memory_operand)
+(vec_select:DF (match_dup 1)
+   (parallel [(const_int 1)])))]
+  TARGET_SSE_UNALIGNED_STORE_OPTIMAL
+adjacent_mem_locations (operands[0], operands[2])
+  [(set (match_dup 3)
+(unspec:V2DF [(match_dup 1)] UNSPEC_STOREU))]

Thanks,
bin


-- 
Best Regards.


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-19 Thread Bin.Cheng
On Sat, May 17, 2014 at 12:18 AM, Jeff Law l...@redhat.com wrote:
 On 05/16/14 04:07, Bin.Cheng wrote:

 On Fri, May 16, 2014 at 1:13 AM, Jeff Law l...@redhat.com wrote:

 On 05/15/14 10:51, Mike Stump wrote:


 On May 15, 2014, at 12:26 AM, bin.cheng bin.ch...@arm.com wrote:


 Here comes up with a new GCC pass looking through each basic block
 and merging paired load store even they are not adjacent to each
 other.



 So I have a target that has load and store multiple support that
 supports large a number of registers (2-n registers), and I added a
 sched0 pass that is a light copy of the regular scheduling pass that
 uses a different cost function which arranges all loads first, then
 all stores then everything else.  Within a group of loads or stores
 the secondary key is the base register, the next key is the offset.
 The net result, all loads off the same register are sorted in
 increasing order.


 Glad to see someone else stumble on (ab)using the scheduler to do this.

 Emm, If it's (ab)using, should we still do it then?

 I think it'd still be fine.  There's even been a comment about doing this
 kind of thing in the scheduler that's been around since the early 90s...

 The scheduler is a bit interesting in that it has a wealth of dependency
 information and the ability to reorganize the insn stream in relatively
 arbitrary ways.  That seems to make it a natural place to think about
 transformations of this nature.  We just haven't had a good infrastructure
 for doing that.

 In theory we're a lot closer now to being able to plug in different
 costing/sorting models and let the scheduler do its thing.  Those models
 might rewrite for register pressure, or encourage certain independent insns
 to issue back-to-back to encourage combining, or to build candidate insns
 for delay slot scheduling, etc.


 As Mike stated, merging of consecutive memory accesses is all about
 the base register and the offset. I am thinking another method
 collecting all memory accesses with same base register then doing the
 merge work.  In this way, we should be able to merge more than 2
 instructions, also it would be possible to remove redundant load
 instructions in one pass.

 My question is how many these redundant loads could be?  Is there any
 rtl pass responsible for this now?

 I suspect it's a lot less important now than it used to be.  But there's
 probably some cases where it'd be useful.  Combining sub-word accesses into
 full-word accesses come immediately to mind.

 I'm not aware of any pass which does these kind of changes in a general
 form.  Some passes (caller-save) do a fair amount of work to track when they
 can generate multi-object loads/stores (and it was a huge win back on the
 old sparc processors).

Glad this RFC has attracted some attentions and thanks for all the
comments.  Here I can see four major concerns as below:
1) Should we do it in a separated pass, or just along with scheduler?

2) When should we run the new pass, before or after RA?  There are
both advantages and disadvantages and very depends on the target for
which we are compiling.
I have no simple answer to this.  Maybe we can run the pass twice or
follow Oleg's suggestion.  I think it's a new strategy for GCC to let
backend decide when to run a pass.

3) Do we need a new target hook interface?
I answered this in other messages and I still think it's target dependent.

4) The optimization should be able to handle cases with more than 2
consecutive load/store instructions.
The current implementation can't handle such cases and need further extension.

The 3) and 4) are just implementation questions, while I am not sure
about 1) and 2), so any more comments that we could make some
decisions to carry on this optimization?

Thanks,
bin

-- 
Best Regards.


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-19 Thread Jeff Law

On 05/19/14 00:38, Bin.Cheng wrote:

On Sat, May 17, 2014 at 12:52 AM, Mike Stump mikest...@comcast.net wrote:

On May 16, 2014, at 3:07 AM, Bin.Cheng amker.ch...@gmail.com wrote:



I don't see how regrename will help resolve [base+offset] false
dependencies. Can you explain? I'd expect effects from
hardreg-copyprop commoning a base register.

It's the register operand's false dependency, rather than the base's
one.  Considering below simple case:
mov r1,  #const1
store r1, [base+offset1]
mov r1, #const2
store r1, [base_offset2]
It should be renamed into:
mov r1,  #const1
store r1, [base+offset1]
mov r2, #const2
store r2, [base_offset2]


Ah, but, what did this look like right before pass_web?

I don't think this would be a problem for pre-RA, generally GCC won't
try to reuse pseudo register in this way, right?
There's nothing which would prevent reuse of a pseudo in this manner; 
for example, a backend might do something ugly like this in an expander 
and we have to handle it properly.


jeff


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-19 Thread Jeff Law

On 05/19/14 00:38, Bin.Cheng wrote:

1) Should we do it in a separated pass, or just along with scheduler?
ISTM that when we're able to combine insns that can impact the schedule 
we'd like to generate, possibly in significant ways.  That argues for a 
separate pass that runs before the scheduler.


So I'd probably look at splitting out the code which builds the 
dependency information, running that first, then the pass to merge these 
memory ops (incrementally updating the dependency info), then the scheduler.





2) When should we run the new pass, before or after RA?  There are
both advantages and disadvantages and very depends on the target for
which we are compiling.
I have no simple answer to this.  Maybe we can run the pass twice or
follow Oleg's suggestion.  I think it's a new strategy for GCC to let
backend decide when to run a pass.
If we only run the dependency stuff once and share the data between the 
new pass and the sceduler, then I think running the new pass twice just 
before scheduling would be fine.




3) Do we need a new target hook interface?
I answered this in other messages and I still think it's target dependent.

Still pondering that one :-)



4) The optimization should be able to handle cases with more than 2
consecutive load/store instructions.
The current implementation can't handle such cases and need further extension.
Yea, probably.  I think if we build the infrastructure right extension 
for  2 load/stores shouldn't be terribly bad to implement.


Jeff


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-19 Thread Jeff Law

On 05/19/14 00:38, Bin.Cheng wrote:

On Sat, May 17, 2014 at 12:32 AM, Jeff Law l...@redhat.com wrote:

On 05/16/14 04:07, Bin.Cheng wrote:


Yes, I think this one does have a good reason.  The target independent
pass just makes sure that two consecutive memory access instructions
are free of data-dependency with each other, then feeds it to back-end
hook.  It's back-end's responsibility to generate correct instruction.


But given these two memory access insns, there's only a couple ways they're
likely to combine into a single insn.  We could just as easily have the
target independent code construct a new insn then try to recognize it.  If
it's not recognized, then try the other way.

Or is it the case that we're doing something beyond upsizing the mode?




   It's not about modifying an existing insn then recognize it, it's
about creating new instruction sometimes.  For example, we can
generate a simple move insn in Arm mode, while have to generate a
parallel instruction in Thumb mode.  Target independent part has no
idea how to generate an expected insn.  Moreover, back-end may check
some special conditions too.


But can't you go through movXX to generate either the simple insn on the ARM
or the PARALLEL on the thumb?


Yes, I think it's more than upsizing the mode.  There is another
example from one of x86's candidate peephole patch at
https://gcc.gnu.org/ml/gcc-patches/2014-04/msg00467.html

The patch wants to do below transformation, which I think is very
target dependent.

Presumably there's no way to go through an expander here?

The idea being that common cases where a pair moves can be turned into a 
single wider move without having to write target code to make that 
happen much of the time.  ie 2xQI-HI, 2xHI-SI, 2xSI-DI 2xSF-DF.  For 
things outside those simple cases, fall back to a target hook or a 
target expander.


jeff



Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-19 Thread Mike Stump
On May 19, 2014, at 10:30 AM, Jeff Law l...@redhat.com wrote:
 Yes, I think it's more than upsizing the mode.  There is another
 example from one of x86's candidate peephole patch at
 https://gcc.gnu.org/ml/gcc-patches/2014-04/msg00467.html
 
 The patch wants to do below transformation, which I think is very
 target dependent.
 Presumably there's no way to go through an expander here?

 The idea being that common cases where a pair moves can be turned into a 
 single wider move without having to write target code to make that happen 
 much of the time.  ie 2xQI-HI, 2xHI-SI, 2xSI-DI 2xSF-DF.  For things 
 outside those simple cases, fall back to a target hook or a target expander.

For completeness, load_multiple can handle quite a bit in a generic way.  For 
example, mine is wired to support 3 DI.

Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-16 Thread Bin.Cheng
On Fri, May 16, 2014 at 12:57 AM, Steven Bosscher stevenb@gmail.com wrote:
 On Thu, May 15, 2014 at 9:26 AM, bin.cheng wrote:
 Hi,
 Targets like ARM and AARCH64 support double-word load store instructions,
 and these instructions are generally faster than the corresponding two
 load/stores.  GCC currently uses peephole2 to merge paired load/store into
 one single instruction which has a disadvantage.  It can only handle simple
 cases like the two instructions actually appear sequentially in instruction
 stream, and is too weak to handle cases in which the two load/store are
 intervened by other irrelevant instructions.

 Here comes up with a new GCC pass looking through each basic block and
 merging paired load store even they are not adjacent to each other.  The
 algorithm is pretty simple:
 1) In initialization pass iterating over instruction stream it collects
 relevant memory access information for each instruction.
 2) It iterates over each basic block, tries to find possible paired
 instruction for each memory access instruction.  During this work, it checks
 dependencies between the two possible instructions and also records the
 information indicating how to pair the two instructions.  To avoid quadratic
 behavior of the algorithm, It introduces new parameter
 max-merge-paired-loadstore-distance and set the default value to 4, which is
 large enough to catch major part of opportunities on ARM/cortex-a15.
 3) For each candidate pair, it calls back-end's hook to do target dependent
 check and merge the two instructions if possible.

 Though the parameter is set to 4, for miscellaneous benchmarks, this pass
 can merge numerous opportunities except ones already merged by peephole2
 (same level numbers of opportunities comparing to peepholed ones).  GCC
 bootstrap can also confirm this finding.

 Yet there is an open issue about when we should run this new pass.  Though
 register renaming is disabled by default now, I put this pass after it,
 because renaming can resolve some false dependencies thus benefit this pass.
 Another finding is, it can capture a lot more opportunities if it's after
 sched2, but I am not sure whether it will mess up with scheduling results in
 this way.

 So, any comments about this?

Hi Steven,
Thanks for reviewing this.  Here are some answers to the general questions.


 First off: Why does this need a target hook? We're getting more and
 more of them -- too many IMHO. There should be really good reasons for
 adding even more new ones.

Yes, I think this one does have a good reason.  The target independent
pass just makes sure that two consecutive memory access instructions
are free of data-dependency with each other, then feeds it to back-end
hook.  It's back-end's responsibility to generate correct instruction.
 It's not about modifying an existing insn then recognize it, it's
about creating new instruction sometimes.  For example, we can
generate a simple move insn in Arm mode, while have to generate a
parallel instruction in Thumb mode.  Target independent part has no
idea how to generate an expected insn.  Moreover, back-end may check
some special conditions too.


 Does this pass have to run after scheduling? The way you describe it,
No, I just meant there is more opportunities after regrenaming, and
even more opportunities after sched2, I haven't investigated reason
for the latter one yet,  but this pass doesn't depend on sched2 to
work.

 this sounds like a scheduling problem, where you don't need regrename
 to resolve false dependencies. Your sched2 pass should be able to
 prioritize mergeable loads/stores to schedule them adjacent. Of if you
 must do this before scheduling, then at least do it before sched2. Now
 you're really ruining the order of the scheduled instructions, which
 seems bad.
Yes, I agree it should NOT disturb scheduling results, that's why I
put it before sched2 and after register renaming right now.


 I don't see how regrename will help resolve [base+offset] false
 dependencies. Can you explain? I'd expect effects from
 hardreg-copyprop commoning a base register.
It's the register operand's false dependency, rather than the base's
one.  Considering below simple case:
mov r1,  #const1
store r1, [base+offset1]
mov r1, #const2
store r1, [base_offset2]
It should be renamed into:
mov r1,  #const1
store r1, [base+offset1]
mov r2, #const2
store r2, [base_offset2]
Then caught by the patch.

I will leave other comments for a moment after more discussion here.

Thanks,
bin

 ChangeLog is missing the entry for arm.c.

 Your pass should make those peephole2's redundant, so you should
 remove the relevant define_peephole2's.


  +   generated by spilling during reigster allocation.  To catch all

 s/reigster/register/


 +   whose address expression is in the form of [base_offset].  It

 s/[base_offset]/[base+offset]/


 +   only guarantee that the two consecutive memory access instructions

 s/guarantee/guarantees/


 +   

Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-16 Thread Bin.Cheng
On Fri, May 16, 2014 at 1:13 AM, Jeff Law l...@redhat.com wrote:
 On 05/15/14 10:51, Mike Stump wrote:

 On May 15, 2014, at 12:26 AM, bin.cheng bin.ch...@arm.com wrote:

 Here comes up with a new GCC pass looking through each basic block
 and merging paired load store even they are not adjacent to each
 other.


 So I have a target that has load and store multiple support that
 supports large a number of registers (2-n registers), and I added a
 sched0 pass that is a light copy of the regular scheduling pass that
 uses a different cost function which arranges all loads first, then
 all stores then everything else.  Within a group of loads or stores
 the secondary key is the base register, the next key is the offset.
 The net result, all loads off the same register are sorted in
 increasing order.

 Glad to see someone else stumble on (ab)using the scheduler to do this.
Emm, If it's (ab)using, should we still do it then?


 I've poked at the scheduler several times to do similar stuff, but was never
 really satisfied with the results and never tried to polish those prototypes
 into something worth submitting.

 One example I've poked at was discovery of stores which then feed into a
 load from the same location.  Which obviously we'd prefer to turn into a
 store + copy (subject to mess of constraints).  There's a handful of these
 kind of transformations that seem to naturally drop out of this kind of
 work.
As Mike stated, merging of consecutive memory accesses is all about
the base register and the offset. I am thinking another method
collecting all memory accesses with same base register then doing the
merge work.  In this way, we should be able to merge more than 2
instructions, also it would be possible to remove redundant load
instructions in one pass.

My question is how many these redundant loads could be?  Is there any
rtl pass responsible for this now?

Thanks,
bin

 Similarly a post-reload pass could be used to promote single word
 loads/stores to double-word operations.

 If anyone cared about PA 1.1 code generation, it'd be a much cleaner way to
 support the non-fused fmpyadd fmpysub insns.

 Anyway, if you want to move forward with the idea, I'd certainly support
 doing so.

 jeff



-- 
Best Regards.


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-16 Thread Bin.Cheng
On Fri, May 16, 2014 at 12:51 AM, Mike Stump mikest...@comcast.net wrote:
 On May 15, 2014, at 12:26 AM, bin.cheng bin.ch...@arm.com wrote:
 Here comes up with a new GCC pass looking through each basic block and
 merging paired load store even they are not adjacent to each other.

 So I have a target that has load and store multiple support that supports 
 large a number of registers (2-n registers), and I added a sched0 pass that 
 is a light copy of the regular scheduling pass that uses a different cost 
 function which arranges all loads first, then all stores then everything 
 else.  Within a group of loads or stores the secondary key is the base 
 register, the next key is the offset.  The net result, all loads off the same 
 register are sorted in increasing order.  We then can use some define_insns 
 and some peephole to patterns to take the seemingly unrelated instructions, 
 which are now adjacent to knock them down into single instructions, instead 
 of the mass of instructions they were before.  And then a peephole pass that 
 runs early to allow the patterns to do the heavy lifting.  This scheme can 
 handle unlimited complexity on the addressing forms just by ensuring the cost 
 function for the new scheduling pass looks at all relevant bits (target 
 dependent, if the simple machine independent form reg + n is not enough).  
 The sched0 and the peephole pass run early:

 +  NEXT_PASS (pass_sched0);
 +  NEXT_PASS (pass_peephole1);
NEXT_PASS (pass_web);
NEXT_PASS (pass_rtl_cprop);
NEXT_PASS (pass_cse2);

 (before register allocation) so, it can arrange to have things in adjacent 
 registers for the load and store multiple instructions.  The register 
 allocator can then arrange all the code to use those registers directly.

 So, having done all this work, I think it would be nicer if there were a pass 
 that managed it (so that one doesn't have to write any of the peephole or the 
 define_insns (you need like 3*n patterns, and the patterns of O(n), so, you 
 need around n*4/2 lines of code, which is annoying for large n.  A pass could 
 use the existing load store multiple patterns directly, so, no additional 
 port work.  In my work, since I extend life times of values into registers, 
 pretty much without limit, this could be a bad thing.  The code is naturally 
 limited to only extending the lives of things where load and store multiple 
 are used, as if they aren't used, the regular scheduling pass would undo all 
 the sched0 motion.  I choose a light copy of sched, as I don't care about 
 compile time, and I wanted a very small patch that was easy to maintain.  If 
 this pass when into trunk, we'd run the new passes _only_ if a port asked for 
 them.  99% of the ports likely don't want either, though, peephole before 
 register allocation might be interesting for others to solve other issues.

 I wanted this to run before register allocation as my load and store multiple 
 instructions only take consecutive register ranges (n-m), and I need the 
 register allocator to manage to make it true.  I do reg to reg motion to move 
 things around as needed, but almost all I expect the allocator to get rid of. 
  Very complex cases might wind up with a few extra moves, but I have nice 
 bubbles that I can fit these extra moves into.

 In my scheme, no new options, no documentation for new options, no new param 
 options, no silly constants, no hard to write/maintain pass, no new weird 
 targets interface, no limit on just pairs, works on stores as well, runs 
 earlier, 430 lines instead of 1086 lines, conceptually much simpler, added 
 benefit of peephole before register allocation that can be used for other 
 things by the port...

 So, my question is, does my scheme on your target find more or fewer things?  
 Would your scheme pair pairs (so that 4 registers would go into 1 
 instruction)?

Hi Mike,
Thanks for bringing up this new method.  I had to admit that I was not
very much into this method at the first glance especially it requires
another pass in cooperation with scheduler.

For the first question, unfortunately, I can't apply the patch to svn
trunk, could you please update it thus I can do some experiment with
it?  From my experience, lots of opportunities on ARM are generated by
RA, putting it before RA would miss many cases.  Another possible
issue is about interaction with other optimizations.  Putting it at
early stage of RTL, we may disturb other optimizations like
/gcse/-fgcse-lm/-fgcse-sm/dse.  Of course, it has advantages too, for
example, fwprop_addr sometimes corrupt load/store pair opportunities
on ARM, it won't be a problem for your patch.

For the second question, the answer is no for current implementation.
For ARM the most important opportunity is the paired one, so I just
started with pair of two instructions, which is much simpler.  I do
have a draft idea how to merge more than two instructions, but haven't
worked on it yet.

Thanks,
bin


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-16 Thread Bin.Cheng
On Thu, May 15, 2014 at 6:31 PM, Oleg Endo oleg.e...@t-online.de wrote:
 Hi,

 On 15 May 2014, at 09:26, bin.cheng bin.ch...@arm.com wrote:

 Hi,
 Targets like ARM and AARCH64 support double-word load store instructions,
 and these instructions are generally faster than the corresponding two
 load/stores.  GCC currently uses peephole2 to merge paired load/store into
 one single instruction which has a disadvantage.  It can only handle simple
 cases like the two instructions actually appear sequentially in instruction
 stream, and is too weak to handle cases in which the two load/store are
 intervened by other irrelevant instructions.

 Here comes up with a new GCC pass looking through each basic block and
 merging paired load store even they are not adjacent to each other.  The
 algorithm is pretty simple:
 1) In initialization pass iterating over instruction stream it collects
 relevant memory access information for each instruction.
 2) It iterates over each basic block, tries to find possible paired
 instruction for each memory access instruction.  During this work, it checks
 dependencies between the two possible instructions and also records the
 information indicating how to pair the two instructions.  To avoid quadratic
 behavior of the algorithm, It introduces new parameter
 max-merge-paired-loadstore-distance and set the default value to 4, which is
 large enough to catch major part of opportunities on ARM/cortex-a15.
 3) For each candidate pair, it calls back-end's hook to do target dependent
 check and merge the two instructions if possible.

 Though the parameter is set to 4, for miscellaneous benchmarks, this pass
 can merge numerous opportunities except ones already merged by peephole2
 (same level numbers of opportunities comparing to peepholed ones).  GCC
 bootstrap can also confirm this finding.

 This is interesting.  E.g. on SH there are insns to load/store SFmode pairs.  
 However, these insns require a mode switch and have some constraints on 
 register usage.  So in the SH case the load/store pairing would need to be 
 done before reg alloc and before mode switching.


 Yet there is an open issue about when we should run this new pass.  Though
 register renaming is disabled by default now, I put this pass after it,
 because renaming can resolve some false dependencies thus benefit this pass.
 Another finding is, it can capture a lot more opportunities if it's after
 sched2, but I am not sure whether it will mess up with scheduling results in
 this way.

 How about the following.
 Instead of adding new hooks and inserting the pass to the general pass list, 
 make the new
 pass class take the necessary callback functions directly.  Then targets can 
 just instantiate
 the pass, passing their impl of the callbacks, and insert the pass object 
 into the pass list at
 a place that fits best for the target.
Oh, I don't know we can do this in GCC.  But yes, a target may want to
run it at some place that fits best for the target.

Thanks,
bin



 So, any comments about this?

 Thanks,
 bin


 2014-05-15  Bin Cheng  bin.ch...@arm.com
* common.opt (flag_merge_paired_loadstore): New option.
* merge-paired-loadstore.c: New file.
* Makefile.in: Support new file.
* config/arm/arm.c (TARGET_MERGE_PAIRED_LOADSTORE): New macro.
(load_latency_expanded_p, arm_merge_paired_loadstore): New function.
* params.def (PARAM_MAX_MERGE_PAIRED_LOADSTORE_DISTANCE): New param.
* doc/invoke.texi (-fmerge-paired-loadstore): New.
(max-merge-paired-loadstore-distance): New.
* doc/tm.texi.in (TARGET_MERGE_PAIRED_LOADSTORE): New.
* doc/tm.texi: Regenerated.
* target.def (merge_paired_loadstore): New.
* tree-pass.h (make_pass_merge_paired_loadstore): New decl.
* passes.def (pass_merge_paired_loadstore): New pass.
* timevar.def (TV_MERGE_PAIRED_LOADSTORE): New time var.

 gcc/testsuite/ChangeLog
 2014-05-15  Bin Cheng  bin.ch...@arm.com

* gcc.target/arm/merge-paired-loadstore.c: New test.

 merge-paired-loadstore-20140515.txt



-- 
Best Regards.


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-16 Thread Richard Biener
On Fri, May 16, 2014 at 12:10 PM, Bin.Cheng amker.ch...@gmail.com wrote:
 On Thu, May 15, 2014 at 6:31 PM, Oleg Endo oleg.e...@t-online.de wrote:
 Hi,

 On 15 May 2014, at 09:26, bin.cheng bin.ch...@arm.com wrote:

 Hi,
 Targets like ARM and AARCH64 support double-word load store instructions,
 and these instructions are generally faster than the corresponding two
 load/stores.  GCC currently uses peephole2 to merge paired load/store into
 one single instruction which has a disadvantage.  It can only handle simple
 cases like the two instructions actually appear sequentially in instruction
 stream, and is too weak to handle cases in which the two load/store are
 intervened by other irrelevant instructions.

 Here comes up with a new GCC pass looking through each basic block and
 merging paired load store even they are not adjacent to each other.  The
 algorithm is pretty simple:
 1) In initialization pass iterating over instruction stream it collects
 relevant memory access information for each instruction.
 2) It iterates over each basic block, tries to find possible paired
 instruction for each memory access instruction.  During this work, it checks
 dependencies between the two possible instructions and also records the
 information indicating how to pair the two instructions.  To avoid quadratic
 behavior of the algorithm, It introduces new parameter
 max-merge-paired-loadstore-distance and set the default value to 4, which is
 large enough to catch major part of opportunities on ARM/cortex-a15.
 3) For each candidate pair, it calls back-end's hook to do target dependent
 check and merge the two instructions if possible.

 Though the parameter is set to 4, for miscellaneous benchmarks, this pass
 can merge numerous opportunities except ones already merged by peephole2
 (same level numbers of opportunities comparing to peepholed ones).  GCC
 bootstrap can also confirm this finding.

 This is interesting.  E.g. on SH there are insns to load/store SFmode pairs. 
  However, these insns require a mode switch and have some constraints on 
 register usage.  So in the SH case the load/store pairing would need to be 
 done before reg alloc and before mode switching.


 Yet there is an open issue about when we should run this new pass.  Though
 register renaming is disabled by default now, I put this pass after it,
 because renaming can resolve some false dependencies thus benefit this pass.
 Another finding is, it can capture a lot more opportunities if it's after
 sched2, but I am not sure whether it will mess up with scheduling results in
 this way.

 How about the following.
 Instead of adding new hooks and inserting the pass to the general pass list, 
 make the new
 pass class take the necessary callback functions directly.  Then targets can 
 just instantiate
 the pass, passing their impl of the callbacks, and insert the pass object 
 into the pass list at
 a place that fits best for the target.
 Oh, I don't know we can do this in GCC.  But yes, a target may want to
 run it at some place that fits best for the target.

Btw, the bswap pass enhancements that are currently in review may
also be an opportunity to catch these.  They can merge adjacent
loads that are used composed (but not yet composed by storing
into adjacent memory).  The basic-block vectorizer should also
handle this (if the composition happens to be by storing into
adjacent memory) - of course it needs vector modes available and
it has to be enabled.

Richard.

 Thanks,
 bin



 So, any comments about this?

 Thanks,
 bin


 2014-05-15  Bin Cheng  bin.ch...@arm.com
* common.opt (flag_merge_paired_loadstore): New option.
* merge-paired-loadstore.c: New file.
* Makefile.in: Support new file.
* config/arm/arm.c (TARGET_MERGE_PAIRED_LOADSTORE): New macro.
(load_latency_expanded_p, arm_merge_paired_loadstore): New function.
* params.def (PARAM_MAX_MERGE_PAIRED_LOADSTORE_DISTANCE): New param.
* doc/invoke.texi (-fmerge-paired-loadstore): New.
(max-merge-paired-loadstore-distance): New.
* doc/tm.texi.in (TARGET_MERGE_PAIRED_LOADSTORE): New.
* doc/tm.texi: Regenerated.
* target.def (merge_paired_loadstore): New.
* tree-pass.h (make_pass_merge_paired_loadstore): New decl.
* passes.def (pass_merge_paired_loadstore): New pass.
* timevar.def (TV_MERGE_PAIRED_LOADSTORE): New time var.

 gcc/testsuite/ChangeLog
 2014-05-15  Bin Cheng  bin.ch...@arm.com

* gcc.target/arm/merge-paired-loadstore.c: New test.

 merge-paired-loadstore-20140515.txt



 --
 Best Regards.


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-16 Thread Steven Bosscher
On Fri, May 16, 2014 at 12:51 PM, Richard Biener wrote:
 Btw, the bswap pass enhancements that are currently in review may
 also be an opportunity to catch these.  They can merge adjacent
 loads that are used composed (but not yet composed by storing
 into adjacent memory).  The basic-block vectorizer should also
 handle this (if the composition happens to be by storing into
 adjacent memory) - of course it needs vector modes available and
 it has to be enabled.

That won't work for cleaning up spill code load/reloads, which is one
of the motivations for the new pass.

Ciao!
Steven


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-16 Thread Ramana Radhakrishnan

 Btw, the bswap pass enhancements that are currently in review may
 also be an opportunity to catch these.  They can merge adjacent
 loads that are used composed (but not yet composed by storing
 into adjacent memory).  The basic-block vectorizer should also
 handle this (if the composition happens to be by storing into
 adjacent memory) - of course it needs vector modes available and
 it has to be enabled.

Should we really do it there ?  If we start merging multiple loads and
stores into the vector register set, on some architectures or
microarchitectures the cost of moving to and from the vector register
set to the general purpose register set might be too expensive for
some operations and then we get into all sorts of issues.

I think there is merit in an RTL pass.

Ramana


 Richard.

 Thanks,
 bin



 So, any comments about this?

 Thanks,
 bin


 2014-05-15  Bin Cheng  bin.ch...@arm.com
* common.opt (flag_merge_paired_loadstore): New option.
* merge-paired-loadstore.c: New file.
* Makefile.in: Support new file.
* config/arm/arm.c (TARGET_MERGE_PAIRED_LOADSTORE): New macro.
(load_latency_expanded_p, arm_merge_paired_loadstore): New function.
* params.def (PARAM_MAX_MERGE_PAIRED_LOADSTORE_DISTANCE): New param.
* doc/invoke.texi (-fmerge-paired-loadstore): New.
(max-merge-paired-loadstore-distance): New.
* doc/tm.texi.in (TARGET_MERGE_PAIRED_LOADSTORE): New.
* doc/tm.texi: Regenerated.
* target.def (merge_paired_loadstore): New.
* tree-pass.h (make_pass_merge_paired_loadstore): New decl.
* passes.def (pass_merge_paired_loadstore): New pass.
* timevar.def (TV_MERGE_PAIRED_LOADSTORE): New time var.

 gcc/testsuite/ChangeLog
 2014-05-15  Bin Cheng  bin.ch...@arm.com

* gcc.target/arm/merge-paired-loadstore.c: New test.

 merge-paired-loadstore-20140515.txt



 --
 Best Regards.


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-16 Thread Jeff Law

On 05/16/14 04:07, Bin.Cheng wrote:

On Fri, May 16, 2014 at 1:13 AM, Jeff Law l...@redhat.com wrote:

On 05/15/14 10:51, Mike Stump wrote:


On May 15, 2014, at 12:26 AM, bin.cheng bin.ch...@arm.com wrote:


Here comes up with a new GCC pass looking through each basic block
and merging paired load store even they are not adjacent to each
other.



So I have a target that has load and store multiple support that
supports large a number of registers (2-n registers), and I added a
sched0 pass that is a light copy of the regular scheduling pass that
uses a different cost function which arranges all loads first, then
all stores then everything else.  Within a group of loads or stores
the secondary key is the base register, the next key is the offset.
The net result, all loads off the same register are sorted in
increasing order.


Glad to see someone else stumble on (ab)using the scheduler to do this.

Emm, If it's (ab)using, should we still do it then?
I think it'd still be fine.  There's even been a comment about doing 
this kind of thing in the scheduler that's been around since the early 
90s...


The scheduler is a bit interesting in that it has a wealth of dependency 
information and the ability to reorganize the insn stream in relatively 
arbitrary ways.  That seems to make it a natural place to think about 
transformations of this nature.  We just haven't had a good 
infrastructure for doing that.


In theory we're a lot closer now to being able to plug in different 
costing/sorting models and let the scheduler do its thing.  Those models 
might rewrite for register pressure, or encourage certain independent 
insns to issue back-to-back to encourage combining, or to build 
candidate insns for delay slot scheduling, etc.



As Mike stated, merging of consecutive memory accesses is all about
the base register and the offset. I am thinking another method
collecting all memory accesses with same base register then doing the
merge work.  In this way, we should be able to merge more than 2
instructions, also it would be possible to remove redundant load
instructions in one pass.

My question is how many these redundant loads could be?  Is there any
rtl pass responsible for this now?
I suspect it's a lot less important now than it used to be.  But there's 
probably some cases where it'd be useful.  Combining sub-word accesses 
into full-word accesses come immediately to mind.


I'm not aware of any pass which does these kind of changes in a general 
form.  Some passes (caller-save) do a fair amount of work to track when 
they can generate multi-object loads/stores (and it was a huge win back 
on the old sparc processors).



jeff


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-16 Thread Jeff Law

On 05/16/14 04:07, Bin.Cheng wrote:


Yes, I think this one does have a good reason.  The target independent
pass just makes sure that two consecutive memory access instructions
are free of data-dependency with each other, then feeds it to back-end
hook.  It's back-end's responsibility to generate correct instruction.
But given these two memory access insns, there's only a couple ways 
they're likely to combine into a single insn.  We could just as easily 
have the target independent code construct a new insn then try to 
recognize it.  If it's not recognized, then try the other way.


Or is it the case that we're doing something beyond upsizing the mode?



  It's not about modifying an existing insn then recognize it, it's
about creating new instruction sometimes.  For example, we can
generate a simple move insn in Arm mode, while have to generate a
parallel instruction in Thumb mode.  Target independent part has no
idea how to generate an expected insn.  Moreover, back-end may check
some special conditions too.
But can't you go through movXX to generate either the simple insn on the 
ARM or the PARALLEL on the thumb?


Jeff


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-16 Thread Mike Stump
On May 16, 2014, at 3:07 AM, Bin.Cheng amker.ch...@gmail.com wrote:
 
 I don't see how regrename will help resolve [base+offset] false
 dependencies. Can you explain? I'd expect effects from
 hardreg-copyprop commoning a base register.
 It's the register operand's false dependency, rather than the base's
 one.  Considering below simple case:
mov r1,  #const1
store r1, [base+offset1]
mov r1, #const2
store r1, [base_offset2]
 It should be renamed into:
mov r1,  #const1
store r1, [base+offset1]
mov r2, #const2
store r2, [base_offset2]

Ah, but, what did this look like right before pass_web?


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-16 Thread Oleg Endo
On Fri, 2014-05-16 at 18:10 +0800, Bin.Cheng wrote:
 On Thu, May 15, 2014 at 6:31 PM, Oleg Endo oleg.e...@t-online.de wrote:
 
  How about the following.
  Instead of adding new hooks and inserting the pass to the general pass 
  list, make the new
  pass class take the necessary callback functions directly.  Then targets 
  can just instantiate
  the pass, passing their impl of the callbacks, and insert the pass object 
  into the pass list at
  a place that fits best for the target.

 Oh, I don't know we can do this in GCC.  But yes, a target may want to
 run it at some place that fits best for the target.
 

I think it's better than trying to come up with a scheme that so-so fits
all.  My idea would look like:

// merge_paired_loadstore.h
class merge_paired_loadstore : public rtl_opt_pass
{
public:
  struct delegate
  {
virtual bool merge_paired_loadstore (rtx x, rtx y, ...) = 0;
...
  };

  merge_paired_loadstore (gcc::context* ctx, const char* name,
  delegate* d);
  ...
};

// target.cc

#include merge_paired_loadstore.h

static struct target_merge_loadstore_delegate :
merge_paired_loadstore::delegate
{
  virtual bool merge_paired_loadstore (...)
  {
 // code as if this was a freestanding target hook function
  };
} g_merge_loadstore_delegate;


static void target_register_passes (void)
{
  register_pass (
new merge_paired_loadstore (g, merge_ls,  
g_merge_loadstore_delegate),
PASS_POS_INSERT_AFTER, other pass, 1);
}


Then, later, maybe sometime in the future, if there's something like a
class target, it'd look like:

class my_target : public target,
  merge_paired_loadstore::delegate
{
   ... 

   virtual bool merge_paired_loadstore (...);
};

Maybe it's a bit far fetched at the moment, but it would be a start.

Cheers,
Oleg



Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-15 Thread Oleg Endo
Hi,

On 15 May 2014, at 09:26, bin.cheng bin.ch...@arm.com wrote:

 Hi,
 Targets like ARM and AARCH64 support double-word load store instructions,
 and these instructions are generally faster than the corresponding two
 load/stores.  GCC currently uses peephole2 to merge paired load/store into
 one single instruction which has a disadvantage.  It can only handle simple
 cases like the two instructions actually appear sequentially in instruction
 stream, and is too weak to handle cases in which the two load/store are
 intervened by other irrelevant instructions.
 
 Here comes up with a new GCC pass looking through each basic block and
 merging paired load store even they are not adjacent to each other.  The
 algorithm is pretty simple:
 1) In initialization pass iterating over instruction stream it collects
 relevant memory access information for each instruction.
 2) It iterates over each basic block, tries to find possible paired
 instruction for each memory access instruction.  During this work, it checks
 dependencies between the two possible instructions and also records the
 information indicating how to pair the two instructions.  To avoid quadratic
 behavior of the algorithm, It introduces new parameter
 max-merge-paired-loadstore-distance and set the default value to 4, which is
 large enough to catch major part of opportunities on ARM/cortex-a15.
 3) For each candidate pair, it calls back-end's hook to do target dependent
 check and merge the two instructions if possible.
 
 Though the parameter is set to 4, for miscellaneous benchmarks, this pass
 can merge numerous opportunities except ones already merged by peephole2
 (same level numbers of opportunities comparing to peepholed ones).  GCC
 bootstrap can also confirm this finding.

This is interesting.  E.g. on SH there are insns to load/store SFmode pairs.  
However, these insns require a mode switch and have some constraints on 
register usage.  So in the SH case the load/store pairing would need to be done 
before reg alloc and before mode switching.

 
 Yet there is an open issue about when we should run this new pass.  Though
 register renaming is disabled by default now, I put this pass after it,
 because renaming can resolve some false dependencies thus benefit this pass.
 Another finding is, it can capture a lot more opportunities if it's after
 sched2, but I am not sure whether it will mess up with scheduling results in
 this way.

How about the following.
Instead of adding new hooks and inserting the pass to the general pass list, 
make the new pass class take the necessary callback functions directly.  Then 
targets can just instantiate the pass, passing their impl of the callbacks, and 
insert the pass object into the pass list at a place that fits best for the 
target.


 
 So, any comments about this?
 
 Thanks,
 bin
 
 
 2014-05-15  Bin Cheng  bin.ch...@arm.com
* common.opt (flag_merge_paired_loadstore): New option.
* merge-paired-loadstore.c: New file.
* Makefile.in: Support new file.
* config/arm/arm.c (TARGET_MERGE_PAIRED_LOADSTORE): New macro.
(load_latency_expanded_p, arm_merge_paired_loadstore): New function.
* params.def (PARAM_MAX_MERGE_PAIRED_LOADSTORE_DISTANCE): New param.
* doc/invoke.texi (-fmerge-paired-loadstore): New.
(max-merge-paired-loadstore-distance): New.
* doc/tm.texi.in (TARGET_MERGE_PAIRED_LOADSTORE): New.
* doc/tm.texi: Regenerated.
* target.def (merge_paired_loadstore): New.
* tree-pass.h (make_pass_merge_paired_loadstore): New decl.
* passes.def (pass_merge_paired_loadstore): New pass.
* timevar.def (TV_MERGE_PAIRED_LOADSTORE): New time var.
 
 gcc/testsuite/ChangeLog
 2014-05-15  Bin Cheng  bin.ch...@arm.com
 
* gcc.target/arm/merge-paired-loadstore.c: New test.
 
 merge-paired-loadstore-20140515.txt


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-15 Thread Mike Stump
On May 15, 2014, at 12:26 AM, bin.cheng bin.ch...@arm.com wrote:
 Here comes up with a new GCC pass looking through each basic block and
 merging paired load store even they are not adjacent to each other.

So I have a target that has load and store multiple support that supports large 
a number of registers (2-n registers), and I added a sched0 pass that is a 
light copy of the regular scheduling pass that uses a different cost function 
which arranges all loads first, then all stores then everything else.  Within a 
group of loads or stores the secondary key is the base register, the next key 
is the offset.  The net result, all loads off the same register are sorted in 
increasing order.  We then can use some define_insns and some peephole to 
patterns to take the seemingly unrelated instructions, which are now adjacent 
to knock them down into single instructions, instead of the mass of 
instructions they were before.  And then a peephole pass that runs early to 
allow the patterns to do the heavy lifting.  This scheme can handle unlimited 
complexity on the addressing forms just by ensuring the cost function for the 
new scheduling pass looks at all relevant bits (target dependent, if the simple 
machine independent form reg + n is not enough).  The sched0 and the peephole 
pass run early:

+  NEXT_PASS (pass_sched0);
+  NEXT_PASS (pass_peephole1);
   NEXT_PASS (pass_web);
   NEXT_PASS (pass_rtl_cprop);
   NEXT_PASS (pass_cse2);

(before register allocation) so, it can arrange to have things in adjacent 
registers for the load and store multiple instructions.  The register allocator 
can then arrange all the code to use those registers directly.

So, having done all this work, I think it would be nicer if there were a pass 
that managed it (so that one doesn’t have to write any of the peephole or the 
define_insns (you need like 3*n patterns, and the patterns of O(n), so, you 
need around n*4/2 lines of code, which is annoying for large n.  A pass could 
use the existing load store multiple patterns directly, so, no additional port 
work.  In my work, since I extend life times of values into registers, pretty 
much without limit, this could be a bad thing.  The code is naturally limited 
to only extending the lives of things where load and store multiple are used, 
as if they aren’t used, the regular scheduling pass would undo all the sched0 
motion.  I choose a light copy of sched, as I don’t care about compile time, 
and I wanted a very small patch that was easy to maintain.  If this pass when 
into trunk, we’d run the new passes _only_ if a port asked for them.  99% of 
the ports likely don’t want either, though, peephole before register allocation 
might be interesting for others to solve other issues.

I wanted this to run before register allocation as my load and store multiple 
instructions only take consecutive register ranges (n-m), and I need the 
register allocator to manage to make it true.  I do reg to reg motion to move 
things around as needed, but almost all I expect the allocator to get rid of.  
Very complex cases might wind up with a few extra moves, but I have nice 
bubbles that I can fit these extra moves into.

In my scheme, no new options, no documentation for new options, no new param 
options, no silly constants, no hard to write/maintain pass, no new weird 
targets interface, no limit on just pairs, works on stores as well, runs 
earlier, 430 lines instead of 1086 lines, conceptually much simpler, added 
benefit of peephole before register allocation that can be used for other 
things by the port…

So, my question is, does my scheme on your target find more or fewer things?  
Would your scheme pair pairs (so that 4 registers would go into 1 instruction)?

diff --git a/gcc/haifa-sched.c b/gcc/haifa-sched.c
index 4317c9f..9702ebe 100644
--- a/gcc/haifa-sched.c
+++ b/gcc/haifa-sched.c
@@ -1361,7 +1361,10 @@ insn_cost (rtx insn)
   if (recog_memoized (insn)  0)
return 0;
 
-  cost = insn_default_latency (insn);
+  if (sched_load)
+   cost = 0;
+  else
+   cost = insn_default_latency (insn);
   if (cost  0)
cost = 0;
 
@@ -1383,7 +1386,10 @@ insn_cost (rtx insn)
}
   else
{
- cost = insn_default_latency (insn);
+ if (sched_load)
+   cost = 0;
+ else
+   cost = insn_default_latency (insn);
  if (cost  0)
cost = 0;
 
@@ -1568,6 +1574,27 @@ dep_list_size (rtx insn, sd_list_types_def list)
   return nodbgcount;
 }
 
+static int saved_priority;
+bool sched_load;
+
+
+static int
+pri (bool wasload, int regno, HOST_WIDE_INT offp) {
+  int p;
+  int off = (int)offp;
+  int m = /* ((114)-1) */ 0x3fff;
+  off = off + (113);
+  off = off  m;
+  /* loads go first, then stores */
+  p = wasload*(INT_MAX/4+1) + (INT_MAX/4+1);
+  /* then we sort upon the base register, increasing regnos.  */
+  p -= (regno  0xfff) 14;
+  /* then we sort on the offset, 

Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-15 Thread Steven Bosscher
On Thu, May 15, 2014 at 9:26 AM, bin.cheng wrote:
 Hi,
 Targets like ARM and AARCH64 support double-word load store instructions,
 and these instructions are generally faster than the corresponding two
 load/stores.  GCC currently uses peephole2 to merge paired load/store into
 one single instruction which has a disadvantage.  It can only handle simple
 cases like the two instructions actually appear sequentially in instruction
 stream, and is too weak to handle cases in which the two load/store are
 intervened by other irrelevant instructions.

 Here comes up with a new GCC pass looking through each basic block and
 merging paired load store even they are not adjacent to each other.  The
 algorithm is pretty simple:
 1) In initialization pass iterating over instruction stream it collects
 relevant memory access information for each instruction.
 2) It iterates over each basic block, tries to find possible paired
 instruction for each memory access instruction.  During this work, it checks
 dependencies between the two possible instructions and also records the
 information indicating how to pair the two instructions.  To avoid quadratic
 behavior of the algorithm, It introduces new parameter
 max-merge-paired-loadstore-distance and set the default value to 4, which is
 large enough to catch major part of opportunities on ARM/cortex-a15.
 3) For each candidate pair, it calls back-end's hook to do target dependent
 check and merge the two instructions if possible.

 Though the parameter is set to 4, for miscellaneous benchmarks, this pass
 can merge numerous opportunities except ones already merged by peephole2
 (same level numbers of opportunities comparing to peepholed ones).  GCC
 bootstrap can also confirm this finding.

 Yet there is an open issue about when we should run this new pass.  Though
 register renaming is disabled by default now, I put this pass after it,
 because renaming can resolve some false dependencies thus benefit this pass.
 Another finding is, it can capture a lot more opportunities if it's after
 sched2, but I am not sure whether it will mess up with scheduling results in
 this way.

 So, any comments about this?

First off: Why does this need a target hook? We're getting more and
more of them -- too many IMHO. There should be really good reasons for
adding even more new ones.

Does this pass have to run after scheduling? The way you describe it,
this sounds like a scheduling problem, where you don't need regrename
to resolve false dependencies. Your sched2 pass should be able to
prioritize mergeable loads/stores to schedule them adjacent. Of if you
must do this before scheduling, then at least do it before sched2. Now
you're really ruining the order of the scheduled instructions, which
seems bad.

I don't see how regrename will help resolve [base+offset] false
dependencies. Can you explain? I'd expect effects from
hardreg-copyprop commoning a base register.

ChangeLog is missing the entry for arm.c.

Your pass should make those peephole2's redundant, so you should
remove the relevant define_peephole2's.


  +   generated by spilling during reigster allocation.  To catch all

s/reigster/register/


 +   whose address expression is in the form of [base_offset].  It

s/[base_offset]/[base+offset]/


 +   only guarantee that the two consecutive memory access instructions

s/guarantee/guarantees/


 +   has no data dependency issues.  After a pair is found it calls to

s/has/have/


 +/* Maximal number of memory references we support in single isntruction.  */

s/Maximal/Maximum/
s/isntruction/instruction/


 +/* Check instruction recorded in PINSN_INFO to see if it is the
 +   first instruction of load store pair.  If yes, record the
 +   information in PPAIR_INFO and return TRUE.  */
 +
 +static bool
 +find_pair_x_insn (struct pair_info_t *ppair_info,

The function description doesn't seem to match the function
implementation. There is nothing in find_pair_x_insn that looks for a
load/store pair.


 +  /* Don't bother if base is a register and we are modifying it.  */
 +  if (REG_P (base)  reg_modified_p (base, insn))
 +return false;

You can shortcut this if you're looking at a load.


 +  for (; *ref_rec; ref_rec++)

Nit: while (*ref_rec)


 +  /* Can't be paired if memory refs have different mode.  */
 +  if (GET_MODE (mem) != mode)
 +return false;

Not even if same GET_MODE_SIZE?


 +  /* Clear PAIR_SINK if the memory unit of the first instruction
 + is clobbered between.  */
 +  if ((pinfo-pair_kind  PAIR_SINK) != 0
 +   MEM_CLOBBERED_P (pinfo-mem_clob, MEM_CLOB_SELF))
 +pinfo-pair_kind = (~PAIR_SINK);

This has the smell of architecture-specific behavior. Maybe you can't
even pair instructions if there is a MEM_CLOB_SELF involved. Likewise
for some of the checks in valid_insn_pair_p following the one quoted
above.


 +  HOST_WIDE_INT offset_t;

We generally reserve _t for types, so I'd recommend a different name
for this variable.



 +find_pair_y_insn


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-15 Thread Jeff Law

On 05/15/14 10:51, Mike Stump wrote:

On May 15, 2014, at 12:26 AM, bin.cheng bin.ch...@arm.com wrote:

Here comes up with a new GCC pass looking through each basic block
and merging paired load store even they are not adjacent to each
other.


So I have a target that has load and store multiple support that
supports large a number of registers (2-n registers), and I added a
sched0 pass that is a light copy of the regular scheduling pass that
uses a different cost function which arranges all loads first, then
all stores then everything else.  Within a group of loads or stores
the secondary key is the base register, the next key is the offset.
The net result, all loads off the same register are sorted in
increasing order.

Glad to see someone else stumble on (ab)using the scheduler to do this.

I've poked at the scheduler several times to do similar stuff, but was 
never really satisfied with the results and never tried to polish those 
prototypes into something worth submitting.


One example I've poked at was discovery of stores which then feed into a 
load from the same location.  Which obviously we'd prefer to turn into a 
store + copy (subject to mess of constraints).  There's a handful of 
these kind of transformations that seem to naturally drop out of this 
kind of work.


Similarly a post-reload pass could be used to promote single word 
loads/stores to double-word operations.


If anyone cared about PA 1.1 code generation, it'd be a much cleaner way 
to support the non-fused fmpyadd fmpysub insns.


Anyway, if you want to move forward with the idea, I'd certainly support 
doing so.


jeff


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-15 Thread H.J. Lu
On Thu, May 15, 2014 at 12:26 AM, bin.cheng bin.ch...@arm.com wrote:
 Hi,
 Targets like ARM and AARCH64 support double-word load store instructions,
 and these instructions are generally faster than the corresponding two
 load/stores.  GCC currently uses peephole2 to merge paired load/store into
 one single instruction which has a disadvantage.  It can only handle simple
 cases like the two instructions actually appear sequentially in instruction
 stream, and is too weak to handle cases in which the two load/store are
 intervened by other irrelevant instructions.

 Here comes up with a new GCC pass looking through each basic block and
 merging paired load store even they are not adjacent to each other.  The
 algorithm is pretty simple:
 1) In initialization pass iterating over instruction stream it collects
 relevant memory access information for each instruction.
 2) It iterates over each basic block, tries to find possible paired
 instruction for each memory access instruction.  During this work, it checks
 dependencies between the two possible instructions and also records the
 information indicating how to pair the two instructions.  To avoid quadratic
 behavior of the algorithm, It introduces new parameter
 max-merge-paired-loadstore-distance and set the default value to 4, which is
 large enough to catch major part of opportunities on ARM/cortex-a15.
 3) For each candidate pair, it calls back-end's hook to do target dependent
 check and merge the two instructions if possible.

 Though the parameter is set to 4, for miscellaneous benchmarks, this pass
 can merge numerous opportunities except ones already merged by peephole2
 (same level numbers of opportunities comparing to peepholed ones).  GCC
 bootstrap can also confirm this finding.

 Yet there is an open issue about when we should run this new pass.  Though
 register renaming is disabled by default now, I put this pass after it,
 because renaming can resolve some false dependencies thus benefit this pass.
 Another finding is, it can capture a lot more opportunities if it's after
 sched2, but I am not sure whether it will mess up with scheduling results in
 this way.

 So, any comments about this?

 Thanks,
 bin


 2014-05-15  Bin Cheng  bin.ch...@arm.com
 * common.opt (flag_merge_paired_loadstore): New option.
 * merge-paired-loadstore.c: New file.
 * Makefile.in: Support new file.
 * config/arm/arm.c (TARGET_MERGE_PAIRED_LOADSTORE): New macro.
 (load_latency_expanded_p, arm_merge_paired_loadstore): New function.
 * params.def (PARAM_MAX_MERGE_PAIRED_LOADSTORE_DISTANCE): New param.
 * doc/invoke.texi (-fmerge-paired-loadstore): New.
 (max-merge-paired-loadstore-distance): New.
 * doc/tm.texi.in (TARGET_MERGE_PAIRED_LOADSTORE): New.
 * doc/tm.texi: Regenerated.
 * target.def (merge_paired_loadstore): New.
 * tree-pass.h (make_pass_merge_paired_loadstore): New decl.
 * passes.def (pass_merge_paired_loadstore): New pass.
 * timevar.def (TV_MERGE_PAIRED_LOADSTORE): New time var.

 gcc/testsuite/ChangeLog
 2014-05-15  Bin Cheng  bin.ch...@arm.com

 * gcc.target/arm/merge-paired-loadstore.c: New test.


Here is a testcase on x86-64:

---
struct Foo
{
  Foo (double x0, double x1, double x2)
{
  data[0] = x0;
  data[1] = x1;
  data[2] = x2;
}
  double data[3];
};

const Foo f1 (0.0, 0.0, 1.0);
const Foo f2 (1.0, 0.0, 0.0);

struct Bar
{
  Bar (float x0, float x1, float x2, float x3, float x4)
{
  data[0] = x0;
  data[1] = x1;
  data[2] = x2;
  data[3] = x3;
  data[4] = x4;
}
  float data[5];
};

const Bar b1 (0.0, 0.0, 0.0, 0.0, 1.0);
const Bar b2 (1.0, 0.0, 0.0, 0.0, 0.0);
---

We generate

xorpd %xmm0, %xmm0
movsd .LC1(%rip), %xmm1
movsd %xmm0, _ZL2f1(%rip)
movsd %xmm0, _ZL2f1+8(%rip)
movsd %xmm0, _ZL2f2+8(%rip)
movsd %xmm0, _ZL2f2+16(%rip)
xorps %xmm0, %xmm0
movsd %xmm1, _ZL2f1+16(%rip)
movsd %xmm1, _ZL2f2(%rip)
movss .LC3(%rip), %xmm1
movss %xmm0, _ZL2b1(%rip)
movss %xmm0, _ZL2b1+4(%rip)
movss %xmm0, _ZL2b1+8(%rip)
movss %xmm0, _ZL2b1+12(%rip)
movss %xmm1, _ZL2b1+16(%rip)
movss %xmm1, _ZL2b2(%rip)
movss %xmm0, _ZL2b2+4(%rip)
movss %xmm0, _ZL2b2+8(%rip)
movss %xmm0, _ZL2b2+12(%rip)
movss %xmm0, _ZL2b2+16(%rip)

There are pairs of movsd and sets of 4 movss.  We should
be able to handle more than 2 load/store insns.

-- 
H.J.


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-15 Thread Mike Stump
On May 15, 2014, at 10:13 AM, Jeff Law l...@redhat.com wrote:
 I've poked at the scheduler several times to do similar stuff, but was never 
 really satisfied with the results and never tried to polish those prototypes 
 into something worth submitting.

What was lacking?  The cleanliness of the patch or the, it didn’t win me more 
than 0.01% code improvement so I never submitted it?

 One example I've poked at was discovery of stores which then feed into a load 
 from the same location.

Trivial to do with my patch, just change the sort key to arrange that to 
happen, however, if you want multiple support and this, then, you need to 
recognize store_m / load from a part of that, which would be kinda annoying.

 Anyway, if you want to move forward with the idea, I'd certainly support 
 doing so.

I’d be happy to clean up my patch as necessary and contribute it, however, 
people would have to tell me what they would like to see done to it.  In my 
mind, the arm, more, nds32, rs6000 and s390 ports would be the biggest in tree 
winners.

To be the most useful, my patch would benefit from an optimization person 
writing MI code to coalesce adjacent memory operations into either larger mode 
load and stores or into gen_load_multiple, gen_store_multiplethen, then all the 
port work (mine is around 16*n*n lines) would be zero.

Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-15 Thread Jeff Law

On 05/15/14 12:41, Mike Stump wrote:

On May 15, 2014, at 10:13 AM, Jeff Law l...@redhat.com wrote:

I've poked at the scheduler several times to do similar stuff, but
was never really satisfied with the results and never tried to
polish those prototypes into something worth submitting.


What was lacking?  The cleanliness of the patch or the, it didn’t win
me more than 0.01% code improvement so I never submitted it?

Cleanliness and applicability of my implementation.

For fmpyadd/fmpysub on the PA, my recollection was that the right way to 
go was to detect when both were in the ready queue at the same time, 
then resort the queue to have them issue consecutively.  We didn't have 
the necessary hooks to have target dependent code examine and rearrange 
the queues back when I looked at this.  By the time we had those 
capabilities, the PA8000 class processors had been released and those 
instructions were considered bad so I never came back to this.


For the memory optimizations, IIRC, the dependencies keep them from 
getting into the ready queue at the same time.  Thus it's significantly 
harder to get them to issue consecutively when you've got an issue rate  1.





Trivial to do with my patch, just change the sort key to arrange that
to happen, however, if you want multiple support and this, then, you
need to recognize store_m / load from a part of that, which would be
kinda annoying.
But if you've got an issue rate  1, then it's a lot less likely you'll 
get the store/load pairing up the way you want.


Arguably one could throttle down the issue rate for these scheduler-like 
passes which makes that problem go away.


JEff


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-15 Thread Mike Stump
On May 15, 2014, at 1:01 PM, Jeff Law l...@redhat.com wrote:
 For the memory optimizations, IIRC, the dependencies keep them from getting 
 into the ready queue at the same time.  Thus it's significantly harder to get 
 them to issue consecutively when you've got an issue rate  1.

 But if you've got an issue rate  1, then it's a lot less likely you'll get 
 the store/load pairing up the way you want.

Ah, yeah, on second thought, that won’t work if they are not ready together…

I sort the ready queue:

first ready set:
load
store
everything else

next ready set:
load store
everything else

Now, it might work, if there are no other instructions and no loads in the next 
set, and so on, but that would be an accident.  The utility is only in 
instructions that are ready together, though, tick boundaries and issue rates 
don’t matter any, as I don’t care about ticks or issue rates.  The only 
consideration on order is my ordering operator and once ordered, the target is 
free to combine as it sees fit.  So, for store load optimization, my patch is 
rather useless.