DFA Scheduler - unable to pipeline loads

2007-08-31 Thread Matt Lee
Hi,

I am working with GCC-4.1.1 on a simple 5-pipe stage simple scalar
RISC processors with the following description for loads and stores,

(define_insn_reservation "integer" 1
  (eq_attr "type" "branch,jump,call,arith,darith,icmp,nop")
  "issue,iu,wb")

(define_insn_reservation "load" 3
  (eq_attr "type" "load")
  "issue,iu,wb")

(define_insn_reservation "store" 1
  (eq_attr "type" "store")
  "issue,iu,wb")

I am seeing poor scheduling in Dhrystone where a memcpy call is
expanded inline.

memcpy (&dst, &src, 16) ==>

load  1, rA + 4
store 1, rB + 4
load  2, rA + 8
store 2, rB + 8
...

Basically, instead of pipelining the loads, the current schedule
stalls the processor for two cycles on every dependent store. Here is
a dump from the .35.sched1 file.

;;   ==
;;   -- basic block 0 from 6 to 36 -- before reload
;;   ==

;;0--> 6r84=r5 :issue,iu,wb
;;1--> 13   r86=[`Ptr_Glob']   :issue,iu,wb
;;2--> 25   r92=0x5:issue,iu,wb
;;3--> 12   r85=[r84]  :issue,iu,wb
;;4--> 14   r87=[r86]  :issue,iu,wb
;;7--> 15   [r85]=r87  :issue,iu,wb
;;8--> 16   r88=[r86+0x4]  :issue,iu,wb
;;   11--> 17   [r85+0x4]=r88  :issue,iu,wb
;;   12--> 18   r89=[r86+0x8]  :issue,iu,wb
;;   15--> 19   [r85+0x8]=r89  :issue,iu,wb
;;   16--> 20   r90=[r86+0xc]  :issue,iu,wb
;;   19--> 21   [r85+0xc]=r90  :issue,iu,wb
;;   20--> 22   r91=[r86+0x10] :issue,iu,wb
;;   23--> 23   [r85+0x10]=r91 :issue,iu,wb
;;   24--> 26   [r84+0xc]=r92  :issue,iu,wb
;;   25--> 31   clobber r3 :nothing
;;   25--> 36   use r3 :nothing
;;  Ready list (final):
;;   total time = 25
;;   new head = 7
;;   new tail = 36

There is an obvious better schedule to be obtained. Here is one such
(hand-modified) schedule which just pipelines two of the loads to
obtain a shorter critical path length to the whole function (function
has only bb 0)

;;0--> 6r84=r5 :issue,iu,wb
;;1--> 13   r86=[`Ptr_Glob']   :issue,iu,wb
;;2--> 25   r92=0x5:issue,iu,wb
;;3--> 12   r85=[r84]  :issue,iu,wb
;;4--> 14   r87=[r86]  :issue,iu,wb
;;7--> 15   [r85]=r87  :issue,iu,wb
;;8--> 16   r88=[r86+0x4]  :issue,iu,wb
;;9--> 18   r89=[r86+0x8]  :issue,iu,wb
;;   10--> 20   r90=[r86+0xc]  :issue,iu,wb
;;   11--> 17   [r85+0x4]=r88  :issue,iu,wb
;;   12--> 19   [r85+0x8]=r89  :issue,iu,wb
;;   13--> 21   [r85+0xc]=r90  :issue,iu,wb
;;   14--> 22   r91=[r86+0x10] :issue,iu,wb
;;   17--> 23   [r85+0x10]=r91 :issue,iu,wb
;;   18--> 26   [r84+0xc]=r92  :issue,iu,mb_wb
;;   19--> 31   clobber r3 :nothing
;;   20--> 36   use r3 :nothing
;;  Ready list (final):
;;   total time = 20
;;   new head = 7
;;   new tail = 36

This schedule is 5 cycles faster.

I have read and re-read the material surrounding the DFA scheduler. I
understand that the heuristics optimize critical path length and not
stalls or other metrics. But in this case it is precisely the critical
path length that is shortened by the better schedule. I have been
examining various hooks available and for a while it seemed like
TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD must be set to a
larger window to look for better candidates to schedule into the ready
queue. For instance, this discussion seems to say so.
http://gcc.gnu.org/ml/gcc/2002-05/msg01132.html

But a post that follows soon after seems to imply otherwise.
http://gcc.gnu.org/ml/gcc/2002-05/msg01388.html

Both posts are from Vladimir. In any case the final conclusion seems
to be that the lookahead is useful only for multi-issue processors.

So what is wrong in my description? How do I get GCC to produce the
optimal schedule?

I appreciate any help as I have run into a similiar scheduling
situation with other architectures in the past (GCC seems to choose
not to pipeline things as much as it can) and this time I would like
to understand it a bit more.

Compile flags used are basically, -O3.
-- 
thanks,
Matt


Re: DFA Scheduler - unable to pipeline loads

2007-08-31 Thread Adam Nemet
"Matt Lee" <[EMAIL PROTECTED]> writes:

> I am seeing poor scheduling in Dhrystone where a memcpy call is
> expanded inline.
>
> memcpy (&dst, &src, 16) ==>
>
> load  1, rA + 4
> store 1, rB + 4
> load  2, rA + 8
> store 2, rB + 8
> ...

Are you sure that there are no dependencies due to aliasing here.  The
only similar thing that Dhrystone has to what you quote is between a
pointer and a global variable and in fact there is an aliasing
conflict there.

If that is the case you can define a movmem pattern where you first
load everthing in one chunk and then store it later.  See MIPS's
movmemsi pattern and the function mips_block_move_straight.

Adam


RE: DFA Scheduler - unable to pipeline loads

2007-09-03 Thread Ye, Joey
Matt,

I just started working on pipeline description and I'm confused one thing in 
your description.

For "integer", your cpu have a 1-cycle latency, but with 3 units stages 
"issue,iu,wb". What does that mean? My understanding is that the number of 
units seperated by "," should be equal to latency. Am I right?

Thanks - Joey

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Matt Lee
Sent: 2007年9月1日 5:58
To: gcc@gcc.gnu.org
Subject: DFA Scheduler - unable to pipeline loads

Hi,

I am working with GCC-4.1.1 on a simple 5-pipe stage simple scalar
RISC processors with the following description for loads and stores,

(define_insn_reservation "integer" 1
  (eq_attr "type" "branch,jump,call,arith,darith,icmp,nop")
  "issue,iu,wb")

(define_insn_reservation "load" 3
  (eq_attr "type" "load")
  "issue,iu,wb")

(define_insn_reservation "store" 1
  (eq_attr "type" "store")
  "issue,iu,wb")

I am seeing poor scheduling in Dhrystone where a memcpy call is
expanded inline.

memcpy (&dst, &src, 16) ==>

load  1, rA + 4
store 1, rB + 4
load  2, rA + 8
store 2, rB + 8
...

Basically, instead of pipelining the loads, the current schedule
stalls the processor for two cycles on every dependent store. Here is
a dump from the .35.sched1 file.

;;   ==
;;   -- basic block 0 from 6 to 36 -- before reload
;;   ==

;;0--> 6r84=r5 :issue,iu,wb
;;1--> 13   r86=[`Ptr_Glob']   :issue,iu,wb
;;2--> 25   r92=0x5:issue,iu,wb
;;3--> 12   r85=[r84]  :issue,iu,wb
;;4--> 14   r87=[r86]  :issue,iu,wb
;;7--> 15   [r85]=r87  :issue,iu,wb
;;8--> 16   r88=[r86+0x4]  :issue,iu,wb
;;   11--> 17   [r85+0x4]=r88  :issue,iu,wb
;;   12--> 18   r89=[r86+0x8]  :issue,iu,wb
;;   15--> 19   [r85+0x8]=r89  :issue,iu,wb
;;   16--> 20   r90=[r86+0xc]  :issue,iu,wb
;;   19--> 21   [r85+0xc]=r90  :issue,iu,wb
;;   20--> 22   r91=[r86+0x10] :issue,iu,wb
;;   23--> 23   [r85+0x10]=r91 :issue,iu,wb
;;   24--> 26   [r84+0xc]=r92  :issue,iu,wb
;;   25--> 31   clobber r3 :nothing
;;   25--> 36   use r3 :nothing
;;  Ready list (final):
;;   total time = 25
;;   new head = 7
;;   new tail = 36

There is an obvious better schedule to be obtained. Here is one such
(hand-modified) schedule which just pipelines two of the loads to
obtain a shorter critical path length to the whole function (function
has only bb 0)

;;0--> 6r84=r5 :issue,iu,wb
;;1--> 13   r86=[`Ptr_Glob']   :issue,iu,wb
;;2--> 25   r92=0x5:issue,iu,wb
;;3--> 12   r85=[r84]  :issue,iu,wb
;;4--> 14   r87=[r86]  :issue,iu,wb
;;7--> 15   [r85]=r87  :issue,iu,wb
;;8--> 16   r88=[r86+0x4]  :issue,iu,wb
;;9--> 18   r89=[r86+0x8]  :issue,iu,wb
;;   10--> 20   r90=[r86+0xc]  :issue,iu,wb
;;   11--> 17   [r85+0x4]=r88  :issue,iu,wb
;;   12--> 19   [r85+0x8]=r89  :issue,iu,wb
;;   13--> 21   [r85+0xc]=r90  :issue,iu,wb
;;   14--> 22   r91=[r86+0x10] :issue,iu,wb
;;   17--> 23   [r85+0x10]=r91 :issue,iu,wb
;;   18--> 26   [r84+0xc]=r92  :issue,iu,mb_wb
;;   19--> 31   clobber r3 :nothing
;;   20--> 36   use r3 :nothing
;;  Ready list (final):
;;   total time = 20
;;   new head = 7
;;   new tail = 36

This schedule is 5 cycles faster.

I have read and re-read the material surrounding the DFA scheduler. I
understand that the heuristics optimize critical path length and not
stalls or other metrics. But in this case it is precisely the critical
path length that is shortened by the better schedule. I have been
examining various hooks available and for a while it seemed like
TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD must be set to a
larger window to look for better candidates to schedule into the ready
queue. For instance, this discussion seems to say so.

Re: DFA Scheduler - unable to pipeline loads

2007-09-03 Thread Maxim Kuvyrkov

Matt Lee wrote:

Hi,

I am working with GCC-4.1.1 on a simple 5-pipe stage simple scalar
RISC processors with the following description for loads and stores,

(define_insn_reservation "integer" 1
  (eq_attr "type" "branch,jump,call,arith,darith,icmp,nop")
  "issue,iu,wb")

(define_insn_reservation "load" 3
  (eq_attr "type" "load")
  "issue,iu,wb")

(define_insn_reservation "store" 1
  (eq_attr "type" "store")
  "issue,iu,wb")

I am seeing poor scheduling in Dhrystone where a memcpy call is
expanded inline.

memcpy (&dst, &src, 16) ==>

load  1, rA + 4
store 1, rB + 4
load  2, rA + 8
store 2, rB + 8
...


I agree with Adam, that this is most probably an aliasing issue.  Take a 
look at the dependency map (you can get it with -fsched-verbose=6 flag 
in your command line).



--
Maxim


Re: DFA Scheduler - unable to pipeline loads

2007-09-04 Thread Matt Lee
Joey,

My understanding was that the latency is used to figure out data
delays and the reservation to figure out reservation delays and that
these two are used independently. The post below seems to confirm
this,

http://gcc.gnu.org/ml/gcc/2006-08/msg00497.html


I have latency 1 because there are forwarding paths within the
pipeline that make integer results available sooner.

thanks,
Matt

On 9/3/07, Ye, Joey <[EMAIL PROTECTED]> wrote:
> Matt,
>
> I just started working on pipeline description and I'm confused one thing in 
> your description.
>
> For "integer", your cpu have a 1-cycle latency, but with 3 units stages 
> "issue,iu,wb". What does that mean? My understanding is that the number of 
> units seperated by "," should be equal to latency. Am I right?
>
> Thanks - Joey
>
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Matt Lee
> Sent: 2007年9月1日 5:58
> To: gcc@gcc.gnu.org
> Subject: DFA Scheduler - unable to pipeline loads
>
> Hi,
>
> I am working with GCC-4.1.1 on a simple 5-pipe stage simple scalar
> RISC processors with the following description for loads and stores,
>
> (define_insn_reservation "integer" 1
>   (eq_attr "type" "branch,jump,call,arith,darith,icmp,nop")
>   "issue,iu,wb")
>
> (define_insn_reservation "load" 3
>   (eq_attr "type" "load")
>   "issue,iu,wb")
>
> (define_insn_reservation "store" 1
>   (eq_attr "type" "store")
>   "issue,iu,wb")
>
> I am seeing poor scheduling in Dhrystone where a memcpy call is
> expanded inline.
>
> memcpy (&dst, &src, 16) ==>
>
> load  1, rA + 4
> store 1, rB + 4
> load  2, rA + 8
> store 2, rB + 8
> ...
>
> Basically, instead of pipelining the loads, the current schedule
> stalls the processor for two cycles on every dependent store. Here is
> a dump from the .35.sched1 file.
>
> ;;   ==
> ;;   -- basic block 0 from 6 to 36 -- before reload
> ;;   ==
>
> ;;0--> 6r84=r5 :issue,iu,wb
> ;;1--> 13   r86=[`Ptr_Glob']   :issue,iu,wb
> ;;2--> 25   r92=0x5:issue,iu,wb
> ;;3--> 12   r85=[r84]  :issue,iu,wb
> ;;4--> 14   r87=[r86]  :issue,iu,wb
> ;;7--> 15   [r85]=r87  :issue,iu,wb
> ;;8--> 16   r88=[r86+0x4]  :issue,iu,wb
> ;;   11--> 17   [r85+0x4]=r88  :issue,iu,wb
> ;;   12--> 18   r89=[r86+0x8]  :issue,iu,wb
> ;;   15--> 19   [r85+0x8]=r89  :issue,iu,wb
> ;;   16--> 20   r90=[r86+0xc]  :issue,iu,wb
> ;;   19--> 21   [r85+0xc]=r90  :issue,iu,wb
> ;;   20--> 22   r91=[r86+0x10] :issue,iu,wb
> ;;   23--> 23   [r85+0x10]=r91 :issue,iu,wb
> ;;   24--> 26   [r84+0xc]=r92  :issue,iu,wb
> ;;   25--> 31   clobber r3 :nothing
> ;;   25--> 36   use r3 :nothing
> ;;  Ready list (final):
> ;;   total time = 25
> ;;   new head = 7
> ;;   new tail = 36
>
> There is an obvious better schedule to be obtained. Here is one such
> (hand-modified) schedule which just pipelines two of the loads to
> obtain a shorter critical path length to the whole function (function
> has only bb 0)
>
> ;;0--> 6r84=r5 :issue,iu,wb
> ;;1--> 13   r86=[`Ptr_Glob']   :issue,iu,wb
> ;;2--> 25   r92=0x5:issue,iu,wb
> ;;3--> 12   r85=[r84]  :issue,iu,wb
> ;;4--> 14   r87=[r86]  :issue,iu,wb
> ;;7--> 15   [r85]=r87  :issue,iu,wb
> ;;8--> 16   r88=[r86+0x4]  :issue,iu,wb
> ;;9--> 18   r89=[r86+0x8]  :issue,iu,wb
> ;;   10--> 20   r90=[r86+0xc]  :issue,iu,wb
> ;;   11--> 17   [r85+0x4]=r88  :issue,iu,wb
> ;;   12--> 19   [r85+0x8]=r89  :issue,iu,wb
> ;;   13--> 21   [r85+0xc]=r90  :issue,iu,wb
> ;;   14--> 22   r91=[r86+0x10] :issue,iu,wb
> ;;   17--> 23   [r85+0

Re: DFA Scheduler - unable to pipeline loads

2007-09-04 Thread Matt Lee
On 8/31/07, Adam Nemet <[EMAIL PROTECTED]> wrote:
> "Matt Lee" <[EMAIL PROTECTED]> writes:
>
> > I am seeing poor scheduling in Dhrystone where a memcpy call is
> > expanded inline.
> >
> > memcpy (&dst, &src, 16) ==>
> >
> > load  1, rA + 4
> > store 1, rB + 4
> > load  2, rA + 8
> > store 2, rB + 8
> > ...
>
> Are you sure that there are no dependencies due to aliasing here.  The
> only similar thing that Dhrystone has to what you quote is between a
> pointer and a global variable and in fact there is an aliasing
> conflict there.
>
> If that is the case you can define a movmem pattern where you first
> load everthing in one chunk and then store it later.  See MIPS's
> movmemsi pattern and the function mips_block_move_straight.
>
> Adam
>


Adam, you are right. There is an aliasing conflict in my test case.
However, I get the same effect when I use the restrict keyword on the
pointers. Here is an even more reduced test case, that shows the same
problem.

#include 

struct foo {
int a[4];
} ;

void func (struct foo * restrict p, struct foo * restrict q)
{
memcpy (p, q, sizeof (struct foo));
}

Perhaps restrict doesn't work. In any case, I am trying to optimize
the case where there is clearly no aliasing. Your suggestion regarding
movmemsi is interesting. I have not used this pattern before and
assumed that it was required only when something special must be done
to do block moves. In my architecture, a block move is not special and
is equivalent a series of loads and stores. Why do I need this pattern
and why/how does the aliasing conflict become a non-issue when
defining this pattern? I apologize if i am missing something basic
here, but the GCC documentation regarding this pattern doesn't tell me
much about why it is required.

-- 
thanks,
Matt


Re: DFA Scheduler - unable to pipeline loads

2007-09-04 Thread Adam Nemet
Matt Lee writes:
> In any case, I am trying to optimize the case where there is clearly no
> aliasing. Your suggestion regarding movmemsi is interesting. I have not used
> this pattern before and assumed that it was required only when something
> special must be done to do block moves. In my architecture, a block move is
> not special and is equivalent a series of loads and stores. Why do I need
> this pattern and why/how does the aliasing conflict become a non-issue when
> defining this pattern?

When expanding this pattern you can take advantage of the fact that source and
destination cannot overlap.  Thus, you can expand the loads and stores more
freely (see mips_block_move_straight).  The type-based aliasing code does not
retain this information so as soon as you are finished expanding RTL you have
practically frozen the order of these instructions.

I agree with you that you should not be required to provide this pattern but I
don't think there is better way currently.  We would need to subset the
original alias set similarly to restrict but since memcpy switches the alias
set of its operands to zero this approach would not work.

Adam