Joey,

My understanding was that the latency is used to figure out data
delays and the reservation to figure out reservation delays and that
these two are used independently. The post below seems to confirm
this,

http://gcc.gnu.org/ml/gcc/2006-08/msg00497.html


I have latency 1 because there are forwarding paths within the
pipeline that make integer results available sooner.

thanks,
Matt

On 9/3/07, Ye, Joey <[EMAIL PROTECTED]> wrote:
> Matt,
>
> I just started working on pipeline description and I'm confused one thing in 
> your description.
>
> For "integer", your cpu have a 1-cycle latency, but with 3 units stages 
> "issue,iu,wb". What does that mean? My understanding is that the number of 
> units seperated by "," should be equal to latency. Am I right?
>
> Thanks - Joey
>
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Matt Lee
> Sent: 2007年9月1日 5:58
> To: gcc@gcc.gnu.org
> Subject: DFA Scheduler - unable to pipeline loads
>
> Hi,
>
> I am working with GCC-4.1.1 on a simple 5-pipe stage simple scalar
> RISC processors with the following description for loads and stores,
>
> (define_insn_reservation "integer" 1
>   (eq_attr "type" "branch,jump,call,arith,darith,icmp,nop")
>   "issue,iu,wb")
>
> (define_insn_reservation "load" 3
>   (eq_attr "type" "load")
>   "issue,iu,wb")
>
> (define_insn_reservation "store" 1
>   (eq_attr "type" "store")
>   "issue,iu,wb")
>
> I am seeing poor scheduling in Dhrystone where a memcpy call is
> expanded inline.
>
> memcpy (&dst, &src, 16) ==>
>
> load  1, rA + 4
> store 1, rB + 4
> load  2, rA + 8
> store 2, rB + 8
> ...
>
> Basically, instead of pipelining the loads, the current schedule
> stalls the processor for two cycles on every dependent store. Here is
> a dump from the .35.sched1 file.
>
> ;;   ======================================================
> ;;   -- basic block 0 from 6 to 36 -- before reload
> ;;   ======================================================
>
> ;;        0--> 6    r84=r5                             :issue,iu,wb
> ;;        1--> 13   r86=[`Ptr_Glob']                   :issue,iu,wb
> ;;        2--> 25   r92=0x5                            :issue,iu,wb
> ;;        3--> 12   r85=[r84]                          :issue,iu,wb
> ;;        4--> 14   r87=[r86]                          :issue,iu,wb
> ;;        7--> 15   [r85]=r87                          :issue,iu,wb
> ;;        8--> 16   r88=[r86+0x4]                      :issue,iu,wb
> ;;       11--> 17   [r85+0x4]=r88                      :issue,iu,wb
> ;;       12--> 18   r89=[r86+0x8]                      :issue,iu,wb
> ;;       15--> 19   [r85+0x8]=r89                      :issue,iu,wb
> ;;       16--> 20   r90=[r86+0xc]                      :issue,iu,wb
> ;;       19--> 21   [r85+0xc]=r90                      :issue,iu,wb
> ;;       20--> 22   r91=[r86+0x10]                     :issue,iu,wb
> ;;       23--> 23   [r85+0x10]=r91                     :issue,iu,wb
> ;;       24--> 26   [r84+0xc]=r92                      :issue,iu,wb
> ;;       25--> 31   clobber r3                         :nothing
> ;;       25--> 36   use r3                             :nothing
> ;;      Ready list (final):
> ;;   total time = 25
> ;;   new head = 7
> ;;   new tail = 36
>
> There is an obvious better schedule to be obtained. Here is one such
> (hand-modified) schedule which just pipelines two of the loads to
> obtain a shorter critical path length to the whole function (function
> has only bb 0)
>
> ;;        0--> 6    r84=r5                             :issue,iu,wb
> ;;        1--> 13   r86=[`Ptr_Glob']                   :issue,iu,wb
> ;;        2--> 25   r92=0x5                            :issue,iu,wb
> ;;        3--> 12   r85=[r84]                          :issue,iu,wb
> ;;        4--> 14   r87=[r86]                          :issue,iu,wb
> ;;        7--> 15   [r85]=r87                          :issue,iu,wb
> ;;        8--> 16   r88=[r86+0x4]                      :issue,iu,wb
> ;;        9--> 18   r89=[r86+0x8]                      :issue,iu,wb
> ;;       10--> 20   r90=[r86+0xc]                      :issue,iu,wb
> ;;       11--> 17   [r85+0x4]=r88                      :issue,iu,wb
> ;;       12--> 19   [r85+0x8]=r89                      :issue,iu,wb
> ;;       13--> 21   [r85+0xc]=r90                      :issue,iu,wb
> ;;       14--> 22   r91=[r86+0x10]                     :issue,iu,wb
> ;;       17--> 23   [r85+0x10]=r91                     :issue,iu,wb
> ;;       18--> 26   [r84+0xc]=r92                      :issue,iu,mb_wb
> ;;       19--> 31   clobber r3                         :nothing
> ;;       20--> 36   use r3                             :nothing
> ;;      Ready list (final):
> ;;   total time = 20
> ;;   new head = 7
> ;;   new tail = 36
>
> This schedule is 5 cycles faster.
>
> I have read and re-read the material surrounding the DFA scheduler. I
> understand that the heuristics optimize critical path length and not
> stalls or other metrics. But in this case it is precisely the critical
> path length that is shortened by the better schedule. I have been
> examining various hooks available and for a while it seemed like
> TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD must be set to a
> larger window to look for better candidates to schedule into the ready
> queue. For instance, this discussion seems to say so.
> http://gcc.gnu.org/ml/gcc/2002-05/msg01132.html
>
> But a post that follows soon after seems to imply otherwise.
> http://gcc.gnu.org/ml/gcc/2002-05/msg01388.html
>
> Both posts are from Vladimir. In any case the final conclusion seems
> to be that the lookahead is useful only for multi-issue processors.
>
> So what is wrong in my description? How do I get GCC to produce the
> optimal schedule?
>
> I appreciate any help as I have run into a similiar scheduling
> situation with other architectures in the past (GCC seems to choose
> not to pipeline things as much as it can) and this time I would like
> to understand it a bit more.
>
> Compile flags used are basically, -O3.
> --
> thanks,
> Matt
>


-- 
thanks,
Matt

Reply via email to