Re: DFA Scheduler - unable to pipeline loads
Matt Lee writes: > In any case, I am trying to optimize the case where there is clearly no > aliasing. Your suggestion regarding movmemsi is interesting. I have not used > this pattern before and assumed that it was required only when something > special must be done to do block moves. In my architecture, a block move is > not special and is equivalent a series of loads and stores. Why do I need > this pattern and why/how does the aliasing conflict become a non-issue when > defining this pattern? When expanding this pattern you can take advantage of the fact that source and destination cannot overlap. Thus, you can expand the loads and stores more freely (see mips_block_move_straight). The type-based aliasing code does not retain this information so as soon as you are finished expanding RTL you have practically frozen the order of these instructions. I agree with you that you should not be required to provide this pattern but I don't think there is better way currently. We would need to subset the original alias set similarly to restrict but since memcpy switches the alias set of its operands to zero this approach would not work. Adam
Re: DFA Scheduler - unable to pipeline loads
On 8/31/07, Adam Nemet <[EMAIL PROTECTED]> wrote: > "Matt Lee" <[EMAIL PROTECTED]> writes: > > > I am seeing poor scheduling in Dhrystone where a memcpy call is > > expanded inline. > > > > memcpy (&dst, &src, 16) ==> > > > > load 1, rA + 4 > > store 1, rB + 4 > > load 2, rA + 8 > > store 2, rB + 8 > > ... > > Are you sure that there are no dependencies due to aliasing here. The > only similar thing that Dhrystone has to what you quote is between a > pointer and a global variable and in fact there is an aliasing > conflict there. > > If that is the case you can define a movmem pattern where you first > load everthing in one chunk and then store it later. See MIPS's > movmemsi pattern and the function mips_block_move_straight. > > Adam > Adam, you are right. There is an aliasing conflict in my test case. However, I get the same effect when I use the restrict keyword on the pointers. Here is an even more reduced test case, that shows the same problem. #include struct foo { int a[4]; } ; void func (struct foo * restrict p, struct foo * restrict q) { memcpy (p, q, sizeof (struct foo)); } Perhaps restrict doesn't work. In any case, I am trying to optimize the case where there is clearly no aliasing. Your suggestion regarding movmemsi is interesting. I have not used this pattern before and assumed that it was required only when something special must be done to do block moves. In my architecture, a block move is not special and is equivalent a series of loads and stores. Why do I need this pattern and why/how does the aliasing conflict become a non-issue when defining this pattern? I apologize if i am missing something basic here, but the GCC documentation regarding this pattern doesn't tell me much about why it is required. -- thanks, Matt
Re: DFA Scheduler - unable to pipeline loads
Joey, My understanding was that the latency is used to figure out data delays and the reservation to figure out reservation delays and that these two are used independently. The post below seems to confirm this, http://gcc.gnu.org/ml/gcc/2006-08/msg00497.html I have latency 1 because there are forwarding paths within the pipeline that make integer results available sooner. thanks, Matt On 9/3/07, Ye, Joey <[EMAIL PROTECTED]> wrote: > Matt, > > I just started working on pipeline description and I'm confused one thing in > your description. > > For "integer", your cpu have a 1-cycle latency, but with 3 units stages > "issue,iu,wb". What does that mean? My understanding is that the number of > units seperated by "," should be equal to latency. Am I right? > > Thanks - Joey > > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Matt Lee > Sent: 2007年9月1日 5:58 > To: gcc@gcc.gnu.org > Subject: DFA Scheduler - unable to pipeline loads > > Hi, > > I am working with GCC-4.1.1 on a simple 5-pipe stage simple scalar > RISC processors with the following description for loads and stores, > > (define_insn_reservation "integer" 1 > (eq_attr "type" "branch,jump,call,arith,darith,icmp,nop") > "issue,iu,wb") > > (define_insn_reservation "load" 3 > (eq_attr "type" "load") > "issue,iu,wb") > > (define_insn_reservation "store" 1 > (eq_attr "type" "store") > "issue,iu,wb") > > I am seeing poor scheduling in Dhrystone where a memcpy call is > expanded inline. > > memcpy (&dst, &src, 16) ==> > > load 1, rA + 4 > store 1, rB + 4 > load 2, rA + 8 > store 2, rB + 8 > ... > > Basically, instead of pipelining the loads, the current schedule > stalls the processor for two cycles on every dependent store. Here is > a dump from the .35.sched1 file. > > ;; == > ;; -- basic block 0 from 6 to 36 -- before reload > ;; == > > ;;0--> 6r84=r5 :issue,iu,wb > ;;1--> 13 r86=[`Ptr_Glob'] :issue,iu,wb > ;;2--> 25 r92=0x5:issue,iu,wb > ;;3--> 12 r85=[r84] :issue,iu,wb > ;;4--> 14 r87=[r86] :issue,iu,wb > ;;7--> 15 [r85]=r87 :issue,iu,wb > ;;8--> 16 r88=[r86+0x4] :issue,iu,wb > ;; 11--> 17 [r85+0x4]=r88 :issue,iu,wb > ;; 12--> 18 r89=[r86+0x8] :issue,iu,wb > ;; 15--> 19 [r85+0x8]=r89 :issue,iu,wb > ;; 16--> 20 r90=[r86+0xc] :issue,iu,wb > ;; 19--> 21 [r85+0xc]=r90 :issue,iu,wb > ;; 20--> 22 r91=[r86+0x10] :issue,iu,wb > ;; 23--> 23 [r85+0x10]=r91 :issue,iu,wb > ;; 24--> 26 [r84+0xc]=r92 :issue,iu,wb > ;; 25--> 31 clobber r3 :nothing > ;; 25--> 36 use r3 :nothing > ;; Ready list (final): > ;; total time = 25 > ;; new head = 7 > ;; new tail = 36 > > There is an obvious better schedule to be obtained. Here is one such > (hand-modified) schedule which just pipelines two of the loads to > obtain a shorter critical path length to the whole function (function > has only bb 0) > > ;;0--> 6r84=r5 :issue,iu,wb > ;;1--> 13 r86=[`Ptr_Glob'] :issue,iu,wb > ;;2--> 25 r92=0x5:issue,iu,wb > ;;3--> 12 r85=[r84] :issue,iu,wb > ;;4--> 14 r87=[r86] :issue,iu,wb > ;;7--> 15 [r85]=r87 :issue,iu,wb > ;;8--> 16 r88=[r86+0x4] :issue,iu,wb > ;;9--> 18 r89=[r86+0x8] :issue,iu,wb > ;; 10--> 20 r90=[r86+0xc] :issue,iu,wb > ;; 11--> 17 [r85+0x4]=r88 :issue,iu,wb > ;; 12--> 19 [r85+0x8]=r89 :issue,iu,wb > ;; 13--> 21 [r85+0xc]=r90 :issue,iu,wb > ;; 14--> 22 r91=[r86+0x10] :issue,iu,wb > ;; 17--> 23 [r85+0x10]=r91 :issue,iu,wb > ;; 18--> 26 [r84+0xc]=r92 :issue,iu,mb_wb > ;; 19--> 31 clobber r3 :nothing > ;; 20--> 36 use r3 :nothing > ;; Ready list (final): > ;; total time = 20 > ;; new head = 7 > ;; new tail = 36 > > This schedule is 5 cycles faster. > > I have read and re-read the material surrounding the DFA scheduler. I > understand that the heuristics optimize critical path length and not > stalls or other metrics. But in th
Re: DFA Scheduler - unable to pipeline loads
Matt Lee wrote: Hi, I am working with GCC-4.1.1 on a simple 5-pipe stage simple scalar RISC processors with the following description for loads and stores, (define_insn_reservation "integer" 1 (eq_attr "type" "branch,jump,call,arith,darith,icmp,nop") "issue,iu,wb") (define_insn_reservation "load" 3 (eq_attr "type" "load") "issue,iu,wb") (define_insn_reservation "store" 1 (eq_attr "type" "store") "issue,iu,wb") I am seeing poor scheduling in Dhrystone where a memcpy call is expanded inline. memcpy (&dst, &src, 16) ==> load 1, rA + 4 store 1, rB + 4 load 2, rA + 8 store 2, rB + 8 ... I agree with Adam, that this is most probably an aliasing issue. Take a look at the dependency map (you can get it with -fsched-verbose=6 flag in your command line). -- Maxim
RE: DFA Scheduler - unable to pipeline loads
Matt, I just started working on pipeline description and I'm confused one thing in your description. For "integer", your cpu have a 1-cycle latency, but with 3 units stages "issue,iu,wb". What does that mean? My understanding is that the number of units seperated by "," should be equal to latency. Am I right? Thanks - Joey -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Matt Lee Sent: 2007年9月1日 5:58 To: gcc@gcc.gnu.org Subject: DFA Scheduler - unable to pipeline loads Hi, I am working with GCC-4.1.1 on a simple 5-pipe stage simple scalar RISC processors with the following description for loads and stores, (define_insn_reservation "integer" 1 (eq_attr "type" "branch,jump,call,arith,darith,icmp,nop") "issue,iu,wb") (define_insn_reservation "load" 3 (eq_attr "type" "load") "issue,iu,wb") (define_insn_reservation "store" 1 (eq_attr "type" "store") "issue,iu,wb") I am seeing poor scheduling in Dhrystone where a memcpy call is expanded inline. memcpy (&dst, &src, 16) ==> load 1, rA + 4 store 1, rB + 4 load 2, rA + 8 store 2, rB + 8 ... Basically, instead of pipelining the loads, the current schedule stalls the processor for two cycles on every dependent store. Here is a dump from the .35.sched1 file. ;; == ;; -- basic block 0 from 6 to 36 -- before reload ;; == ;;0--> 6r84=r5 :issue,iu,wb ;;1--> 13 r86=[`Ptr_Glob'] :issue,iu,wb ;;2--> 25 r92=0x5:issue,iu,wb ;;3--> 12 r85=[r84] :issue,iu,wb ;;4--> 14 r87=[r86] :issue,iu,wb ;;7--> 15 [r85]=r87 :issue,iu,wb ;;8--> 16 r88=[r86+0x4] :issue,iu,wb ;; 11--> 17 [r85+0x4]=r88 :issue,iu,wb ;; 12--> 18 r89=[r86+0x8] :issue,iu,wb ;; 15--> 19 [r85+0x8]=r89 :issue,iu,wb ;; 16--> 20 r90=[r86+0xc] :issue,iu,wb ;; 19--> 21 [r85+0xc]=r90 :issue,iu,wb ;; 20--> 22 r91=[r86+0x10] :issue,iu,wb ;; 23--> 23 [r85+0x10]=r91 :issue,iu,wb ;; 24--> 26 [r84+0xc]=r92 :issue,iu,wb ;; 25--> 31 clobber r3 :nothing ;; 25--> 36 use r3 :nothing ;; Ready list (final): ;; total time = 25 ;; new head = 7 ;; new tail = 36 There is an obvious better schedule to be obtained. Here is one such (hand-modified) schedule which just pipelines two of the loads to obtain a shorter critical path length to the whole function (function has only bb 0) ;;0--> 6r84=r5 :issue,iu,wb ;;1--> 13 r86=[`Ptr_Glob'] :issue,iu,wb ;;2--> 25 r92=0x5:issue,iu,wb ;;3--> 12 r85=[r84] :issue,iu,wb ;;4--> 14 r87=[r86] :issue,iu,wb ;;7--> 15 [r85]=r87 :issue,iu,wb ;;8--> 16 r88=[r86+0x4] :issue,iu,wb ;;9--> 18 r89=[r86+0x8] :issue,iu,wb ;; 10--> 20 r90=[r86+0xc] :issue,iu,wb ;; 11--> 17 [r85+0x4]=r88 :issue,iu,wb ;; 12--> 19 [r85+0x8]=r89 :issue,iu,wb ;; 13--> 21 [r85+0xc]=r90 :issue,iu,wb ;; 14--> 22 r91=[r86+0x10] :issue,iu,wb ;; 17--> 23 [r85+0x10]=r91 :issue,iu,wb ;; 18--> 26 [r84+0xc]=r92 :issue,iu,mb_wb ;; 19--> 31 clobber r3 :nothing ;; 20--> 36 use r3 :nothing ;; Ready list (final): ;; total time = 20 ;; new head = 7 ;; new tail = 36 This schedule is 5 cycles faster. I have read and re-read the material surrounding the DFA scheduler. I understand that the heuristics optimize critical path length and not stalls or other metrics. But in this case it is precisely the critical path length that is shortened by the better schedule. I have been examining various hooks available and for a while it seemed like TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD must be set to a larger window to look for better candidates to schedule into the ready queue. For instance, this discussion seems to say so. http://gcc.gnu.org/ml/gcc/2002-05/msg01132.html But a post that follows soon after seems to imply otherwise. http://gcc.gnu.org/ml/gcc/2002-05/msg01388.html Both posts are from Vladimir. In any case the final conclusion seems to be that the lookahead is useful only for multi-
Re: DFA Scheduler - unable to pipeline loads
"Matt Lee" <[EMAIL PROTECTED]> writes: > I am seeing poor scheduling in Dhrystone where a memcpy call is > expanded inline. > > memcpy (&dst, &src, 16) ==> > > load 1, rA + 4 > store 1, rB + 4 > load 2, rA + 8 > store 2, rB + 8 > ... Are you sure that there are no dependencies due to aliasing here. The only similar thing that Dhrystone has to what you quote is between a pointer and a global variable and in fact there is an aliasing conflict there. If that is the case you can define a movmem pattern where you first load everthing in one chunk and then store it later. See MIPS's movmemsi pattern and the function mips_block_move_straight. Adam