Re: Register allocation issues

2007-09-06 Thread Matt Lee
On 9/6/07, David Edelsohn [EMAIL PROTECTED] wrote:
  Matt Lee writes:

 Matt The problem is, that though the loads can be optimized by pipelining
 Matt them. The register allocator has created a dependency by using only r3
 Matt and r4, instead of using the other volatiles.

 GCC's register allocator currently is designed to minimize the
 number of hard registers.  As Ian mentioned, -frename-registers tries to
 perform register renaming with available registers after register
 allocation.  As they say, Your Mileage May Vary.


There is no point trying to minimize usage of volatile hard registers,
is there? They are precisely there to be used up as much as needed.
The function is a leaf procedure as well, so there are no other
considerations. Lastly, architectures like PPC do make use of more
registers (without -frename-registers), so there has to be something
in the PPC back-end that allows for the liberal use or in mine that
prevents such.

-- 
thanks,
Matt


Re: Register allocation issues

2007-09-06 Thread Matt Lee
On 9/6/07, Dave Korn [EMAIL PROTECTED] wrote:
 On 05 September 2007 23:47, Matt Lee wrote:

  Registers r3 to r12 are volatiles. However, for the C code below,
 
  struct foo {
  int a[4];
  } ;
 
  struct foo p, q;
 
  void func ()
  {
  memcpy (p, q, sizeof (struct foo));
  }
 
  I am getting a instruction sequence for func() such as,
 
  load r3, q + 0
  load r4, q + 4
  store r3, p + 0
  store r4, p + 4
  load r3, q + 4
  load r4, q + 8
  store r3, p + 4
  store r4, p + 8
 
  The problem is, that though the loads can be optimized by pipelining
  them. The register allocator has created a dependency by using only r3
  and r4, instead of using the other volatiles.

   Does your backend define a movdi pattern?



Yes it does. But in this case, these are word-by-word moves from
memory to memory and make use of only movsi.
-- 
thanks,
Matt


Re: Register allocation issues

2007-09-06 Thread Matt Lee
On 9/6/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
  Matt Lee writes:
 Date: Thu, 06 Sep 2007 15:02:52 -0400
 From: David Edelsohn [EMAIL PROTECTED]

 Matt There is no point trying to minimize usage of volatile hard registers,
 Matt is there? They are precisely there to be used up as much as needed.
 Matt The function is a leaf procedure as well, so there are no other
 Matt considerations. Lastly, architectures like PPC do make use of more
 Matt registers (without -frename-registers), so there has to be something
 Matt in the PPC back-end that allows for the liberal use or in mine that
 Matt prevents such.

 GCC RA mostly is tuned for IA-32 with very few registers.

 The rs6000 port defines the movmemsi pattern calling
 expand_block_move() which generates many intermediate pseudos.


So does mine. My generated RTL has all the right looking RTL with a
unique pseudo for each load/store pair. If you see the dump in the
original email from the .lreg files, the problem is that these pseudos
get bound to only 2 out of 10 candidate volatile registers.

-- 
thanks,
Matt


Re: Register allocation issues

2007-09-06 Thread Matt Lee
On 9/6/07, Segher Boessenkool [EMAIL PROTECTED] wrote:
  load r3, q + 0
  load r4, q + 4
  store r3, p + 0
  store r4, p + 4
  load r3, q + 4
  load r4, q + 8
  store r3, p + 4
  store r4, p + 8

 These last four lines should be

 load r3, q + 8
 load r4, q + 12
 store r3, p + 8
 store r4, p + 12


 Did you just typo it or do you have a bigger problem?  The
 problems might even be connected, who knows :-)


Sorry, that was a typo.


Register allocation issues

2007-09-05 Thread Matt Lee
Hello,

On my simple RISC architecture I am seeing suboptimal instruction
scheduling with GCC-4.1.1 caused by the way registers are getting
allocated. I am looking for suggestions on what could be wrong in my
description to cause the poor allocation.

More details --

Registers r3 to r12 are volatiles. However, for the C code below,

struct foo {
int a[4];
} ;

struct foo p, q;

void func ()
{
memcpy (p, q, sizeof (struct foo));
}

I am getting a instruction sequence for func() such as,

load r3, q + 0
load r4, q + 4
store r3, p + 0
store r4, p + 4
load r3, q + 4
load r4, q + 8
store r3, p + 4
store r4, p + 8

The problem is, that though the loads can be optimized by pipelining
them. The register allocator has created a dependency by using only r3
and r4, instead of using the other volatiles.

Strangely, modifying func to the following helps,

void func (struct foo *p, struct foo *q)
{
memcpy (p, q, sizeof (struct foo));
}

Now the register allocator uses more volatile registers and there are
no false dependencies.

I have attached the header information from the .lreg file for the
poor register allocation case. Pseudos 87 and 88 get assigned again to
r3, r4 instead of new volatiles.

Is this a problem with some poor cost model in my backend? How do I
get more information  about why the register allocator didn't pick the
other volatiles?

-- 
thanks,
Matt


func.lreg
Description: Binary data


Re: DFA Scheduler - unable to pipeline loads

2007-09-04 Thread Matt Lee
Joey,

My understanding was that the latency is used to figure out data
delays and the reservation to figure out reservation delays and that
these two are used independently. The post below seems to confirm
this,

http://gcc.gnu.org/ml/gcc/2006-08/msg00497.html


I have latency 1 because there are forwarding paths within the
pipeline that make integer results available sooner.

thanks,
Matt

On 9/3/07, Ye, Joey [EMAIL PROTECTED] wrote:
 Matt,

 I just started working on pipeline description and I'm confused one thing in 
 your description.

 For integer, your cpu have a 1-cycle latency, but with 3 units stages 
 issue,iu,wb. What does that mean? My understanding is that the number of 
 units seperated by , should be equal to latency. Am I right?

 Thanks - Joey

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Matt Lee
 Sent: 2007年9月1日 5:58
 To: gcc@gcc.gnu.org
 Subject: DFA Scheduler - unable to pipeline loads

 Hi,

 I am working with GCC-4.1.1 on a simple 5-pipe stage simple scalar
 RISC processors with the following description for loads and stores,

 (define_insn_reservation integer 1
   (eq_attr type branch,jump,call,arith,darith,icmp,nop)
   issue,iu,wb)

 (define_insn_reservation load 3
   (eq_attr type load)
   issue,iu,wb)

 (define_insn_reservation store 1
   (eq_attr type store)
   issue,iu,wb)

 I am seeing poor scheduling in Dhrystone where a memcpy call is
 expanded inline.

 memcpy (dst, src, 16) ==

 load  1, rA + 4
 store 1, rB + 4
 load  2, rA + 8
 store 2, rB + 8
 ...

 Basically, instead of pipelining the loads, the current schedule
 stalls the processor for two cycles on every dependent store. Here is
 a dump from the .35.sched1 file.

 ;;   ==
 ;;   -- basic block 0 from 6 to 36 -- before reload
 ;;   ==

 ;;0-- 6r84=r5 :issue,iu,wb
 ;;1-- 13   r86=[`Ptr_Glob']   :issue,iu,wb
 ;;2-- 25   r92=0x5:issue,iu,wb
 ;;3-- 12   r85=[r84]  :issue,iu,wb
 ;;4-- 14   r87=[r86]  :issue,iu,wb
 ;;7-- 15   [r85]=r87  :issue,iu,wb
 ;;8-- 16   r88=[r86+0x4]  :issue,iu,wb
 ;;   11-- 17   [r85+0x4]=r88  :issue,iu,wb
 ;;   12-- 18   r89=[r86+0x8]  :issue,iu,wb
 ;;   15-- 19   [r85+0x8]=r89  :issue,iu,wb
 ;;   16-- 20   r90=[r86+0xc]  :issue,iu,wb
 ;;   19-- 21   [r85+0xc]=r90  :issue,iu,wb
 ;;   20-- 22   r91=[r86+0x10] :issue,iu,wb
 ;;   23-- 23   [r85+0x10]=r91 :issue,iu,wb
 ;;   24-- 26   [r84+0xc]=r92  :issue,iu,wb
 ;;   25-- 31   clobber r3 :nothing
 ;;   25-- 36   use r3 :nothing
 ;;  Ready list (final):
 ;;   total time = 25
 ;;   new head = 7
 ;;   new tail = 36

 There is an obvious better schedule to be obtained. Here is one such
 (hand-modified) schedule which just pipelines two of the loads to
 obtain a shorter critical path length to the whole function (function
 has only bb 0)

 ;;0-- 6r84=r5 :issue,iu,wb
 ;;1-- 13   r86=[`Ptr_Glob']   :issue,iu,wb
 ;;2-- 25   r92=0x5:issue,iu,wb
 ;;3-- 12   r85=[r84]  :issue,iu,wb
 ;;4-- 14   r87=[r86]  :issue,iu,wb
 ;;7-- 15   [r85]=r87  :issue,iu,wb
 ;;8-- 16   r88=[r86+0x4]  :issue,iu,wb
 ;;9-- 18   r89=[r86+0x8]  :issue,iu,wb
 ;;   10-- 20   r90=[r86+0xc]  :issue,iu,wb
 ;;   11-- 17   [r85+0x4]=r88  :issue,iu,wb
 ;;   12-- 19   [r85+0x8]=r89  :issue,iu,wb
 ;;   13-- 21   [r85+0xc]=r90  :issue,iu,wb
 ;;   14-- 22   r91=[r86+0x10] :issue,iu,wb
 ;;   17-- 23   [r85+0x10]=r91 :issue,iu,wb
 ;;   18-- 26   [r84+0xc]=r92  :issue,iu,mb_wb
 ;;   19-- 31   clobber r3 :nothing
 ;;   20-- 36   use r3 :nothing
 ;;  Ready list (final):
 ;;   total time = 20
 ;;   new head = 7
 ;;   new tail = 36

 This schedule is 5 cycles faster.

 I have read and re-read the material surrounding the DFA scheduler. I
 understand that the heuristics optimize critical path length and not
 stalls or other metrics. But in this case it is precisely the critical
 path length that is shortened by the better schedule. I have been
 examining various hooks available and for a while it seemed like

Re: DFA Scheduler - unable to pipeline loads

2007-09-04 Thread Matt Lee
On 8/31/07, Adam Nemet [EMAIL PROTECTED] wrote:
 Matt Lee [EMAIL PROTECTED] writes:

  I am seeing poor scheduling in Dhrystone where a memcpy call is
  expanded inline.
 
  memcpy (dst, src, 16) ==
 
  load  1, rA + 4
  store 1, rB + 4
  load  2, rA + 8
  store 2, rB + 8
  ...

 Are you sure that there are no dependencies due to aliasing here.  The
 only similar thing that Dhrystone has to what you quote is between a
 pointer and a global variable and in fact there is an aliasing
 conflict there.

 If that is the case you can define a movmem pattern where you first
 load everthing in one chunk and then store it later.  See MIPS's
 movmemsi pattern and the function mips_block_move_straight.

 Adam



Adam, you are right. There is an aliasing conflict in my test case.
However, I get the same effect when I use the restrict keyword on the
pointers. Here is an even more reduced test case, that shows the same
problem.

#include string.h

struct foo {
int a[4];
} ;

void func (struct foo * restrict p, struct foo * restrict q)
{
memcpy (p, q, sizeof (struct foo));
}

Perhaps restrict doesn't work. In any case, I am trying to optimize
the case where there is clearly no aliasing. Your suggestion regarding
movmemsi is interesting. I have not used this pattern before and
assumed that it was required only when something special must be done
to do block moves. In my architecture, a block move is not special and
is equivalent a series of loads and stores. Why do I need this pattern
and why/how does the aliasing conflict become a non-issue when
defining this pattern? I apologize if i am missing something basic
here, but the GCC documentation regarding this pattern doesn't tell me
much about why it is required.

-- 
thanks,
Matt


DFA Scheduler - unable to pipeline loads

2007-08-31 Thread Matt Lee
Hi,

I am working with GCC-4.1.1 on a simple 5-pipe stage simple scalar
RISC processors with the following description for loads and stores,

(define_insn_reservation integer 1
  (eq_attr type branch,jump,call,arith,darith,icmp,nop)
  issue,iu,wb)

(define_insn_reservation load 3
  (eq_attr type load)
  issue,iu,wb)

(define_insn_reservation store 1
  (eq_attr type store)
  issue,iu,wb)

I am seeing poor scheduling in Dhrystone where a memcpy call is
expanded inline.

memcpy (dst, src, 16) ==

load  1, rA + 4
store 1, rB + 4
load  2, rA + 8
store 2, rB + 8
...

Basically, instead of pipelining the loads, the current schedule
stalls the processor for two cycles on every dependent store. Here is
a dump from the .35.sched1 file.

;;   ==
;;   -- basic block 0 from 6 to 36 -- before reload
;;   ==

;;0-- 6r84=r5 :issue,iu,wb
;;1-- 13   r86=[`Ptr_Glob']   :issue,iu,wb
;;2-- 25   r92=0x5:issue,iu,wb
;;3-- 12   r85=[r84]  :issue,iu,wb
;;4-- 14   r87=[r86]  :issue,iu,wb
;;7-- 15   [r85]=r87  :issue,iu,wb
;;8-- 16   r88=[r86+0x4]  :issue,iu,wb
;;   11-- 17   [r85+0x4]=r88  :issue,iu,wb
;;   12-- 18   r89=[r86+0x8]  :issue,iu,wb
;;   15-- 19   [r85+0x8]=r89  :issue,iu,wb
;;   16-- 20   r90=[r86+0xc]  :issue,iu,wb
;;   19-- 21   [r85+0xc]=r90  :issue,iu,wb
;;   20-- 22   r91=[r86+0x10] :issue,iu,wb
;;   23-- 23   [r85+0x10]=r91 :issue,iu,wb
;;   24-- 26   [r84+0xc]=r92  :issue,iu,wb
;;   25-- 31   clobber r3 :nothing
;;   25-- 36   use r3 :nothing
;;  Ready list (final):
;;   total time = 25
;;   new head = 7
;;   new tail = 36

There is an obvious better schedule to be obtained. Here is one such
(hand-modified) schedule which just pipelines two of the loads to
obtain a shorter critical path length to the whole function (function
has only bb 0)

;;0-- 6r84=r5 :issue,iu,wb
;;1-- 13   r86=[`Ptr_Glob']   :issue,iu,wb
;;2-- 25   r92=0x5:issue,iu,wb
;;3-- 12   r85=[r84]  :issue,iu,wb
;;4-- 14   r87=[r86]  :issue,iu,wb
;;7-- 15   [r85]=r87  :issue,iu,wb
;;8-- 16   r88=[r86+0x4]  :issue,iu,wb
;;9-- 18   r89=[r86+0x8]  :issue,iu,wb
;;   10-- 20   r90=[r86+0xc]  :issue,iu,wb
;;   11-- 17   [r85+0x4]=r88  :issue,iu,wb
;;   12-- 19   [r85+0x8]=r89  :issue,iu,wb
;;   13-- 21   [r85+0xc]=r90  :issue,iu,wb
;;   14-- 22   r91=[r86+0x10] :issue,iu,wb
;;   17-- 23   [r85+0x10]=r91 :issue,iu,wb
;;   18-- 26   [r84+0xc]=r92  :issue,iu,mb_wb
;;   19-- 31   clobber r3 :nothing
;;   20-- 36   use r3 :nothing
;;  Ready list (final):
;;   total time = 20
;;   new head = 7
;;   new tail = 36

This schedule is 5 cycles faster.

I have read and re-read the material surrounding the DFA scheduler. I
understand that the heuristics optimize critical path length and not
stalls or other metrics. But in this case it is precisely the critical
path length that is shortened by the better schedule. I have been
examining various hooks available and for a while it seemed like
TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD must be set to a
larger window to look for better candidates to schedule into the ready
queue. For instance, this discussion seems to say so.
http://gcc.gnu.org/ml/gcc/2002-05/msg01132.html

But a post that follows soon after seems to imply otherwise.
http://gcc.gnu.org/ml/gcc/2002-05/msg01388.html

Both posts are from Vladimir. In any case the final conclusion seems
to be that the lookahead is useful only for multi-issue processors.

So what is wrong in my description? How do I get GCC to produce the
optimal schedule?

I appreciate any help as I have run into a similiar scheduling
situation with other architectures in the past (GCC seems to choose
not to pipeline things as much as it can) and this time I would like
to understand it a bit more.

Compile flags used are basically, -O3.
-- 
thanks,
Matt


Addressing modes question

2006-07-06 Thread Matt Lee

Hi,

Is it possible for me to write a backend that accepts certain
addressing modes for loads, while rejecting them for stores? I am not
sure how to do this with the GO_IF_LEGITIMATE_ADDRESS macro. I know
that this is not very sane, but the situation has arisen neverthless.
I want to allow only indexed addressing on stores, while allowing
immediate forms of addressing with loads.

Any help is much appreciated.

thanks,
Matt