On Wed, Aug 26, 2009 at 03:02:44PM -0400, Bradley Lucier wrote:
> On Wed, 2009-08-26 at 20:38 +0200, Paolo Bonzini wrote:
> > 
> > > When I worked at AMD, I was starting to suspect that it may be more 
> > > beneficial
> > > to re-enable the first schedule insns pass if you were compiling in 64-bit
> > > mode, since you have more registers available, and the new registers do 
> > > not
> > > have hard wired uses, which in the past always meant a lot of spills 
> > > (also, the
> > > default floating point unit is SSE instead of the x87 stack).  I never got
> > > around to testing this before AMD and I parted company.
> > 
> > Unfortunately, hardwired use of %ecx for shifts is still enough to kill 
> > -fschedule-insns on AMD64.
> 
> The AMD64 Architecture manual I found said that various combinations of
> the RSI, RDI, and RCX registers are used implicitly by ten instructions
> or prefixes, and RBX is used by XLAT, XLATB.  So it appears that there
> are 12 general-purpose registers available for allocation.

XLATB is essentially useless (well maybe had some uses back in 16 bit days, 
when only a few registers could be used for addressing) and never generated
by GCC. 

However %ebx is used for PIC addressing in 32 bit mode so it is not 
always free either (I don't know about PIE code).

In 64 bit mode, PIC/PIE use PC relative addressing, so this gives 
you actually 9 more free registers than in 32 bit mode.

However for some reason you glossed over the case of integer division
which always use %edx and %eax. This is true even when dividing by a 
constant (non power of 2) in which case gcc will often use a widening 
multiply instead, whose results are in %edx:%eax, so it's almost a wash 
in terms of fixed register usage (not exactly, the divisions use %edx:%eax 
as dividends and need the divisor somewhere else, while the widening
multiply use %eax as one input but %edx can be used for the other).

(As a side note, %edx and %eax are also special with regard to I/O port
accesses but this is only of interest in device drivers).

> Are 12 registers not enough, in principle, to do scheduling before
> register allocation? 

I don't know, but I would say that you have about 14 registers
for address computations/indexing since you seem to be interested
in FP code. I would think that it is sufficient for many inner
loops (but not all, it really depends on the number of arrays
that you access and the number of independant indexes that
you have to keep).

> I was getting a 15% speedup on some numerical
> codes, as pre-scheduling spaced out the vector loads among the
> floating-point computations.

Well vector loads and floating point computations do not have anything 
to do with integer register choices. The 16 FP registers are 
nicely orthogonal (compared to the real nightmare that the x87 stack was).
In practice you schedule on 16 FP registers and 14 (15 if you omit
the frame pointer) addressing/indexing/counting registers.

In this type of code there are typically very few instructions with
fixed register constraints, and the less likely are the string
instructions. Shifts of variable amount and integer divides
are still possible, but unlikely.

        Gabriel

Reply via email to