On Wed, Aug 26, 2009 at 03:02:44PM -0400, Bradley Lucier wrote: > On Wed, 2009-08-26 at 20:38 +0200, Paolo Bonzini wrote: > > > > > When I worked at AMD, I was starting to suspect that it may be more > > > beneficial > > > to re-enable the first schedule insns pass if you were compiling in 64-bit > > > mode, since you have more registers available, and the new registers do > > > not > > > have hard wired uses, which in the past always meant a lot of spills > > > (also, the > > > default floating point unit is SSE instead of the x87 stack). I never got > > > around to testing this before AMD and I parted company. > > > > Unfortunately, hardwired use of %ecx for shifts is still enough to kill > > -fschedule-insns on AMD64. > > The AMD64 Architecture manual I found said that various combinations of > the RSI, RDI, and RCX registers are used implicitly by ten instructions > or prefixes, and RBX is used by XLAT, XLATB. So it appears that there > are 12 general-purpose registers available for allocation.
XLATB is essentially useless (well maybe had some uses back in 16 bit days, when only a few registers could be used for addressing) and never generated by GCC. However %ebx is used for PIC addressing in 32 bit mode so it is not always free either (I don't know about PIE code). In 64 bit mode, PIC/PIE use PC relative addressing, so this gives you actually 9 more free registers than in 32 bit mode. However for some reason you glossed over the case of integer division which always use %edx and %eax. This is true even when dividing by a constant (non power of 2) in which case gcc will often use a widening multiply instead, whose results are in %edx:%eax, so it's almost a wash in terms of fixed register usage (not exactly, the divisions use %edx:%eax as dividends and need the divisor somewhere else, while the widening multiply use %eax as one input but %edx can be used for the other). (As a side note, %edx and %eax are also special with regard to I/O port accesses but this is only of interest in device drivers). > Are 12 registers not enough, in principle, to do scheduling before > register allocation? I don't know, but I would say that you have about 14 registers for address computations/indexing since you seem to be interested in FP code. I would think that it is sufficient for many inner loops (but not all, it really depends on the number of arrays that you access and the number of independant indexes that you have to keep). > I was getting a 15% speedup on some numerical > codes, as pre-scheduling spaced out the vector loads among the > floating-point computations. Well vector loads and floating point computations do not have anything to do with integer register choices. The 16 FP registers are nicely orthogonal (compared to the real nightmare that the x87 stack was). In practice you schedule on 16 FP registers and 14 (15 if you omit the frame pointer) addressing/indexing/counting registers. In this type of code there are typically very few instructions with fixed register constraints, and the less likely are the string instructions. Shifts of variable amount and integer divides are still possible, but unlikely. Gabriel