Hi all!
I want to adapt the current register allocation scheme so that it is
usable for targets with virtual registers. I believe that, if done
correctly, this may even result in some benefit to physical register
targets, too.
** Some background **
I am working on a wasm port for GCC. wasm is a virtual register target,
meaning that a compiler is supposed to declare an arbitrary set of
registers that it is willing to use for a function, and use this set of
registers without ever needing to spill values onto the stack.
Currently, the only other virtual register target is nvptx. To have
virtual registers, it skips over IRA and LRA entirely. This works
relatively well for them, since due to specifics of the target, there is
a one-to-one correspondence between register's mode and would-be
register class, so instructions can be selected based purely on the
instruction's RTL shape, without need for constraints. nvptx also
doesn't need to worry about size, so it can declare as many registers
as it likes, without trying to compress the set of declared registers.
wasm unfortunately isn't like that: being a target primarily aimed at
use on websites, size is usually very important for page loading speed.
Proper register allocation would still be useful here to compress the
set of declared registers for a function. This compression would also be
useful because wasm addresses registers via leb128 indices, so popular
registers can also have lower indices, which reduces size of register
access instructions. (benefit form register allocation here is that
as there are less registers in total, a bigger proportion of them can
get short indices)
Unlike nvptx, wasm lacks virtual registers for 8- and 16-bit values. As
constraints are meaningless without register allocation, such values
still have to be stored as 32-bit values, and pretty complicated
workarounds are required to ensure that a register of incorrect mode
does not end up in an operation that does not work on physical registers
of such size (e.g. via a subreg expr).
There are also optimization opportunities that are missed if only mode
determines which register value gets placed in: if a register of small
mode is primarily used via subregs of large mode, it would be beneficial
to place such value in a large register in the first place, instead of
trying to actually cast the value to a large mode first every time.
Similar situation can occur in reverse: A large value that is used as a
small one several times in a row, would benefit from being temporarily
reloaded to a small register class, to coalesce the casts.
** wasm-specific considerations **
- wasm also has special types that can only exist in registers, and for
which no instructions to load, store or cast to/from an integer is
available. This would mean that for these types the register allocator
cannot ever pick NO_REGS as a fallback (which it shouldn't ever do on a
virtual register target regardless, but it's one more thing to keep in
mind).
- wasm's to-be register classes are also disjoint, meaning that a value
that resides in a 64-bit register cannot be accessed as-if it resides in
a 32-bit register. There is also no register class that can hold any
value, so GENERAL_REGS is also not meaningful, and ALL_REGS is not a
register class any target virtual register can actually be assigned.
** General considerations **
- instructions for virtual register targets may actually accept an
arbitrary amount of arguments (e.g. wasm's call family of insns, or asm
blocks), each of which would need a constraint, so register allocation
infrastructure needs to be adapted to support that. It probably would be
useful even for targets with a fixed amount of registers, as that will
potentially remove the current 30 operand limit on asm_operands.
- register asm decls would no longer be able to specify a specific
register name to store the value in, they would need to specify a class
instead. This may be of use for fixed register targets too, e.g. First
of all, porting GCC to one more target is not a simple task. In
Cygnus (a company which made a lot of GCC ports), it usually took half
a year for a developer with some experience to compile a "hello,
world" program. Of course, there are people who could make a port much
faster, e.g. Mike Meissner who made more than 10 ports. But probably
for a novice like you, I'd plan on 6 months of full work at least. You
will learn a lot about GCC during this work.
To use GCC optimizations fully (e.g. generate madd insns by the GCC
combiner pass), I believe you should get rid of stack insns: all
operands should be explicit. And generation of stack insns should be
the very last pass of GCC. You could adapt the reg-stack.cc pass for
this or make your own machine-dependent pass (it is easier for
implementation review and approval too). The machine-dependent pass
can do some optimization too if it is necessary, e.g. renumbering
local vars to make frequently used vars have smaller numbers (but I am
not sure it will be necessary after IRA).
Local variables should be pseudos starting with a very big number to
avoid spilling. And hard registers (numbered before the first
pseudo-register) should be final local variable numbers. I think
FIRST_PSEUDO_REGISTER=10000 is enough to avoid spilling. The most
demanding benchmarking program I know, fppp, has register pressure of
several hundred only. You can divide 10000 hard registers by mode
(integer, FP, vector hard regs).
That is the high-level wasm port design I think is worth doing.
I think although wasm is unusual, it is not complicated (real hardware
targets are less regular than wasm). Probably RTL insns for wasm will
have only one alternative, no secondary memory, no secondary reload,
etc. When you run into a problem, look at how other ports solve the
problem — the simpler and more regular the port, the better (I'd
recommend a classic RISC architecture for this).
Probably, you will need to switch off some code in LRA,
e.g. inheritance (as you will have no spilling), hard reg splitting
(because you always have enough regs), or maybe register
elimination. Also, when working on RA for wasm, you could switch off
other LRA optimizations and subpasses first (like rematerialization,
equivalence substitution), and switch them on later (if they give some
benefits). It may be that after switching off all LRA optimizations,
LRA will only need to change pseudos to hard regs and check that the
insn constraints are satisfied.
You can switch off regional RA in IRA (I don't think it is necessary
for wasm as it has an unconstrained number of locals, and regional RA
can only help to generate smaller numbers for local vars in a
region/loop). You could also start with IRA fast_allocation, which is
much simpler than coloring, but I am not sure it saves you time.
to place a value in any register as long as it has the required class.
** The request **
I would like to adapt the current setup of IRA+LRA to work under these
constraints. Unfortunately, after attempting to do that by myself my
changes kept behaving unpredictably, and I realized that I lack the
understanding necessary, and would like some guidance.
- Which parts should I touch first, to get something testable? Every
time I tried to start this, I felt like there were too many moving
parts, between machine constraints, and various parts of ira and lra.
- What would a good design for this look like? E.g. my first attempt
used backend hooks to mark out additional operands based on
circumstantial evidence from the RTL shape itself. Would it be better
to create another rtx like "use with constraint" instead?
- Is adapting the current infrastructure even the right call, or would
it be simpler to write a separate register allocator? It may not even
be possible to make the current scheme never reload to memory, but
that's a hard requirement for wasm.