On Fri, Feb 01, 2002 at 01:32:13AM +0000, Nicholas Clark wrote:
> This just about implements a jit for ARM. It doesn't actually do any ops in
> assembler yet, except for end. It's names on the basis that it's for v3 or

This is where I give up on the current format.
Others are welcome to carry on either based on what I did, or starting
afresh. And we have a fresh format I'm interested.
What I've written will call parrot ops.

> Problems that I remember that I encountered. (Comments in the code may
> indicate more). Part of these were understanding things - it doesn't mean
> that the current way is wrong, just that it wasn't obvious to me :-(
> 
> 1: '}' is a necessary character in ARM assembler syntax, so jit2h.pl needs
>    to be a bit smarter about deciding when to chop the end of a function
> 
> 2: There is no terse way to load arbitrary 32 bit constants into a register
>    with ARM instructions. There are 2 usual methods
>    1: Put the constant in a constant pool within +- 4092 or so bytes of the
>       PC, and load it with an offset from the PC.
>    2: Make it with 1, 2 or 3 instructions. I believe that currently it is
>       conjectured that it is possible to make any 32 bit value with 3 ARM
>       instructions, and so far no-one has found any value that they couldn't
>       make, but no-one has proved it possible and thereby made an algorithm
>       that lets a program generate instructions to build a constant
> 
>    Either way, I found I was fighting the current jit which expects (at worst)
>    to be able to split a 32 bit constant into 2 (possibly unequal) halves
>    stored in two machine instructions. To be more flexible jit would need to
>    know what some CPU registers contain (ie things like the current
>    interpreter pointer), and be able to choose whether to get a value or
>    pointer by arithmetic from a CPU register, by deferencing a CPU register
>    (possibly with offset) or by giving up and loading a constant
> 
>    This will make more sense to anyone who gets hold of an ARM machine and
>    then tries to write ops :-)
> 
> 3: I wanted to put the pointer to the current interpreter in r7. This made
>    the default precompiled "call" function have its branch somewhere wonky.
>    It seems to me that Parrot::Jit->call should be returning a 2 item list
>    the  bytecode, and the offset of the branching instruction in there.

Actually, I'd like to do arbitrary call like this:

        mov     r1, r7                  ; say arg 2 is *interpreter
        adr     r14,  .L1               ; pseudocode for pc relative calc.
        ldmia   r14!,  {r0, r2, r3, pc} ; register list built by jit
..L1:    r0 data
        r2 data
        r3 data
        <where ever>            ; address of function.
..L2:                           ; next instruction - return point from func.

Which to me doesn't look much like the way the current system expects to
prime the registers in order.

ARM SPECIFIC BIT:

I'm taking advantage of the way that a branch to subroutine instruction
(bl) stores the return address in r14 (lr, the Link register), and loads
pc (r15) with the subroutine address.
The above (untested) code takes advantage of r14 (and r15) being regular
registers, by replacing all the load registers with function parameters,
call function into 1 (yay!) instruction which (effectively) treats r14 as a
stack pointer.

1 instruction to prime r14 with the address of label .L1 (1 clock cycle)
1 instruction to:

 loads all the registers with parameters
 load the program counter with the subroutine address (so branch into it)
 write back new pointer value to r14 (which will be pointing at .L2)
   which has effectively set the return address for the function.

admittedly that load takes >1 clock cycle. But it just seems a cool way to do
it.


LESS ARM SPECIFIC BIT:

However, building the ldmia instruction means setting the bitmap of registers
to load based on which are values in the hitlist between .L1 and .L2. And if
some are already in CPU registers, or are actually to be loaded from Parrot
registers, then they don't need to be pulled from the hitlist, because they
are being evaluated some other way.
(eg I think it is a good idea to keep the current interpreter pointer in a
CPU register (eg r7), hence if that is needed as a function parameter it's
a mov, rather than a memory load)

And if arguments need deferencing first, then I need to load the pointer, then
dereference, and hence they don't want to be in the hitlist.

CONCLUSION:

So I start needing to build simple, parameterisable code, but more complex
than the current system allows.

> 4: I think in a RISC way, so expect the offset to be of the start of the
>    instruction that needs butchering, not the byte within it. (How the sparc
>    position was expressed confused me for a while).

To be very undiplomatic:

5: The current way the jit is done turns into madness on ARM.

To be specific

The current system seems to be well suited to how x86 wants to work.
It's really cool to have x86 going much faster.

On ARM I think the best way to get the constants of parrot registers into/
out of CPU registers is to put the address of I1 into an ARM register, and
load parrot registers into/out of CPU registers with memory load/store offset,
which seems to be radically different from how x86 is working. This appears
to be how Sparc is working.

I also guess I need to have a global register for integer constants, as they
can't go inline. Actually, a global register for a merged constant pool is
a better idea. This appears to differ from Sparc.

So for set_i_i what I actually want the jit to translate that to is

        ldr     ip, [r4, #8]
        str     ip, [r4, #4]

if I have the address of I1 in r4, and I'm doing set_i I2, I3


As far as I can tell, currently I have to write this in core.ops:


Parrot_set_i_i {
    ldr ip, &INT_REG[2]
    str ip, &INT_REG[1]
}

The current jit2h and module code

1: reads that
2: mangles &INT_REG[1] to something that is syntactically legal
3: calls out to as
4: calls objdump to disassemble it
5: looks for a pattern to spot where the special bit is.
   AARGH. The "special bit" is the last 12 bits of the instruction.
   If I have to convert the disassembled instruction back to binary, and
   then match /^....010...(.)....................$/ to find out if it's
   LDR or STR (with $1 determining which) I feel I might as well write my
   own ARM assembler in perl

and then

6: I need to write more specialised C code in jit.c to mangle the LDR or STR
   instruction at jit building time, and that too needs to be taught the
   instruction format.
7: I need to teach jit.c that on this architecture it is doing INT_REG
   loads in this way. (There is a forest of #ifdef rapidly growing there, as
   it seems every architecture is not as "simple" as x86)


And HELL. I'm going to need to do the same hoop jumping for NUM_REG, string
REG, INT_CONST, NUM_CONST, current opcode, aargh.

This is why I feel it's getting futile.

I'd like to be able to write a subroutine that the jit calls. Parameters are
the parrot opcode and parameters, the address to assemble at (so I know what
the program counter will be) and probably some other stuff about what the
CPU registers contain. Output is the section of assembler code.

(Like C signal handlers I may be able to merge several opcodes into 1
generator function, hence I'd like the opcode number (or name?) as a parameter)

So for set_i_i I'd be passed something like (0xdeadbeef, "set_i_i", 2, 3)
and I'd return 8 bytes of ARM assembler that do it.

And all the knowledge about how to "do" ARM instructions is in exactly one
place. **Not mixed across core.ops, arm*Generic and jit.c**


This would actually let me micro-code the parrot ops. I could implement
set_i_i as set_cpu_i, set_i_cpu, and in turn call 2 functions to generate
code to load parrot a reg to a CPU register, and store the CPU register back
to parrot code.
Whilst that seems a lot of effort for a 2 instruction job such as set_i_i,
the load from RAM to CPU is going to be needed at (or near) the front of
every parrot op, and the store from CPU to RAM at the end, so on a RISC CPU
being able to subdivide the parrot ops seems to make sense.

It would also mean that something like set I2, 0 doesn't need to use the
arbitrary 32 bit constant pool, as I could make my generator encode that as

    mov ip, #0
    str ip, [r4, #4]

Hmm. Going to need to pass state about constant pools in and out of generators.
Also, knowing which CPU registers contain fixed values such as 0 could be
useful. Maybe that's getting to optimiser stage.

Nicholas Clark
-- 
EMCFT http://www.ccl4.org/~nick/CV.html

Reply via email to