Re: [RFC PATCH 0/1] TCTI TCG backend for acceleration on non-JIT AArch64

Richard Henderson Tue, 22 Jul 2025 17:52:42 -0700

On 7/22/25 10:42, Joelle van Dyne wrote:

The following patch is from work by Katherine Temkin to add a JITless backend
for aarch64. The aarch64-tcti target for TCG uses pre-compiled "gadgets" which
are snippets of code for every TCG op x all operands and then at runtime will
"thread" together the gadgets with jumps after each gadget. This results in
significant speedup against TCI but is still much slower than JIT.


This backend is mainly useful for iOS, which does not allow JIT in distributed
apps. We ported the feature from v5.2 to v10.0 but will rebase it to master if
there is interest. Would this backend be useful for mainline QEMU?


That's certainly interesting.

(1) The generated gadget code is excessively large: 75MiB of code, 18MiB of 
tables.

$ ld -r -o z.o *gadgets.c.o && size z.o

    text      data
74556216  18358784

Have you done experiments with the number of available registers to see at what point youget diminishing returns? The combinatorics on 16 registers is not in your favor.

I would expect you could make do with e.g. 6 writable registers, plus SP and ENV. Notethat SP and ENV may be read in controlled ways, but not written. You would never generatea gadget that writes to either. They need not be accessed in other ways, such as thesource of a multiply.


Have you done experiments with 2-operand matching constraints?
I.e. "d*=m" instead of "d=n*m".
That will also dramatically affect combinatorics.

Have you experimented with, for the most part, generating *only* 64-bit code? For manyoperations, the 32-bit operation can be implemented by the 64-bit instruction simply byignoring the high 32 bits of the output. E.g. add, sub, mul, logicals, left-shift. Thatwould avoid repeating some gadgets for _i32 and _i64.


(2) The use of __attribute__((naked)) means this is clang-specific.

I'm really not sure why you're generating C and using naked, as opposed to simplygenerating assembly directly. By generating assembly, you could also emit the correctunwind info for the gadgets. Or, one set of unwind info for *all* of the gadgets.

I'll also note that it takes up to 5 minutes for clang to compile some of these files, sousing the assembler instead of the compiler would help with that as well.


(3) The tables of gadgets are writable, placed in .data.

With C, it's trivial to simply place the "const" correctly to place these tables inmostly-read-only memory (.data.rel.ro; constant after runtime relocation).

With assembly, it's trivial to generate 32-bit pc-relative offsets instead of rawpointers, which would allow truly read-only memory (.rodata). This would require trivialchanges to the read of the tables within tcg_out_*gadget.


(4) Some of the other changes are unacceptable, such as the ifdefs for movcond.

I'm surprised about that one, really. I suppose you were worried about combinatorics on a5-operand operation. But there's no reason you need to keep comparisons next to uses.

You are currently generating C*N^3 versions of setcond_i64, setcond_i32, negsetcond_i64,negsetcond_i32, brcond_i64, and brcond_i32.

But instead you can generate N^2 versions of compare_i32 and compare_i64, and then chainthat to C versions of brcond, C*N versions of setcond and negsetcond, and C*N^2 versionsof movcond. The cpu flags retain the comparison state across the two gadgets.


(5) docs/devel/tcg-tcti.rst generates a build error.

This is caused by not being linked to the index.

(6) Do you actually see much speed-up from the vector code?

(7) To be properly reviewed, this would have to be split into many many patches.

(8) True interest in this is probably limited to iOS, and thus an aarch64 host.

I guess I'm ok with that. I would insist that the code remain buildable on any aarch64host so that we can test it in CI.

I have thought about using gcc's computed goto instead of assembly gadgets, which wouldallow this general scheme to work on any host. We would be at the mercy of the quality ofcode generation though, and the size of the generated function might break the compiler --this has happened before for machine-generated interpreters. But it would avoid twointerpreter implementations, which would be a big plus.


Here's an incomplete example:

uintptr_t tcg_qemu_tb_exec(CPUArchState *env, const void *v_tb_ptr)
{
    static void * const table[20]
        __asm__("tci_table")  __attribute__((used)) = {
        &&mov_0_1,
        &&mov_0_2,
        &&mov_0_3,
        &&mov_1_0,
        &&mov_1_2,
        &&mov_1_3,
        &&mov_2_0,
        &&mov_2_1,
        &&mov_2_3,
        &&mov_3_0,
        &&mov_3_1,
        &&mov_3_2,
        &&stq_0_sp,
        &&stq_1_sp,
        &&stq_2_sp,
        &&stq_3_sp,
        &&ldq_0_sp,
        &&ldq_1_sp,
        &&ldq_2_sp,
        &&ldq_3_sp,
    };

    const intptr_t *tb_ptr = v_tb_ptr;
    tcg_target_ulong r0 = 0, r1 = 0, r2 = 0, r3 = 0;
    uint64_t stack[(TCG_STATIC_CALL_ARGS_SIZE + TCG_STATIC_FRAME_SIZE)
                   / sizeof(uint64_t)];
    uintptr_t sp = (uintptr_t)stack;

#define NEXT  goto *tb_ptr++

    NEXT;

    mov_0_1: r0 = r1; NEXT;
    mov_0_2: r0 = r2; NEXT;
    mov_0_3: r0 = r3; NEXT;
    mov_1_0: r1 = r0; NEXT;
    mov_1_2: r1 = r2; NEXT;
    mov_1_3: r1 = r3; NEXT;
    mov_2_0: r2 = r0; NEXT;
    mov_2_1: r2 = r1; NEXT;
    mov_2_3: r2 = r3; NEXT;
    mov_3_0: r3 = r0; NEXT;
    mov_3_1: r3 = r1; NEXT;
    mov_3_2: r3 = r2; NEXT;
    stq_0_sp: *(uint64_t *)(sp + *tb_ptr++) = r0; NEXT;
    stq_1_sp: *(uint64_t *)(sp + *tb_ptr++) = r1; NEXT;
    stq_2_sp: *(uint64_t *)(sp + *tb_ptr++) = r2; NEXT;
    stq_3_sp: *(uint64_t *)(sp + *tb_ptr++) = r3; NEXT;
    ldq_0_sp: r0 = *(uint64_t *)(sp + *tb_ptr++); NEXT;
    ldq_1_sp: r1 = *(uint64_t *)(sp + *tb_ptr++); NEXT;
    ldq_2_sp: r2 = *(uint64_t *)(sp + *tb_ptr++); NEXT;
    ldq_3_sp: r3 = *(uint64_t *)(sp + *tb_ptr++); NEXT;

#undef NEXT

    return 0;
}

This compiles to fragments like

  // ldq_x_sp
  3c:   f9400420        ldr     x0, [x1, #8]
  40:   f8410425        ldr     x5, [x1], #16
  44:   f86668a5        ldr     x5, [x5, x6]
  48:   d61f0000        br      x0

  // stq_x_sp
  7c:   aa0103e7        mov     x7, x1
  80:   f84104e0        ldr     x0, [x7], #16
  84:   f8266805        str     x5, [x0, x6]
  88:   f9400420        ldr     x0, [x1, #8]
  8c:   aa0703e1        mov     x1, x7
  90:   d61f0000        br      x0

  // mov_x_y
 154:   f8408420        ldr     x0, [x1], #8
 158:   aa0403e2        mov     x2, x4
 15c:   d61f0000        br      x0

While stq has two unnecessary moves, the other two are optimal.

Obviously, the function would be not be hand-written in a complete 
implementation.


r~

Re: [RFC PATCH 0/1] TCTI TCG backend for acceleration on non-JIT AArch64

Reply via email to