On 7/22/25 10:42, Joelle van Dyne wrote:
The following patch is from work by Katherine Temkin to add a JITless backend
for aarch64. The aarch64-tcti target for TCG uses pre-compiled "gadgets" which
are snippets of code for every TCG op x all operands and then at runtime will
"thread" together the gadgets with jumps after each gadget. This results in
significant speedup against TCI but is still much slower than JIT.
This backend is mainly useful for iOS, which does not allow JIT in distributed
apps. We ported the feature from v5.2 to v10.0 but will rebase it to master if
there is interest. Would this backend be useful for mainline QEMU?
That's certainly interesting.
(1) The generated gadget code is excessively large: 75MiB of code, 18MiB of
tables.
$ ld -r -o z.o *gadgets.c.o && size z.o
text data
74556216 18358784
Have you done experiments with the number of available registers to see at what point you
get diminishing returns? The combinatorics on 16 registers is not in your favor.
I would expect you could make do with e.g. 6 writable registers, plus SP and ENV. Note
that SP and ENV may be read in controlled ways, but not written. You would never generate
a gadget that writes to either. They need not be accessed in other ways, such as the
source of a multiply.
Have you done experiments with 2-operand matching constraints?
I.e. "d*=m" instead of "d=n*m".
That will also dramatically affect combinatorics.
Have you experimented with, for the most part, generating *only* 64-bit code? For many
operations, the 32-bit operation can be implemented by the 64-bit instruction simply by
ignoring the high 32 bits of the output. E.g. add, sub, mul, logicals, left-shift. That
would avoid repeating some gadgets for _i32 and _i64.
(2) The use of __attribute__((naked)) means this is clang-specific.
I'm really not sure why you're generating C and using naked, as opposed to simply
generating assembly directly. By generating assembly, you could also emit the correct
unwind info for the gadgets. Or, one set of unwind info for *all* of the gadgets.
I'll also note that it takes up to 5 minutes for clang to compile some of these files, so
using the assembler instead of the compiler would help with that as well.
(3) The tables of gadgets are writable, placed in .data.
With C, it's trivial to simply place the "const" correctly to place these tables in
mostly-read-only memory (.data.rel.ro; constant after runtime relocation).
With assembly, it's trivial to generate 32-bit pc-relative offsets instead of raw
pointers, which would allow truly read-only memory (.rodata). This would require trivial
changes to the read of the tables within tcg_out_*gadget.
(4) Some of the other changes are unacceptable, such as the ifdefs for movcond.
I'm surprised about that one, really. I suppose you were worried about combinatorics on a
5-operand operation. But there's no reason you need to keep comparisons next to uses.
You are currently generating C*N^3 versions of setcond_i64, setcond_i32, negsetcond_i64,
negsetcond_i32, brcond_i64, and brcond_i32.
But instead you can generate N^2 versions of compare_i32 and compare_i64, and then chain
that to C versions of brcond, C*N versions of setcond and negsetcond, and C*N^2 versions
of movcond. The cpu flags retain the comparison state across the two gadgets.
(5) docs/devel/tcg-tcti.rst generates a build error.
This is caused by not being linked to the index.
(6) Do you actually see much speed-up from the vector code?
(7) To be properly reviewed, this would have to be split into many many patches.
(8) True interest in this is probably limited to iOS, and thus an aarch64 host.
I guess I'm ok with that. I would insist that the code remain buildable on any aarch64
host so that we can test it in CI.
I have thought about using gcc's computed goto instead of assembly gadgets, which would
allow this general scheme to work on any host. We would be at the mercy of the quality of
code generation though, and the size of the generated function might break the compiler --
this has happened before for machine-generated interpreters. But it would avoid two
interpreter implementations, which would be a big plus.
Here's an incomplete example:
uintptr_t tcg_qemu_tb_exec(CPUArchState *env, const void *v_tb_ptr)
{
static void * const table[20]
__asm__("tci_table") __attribute__((used)) = {
&&mov_0_1,
&&mov_0_2,
&&mov_0_3,
&&mov_1_0,
&&mov_1_2,
&&mov_1_3,
&&mov_2_0,
&&mov_2_1,
&&mov_2_3,
&&mov_3_0,
&&mov_3_1,
&&mov_3_2,
&&stq_0_sp,
&&stq_1_sp,
&&stq_2_sp,
&&stq_3_sp,
&&ldq_0_sp,
&&ldq_1_sp,
&&ldq_2_sp,
&&ldq_3_sp,
};
const intptr_t *tb_ptr = v_tb_ptr;
tcg_target_ulong r0 = 0, r1 = 0, r2 = 0, r3 = 0;
uint64_t stack[(TCG_STATIC_CALL_ARGS_SIZE + TCG_STATIC_FRAME_SIZE)
/ sizeof(uint64_t)];
uintptr_t sp = (uintptr_t)stack;
#define NEXT goto *tb_ptr++
NEXT;
mov_0_1: r0 = r1; NEXT;
mov_0_2: r0 = r2; NEXT;
mov_0_3: r0 = r3; NEXT;
mov_1_0: r1 = r0; NEXT;
mov_1_2: r1 = r2; NEXT;
mov_1_3: r1 = r3; NEXT;
mov_2_0: r2 = r0; NEXT;
mov_2_1: r2 = r1; NEXT;
mov_2_3: r2 = r3; NEXT;
mov_3_0: r3 = r0; NEXT;
mov_3_1: r3 = r1; NEXT;
mov_3_2: r3 = r2; NEXT;
stq_0_sp: *(uint64_t *)(sp + *tb_ptr++) = r0; NEXT;
stq_1_sp: *(uint64_t *)(sp + *tb_ptr++) = r1; NEXT;
stq_2_sp: *(uint64_t *)(sp + *tb_ptr++) = r2; NEXT;
stq_3_sp: *(uint64_t *)(sp + *tb_ptr++) = r3; NEXT;
ldq_0_sp: r0 = *(uint64_t *)(sp + *tb_ptr++); NEXT;
ldq_1_sp: r1 = *(uint64_t *)(sp + *tb_ptr++); NEXT;
ldq_2_sp: r2 = *(uint64_t *)(sp + *tb_ptr++); NEXT;
ldq_3_sp: r3 = *(uint64_t *)(sp + *tb_ptr++); NEXT;
#undef NEXT
return 0;
}
This compiles to fragments like
// ldq_x_sp
3c: f9400420 ldr x0, [x1, #8]
40: f8410425 ldr x5, [x1], #16
44: f86668a5 ldr x5, [x5, x6]
48: d61f0000 br x0
// stq_x_sp
7c: aa0103e7 mov x7, x1
80: f84104e0 ldr x0, [x7], #16
84: f8266805 str x5, [x0, x6]
88: f9400420 ldr x0, [x1, #8]
8c: aa0703e1 mov x1, x7
90: d61f0000 br x0
// mov_x_y
154: f8408420 ldr x0, [x1], #8
158: aa0403e2 mov x2, x4
15c: d61f0000 br x0
While stq has two unnecessary moves, the other two are optimal.
Obviously, the function would be not be hand-written in a complete
implementation.
r~