RE: [RFC] Reducing NEED_CPU_H usage
> -Original Message- > From: Richard Henderson > Sent: Saturday, January 14, 2023 6:53 PM > To: Alessandro Di Federico > Cc: qemu-devel@nongnu.org; Taylor Simpson ; > Alex Bennée > Subject: Re: [RFC] Reducing NEED_CPU_H usage > > On 1/12/23 05:28, Alessandro Di Federico wrote: > > fpu/softfloat.c > > Something I happened to notice while doing other triage: > > https://gitlab.com/qemu-project/qemu/-/issues/1375 > > This is an x86 problem that currently has no solution, but ought to be trivial > with the changes to softfloat required for this project. One other thing we'll need to deal with is command-line options. In a heterogeneous system, options like -cpu and -d will need a way to direct different settings to each target. Thanks, Taylor
Re: [RFC] Reducing NEED_CPU_H usage
On 1/12/23 05:28, Alessandro Di Federico wrote: fpu/softfloat.c Something I happened to notice while doing other triage: https://gitlab.com/qemu-project/qemu/-/issues/1375 This is an x86 problem that currently has no solution, but ought to be trivial with the changes to softfloat required for this project. r~
Re: [RFC] Reducing NEED_CPU_H usage
On 1/12/23 07:28, Alessandro Di Federico wrote: On Tue, 10 Jan 2023 11:56:50 -0800 Richard Henderson wrote: However, at some point we do want to keep some target addresses in the proper size. For instance within the softmmu tlb, where CPUTLBEntry is either 16 or 32 bytes, depending. So that would be an optimization if `HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32`, correct? At present, not an 'optimization' but 'natural fallout of type sizes'. But, yeah. ## `abi_ulong` Similar to `target_ulong`, but with alignment info. Pardon? There's no alignment info in abi_ulong. From include/exec/user/abitypes.h: typedef uint32_t abi_ulong __attribute__((aligned(ABI_LONG_ALIGNMENT))); typedef target_ulong abi_ulong __attribute__((aligned(ABI_LONG_ALIGNMENT))); I thought that was the difference. Thanks for the clarification. Ah, I see what you mean. I *believe* that use of target_ulong could actually be uint64_t, since all TARGET_LONG_BITS == 32 ought to have TARGET_ABI32 set too (which brings us to that first definition with uint32_t.) There is a target alignment difference, for the benefit of e.g. m68k, which has sizeof(long) == 4 and __alignof(long) == 2, which may differ by the host. In any case, all of "exec/abitypes.h" is (or should be) user-only related, so out of scope for this project. This one requires some work within tcg/ to handle two target address sizes simultaneously. It should not be technically difficult to solve, but it does involve adding a few TCG opcodes and adjusting all tcg backends. I'm a bit confused by this, do backends for some reason have expectations about the address size? Wouldn't this be enough? Yes, this affects the tcg backend as well: static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64) { ... datalo = *args++; datahi = (TCG_TARGET_REG_BITS == 32 && is64 ? *args++ : 0); addrlo = *args++; addrhi = (TARGET_LONG_BITS > TCG_TARGET_REG_BITS ? *args++ : 0); oi = *args++; and further code generation changes especially for 64-bit guest pointer comparisons on 32-bit hosts. (There's that nasty combination again.) There's also code generation changes for 32-bit guest pointer comparisons on 64-bit hosts, e.g. static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi, int mem_index, MemOp opc, tcg_insn_unit **label_ptr, int which) { ... TCGType tlbtype = TCG_TYPE_I32; int trexw = 0, hrexw = 0, tlbrexw = 0; ... if (TCG_TARGET_REG_BITS == 64) { if (TARGET_LONG_BITS == 64) { ttype = TCG_TYPE_I64; trexw = P_REXW; } if (TCG_TYPE_PTR == TCG_TYPE_I64) { hrexw = P_REXW; if (TARGET_PAGE_BITS + CPU_TLB_DYN_MAX_BITS > 32) { tlbtype = TCG_TYPE_I64; tlbrexw = P_REXW; } } } ... /* cmp 0(r0), r1 */ tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw, r1, r0, which); which generates either 'cmpl' or 'cmpq' depending on the guest address size. Anyway, there's no need for this. So, if I got it right, we could just make TCGv become a new opaque type, propagate it down until the spot where we actually need to know its size and then just have some `TCGTemp *tcgv_temp(TCGv v)` function to inspect the actual size? No, leave TCGv as a macro, but we need changes to "tcg/tcg-op*.h", so that all of the above tcg backend stuff *also* gets disconnected from TARGET_LONG_BITS. Makes sense! Before CPUNegativeOffsetState, we had all of those variables within CPUState, and even comments that they should remain at the end of the struct. But those comments were ignored and one day icount_decr was too far away from env for easy addressing on arm host. Luckily there was an assert that fired, but it didn't get caught right away. How comes it wasn't caught immediately? We could have something like: QEMU_BUILD_BUG_MSG(&ArchCPU.env - &ArchCPU.neg.tlb < DESIRED_THRESHOLD) We probably should. But it didn't affect x86, and cross-build testing wasn't sufficient, so we didn't find out later when someone did runtime testing on the affected hosts. Our current goal is to get the following compilation unit to build without NEED_CPU_H: trace/control-target.c gdbstub/gdbstub.c cpu.c disas.c page-vary.c tcg/optimize.c tcg/region.c tcg/tcg.c tcg/tcg-common.c tcg/tcg-op.c tcg/tcg-op-gvec.c tcg/tcg-op-vec.c fpu/softfloat.c accel/accel-common.c accel/tcg/tcg-all.c accel/tcg/cpu-exec-common.c accel/tcg/cpu-exec.c accel/tcg/tb-maint.c accel/tcg/tcg-runtime-gvec.c accel/tcg/tcg-runtime.c accel/tcg/translate-all.c accel/tcg/translator.c accel/tcg/user-exec.c accel/tcg/user-exec-stub.c accel/tcg/plugin-gen.c plugins/loader.c plugins/core.c
Re: [RFC] Reducing NEED_CPU_H usage
On Tue, 10 Jan 2023 11:56:50 -0800 Richard Henderson wrote: > However, at some point we do want to keep some target addresses in > the proper size. For instance within the softmmu tlb, where > CPUTLBEntry is either 16 or 32 bytes, depending. So that would be an optimization if `HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32`, correct? I've heard about dropping 32 bit hosts multiple times here and there. Maybe we could start with dropping oversized guests, which AFAIU are the real offenders for most (all?) of these situations. > > ## `abi_ulong` > > > > Similar to `target_ulong`, but with alignment info. > > Pardon? There's no alignment info in abi_ulong. From include/exec/user/abitypes.h: typedef uint32_t abi_ulong __attribute__((aligned(ABI_LONG_ALIGNMENT))); typedef target_ulong abi_ulong __attribute__((aligned(ABI_LONG_ALIGNMENT))); I thought that was the difference. Thanks for the clarification. > This one requires some work within tcg/ to handle two target address > sizes simultaneously. It should not be technically difficult to > solve, but it does involve adding a few TCG opcodes and adjusting all > tcg backends. I'm a bit confused by this, do backends for some reason have expectations about the address size? Wouldn't this be enough? diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c index 019fab00ccb..175162b8fef 100644 --- a/tcg/tcg-op.c +++ b/tcg/tcg-op.c @@ -29,0 +29,0 @@ @@ -2827,17 +2843,17 @@ static inline MemOp tcg_canonicalize_memop(MemOp op, bool is64, bool st) return op; } -static void gen_ldst_i32(TCGOpcode opc, TCGv_i32 val, TCGv addr, +static void gen_ldst_i32(TCGOpcode opc, TCGv_i32 val, TCGv_dyn addr, MemOp memop, TCGArg idx) { MemOpIdx oi = make_memop_idx(memop, idx); -#if TARGET_LONG_BITS == 32 -tcg_gen_op3i_i32(opc, val, addr, oi); -#else -if (TCG_TARGET_REG_BITS == 32) { -tcg_gen_op4i_i32(opc, val, TCGV_LOW(addr), TCGV_HIGH(addr), oi); -} else { -tcg_gen_op3(opc, tcgv_i32_arg(val), tcgv_i64_arg(addr), oi); -} -#endif +if (addr.size == 32) { +tcg_gen_op3i_i32(opc, val, addr.i32, oi); +} else { +if (TCG_TARGET_REG_BITS == 32) { +tcg_gen_op4i_i32(opc, val, TCGV_LOW(addr.i64), TCGV_HIGH(addr.i64), oi); +} else { +tcg_gen_op3(opc, tcgv_i32_arg(val), tcgv_i64_arg(addr.i64), oi); +} +} } > This forgets that both TCGv_i32 and TCGv_i64 are represented by > TCGTemp, https://i.imgflip.com/777wax.jpg > which contains 'TCGType type' to discriminate. This is not > exposed to target/, but it's there. > > Anyway, there's no need for this. So, if I got it right, we could just make TCGv become a new opaque type, propagate it down until the spot where we actually need to know its size and then just have some `TCGTemp *tcgv_temp(TCGv v)` function to inspect the actual size? Makes sense! > Before CPUNegativeOffsetState, we had all of those variables within > CPUState, and even comments that they should remain at the end of the > struct. But those comments were ignored and one day icount_decr was > too far away from env for easy addressing on arm host. Luckily there > was an assert that fired, but it didn't get caught right away. How comes it wasn't caught immediately? We could have something like: QEMU_BUILD_BUG_MSG(&ArchCPU.env - &ArchCPU.neg.tlb < DESIRED_THRESHOLD) > > # Current status > > > > We currently have a branch where we can build (but not link nor > > run) a `x86_64-linux-user` configuration where `NEED_CPU_H` is > > defined only for translation units in `target/` and `linux-user/`. > > This effort should be exclusive to system mode -- don't consider > *-user at all. Each linux-user target bakes in not just target > architecture parameters, which are complicated enough, but also C > ABI, and kernel API. Moreover, the most common use case for > linux-user is a statically linked binary that operates within a > chroot. Our current goal is to get the following compilation unit to build without NEED_CPU_H: trace/control-target.c gdbstub/gdbstub.c cpu.c disas.c page-vary.c tcg/optimize.c tcg/region.c tcg/tcg.c tcg/tcg-common.c tcg/tcg-op.c tcg/tcg-op-gvec.c tcg/tcg-op-vec.c fpu/softfloat.c accel/accel-common.c accel/tcg/tcg-all.c accel/tcg/cpu-exec-common.c accel/tcg/cpu-exec.c accel/tcg/tb-maint.c accel/tcg/tcg-runtime-gvec.c accel/tcg/tcg-runtime.c accel/tcg/translate-all.c accel/tcg/translator.c accel/tcg/user-exec.c accel/tcg/user-exec-stub.c accel/tcg/plugin-gen.c plugins/loader.c plugins/core.c plugins/api.c They are subset of `arch_srcs` from `meson.build`. Making them target agnostic for *-user too should be easy and could save some build time. But yeah, we'll now focus on system-mode. We'll now try to sort out things from the easiest to the most complex and start send
Re: [RFC] Reducing NEED_CPU_H usage
On 12/28/22 08:16, Alessandro Di Federico wrote: ## `target_ulong` `target_ulong` is `uint32_t` in 32-bit targets and `uint64_t` in 64-bit targets. Problem: This is used in many many places to represent addresses in code that could become target-independent. Proposed solution: we can convert it to: typedef uint64_t target_address; We have other typedefs that are better for this, e.g. vaddr. However, at some point we do want to keep some target addresses in the proper size. For instance within the softmmu tlb, where CPUTLBEntry is either 16 or 32 bytes, depending. (On the other hand, if we drop support for 32-bit hosts, as we keep threatening to do, then CPUTLB is always 32 bytes, and we might as well use vaddr there too. But not until 32-bit hosts are gone.) The problem with this is that, if arithmetic operations are performed on it, we might get undesired results: // Was: char load_data(target_ulong address) char load_data(target_address address) { char *base_address = get_base_address(); // On a 32-bits target this would overflow, it doesn't with // uint64_t target_address real_address = address + 1; return *(base_address + real_address); } Doesn't, or shouldn't matter, because we should never do anything like this in generic code. Note that vaddr ptr = ...; cpu_ldl_le_data(env, ptr + offset) does not have the problem you describe, because any overflow is truncated within the load function. ## `abi_ulong` Similar to `target_ulong`, but with alignment info. Pardon? There's no alignment info in abi_ulong. The difference is that 'target_ulong' is the size of the target register, and 'abi_ulong' is the 'unsigned long' in the target's C ABI. Consider e.g. x32 (x86_64 with ilp32 abi), for which target_ulong is 64-bit but abi_ulong is 32-bit. This only applies to user-only, and should not matter for this project. There *is* an 'abi_ptr' type, which is shared between softmmu and user-only, which might be able to be replaced by 'vaddr'. Or 'typedef vaddr abi_ptr' in softmmu mode. I haven't done a survey on that to be certain. ## `TCGv` `TCGv` is a macro for `TCGv_i32` for 32-bit targets and `TCGv_i64` for 64-bit targets. The idea is that this macro should only be visible to target-specific code, and the macro provides the swizzling/encoding to the concrete type functions. Problem: it makes `tcg-op.h` This is fine. and, more importantly, `tcg-op.c` This one requires some work within tcg/ to handle two target address sizes simultaneously. It should not be technically difficult to solve, but it does involve adding a few TCG opcodes and adjusting all tcg backends. Solution: transform current functions using them into target-specific wrappers that dispatch to target-agnostic functions that accept `TCGv_dyn` instead of `TCGv`: typedef struct { union { TCGv_i32 i32; TCGv_i64 i64; }; bool is_64; } TCGv_dyn; This forgets that both TCGv_i32 and TCGv_i64 are represented by TCGTemp, which contains 'TCGType type' to discriminate. This is not exposed to target/, but it's there. Anyway, there's no need for this. ## `TARGET_` macros These are macros that provide target-specific information. Problem: they need to be abandoned in translation units that need to become target agnostic. Solution: promote them to fields of a `struct`. Current ideas: TARGET_TB_PCREL -> TranslationBlock.pc_rel I'd been thinking a bit on the cpu, but a CF_* bit works well. It gets initialized for each TB from CPUState.tcg_cflags. TBD where we'd initialize the new bit for each cpu... TARGET_PAGE_BITS -> TranslationBlock.page_bits TARGET_PAGE_MASK -> TranslationBlock.page_mask You need to look at how TARGET_PAGE_BITS_VARY works. The memory subsystem needs rewriting if we were to support multiple page sizes. What we can support now is one single global page size, selected at startup. TARGET_PAGE_ALIGN -> CPUArchState.page_align -> DisasContextBase.page_align This remains a trivial macro based on the variable TARGET_PAGE_MASK. TARGET_LONG_BITS -> TCGContext.long_bits Yes. I've been considering how to generalize this to arbitrary address widths, in order to better support ARM top-byte-ignore and RISC-V J extension (pointer masking). But in the short term I'm happy with this number being exactly 64 or 32. TARGET_PAGE_SIZE -> ??? Remains a trivial macro based on the variable TARGET_PAGE_MASK. TCG_OVERSIZED_GUEST -> ??? Goes away if we drop support for 32-bit hosts, or restrict 32-bit hosts to 32-bit guests. I have no other good ideas. TARGET_FMT_lx -> ??? VADDR_PRIx, mostly. May need resolving on a case-by-case basis. CPU_RESOLVING_TYPE -> ??? Would need to be part of the per-target shared library interface. ## `CPUState` `CPUState` is a t
RE: [RFC] Reducing NEED_CPU_H usage
> -Original Message- > From: Alessandro Di Federico > Sent: Wednesday, December 28, 2022 10:16 AM > To: qemu-devel@nongnu.org > Cc: Taylor Simpson ; Philippe Mathieu-Daudé > ; Richard Henderson ; > Alex Bennée > Subject: [RFC] Reducing NEED_CPU_H usage > > Hello everyone, this is a proposal in the libtcg direction, i.e., enabling > building > a single QEMU executable featuring multiple *targets* (i.e., TCG frontends). > This follows what we discussed in recent KVM calls and originally at the KVM > Forum Emulation BoF. > > Note that our commitment to this project is possible thanks to our > collaboration with the Qualcomm Innovation Center! > Thanks for sharing. My $.02 is to change the subject line to reflect the total initiative, not just NEED_CPU_H. That would make more folks notice it in their inbox. > # What's left out? > > Quite a few minor things but.. you tell me! > There's probably *a lot* more stuff when we dig into system mode > emulation, but I'd postpone investigating those issues until we have user > mode in a decent shape. One thing I can think of is the TARGET_ macros . Some examples are in fpu/softfloat-specialize.c.inc and linux-user/user-internals.h. Thanks, Taylor
[RFC] Reducing NEED_CPU_H usage
Hello everyone, this is a proposal in the libtcg direction, i.e., enabling building a single QEMU executable featuring multiple *targets* (i.e., TCG frontends). This follows what we discussed in recent KVM calls and originally at the KVM Forum Emulation BoF. Note that our commitment to this project is possible thanks to our collaboration with the Qualcomm Innovation Center! # Grand goal The following is the core point where translation of input code to host code takes place: TranslationBlock *tb_gen_code(CPUState *cpu, target_ulong pc, target_ulong cs_base, uint32_t flags, int cflags) { // ... gen_intermediate_code(cpu, tb, max_insns, pc, host_pc); // ... gen_code_size = tcg_gen_code(tcg_ctx, tb, pc); // ... } https://github.com/qemu/qemu/blob/4e4fa6c/accel/tcg/translate-all.c#L816 We want: * `gen_intermediate_code` to be target-specific (provided by a shared library) * `tb_gen_code` and `tcg_gen_code` to be target-agnostic Result: the frontend just emits instructions into the `TranslationBlock`, not much more. But first we need to make sure that QEMU is neatly divided into components that can make compile-time assumptions about the target and components that don't, which is what this proposal is about. This will be a large chunk of work affecting many parts of the code and requiring some infrastructural changes. # Strategy The strategy to make this happen is primarily to have an out-of-tree branch where the development of libtcg takes place. The main goal is to be able to limit usage of `NEED_CPU_H` and some other macros only to certain places (mainly `target/*` and `{bsd,linux}-user/*`). Initially we will focus on getting a `--target-list=x86_64-linux-user` configuration to build with reduced usage of `NEED_CPU_H`. This is just for simplicity, we intend to tackle all other configurations step-by-step. At first, there will be lots of hacks to make it build and QEMU won't be operational, but building it will enable us to identify all the core issues. Once all the core issues have been identified, we intend to prepare fixes for each one of them and frequently send them upstream in a form palatable for master. For instance, we expect to have a patch to move from `TARGET_TB_PCREL` to `TranslationBlock.pc_rel`. This way, we hope to be able to tackle the project in small bites and, at a certain point, send a patch upstream where `NEED_CPU_H` is actually disabled in many many places. Given the amount of work, doing everything in a single shot does not seem feasible or beneficial. # Core issues identifed so far In the following we report some of the core issues we identified, why they are problematic and what we intend to do to tackle them. Feedback on each and every point is very welcome! :) ## `NEED_CPU_H` `NEED_CPU_H` is a macro to decide whether a certain translation unit needs `target/$ARCH/cpu.h` or not. In several headers, this macro is used to guard target-specific code. Problem: we need to build translation units out of the `target/` and `*-user/` directories *without* this macro being defined. This also means that many files will be compiled only once and not once per target! Solution: Fix all the issues caused by undefining it (main ones are reported below). Sometimes a solution is to take headers that are currently target-specific headers and introduce `#ifdef NEED_CPU_H` in certain parts so that they become target-agnostic. At a certain point it might make sense to count how many lines are guarded by such macro and consider splitting headers into target-agnostic and target-specific ones. Note: currently, we plan to get rid of all usages of `TARGET_*` macros in code that we want to become target agnostic but *not* of `CONFIG_USER_ONLY` and `CONFIG_SOFTMMU`. At least initially, our goal is just to decouple translation units out of `target/`/`*-user/` from target-specific assumptions. ## `target_ulong` `target_ulong` is `uint32_t` in 32-bit targets and `uint64_t` in 64-bit targets. Problem: This is used in many many places to represent addresses in code that could become target-independent. Proposed solution: we can convert it to: typedef uint64_t target_address; The problem with this is that, if arithmetic operations are performed on it, we might get undesired results: // Was: char load_data(target_ulong address) char load_data(target_address address) { char *base_address = get_base_address(); // On a 32-bits target this would overflow, it doesn't with // uint64_t target_address real_address = address + 1; return *(base_address + real_address); } load_data(UINT32_MAX); This is might be a source of subtle bugs. We're using a rough `clang-query` invocation to see who performs arithmetic on `target_ulong`: $ cd build $ cat compile_commands.json \ | jq -r '.[] | .[