RE: [RFC] Reducing NEED_CPU_H usage

2023-03-02 Thread Taylor Simpson


> -Original Message-
> From: Richard Henderson 
> Sent: Saturday, January 14, 2023 6:53 PM
> To: Alessandro Di Federico 
> Cc: qemu-devel@nongnu.org; Taylor Simpson ;
> Alex Bennée 
> Subject: Re: [RFC] Reducing NEED_CPU_H usage
> 
> On 1/12/23 05:28, Alessandro Di Federico wrote:
> >  fpu/softfloat.c
> 
> Something I happened to notice while doing other triage:
> 
>  https://gitlab.com/qemu-project/qemu/-/issues/1375
> 
> This is an x86 problem that currently has no solution, but ought to be trivial
> with the changes to softfloat required for this project.

One other thing we'll need to deal with is command-line options.  In a 
heterogeneous system, options like -cpu and -d will need a way to direct 
different settings to each target.

Thanks,
Taylor


Re: [RFC] Reducing NEED_CPU_H usage

2023-01-14 Thread Richard Henderson

On 1/12/23 05:28, Alessandro Di Federico wrote:

 fpu/softfloat.c


Something I happened to notice while doing other triage:

https://gitlab.com/qemu-project/qemu/-/issues/1375

This is an x86 problem that currently has no solution, but ought to be trivial with the 
changes to softfloat required for this project.



r~



Re: [RFC] Reducing NEED_CPU_H usage

2023-01-12 Thread Richard Henderson

On 1/12/23 07:28, Alessandro Di Federico wrote:

On Tue, 10 Jan 2023 11:56:50 -0800
Richard Henderson  wrote:


However, at some point we do want to keep some target addresses in
the proper size.  For instance within the softmmu tlb, where
CPUTLBEntry is either 16 or 32 bytes, depending.


So that would be an optimization if `HOST_LONG_BITS == 32 &&
TARGET_LONG_BITS == 32`, correct?


At present, not an 'optimization' but 'natural fallout of type sizes'.
But, yeah.


## `abi_ulong`

Similar to `target_ulong`, but with alignment info.


Pardon?  There's no alignment info in abi_ulong.


 From include/exec/user/abitypes.h:

 typedef uint32_t abi_ulong __attribute__((aligned(ABI_LONG_ALIGNMENT)));
 typedef target_ulong abi_ulong 
__attribute__((aligned(ABI_LONG_ALIGNMENT)));

I thought that was the difference. Thanks for the clarification.


Ah, I see what you mean.  I *believe* that use of target_ulong could actually be uint64_t, 
since all TARGET_LONG_BITS == 32 ought to have TARGET_ABI32 set too (which brings us to 
that first definition with uint32_t.)


There is a target alignment difference, for the benefit of e.g. m68k, which has 
sizeof(long) == 4 and __alignof(long) == 2, which may differ by the host.


In any case, all of "exec/abitypes.h" is (or should be) user-only related, so out of scope 
for this project.



This one requires some work within tcg/ to handle two target address
sizes simultaneously. It should not be technically difficult to
solve, but it does involve adding a few TCG opcodes and adjusting all
tcg backends.


I'm a bit confused by this, do backends for some reason have
expectations about the address size?
Wouldn't this be enough?


Yes, this affects the tcg backend as well:

static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
{
...
datalo = *args++;
datahi = (TCG_TARGET_REG_BITS == 32 && is64 ? *args++ : 0);
addrlo = *args++;
addrhi = (TARGET_LONG_BITS > TCG_TARGET_REG_BITS ? *args++ : 0);
oi = *args++;

and further code generation changes especially for 64-bit guest pointer comparisons on 
32-bit hosts.  (There's that nasty combination again.)


There's also code generation changes for 32-bit guest pointer comparisons on 64-bit hosts, 
e.g.


static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
int mem_index, MemOp opc,
tcg_insn_unit **label_ptr, int which)
{
...
TCGType tlbtype = TCG_TYPE_I32;
int trexw = 0, hrexw = 0, tlbrexw = 0;
...
if (TCG_TARGET_REG_BITS == 64) {
if (TARGET_LONG_BITS == 64) {
ttype = TCG_TYPE_I64;
trexw = P_REXW;
}
if (TCG_TYPE_PTR == TCG_TYPE_I64) {
hrexw = P_REXW;
if (TARGET_PAGE_BITS + CPU_TLB_DYN_MAX_BITS > 32) {
tlbtype = TCG_TYPE_I64;
tlbrexw = P_REXW;
}
}
}
...
/* cmp 0(r0), r1 */
tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw, r1, r0, which);

which generates either 'cmpl' or 'cmpq' depending on the guest address size.


Anyway, there's no need for this.


So, if I got it right, we could just make TCGv become a new opaque
type, propagate it down until the spot where we actually need to know
its size and then just have some `TCGTemp *tcgv_temp(TCGv v)` function
to inspect the actual size?


No, leave TCGv as a macro, but we need changes to "tcg/tcg-op*.h", so that all of the 
above tcg backend stuff *also* gets disconnected from TARGET_LONG_BITS.



Makes sense!


Before CPUNegativeOffsetState, we had all of those variables within
CPUState, and even comments that they should remain at the end of the
struct.  But those comments were ignored and one day icount_decr was
too far away from env for easy addressing on arm host. Luckily there
was an assert that fired, but it didn't get caught right away.


How comes it wasn't caught immediately?
We could have something like:

 QEMU_BUILD_BUG_MSG(&ArchCPU.env - &ArchCPU.neg.tlb
< DESIRED_THRESHOLD)


We probably should.

But it didn't affect x86, and cross-build testing wasn't sufficient, so we didn't find out 
later when someone did runtime testing on the affected hosts.



Our current goal is to get the following compilation unit to build
without NEED_CPU_H:

 trace/control-target.c
 gdbstub/gdbstub.c
 cpu.c
 disas.c
 page-vary.c
 tcg/optimize.c
 tcg/region.c
 tcg/tcg.c
 tcg/tcg-common.c
 tcg/tcg-op.c
 tcg/tcg-op-gvec.c
 tcg/tcg-op-vec.c
 fpu/softfloat.c
 accel/accel-common.c
 accel/tcg/tcg-all.c
 accel/tcg/cpu-exec-common.c
 accel/tcg/cpu-exec.c
 accel/tcg/tb-maint.c
 accel/tcg/tcg-runtime-gvec.c
 accel/tcg/tcg-runtime.c
 accel/tcg/translate-all.c
 accel/tcg/translator.c
 accel/tcg/user-exec.c
 accel/tcg/user-exec-stub.c
 accel/tcg/plugin-gen.c
 plugins/loader.c
 plugins/core.c
   

Re: [RFC] Reducing NEED_CPU_H usage

2023-01-12 Thread Alessandro Di Federico via
On Tue, 10 Jan 2023 11:56:50 -0800
Richard Henderson  wrote:

> However, at some point we do want to keep some target addresses in
> the proper size.  For instance within the softmmu tlb, where
> CPUTLBEntry is either 16 or 32 bytes, depending.

So that would be an optimization if `HOST_LONG_BITS == 32 &&
TARGET_LONG_BITS == 32`, correct?

I've heard about dropping 32 bit hosts multiple times here and there.
Maybe we could start with dropping oversized guests, which AFAIU are
the real offenders for most (all?) of these situations.

> > ## `abi_ulong`
> > 
> > Similar to `target_ulong`, but with alignment info.  
> 
> Pardon?  There's no alignment info in abi_ulong.

From include/exec/user/abitypes.h:

typedef uint32_t abi_ulong __attribute__((aligned(ABI_LONG_ALIGNMENT)));
typedef target_ulong abi_ulong __attribute__((aligned(ABI_LONG_ALIGNMENT)));

I thought that was the difference. Thanks for the clarification.

> This one requires some work within tcg/ to handle two target address
> sizes simultaneously. It should not be technically difficult to
> solve, but it does involve adding a few TCG opcodes and adjusting all
> tcg backends.

I'm a bit confused by this, do backends for some reason have
expectations about the address size?
Wouldn't this be enough?

diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 019fab00ccb..175162b8fef 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -29,0 +29,0 @@
@@ -2827,17 +2843,17 @@ static inline MemOp tcg_canonicalize_memop(MemOp op, 
bool is64, bool st)
 return op;
 }
 
-static void gen_ldst_i32(TCGOpcode opc, TCGv_i32 val, TCGv addr,
+static void gen_ldst_i32(TCGOpcode opc, TCGv_i32 val, TCGv_dyn addr,
  MemOp memop, TCGArg idx)
 {
 MemOpIdx oi = make_memop_idx(memop, idx);
-#if TARGET_LONG_BITS == 32
-tcg_gen_op3i_i32(opc, val, addr, oi);
-#else
-if (TCG_TARGET_REG_BITS == 32) {
-tcg_gen_op4i_i32(opc, val, TCGV_LOW(addr), TCGV_HIGH(addr), oi);
-} else {
-tcg_gen_op3(opc, tcgv_i32_arg(val), tcgv_i64_arg(addr), oi);
-}
-#endif
+if (addr.size == 32) {
+tcg_gen_op3i_i32(opc, val, addr.i32, oi);
+} else {
+if (TCG_TARGET_REG_BITS == 32) {
+tcg_gen_op4i_i32(opc, val, TCGV_LOW(addr.i64), 
TCGV_HIGH(addr.i64), oi);
+} else {
+tcg_gen_op3(opc, tcgv_i32_arg(val), tcgv_i64_arg(addr.i64), oi);
+}
+}
 }


> This forgets that both TCGv_i32 and TCGv_i64 are represented by
> TCGTemp,

https://i.imgflip.com/777wax.jpg

> which contains 'TCGType type' to discriminate.  This is not
> exposed to target/, but it's there.
> 
> Anyway, there's no need for this.

So, if I got it right, we could just make TCGv become a new opaque
type, propagate it down until the spot where we actually need to know
its size and then just have some `TCGTemp *tcgv_temp(TCGv v)` function
to inspect the actual size?

Makes sense!

> Before CPUNegativeOffsetState, we had all of those variables within
> CPUState, and even comments that they should remain at the end of the
> struct.  But those comments were ignored and one day icount_decr was
> too far away from env for easy addressing on arm host. Luckily there
> was an assert that fired, but it didn't get caught right away.

How comes it wasn't caught immediately?
We could have something like:

QEMU_BUILD_BUG_MSG(&ArchCPU.env - &ArchCPU.neg.tlb
   < DESIRED_THRESHOLD)

> > # Current status
> > 
> > We currently have a branch where we can build (but not link nor
> > run) a `x86_64-linux-user` configuration where `NEED_CPU_H` is
> > defined only for translation units in `target/` and `linux-user/`.  
> 
> This effort should be exclusive to system mode -- don't consider
> *-user at all. Each linux-user target bakes in not just target
> architecture parameters, which are complicated enough, but also C
> ABI, and kernel API.  Moreover, the most common use case for
> linux-user is a statically linked binary that operates within a
> chroot.

Our current goal is to get the following compilation unit to build
without NEED_CPU_H:

trace/control-target.c
gdbstub/gdbstub.c
cpu.c
disas.c
page-vary.c
tcg/optimize.c
tcg/region.c
tcg/tcg.c
tcg/tcg-common.c
tcg/tcg-op.c
tcg/tcg-op-gvec.c
tcg/tcg-op-vec.c
fpu/softfloat.c
accel/accel-common.c
accel/tcg/tcg-all.c
accel/tcg/cpu-exec-common.c
accel/tcg/cpu-exec.c
accel/tcg/tb-maint.c
accel/tcg/tcg-runtime-gvec.c
accel/tcg/tcg-runtime.c
accel/tcg/translate-all.c
accel/tcg/translator.c
accel/tcg/user-exec.c
accel/tcg/user-exec-stub.c
accel/tcg/plugin-gen.c
plugins/loader.c
plugins/core.c
plugins/api.c

They are subset of `arch_srcs` from `meson.build`.

Making them target agnostic for *-user too should be easy and could save
some build time.
But yeah, we'll now focus on system-mode.

We'll now try to sort out things from the easiest to the most complex
and start send

Re: [RFC] Reducing NEED_CPU_H usage

2023-01-10 Thread Richard Henderson

On 12/28/22 08:16, Alessandro Di Federico wrote:

## `target_ulong`

`target_ulong` is `uint32_t` in 32-bit targets and `uint64_t` in 64-bit
targets.

Problem: This is used in many many places to represent addresses in
code that could become target-independent.

Proposed solution: we can convert it to:

 typedef uint64_t target_address;


We have other typedefs that are better for this, e.g. vaddr.

However, at some point we do want to keep some target addresses in the proper size.  For 
instance within the softmmu tlb, where CPUTLBEntry is either 16 or 32 bytes, depending.


(On the other hand, if we drop support for 32-bit hosts, as we keep threatening to do, 
then CPUTLB is always 32 bytes, and we might as well use vaddr there too.  But not until 
32-bit hosts are gone.)




The problem with this is that, if arithmetic operations are performed
on it, we might get undesired results:

 // Was: char load_data(target_ulong address)
 char load_data(target_address address) {
   char *base_address = get_base_address();
   // On a 32-bits target this would overflow, it doesn't with
   // uint64_t
   target_address real_address = address + 1;
   return *(base_address + real_address);
 }


Doesn't, or shouldn't matter, because we should never do anything like this in generic 
code.  Note that


vaddr ptr = ...;
cpu_ldl_le_data(env, ptr + offset)

does not have the problem you describe, because any overflow is truncated within the load 
function.




## `abi_ulong`

Similar to `target_ulong`, but with alignment info.


Pardon?  There's no alignment info in abi_ulong.

The difference is that 'target_ulong' is the size of the target register, and 'abi_ulong' 
is the 'unsigned long' in the target's C ABI.  Consider e.g. x32 (x86_64 with ilp32 abi), 
for which target_ulong is 64-bit but abi_ulong is 32-bit.


This only applies to user-only, and should not matter for this project.

There *is* an 'abi_ptr' type, which is shared between softmmu and user-only, which might 
be able to be replaced by 'vaddr'.  Or 'typedef vaddr abi_ptr' in softmmu mode.  I haven't 
done a survey on that to be certain.



## `TCGv`

`TCGv` is a macro for `TCGv_i32` for 32-bit targets and `TCGv_i64`
for 64-bit targets.


The idea is that this macro should only be visible to target-specific code, and the macro 
provides the swizzling/encoding to the concrete type functions.


Problem: it makes `tcg-op.h` 


This is fine.


and, more importantly, `tcg-op.c`


This one requires some work within tcg/ to handle two target address sizes simultaneously. 
 It should not be technically difficult to solve, but it does involve adding a few TCG 
opcodes and adjusting all tcg backends.




Solution: transform current functions using them into target-specific
wrappers that dispatch to target-agnostic functions that accept
`TCGv_dyn` instead of `TCGv`:

 typedef struct {
 union {
 TCGv_i32 i32;
 TCGv_i64 i64;
 };
 bool is_64;
 } TCGv_dyn;


This forgets that both TCGv_i32 and TCGv_i64 are represented by TCGTemp, which contains 
'TCGType type' to discriminate.  This is not exposed to target/, but it's there.


Anyway, there's no need for this.


## `TARGET_` macros

These are macros that provide target-specific information.

Problem: they need to be abandoned in translation units that need to
become target agnostic.

Solution: promote them to fields of a `struct`.
Current ideas:

 TARGET_TB_PCREL -> TranslationBlock.pc_rel


I'd been thinking a bit on the cpu, but a CF_* bit works well.
It gets initialized for each TB from CPUState.tcg_cflags.
TBD where we'd initialize the new bit for each cpu...



 TARGET_PAGE_BITS -> TranslationBlock.page_bits
 TARGET_PAGE_MASK -> TranslationBlock.page_mask


You need to look at how TARGET_PAGE_BITS_VARY works.  The memory subsystem needs rewriting 
if we were to support multiple page sizes.  What we can support now is one single global 
page size, selected at startup.




 TARGET_PAGE_ALIGN -> CPUArchState.page_align
   -> DisasContextBase.page_align


This remains a trivial macro based on the variable TARGET_PAGE_MASK.



 TARGET_LONG_BITS -> TCGContext.long_bits


Yes.

I've been considering how to generalize this to arbitrary address widths, in order to 
better support ARM top-byte-ignore and RISC-V J extension (pointer masking).  But in the 
short term I'm happy with this number being exactly 64 or 32.



 TARGET_PAGE_SIZE -> ???


Remains a trivial macro based on the variable TARGET_PAGE_MASK.


 TCG_OVERSIZED_GUEST -> ???


Goes away if we drop support for 32-bit hosts, or restrict 32-bit hosts to 32-bit guests. 
I have no other good ideas.




 TARGET_FMT_lx -> ???


VADDR_PRIx, mostly.  May need resolving on a case-by-case basis.


 CPU_RESOLVING_TYPE -> ???


Would need to be part of the per-target shared library interface.


## `CPUState`

`CPUState` is a t

RE: [RFC] Reducing NEED_CPU_H usage

2022-12-29 Thread Taylor Simpson



> -Original Message-
> From: Alessandro Di Federico 
> Sent: Wednesday, December 28, 2022 10:16 AM
> To: qemu-devel@nongnu.org
> Cc: Taylor Simpson ; Philippe Mathieu-Daudé
> ; Richard Henderson ;
> Alex Bennée 
> Subject: [RFC] Reducing NEED_CPU_H usage
> 
> Hello everyone, this is a proposal in the libtcg direction, i.e., enabling 
> building
> a single QEMU executable featuring multiple *targets* (i.e., TCG frontends).
> This follows what we discussed in recent KVM calls and originally at the KVM
> Forum Emulation BoF.
> 
> Note that our commitment to this project is possible thanks to our
> collaboration with the Qualcomm Innovation Center!
> 

Thanks for sharing.  My $.02 is to change the subject line to reflect the total 
initiative, not just NEED_CPU_H.  That would make more folks notice it in their 
inbox.

> # What's left out?
> 
> Quite a few minor things but.. you tell me!
> There's probably *a lot* more stuff when we dig into system mode
> emulation, but I'd postpone investigating those issues until we have user
> mode in a decent shape.

One thing I can think of is the TARGET_ macros .  Some examples are in 
fpu/softfloat-specialize.c.inc and linux-user/user-internals.h.

Thanks,
Taylor




[RFC] Reducing NEED_CPU_H usage

2022-12-28 Thread Alessandro Di Federico via
Hello everyone, this is a proposal in the libtcg direction, i.e.,
enabling building a single QEMU executable featuring multiple *targets*
(i.e., TCG frontends).
This follows what we discussed in recent KVM calls and originally at
the KVM Forum Emulation BoF.

Note that our commitment to this project is possible thanks to our
collaboration with the Qualcomm Innovation Center!

# Grand goal

The following is the core point where translation of input code to host
code takes place:

TranslationBlock *tb_gen_code(CPUState *cpu,
  target_ulong pc, target_ulong cs_base,
  uint32_t flags, int cflags)
{
// ...

gen_intermediate_code(cpu, tb, max_insns, pc, host_pc);

// ...

gen_code_size = tcg_gen_code(tcg_ctx, tb, pc);

// ...
}

https://github.com/qemu/qemu/blob/4e4fa6c/accel/tcg/translate-all.c#L816

We want:

* `gen_intermediate_code` to be target-specific (provided by a shared
  library)
* `tb_gen_code` and `tcg_gen_code` to be target-agnostic

Result: the frontend just emits instructions into the
`TranslationBlock`, not much more.

But first we need to make sure that QEMU is neatly divided into
components that can make compile-time assumptions about the target and
components that don't, which is what this proposal is about.

This will be a large chunk of work affecting many parts of the code and
requiring some infrastructural changes.

# Strategy

The strategy to make this happen is primarily to have an out-of-tree
branch where the development of libtcg takes place. The main goal is to
be able to limit usage of `NEED_CPU_H` and some other macros only to
certain places (mainly `target/*` and `{bsd,linux}-user/*`).

Initially we will focus on getting a `--target-list=x86_64-linux-user`
configuration to build with reduced usage of `NEED_CPU_H`. This is just
for simplicity, we intend to tackle all other configurations
step-by-step.

At first, there will be lots of hacks to make it build and QEMU won't
be operational, but building it will enable us to identify all the core
issues.

Once all the core issues have been identified, we intend to prepare
fixes for each one of them and frequently send them upstream in a form
palatable for master. For instance, we expect to have a patch to move
from `TARGET_TB_PCREL` to `TranslationBlock.pc_rel`.

This way, we hope to be able to tackle the project in small bites and,
at a certain point, send a patch upstream where `NEED_CPU_H` is
actually disabled in many many places. Given the amount of work, doing
everything in a single shot does not seem feasible or beneficial.

# Core issues identifed so far

In the following we report some of the core issues we identified, why
they are problematic and what we intend to do to tackle them. Feedback
on each and every point is very welcome! :)

## `NEED_CPU_H`

`NEED_CPU_H` is a macro to decide whether a certain translation unit
needs `target/$ARCH/cpu.h` or not. In several headers, this macro is
used to guard target-specific code.

Problem: we need to build translation units out of the `target/`
and `*-user/` directories *without* this macro being defined.
This also means that many files will be compiled only once and not once
per target!

Solution: Fix all the issues caused by undefining it (main ones
are reported below). Sometimes a solution is to take headers that are
currently target-specific headers and introduce `#ifdef NEED_CPU_H` in
certain parts so that they become target-agnostic.

At a certain point it might make sense to count how many lines are
guarded by such macro and consider splitting headers into
target-agnostic and target-specific ones.

Note: currently, we plan to get rid of all usages of `TARGET_*`
macros in code that we want to become target agnostic but *not* of
`CONFIG_USER_ONLY` and `CONFIG_SOFTMMU`. At least initially, our goal
is just to decouple translation units out of `target/`/`*-user/` from
target-specific assumptions.

## `target_ulong`

`target_ulong` is `uint32_t` in 32-bit targets and `uint64_t` in 64-bit
targets.

Problem: This is used in many many places to represent addresses in
code that could become target-independent.

Proposed solution: we can convert it to:

typedef uint64_t target_address;

The problem with this is that, if arithmetic operations are performed
on it, we might get undesired results:

// Was: char load_data(target_ulong address)
char load_data(target_address address) {
  char *base_address = get_base_address();
  // On a 32-bits target this would overflow, it doesn't with
  // uint64_t
  target_address real_address = address + 1;
  return *(base_address + real_address);
}

load_data(UINT32_MAX);

This is might be a source of subtle bugs.
We're using a rough `clang-query` invocation to see who performs
arithmetic on `target_ulong`:

$ cd build
$ cat compile_commands.json \
| jq -r '.[] | .[