from:"Paolo Bonzini"

Re: [RFC PATCH 0/8] Convert avocado tests to normal Python unittests

2024-07-17 Thread Paolo Bonzini

On Wed, Jul 17, 2024 at 9:32 AM Thomas Huth  wrote:
> > There is the pycotap dependency to produce TAP from pytest, but that's
> > probably something small enough to be vendored.
>
> The next version is only depending on pycotap now. I'm installing it in the
> venv there that we also install when running the old avocado tests. Not sure
> whether that's the best solution, though. Would it be OK to have it in
> python/wheels/ instead?

Yes, and you can probably move it to the same group as meson; it's
ridiculously small (5k) and it is indeed used _with_ meson. Then you
don't need any change in "configure".

Paolo

[PULL 04/20] disas: Fix build against Capstone v6

2024-07-16 Thread Paolo Bonzini

From: Gustavo Romero 

Capstone v6 made major changes, such as renaming for AArch64, which
broke programs using the old headers, like QEMU. However, Capstone v6
provides the CAPSTONE_AARCH64_COMPAT_HEADER compatibility definition
allowing to build against v6 with the old definitions, so fix the QEMU
build using it.

We can lift that definition and switch to the new naming once our
supported distros have Capstone v6 in place.

Signed-off-by: Gustavo Romero 
Suggested-by: Peter Maydell 
Reviewed-by: Richard Henderson 
Link: 
https://lore.kernel.org/r/20240715213943.1210355-1-gustavo.rom...@linaro.org
Signed-off-by: Paolo Bonzini 
---
 include/disas/capstone.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/disas/capstone.h b/include/disas/capstone.h
index e29068dd977..a11985151d3 100644
--- a/include/disas/capstone.h
+++ b/include/disas/capstone.h
@@ -3,6 +3,7 @@
 
 #ifdef CONFIG_CAPSTONE
 
+#define CAPSTONE_AARCH64_COMPAT_HEADER
 #include 
 
 #else
-- 
2.45.2

[PULL 08/20] docs: Update description of 'user=username' for '-run-with'

2024-07-16 Thread Paolo Bonzini

From: Boqiao Fu 

The description of '-runas' and '-run-with' didn't explain that QEMU
will use setuid/setgid to implement the option, so the user might get
confused if using 'elevateprivileges=deny' as well.

Since '-runas' is going to be deprecated and replaced by '-run-with'
in the coming qemu9.1, add the message there.

Signed-off-by: Boqiao Fu 
Link: 
https://lore.kernel.org/r/cafrhj6j9umk+hmzl+w+ke1yorcolpgbpuvvdku55sdxyigx...@mail.gmail.com
Signed-off-by: Paolo Bonzini 
---
 qemu-options.hx | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/qemu-options.hx b/qemu-options.hx
index ad6521ef5e7..694fa37f284 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -5024,8 +5024,11 @@ SRST
 in combination with -runas.
 
 ``user=username`` or ``user=uid:gid`` can be used to drop root privileges
-by switching to the specified user (via username) or user and group
-(via uid:gid) immediately before starting guest execution.
+before starting guest execution. QEMU will use the ``setuid`` and 
``setgid``
+system calls to switch to the specified identity.  Note that the
+``user=username`` syntax will also apply the full set of supplementary
+groups for the user, whereas the ``user=uid:gid`` will use only the
+``gid`` group.
 ERST
 #endif
 
-- 
2.45.2

[PULL 02/20] Revert "qemu-char: do not operate on sources from finalize callbacks"

2024-07-16 Thread Paolo Bonzini

From: Sergey Dyasli 

This reverts commit 2b316774f60291f57ca9ecb6a9f0712c532cae34.

After 038b4217884c ("Revert "chardev: use a child source for qio input
source"") we've been observing the "iwp->src == NULL" assertion
triggering periodically during the initial capabilities querying by
libvirtd. One of possible backtraces:

Thread 1 (Thread 0x7f16cd4f0700 (LWP 43858)):
0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
1  0x7f16c6c21e65 in __GI_abort () at abort.c:79
2  0x7f16c6c21d39 in __assert_fail_base  at assert.c:92
3  0x7f16c6c46e86 in __GI___assert_fail 
(assertion=assertion@entry=0x562e9bcdaadd "iwp->src == NULL", 
file=file@entry=0x562e9bcdaac8 "../chardev/char-io.c", line=line@entry=99, 
function=function@entry=0x562e9bcdab10 <__PRETTY_FUNCTION__.20549> 
"io_watch_poll_finalize") at assert.c:101
4  0x562e9ba20c2c in io_watch_poll_finalize (source=) at 
../chardev/char-io.c:99
5  io_watch_poll_finalize (source=) at ../chardev/char-io.c:88
6  0x7f16c904aae0 in g_source_unref_internal () from /lib64/libglib-2.0.so.0
7  0x7f16c904baf9 in g_source_destroy_internal () from 
/lib64/libglib-2.0.so.0
8  0x562e9ba20db0 in io_remove_watch_poll (source=0x562e9d6720b0) at 
../chardev/char-io.c:147
9  remove_fd_in_watch (chr=chr@entry=0x562e9d5f3800) at ../chardev/char-io.c:153
10 0x562e9ba23ffb in update_ioc_handlers (s=0x562e9d5f3800) at 
../chardev/char-socket.c:592
11 0x562e9ba2072f in qemu_chr_fe_set_handlers_full at 
../chardev/char-fe.c:279
12 0x562e9ba207a9 in qemu_chr_fe_set_handlers at ../chardev/char-fe.c:304
13 0x562e9ba2ca75 in monitor_qmp_setup_handlers_bh (opaque=0x562e9d4c2c60) 
at ../monitor/qmp.c:509
14 0x562e9bb6222e in aio_bh_poll (ctx=ctx@entry=0x562e9d4c2f20) at 
../util/async.c:216
15 0x562e9bb4de0a in aio_poll (ctx=0x562e9d4c2f20, 
blocking=blocking@entry=true) at ../util/aio-posix.c:722
16 0x562e9b99dfaa in iothread_run (opaque=0x562e9d4c26f0) at 
../iothread.c:63
17 0x562e9bb505a4 in qemu_thread_start (args=0x562e9d4c7ea0) at 
../util/qemu-thread-posix.c:543
18 0x7f16c70081ca in start_thread (arg=) at 
pthread_create.c:479
19 0x7f16c6c398d3 in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

io_remove_watch_poll(), which makes sure that iwp->src is NULL, calls
g_source_destroy() which finds that iwp->src is not NULL in the finalize
callback. This can only happen if another thread has managed to trigger
io_watch_poll_prepare() callback in the meantime.

Move iwp->src destruction back to the finalize callback to prevent the
described race, and also remove the stale comment. The deadlock glib bug
was fixed back in 2010 by b35820285668 ("gmain: move finalization of
GSource outside of context lock").

Suggested-by: Paolo Bonzini 
Signed-off-by: Sergey Dyasli 
Link: 
https://lore.kernel.org/r/20240712092659.216206-1-sergey.dya...@nutanix.com
Signed-off-by: Paolo Bonzini 
---
 chardev/char-io.c | 19 +--
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/chardev/char-io.c b/chardev/char-io.c
index dab77b112e3..3be17b51ca5 100644
--- a/chardev/char-io.c
+++ b/chardev/char-io.c
@@ -87,16 +87,12 @@ static gboolean io_watch_poll_dispatch(GSource *source, 
GSourceFunc callback,
 
 static void io_watch_poll_finalize(GSource *source)
 {
-/*
- * Due to a glib bug, removing the last reference to a source
- * inside a finalize callback causes recursive locking (and a
- * deadlock).  This is not a problem inside other callbacks,
- * including dispatch callbacks, so we call io_remove_watch_poll
- * to remove this source.  At this point, iwp->src must
- * be NULL, or we would leak it.
- */
 IOWatchPoll *iwp = io_watch_poll_from_source(source);
-assert(iwp->src == NULL);
+if (iwp->src) {
+g_source_destroy(iwp->src);
+g_source_unref(iwp->src);
+iwp->src = NULL;
+}
 }
 
 static GSourceFuncs io_watch_poll_funcs = {
@@ -139,11 +135,6 @@ static void io_remove_watch_poll(GSource *source)
 IOWatchPoll *iwp;
 
 iwp = io_watch_poll_from_source(source);
-if (iwp->src) {
-g_source_destroy(iwp->src);
-g_source_unref(iwp->src);
-iwp->src = NULL;
-}
 g_source_destroy(>parent);
 }
 
-- 
2.45.2

[PULL 16/20] target/i386/tcg: Introduce x86_mmu_index_{kernel_,}pl

2024-07-16 Thread Paolo Bonzini

From: Richard Henderson 

Disconnect mmu index computation from the current pl
as stored in env->hflags.

Signed-off-by: Richard Henderson 
Link: 
https://lore.kernel.org/r/20240617161210.4639-2-richard.hender...@linaro.org
Signed-off-by: Paolo Bonzini 
---
 target/i386/cpu.h | 11 ++-
 target/i386/cpu.c | 27 ---
 2 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index c43ac01c794..1e121acef54 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2445,15 +2445,8 @@ static inline bool is_mmu_index_32(int mmu_index)
 return mmu_index & 1;
 }
 
-static inline int cpu_mmu_index_kernel(CPUX86State *env)
-{
-int mmu_index_32 = (env->hflags & HF_LMA_MASK) ? 0 : 1;
-int mmu_index_base =
-!(env->hflags & HF_SMAP_MASK) ? MMU_KNOSMAP64_IDX :
-((env->hflags & HF_CPL_MASK) < 3 && (env->eflags & AC_MASK)) ? 
MMU_KNOSMAP64_IDX : MMU_KSMAP64_IDX;
-
-return mmu_index_base + mmu_index_32;
-}
+int x86_mmu_index_pl(CPUX86State *env, unsigned pl);
+int cpu_mmu_index_kernel(CPUX86State *env);
 
 #define CC_DST  (env->cc_dst)
 #define CC_SRC  (env->cc_src)
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index c05765eeafc..4688d140c2d 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -8122,18 +8122,39 @@ static bool x86_cpu_has_work(CPUState *cs)
 return x86_cpu_pending_interrupt(cs, cs->interrupt_request) != 0;
 }
 
-static int x86_cpu_mmu_index(CPUState *cs, bool ifetch)
+int x86_mmu_index_pl(CPUX86State *env, unsigned pl)
 {
-CPUX86State *env = cpu_env(cs);
 int mmu_index_32 = (env->hflags & HF_CS64_MASK) ? 0 : 1;
 int mmu_index_base =
-(env->hflags & HF_CPL_MASK) == 3 ? MMU_USER64_IDX :
+pl == 3 ? MMU_USER64_IDX :
 !(env->hflags & HF_SMAP_MASK) ? MMU_KNOSMAP64_IDX :
 (env->eflags & AC_MASK) ? MMU_KNOSMAP64_IDX : MMU_KSMAP64_IDX;
 
 return mmu_index_base + mmu_index_32;
 }
 
+static int x86_cpu_mmu_index(CPUState *cs, bool ifetch)
+{
+CPUX86State *env = cpu_env(cs);
+return x86_mmu_index_pl(env, env->hflags & HF_CPL_MASK);
+}
+
+static int x86_mmu_index_kernel_pl(CPUX86State *env, unsigned pl)
+{
+int mmu_index_32 = (env->hflags & HF_LMA_MASK) ? 0 : 1;
+int mmu_index_base =
+!(env->hflags & HF_SMAP_MASK) ? MMU_KNOSMAP64_IDX :
+(pl < 3 && (env->eflags & AC_MASK)
+ ? MMU_KNOSMAP64_IDX : MMU_KSMAP64_IDX);
+
+return mmu_index_base + mmu_index_32;
+}
+
+int cpu_mmu_index_kernel(CPUX86State *env)
+{
+return x86_mmu_index_kernel_pl(env, env->hflags & HF_CPL_MASK);
+}
+
 static void x86_disas_set_info(CPUState *cs, disassemble_info *info)
 {
 X86CPU *cpu = X86_CPU(cs);
-- 
2.45.2

[PULL 10/20] hpet: fix HPET_TN_SETVAL for high 32-bits of the comparator

2024-07-16 Thread Paolo Bonzini

Commit 3787324101b ("hpet: Fix emulation of HPET_TN_SETVAL (Jan Kiszka)",
2009-04-17) applied the fix only to the low 32-bits of the comparator, but
it should be done for the high bits as well.  Otherwise, the high 32-bits
of the comparator cannot be written and they remain fixed to 0x.

Co-developed-by: TaiseiIto 
Signed-off-by: TaiseiIto 
Signed-off-by: Paolo Bonzini 
---
 hw/timer/hpet.c | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
index ad881448bf3..4cb5393c0b5 100644
--- a/hw/timer/hpet.c
+++ b/hw/timer/hpet.c
@@ -554,6 +554,10 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
 timer->period =
 (timer->period & 0xULL) | new_val;
 }
+/*
+ * FIXME: on a 64-bit write, HPET_TN_SETVAL should apply to the
+ * high bits part as well.
+ */
 timer->config &= ~HPET_TN_SETVAL;
 if (hpet_enabled(s)) {
 hpet_set_timer(timer);
@@ -564,7 +568,8 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
 if (!timer_is_periodic(timer)
 || (timer->config & HPET_TN_SETVAL)) {
 timer->cmp = (timer->cmp & 0xULL) | new_val << 32;
-} else {
+}
+if (timer_is_periodic(timer)) {
 /*
  * FIXME: Clamp period to reasonable min value?
  * Clamp period to reasonable max value
@@ -572,12 +577,12 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
 new_val = MIN(new_val, ~0u >> 1);
 timer->period =
 (timer->period & 0xULL) | new_val << 32;
-}
-timer->config &= ~HPET_TN_SETVAL;
-if (hpet_enabled(s)) {
-hpet_set_timer(timer);
-}
-break;
+}
+timer->config &= ~HPET_TN_SETVAL;
+if (hpet_enabled(s)) {
+hpet_set_timer(timer);
+}
+break;
 case HPET_TN_ROUTE:
 timer->fsb = (timer->fsb & 0xULL) | new_val;
 break;
-- 
2.45.2

[PULL 15/20] target/i386/tcg: Reorg push/pop within seg_helper.c

2024-07-16 Thread Paolo Bonzini

From: Richard Henderson 

Interrupts and call gates should use accesses with the DPL as
the privilege level.  While computing the applicable MMU index
is easy, the harder thing is how to plumb it in the code.

One possibility could be to add a single argument to the PUSH* macros
for the privilege level, but this is repetitive and risks confusion
between the involved privilege levels.

Another possibility is to pass both CPL and DPL, and adjusting both
PUSH* and POP* to use specific privilege levels (instead of using
cpu_{ld,st}*_data). This makes the code more symmetric.

However, a more complicated but much nicer approach is to use a structure
to contain the stack parameters, env, unwind return address, and rewrite
the macros into functions.  The struct provides an easy home for the MMU
index as well.

Signed-off-by: Richard Henderson 
Link: 
https://lore.kernel.org/r/20240617161210.4639-4-richard.hender...@linaro.org
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 481 +++
 1 file changed, 259 insertions(+), 222 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index b985382d704..b6902ca3fba 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -28,6 +28,68 @@
 #include "helper-tcg.h"
 #include "seg_helper.h"
 
+#ifdef TARGET_X86_64
+#define SET_ESP(val, sp_mask)   \
+do {\
+if ((sp_mask) == 0x) {  \
+env->regs[R_ESP] = (env->regs[R_ESP] & ~0x) |   \
+((val) & 0x);   \
+} else if ((sp_mask) == 0xLL) { \
+env->regs[R_ESP] = (uint32_t)(val); \
+} else {\
+env->regs[R_ESP] = (val);   \
+}   \
+} while (0)
+#else
+#define SET_ESP(val, sp_mask)   \
+do {\
+env->regs[R_ESP] = (env->regs[R_ESP] & ~(sp_mask)) |\
+((val) & (sp_mask));\
+} while (0)
+#endif
+
+/* XXX: use mmu_index to have proper DPL support */
+typedef struct StackAccess
+{
+CPUX86State *env;
+uintptr_t ra;
+target_ulong ss_base;
+target_ulong sp;
+target_ulong sp_mask;
+} StackAccess;
+
+static void pushw(StackAccess *sa, uint16_t val)
+{
+sa->sp -= 2;
+cpu_stw_kernel_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
+  val, sa->ra);
+}
+
+static void pushl(StackAccess *sa, uint32_t val)
+{
+sa->sp -= 4;
+cpu_stl_kernel_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
+  val, sa->ra);
+}
+
+static uint16_t popw(StackAccess *sa)
+{
+uint16_t ret = cpu_lduw_data_ra(sa->env,
+sa->ss_base + (sa->sp & sa->sp_mask),
+sa->ra);
+sa->sp += 2;
+return ret;
+}
+
+static uint32_t popl(StackAccess *sa)
+{
+uint32_t ret = cpu_ldl_data_ra(sa->env,
+   sa->ss_base + (sa->sp & sa->sp_mask),
+   sa->ra);
+sa->sp += 4;
+return ret;
+}
+
 int get_pg_mode(CPUX86State *env)
 {
 int pg_mode = 0;
@@ -559,68 +621,19 @@ int exception_has_error_code(int intno)
 return 0;
 }
 
-#ifdef TARGET_X86_64
-#define SET_ESP(val, sp_mask)   \
-do {\
-if ((sp_mask) == 0x) {  \
-env->regs[R_ESP] = (env->regs[R_ESP] & ~0x) |   \
-((val) & 0x);   \
-} else if ((sp_mask) == 0xLL) { \
-env->regs[R_ESP] = (uint32_t)(val); \
-} else {\
-env->regs[R_ESP] = (val);   \
-}   \
-} while (0)
-#else
-#define SET_ESP(val, sp_mask)   \
-do {\
-env->regs[R_ESP] = (env->regs[R_ESP] & ~(sp_mask)) |\
-((val) & (sp_mask));\
-} while (0)
-#endif
-
-/* XXX: add a is_user flag to have proper security support */
-#define PUSHW_RA(ssp, sp, sp_mask, val, ra)  \
-{\
-sp -= 2

[PULL 03/20] cpu: Free queued CPU work

2024-07-16 Thread Paolo Bonzini

From: Akihiko Odaki 

Running qemu-system-aarch64 -M virt -nographic and terminating it will
result in a LeakSanitizer error due to remaining queued CPU work so
free it.

Signed-off-by: Akihiko Odaki 
Link: https://lore.kernel.org/r/20240714-cpu-v1-1-19c2f8de2...@daynix.com
Signed-off-by: Paolo Bonzini 
---
 include/hw/core/cpu.h |  6 ++
 cpu-common.c  | 11 +++
 hw/core/cpu-common.c  |  1 +
 3 files changed, 18 insertions(+)

diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index a2c8536943f..8e6466c1dda 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -1000,6 +1000,12 @@ void cpu_resume(CPUState *cpu);
  */
 void cpu_remove_sync(CPUState *cpu);
 
+/**
+ * free_queued_cpu_work() - free all items on CPU work queue
+ * @cpu: The CPU which work queue to free.
+ */
+void free_queued_cpu_work(CPUState *cpu);
+
 /**
  * process_queued_cpu_work() - process all items on CPU work queue
  * @cpu: The CPU which work queue to process.
diff --git a/cpu-common.c b/cpu-common.c
index ce78273af59..7ae136f98ca 100644
--- a/cpu-common.c
+++ b/cpu-common.c
@@ -331,6 +331,17 @@ void async_safe_run_on_cpu(CPUState *cpu, run_on_cpu_func 
func,
 queue_work_on_cpu(cpu, wi);
 }
 
+void free_queued_cpu_work(CPUState *cpu)
+{
+while (!QSIMPLEQ_EMPTY(>work_list)) {
+struct qemu_work_item *wi = QSIMPLEQ_FIRST(>work_list);
+QSIMPLEQ_REMOVE_HEAD(>work_list, node);
+if (wi->free) {
+g_free(wi);
+}
+}
+}
+
 void process_queued_cpu_work(CPUState *cpu)
 {
 struct qemu_work_item *wi;
diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c
index b19e1fdacf2..d2e3e4570ab 100644
--- a/hw/core/cpu-common.c
+++ b/hw/core/cpu-common.c
@@ -281,6 +281,7 @@ static void cpu_common_finalize(Object *obj)
 g_free(cpu->plugin_state);
 }
 #endif
+free_queued_cpu_work(cpu);
 g_array_free(cpu->gdb_regs, TRUE);
 qemu_lockcnt_destroy(>in_ioctl_lock);
 qemu_mutex_destroy(>work_mutex);
-- 
2.45.2

[PULL 13/20] target/i386/tcg: Allow IRET from user mode to user mode with SMAP

2024-07-16 Thread Paolo Bonzini

This fixes a bug wherein i386/tcg assumed an interrupt return using
the IRET instruction was always returning from kernel mode to either
kernel mode or user mode. This assumption is violated when IRET is used
as a clever way to restore thread state, as for example in the dotnet
runtime. There, IRET returns from user mode to user mode.

This bug is that stack accesses from IRET and RETF, as well as accesses
to the parameters in a call gate, are normal data accesses using the
current CPL.  This manifested itself as a page fault in the guest Linux
kernel due to SMAP preventing the access.

This bug appears to have been in QEMU since the beginning.

Analyzed-by: Robert R. Henry 
Co-developed-by: Robert R. Henry 
Signed-off-by: Robert R. Henry 
Reviewed-by: Richard Henderson 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 19d6b41a589..224e73e9ed0 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -594,13 +594,13 @@ int exception_has_error_code(int intno)
 
 #define POPW_RA(ssp, sp, sp_mask, val, ra)   \
 {\
-val = cpu_lduw_kernel_ra(env, (ssp) + (sp & (sp_mask)), ra); \
+val = cpu_lduw_data_ra(env, (ssp) + (sp & (sp_mask)), ra); \
 sp += 2; \
 }
 
 #define POPL_RA(ssp, sp, sp_mask, val, ra)  \
 {   \
-val = (uint32_t)cpu_ldl_kernel_ra(env, (ssp) + (sp & (sp_mask)), ra); \
+val = (uint32_t)cpu_ldl_data_ra(env, (ssp) + (sp & (sp_mask)), ra); \
 sp += 4;\
 }
 
@@ -847,7 +847,7 @@ static void do_interrupt_protected(CPUX86State *env, int 
intno, int is_int,
 
 #define POPQ_RA(sp, val, ra)\
 {   \
-val = cpu_ldq_kernel_ra(env, sp, ra);   \
+val = cpu_ldq_data_ra(env, sp, ra);   \
 sp += 8;\
 }
 
@@ -1797,18 +1797,18 @@ void helper_lcall_protected(CPUX86State *env, int 
new_cs, target_ulong new_eip,
 PUSHL_RA(ssp, sp, sp_mask, env->segs[R_SS].selector, GETPC());
 PUSHL_RA(ssp, sp, sp_mask, env->regs[R_ESP], GETPC());
 for (i = param_count - 1; i >= 0; i--) {
-val = cpu_ldl_kernel_ra(env, old_ssp +
-((env->regs[R_ESP] + i * 4) &
- old_sp_mask), GETPC());
+val = cpu_ldl_data_ra(env,
+  old_ssp + ((env->regs[R_ESP] + i * 
4) & old_sp_mask),
+  GETPC());
 PUSHL_RA(ssp, sp, sp_mask, val, GETPC());
 }
 } else {
 PUSHW_RA(ssp, sp, sp_mask, env->segs[R_SS].selector, GETPC());
 PUSHW_RA(ssp, sp, sp_mask, env->regs[R_ESP], GETPC());
 for (i = param_count - 1; i >= 0; i--) {
-val = cpu_lduw_kernel_ra(env, old_ssp +
- ((env->regs[R_ESP] + i * 2) &
-  old_sp_mask), GETPC());
+val = cpu_lduw_data_ra(env,
+   old_ssp + ((env->regs[R_ESP] + i * 
2) & old_sp_mask),
+   GETPC());
 PUSHW_RA(ssp, sp, sp_mask, val, GETPC());
 }
 }
-- 
2.45.2

[PULL 07/20] qemu/timer: Add host ticks function for LoongArch

2024-07-16 Thread Paolo Bonzini

From: Song Gao 

Signed-off-by: Song Gao 
Link: https://lore.kernel.org/r/20240716031500.4193498-1-gaos...@loongson.cn
Signed-off-by: Paolo Bonzini 
---
 include/qemu/timer.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/qemu/timer.h b/include/qemu/timer.h
index 5ce83c79112..fa56ec9481d 100644
--- a/include/qemu/timer.h
+++ b/include/qemu/timer.h
@@ -1016,6 +1016,15 @@ static inline int64_t cpu_get_host_ticks(void)
 return val;
 }
 
+#elif defined(__loongarch64)
+static inline int64_t cpu_get_host_ticks(void)
+{
+uint64_t val;
+
+asm volatile("rdtime.d %0, $zero" : "=r"(val));
+return val;
+}
+
 #else
 /* The host CPU doesn't have an easily accessible cycle counter.
Just return a monotonically increasing value.  This will be
-- 
2.45.2

[PULL 11/20] target/i386/tcg: fix POP to memory in long mode

2024-07-16 Thread Paolo Bonzini

In long mode, POP to memory will write a full 64-bit value.  However,
the call to gen_writeback() in gen_POP will use MO_32 because the
decoding table is incorrect.

The bug was latent until commit aea49fbb01a ("target/i386: use gen_writeback()
within gen_POP()", 2024-06-08), and then became visible because gen_op_st_v
now receives op->ot instead of the "ot" returned by gen_pop_T0.

Analyzed-by: Clément Chigot 
Fixes: 5e9e21bcc4d ("target/i386: move 60-BF opcodes to new decoder", 
2024-05-07)
Tested-by: Clément Chigot 
Reviewed-by: Richard Henderson 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/decode-new.c.inc | 2 +-
 target/i386/tcg/emit.c.inc   | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/target/i386/tcg/decode-new.c.inc b/target/i386/tcg/decode-new.c.inc
index 0d846c32c22..d2da1d396d5 100644
--- a/target/i386/tcg/decode-new.c.inc
+++ b/target/i386/tcg/decode-new.c.inc
@@ -1717,7 +1717,7 @@ static const X86OpEntry opcodes_root[256] = {
 [0x8C] = X86_OP_ENTRYwr(MOV, E,v, S,w, op0_Mw),
 [0x8D] = X86_OP_ENTRYwr(LEA, G,v, M,v, nolea),
 [0x8E] = X86_OP_ENTRYwr(MOV, S,w, E,w),
-[0x8F] = X86_OP_GROUPw(group1A, E,v),
+[0x8F] = X86_OP_GROUPw(group1A, E,d64),
 
 [0x98] = X86_OP_ENTRY1(CBW,0,v), /* rAX */
 [0x99] = X86_OP_ENTRYwr(CWD,   2,v, 0,v), /* rDX, rAX */
diff --git a/target/i386/tcg/emit.c.inc b/target/i386/tcg/emit.c.inc
index fc7477833bc..016dce81464 100644
--- a/target/i386/tcg/emit.c.inc
+++ b/target/i386/tcg/emit.c.inc
@@ -2788,6 +2788,7 @@ static void gen_POP(DisasContext *s, X86DecodedInsn 
*decode)
 X86DecodedOp *op = >op[0];
 MemOp ot = gen_pop_T0(s);
 
+assert(ot >= op->ot);
 if (op->has_ea || op->unit == X86_OP_SEG) {
 /* NOTE: order is important for MMU exceptions */
 gen_writeback(s, decode, 0, s->T0);
-- 
2.45.2

[PULL 12/20] target/i386/tcg: Remove SEG_ADDL

2024-07-16 Thread Paolo Bonzini

From: Richard Henderson 

This truncation is now handled by MMU_*32_IDX.  The introduction of
MMU_*32_IDX in fact applied correct 32-bit wraparound to 16-bit accesses
with a high segment base (e.g.  big real mode or vm86 mode), which did
not use SEG_ADDL.

Signed-off-by: Richard Henderson 
Link: 
https://lore.kernel.org/r/20240617161210.4639-3-richard.hender...@linaro.org
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index aee3d19f29b..19d6b41a589 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -579,10 +579,6 @@ int exception_has_error_code(int intno)
 } while (0)
 #endif
 
-/* in 64-bit machines, this can overflow. So this segment addition macro
- * can be used to trim the value to 32-bit whenever needed */
-#define SEG_ADDL(ssp, sp, sp_mask) ((uint32_t)((ssp) + (sp & (sp_mask
-
 /* XXX: add a is_user flag to have proper security support */
 #define PUSHW_RA(ssp, sp, sp_mask, val, ra)  \
 {\
@@ -593,7 +589,7 @@ int exception_has_error_code(int intno)
 #define PUSHL_RA(ssp, sp, sp_mask, val, ra) \
 {   \
 sp -= 4;\
-cpu_stl_kernel_ra(env, SEG_ADDL(ssp, sp, sp_mask), (uint32_t)(val), 
ra); \
+cpu_stl_kernel_ra(env, (ssp) + (sp & (sp_mask)), (val), ra); \
 }
 
 #define POPW_RA(ssp, sp, sp_mask, val, ra)   \
@@ -604,7 +600,7 @@ int exception_has_error_code(int intno)
 
 #define POPL_RA(ssp, sp, sp_mask, val, ra)  \
 {   \
-val = (uint32_t)cpu_ldl_kernel_ra(env, SEG_ADDL(ssp, sp, sp_mask), 
ra); \
+val = (uint32_t)cpu_ldl_kernel_ra(env, (ssp) + (sp & (sp_mask)), ra); \
 sp += 4;\
 }
 
-- 
2.45.2

[PULL 18/20] target/i386/tcg: check for correct busy state before switching to a new task

2024-07-16 Thread Paolo Bonzini

This step is listed in the Intel manual: "Checks that the new task is available
(call, jump, exception, or interrupt) or busy (IRET return)".

The AMD manual lists the same operation under the "Preventing recursion"
paragraph of "12.3.4 Nesting Tasks", though it is not clear if the processor
checks the busy bit in the IRET case.

Reviewed-by: Richard Henderson 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 8a6d92b3583..a5d5ce61f59 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -369,6 +369,11 @@ static int switch_tss_ra(CPUX86State *env, int 
tss_selector,
 old_tss_limit_max = 43;
 }
 
+/* new TSS must be busy iff the source is an IRET instruction  */
+if (!!(e2 & DESC_TSS_BUSY_MASK) != (source == SWITCH_TSS_IRET)) {
+raise_exception_err_ra(env, EXCP0A_TSS, tss_selector & 0xfffc, 
retaddr);
+}
+
 /* read all the registers from the new TSS */
 if (type & 8) {
 /* 32 bit */
-- 
2.45.2

[PULL 20/20] target/i386/tcg: save current task state before loading new one

2024-07-16 Thread Paolo Bonzini

This is how the steps are ordered in the manual.  EFLAGS.NT is
overwritten after the fact in the saved image.

Reviewed-by: Richard Henderson 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 85 +++-
 1 file changed, 45 insertions(+), 40 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 36d2f089cae..aac092a356b 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -389,6 +389,42 @@ static int switch_tss_ra(CPUX86State *env, int 
tss_selector,
 access_prepare_mmu(, env, tss_base, tss_limit,
MMU_DATA_LOAD, mmu_index, retaddr);
 
+/* save the current state in the old TSS */
+old_eflags = cpu_compute_eflags(env);
+if (old_type & 8) {
+/* 32 bit */
+access_stl(, env->tr.base + 0x20, next_eip);
+access_stl(, env->tr.base + 0x24, old_eflags);
+access_stl(, env->tr.base + (0x28 + 0 * 4), env->regs[R_EAX]);
+access_stl(, env->tr.base + (0x28 + 1 * 4), env->regs[R_ECX]);
+access_stl(, env->tr.base + (0x28 + 2 * 4), env->regs[R_EDX]);
+access_stl(, env->tr.base + (0x28 + 3 * 4), env->regs[R_EBX]);
+access_stl(, env->tr.base + (0x28 + 4 * 4), env->regs[R_ESP]);
+access_stl(, env->tr.base + (0x28 + 5 * 4), env->regs[R_EBP]);
+access_stl(, env->tr.base + (0x28 + 6 * 4), env->regs[R_ESI]);
+access_stl(, env->tr.base + (0x28 + 7 * 4), env->regs[R_EDI]);
+for (i = 0; i < 6; i++) {
+access_stw(, env->tr.base + (0x48 + i * 4),
+   env->segs[i].selector);
+}
+} else {
+/* 16 bit */
+access_stw(, env->tr.base + 0x0e, next_eip);
+access_stw(, env->tr.base + 0x10, old_eflags);
+access_stw(, env->tr.base + (0x12 + 0 * 2), env->regs[R_EAX]);
+access_stw(, env->tr.base + (0x12 + 1 * 2), env->regs[R_ECX]);
+access_stw(, env->tr.base + (0x12 + 2 * 2), env->regs[R_EDX]);
+access_stw(, env->tr.base + (0x12 + 3 * 2), env->regs[R_EBX]);
+access_stw(, env->tr.base + (0x12 + 4 * 2), env->regs[R_ESP]);
+access_stw(, env->tr.base + (0x12 + 5 * 2), env->regs[R_EBP]);
+access_stw(, env->tr.base + (0x12 + 6 * 2), env->regs[R_ESI]);
+access_stw(, env->tr.base + (0x12 + 7 * 2), env->regs[R_EDI]);
+for (i = 0; i < 4; i++) {
+access_stw(, env->tr.base + (0x22 + i * 2),
+   env->segs[i].selector);
+}
+}
+
 /* read all the registers from the new TSS */
 if (type & 8) {
 /* 32 bit */
@@ -428,49 +464,16 @@ static int switch_tss_ra(CPUX86State *env, int 
tss_selector,
 if (source == SWITCH_TSS_JMP || source == SWITCH_TSS_IRET) {
 tss_set_busy(env, env->tr.selector, 0, retaddr);
 }
-old_eflags = cpu_compute_eflags(env);
+
 if (source == SWITCH_TSS_IRET) {
 old_eflags &= ~NT_MASK;
+if (old_type & 8) {
+access_stl(, env->tr.base + 0x24, old_eflags);
+} else {
+access_stw(, env->tr.base + 0x10, old_eflags);
+   }
 }
 
-/* save the current state in the old TSS */
-if (old_type & 8) {
-/* 32 bit */
-access_stl(, env->tr.base + 0x20, next_eip);
-access_stl(, env->tr.base + 0x24, old_eflags);
-access_stl(, env->tr.base + (0x28 + 0 * 4), env->regs[R_EAX]);
-access_stl(, env->tr.base + (0x28 + 1 * 4), env->regs[R_ECX]);
-access_stl(, env->tr.base + (0x28 + 2 * 4), env->regs[R_EDX]);
-access_stl(, env->tr.base + (0x28 + 3 * 4), env->regs[R_EBX]);
-access_stl(, env->tr.base + (0x28 + 4 * 4), env->regs[R_ESP]);
-access_stl(, env->tr.base + (0x28 + 5 * 4), env->regs[R_EBP]);
-access_stl(, env->tr.base + (0x28 + 6 * 4), env->regs[R_ESI]);
-access_stl(, env->tr.base + (0x28 + 7 * 4), env->regs[R_EDI]);
-for (i = 0; i < 6; i++) {
-access_stw(, env->tr.base + (0x48 + i * 4),
-   env->segs[i].selector);
-}
-} else {
-/* 16 bit */
-access_stw(, env->tr.base + 0x0e, next_eip);
-access_stw(, env->tr.base + 0x10, old_eflags);
-access_stw(, env->tr.base + (0x12 + 0 * 2), env->regs[R_EAX]);
-access_stw(, env->tr.base + (0x12 + 1 * 2), env->regs[R_ECX]);
-access_stw(, env->tr.base + (0x12 + 2 * 2), env->regs[R_EDX]);
-access_stw(, env->tr.base + (0x12 + 3 * 2), env->regs[R_EBX]);
-access_stw(, env->tr.base + (0x12 + 4 * 2), env->regs[R_ESP]);
-access_stw(, env->tr.base + (0x12 + 5 * 2), env->regs[R_EBP]);
-access_stw(, env->tr.base + (0x12 + 6 * 2), env->regs[R_ESI]);

[PULL 01/20] i386/sev: Don't allow automatic fallback to legacy KVM_SEV*_INIT

2024-07-16 Thread Paolo Bonzini

From: Michael Roth 

Currently if the 'legacy-vm-type' property of the sev-guest object is
'on', QEMU will attempt to use the newer KVM_SEV_INIT2 kernel
interface in conjunction with the newer KVM_X86_SEV_VM and
KVM_X86_SEV_ES_VM KVM VM types.

This can lead to measurement changes if, for instance, an SEV guest was
created on a host that originally had an older kernel that didn't
support KVM_SEV_INIT2, but is booted on the same host later on after the
host kernel was upgraded.

Instead, if legacy-vm-type is 'off', QEMU should fail if the
KVM_SEV_INIT2 interface is not provided by the current host kernel.
Modify the fallback handling accordingly.

In the future, VMSA features and other flags might be added to QEMU
which will require legacy-vm-type to be 'off' because they will rely
on the newer KVM_SEV_INIT2 interface. It may be difficult to convey to
users what values of legacy-vm-type are compatible with which
features/options, so as part of this rework, switch legacy-vm-type to a
tri-state OnOffAuto option. 'auto' in this case will automatically
switch to using the newer KVM_SEV_INIT2, but only if it is required to
make use of new VMSA features or other options only available via
KVM_SEV_INIT2.

Defining 'auto' in this way would avoid inadvertantly breaking
compatibility with older kernels since it would only be used in cases
where users opt into newer features that are only available via
KVM_SEV_INIT2 and newer kernels, and provide better default behavior
than the legacy-vm-type=off behavior that was previously in place, so
make it the default for 9.1+ machine types.

Cc: Daniel P. Berrangé 
Cc: Paolo Bonzini 
cc: k...@vger.kernel.org
Signed-off-by: Michael Roth 
Reviewed-by: Daniel P. Berrangé 
Link: https://lore.kernel.org/r/20240710041005.83720-1-michael.r...@amd.com
Signed-off-by: Paolo Bonzini 
---
 qapi/qom.json | 18 ++
 hw/i386/pc.c  |  2 +-
 target/i386/sev.c | 87 +++
 3 files changed, 84 insertions(+), 23 deletions(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 8e75a419c30..7eccd2e14e2 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -924,12 +924,16 @@
 # @handle: SEV firmware handle (default: 0)
 #
 # @legacy-vm-type: Use legacy KVM_SEV_INIT KVM interface for creating the VM.
-#  The newer KVM_SEV_INIT2 interface syncs additional vCPU
-#  state when initializing the VMSA structures, which will
-#  result in a different guest measurement. Set this to
-#  maintain compatibility with older QEMU or kernel versions
-#  that rely on legacy KVM_SEV_INIT behavior.
-#  (default: false) (since 9.1)
+#  The newer KVM_SEV_INIT2 interface, from Linux >= 6.10, syncs
+#  additional vCPU state when initializing the VMSA structures,
+#  which will result in a different guest measurement. Set
+#  this to 'on' to force compatibility with older QEMU or 
kernel
+#  versions that rely on legacy KVM_SEV_INIT behavior. 'auto'
+#  will behave identically to 'on', but will automatically
+#  switch to using KVM_SEV_INIT2 if the user specifies any
+#  additional options that require it. If set to 'off', QEMU
+#  will require KVM_SEV_INIT2 unconditionally.
+#  (default: off) (since 9.1)
 #
 # Since: 2.12
 ##
@@ -939,7 +943,7 @@
 '*session-file': 'str',
 '*policy': 'uint32',
 '*handle': 'uint32',
-'*legacy-vm-type': 'bool' } }
+'*legacy-vm-type': 'OnOffAuto' } }
 
 ##
 # @SevSnpGuestProperties:
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 4fbc5774708..c74931d577a 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -83,7 +83,7 @@ GlobalProperty pc_compat_9_0[] = {
 { TYPE_X86_CPU, "x-amd-topoext-features-only", "false" },
 { TYPE_X86_CPU, "x-l1-cache-per-thread", "false" },
 { TYPE_X86_CPU, "guest-phys-bits", "0" },
-{ "sev-guest", "legacy-vm-type", "true" },
+{ "sev-guest", "legacy-vm-type", "on" },
 { TYPE_X86_CPU, "legacy-multi-node", "on" },
 };
 const size_t pc_compat_9_0_len = G_N_ELEMENTS(pc_compat_9_0);
diff --git a/target/i386/sev.c b/target/i386/sev.c
index 2ba5f517228..a1157c0ede6 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -144,7 +144,7 @@ struct SevGuestState {
 uint32_t policy;
 char *dh_cert_file;
 char *session_file;
-bool legacy_vm_type;
+OnOffAuto legacy_vm_type;
 };
 
 struct SevSnpGuestState {
@@ -1369,6 +1369,17 @@ sev_vm_state_change(void *opaque, bool running, RunState 
state)
 }
 }
 
+/*
+ * This helper is to examine sev-guest properties and determine if any options
+ * have been set which rely on the newer KVM_SEV_INIT2

[PULL 17/20] target/i386/tcg: Compute MMU index once

2024-07-16 Thread Paolo Bonzini

Add the MMU index to the StackAccess struct, so that it can be cached
or (in the next patch) computed from information that is not in
CPUX86State.

Co-developed-by: Richard Henderson 
Signed-off-by: Richard Henderson 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 35 ++-
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index b6902ca3fba..8a6d92b3583 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -56,36 +56,37 @@ typedef struct StackAccess
 target_ulong ss_base;
 target_ulong sp;
 target_ulong sp_mask;
+int mmu_index;
 } StackAccess;
 
 static void pushw(StackAccess *sa, uint16_t val)
 {
 sa->sp -= 2;
-cpu_stw_kernel_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
-  val, sa->ra);
+cpu_stw_mmuidx_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
+  val, sa->mmu_index, sa->ra);
 }
 
 static void pushl(StackAccess *sa, uint32_t val)
 {
 sa->sp -= 4;
-cpu_stl_kernel_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
-  val, sa->ra);
+cpu_stl_mmuidx_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
+  val, sa->mmu_index, sa->ra);
 }
 
 static uint16_t popw(StackAccess *sa)
 {
-uint16_t ret = cpu_lduw_data_ra(sa->env,
-sa->ss_base + (sa->sp & sa->sp_mask),
-sa->ra);
+uint16_t ret = cpu_lduw_mmuidx_ra(sa->env,
+  sa->ss_base + (sa->sp & sa->sp_mask),
+  sa->mmu_index, sa->ra);
 sa->sp += 2;
 return ret;
 }
 
 static uint32_t popl(StackAccess *sa)
 {
-uint32_t ret = cpu_ldl_data_ra(sa->env,
-   sa->ss_base + (sa->sp & sa->sp_mask),
-   sa->ra);
+uint32_t ret = cpu_ldl_mmuidx_ra(sa->env,
+ sa->ss_base + (sa->sp & sa->sp_mask),
+ sa->mmu_index, sa->ra);
 sa->sp += 4;
 return ret;
 }
@@ -677,6 +678,7 @@ static void do_interrupt_protected(CPUX86State *env, int 
intno, int is_int,
 
 sa.env = env;
 sa.ra = 0;
+sa.mmu_index = cpu_mmu_index_kernel(env);
 
 if (type == 5) {
 /* task gate */
@@ -858,12 +860,12 @@ static void do_interrupt_protected(CPUX86State *env, int 
intno, int is_int,
 static void pushq(StackAccess *sa, uint64_t val)
 {
 sa->sp -= 8;
-cpu_stq_kernel_ra(sa->env, sa->sp, val, sa->ra);
+cpu_stq_mmuidx_ra(sa->env, sa->sp, val, sa->mmu_index, sa->ra);
 }
 
 static uint64_t popq(StackAccess *sa)
 {
-uint64_t ret = cpu_ldq_data_ra(sa->env, sa->sp, sa->ra);
+uint64_t ret = cpu_ldq_mmuidx_ra(sa->env, sa->sp, sa->mmu_index, sa->ra);
 sa->sp += 8;
 return ret;
 }
@@ -982,6 +984,7 @@ static void do_interrupt64(CPUX86State *env, int intno, int 
is_int,
 
 sa.env = env;
 sa.ra = 0;
+sa.mmu_index = cpu_mmu_index_kernel(env);
 sa.sp_mask = -1;
 sa.ss_base = 0;
 if (dpl < cpl || ist != 0) {
@@ -1116,6 +1119,7 @@ static void do_interrupt_real(CPUX86State *env, int 
intno, int is_int,
 sa.sp = env->regs[R_ESP];
 sa.sp_mask = 0x;
 sa.ss_base = env->segs[R_SS].base;
+sa.mmu_index = cpu_mmu_index_kernel(env);
 
 if (is_int) {
 old_eip = next_eip;
@@ -1579,6 +1583,7 @@ void helper_lcall_real(CPUX86State *env, uint32_t new_cs, 
uint32_t new_eip,
 sa.sp = env->regs[R_ESP];
 sa.sp_mask = get_sp_mask(env->segs[R_SS].flags);
 sa.ss_base = env->segs[R_SS].base;
+sa.mmu_index = cpu_mmu_index_kernel(env);
 
 if (shift) {
 pushl(, env->segs[R_CS].selector);
@@ -1618,6 +1623,7 @@ void helper_lcall_protected(CPUX86State *env, int new_cs, 
target_ulong new_eip,
 
 sa.env = env;
 sa.ra = GETPC();
+sa.mmu_index = cpu_mmu_index_kernel(env);
 
 if (e2 & DESC_S_MASK) {
 if (!(e2 & DESC_CS_MASK)) {
@@ -1905,6 +1911,7 @@ void helper_iret_real(CPUX86State *env, int shift)
 
 sa.env = env;
 sa.ra = GETPC();
+sa.mmu_index = x86_mmu_index_pl(env, 0);
 sa.sp_mask = 0x; /* : use SS segment size? */
 sa.sp = env->regs[R_ESP];
 sa.ss_base = env->segs[R_SS].base;
@@ -1976,8 +1983,11 @@ static inline void helper_ret_protected(CPUX86State 
*env, int shift,
 target_ulong new_eip, new_esp;
 StackAccess sa;
 
+cpl = env->hflags & HF_CPL_MASK;
+
 sa.env = env;
 sa.ra = retaddr;
+sa.mmu_index = x86_mmu_index_pl(env, cpl);
 
 #ifdef TARGET_X86_64
 if (shift == 2) {
@@ -2032,7 +2042,6 @@

[PULL 06/20] scsi: fix regression and honor bootindex again for legacy drives

2024-07-16 Thread Paolo Bonzini

From: Fiona Ebner 

Commit 3089637461 ("scsi: Don't ignore most usb-storage properties")
removed the call to object_property_set_int() and thus the 'set'
method for the bootindex property was also not called anymore. Here
that method is device_set_bootindex() (as configured by
scsi_dev_instance_init() -> device_add_bootindex_property()) which as
a side effect registers the device via add_boot_device_path().

As reported by a downstream user [0], the bootindex property did not
have the desired effect anymore for legacy drives. Fix the regression
by explicitly calling the add_boot_device_path() function after
checking that the bootindex is not yet used (to avoid
add_boot_device_path() calling exit()).

[0]: https://forum.proxmox.com/threads/149772/post-679433

Cc: qemu-sta...@nongnu.org
Fixes: 3089637461 ("scsi: Don't ignore most usb-storage properties")
Suggested-by: Kevin Wolf 
Signed-off-by: Fiona Ebner 
Link: https://lore.kernel.org/r/20240710152529.1737407-1-f.eb...@proxmox.com
Signed-off-by: Paolo Bonzini 
---
 hw/scsi/scsi-bus.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 9e40b0c920b..53eff5dd3d6 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -384,6 +384,7 @@ SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, 
BlockBackend *blk,
 DeviceState *dev;
 SCSIDevice *s;
 DriveInfo *dinfo;
+Error *local_err = NULL;
 
 if (blk_is_sg(blk)) {
 driver = "scsi-generic";
@@ -403,6 +404,14 @@ SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, 
BlockBackend *blk,
 s = SCSI_DEVICE(dev);
 s->conf = *conf;
 
+check_boot_index(conf->bootindex, _err);
+if (local_err) {
+object_unparent(OBJECT(dev));
+error_propagate(errp, local_err);
+return NULL;
+}
+add_boot_device_path(conf->bootindex, dev, NULL);
+
 qdev_prop_set_uint32(dev, "scsi-id", unit);
 if (object_property_find(OBJECT(dev), "removable")) {
 qdev_prop_set_bit(dev, "removable", removable);
-- 
2.45.2

[PULL 14/20] target/i386/tcg: use PUSHL/PUSHW for error code

2024-07-16 Thread Paolo Bonzini

Do not pre-decrement esp, let the macros subtract the appropriate
operand size.

Reviewed-by: Richard Henderson 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 224e73e9ed0..b985382d704 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -670,22 +670,20 @@ static void do_interrupt_protected(CPUX86State *env, int 
intno, int is_int,
 }
 shift = switch_tss(env, intno * 8, e1, e2, SWITCH_TSS_CALL, old_eip);
 if (has_error_code) {
-uint32_t mask;
-
 /* push the error code */
 if (env->segs[R_SS].flags & DESC_B_MASK) {
-mask = 0x;
+sp_mask = 0x;
 } else {
-mask = 0x;
+sp_mask = 0x;
 }
-esp = (env->regs[R_ESP] - (2 << shift)) & mask;
-ssp = env->segs[R_SS].base + esp;
+esp = env->regs[R_ESP];
+ssp = env->segs[R_SS].base;
 if (shift) {
-cpu_stl_kernel(env, ssp, error_code);
+PUSHL(ssp, esp, sp_mask, error_code);
 } else {
-cpu_stw_kernel(env, ssp, error_code);
+PUSHW(ssp, esp, sp_mask, error_code);
 }
-SET_ESP(esp, mask);
+SET_ESP(esp, sp_mask);
 }
 return;
 }
-- 
2.45.2

[PULL 09/20] hpet: fix clamping of period

2024-07-16 Thread Paolo Bonzini

When writing a new period, the clamping should use a maximum value
rather tyhan a bit mask.  Also, when writing the high bits new_val
is shifted right by 32, so the maximum allowed period should also
be shifted right.

Signed-off-by: Paolo Bonzini 
---
 hw/timer/hpet.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
index 01efe4885db..ad881448bf3 100644
--- a/hw/timer/hpet.c
+++ b/hw/timer/hpet.c
@@ -548,7 +548,9 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
  * FIXME: Clamp period to reasonable min value?
  * Clamp period to reasonable max value
  */
-new_val &= (timer->config & HPET_TN_32BIT ? ~0u : ~0ull) >> 1;
+if (timer->config & HPET_TN_32BIT) {
+new_val = MIN(new_val, ~0u >> 1);
+}
 timer->period =
 (timer->period & 0xULL) | new_val;
 }
@@ -567,7 +569,7 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
  * FIXME: Clamp period to reasonable min value?
  * Clamp period to reasonable max value
  */
-new_val &= (timer->config & HPET_TN_32BIT ? ~0u : ~0ull) >> 1;
+new_val = MIN(new_val, ~0u >> 1);
 timer->period =
 (timer->period & 0xULL) | new_val << 32;
 }
-- 
2.45.2

[PULL 19/20] target/i386/tcg: use X86Access for TSS access

2024-07-16 Thread Paolo Bonzini

This takes care of probing the vaddr range in advance, and is also faster
because it avoids repeated TLB lookups.  It also matches the Intel manual
better, as it says "Checks that the current (old) TSS, new TSS, and all
segment descriptors used in the task switch are paged into system memory";
note however that it's not clear how the processor checks for segment
descriptors, and this check is not included in the AMD manual.

Reviewed-by: Richard Henderson 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 110 ++-
 1 file changed, 58 insertions(+), 52 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index a5d5ce61f59..36d2f089cae 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -27,6 +27,7 @@
 #include "exec/log.h"
 #include "helper-tcg.h"
 #include "seg_helper.h"
+#include "access.h"
 
 #ifdef TARGET_X86_64
 #define SET_ESP(val, sp_mask)   \
@@ -313,14 +314,15 @@ static int switch_tss_ra(CPUX86State *env, int 
tss_selector,
  uint32_t e1, uint32_t e2, int source,
  uint32_t next_eip, uintptr_t retaddr)
 {
-int tss_limit, tss_limit_max, type, old_tss_limit_max, old_type, v1, v2, i;
+int tss_limit, tss_limit_max, type, old_tss_limit_max, old_type, i;
 target_ulong tss_base;
 uint32_t new_regs[8], new_segs[6];
 uint32_t new_eflags, new_eip, new_cr3, new_ldt, new_trap;
 uint32_t old_eflags, eflags_mask;
 SegmentCache *dt;
-int index;
+int mmu_index, index;
 target_ulong ptr;
+X86Access old, new;
 
 type = (e2 >> DESC_TYPE_SHIFT) & 0xf;
 LOG_PCALL("switch_tss: sel=0x%04x type=%d src=%d\n", tss_selector, type,
@@ -374,35 +376,45 @@ static int switch_tss_ra(CPUX86State *env, int 
tss_selector,
 raise_exception_err_ra(env, EXCP0A_TSS, tss_selector & 0xfffc, 
retaddr);
 }
 
+/* X86Access avoids memory exceptions during the task switch */
+mmu_index = cpu_mmu_index_kernel(env);
+access_prepare_mmu(, env, env->tr.base, old_tss_limit_max,
+   MMU_DATA_STORE, mmu_index, retaddr);
+
+if (source == SWITCH_TSS_CALL) {
+/* Probe for future write of parent task */
+probe_access(env, tss_base, 2, MMU_DATA_STORE,
+ mmu_index, retaddr);
+}
+access_prepare_mmu(, env, tss_base, tss_limit,
+   MMU_DATA_LOAD, mmu_index, retaddr);
+
 /* read all the registers from the new TSS */
 if (type & 8) {
 /* 32 bit */
-new_cr3 = cpu_ldl_kernel_ra(env, tss_base + 0x1c, retaddr);
-new_eip = cpu_ldl_kernel_ra(env, tss_base + 0x20, retaddr);
-new_eflags = cpu_ldl_kernel_ra(env, tss_base + 0x24, retaddr);
+new_cr3 = access_ldl(, tss_base + 0x1c);
+new_eip = access_ldl(, tss_base + 0x20);
+new_eflags = access_ldl(, tss_base + 0x24);
 for (i = 0; i < 8; i++) {
-new_regs[i] = cpu_ldl_kernel_ra(env, tss_base + (0x28 + i * 4),
-retaddr);
+new_regs[i] = access_ldl(, tss_base + (0x28 + i * 4));
 }
 for (i = 0; i < 6; i++) {
-new_segs[i] = cpu_lduw_kernel_ra(env, tss_base + (0x48 + i * 4),
- retaddr);
+new_segs[i] = access_ldw(, tss_base + (0x48 + i * 4));
 }
-new_ldt = cpu_lduw_kernel_ra(env, tss_base + 0x60, retaddr);
-new_trap = cpu_ldl_kernel_ra(env, tss_base + 0x64, retaddr);
+new_ldt = access_ldw(, tss_base + 0x60);
+new_trap = access_ldl(, tss_base + 0x64);
 } else {
 /* 16 bit */
 new_cr3 = 0;
-new_eip = cpu_lduw_kernel_ra(env, tss_base + 0x0e, retaddr);
-new_eflags = cpu_lduw_kernel_ra(env, tss_base + 0x10, retaddr);
+new_eip = access_ldw(, tss_base + 0x0e);
+new_eflags = access_ldw(, tss_base + 0x10);
 for (i = 0; i < 8; i++) {
-new_regs[i] = cpu_lduw_kernel_ra(env, tss_base + (0x12 + i * 2), 
retaddr);
+new_regs[i] = access_ldw(, tss_base + (0x12 + i * 2));
 }
 for (i = 0; i < 4; i++) {
-new_segs[i] = cpu_lduw_kernel_ra(env, tss_base + (0x22 + i * 2),
- retaddr);
+new_segs[i] = access_ldw(, tss_base + (0x22 + i * 2));
 }
-new_ldt = cpu_lduw_kernel_ra(env, tss_base + 0x2a, retaddr);
+new_ldt = access_ldw(, tss_base + 0x2a);
 new_segs[R_FS] = 0;
 new_segs[R_GS] = 0;
 new_trap = 0;
@@ -412,16 +424,6 @@ static int switch_tss_ra(CPUX86State *env, int 
tss_selector,
  chapters 12.2.5 and 13.2.4 on how to implement TSS Trap bit */
 (void)new_trap;
 
-/* NOTE: we must avoid memory exceptions during the task switch,

[PULL 00/20] i386, bugfix changes for QEMU 9.1 soft freeze

2024-07-16 Thread Paolo Bonzini

The following changes since commit 959269e910944c03bc13f300d65bf08b060d5d0f:

  Merge tag 'python-pull-request' of https://gitlab.com/jsnow/qemu into staging 
(2024-07-16 06:45:23 +1000)

are available in the Git repository at:

  https://gitlab.com/bonzini/qemu.git tags/for-upstream

for you to fetch changes up to 6a079f2e68e1832ebca0e7d64bc31ffebde9b2dd:

  target/i386/tcg: save current task state before loading new one (2024-07-16 
18:18:25 +0200)


* target/i386/tcg: fixes for seg_helper.c
* SEV: Don't allow automatic fallback to legacy KVM_SEV_INIT,
  but also don't use it by default
* scsi: honor bootindex again for legacy drives
* hpet, utils, scsi, build, cpu: miscellaneous bugfixes


Akihiko Odaki (1):
  cpu: Free queued CPU work

Boqiao Fu (1):
  docs: Update description of 'user=username' for '-run-with'

Fiona Ebner (2):
  hw/scsi/lsi53c895a: bump instruction limit in scripts processing to fix 
regression
  scsi: fix regression and honor bootindex again for legacy drives

Gustavo Romero (1):
  disas: Fix build against Capstone v6

Michael Roth (1):
  i386/sev: Don't allow automatic fallback to legacy KVM_SEV*_INIT

Paolo Bonzini (9):
  hpet: fix clamping of period
  hpet: fix HPET_TN_SETVAL for high 32-bits of the comparator
  target/i386/tcg: fix POP to memory in long mode
  target/i386/tcg: Allow IRET from user mode to user mode with SMAP
  target/i386/tcg: use PUSHL/PUSHW for error code
  target/i386/tcg: Compute MMU index once
  target/i386/tcg: check for correct busy state before switching to a new 
task
  target/i386/tcg: use X86Access for TSS access
  target/i386/tcg: save current task state before loading new one

Richard Henderson (3):
  target/i386/tcg: Remove SEG_ADDL
  target/i386/tcg: Reorg push/pop within seg_helper.c
  target/i386/tcg: Introduce x86_mmu_index_{kernel_,}pl

Sergey Dyasli (1):
  Revert "qemu-char: do not operate on sources from finalize callbacks"

Song Gao (1):
  qemu/timer: Add host ticks function for LoongArch

 qapi/qom.json|  18 +-
 include/disas/capstone.h |   1 +
 include/hw/core/cpu.h|   6 +
 include/qemu/timer.h |   9 +
 target/i386/cpu.h|  11 +-
 chardev/char-io.c|  19 +-
 cpu-common.c |  11 +
 hw/core/cpu-common.c |   1 +
 hw/i386/pc.c |   2 +-
 hw/scsi/lsi53c895a.c |   2 +-
 hw/scsi/scsi-bus.c   |   9 +
 hw/timer/hpet.c  |  25 +-
 target/i386/cpu.c|  27 +-
 target/i386/sev.c|  87 -
 target/i386/tcg/seg_helper.c | 662 +--
 target/i386/tcg/decode-new.c.inc |   2 +-
 target/i386/tcg/emit.c.inc   |   1 +
 qemu-options.hx  |   7 +-
 18 files changed, 535 insertions(+), 365 deletions(-)
-- 
2.45.2

[PULL 05/20] hw/scsi/lsi53c895a: bump instruction limit in scripts processing to fix regression

2024-07-16 Thread Paolo Bonzini

From: Fiona Ebner 

Commit 9876359990 ("hw/scsi/lsi53c895a: add timer to scripts
processing") reduced the maximum allowed instruction count by
a factor of 100 all the way down to 100.

This causes the "Check Point R81.20 Gaia" appliance [0] to fail to
boot after fully finishing the installation via the appliance's web
interface (there is already one reboot before that).

With a limit of 150, the appliance still fails to boot, while with a
limit of 200, it works. Bump to 500 to fix the regression and be on
the safe side.

Originally reported in the Proxmox community forum[1].

[0]: https://support.checkpoint.com/results/download/124397
[1]: https://forum.proxmox.com/threads/149772/post-683459

Cc: qemu-sta...@nongnu.org
Fixes: 9876359990 ("hw/scsi/lsi53c895a: add timer to scripts processing")
Signed-off-by: Fiona Ebner 
Acked-by: Sven Schnelle 
Link: https://lore.kernel.org/r/20240715131403.223239-1-f.eb...@proxmox.com
Signed-off-by: Paolo Bonzini 
---
 hw/scsi/lsi53c895a.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/scsi/lsi53c895a.c b/hw/scsi/lsi53c895a.c
index eb9828dd5ef..f1935e53280 100644
--- a/hw/scsi/lsi53c895a.c
+++ b/hw/scsi/lsi53c895a.c
@@ -188,7 +188,7 @@ static const char *names[] = {
 #define LSI_TAG_VALID (1 << 16)
 
 /* Maximum instructions to process. */
-#define LSI_MAX_INSN100
+#define LSI_MAX_INSN500
 
 typedef struct lsi_request {
 SCSIRequest *req;
-- 
2.45.2

Re: [RFC PATCH 0/8] Convert avocado tests to normal Python unittests

2024-07-16 Thread Paolo Bonzini

Il mar 16 lug 2024, 20:10 Daniel P. Berrangé  ha
scritto:

> On Tue, Jul 16, 2024 at 08:03:54PM +0200, Paolo Bonzini wrote:
> > Il mar 16 lug 2024, 18:45 John Snow  ha scritto:
> >
> > > My only ask is that we keep the tests running in the custom venv
> > > environment we set up at build time
> > >
> >
> > Yes, they do, however pytest should also be added to pythondeps.toml if
> we
> > go this way.
>
> Done in this patch:
>
>   https://lists.nongnu.org/archive/html/qemu-devel/2024-07/msg03596.html


That adds pycotap, not pytest.

> Yep, that's the part that I am a bit more doubtful about.
>
> Pulling & caching VM images isn't much more than a URL download to
> a local file, not very complex in python. Assuming that's what you
> are refering to, then it is already done in this patch:
>
>   https://lists.nongnu.org/archive/html/qemu-devel/2024-07/msg03598.html


I think there are also compressed assets that have to be passed through
gzip/xzip/zstd. I am worried that Thomas's patches do 90% of the job but
that is not a good estimation of what's left.

Paolo


> With regards,
> Daniel
> --
> |: https://berrange.com  -o-
> https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-
> https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-
> https://www.instagram.com/dberrange :|
>
>

Re: [RFC PATCH 0/8] Convert avocado tests to normal Python unittests

2024-07-16 Thread Paolo Bonzini

Il mar 16 lug 2024, 18:45 John Snow  ha scritto:

> My only ask is that we keep the tests running in the custom venv
> environment we set up at build time
>

Yes, they do, however pytest should also be added to pythondeps.toml if we
go this way.

If we move to pytest, it's possible we can eliminate that funkiness, which
> would be a win.
>

There is the pycotap dependency to produce TAP from pytest, but that's
probably something small enough to be vendored. And also it depends on what
the dependencies would be for the assets framework.

I'm also not so sure about recreating all of the framework that pulls vm
> images on demand, that sounds like it'd be a lot of work, but maybe I'm
> wrong about that.
>

Yep, that's the part that I am a bit more doubtful about.

Paolo

Tacit ACK from me on this project in general, provided we are still using
> the configure venv.
>
>
>>  Thomas
>>
>>
>> Ani Sinha (1):
>>   tests/pytest: add pytest to the meson build system
>>
>> Thomas Huth (7):
>>   tests/pytest: Add base classes for the upcoming pytest-based tests
>>   tests/pytest: Convert some simple avocado tests into pytests
>>   tests/pytest: Convert info_usernet and version test with small
>> adjustments
>>   tests_pytest: Implement fetch_asset() method for downloading assets
>>   tests/pytest: Convert some tests that download files via fetch_asset()
>>   tests/pytest: Add a function for extracting files from an archive
>>   tests/pytest: Convert avocado test that needed avocado.utils.archive
>>
>>  tests/Makefile.include|   4 +-
>>  tests/meson.build |   1 +
>>  tests/pytest/meson.build  |  74 
>>  tests/pytest/qemu_pytest/__init__.py  | 362 ++
>>  tests/pytest/qemu_pytest/utils.py |  21 +
>>  .../test_arm_canona1100.py}   |  16 +-
>>  .../test_cpu_queries.py}  |   2 +-
>>  .../test_empty_cpu_model.py}  |   2 +-
>>  .../test_info_usernet.py} |   6 +-
>>  .../test_machine_arm_n8x0.py} |  20 +-
>>  .../test_machine_avr6.py} |   7 +-
>>  .../test_machine_loongarch.py}|  11 +-
>>  .../test_machine_mips_loongson3v.py}  |  19 +-
>>  .../test_mem_addr_space.py}   |   3 +-
>>  .../test_ppc_bamboo.py}   |  18 +-
>>  .../version.py => pytest/test_version.py} |   8 +-
>>  .../test_virtio_version.py}   |   2 +-
>>  17 files changed, 502 insertions(+), 74 deletions(-)
>>  create mode 100644 tests/pytest/meson.build
>>  create mode 100644 tests/pytest/qemu_pytest/__init__.py
>>  create mode 100644 tests/pytest/qemu_pytest/utils.py
>>  rename tests/{avocado/machine_arm_canona1100.py =>
>> pytest/test_arm_canona1100.py} (74%)
>>  rename tests/{avocado/cpu_queries.py => pytest/test_cpu_queries.py} (96%)
>>  rename tests/{avocado/empty_cpu_model.py =>
>> pytest/test_empty_cpu_model.py} (94%)
>>  rename tests/{avocado/info_usernet.py => pytest/test_info_usernet.py}
>> (91%)
>>  rename tests/{avocado/machine_arm_n8x0.py =>
>> pytest/test_machine_arm_n8x0.py} (71%)
>>  rename tests/{avocado/machine_avr6.py => pytest/test_machine_avr6.py}
>> (91%)
>>  rename tests/{avocado/machine_loongarch.py =>
>> pytest/test_machine_loongarch.py} (89%)
>>  rename tests/{avocado/machine_mips_loongson3v.py =>
>> pytest/test_machine_mips_loongson3v.py} (59%)
>>  rename tests/{avocado/mem-addr-space-check.py =>
>> pytest/test_mem_addr_space.py} (99%)
>>  rename tests/{avocado/ppc_bamboo.py => pytest/test_ppc_bamboo.py} (75%)
>>  rename tests/{avocado/version.py => pytest/test_version.py} (82%)
>>  rename tests/{avocado/virtio_version.py =>
>> pytest/test_virtio_version.py} (99%)
>>
>> --
>> 2.45.2
>>
>>
>>

Re: [PATCH v1 00/11] Convert avocado tests to normal Python unittests

2024-07-16 Thread Paolo Bonzini

Il mar 16 lug 2024, 13:26 Thomas Huth  ha scritto:

> The Avocado v88 that we use in QEMU is already on a life support
> system: It is not supported by upstream anymore, and with the latest
> versions of Python, it won't work anymore since it depends on the
> "imp" module that has been removed in Python 3.12.
>
> There have been several attempts to update the test suite in QEMU
> to a newer version of Avocado, but so far no attempt has successfully
> been merged yet.
>

I think we should take another look at that. Avocado 92 should work, though
I am not sure if it's also depending on "imp", and if I recall correctly
the problem with more recent version was more that it conflicted with
distros that didn't have it packaged. That should also not be a problem
anymore with the pythondeps.toml mechanism.

Additionally, the whole "make check" test suite in QEMU is using the
> meson test runner nowadays, so running the python-based tests via the
> Avocodo test runner looks and feels quite like an oddball, requiring
> the users to deal with the knowledge of multiple test runners in
> parallel (e.g. the timeout settings work completely differently).
>

This is true.

Paolo


> So instead of trying to update the python-based test suite in QEMU
> to a newer version of Avocado, we should maybe try to better integrate
> it with the meson test runner instead. Indeed most tests work quite
> nicely without the Avocado framework already, as you can see with
> this patch series - it does not convert all tests, just a subset so
> far, but this already proves that many tests only need small modifi-
> cations to work without Avocado.
>
> Only tests that use the LinuxTest / LinuxDistro and LinuxSSHMixIn
> classes (e.g. based on cloud-init images or using SSH) really depend
> on the Avocado framework, so we'd need a solution for those if we
> want to continue using them. One solution might be to simply use the
> required functions from avocado.utils for these tests, and still run
> them via the meson test runner instead, but that needs some further
> investigation that will be done later.
>
>
> Now if you want to try out these patches: Apply the patches, then
> recompile and then run:
>
>  make check-functional
>
> You can also run single targets e.g. with:
>
>  make check-functional-ppc
>
> You can also run the tests without any test runner now by
> setting the PYTHONPATH environment variable to the "python" folder
> of your source tree, and by specifying the build directory via
> QEMU_BUILD_ROOT (if autodetection fails) and by specifying the
> QEMU binary via QEMU_TEST_QEMU_BINARY. For example:
>
>  export PYTHONPATH=$HOME/qemu/python
>  export QEMU_TEST_QEMU_BINARY=qemu-system-x86_64
>  export PYTHONPATH=$HOME/qemu/build
>  ~/qemu/tests/functional/test_virtio_version.py
>
> The logs of the tests can be found in the build directory under
> tests/functional/ - console log and general logs will
> be put in separate files there.
>
> Still to be done: Update the documentation for this new test framework.
>
> RFC -> v1:
> - Now using pycotap for running the tests instead of "pytest"
> - Change the name from "tests/pytest" to "tests/functional" accordingly
> - Make it possible to run the tests directly
> - Use Python's urllib instead of wget for downloading
> - Lots of makefile / meson integration improvements
> - Converted more tests
> - Update MAINTAINERS file accordingly
> - Added a patch to run check-functional in the gitlab-CI
> - ... lots of other changes I forgot about ... in fact, I changed so
>   many things that I also did not dare to pick up the Reviewed-bys
>   from the RFC
>
> Thomas Huth (11):
>   tests/functional: Add base classes for the upcoming pytest-based tests
>   tests/functional: Convert simple avocado tests into standalone python
> tests
>   tests/functional: Convert avocado tests that just need a small
> adjustment
>   tests/functional: Add python-based tests to the meson build system
>   tests/functional: Implement fetch_asset() method for downloading
> assets
>   tests/functional: Convert some tests that download files via
> fetch_asset()
>   tests/functional: Add a function for extracting files from an archive
>   tests/functional: Convert some avocado tests that needed
> avocado.utils.archive
>   tests/functional: Set up logging
>   tests/functional: Convert the s390x avocado tests into standalone
> tests
>   gitlab-ci: Add "check-functional" to the build tests
>
>  MAINTAINERS   |  22 +-
>  .gitlab-ci.d/buildtest-template.yml   |   3 +-
>  .gitlab-ci.d/buildtest.yml|  60 +--
>  pythondeps.toml   |   3 +-
>  tests/Makefile.include|  18 +-
>  tests/functional/meson.build  | 112 +
>  tests/functional/qemu_test/__init__.py| 384 ++
>  tests/functional/qemu_test/utils.py   |  28 ++
>  .../test_arm_canona1100.py}

[PATCH] target/i386: do not crash if microvm guest uses SGX CPUID leaves

2024-07-16 Thread Paolo Bonzini

sgx_epc_get_section assumes a PC platform is in use:

bool sgx_epc_get_section(int section_nr, uint64_t *addr, uint64_t *size)
{
PCMachineState *pcms = PC_MACHINE(qdev_get_machine());

However, sgx_epc_get_section is called by CPUID regardless of whether
SGX state has been initialized or which platform is in use.  Check
whether the machine has the right QOM class and if not behave as if
there are no EPC sections.

Fixes: 1dec2e1f19f ("i386: Update SGX CPUID info according to hardware/KVM/user 
input", 2021-09-30)
Cc: qemu-sta...@nongnu.org
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2142
Signed-off-by: Paolo Bonzini 
---
 hw/i386/sgx.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/hw/i386/sgx.c b/hw/i386/sgx.c
index de76397bcfb..25b2055d653 100644
--- a/hw/i386/sgx.c
+++ b/hw/i386/sgx.c
@@ -266,10 +266,12 @@ void hmp_info_sgx(Monitor *mon, const QDict *qdict)
 
 bool sgx_epc_get_section(int section_nr, uint64_t *addr, uint64_t *size)
 {
-PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
+PCMachineState *pcms =
+(PCMachineState *)object_dynamic_cast(qdev_get_machine(),
+  TYPE_PC_MACHINE);
 SGXEPCDevice *epc;
 
-if (pcms->sgx_epc.size == 0 || pcms->sgx_epc.nr_sections <= section_nr) {
+if (!pcms || pcms->sgx_epc.size == 0 || pcms->sgx_epc.nr_sections <= 
section_nr) {
 return true;
 }
 
-- 
2.45.2

Re: [PATCH v4 0/7] util: Introduce qemu_get_runtime_dir()

2024-07-16 Thread Paolo Bonzini

On Tue, Jul 16, 2024 at 2:46 PM Akihiko Odaki  wrote:
>
> On 2024/07/16 19:43, Paolo Bonzini wrote:
> > On Tue, Jul 16, 2024 at 11:56 AM Daniel P. Berrangé  
> > wrote:
> >>
> >> On Tue, Jul 16, 2024 at 11:06:57AM +0300, Michael Tokarev wrote:
> >>> 16.07.2024 10:27, Akihiko Odaki wrote:
> >>>> qemu_get_runtime_dir() returns a dynamically allocated directory path
> >>>> that is appropriate for storing runtime files. It corresponds to "run"
> >>>> directory in Unix.
> >>>
> >>> Since runtime dir is always used with a filename within, how about
> >>>
> >>>char *qemu_get_runtime_path(const char *filename)
> >>>
> >>> which return RUNTIME_DIR/filename instead of just RUNTIME_DIR ?
> >>
> >> Yeah, I agree, every single caller of the function goes on to call
> >> g_build_filename with the result. The helper should just be building
> >> the filename itself.
> >
> > That would mean using variable arguments and g_build_filename_valist().
>
> We can't prepend an element to va_list.

You could do it in two steps, with g_build_filename(runtime_dir,
first) followed by g_build_filename_valist(result, ap); doing these
steps only if if first != NULL of course.

But I agree that leaving the concatenation in the caller is not
particularly worse, and makes qemu_get_runtime_dir() more readable.

Paolo


Paolo

Re: [PATCH v2] hw/timer/hpet: Fix wrong HPET interrupts

2024-07-16 Thread Paolo Bonzini


On 7/13/24 13:54, TaiseiIto wrote:

Before this commit, there are 3 problems about HPET timer interrupts. First,
HPET periodic timers cause a too early interrupt before HPET main counter
value reaches a value written its comparator value register. Second,
disabled HPET timers whose comparator value register is not
0x cause wrong interrupts. Third, enabled HPET timers whose
comparator value register is 0x don't cause any interrupts.
About the first one, for example, an HPET driver writes 0x
to an HPET periodic timer comparator value register. As a result, the
register becomes 0x because writing to the higher 32 bits of
the register doesn't affect itself in periodic mode. (see
"case HPET_TN_CMP + 4" of "hpet_ram_write" function.) And "timer->period"
which means interrupt period in periodic mode becomes 0x. Next, the
HPET driver sets the HPET_CFG_ENABLE flag to start the main counter. The
comparator value register (0x) indicate the next interrupt
time. The period (0x) is added to the comparator value register at
"hpet_timer" function because "hpet_time_after64" function returns true when
the main counter is small. So, the first interrupt is planned when the main
counter is 0x5554, but the first interrupt should occur when the
main counter is 0x. To solve this problem, I fix
"case HPET_TN_CMP + 4" of "hpet_ram_write" function to ensure that writings
to higher 32 bits of a comparator value register reflect itself even if in
periodic mode. About the other two problems, it was decided by comparator
value whether each timer is enabled, but it should be decided by
"timer_enabled" function which confirm "HPET_TN_ENABLE" flag. To solve these
problems, I fix the code to decide correctly whether each timer is enabled.
After this commit, the 3 problems are solved. First, HPET periodic timers
cause the first interrupt when the main counter value reaches a value
written its comparator value register. Second, disabled HPET timers never
cause any interrupt. Third, enabled HPET timers cause interrupts correctly
even if an HPET driver writes 0x to its comparator value
register.

Signed-off-by: TaiseiIto 
---

Changes in v2:
- Reflect writings to higher 32 bits of a comparator value register rather
   than clearing these bits.
- Fix wrong indents.
- Link to v1: 
https://lore.kernel.org/qemu-devel/ty0pr0101mb4285838139bc56dec3d1ccfda4...@ty0pr0101mb4285.apcprd01.prod.exchangelabs.com/

  hw/timer/hpet.c | 24 +++-
  1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
index 01efe4885d..4b6352e257 100644
--- a/hw/timer/hpet.c
+++ b/hw/timer/hpet.c
@@ -552,6 +552,10 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
  timer->period =
  (timer->period & 0xULL) | new_val;
  }
+/*
+ * FIXME: on a 64-bit write, HPET_TN_SETVAL should apply to the
+ * high bits part as well.
+ */
  timer->config &= ~HPET_TN_SETVAL;
  if (hpet_enabled(s)) {
  hpet_set_timer(timer);
@@ -562,20 +566,22 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
  if (!timer_is_periodic(timer)
  || (timer->config & HPET_TN_SETVAL)) {
  timer->cmp = (timer->cmp & 0xULL) | new_val << 32;
-} else {
+}
+if (timer_is_periodic(timer)) {
  /*
   * FIXME: Clamp period to reasonable min value?
   * Clamp period to reasonable max value
   */
-new_val &= (timer->config & HPET_TN_32BIT ? ~0u : ~0ull) >> 1;
+new_val = MIN(new_val, ~0u >> 1);
+timer->cmp = (timer->cmp & 0xULL) | new_val << 32;


This seems wrong to me.  The comparator must be reset using 
HPET_TN_SETVAL, otherwise it just keeps running.  From the specification:


"If software wants to change the periodic rate, it should write a new 
value to the comparator value register.  At the point when the timer’s 
comparator indicates a match, this new value will be added to derive the 
next matching point. So as to avoid race conditions where the new value 
is written just as a match occurs, either the main counter should be 
halted or the comparator disabled when the new periodic rate is written".


The sentence "at the point when the timer’s comparator indicates a 
match" indicates that the comparator register is not written.


I suspect you're hitting the other issue, and HPET_TN_SETVAL is cleared 
incorrectly by a 64-bit write.  I'll try sending out a patch to fix that.



-if ((>timer[i])->cmp != ~0ULL) {
+if (hpet_enabled(s)) {
  hpet_set_timer(>timer[i]);
  }

Re: [PATCH v4 0/7] util: Introduce qemu_get_runtime_dir()

2024-07-16 Thread Paolo Bonzini

On Tue, Jul 16, 2024 at 11:56 AM Daniel P. Berrangé  wrote:
>
> On Tue, Jul 16, 2024 at 11:06:57AM +0300, Michael Tokarev wrote:
> > 16.07.2024 10:27, Akihiko Odaki wrote:
> > > qemu_get_runtime_dir() returns a dynamically allocated directory path
> > > that is appropriate for storing runtime files. It corresponds to "run"
> > > directory in Unix.
> >
> > Since runtime dir is always used with a filename within, how about
> >
> >   char *qemu_get_runtime_path(const char *filename)
> >
> > which return RUNTIME_DIR/filename instead of just RUNTIME_DIR ?
>
> Yeah, I agree, every single caller of the function goes on to call
> g_build_filename with the result. The helper should just be building
> the filename itself.

That would mean using variable arguments and g_build_filename_valist().

Paolo

[PATCH 2/2] hpet: fix HPET_TN_SETVAL for high 32-bits of the comparator

2024-07-16 Thread Paolo Bonzini

Commit 3787324101b ("hpet: Fix emulation of HPET_TN_SETVAL (Jan Kiszka)",
2009-04-17) applied the fix only to the low 32-bits of the comparator, but
it should be done for the high bits as well.  Otherwise, the high 32-bits
of the comparator cannot be written and they remain fixed to 0x.

Co-developed-by: TaiseiIto 
Signed-off-by: TaiseiIto 
Signed-off-by: Paolo Bonzini 
---
 hw/timer/hpet.c | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
index 16be1278d09..85fb2c07ae3 100644
--- a/hw/timer/hpet.c
+++ b/hw/timer/hpet.c
@@ -554,6 +554,10 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
 timer->period =
 (timer->period & 0xULL) | new_val;
 }
+/*
+ * FIXME: on a 64-bit write, HPET_TN_SETVAL should apply to the
+ * high bits part as well.
+ */
 timer->config &= ~HPET_TN_SETVAL;
 if (hpet_enabled(s)) {
 hpet_set_timer(timer);
@@ -564,7 +568,8 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
 if (!timer_is_periodic(timer)
 || (timer->config & HPET_TN_SETVAL)) {
 timer->cmp = (timer->cmp & 0xULL) | new_val << 32;
-} else {
+}
+if (timer_is_periodic(timer)) {
 /*
  * FIXME: Clamp period to reasonable min value?
  * Clamp period to reasonable max value
@@ -572,12 +577,12 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
 new_val = MIN(new_val, ~0u >> 1);
 timer->period =
 (timer->period & 0xULL) | new_val << 32;
-}
-timer->config &= ~HPET_TN_SETVAL;
-if (hpet_enabled(s)) {
-hpet_set_timer(timer);
-}
-break;
+}
+timer->config &= ~HPET_TN_SETVAL;
+if (hpet_enabled(s)) {
+hpet_set_timer(timer);
+}
+break;
 case HPET_TN_ROUTE:
 timer->fsb = (timer->fsb & 0xULL) | new_val;
 break;
-- 
2.45.2

[PATCH 0/2] first batch of hpet fixes

2024-07-16 Thread Paolo Bonzini

Extracted from the patch that TaiseiIto  tested.
While not sufficient to fix their problems, this is a step in the
right direction.

Paolo Bonzini (2):
  hpet: fix clamping of period
  hpet: fix HPET_TN_SETVAL for high 32-bits of the comparator

 hw/timer/hpet.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

-- 
2.45.2

[PATCH 1/2] hpet: fix clamping of period

2024-07-16 Thread Paolo Bonzini

When writing a new period, the clamping should use a maximum value
rather than a bit mask.  Also, when writing the high bits new_val
is shifted right by 32, so the maximum allowed period should also
be shifted right.

Signed-off-by: Paolo Bonzini 
---
 hw/timer/hpet.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
index 01efe4885db..16be1278d09 100644
--- a/hw/timer/hpet.c
+++ b/hw/timer/hpet.c
@@ -548,7 +548,9 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
  * FIXME: Clamp period to reasonable min value?
  * Clamp period to reasonable max value
  */
-new_val &= (timer->config & HPET_TN_32BIT ? ~0u : ~0ull) >> 1;
+if (timer->config & HPET_TN_32_BIT) {
+new_val = MIN(new_val, ~0u >> 1);
+}
 timer->period =
 (timer->period & 0xULL) | new_val;
 }
@@ -567,7 +569,7 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
  * FIXME: Clamp period to reasonable min value?
  * Clamp period to reasonable max value
  */
-new_val &= (timer->config & HPET_TN_32BIT ? ~0u : ~0ull) >> 1;
+new_val = MIN(new_val, ~0u >> 1);
 timer->period =
 (timer->period & 0xULL) | new_val << 32;
 }
-- 
2.45.2

Re: [PATCH] Manpage: Update description of 'user=username' for '-run-with'

2024-07-16 Thread Paolo Bonzini

> Manpage: the description of '-runs' didn't show this parameter will use
> setuid, so the customer might get confused when 'elevateprivileges=deny' is
> used. Since '-runas' is going to be deprecated and replaced by this
> parameter in the coming qemu9.1, add the message here.

Queued, thanks.  I modified the patch a bit to explain how setgid and
setgroups are used in addition to setuid:

diff --git a/qemu-options.hx b/qemu-options.hx
index ad6521ef5e7..694fa37f284 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -5024,8 +5024,11 @@ SRST
 in combination with -runas.

 ``user=username`` or ``user=uid:gid`` can be used to drop root privileges
-by switching to the specified user (via username) or user and group
-(via uid:gid) immediately before starting guest execution.
+before starting guest execution. QEMU will use the ``setuid`` and 
``setgid``
+system calls to switch to the specified identity.  Note that the
+``user=username`` syntax will also apply the full set of supplementary
+groups for the user, whereas the ``user=uid:gid`` will use only the
+``gid`` group.

Paolo

Re: [PATCH] qemu/timer: Add host ticks function for LoongArch

2024-07-16 Thread Paolo Bonzini

Queued, thanks.

Paolo

Re: [PATCH] scsi: fix regression and honor bootindex again for legacy drives

2024-07-16 Thread Paolo Bonzini

Queued, thanks.

Paolo

Re: [PATCH] hw/scsi/lsi53c895a: bump instruction limit in scripts processing to fix regression

2024-07-16 Thread Paolo Bonzini

Queued, thanks.

Paolo

Re: [PATCH] disas: Fix build against Capstone v6

2024-07-16 Thread Paolo Bonzini

Queued, thanks.

Paolo

Re: [PATCH] meson: Use -fno-sanitize=function when available

2024-07-16 Thread Paolo Bonzini

Queued, thanks.

Paolo

Re: [PATCH] cpu: Free queued CPU work

2024-07-16 Thread Paolo Bonzini

Queued, thanks.

Paolo

Re: [PATCH v4 0/7] util: Introduce qemu_get_runtime_dir()

2024-07-16 Thread Paolo Bonzini

Queued, thanks.

Paolo

[PULL 02/13] target/i386/tcg: Remove SEG_ADDL

2024-07-14 Thread Paolo Bonzini

From: Richard Henderson 

This truncation is now handled by MMU_*32_IDX.  The introduction of
MMU_*32_IDX in fact applied correct 32-bit wraparound to 16-bit accesses
with a high segment base (e.g.  big real mode or vm86 mode), which did
not use SEG_ADDL.

Signed-off-by: Richard Henderson 
Link: 
https://lore.kernel.org/r/20240617161210.4639-3-richard.hender...@linaro.org
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index aee3d19f29b..19d6b41a589 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -579,10 +579,6 @@ int exception_has_error_code(int intno)
 } while (0)
 #endif
 
-/* in 64-bit machines, this can overflow. So this segment addition macro
- * can be used to trim the value to 32-bit whenever needed */
-#define SEG_ADDL(ssp, sp, sp_mask) ((uint32_t)((ssp) + (sp & (sp_mask
-
 /* XXX: add a is_user flag to have proper security support */
 #define PUSHW_RA(ssp, sp, sp_mask, val, ra)  \
 {\
@@ -593,7 +589,7 @@ int exception_has_error_code(int intno)
 #define PUSHL_RA(ssp, sp, sp_mask, val, ra) \
 {   \
 sp -= 4;\
-cpu_stl_kernel_ra(env, SEG_ADDL(ssp, sp, sp_mask), (uint32_t)(val), 
ra); \
+cpu_stl_kernel_ra(env, (ssp) + (sp & (sp_mask)), (val), ra); \
 }
 
 #define POPW_RA(ssp, sp, sp_mask, val, ra)   \
@@ -604,7 +600,7 @@ int exception_has_error_code(int intno)
 
 #define POPL_RA(ssp, sp, sp_mask, val, ra)  \
 {   \
-val = (uint32_t)cpu_ldl_kernel_ra(env, SEG_ADDL(ssp, sp, sp_mask), 
ra); \
+val = (uint32_t)cpu_ldl_kernel_ra(env, (ssp) + (sp & (sp_mask)), ra); \
 sp += 4;\
 }
 
-- 
2.45.2

[PULL 03/13] target/i386/tcg: Allow IRET from user mode to user mode with SMAP

2024-07-14 Thread Paolo Bonzini

This fixes a bug wherein i386/tcg assumed an interrupt return using
the IRET instruction was always returning from kernel mode to either
kernel mode or user mode. This assumption is violated when IRET is used
as a clever way to restore thread state, as for example in the dotnet
runtime. There, IRET returns from user mode to user mode.

This bug is that stack accesses from IRET and RETF, as well as accesses
to the parameters in a call gate, are normal data accesses using the
current CPL.  This manifested itself as a page fault in the guest Linux
kernel due to SMAP preventing the access.

This bug appears to have been in QEMU since the beginning.

Analyzed-by: Robert R. Henry 
Co-developed-by: Robert R. Henry 
Signed-off-by: Robert R. Henry 
Reviewed-by: Richard Henderson 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 19d6b41a589..224e73e9ed0 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -594,13 +594,13 @@ int exception_has_error_code(int intno)
 
 #define POPW_RA(ssp, sp, sp_mask, val, ra)   \
 {\
-val = cpu_lduw_kernel_ra(env, (ssp) + (sp & (sp_mask)), ra); \
+val = cpu_lduw_data_ra(env, (ssp) + (sp & (sp_mask)), ra); \
 sp += 2; \
 }
 
 #define POPL_RA(ssp, sp, sp_mask, val, ra)  \
 {   \
-val = (uint32_t)cpu_ldl_kernel_ra(env, (ssp) + (sp & (sp_mask)), ra); \
+val = (uint32_t)cpu_ldl_data_ra(env, (ssp) + (sp & (sp_mask)), ra); \
 sp += 4;\
 }
 
@@ -847,7 +847,7 @@ static void do_interrupt_protected(CPUX86State *env, int 
intno, int is_int,
 
 #define POPQ_RA(sp, val, ra)\
 {   \
-val = cpu_ldq_kernel_ra(env, sp, ra);   \
+val = cpu_ldq_data_ra(env, sp, ra);   \
 sp += 8;\
 }
 
@@ -1797,18 +1797,18 @@ void helper_lcall_protected(CPUX86State *env, int 
new_cs, target_ulong new_eip,
 PUSHL_RA(ssp, sp, sp_mask, env->segs[R_SS].selector, GETPC());
 PUSHL_RA(ssp, sp, sp_mask, env->regs[R_ESP], GETPC());
 for (i = param_count - 1; i >= 0; i--) {
-val = cpu_ldl_kernel_ra(env, old_ssp +
-((env->regs[R_ESP] + i * 4) &
- old_sp_mask), GETPC());
+val = cpu_ldl_data_ra(env,
+  old_ssp + ((env->regs[R_ESP] + i * 
4) & old_sp_mask),
+  GETPC());
 PUSHL_RA(ssp, sp, sp_mask, val, GETPC());
 }
 } else {
 PUSHW_RA(ssp, sp, sp_mask, env->segs[R_SS].selector, GETPC());
 PUSHW_RA(ssp, sp, sp_mask, env->regs[R_ESP], GETPC());
 for (i = param_count - 1; i >= 0; i--) {
-val = cpu_lduw_kernel_ra(env, old_ssp +
- ((env->regs[R_ESP] + i * 2) &
-  old_sp_mask), GETPC());
+val = cpu_lduw_data_ra(env,
+   old_ssp + ((env->regs[R_ESP] + i * 
2) & old_sp_mask),
+   GETPC());
 PUSHW_RA(ssp, sp, sp_mask, val, GETPC());
 }
 }
-- 
2.45.2

[PULL 10/13] target/i386/tcg: use X86Access for TSS access

2024-07-14 Thread Paolo Bonzini

This takes care of probing the vaddr range in advance, and is also faster
because it avoids repeated TLB lookups.  It also matches the Intel manual
better, as it says "Checks that the current (old) TSS, new TSS, and all
segment descriptors used in the task switch are paged into system memory";
note however that it's not clear how the processor checks for segment
descriptors, and this check is not included in the AMD manual.

Reviewed-by: Richard Henderson 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 110 ++-
 1 file changed, 58 insertions(+), 52 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 0242f9d8b58..fea08a2ba0f 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -27,6 +27,7 @@
 #include "exec/log.h"
 #include "helper-tcg.h"
 #include "seg_helper.h"
+#include "access.h"
 
 #ifdef TARGET_X86_64
 #define SET_ESP(val, sp_mask)   \
@@ -313,14 +314,15 @@ static int switch_tss_ra(CPUX86State *env, int 
tss_selector,
  uint32_t e1, uint32_t e2, int source,
  uint32_t next_eip, uintptr_t retaddr)
 {
-int tss_limit, tss_limit_max, type, old_tss_limit_max, old_type, v1, v2, i;
+int tss_limit, tss_limit_max, type, old_tss_limit_max, old_type, i;
 target_ulong tss_base;
 uint32_t new_regs[8], new_segs[6];
 uint32_t new_eflags, new_eip, new_cr3, new_ldt, new_trap;
 uint32_t old_eflags, eflags_mask;
 SegmentCache *dt;
-int index;
+int mmu_index, index;
 target_ulong ptr;
+X86Access old, new;
 
 type = (e2 >> DESC_TYPE_SHIFT) & 0xf;
 LOG_PCALL("switch_tss: sel=0x%04x type=%d src=%d\n", tss_selector, type,
@@ -374,35 +376,45 @@ static int switch_tss_ra(CPUX86State *env, int 
tss_selector,
 raise_exception_err_ra(env, EXCP0A_TSS, tss_selector & 0xfffc, 
retaddr);
 }
 
+/* X86Access avoids memory exceptions during the task switch */
+mmu_index = cpu_mmu_index_kernel(env);
+access_prepare_mmu(, env, env->tr.base, old_tss_limit_max,
+   MMU_DATA_STORE, mmu_index, retaddr);
+
+if (source == SWITCH_TSS_CALL) {
+/* Probe for future write of parent task */
+probe_access(env, tss_base, 2, MMU_DATA_STORE,
+ mmu_index, retaddr);
+}
+access_prepare_mmu(, env, tss_base, tss_limit,
+   MMU_DATA_LOAD, mmu_index, retaddr);
+
 /* read all the registers from the new TSS */
 if (type & 8) {
 /* 32 bit */
-new_cr3 = cpu_ldl_kernel_ra(env, tss_base + 0x1c, retaddr);
-new_eip = cpu_ldl_kernel_ra(env, tss_base + 0x20, retaddr);
-new_eflags = cpu_ldl_kernel_ra(env, tss_base + 0x24, retaddr);
+new_cr3 = access_ldl(, tss_base + 0x1c);
+new_eip = access_ldl(, tss_base + 0x20);
+new_eflags = access_ldl(, tss_base + 0x24);
 for (i = 0; i < 8; i++) {
-new_regs[i] = cpu_ldl_kernel_ra(env, tss_base + (0x28 + i * 4),
-retaddr);
+new_regs[i] = access_ldl(, tss_base + (0x28 + i * 4));
 }
 for (i = 0; i < 6; i++) {
-new_segs[i] = cpu_lduw_kernel_ra(env, tss_base + (0x48 + i * 4),
- retaddr);
+new_segs[i] = access_ldw(, tss_base + (0x48 + i * 4));
 }
-new_ldt = cpu_lduw_kernel_ra(env, tss_base + 0x60, retaddr);
-new_trap = cpu_ldl_kernel_ra(env, tss_base + 0x64, retaddr);
+new_ldt = access_ldw(, tss_base + 0x60);
+new_trap = access_ldl(, tss_base + 0x64);
 } else {
 /* 16 bit */
 new_cr3 = 0;
-new_eip = cpu_lduw_kernel_ra(env, tss_base + 0x0e, retaddr);
-new_eflags = cpu_lduw_kernel_ra(env, tss_base + 0x10, retaddr);
+new_eip = access_ldw(, tss_base + 0x0e);
+new_eflags = access_ldw(, tss_base + 0x10);
 for (i = 0; i < 8; i++) {
-new_regs[i] = cpu_lduw_kernel_ra(env, tss_base + (0x12 + i * 2), 
retaddr);
+new_regs[i] = access_ldw(, tss_base + (0x12 + i * 2));
 }
 for (i = 0; i < 4; i++) {
-new_segs[i] = cpu_lduw_kernel_ra(env, tss_base + (0x22 + i * 2),
- retaddr);
+new_segs[i] = access_ldw(, tss_base + (0x22 + i * 2));
 }
-new_ldt = cpu_lduw_kernel_ra(env, tss_base + 0x2a, retaddr);
+new_ldt = access_ldw(, tss_base + 0x2a);
 new_segs[R_FS] = 0;
 new_segs[R_GS] = 0;
 new_trap = 0;
@@ -412,16 +424,6 @@ static int switch_tss_ra(CPUX86State *env, int 
tss_selector,
  chapters 12.2.5 and 13.2.4 on how to implement TSS Trap bit */
 (void)new_trap;
 
-/* NOTE: we must avoid memory exceptions during the task switch,

[PULL 12/13] i386/sev: Don't allow automatic fallback to legacy KVM_SEV*_INIT

2024-07-14 Thread Paolo Bonzini

From: Michael Roth 

Currently if the 'legacy-vm-type' property of the sev-guest object is
'on', QEMU will attempt to use the newer KVM_SEV_INIT2 kernel
interface in conjunction with the newer KVM_X86_SEV_VM and
KVM_X86_SEV_ES_VM KVM VM types.

This can lead to measurement changes if, for instance, an SEV guest was
created on a host that originally had an older kernel that didn't
support KVM_SEV_INIT2, but is booted on the same host later on after the
host kernel was upgraded.

Instead, if legacy-vm-type is 'off', QEMU should fail if the
KVM_SEV_INIT2 interface is not provided by the current host kernel.
Modify the fallback handling accordingly.

In the future, VMSA features and other flags might be added to QEMU
which will require legacy-vm-type to be 'off' because they will rely
on the newer KVM_SEV_INIT2 interface. It may be difficult to convey to
users what values of legacy-vm-type are compatible with which
features/options, so as part of this rework, switch legacy-vm-type to a
tri-state OnOffAuto option. 'auto' in this case will automatically
switch to using the newer KVM_SEV_INIT2, but only if it is required to
make use of new VMSA features or other options only available via
KVM_SEV_INIT2.

Defining 'auto' in this way would avoid inadvertantly breaking
compatibility with older kernels since it would only be used in cases
where users opt into newer features that are only available via
KVM_SEV_INIT2 and newer kernels, and provide better default behavior
than the legacy-vm-type=off behavior that was previously in place, so
make it the default for 9.1+ machine types.

Cc: Daniel P. Berrangé 
Cc: Paolo Bonzini 
cc: k...@vger.kernel.org
Signed-off-by: Michael Roth 
Reviewed-by: Daniel P. Berrangé 
Link: https://lore.kernel.org/r/20240710041005.83720-1-michael.r...@amd.com
Signed-off-by: Paolo Bonzini 
---
 qapi/qom.json | 18 ++
 hw/i386/pc.c  |  2 +-
 target/i386/sev.c | 87 +++
 3 files changed, 84 insertions(+), 23 deletions(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 8e75a419c30..7eccd2e14e2 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -924,12 +924,16 @@
 # @handle: SEV firmware handle (default: 0)
 #
 # @legacy-vm-type: Use legacy KVM_SEV_INIT KVM interface for creating the VM.
-#  The newer KVM_SEV_INIT2 interface syncs additional vCPU
-#  state when initializing the VMSA structures, which will
-#  result in a different guest measurement. Set this to
-#  maintain compatibility with older QEMU or kernel versions
-#  that rely on legacy KVM_SEV_INIT behavior.
-#  (default: false) (since 9.1)
+#  The newer KVM_SEV_INIT2 interface, from Linux >= 6.10, syncs
+#  additional vCPU state when initializing the VMSA structures,
+#  which will result in a different guest measurement. Set
+#  this to 'on' to force compatibility with older QEMU or 
kernel
+#  versions that rely on legacy KVM_SEV_INIT behavior. 'auto'
+#  will behave identically to 'on', but will automatically
+#  switch to using KVM_SEV_INIT2 if the user specifies any
+#  additional options that require it. If set to 'off', QEMU
+#  will require KVM_SEV_INIT2 unconditionally.
+#  (default: off) (since 9.1)
 #
 # Since: 2.12
 ##
@@ -939,7 +943,7 @@
 '*session-file': 'str',
 '*policy': 'uint32',
 '*handle': 'uint32',
-'*legacy-vm-type': 'bool' } }
+'*legacy-vm-type': 'OnOffAuto' } }
 
 ##
 # @SevSnpGuestProperties:
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 4fbc5774708..c74931d577a 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -83,7 +83,7 @@ GlobalProperty pc_compat_9_0[] = {
 { TYPE_X86_CPU, "x-amd-topoext-features-only", "false" },
 { TYPE_X86_CPU, "x-l1-cache-per-thread", "false" },
 { TYPE_X86_CPU, "guest-phys-bits", "0" },
-{ "sev-guest", "legacy-vm-type", "true" },
+{ "sev-guest", "legacy-vm-type", "on" },
 { TYPE_X86_CPU, "legacy-multi-node", "on" },
 };
 const size_t pc_compat_9_0_len = G_N_ELEMENTS(pc_compat_9_0);
diff --git a/target/i386/sev.c b/target/i386/sev.c
index 2ba5f517228..a1157c0ede6 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -144,7 +144,7 @@ struct SevGuestState {
 uint32_t policy;
 char *dh_cert_file;
 char *session_file;
-bool legacy_vm_type;
+OnOffAuto legacy_vm_type;
 };
 
 struct SevSnpGuestState {
@@ -1369,6 +1369,17 @@ sev_vm_state_change(void *opaque, bool running, RunState 
state)
 }
 }
 
+/*
+ * This helper is to examine sev-guest properties and determine if any options
+ * have been set which rely on the newer KVM_SEV_INIT2

[PULL 11/13] target/i386/tcg: save current task state before loading new one

2024-07-14 Thread Paolo Bonzini

This is how the steps are ordered in the manual.  EFLAGS.NT is
overwritten after the fact in the saved image.

Reviewed-by: Richard Henderson 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 85 +++-
 1 file changed, 45 insertions(+), 40 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index fea08a2ba0f..c0641a79c70 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -389,6 +389,42 @@ static int switch_tss_ra(CPUX86State *env, int 
tss_selector,
 access_prepare_mmu(, env, tss_base, tss_limit,
MMU_DATA_LOAD, mmu_index, retaddr);
 
+/* save the current state in the old TSS */
+old_eflags = cpu_compute_eflags(env);
+if (old_type & 8) {
+/* 32 bit */
+access_stl(, env->tr.base + 0x20, next_eip);
+access_stl(, env->tr.base + 0x24, old_eflags);
+access_stl(, env->tr.base + (0x28 + 0 * 4), env->regs[R_EAX]);
+access_stl(, env->tr.base + (0x28 + 1 * 4), env->regs[R_ECX]);
+access_stl(, env->tr.base + (0x28 + 2 * 4), env->regs[R_EDX]);
+access_stl(, env->tr.base + (0x28 + 3 * 4), env->regs[R_EBX]);
+access_stl(, env->tr.base + (0x28 + 4 * 4), env->regs[R_ESP]);
+access_stl(, env->tr.base + (0x28 + 5 * 4), env->regs[R_EBP]);
+access_stl(, env->tr.base + (0x28 + 6 * 4), env->regs[R_ESI]);
+access_stl(, env->tr.base + (0x28 + 7 * 4), env->regs[R_EDI]);
+for (i = 0; i < 6; i++) {
+access_stw(, env->tr.base + (0x48 + i * 4),
+   env->segs[i].selector);
+}
+} else {
+/* 16 bit */
+access_stw(, env->tr.base + 0x0e, next_eip);
+access_stw(, env->tr.base + 0x10, old_eflags);
+access_stw(, env->tr.base + (0x12 + 0 * 2), env->regs[R_EAX]);
+access_stw(, env->tr.base + (0x12 + 1 * 2), env->regs[R_ECX]);
+access_stw(, env->tr.base + (0x12 + 2 * 2), env->regs[R_EDX]);
+access_stw(, env->tr.base + (0x12 + 3 * 2), env->regs[R_EBX]);
+access_stw(, env->tr.base + (0x12 + 4 * 2), env->regs[R_ESP]);
+access_stw(, env->tr.base + (0x12 + 5 * 2), env->regs[R_EBP]);
+access_stw(, env->tr.base + (0x12 + 6 * 2), env->regs[R_ESI]);
+access_stw(, env->tr.base + (0x12 + 7 * 2), env->regs[R_EDI]);
+for (i = 0; i < 4; i++) {
+access_stw(, env->tr.base + (0x22 + i * 2),
+   env->segs[i].selector);
+}
+}
+
 /* read all the registers from the new TSS */
 if (type & 8) {
 /* 32 bit */
@@ -428,49 +464,16 @@ static int switch_tss_ra(CPUX86State *env, int 
tss_selector,
 if (source == SWITCH_TSS_JMP || source == SWITCH_TSS_IRET) {
 tss_set_busy(env, env->tr.selector, 0, retaddr);
 }
-old_eflags = cpu_compute_eflags(env);
+
 if (source == SWITCH_TSS_IRET) {
 old_eflags &= ~NT_MASK;
+if (old_type & 8) {
+access_stl(, env->tr.base + 0x24, old_eflags);
+} else {
+access_stw(, env->tr.base + 0x10, old_eflags);
+   }
 }
 
-/* save the current state in the old TSS */
-if (old_type & 8) {
-/* 32 bit */
-access_stl(, env->tr.base + 0x20, next_eip);
-access_stl(, env->tr.base + 0x24, old_eflags);
-access_stl(, env->tr.base + (0x28 + 0 * 4), env->regs[R_EAX]);
-access_stl(, env->tr.base + (0x28 + 1 * 4), env->regs[R_ECX]);
-access_stl(, env->tr.base + (0x28 + 2 * 4), env->regs[R_EDX]);
-access_stl(, env->tr.base + (0x28 + 3 * 4), env->regs[R_EBX]);
-access_stl(, env->tr.base + (0x28 + 4 * 4), env->regs[R_ESP]);
-access_stl(, env->tr.base + (0x28 + 5 * 4), env->regs[R_EBP]);
-access_stl(, env->tr.base + (0x28 + 6 * 4), env->regs[R_ESI]);
-access_stl(, env->tr.base + (0x28 + 7 * 4), env->regs[R_EDI]);
-for (i = 0; i < 6; i++) {
-access_stw(, env->tr.base + (0x48 + i * 4),
-   env->segs[i].selector);
-}
-} else {
-/* 16 bit */
-access_stw(, env->tr.base + 0x0e, next_eip);
-access_stw(, env->tr.base + 0x10, old_eflags);
-access_stw(, env->tr.base + (0x12 + 0 * 2), env->regs[R_EAX]);
-access_stw(, env->tr.base + (0x12 + 1 * 2), env->regs[R_ECX]);
-access_stw(, env->tr.base + (0x12 + 2 * 2), env->regs[R_EDX]);
-access_stw(, env->tr.base + (0x12 + 3 * 2), env->regs[R_EBX]);
-access_stw(, env->tr.base + (0x12 + 4 * 2), env->regs[R_ESP]);
-access_stw(, env->tr.base + (0x12 + 5 * 2), env->regs[R_EBP]);
-access_stw(, env->tr.base + (0x12 + 6 * 2), env->regs[R_ESI]);

[PULL 00/13] target/i386 changes for 2024-07-12

2024-07-14 Thread Paolo Bonzini

The following changes since commit 23901b2b721c0576007ab7580da8aa855d6042a9:

  Merge tag 'pull-target-arm-20240711' of 
https://git.linaro.org/people/pmaydell/qemu-arm into staging (2024-07-11 
12:00:00 -0700)

are available in the Git repository at:

  https://gitlab.com/bonzini/qemu.git tags/for-upstream-i386

for you to fetch changes up to cdcadf9ee9efef96323e0b88fccff589f06fc0ee:

  i386/sev: Don't allow automatic fallback to legacy KVM_SEV*_INIT (2024-07-12 
15:35:54 +0200)


* target/i386/tcg: fixes for seg_helper.c
* SEV: Don't allow automatic fallback to legacy KVM_SEV_INIT,
  but also don't use it by default


Michael Roth (1):
  i386/sev: Don't allow automatic fallback to legacy KVM_SEV*_INIT

Paolo Bonzini (8):
  target/i386/tcg: fix POP to memory in long mode
  target/i386/tcg: Allow IRET from user mode to user mode with SMAP
  target/i386/tcg: use PUSHL/PUSHW for error code
  target/i386/tcg: Compute MMU index once
  target/i386/tcg: Use DPL-level accesses for interrupts and call gates
  target/i386/tcg: check for correct busy state before switching to a new 
task
  target/i386/tcg: use X86Access for TSS access
  target/i386/tcg: save current task state before loading new one

Richard Henderson (3):
  target/i386/tcg: Remove SEG_ADDL
  target/i386/tcg: Reorg push/pop within seg_helper.c
  target/i386/tcg: Introduce x86_mmu_index_{kernel_,}pl

 qapi/qom.json|  18 +-
 target/i386/cpu.h|  11 +-
 hw/i386/pc.c |   2 +-
 target/i386/cpu.c|  27 +-
 target/i386/sev.c|  87 -
 target/i386/tcg/seg_helper.c | 665 +--
 target/i386/tcg/decode-new.c.inc |   2 +-
 target/i386/tcg/emit.c.inc   |   1 +
 8 files changed, 472 insertions(+), 341 deletions(-)
-- 
2.45.2

[PULL 06/13] target/i386/tcg: Introduce x86_mmu_index_{kernel_,}pl

2024-07-14 Thread Paolo Bonzini

From: Richard Henderson 

Disconnect mmu index computation from the current pl
as stored in env->hflags.

Signed-off-by: Richard Henderson 
Link: 
https://lore.kernel.org/r/20240617161210.4639-2-richard.hender...@linaro.org
Signed-off-by: Paolo Bonzini 
---
 target/i386/cpu.h | 11 ++-
 target/i386/cpu.c | 27 ---
 2 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index c43ac01c794..1e121acef54 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2445,15 +2445,8 @@ static inline bool is_mmu_index_32(int mmu_index)
 return mmu_index & 1;
 }
 
-static inline int cpu_mmu_index_kernel(CPUX86State *env)
-{
-int mmu_index_32 = (env->hflags & HF_LMA_MASK) ? 0 : 1;
-int mmu_index_base =
-!(env->hflags & HF_SMAP_MASK) ? MMU_KNOSMAP64_IDX :
-((env->hflags & HF_CPL_MASK) < 3 && (env->eflags & AC_MASK)) ? 
MMU_KNOSMAP64_IDX : MMU_KSMAP64_IDX;
-
-return mmu_index_base + mmu_index_32;
-}
+int x86_mmu_index_pl(CPUX86State *env, unsigned pl);
+int cpu_mmu_index_kernel(CPUX86State *env);
 
 #define CC_DST  (env->cc_dst)
 #define CC_SRC  (env->cc_src)
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index c05765eeafc..4688d140c2d 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -8122,18 +8122,39 @@ static bool x86_cpu_has_work(CPUState *cs)
 return x86_cpu_pending_interrupt(cs, cs->interrupt_request) != 0;
 }
 
-static int x86_cpu_mmu_index(CPUState *cs, bool ifetch)
+int x86_mmu_index_pl(CPUX86State *env, unsigned pl)
 {
-CPUX86State *env = cpu_env(cs);
 int mmu_index_32 = (env->hflags & HF_CS64_MASK) ? 0 : 1;
 int mmu_index_base =
-(env->hflags & HF_CPL_MASK) == 3 ? MMU_USER64_IDX :
+pl == 3 ? MMU_USER64_IDX :
 !(env->hflags & HF_SMAP_MASK) ? MMU_KNOSMAP64_IDX :
 (env->eflags & AC_MASK) ? MMU_KNOSMAP64_IDX : MMU_KSMAP64_IDX;
 
 return mmu_index_base + mmu_index_32;
 }
 
+static int x86_cpu_mmu_index(CPUState *cs, bool ifetch)
+{
+CPUX86State *env = cpu_env(cs);
+return x86_mmu_index_pl(env, env->hflags & HF_CPL_MASK);
+}
+
+static int x86_mmu_index_kernel_pl(CPUX86State *env, unsigned pl)
+{
+int mmu_index_32 = (env->hflags & HF_LMA_MASK) ? 0 : 1;
+int mmu_index_base =
+!(env->hflags & HF_SMAP_MASK) ? MMU_KNOSMAP64_IDX :
+(pl < 3 && (env->eflags & AC_MASK)
+ ? MMU_KNOSMAP64_IDX : MMU_KSMAP64_IDX);
+
+return mmu_index_base + mmu_index_32;
+}
+
+int cpu_mmu_index_kernel(CPUX86State *env)
+{
+return x86_mmu_index_kernel_pl(env, env->hflags & HF_CPL_MASK);
+}
+
 static void x86_disas_set_info(CPUState *cs, disassemble_info *info)
 {
 X86CPU *cpu = X86_CPU(cs);
-- 
2.45.2

[PULL 05/13] target/i386/tcg: Reorg push/pop within seg_helper.c

2024-07-14 Thread Paolo Bonzini

From: Richard Henderson 

Interrupts and call gates should use accesses with the DPL as
the privilege level.  While computing the applicable MMU index
is easy, the harder thing is how to plumb it in the code.

One possibility could be to add a single argument to the PUSH* macros
for the privilege level, but this is repetitive and risks confusion
between the involved privilege levels.

Another possibility is to pass both CPL and DPL, and adjusting both
PUSH* and POP* to use specific privilege levels (instead of using
cpu_{ld,st}*_data). This makes the code more symmetric.

However, a more complicated but much nicer approach is to use a structure
to contain the stack parameters, env, unwind return address, and rewrite
the macros into functions.  The struct provides an easy home for the MMU
index as well.

Signed-off-by: Richard Henderson 
Link: 
https://lore.kernel.org/r/20240617161210.4639-4-richard.hender...@linaro.org
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 481 +++
 1 file changed, 259 insertions(+), 222 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index b985382d704..b6902ca3fba 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -28,6 +28,68 @@
 #include "helper-tcg.h"
 #include "seg_helper.h"
 
+#ifdef TARGET_X86_64
+#define SET_ESP(val, sp_mask)   \
+do {\
+if ((sp_mask) == 0x) {  \
+env->regs[R_ESP] = (env->regs[R_ESP] & ~0x) |   \
+((val) & 0x);   \
+} else if ((sp_mask) == 0xLL) { \
+env->regs[R_ESP] = (uint32_t)(val); \
+} else {\
+env->regs[R_ESP] = (val);   \
+}   \
+} while (0)
+#else
+#define SET_ESP(val, sp_mask)   \
+do {\
+env->regs[R_ESP] = (env->regs[R_ESP] & ~(sp_mask)) |\
+((val) & (sp_mask));\
+} while (0)
+#endif
+
+/* XXX: use mmu_index to have proper DPL support */
+typedef struct StackAccess
+{
+CPUX86State *env;
+uintptr_t ra;
+target_ulong ss_base;
+target_ulong sp;
+target_ulong sp_mask;
+} StackAccess;
+
+static void pushw(StackAccess *sa, uint16_t val)
+{
+sa->sp -= 2;
+cpu_stw_kernel_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
+  val, sa->ra);
+}
+
+static void pushl(StackAccess *sa, uint32_t val)
+{
+sa->sp -= 4;
+cpu_stl_kernel_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
+  val, sa->ra);
+}
+
+static uint16_t popw(StackAccess *sa)
+{
+uint16_t ret = cpu_lduw_data_ra(sa->env,
+sa->ss_base + (sa->sp & sa->sp_mask),
+sa->ra);
+sa->sp += 2;
+return ret;
+}
+
+static uint32_t popl(StackAccess *sa)
+{
+uint32_t ret = cpu_ldl_data_ra(sa->env,
+   sa->ss_base + (sa->sp & sa->sp_mask),
+   sa->ra);
+sa->sp += 4;
+return ret;
+}
+
 int get_pg_mode(CPUX86State *env)
 {
 int pg_mode = 0;
@@ -559,68 +621,19 @@ int exception_has_error_code(int intno)
 return 0;
 }
 
-#ifdef TARGET_X86_64
-#define SET_ESP(val, sp_mask)   \
-do {\
-if ((sp_mask) == 0x) {  \
-env->regs[R_ESP] = (env->regs[R_ESP] & ~0x) |   \
-((val) & 0x);   \
-} else if ((sp_mask) == 0xLL) { \
-env->regs[R_ESP] = (uint32_t)(val); \
-} else {\
-env->regs[R_ESP] = (val);   \
-}   \
-} while (0)
-#else
-#define SET_ESP(val, sp_mask)   \
-do {\
-env->regs[R_ESP] = (env->regs[R_ESP] & ~(sp_mask)) |\
-((val) & (sp_mask));\
-} while (0)
-#endif
-
-/* XXX: add a is_user flag to have proper security support */
-#define PUSHW_RA(ssp, sp, sp_mask, val, ra)  \
-{\
-sp -= 2

[PULL 07/13] target/i386/tcg: Compute MMU index once

2024-07-14 Thread Paolo Bonzini

Add the MMU index to the StackAccess struct, so that it can be cached
or (in the next patch) computed from information that is not in
CPUX86State.

Co-developed-by: Richard Henderson 
Signed-off-by: Richard Henderson 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 35 ++-
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index b6902ca3fba..8a6d92b3583 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -56,36 +56,37 @@ typedef struct StackAccess
 target_ulong ss_base;
 target_ulong sp;
 target_ulong sp_mask;
+int mmu_index;
 } StackAccess;
 
 static void pushw(StackAccess *sa, uint16_t val)
 {
 sa->sp -= 2;
-cpu_stw_kernel_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
-  val, sa->ra);
+cpu_stw_mmuidx_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
+  val, sa->mmu_index, sa->ra);
 }
 
 static void pushl(StackAccess *sa, uint32_t val)
 {
 sa->sp -= 4;
-cpu_stl_kernel_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
-  val, sa->ra);
+cpu_stl_mmuidx_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
+  val, sa->mmu_index, sa->ra);
 }
 
 static uint16_t popw(StackAccess *sa)
 {
-uint16_t ret = cpu_lduw_data_ra(sa->env,
-sa->ss_base + (sa->sp & sa->sp_mask),
-sa->ra);
+uint16_t ret = cpu_lduw_mmuidx_ra(sa->env,
+  sa->ss_base + (sa->sp & sa->sp_mask),
+  sa->mmu_index, sa->ra);
 sa->sp += 2;
 return ret;
 }
 
 static uint32_t popl(StackAccess *sa)
 {
-uint32_t ret = cpu_ldl_data_ra(sa->env,
-   sa->ss_base + (sa->sp & sa->sp_mask),
-   sa->ra);
+uint32_t ret = cpu_ldl_mmuidx_ra(sa->env,
+ sa->ss_base + (sa->sp & sa->sp_mask),
+ sa->mmu_index, sa->ra);
 sa->sp += 4;
 return ret;
 }
@@ -677,6 +678,7 @@ static void do_interrupt_protected(CPUX86State *env, int 
intno, int is_int,
 
 sa.env = env;
 sa.ra = 0;
+sa.mmu_index = cpu_mmu_index_kernel(env);
 
 if (type == 5) {
 /* task gate */
@@ -858,12 +860,12 @@ static void do_interrupt_protected(CPUX86State *env, int 
intno, int is_int,
 static void pushq(StackAccess *sa, uint64_t val)
 {
 sa->sp -= 8;
-cpu_stq_kernel_ra(sa->env, sa->sp, val, sa->ra);
+cpu_stq_mmuidx_ra(sa->env, sa->sp, val, sa->mmu_index, sa->ra);
 }
 
 static uint64_t popq(StackAccess *sa)
 {
-uint64_t ret = cpu_ldq_data_ra(sa->env, sa->sp, sa->ra);
+uint64_t ret = cpu_ldq_mmuidx_ra(sa->env, sa->sp, sa->mmu_index, sa->ra);
 sa->sp += 8;
 return ret;
 }
@@ -982,6 +984,7 @@ static void do_interrupt64(CPUX86State *env, int intno, int 
is_int,
 
 sa.env = env;
 sa.ra = 0;
+sa.mmu_index = cpu_mmu_index_kernel(env);
 sa.sp_mask = -1;
 sa.ss_base = 0;
 if (dpl < cpl || ist != 0) {
@@ -1116,6 +1119,7 @@ static void do_interrupt_real(CPUX86State *env, int 
intno, int is_int,
 sa.sp = env->regs[R_ESP];
 sa.sp_mask = 0x;
 sa.ss_base = env->segs[R_SS].base;
+sa.mmu_index = cpu_mmu_index_kernel(env);
 
 if (is_int) {
 old_eip = next_eip;
@@ -1579,6 +1583,7 @@ void helper_lcall_real(CPUX86State *env, uint32_t new_cs, 
uint32_t new_eip,
 sa.sp = env->regs[R_ESP];
 sa.sp_mask = get_sp_mask(env->segs[R_SS].flags);
 sa.ss_base = env->segs[R_SS].base;
+sa.mmu_index = cpu_mmu_index_kernel(env);
 
 if (shift) {
 pushl(, env->segs[R_CS].selector);
@@ -1618,6 +1623,7 @@ void helper_lcall_protected(CPUX86State *env, int new_cs, 
target_ulong new_eip,
 
 sa.env = env;
 sa.ra = GETPC();
+sa.mmu_index = cpu_mmu_index_kernel(env);
 
 if (e2 & DESC_S_MASK) {
 if (!(e2 & DESC_CS_MASK)) {
@@ -1905,6 +1911,7 @@ void helper_iret_real(CPUX86State *env, int shift)
 
 sa.env = env;
 sa.ra = GETPC();
+sa.mmu_index = x86_mmu_index_pl(env, 0);
 sa.sp_mask = 0x; /* : use SS segment size? */
 sa.sp = env->regs[R_ESP];
 sa.ss_base = env->segs[R_SS].base;
@@ -1976,8 +1983,11 @@ static inline void helper_ret_protected(CPUX86State 
*env, int shift,
 target_ulong new_eip, new_esp;
 StackAccess sa;
 
+cpl = env->hflags & HF_CPL_MASK;
+
 sa.env = env;
 sa.ra = retaddr;
+sa.mmu_index = x86_mmu_index_pl(env, cpl);
 
 #ifdef TARGET_X86_64
 if (shift == 2) {
@@ -2032,7 +2042,6 @@

[PULL 13/13] Revert "qemu-char: do not operate on sources from finalize callbacks"

2024-07-14 Thread Paolo Bonzini

From: Sergey Dyasli 

This reverts commit 2b316774f60291f57ca9ecb6a9f0712c532cae34.

After 038b4217884c ("Revert "chardev: use a child source for qio input
source"") we've been observing the "iwp->src == NULL" assertion
triggering periodically during the initial capabilities querying by
libvirtd. One of possible backtraces:

Thread 1 (Thread 0x7f16cd4f0700 (LWP 43858)):
0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
1  0x7f16c6c21e65 in __GI_abort () at abort.c:79
2  0x7f16c6c21d39 in __assert_fail_base  at assert.c:92
3  0x7f16c6c46e86 in __GI___assert_fail 
(assertion=assertion@entry=0x562e9bcdaadd "iwp->src == NULL", 
file=file@entry=0x562e9bcdaac8 "../chardev/char-io.c", line=line@entry=99, 
function=function@entry=0x562e9bcdab10 <__PRETTY_FUNCTION__.20549> 
"io_watch_poll_finalize") at assert.c:101
4  0x562e9ba20c2c in io_watch_poll_finalize (source=) at 
../chardev/char-io.c:99
5  io_watch_poll_finalize (source=) at ../chardev/char-io.c:88
6  0x7f16c904aae0 in g_source_unref_internal () from /lib64/libglib-2.0.so.0
7  0x7f16c904baf9 in g_source_destroy_internal () from 
/lib64/libglib-2.0.so.0
8  0x562e9ba20db0 in io_remove_watch_poll (source=0x562e9d6720b0) at 
../chardev/char-io.c:147
9  remove_fd_in_watch (chr=chr@entry=0x562e9d5f3800) at ../chardev/char-io.c:153
10 0x562e9ba23ffb in update_ioc_handlers (s=0x562e9d5f3800) at 
../chardev/char-socket.c:592
11 0x562e9ba2072f in qemu_chr_fe_set_handlers_full at 
../chardev/char-fe.c:279
12 0x562e9ba207a9 in qemu_chr_fe_set_handlers at ../chardev/char-fe.c:304
13 0x562e9ba2ca75 in monitor_qmp_setup_handlers_bh (opaque=0x562e9d4c2c60) 
at ../monitor/qmp.c:509
14 0x562e9bb6222e in aio_bh_poll (ctx=ctx@entry=0x562e9d4c2f20) at 
../util/async.c:216
15 0x562e9bb4de0a in aio_poll (ctx=0x562e9d4c2f20, 
blocking=blocking@entry=true) at ../util/aio-posix.c:722
16 0x562e9b99dfaa in iothread_run (opaque=0x562e9d4c26f0) at 
../iothread.c:63
17 0x562e9bb505a4 in qemu_thread_start (args=0x562e9d4c7ea0) at 
../util/qemu-thread-posix.c:543
18 0x7f16c70081ca in start_thread (arg=) at 
pthread_create.c:479
19 0x7f16c6c398d3 in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

io_remove_watch_poll(), which makes sure that iwp->src is NULL, calls
g_source_destroy() which finds that iwp->src is not NULL in the finalize
callback. This can only happen if another thread has managed to trigger
io_watch_poll_prepare() callback in the meantime.

Move iwp->src destruction back to the finalize callback to prevent the
described race, and also remove the stale comment. The deadlock glib bug
was fixed back in 2010 by b35820285668 ("gmain: move finalization of
GSource outside of context lock").

Suggested-by: Paolo Bonzini 
Signed-off-by: Sergey Dyasli 
Link: 
https://lore.kernel.org/r/20240712092659.216206-1-sergey.dya...@nutanix.com
Signed-off-by: Paolo Bonzini 
---
 chardev/char-io.c | 19 +--
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/chardev/char-io.c b/chardev/char-io.c
index dab77b112e3..3be17b51ca5 100644
--- a/chardev/char-io.c
+++ b/chardev/char-io.c
@@ -87,16 +87,12 @@ static gboolean io_watch_poll_dispatch(GSource *source, 
GSourceFunc callback,
 
 static void io_watch_poll_finalize(GSource *source)
 {
-/*
- * Due to a glib bug, removing the last reference to a source
- * inside a finalize callback causes recursive locking (and a
- * deadlock).  This is not a problem inside other callbacks,
- * including dispatch callbacks, so we call io_remove_watch_poll
- * to remove this source.  At this point, iwp->src must
- * be NULL, or we would leak it.
- */
 IOWatchPoll *iwp = io_watch_poll_from_source(source);
-assert(iwp->src == NULL);
+if (iwp->src) {
+g_source_destroy(iwp->src);
+g_source_unref(iwp->src);
+iwp->src = NULL;
+}
 }
 
 static GSourceFuncs io_watch_poll_funcs = {
@@ -139,11 +135,6 @@ static void io_remove_watch_poll(GSource *source)
 IOWatchPoll *iwp;
 
 iwp = io_watch_poll_from_source(source);
-if (iwp->src) {
-g_source_destroy(iwp->src);
-g_source_unref(iwp->src);
-iwp->src = NULL;
-}
 g_source_destroy(>parent);
 }
 
-- 
2.45.2

[PULL 08/13] target/i386/tcg: Use DPL-level accesses for interrupts and call gates

2024-07-14 Thread Paolo Bonzini

This fixes a bug wherein i386/tcg assumed an interrupt return using
the CALL or JMP instructions were always going from kernel or user mode to
kernel mode, when using a call gate. This assumption is violated if
the call gate has a DPL that is greater than 0.

In addition, the stack accesses should count as explicit, not implicit
("kernel" in QEMU code), so that SMAP is not applied if DPL=3.

Analyzed-by: Robert R. Henry 
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/249
Reviewed-by: Richard Henderson 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 8a6d92b3583..809ee3d9833 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -678,7 +678,7 @@ static void do_interrupt_protected(CPUX86State *env, int 
intno, int is_int,
 
 sa.env = env;
 sa.ra = 0;
-sa.mmu_index = cpu_mmu_index_kernel(env);
+sa.mmu_index = x86_mmu_index_pl(env, dpl);
 
 if (type == 5) {
 /* task gate */
@@ -984,7 +984,7 @@ static void do_interrupt64(CPUX86State *env, int intno, int 
is_int,
 
 sa.env = env;
 sa.ra = 0;
-sa.mmu_index = cpu_mmu_index_kernel(env);
+sa.mmu_index = x86_mmu_index_pl(env, dpl);
 sa.sp_mask = -1;
 sa.ss_base = 0;
 if (dpl < cpl || ist != 0) {
@@ -1119,7 +1119,7 @@ static void do_interrupt_real(CPUX86State *env, int 
intno, int is_int,
 sa.sp = env->regs[R_ESP];
 sa.sp_mask = 0x;
 sa.ss_base = env->segs[R_SS].base;
-sa.mmu_index = cpu_mmu_index_kernel(env);
+sa.mmu_index = x86_mmu_index_pl(env, 0);
 
 if (is_int) {
 old_eip = next_eip;
@@ -1583,7 +1583,7 @@ void helper_lcall_real(CPUX86State *env, uint32_t new_cs, 
uint32_t new_eip,
 sa.sp = env->regs[R_ESP];
 sa.sp_mask = get_sp_mask(env->segs[R_SS].flags);
 sa.ss_base = env->segs[R_SS].base;
-sa.mmu_index = cpu_mmu_index_kernel(env);
+sa.mmu_index = x86_mmu_index_pl(env, 0);
 
 if (shift) {
 pushl(, env->segs[R_CS].selector);
@@ -1619,17 +1619,17 @@ void helper_lcall_protected(CPUX86State *env, int 
new_cs, target_ulong new_eip,
 raise_exception_err_ra(env, EXCP0D_GPF, new_cs & 0xfffc, GETPC());
 }
 cpl = env->hflags & HF_CPL_MASK;
+dpl = (e2 >> DESC_DPL_SHIFT) & 3;
 LOG_PCALL("desc=%08x:%08x\n", e1, e2);
 
 sa.env = env;
 sa.ra = GETPC();
-sa.mmu_index = cpu_mmu_index_kernel(env);
+sa.mmu_index = x86_mmu_index_pl(env, dpl);
 
 if (e2 & DESC_S_MASK) {
 if (!(e2 & DESC_CS_MASK)) {
 raise_exception_err_ra(env, EXCP0D_GPF, new_cs & 0xfffc, GETPC());
 }
-dpl = (e2 >> DESC_DPL_SHIFT) & 3;
 if (e2 & DESC_C_MASK) {
 /* conforming code segment */
 if (dpl > cpl) {
@@ -1691,7 +1691,6 @@ void helper_lcall_protected(CPUX86State *env, int new_cs, 
target_ulong new_eip,
 } else {
 /* check gate type */
 type = (e2 >> DESC_TYPE_SHIFT) & 0x1f;
-dpl = (e2 >> DESC_DPL_SHIFT) & 3;
 rpl = new_cs & 3;
 
 #ifdef TARGET_X86_64
-- 
2.45.2

[PULL 09/13] target/i386/tcg: check for correct busy state before switching to a new task

2024-07-14 Thread Paolo Bonzini

This step is listed in the Intel manual: "Checks that the new task is available
(call, jump, exception, or interrupt) or busy (IRET return)".

The AMD manual lists the same operation under the "Preventing recursion"
paragraph of "12.3.4 Nesting Tasks", though it is not clear if the processor
checks the busy bit in the IRET case.

Reviewed-by: Richard Henderson 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 809ee3d9833..0242f9d8b58 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -369,6 +369,11 @@ static int switch_tss_ra(CPUX86State *env, int 
tss_selector,
 old_tss_limit_max = 43;
 }
 
+/* new TSS must be busy iff the source is an IRET instruction  */
+if (!!(e2 & DESC_TSS_BUSY_MASK) != (source == SWITCH_TSS_IRET)) {
+raise_exception_err_ra(env, EXCP0A_TSS, tss_selector & 0xfffc, 
retaddr);
+}
+
 /* read all the registers from the new TSS */
 if (type & 8) {
 /* 32 bit */
-- 
2.45.2

[PULL 01/13] target/i386/tcg: fix POP to memory in long mode

2024-07-14 Thread Paolo Bonzini

In long mode, POP to memory will write a full 64-bit value.  However,
the call to gen_writeback() in gen_POP will use MO_32 because the
decoding table is incorrect.

The bug was latent until commit aea49fbb01a ("target/i386: use gen_writeback()
within gen_POP()", 2024-06-08), and then became visible because gen_op_st_v
now receives op->ot instead of the "ot" returned by gen_pop_T0.

Analyzed-by: Clément Chigot 
Fixes: 5e9e21bcc4d ("target/i386: move 60-BF opcodes to new decoder", 
2024-05-07)
Tested-by: Clément Chigot 
Reviewed-by: Richard Henderson 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/decode-new.c.inc | 2 +-
 target/i386/tcg/emit.c.inc   | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/target/i386/tcg/decode-new.c.inc b/target/i386/tcg/decode-new.c.inc
index 0d846c32c22..d2da1d396d5 100644
--- a/target/i386/tcg/decode-new.c.inc
+++ b/target/i386/tcg/decode-new.c.inc
@@ -1717,7 +1717,7 @@ static const X86OpEntry opcodes_root[256] = {
 [0x8C] = X86_OP_ENTRYwr(MOV, E,v, S,w, op0_Mw),
 [0x8D] = X86_OP_ENTRYwr(LEA, G,v, M,v, nolea),
 [0x8E] = X86_OP_ENTRYwr(MOV, S,w, E,w),
-[0x8F] = X86_OP_GROUPw(group1A, E,v),
+[0x8F] = X86_OP_GROUPw(group1A, E,d64),
 
 [0x98] = X86_OP_ENTRY1(CBW,0,v), /* rAX */
 [0x99] = X86_OP_ENTRYwr(CWD,   2,v, 0,v), /* rDX, rAX */
diff --git a/target/i386/tcg/emit.c.inc b/target/i386/tcg/emit.c.inc
index fc7477833bc..016dce81464 100644
--- a/target/i386/tcg/emit.c.inc
+++ b/target/i386/tcg/emit.c.inc
@@ -2788,6 +2788,7 @@ static void gen_POP(DisasContext *s, X86DecodedInsn 
*decode)
 X86DecodedOp *op = >op[0];
 MemOp ot = gen_pop_T0(s);
 
+assert(ot >= op->ot);
 if (op->has_ea || op->unit == X86_OP_SEG) {
 /* NOTE: order is important for MMU exceptions */
 gen_writeback(s, decode, 0, s->T0);
-- 
2.45.2

[PULL 04/13] target/i386/tcg: use PUSHL/PUSHW for error code

2024-07-14 Thread Paolo Bonzini

Do not pre-decrement esp, let the macros subtract the appropriate
operand size.

Reviewed-by: Richard Henderson 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 224e73e9ed0..b985382d704 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -670,22 +670,20 @@ static void do_interrupt_protected(CPUX86State *env, int 
intno, int is_int,
 }
 shift = switch_tss(env, intno * 8, e1, e2, SWITCH_TSS_CALL, old_eip);
 if (has_error_code) {
-uint32_t mask;
-
 /* push the error code */
 if (env->segs[R_SS].flags & DESC_B_MASK) {
-mask = 0x;
+sp_mask = 0x;
 } else {
-mask = 0x;
+sp_mask = 0x;
 }
-esp = (env->regs[R_ESP] - (2 << shift)) & mask;
-ssp = env->segs[R_SS].base + esp;
+esp = env->regs[R_ESP];
+ssp = env->segs[R_SS].base;
 if (shift) {
-cpu_stl_kernel(env, ssp, error_code);
+PUSHL(ssp, esp, sp_mask, error_code);
 } else {
-cpu_stw_kernel(env, ssp, error_code);
+PUSHW(ssp, esp, sp_mask, error_code);
 }
-SET_ESP(esp, mask);
+SET_ESP(esp, sp_mask);
 }
 return;
 }
-- 
2.45.2

Re: [PATCH] chardev: add a mutex to protect IOWatchPoll::src

2024-07-11 Thread Paolo Bonzini


On 7/11/24 11:51, Sergey Dyasli wrote:

After 038b4217884c ("Revert "chardev: use a child source for qio input
source"") we've been observing the "iwp->src == NULL" assertion
triggering periodically during the initial capabilities querying by
libvirtd. One of possible backtraces:


Hi Sergey,

thanks for the analysis!

I noticed however that this comment is really old; it was added from 
commit 2b316774f60 ("qemu-char: do not operate on sources from finalize 
callbacks", 2013-04-22):


/* Due to a glib bug, removing the last reference to a source
 * inside a finalize callback causes recursive locking (and a
 * deadlock).  This is not a problem inside other callbacks,
 * including dispatch callbacks, so we call io_remove_watch_poll
 * to remove this source.  At this point, iwp->src must
 * be NULL, or we would leak it.
 *
 * This would be solved much more elegantly by child sources,
 * but we support older glib versions that do not have them.
 */

and the original mailing list message points to a problem on RHEL6 and 
Wheezy, which were both relatively old in 2013.  And in fact the issue 
with finalize had been fixed in glib in 2010:


commit b358202856682e5cdefb0b4b8aaed3a45d9a85fa
Author: Dan Winship 
Date:   Sat Nov 6 09:35:25 2010 -0400

gmain: move finalization of GSource outside of context lock

This avoids ugly deadlock situations such as in
https://bugzilla.gnome.org/show_bug.cgi?id=586432

https://bugzilla.gnome.org/show_bug.cgi?id=626702

https://bugzilla.gnome.org/show_bug.cgi?id=634239

diff --git a/glib/gmain.c b/glib/gmain.c
index b182c6607..301adb0a7 100644
--- a/glib/gmain.c
+++ b/glib/gmain.c
@@ -1520,7 +1520,13 @@ g_source_unref_internal (GSource  *source,
g_source_list_remove (source, context);

   if (source->source_funcs->finalize)
-   source->source_funcs->finalize (source);
+   {
+ if (context)
+   UNLOCK_CONTEXT (context);
+ source->source_funcs->finalize (source);
+ if (context)
+   LOCK_CONTEXT (context);
+   }

   g_free (source->name);
   source->name = NULL;

So I think we should just revert commit 2b316774f60, which is not hard 
to do (if it works) even if the code has since moved from qemu-char.c to 
chardev/char-io.c.


Thanks,

Paolo


Thread 1 (Thread 0x7f16cd4f0700 (LWP 43858)):
0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
1  0x7f16c6c21e65 in __GI_abort () at abort.c:79
2  0x7f16c6c21d39 in __assert_fail_base  at assert.c:92
3  0x7f16c6c46e86 in __GI___assert_fail (assertion=assertion@entry=0x562e9bcdaadd "iwp->src == NULL", 
file=file@entry=0x562e9bcdaac8 "../chardev/char-io.c", line=line@entry=99, 
function=function@entry=0x562e9bcdab10 <__PRETTY_FUNCTION__.20549> "io_watch_poll_finalize") at 
assert.c:101
4  0x562e9ba20c2c in io_watch_poll_finalize (source=) at 
../chardev/char-io.c:99
5  io_watch_poll_finalize (source=) at ../chardev/char-io.c:88
6  0x7f16c904aae0 in g_source_unref_internal () from /lib64/libglib-2.0.so.0
7  0x7f16c904baf9 in g_source_destroy_internal () from 
/lib64/libglib-2.0.so.0
8  0x562e9ba20db0 in io_remove_watch_poll (source=0x562e9d6720b0) at 
../chardev/char-io.c:147
9  remove_fd_in_watch (chr=chr@entry=0x562e9d5f3800) at ../chardev/char-io.c:153
10 0x562e9ba23ffb in update_ioc_handlers (s=0x562e9d5f3800) at 
../chardev/char-socket.c:592
11 0x562e9ba2072f in qemu_chr_fe_set_handlers_full at 
../chardev/char-fe.c:279
12 0x562e9ba207a9 in qemu_chr_fe_set_handlers at ../chardev/char-fe.c:304
13 0x562e9ba2ca75 in monitor_qmp_setup_handlers_bh (opaque=0x562e9d4c2c60) 
at ../monitor/qmp.c:509
14 0x562e9bb6222e in aio_bh_poll (ctx=ctx@entry=0x562e9d4c2f20) at 
../util/async.c:216
15 0x562e9bb4de0a in aio_poll (ctx=0x562e9d4c2f20, 
blocking=blocking@entry=true) at ../util/aio-posix.c:722
16 0x562e9b99dfaa in iothread_run (opaque=0x562e9d4c26f0) at 
../iothread.c:63
17 0x562e9bb505a4 in qemu_thread_start (args=0x562e9d4c7ea0) at 
../util/qemu-thread-posix.c:543
18 0x7f16c70081ca in start_thread (arg=) at 
pthread_create.c:479
19 0x7f16c6c398d3 in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

io_remove_watch_poll(), which makes sure that iwp->src is NULL, calls
g_source_destroy() which finds that iwp->src is not NULL in the finalize
callback. This can only happen if another thread has managed to trigger
io_watch_poll_prepare() callback in the meantime.

Introduce a mutex and a boolean variable to prevent other threads
creating a watch in io_watch_poll_prepare() in case that the IOWatchPoll
itself is about to get destroyed.

Signed-off-by: Sergey Dyasli 
---
  chardev/char-io.c | 21 +
  1 file changed, 21 insertions(+)

diff --git a/chardev/char-io.c b/chardev/char-io.c
index dab77b112e35..b1edccf0cc85 100644
--- a/chardev/char-io.c
+++ b/chardev/char-io.c
@@ -34,6 +34,9 @@

Re: [PATCH 09/10] target/i386/tcg: use X86Access for TSS access

2024-07-11 Thread Paolo Bonzini


On 7/10/24 20:40, Paolo Bonzini wrote:



Il mer 10 lug 2024, 18:47 Richard Henderson 
mailto:richard.hender...@linaro.org>> ha 
scritto:


On 7/9/24 23:29, Paolo Bonzini wrote:
 > This takes care of probing the vaddr range in advance, and is
also faster
 > because it avoids repeated TLB lookups.  It also matches the
Intel manual
 > better, as it says "Checks that the current (old) TSS, new TSS,
and all
 > segment descriptors used in the task switch are paged into system
memory";
 > note however that it's not clear how the processor checks for segment
 > descriptors, and this check is not included in the AMD manual.
 >
 > Signed-off-by: Paolo Bonzini mailto:pbonz...@redhat.com>>
 > ---
 >   target/i386/tcg/seg_helper.c | 101
++-
 >   1 file changed, 51 insertions(+), 50 deletions(-)
 >
 > diff --git a/target/i386/tcg/seg_helper.c
b/target/i386/tcg/seg_helper.c
 > index 25af9d4a4ec..77f2c65c3cf 100644
 > --- a/target/i386/tcg/seg_helper.c
 > +++ b/target/i386/tcg/seg_helper.c
 > @@ -311,35 +313,44 @@ static int switch_tss_ra(CPUX86State *env,
int tss_selector,
 >           raise_exception_err_ra(env, EXCP0A_TSS, tss_selector &
0xfffc, retaddr);
 >       }
 >
 > +    /* X86Access avoids memory exceptions during the task switch */
 > +    access_prepare_mmu(, env, env->tr.base, old_tss_limit_max,
 > +                    MMU_DATA_STORE, cpu_mmu_index_kernel(env),
retaddr);
 > +
 > +    if (source == SWITCH_TSS_CALL) {
 > +        /* Probe for future write of parent task */
 > +        probe_access(env, tss_base, 2, MMU_DATA_STORE,
 > +                  cpu_mmu_index_kernel(env), retaddr);
 > +    }
 > +    access_prepare_mmu(, env, tss_base, tss_limit,
 > +                    MMU_DATA_LOAD, cpu_mmu_index_kernel(env),
retaddr);

You're computing cpu_mmu_index_kernel 3 times.


Squashing this in (easier to review than the whole thing):

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 4123ff1245e..4edfd26135f 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -321,7 +321,7 @@ static void switch_tss_ra(CPUX86State *env, int 
tss_selector,
 uint32_t new_eflags, new_eip, new_cr3, new_ldt, new_trap;
 uint32_t old_eflags, eflags_mask;
 SegmentCache *dt;
-int index;
+int mmu_index, index;
 target_ulong ptr;
 X86Access old, new;
 
@@ -378,16 +378,17 @@ static void switch_tss_ra(CPUX86State *env, int tss_selector,

 }
 
 /* X86Access avoids memory exceptions during the task switch */

+mmu_index = cpu_mmu_index_kernel(env);
 access_prepare_mmu(, env, env->tr.base, old_tss_limit_max,
-  MMU_DATA_STORE, cpu_mmu_index_kernel(env), retaddr);
+  MMU_DATA_STORE, mmu_index, retaddr);
 
 if (source == SWITCH_TSS_CALL) {

 /* Probe for future write of parent task */
 probe_access(env, tss_base, 2, MMU_DATA_STORE,
-cpu_mmu_index_kernel(env), retaddr);
+mmu_index, retaddr);
 }
 access_prepare_mmu(, env, tss_base, tss_limit,
-  MMU_DATA_LOAD, cpu_mmu_index_kernel(env), retaddr);
+  MMU_DATA_LOAD, mmu_index, retaddr);
 
 /* read all the registers from the new TSS */

 if (type & 8) {
@@ -468,7 +469,11 @@ static void switch_tss_ra(CPUX86State *env, int 
tss_selector,
context */
 
 if (source == SWITCH_TSS_CALL) {

-cpu_stw_kernel_ra(env, tss_base, env->tr.selector, retaddr);
+/*
+ * Thanks to the probe_access above, we know the first two
+ * bytes addressed by  are writable too.
+ */
+access_stw(, tss_base, env->tr.selector);
 new_eflags |= NT_MASK;
 }
 
Paolo

Re: Disassembler location

2024-07-10 Thread Paolo Bonzini


On 7/10/24 20:02, Michael Morrell wrote:

I'm working on a port to a new architecture and was noticing a
discrepancy in where the disassembler code lives.  There is a file
"target//disas.c" for 4 architectures (avr, loongarch,
openrisc, and rx), but a file "disas/.c" for 14 architectures
(if I counted right).  It seems the 4 architectures using
"target//disas.c" are more recently added so I was wondering if
that is now the preferred location.  I couldn't find information on
this, but I wasn't sure where to look.


loongarch puts it in target// because it reuses some code between 
disassembler and translator.


The others are not hosts, only targets.  By putting the file in 
target//, they do not need to add it to the "disassemblers" 
variable in meson.build---but they add it anyway. :)


All in all, if you're not in the loongarch situation I'd put it in disas/.

Paolo

Re: [PATCH 00/10] target/i386/tcg: fixes for seg_helper.c

2024-07-10 Thread Paolo Bonzini

Il mer 10 lug 2024, 23:01 Robert Henry  ha scritto:

> I have only skimmed the diffs.  Your knowledge of the deep semantics,
> gained by close differential reading of intel and amd docs, is truly
> amazing.  Many thanks for pushing this through!
>

Thanks for bringing this to our attention too, apart from the practical bug
hopefully it will help future readers to have a more precise implementation.

I tried to acknowledge your contribution in the commit messages.

I have 2 nits, perhaps stylistic only.
>
> For code like "sp -= 2" or "sp += 2" followed or preceded by a write to
> the stack pointer of a uint16_t variable 'x',  would it be better/more
> robust to rewrite as: "sp -= sizeof(x)"  ?
>

I think that's intentional because the value subtracted is related to the
"stw" or "stl" in the store (likewise for incrementing after a load) more
than to the size of x.

There are a lot of masks constructed using -1.  I think it would be clearer
> to use 0x (for 32-bit masks) as that reminds the reader that this
> is a bit mask.  But it seems that using -1 is how the original code was
> written.
>

-1 is used for 64-bit masks only. They get unwieldy quickly. :)

Paolo


> On Tue, Jul 9, 2024 at 11:29 PM Paolo Bonzini  wrote:
>
>> This includes bugfixes:
>> - allowing IRET from user mode to user mode with SMAP (do not use implicit
>>   kernel accesses, which break if the stack is in userspace)
>>
>> - use DPL-level accesses for interrupts and call gates
>>
>> - various fixes for task switching
>>
>> And two related cleanups: computing MMU index once for far calls and
>> returns
>> (including task switches), and using X86Access for TSS access.
>>
>> Tested with a really ugly patch to kvm-unit-tests, included after
>> signature.
>>
>> Paolo Bonzini (7):
>>   target/i386/tcg: Allow IRET from user mode to user mode with SMAP
>>   target/i386/tcg: use PUSHL/PUSHW for error code
>>   target/i386/tcg: Compute MMU index once
>>   target/i386/tcg: Use DPL-level accesses for interrupts and call gates
>>   target/i386/tcg: check for correct busy state before switching to a
>> new task
>>   target/i386/tcg: use X86Access for TSS access
>>   target/i386/tcg: save current task state before loading new one
>>
>> Richard Henderson (3):
>>   target/i386/tcg: Remove SEG_ADDL
>>   target/i386/tcg: Reorg push/pop within seg_helper.c
>>   target/i386/tcg: Introduce x86_mmu_index_{kernel_,}pl
>>
>>  target/i386/cpu.h|  11 +-
>>  target/i386/cpu.c|  27 +-
>>  target/i386/tcg/seg_helper.c | 606 +++
>>  3 files changed, 354 insertions(+), 290 deletions(-)
>>
>> --
>> 2.45.2
>>
>> diff --git a/lib/x86/usermode.c b/lib/x86/usermode.c
>> index c3ec0ad7..0bf40c6d 100644
>> --- a/lib/x86/usermode.c
>> +++ b/lib/x86/usermode.c
>> @@ -5,13 +5,15 @@
>>  #include "x86/desc.h"
>>  #include "x86/isr.h"
>>  #include "alloc.h"
>> +#include "alloc_page.h"
>>  #include "setjmp.h"
>>  #include "usermode.h"
>>
>>  #include "libcflat.h"
>>  #include 
>>
>> -#define USERMODE_STACK_SIZE0x2000
>> +#define USERMODE_STACK_ORDER   1 /* 8k */
>> +#define USERMODE_STACK_SIZE(1 << (12 + USERMODE_STACK_ORDER))
>>  #define RET_TO_KERNEL_IRQ  0x20
>>
>>  static jmp_buf jmpbuf;
>> @@ -37,9 +39,14 @@ uint64_t run_in_user(usermode_func func, unsigned int
>> fault_vector,
>>  {
>> extern char ret_to_kernel;
>> volatile uint64_t rax = 0;
>> -   static unsigned char user_stack[USERMODE_STACK_SIZE];
>> +   static unsigned char *user_stack;
>> handler old_ex;
>>
>> +   if (!user_stack) {
>> +   user_stack = alloc_pages(USERMODE_STACK_ORDER);
>> +   printf("%p\n", user_stack);
>> +   }
>> +
>> *raised_vector = 0;
>> set_idt_entry(RET_TO_KERNEL_IRQ, _to_kernel, 3);
>> old_ex = handle_exception(fault_vector,
>> @@ -51,6 +58,8 @@ uint64_t run_in_user(usermode_func func, unsigned int
>> fault_vector,
>> return 0;
>> }
>>
>> +   memcpy(user_stack + USERMODE_STACK_SIZE - 8, , 8);
>> +
>> asm volatile (
>> /* Prepare kernel SP for exception handlers */
>> "mov %%rsp, %[rsp0]\n\t"
>> @@ -63,12 +72,13 @@ uint64_t run_in_user(usermode_fu

Re: [PATCH 09/10] target/i386/tcg: use X86Access for TSS access

2024-07-10 Thread Paolo Bonzini

Il mer 10 lug 2024, 18:47 Richard Henderson 
ha scritto:

> On 7/9/24 23:29, Paolo Bonzini wrote:
> > This takes care of probing the vaddr range in advance, and is also faster
> > because it avoids repeated TLB lookups.  It also matches the Intel manual
> > better, as it says "Checks that the current (old) TSS, new TSS, and all
> > segment descriptors used in the task switch are paged into system
> memory";
> > note however that it's not clear how the processor checks for segment
> > descriptors, and this check is not included in the AMD manual.
> >
> > Signed-off-by: Paolo Bonzini 
> > ---
> >   target/i386/tcg/seg_helper.c | 101 ++-
> >   1 file changed, 51 insertions(+), 50 deletions(-)
> >
> > diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
> > index 25af9d4a4ec..77f2c65c3cf 100644
> > --- a/target/i386/tcg/seg_helper.c
> > +++ b/target/i386/tcg/seg_helper.c
> > @@ -311,35 +313,44 @@ static int switch_tss_ra(CPUX86State *env, int
> tss_selector,
> >   raise_exception_err_ra(env, EXCP0A_TSS, tss_selector & 0xfffc,
> retaddr);
> >   }
> >
> > +/* X86Access avoids memory exceptions during the task switch */
> > +access_prepare_mmu(, env, env->tr.base, old_tss_limit_max,
> > +MMU_DATA_STORE, cpu_mmu_index_kernel(env), retaddr);
> > +
> > +if (source == SWITCH_TSS_CALL) {
> > +/* Probe for future write of parent task */
> > +probe_access(env, tss_base, 2, MMU_DATA_STORE,
> > +  cpu_mmu_index_kernel(env), retaddr);
> > +}
> > +access_prepare_mmu(, env, tss_base, tss_limit,
> > +MMU_DATA_LOAD, cpu_mmu_index_kernel(env), retaddr);
>
> You're computing cpu_mmu_index_kernel 3 times.
>

Oh, indeed. Better than 30. :)

This appears to be conservative in that you're requiring only 2 bytes (a
> minimum) of 0x68
> to be writable.  Is it legal to place the TSS at offset 0xffe of page 0,
> with the balance
> on page 1, with page 0 writable and page 1 read-only?


Yes, paging is totally optional here. The only field that is written is the
link.

Otherwise I would think you could
> just check the entire TSS for writability.
>
> Anyway, after the MMU_DATA_STORE probe, you have proved that 'X86Access
> new' contains an
> address range that may be stored.  So you can change the SWITCH_TSS_CALL
> store below to
> access_stw() too.
>

Nice.

> -/* NOTE: we must avoid memory exceptions during the task switch,
> > -   so we make dummy accesses before */
> > -/* XXX: it can still fail in some cases, so a bigger hack is
> > -   necessary to valid the TLB after having done the accesses */
> > -
> > -v1 = cpu_ldub_kernel_ra(env, env->tr.base, retaddr);
> > -v2 = cpu_ldub_kernel_ra(env, env->tr.base + old_tss_limit_max,
> retaddr);
> > -cpu_stb_kernel_ra(env, env->tr.base, v1, retaddr);
> > -cpu_stb_kernel_ra(env, env->tr.base + old_tss_limit_max, v2,
> retaddr);
>
> OMG.
>

Haha, yeah X86Access is perfect here.

Paolo

Looks like a fantastic cleanup overall.
>
>
> r~
>
>

Re: [RFC PATCH v4 2/7] rust: add bindgen step as a meson dependency

2024-07-10 Thread Paolo Bonzini

On Wed, Jul 10, 2024 at 4:48 PM Zhao Liu  wrote:
>
> On Tue, Jul 09, 2024 at 02:28:38PM +0200, Paolo Bonzini wrote:
> >
> > Here are the stopping points that I found over the last couple weeks:
> >
> > 1.56.0: 2021 edition
> > 1.59.0: const CStr::from_bytes_with_nul_unchecked (needed by cstr
> > crate, see below)
> > 1.64.0: std::ffi::c_char
> > 1.65.0: Generic Associated Types
> > 1.74.0: Clippy can be configured in Cargo.toml
> > 1.77.0: C string literals, offset_of!
> >
> > I think 1.59.0 is pretty much the lower bound. Not having offset_of!
> > will be a bit painful, but it can be worked around (and pretty much
> > has to be, because 1.77.0 is really new).
> >
>
> An additional question: does our minimum rust version requirement
> indicate that users with this rust version can compile other
> dependencies that satisfy QEMU requirements, such as bindgen?

Yes (though in the case of bindgen, like cargo and rustc, we'll use it
from the distro; Debian bookworm has 0.60.1 so that's what we'll have
to stick with).

Paolo

[PATCH] target/i386/tcg: fix POP to memory in long mode

2024-07-10 Thread Paolo Bonzini

In long mode, POP to memory will write a full 64-bit value.  However,
the call to gen_writeback() in gen_POP will use MO_32 because the
decoding table is incorrect.

The bug was latent until commit aea49fbb01a ("target/i386: use gen_writeback()
within gen_POP()", 2024-06-08), and then became visible because gen_op_st_v
now receives op->ot instead of the "ot" returned by gen_pop_T0.

Analyzed-by: Clément Chigot 
Fixes: 5e9e21bcc4d ("target/i386: move 60-BF opcodes to new decoder", 
2024-05-07)
Tested-by: Clément Chigot 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/decode-new.c.inc | 2 +-
 target/i386/tcg/emit.c.inc   | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/target/i386/tcg/decode-new.c.inc b/target/i386/tcg/decode-new.c.inc
index 0d846c32c22..d2da1d396d5 100644
--- a/target/i386/tcg/decode-new.c.inc
+++ b/target/i386/tcg/decode-new.c.inc
@@ -1717,7 +1717,7 @@ static const X86OpEntry opcodes_root[256] = {
 [0x8C] = X86_OP_ENTRYwr(MOV, E,v, S,w, op0_Mw),
 [0x8D] = X86_OP_ENTRYwr(LEA, G,v, M,v, nolea),
 [0x8E] = X86_OP_ENTRYwr(MOV, S,w, E,w),
-[0x8F] = X86_OP_GROUPw(group1A, E,v),
+[0x8F] = X86_OP_GROUPw(group1A, E,d64),
 
 [0x98] = X86_OP_ENTRY1(CBW,0,v), /* rAX */
 [0x99] = X86_OP_ENTRYwr(CWD,   2,v, 0,v), /* rDX, rAX */
diff --git a/target/i386/tcg/emit.c.inc b/target/i386/tcg/emit.c.inc
index fc7477833bc..c6c2c7257b9 100644
--- a/target/i386/tcg/emit.c.inc
+++ b/target/i386/tcg/emit.c.inc
@@ -2788,6 +2788,8 @@ static void gen_POP(DisasContext *s, X86DecodedInsn 
*decode)
 X86DecodedOp *op = >op[0];
 MemOp ot = gen_pop_T0(s);
 
+/* Only 16/32-bit access in 32-bit mode, 16/64-bit access in long mode.  */
+assert(ot == op->ot);
 if (op->has_ea || op->unit == X86_OP_SEG) {
 /* NOTE: order is important for MMU exceptions */
 gen_writeback(s, decode, 0, s->T0);
-- 
2.45.2

Re: [PULL 13/42] target/i386: use gen_writeback() within gen_POP()

2024-07-10 Thread Paolo Bonzini


On 7/10/24 11:42, Clément Chigot wrote:

Hi Mark,

This patch introduces regressions in our x86_64 VxWorks kernels
running over qemu. Some page faults are triggered randomly.

Earlier to this patch, the MemOp `ot` passed to `gen_op_st_v` was the
`gen_pop_T0` created a few lines above.
Now, this is `op->ot` which comes from elsewhere.

Adding `op->ot = ot` just before calling `gen_writeback` fixes my
regressions. But I'm wondering if there could be some unexpected
fallbacks, `op->ot` possibly being used afterwards.


Mark's patch is correct and the (previously latent) mistake is in
the decoding table.

The manual correctly says that POP has "default 64-bit" operand;
that is, it is 64-bit even without a REX.W prefix.  It must be
marked as "d64" rather than "v".

Can you test this patch?

diff --git a/target/i386/tcg/decode-new.c.inc b/target/i386/tcg/decode-new.c.inc
index 0d846c32c22..d2da1d396d5 100644
--- a/target/i386/tcg/decode-new.c.inc
+++ b/target/i386/tcg/decode-new.c.inc
@@ -1717,7 +1717,7 @@ static const X86OpEntry opcodes_root[256] = {
 [0x8C] = X86_OP_ENTRYwr(MOV, E,v, S,w, op0_Mw),
 [0x8D] = X86_OP_ENTRYwr(LEA, G,v, M,v, nolea),
 [0x8E] = X86_OP_ENTRYwr(MOV, S,w, E,w),
-[0x8F] = X86_OP_GROUPw(group1A, E,v),
+[0x8F] = X86_OP_GROUPw(group1A, E,d64),
 
 [0x98] = X86_OP_ENTRY1(CBW,0,v), /* rAX */

 [0x99] = X86_OP_ENTRYwr(CWD,   2,v, 0,v), /* rDX, rAX */
diff --git a/target/i386/tcg/emit.c.inc b/target/i386/tcg/emit.c.inc
index fc7477833bc..36bb308e449 100644
--- a/target/i386/tcg/emit.c.inc
+++ b/target/i386/tcg/emit.c.inc
@@ -2788,6 +2788,8 @@ static void gen_POP(DisasContext *s, X86DecodedInsn 
*decode)
 X86DecodedOp *op = >op[0];
 MemOp ot = gen_pop_T0(s);
 
+/* Only 16/32-bit access in 32-bit mode, 16/64-bit access in long mode.  */

+assert(ot == op->ot);
 if (op->has_ea || op->unit == X86_OP_SEG) {
 /* NOTE: order is important for MMU exceptions */
 gen_writeback(s, decode, 0, s->T0);

Thanks (and it's reassuring that everything else has worked fine
for you!),

Paolo


Thanks,
Clément

On Sat, Jun 8, 2024 at 10:36 AM Paolo Bonzini  wrote:


From: Mark Cave-Ayland 

Instead of directly implementing the writeback using gen_op_st_v(), use the
existing gen_writeback() function.

Suggested-by: Paolo Bonzini 
Signed-off-by: Mark Cave-Ayland 
Message-ID: <20240606095319.229650-3-mark.cave-ayl...@ilande.co.uk>
Signed-off-by: Paolo Bonzini 
---
  target/i386/tcg/emit.c.inc | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target/i386/tcg/emit.c.inc b/target/i386/tcg/emit.c.inc
index ca78504b6e4..6123235c000 100644
--- a/target/i386/tcg/emit.c.inc
+++ b/target/i386/tcg/emit.c.inc
@@ -2580,9 +2580,9 @@ static void gen_POP(DisasContext *s, CPUX86State *env, 
X86DecodedInsn *decode)

  if (op->has_ea) {
  /* NOTE: order is important for MMU exceptions */
-gen_op_st_v(s, ot, s->T0, s->A0);
-op->unit = X86_OP_SKIP;
+gen_writeback(s, decode, 0, s->T0);
  }
+
  /* NOTE: writing back registers after update is important for pop %sp */
  gen_pop_update(s, ot);
  }
--
2.45.1

Re: [PATCH] hw/timer/hpet: Fix wrong HPET interrupts

2024-07-10 Thread Paolo Bonzini


Hello!  Thanks for looking after the HPET, which is not a very well
maintained device.

I am not sure your patch needs to mask the comparator with

timer->cmp &= 0xULL;

I think that's a bug in the "case HPET_TN_CMP + 4" part of
hpet_ram_write.  The logic was changed in "hpet: Fix emulation of
HPET_TN_SETVAL" but the commit forgot to apply it to the high bits of
the comparator.

The whole handling of writes to the comparator is very messy.  I can
see two more bugs:
* Idon't think it's correct that HPET_TN_CMP+4 does not clear
HPET_TN_SETVAL
* the clamping should take into account that new_val is shifted by 32

Can you check that this also fixes the problem:

diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
index 01efe4885db..11df272fe87 100644
--- a/hw/timer/hpet.c
+++ b/hw/timer/hpet.c
@@ -552,6 +552,10 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
 timer->period =
 (timer->period & 0xULL) | new_val;
 }
+/*
+ * FIXME: on a 64-bit write, HPET_TN_SETVAL should apply to the
+ * high bits part as well.
+ */
 timer->config &= ~HPET_TN_SETVAL;
 if (hpet_enabled(s)) {
 hpet_set_timer(timer);
@@ -562,7 +566,8 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
 if (!timer_is_periodic(timer)
 || (timer->config & HPET_TN_SETVAL)) {
 timer->cmp = (timer->cmp & 0xULL) | new_val << 32;
-} else {
+}
+if (timer_is_periodic(timer)) {
 /*
  * FIXME: Clamp period to reasonable min value?
  * Clamp period to reasonable max value
@@ -562,20 +566,21 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
 if (!timer_is_periodic(timer)
 || (timer->config & HPET_TN_SETVAL)) {
 timer->cmp = (timer->cmp & 0xULL) | new_val << 32;
-} else {
+}
+if (timer_is_periodic(timer)) {
 /*
  * FIXME: Clamp period to reasonable min value?
  * Clamp period to reasonable max value
  */
-new_val &= (timer->config & HPET_TN_32BIT ? ~0u : ~0ull) >> 1;
+new_val = MIN(new_val, ~0u >> 1);
 timer->period =
 (timer->period & 0xULL) | new_val << 32;
-}
-timer->config &= ~HPET_TN_SETVAL;
-if (hpet_enabled(s)) {
-hpet_set_timer(timer);
-}
-break;
+}
+timer->config &= ~HPET_TN_SETVAL;
+if (hpet_enabled(s)) {
+hpet_set_timer(timer);
+}
+break;
 case HPET_TN_ROUTE:
 timer->fsb = (timer->fsb & 0xULL) | new_val;
 break;
@@ -605,7 +605,7 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
 s->hpet_offset =
 ticks_to_ns(s->hpet_counter) - 
qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
 for (i = 0; i < s->num_timers; i++) {
-if ((>timer[i])->cmp != ~0ULL) {
+if (timer_enabled(timer)) {
 hpet_set_timer(>timer[i]);
 }
 }

(the final part is taken from your patch)?

Thanks,

Paolo

On 6/18/24 15:10, TaiseiIto wrote:

Before this commit, there are 3 problems about HPET timer interrupts. First,
HPET periodic timers cause a too early interrupt before HPET main counter
value reaches a value written its comparator value register. Second,
disabled HPET timers whose comparator value register is not
0x cause wrong interrupts. Third, enabled HPET timers whose
comparator value register is 0x don't cause any interrupts.



About the first one, for example, an HPET driver writes 0x
to an HPET periodic timer comparator value register. As a result, the
register becomes 0x because writing to the higher 32 bits
of the register doesn't affect itself in periodic mode. (see
"case HPET_TN_CMP + 4" of "hpet_ram_write" function.) And "timer->period"
which means interrupt period in periodic mode becomes 0x. Next, the
HPET driver sets the HPET_CFG_ENABLE flag to start the main counter. The
comparator value register (0x) indicate the next interrupt
time. The period (0x) is added to the comparator value register at
"hpet_timer" function because "hpet_time_after64" function returns true when
the main counter is small. So, the first interrupt is planned when the main
counter is 0x5554, but the first interrupt should occur when the
main counter is 0x. To solve this problem, I fix the code to
clear the higher 32 bits of comparator value registers of periodic mode

Re: [PATCH v2] i386/sev: Don't allow automatic fallback to legacy KVM_SEV*_INIT

2024-07-10 Thread Paolo Bonzini

Queued, thanks.

Paolo

[PATCH 05/10] target/i386/tcg: Introduce x86_mmu_index_{kernel_,}pl

2024-07-10 Thread Paolo Bonzini

From: Richard Henderson 

Disconnect mmu index computation from the current pl
as stored in env->hflags.

Signed-off-by: Richard Henderson 
Link: 
https://lore.kernel.org/r/20240617161210.4639-2-richard.hender...@linaro.org
Signed-off-by: Paolo Bonzini 
---
 target/i386/cpu.h | 11 ++-
 target/i386/cpu.c | 27 ---
 2 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index c43ac01c794..1e121acef54 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2445,15 +2445,8 @@ static inline bool is_mmu_index_32(int mmu_index)
 return mmu_index & 1;
 }
 
-static inline int cpu_mmu_index_kernel(CPUX86State *env)
-{
-int mmu_index_32 = (env->hflags & HF_LMA_MASK) ? 0 : 1;
-int mmu_index_base =
-!(env->hflags & HF_SMAP_MASK) ? MMU_KNOSMAP64_IDX :
-((env->hflags & HF_CPL_MASK) < 3 && (env->eflags & AC_MASK)) ? 
MMU_KNOSMAP64_IDX : MMU_KSMAP64_IDX;
-
-return mmu_index_base + mmu_index_32;
-}
+int x86_mmu_index_pl(CPUX86State *env, unsigned pl);
+int cpu_mmu_index_kernel(CPUX86State *env);
 
 #define CC_DST  (env->cc_dst)
 #define CC_SRC  (env->cc_src)
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index c05765eeafc..4688d140c2d 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -8122,18 +8122,39 @@ static bool x86_cpu_has_work(CPUState *cs)
 return x86_cpu_pending_interrupt(cs, cs->interrupt_request) != 0;
 }
 
-static int x86_cpu_mmu_index(CPUState *cs, bool ifetch)
+int x86_mmu_index_pl(CPUX86State *env, unsigned pl)
 {
-CPUX86State *env = cpu_env(cs);
 int mmu_index_32 = (env->hflags & HF_CS64_MASK) ? 0 : 1;
 int mmu_index_base =
-(env->hflags & HF_CPL_MASK) == 3 ? MMU_USER64_IDX :
+pl == 3 ? MMU_USER64_IDX :
 !(env->hflags & HF_SMAP_MASK) ? MMU_KNOSMAP64_IDX :
 (env->eflags & AC_MASK) ? MMU_KNOSMAP64_IDX : MMU_KSMAP64_IDX;
 
 return mmu_index_base + mmu_index_32;
 }
 
+static int x86_cpu_mmu_index(CPUState *cs, bool ifetch)
+{
+CPUX86State *env = cpu_env(cs);
+return x86_mmu_index_pl(env, env->hflags & HF_CPL_MASK);
+}
+
+static int x86_mmu_index_kernel_pl(CPUX86State *env, unsigned pl)
+{
+int mmu_index_32 = (env->hflags & HF_LMA_MASK) ? 0 : 1;
+int mmu_index_base =
+!(env->hflags & HF_SMAP_MASK) ? MMU_KNOSMAP64_IDX :
+(pl < 3 && (env->eflags & AC_MASK)
+ ? MMU_KNOSMAP64_IDX : MMU_KSMAP64_IDX);
+
+return mmu_index_base + mmu_index_32;
+}
+
+int cpu_mmu_index_kernel(CPUX86State *env)
+{
+return x86_mmu_index_kernel_pl(env, env->hflags & HF_CPL_MASK);
+}
+
 static void x86_disas_set_info(CPUState *cs, disassemble_info *info)
 {
 X86CPU *cpu = X86_CPU(cs);
-- 
2.45.2

[PATCH 06/10] target/i386/tcg: Compute MMU index once

2024-07-10 Thread Paolo Bonzini

Add the MMU index to the StackAccess struct, so that it can be cached
or (in the next patch) computed from information that is not in
CPUX86State.

Co-developed-by: Richard Henderson 
Signed-off-by: Richard Henderson 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 35 ++-
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 6b3de7a2be4..07e3667639a 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -587,36 +587,37 @@ typedef struct StackAccess
 target_ulong ss_base;
 target_ulong sp;
 target_ulong sp_mask;
+int mmu_index;
 } StackAccess;
 
 static void pushw(StackAccess *sa, uint16_t val)
 {
 sa->sp -= 2;
-cpu_stw_kernel_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
-  val, sa->ra);
+cpu_stw_mmuidx_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
+  val, sa->mmu_index, sa->ra);
 }
 
 static void pushl(StackAccess *sa, uint32_t val)
 {
 sa->sp -= 4;
-cpu_stl_kernel_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
-  val, sa->ra);
+cpu_stl_mmuidx_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
+  val, sa->mmu_index, sa->ra);
 }
 
 static uint16_t popw(StackAccess *sa)
 {
-uint16_t ret = cpu_lduw_data_ra(sa->env,
-sa->ss_base + (sa->sp & sa->sp_mask),
-sa->ra);
+uint16_t ret = cpu_lduw_mmuidx_ra(sa->env,
+  sa->ss_base + (sa->sp & sa->sp_mask),
+  sa->mmu_index, sa->ra);
 sa->sp += 2;
 return ret;
 }
 
 static uint32_t popl(StackAccess *sa)
 {
-uint32_t ret = cpu_ldl_data_ra(sa->env,
-   sa->ss_base + (sa->sp & sa->sp_mask),
-   sa->ra);
+uint32_t ret = cpu_ldl_mmuidx_ra(sa->env,
+ sa->ss_base + (sa->sp & sa->sp_mask),
+ sa->mmu_index, sa->ra);
 sa->sp += 4;
 return ret;
 }
@@ -677,6 +678,7 @@ static void do_interrupt_protected(CPUX86State *env, int 
intno, int is_int,
 
 sa.env = env;
 sa.ra = 0;
+sa.mmu_index = cpu_mmu_index_kernel(env);
 
 if (type == 5) {
 /* task gate */
@@ -858,12 +860,12 @@ static void do_interrupt_protected(CPUX86State *env, int 
intno, int is_int,
 static void pushq(StackAccess *sa, uint64_t val)
 {
 sa->sp -= 8;
-cpu_stq_kernel_ra(sa->env, sa->sp, val, sa->ra);
+cpu_stq_mmuidx_ra(sa->env, sa->sp, val, sa->mmu_index, sa->ra);
 }
 
 static uint64_t popq(StackAccess *sa)
 {
-uint64_t ret = cpu_ldq_data_ra(sa->env, sa->sp, sa->ra);
+uint64_t ret = cpu_ldq_mmuidx_ra(sa->env, sa->sp, sa->mmu_index, sa->ra);
 sa->sp += 8;
 return ret;
 }
@@ -982,6 +984,7 @@ static void do_interrupt64(CPUX86State *env, int intno, int 
is_int,
 
 sa.env = env;
 sa.ra = 0;
+sa.mmu_index = cpu_mmu_index_kernel(env);
 sa.sp_mask = -1;
 sa.ss_base = 0;
 if (dpl < cpl || ist != 0) {
@@ -1116,6 +1119,7 @@ static void do_interrupt_real(CPUX86State *env, int 
intno, int is_int,
 sa.sp = env->regs[R_ESP];
 sa.sp_mask = 0x;
 sa.ss_base = env->segs[R_SS].base;
+sa.mmu_index = cpu_mmu_index_kernel(env);
 
 if (is_int) {
 old_eip = next_eip;
@@ -1579,6 +1583,7 @@ void helper_lcall_real(CPUX86State *env, uint32_t new_cs, 
uint32_t new_eip,
 sa.sp = env->regs[R_ESP];
 sa.sp_mask = get_sp_mask(env->segs[R_SS].flags);
 sa.ss_base = env->segs[R_SS].base;
+sa.mmu_index = cpu_mmu_index_kernel(env);
 
 if (shift) {
 pushl(, env->segs[R_CS].selector);
@@ -1618,6 +1623,7 @@ void helper_lcall_protected(CPUX86State *env, int new_cs, 
target_ulong new_eip,
 
 sa.env = env;
 sa.ra = GETPC();
+sa.mmu_index = cpu_mmu_index_kernel(env);
 
 if (e2 & DESC_S_MASK) {
 if (!(e2 & DESC_CS_MASK)) {
@@ -1905,6 +1911,7 @@ void helper_iret_real(CPUX86State *env, int shift)
 
 sa.env = env;
 sa.ra = GETPC();
+sa.mmu_index = x86_mmu_index_pl(env, 0);
 sa.sp_mask = 0x; /* : use SS segment size? */
 sa.sp = env->regs[R_ESP];
 sa.ss_base = env->segs[R_SS].base;
@@ -1976,8 +1983,11 @@ static inline void helper_ret_protected(CPUX86State 
*env, int shift,
 target_ulong new_eip, new_esp;
 StackAccess sa;
 
+cpl = env->hflags & HF_CPL_MASK;
+
 sa.env = env;
 sa.ra = retaddr;
+sa.mmu_index = x86_mmu_index_pl(env, cpl);
 
 #ifdef TARGET_X86_64
 if (shift == 2) {
@@ -2032,7 +2042,6 @@

[PATCH 00/10] target/i386/tcg: fixes for seg_helper.c

2024-07-10 Thread Paolo Bonzini

This includes bugfixes:
- allowing IRET from user mode to user mode with SMAP (do not use implicit
  kernel accesses, which break if the stack is in userspace)

- use DPL-level accesses for interrupts and call gates

- various fixes for task switching

And two related cleanups: computing MMU index once for far calls and returns
(including task switches), and using X86Access for TSS access.

Tested with a really ugly patch to kvm-unit-tests, included after signature.

Paolo Bonzini (7):
  target/i386/tcg: Allow IRET from user mode to user mode with SMAP
  target/i386/tcg: use PUSHL/PUSHW for error code
  target/i386/tcg: Compute MMU index once
  target/i386/tcg: Use DPL-level accesses for interrupts and call gates
  target/i386/tcg: check for correct busy state before switching to a
new task
  target/i386/tcg: use X86Access for TSS access
  target/i386/tcg: save current task state before loading new one

Richard Henderson (3):
  target/i386/tcg: Remove SEG_ADDL
  target/i386/tcg: Reorg push/pop within seg_helper.c
  target/i386/tcg: Introduce x86_mmu_index_{kernel_,}pl

 target/i386/cpu.h|  11 +-
 target/i386/cpu.c|  27 +-
 target/i386/tcg/seg_helper.c | 606 +++
 3 files changed, 354 insertions(+), 290 deletions(-)

-- 
2.45.2

diff --git a/lib/x86/usermode.c b/lib/x86/usermode.c
index c3ec0ad7..0bf40c6d 100644
--- a/lib/x86/usermode.c
+++ b/lib/x86/usermode.c
@@ -5,13 +5,15 @@
 #include "x86/desc.h"
 #include "x86/isr.h"
 #include "alloc.h"
+#include "alloc_page.h"
 #include "setjmp.h"
 #include "usermode.h"
 
 #include "libcflat.h"
 #include 
 
-#define USERMODE_STACK_SIZE0x2000
+#define USERMODE_STACK_ORDER   1 /* 8k */
+#define USERMODE_STACK_SIZE(1 << (12 + USERMODE_STACK_ORDER))
 #define RET_TO_KERNEL_IRQ  0x20
 
 static jmp_buf jmpbuf;
@@ -37,9 +39,14 @@ uint64_t run_in_user(usermode_func func, unsigned int 
fault_vector,
 {
extern char ret_to_kernel;
volatile uint64_t rax = 0;
-   static unsigned char user_stack[USERMODE_STACK_SIZE];
+   static unsigned char *user_stack;
handler old_ex;
 
+   if (!user_stack) {
+   user_stack = alloc_pages(USERMODE_STACK_ORDER);
+   printf("%p\n", user_stack);
+   }
+
*raised_vector = 0;
set_idt_entry(RET_TO_KERNEL_IRQ, _to_kernel, 3);
old_ex = handle_exception(fault_vector,
@@ -51,6 +58,8 @@ uint64_t run_in_user(usermode_func func, unsigned int 
fault_vector,
return 0;
}
 
+   memcpy(user_stack + USERMODE_STACK_SIZE - 8, , 8);
+
asm volatile (
/* Prepare kernel SP for exception handlers */
"mov %%rsp, %[rsp0]\n\t"
@@ -63,12 +72,13 @@ uint64_t run_in_user(usermode_func func, unsigned int 
fault_vector,
"pushq %[user_stack_top]\n\t"
"pushfq\n\t"
"pushq %[user_cs]\n\t"
-   "lea user_mode(%%rip), %%rax\n\t"
+   "lea user_mode+0x80(%%rip), %%rax\n\t" // smap.flat 
places usermode addresses at 8MB-16MB
"pushq %%rax\n\t"
"iretq\n"
 
"user_mode:\n\t"
/* Back up volatile registers before invoking func */
+   "pop %%rax\n\t"
"push %%rcx\n\t"
"push %%rdx\n\t"
"push %%rdi\n\t"
@@ -78,11 +88,12 @@ uint64_t run_in_user(usermode_func func, unsigned int 
fault_vector,
"push %%r10\n\t"
"push %%r11\n\t"
/* Call user mode function */
+   "add $0x80,%%rbp\n\t"
"mov %[arg1], %%rdi\n\t"
"mov %[arg2], %%rsi\n\t"
"mov %[arg3], %%rdx\n\t"
"mov %[arg4], %%rcx\n\t"
-   "call *%[func]\n\t"
+   "call *%%rax\n\t"
/* Restore registers */
"pop %%r11\n\t"
"pop %%r10\n\t"
@@ -112,12 +123,11 @@ uint64_t run_in_user(usermode_func func, unsigned int 
fault_vector,
[arg2]"m"(arg2),
[arg3]"m"(arg3),
[arg4]"m"(arg4),
-   [func]"m"(func),
[user_ds]"i"(USER_DS),
[user_cs]"i"(USER_CS),
[kernel_ds]"rm"(KERNEL_DS),

[PATCH 02/10] target/i386/tcg: Allow IRET from user mode to user mode with SMAP

2024-07-10 Thread Paolo Bonzini

This fixes a bug wherein i386/tcg assumed an interrupt return using
the IRET instruction was always returning from kernel mode to either
kernel mode or user mode. This assumption is violated when IRET is used
as a clever way to restore thread state, as for example in the dotnet
runtime. There, IRET returns from user mode to user mode.

This bug is that stack accesses from IRET and RETF, as well as accesses
to the parameters in a call gate, are normal data accesses using the
current CPL.  This manifested itself as a page fault in the guest Linux
kernel due to SMAP preventing the access.

This bug appears to have been in QEMU since the beginning.

Analyzed-by: Robert R. Henry 
Co-developed-by: Robert R. Henry 
Signed-off-by: Robert R. Henry 
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 19d6b41a589..4977a5f5b3a 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -594,13 +594,13 @@ int exception_has_error_code(int intno)
 
 #define POPW_RA(ssp, sp, sp_mask, val, ra)   \
 {\
-val = cpu_lduw_kernel_ra(env, (ssp) + (sp & (sp_mask)), ra); \
+val = cpu_lduw_data_ra(env, (ssp) + (sp & (sp_mask)), ra); \
 sp += 2; \
 }
 
 #define POPL_RA(ssp, sp, sp_mask, val, ra)  \
 {   \
-val = (uint32_t)cpu_ldl_kernel_ra(env, (ssp) + (sp & (sp_mask)), ra); \
+val = (uint32_t)cpu_ldl_data_ra(env, (ssp) + (sp & (sp_mask)), ra); \
 sp += 4;\
 }
 
@@ -847,7 +847,7 @@ static void do_interrupt_protected(CPUX86State *env, int 
intno, int is_int,
 
 #define POPQ_RA(sp, val, ra)\
 {   \
-val = cpu_ldq_kernel_ra(env, sp, ra);   \
+val = cpu_ldq_data_ra(env, sp, ra);   \
 sp += 8;\
 }
 
@@ -1797,18 +1797,18 @@ void helper_lcall_protected(CPUX86State *env, int 
new_cs, target_ulong new_eip,
 PUSHL_RA(ssp, sp, sp_mask, env->segs[R_SS].selector, GETPC());
 PUSHL_RA(ssp, sp, sp_mask, env->regs[R_ESP], GETPC());
 for (i = param_count - 1; i >= 0; i--) {
-val = cpu_ldl_kernel_ra(env, old_ssp +
-((env->regs[R_ESP] + i * 4) &
- old_sp_mask), GETPC());
+val = cpu_ldl_data_ra(env,
+ old_ssp + ((env->regs[R_ESP] + i * 4) 
& old_sp_mask),
+ GETPC());
 PUSHL_RA(ssp, sp, sp_mask, val, GETPC());
 }
 } else {
 PUSHW_RA(ssp, sp, sp_mask, env->segs[R_SS].selector, GETPC());
 PUSHW_RA(ssp, sp, sp_mask, env->regs[R_ESP], GETPC());
 for (i = param_count - 1; i >= 0; i--) {
-val = cpu_lduw_kernel_ra(env, old_ssp +
- ((env->regs[R_ESP] + i * 2) &
-  old_sp_mask), GETPC());
+val = cpu_lduw_data_ra(env,
+  old_ssp + ((env->regs[R_ESP] + i * 
2) & old_sp_mask),
+  GETPC());
 PUSHW_RA(ssp, sp, sp_mask, val, GETPC());
 }
 }
-- 
2.45.2

[PATCH 01/10] target/i386/tcg: Remove SEG_ADDL

2024-07-10 Thread Paolo Bonzini

From: Richard Henderson 

This truncation is now handled by MMU_*32_IDX.  The introduction of
MMU_*32_IDX in fact applied correct 32-bit wraparound to 16-bit accesses
with a high segment base (e.g.  big real mode or vm86 mode), which did
not use SEG_ADDL.

Signed-off-by: Richard Henderson 
Link: 
https://lore.kernel.org/r/20240617161210.4639-3-richard.hender...@linaro.org
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index aee3d19f29b..19d6b41a589 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -579,10 +579,6 @@ int exception_has_error_code(int intno)
 } while (0)
 #endif
 
-/* in 64-bit machines, this can overflow. So this segment addition macro
- * can be used to trim the value to 32-bit whenever needed */
-#define SEG_ADDL(ssp, sp, sp_mask) ((uint32_t)((ssp) + (sp & (sp_mask
-
 /* XXX: add a is_user flag to have proper security support */
 #define PUSHW_RA(ssp, sp, sp_mask, val, ra)  \
 {\
@@ -593,7 +589,7 @@ int exception_has_error_code(int intno)
 #define PUSHL_RA(ssp, sp, sp_mask, val, ra) \
 {   \
 sp -= 4;\
-cpu_stl_kernel_ra(env, SEG_ADDL(ssp, sp, sp_mask), (uint32_t)(val), 
ra); \
+cpu_stl_kernel_ra(env, (ssp) + (sp & (sp_mask)), (val), ra); \
 }
 
 #define POPW_RA(ssp, sp, sp_mask, val, ra)   \
@@ -604,7 +600,7 @@ int exception_has_error_code(int intno)
 
 #define POPL_RA(ssp, sp, sp_mask, val, ra)  \
 {   \
-val = (uint32_t)cpu_ldl_kernel_ra(env, SEG_ADDL(ssp, sp, sp_mask), 
ra); \
+val = (uint32_t)cpu_ldl_kernel_ra(env, (ssp) + (sp & (sp_mask)), ra); \
 sp += 4;\
 }
 
-- 
2.45.2

[PATCH 10/10] target/i386/tcg: save current task state before loading new one

2024-07-10 Thread Paolo Bonzini

This is how the steps are ordered in the manual.  EFLAGS.NT is
overwritten after the fact in the saved image.

Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 85 +++-
 1 file changed, 45 insertions(+), 40 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 77f2c65c3cf..7663e653e84 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -325,6 +325,42 @@ static int switch_tss_ra(CPUX86State *env, int 
tss_selector,
 access_prepare_mmu(, env, tss_base, tss_limit,
   MMU_DATA_LOAD, cpu_mmu_index_kernel(env), retaddr);
 
+/* save the current state in the old TSS */
+old_eflags = cpu_compute_eflags(env);
+if (old_type & 8) {
+/* 32 bit */
+access_stl(, env->tr.base + 0x20, next_eip);
+access_stl(, env->tr.base + 0x24, old_eflags);
+access_stl(, env->tr.base + (0x28 + 0 * 4), env->regs[R_EAX]);
+access_stl(, env->tr.base + (0x28 + 1 * 4), env->regs[R_ECX]);
+access_stl(, env->tr.base + (0x28 + 2 * 4), env->regs[R_EDX]);
+access_stl(, env->tr.base + (0x28 + 3 * 4), env->regs[R_EBX]);
+access_stl(, env->tr.base + (0x28 + 4 * 4), env->regs[R_ESP]);
+access_stl(, env->tr.base + (0x28 + 5 * 4), env->regs[R_EBP]);
+access_stl(, env->tr.base + (0x28 + 6 * 4), env->regs[R_ESI]);
+access_stl(, env->tr.base + (0x28 + 7 * 4), env->regs[R_EDI]);
+for (i = 0; i < 6; i++) {
+access_stw(, env->tr.base + (0x48 + i * 4),
+   env->segs[i].selector);
+}
+} else {
+/* 16 bit */
+access_stw(, env->tr.base + 0x0e, next_eip);
+access_stw(, env->tr.base + 0x10, old_eflags);
+access_stw(, env->tr.base + (0x12 + 0 * 2), env->regs[R_EAX]);
+access_stw(, env->tr.base + (0x12 + 1 * 2), env->regs[R_ECX]);
+access_stw(, env->tr.base + (0x12 + 2 * 2), env->regs[R_EDX]);
+access_stw(, env->tr.base + (0x12 + 3 * 2), env->regs[R_EBX]);
+access_stw(, env->tr.base + (0x12 + 4 * 2), env->regs[R_ESP]);
+access_stw(, env->tr.base + (0x12 + 5 * 2), env->regs[R_EBP]);
+access_stw(, env->tr.base + (0x12 + 6 * 2), env->regs[R_ESI]);
+access_stw(, env->tr.base + (0x12 + 7 * 2), env->regs[R_EDI]);
+for (i = 0; i < 4; i++) {
+access_stw(, env->tr.base + (0x22 + i * 2),
+   env->segs[i].selector);
+}
+}
+
 /* read all the registers from the new TSS */
 if (type & 8) {
 /* 32 bit */
@@ -364,49 +400,16 @@ static int switch_tss_ra(CPUX86State *env, int 
tss_selector,
 if (source == SWITCH_TSS_JMP || source == SWITCH_TSS_IRET) {
 tss_set_busy(env, env->tr.selector, 0, retaddr);
 }
-old_eflags = cpu_compute_eflags(env);
+
 if (source == SWITCH_TSS_IRET) {
 old_eflags &= ~NT_MASK;
+if (old_type & 8) {
+access_stl(, env->tr.base + 0x24, old_eflags);
+} else {
+access_stw(, env->tr.base + 0x10, old_eflags);
+   }
 }
 
-/* save the current state in the old TSS */
-if (old_type & 8) {
-/* 32 bit */
-access_stl(, env->tr.base + 0x20, next_eip);
-access_stl(, env->tr.base + 0x24, old_eflags);
-access_stl(, env->tr.base + (0x28 + 0 * 4), env->regs[R_EAX]);
-access_stl(, env->tr.base + (0x28 + 1 * 4), env->regs[R_ECX]);
-access_stl(, env->tr.base + (0x28 + 2 * 4), env->regs[R_EDX]);
-access_stl(, env->tr.base + (0x28 + 3 * 4), env->regs[R_EBX]);
-access_stl(, env->tr.base + (0x28 + 4 * 4), env->regs[R_ESP]);
-access_stl(, env->tr.base + (0x28 + 5 * 4), env->regs[R_EBP]);
-access_stl(, env->tr.base + (0x28 + 6 * 4), env->regs[R_ESI]);
-access_stl(, env->tr.base + (0x28 + 7 * 4), env->regs[R_EDI]);
-for (i = 0; i < 6; i++) {
-access_stw(, env->tr.base + (0x48 + i * 4),
-   env->segs[i].selector);
-}
-} else {
-/* 16 bit */
-access_stw(, env->tr.base + 0x0e, next_eip);
-access_stw(, env->tr.base + 0x10, old_eflags);
-access_stw(, env->tr.base + (0x12 + 0 * 2), env->regs[R_EAX]);
-access_stw(, env->tr.base + (0x12 + 1 * 2), env->regs[R_ECX]);
-access_stw(, env->tr.base + (0x12 + 2 * 2), env->regs[R_EDX]);
-access_stw(, env->tr.base + (0x12 + 3 * 2), env->regs[R_EBX]);
-access_stw(, env->tr.base + (0x12 + 4 * 2), env->regs[R_ESP]);
-access_stw(, env->tr.base + (0x12 + 5 * 2), env->regs[R_EBP]);
-access_stw(, env->tr.base + (0x12 + 6 * 2), env->regs[R_ESI]);
-access_

[PATCH 04/10] target/i386/tcg: Reorg push/pop within seg_helper.c

2024-07-10 Thread Paolo Bonzini

From: Richard Henderson 

Interrupts and call gates should use accesses with the DPL as
the privilege level.  While computing the applicable MMU index
is easy, the harder thing is how to plumb it in the code.

One possibility could be to add a single argument to the PUSH* macros
for the privilege level, but this is repetitive and risks confusion
between the involved privilege levels.

Another possibility is to pass both CPL and DPL, and adjusting both
PUSH* and POP* to use specific privilege levels (instead of using
cpu_{ld,st}*_data). This makes the code more symmetric.

However, a more complicated but much nicer approach is to use a structure
to contain the stack parameters, env, unwind return address, and rewrite
the macros into functions.  The struct provides an easy home for the MMU
index as well.

Signed-off-by: Richard Henderson 
Link: 
https://lore.kernel.org/r/20240617161210.4639-4-richard.hender...@linaro.org
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 439 +++
 1 file changed, 238 insertions(+), 201 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 0653bc10936..6b3de7a2be4 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -579,35 +579,47 @@ int exception_has_error_code(int intno)
 } while (0)
 #endif
 
-/* XXX: add a is_user flag to have proper security support */
-#define PUSHW_RA(ssp, sp, sp_mask, val, ra)  \
-{\
-sp -= 2; \
-cpu_stw_kernel_ra(env, (ssp) + (sp & (sp_mask)), (val), ra); \
-}
+/* XXX: use mmu_index to have proper DPL support */
+typedef struct StackAccess
+{
+CPUX86State *env;
+uintptr_t ra;
+target_ulong ss_base;
+target_ulong sp;
+target_ulong sp_mask;
+} StackAccess;
 
-#define PUSHL_RA(ssp, sp, sp_mask, val, ra) \
-{   \
-sp -= 4;\
-cpu_stl_kernel_ra(env, (ssp) + (sp & (sp_mask)), (val), ra); \
-}
+static void pushw(StackAccess *sa, uint16_t val)
+{
+sa->sp -= 2;
+cpu_stw_kernel_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
+  val, sa->ra);
+}
 
-#define POPW_RA(ssp, sp, sp_mask, val, ra)   \
-{\
-val = cpu_lduw_data_ra(env, (ssp) + (sp & (sp_mask)), ra); \
-sp += 2; \
-}
+static void pushl(StackAccess *sa, uint32_t val)
+{
+sa->sp -= 4;
+cpu_stl_kernel_ra(sa->env, sa->ss_base + (sa->sp & sa->sp_mask),
+  val, sa->ra);
+}
 
-#define POPL_RA(ssp, sp, sp_mask, val, ra)  \
-{   \
-val = (uint32_t)cpu_ldl_data_ra(env, (ssp) + (sp & (sp_mask)), ra); \
-sp += 4;\
-}
+static uint16_t popw(StackAccess *sa)
+{
+uint16_t ret = cpu_lduw_data_ra(sa->env,
+sa->ss_base + (sa->sp & sa->sp_mask),
+sa->ra);
+sa->sp += 2;
+return ret;
+}
 
-#define PUSHW(ssp, sp, sp_mask, val) PUSHW_RA(ssp, sp, sp_mask, val, 0)
-#define PUSHL(ssp, sp, sp_mask, val) PUSHL_RA(ssp, sp, sp_mask, val, 0)
-#define POPW(ssp, sp, sp_mask, val) POPW_RA(ssp, sp, sp_mask, val, 0)
-#define POPL(ssp, sp, sp_mask, val) POPL_RA(ssp, sp, sp_mask, val, 0)
+static uint32_t popl(StackAccess *sa)
+{
+uint32_t ret = cpu_ldl_data_ra(sa->env,
+   sa->ss_base + (sa->sp & sa->sp_mask),
+   sa->ra);
+sa->sp += 4;
+return ret;
+}
 
 /* protected mode interrupt */
 static void do_interrupt_protected(CPUX86State *env, int intno, int is_int,
@@ -615,12 +627,13 @@ static void do_interrupt_protected(CPUX86State *env, int 
intno, int is_int,
int is_hw)
 {
 SegmentCache *dt;
-target_ulong ptr, ssp;
+target_ulong ptr;
 int type, dpl, selector, ss_dpl, cpl;
 int has_error_code, new_stack, shift;
-uint32_t e1, e2, offset, ss = 0, esp, ss_e1 = 0, ss_e2 = 0;
-uint32_t old_eip, sp_mask, eflags;
+uint32_t e1, e2, offset, ss = 0, ss_e1 = 0, ss_e2 = 0;
+uint32_t old_eip, eflags;
 int vm86 = env->eflags & VM_MASK;
+StackAccess sa;
 bool set_rf;
 
 has_error_code = 0;
@@ -662,6 +675,9 @@ static void do_interrupt_protected(CPUX86State *env, int 
intno, int is_int,
 raise_exception_err(env, EXCP0D_GPF, intno * 8 + 2);
 }
 
+sa.env = env;

[PATCH 09/10] target/i386/tcg: use X86Access for TSS access

2024-07-10 Thread Paolo Bonzini

This takes care of probing the vaddr range in advance, and is also faster
because it avoids repeated TLB lookups.  It also matches the Intel manual
better, as it says "Checks that the current (old) TSS, new TSS, and all
segment descriptors used in the task switch are paged into system memory";
note however that it's not clear how the processor checks for segment
descriptors, and this check is not included in the AMD manual.

Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 101 ++-
 1 file changed, 51 insertions(+), 50 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 25af9d4a4ec..77f2c65c3cf 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -27,6 +27,7 @@
 #include "exec/log.h"
 #include "helper-tcg.h"
 #include "seg_helper.h"
+#include "access.h"
 
 int get_pg_mode(CPUX86State *env)
 {
@@ -250,7 +251,7 @@ static int switch_tss_ra(CPUX86State *env, int tss_selector,
  uint32_t e1, uint32_t e2, int source,
  uint32_t next_eip, uintptr_t retaddr)
 {
-int tss_limit, tss_limit_max, type, old_tss_limit_max, old_type, v1, v2, i;
+int tss_limit, tss_limit_max, type, old_tss_limit_max, old_type, i;
 target_ulong tss_base;
 uint32_t new_regs[8], new_segs[6];
 uint32_t new_eflags, new_eip, new_cr3, new_ldt, new_trap;
@@ -258,6 +259,7 @@ static int switch_tss_ra(CPUX86State *env, int tss_selector,
 SegmentCache *dt;
 int index;
 target_ulong ptr;
+X86Access old, new;
 
 type = (e2 >> DESC_TYPE_SHIFT) & 0xf;
 LOG_PCALL("switch_tss: sel=0x%04x type=%d src=%d\n", tss_selector, type,
@@ -311,35 +313,44 @@ static int switch_tss_ra(CPUX86State *env, int 
tss_selector,
 raise_exception_err_ra(env, EXCP0A_TSS, tss_selector & 0xfffc, 
retaddr);
 }
 
+/* X86Access avoids memory exceptions during the task switch */
+access_prepare_mmu(, env, env->tr.base, old_tss_limit_max,
+  MMU_DATA_STORE, cpu_mmu_index_kernel(env), retaddr);
+
+if (source == SWITCH_TSS_CALL) {
+/* Probe for future write of parent task */
+probe_access(env, tss_base, 2, MMU_DATA_STORE,
+cpu_mmu_index_kernel(env), retaddr);
+}
+access_prepare_mmu(, env, tss_base, tss_limit,
+  MMU_DATA_LOAD, cpu_mmu_index_kernel(env), retaddr);
+
 /* read all the registers from the new TSS */
 if (type & 8) {
 /* 32 bit */
-new_cr3 = cpu_ldl_kernel_ra(env, tss_base + 0x1c, retaddr);
-new_eip = cpu_ldl_kernel_ra(env, tss_base + 0x20, retaddr);
-new_eflags = cpu_ldl_kernel_ra(env, tss_base + 0x24, retaddr);
+new_cr3 = access_ldl(, tss_base + 0x1c);
+new_eip = access_ldl(, tss_base + 0x20);
+new_eflags = access_ldl(, tss_base + 0x24);
 for (i = 0; i < 8; i++) {
-new_regs[i] = cpu_ldl_kernel_ra(env, tss_base + (0x28 + i * 4),
-retaddr);
+new_regs[i] = access_ldl(, tss_base + (0x28 + i * 4));
 }
 for (i = 0; i < 6; i++) {
-new_segs[i] = cpu_lduw_kernel_ra(env, tss_base + (0x48 + i * 4),
- retaddr);
+new_segs[i] = access_ldw(, tss_base + (0x48 + i * 4));
 }
-new_ldt = cpu_lduw_kernel_ra(env, tss_base + 0x60, retaddr);
-new_trap = cpu_ldl_kernel_ra(env, tss_base + 0x64, retaddr);
+new_ldt = access_ldw(, tss_base + 0x60);
+new_trap = access_ldl(, tss_base + 0x64);
 } else {
 /* 16 bit */
 new_cr3 = 0;
-new_eip = cpu_lduw_kernel_ra(env, tss_base + 0x0e, retaddr);
-new_eflags = cpu_lduw_kernel_ra(env, tss_base + 0x10, retaddr);
+new_eip = access_ldw(, tss_base + 0x0e);
+new_eflags = access_ldw(, tss_base + 0x10);
 for (i = 0; i < 8; i++) {
-new_regs[i] = cpu_lduw_kernel_ra(env, tss_base + (0x12 + i * 2), 
retaddr);
+new_regs[i] = access_ldw(, tss_base + (0x12 + i * 2));
 }
 for (i = 0; i < 4; i++) {
-new_segs[i] = cpu_lduw_kernel_ra(env, tss_base + (0x22 + i * 2),
- retaddr);
+new_segs[i] = access_ldw(, tss_base + (0x22 + i * 2));
 }
-new_ldt = cpu_lduw_kernel_ra(env, tss_base + 0x2a, retaddr);
+new_ldt = access_ldw(, tss_base + 0x2a);
 new_segs[R_FS] = 0;
 new_segs[R_GS] = 0;
 new_trap = 0;
@@ -349,16 +360,6 @@ static int switch_tss_ra(CPUX86State *env, int 
tss_selector,
  chapters 12.2.5 and 13.2.4 on how to implement TSS Trap bit */
 (void)new_trap;
 
-/* NOTE: we must avoid memory exceptions during the task switch,
-   so we make dummy accesses before */
-/* XXX: it can still

[PATCH 03/10] target/i386/tcg: use PUSHL/PUSHW for error code

2024-07-10 Thread Paolo Bonzini

Do not pre-decrement esp, let the macros subtract the appropriate
operand size.

Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 4977a5f5b3a..0653bc10936 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -670,22 +670,20 @@ static void do_interrupt_protected(CPUX86State *env, int 
intno, int is_int,
 }
 shift = switch_tss(env, intno * 8, e1, e2, SWITCH_TSS_CALL, old_eip);
 if (has_error_code) {
-uint32_t mask;
-
 /* push the error code */
 if (env->segs[R_SS].flags & DESC_B_MASK) {
-mask = 0x;
+sp_mask = 0x;
 } else {
-mask = 0x;
+sp_mask = 0x;
 }
-esp = (env->regs[R_ESP] - (2 << shift)) & mask;
-ssp = env->segs[R_SS].base + esp;
+esp = env->regs[R_ESP];
+ssp = env->segs[R_SS].base;
 if (shift) {
-cpu_stl_kernel(env, ssp, error_code);
+PUSHL(ssp, esp, sp_mask, error_code);
 } else {
-cpu_stw_kernel(env, ssp, error_code);
+PUSHW(ssp, esp, sp_mask, error_code);
 }
-SET_ESP(esp, mask);
+SET_ESP(esp, sp_mask);
 }
 return;
 }
-- 
2.45.2

[PATCH 07/10] target/i386/tcg: Use DPL-level accesses for interrupts and call gates

2024-07-10 Thread Paolo Bonzini

This fixes a bug wherein i386/tcg assumed an interrupt return using
the CALL or JMP instructions were always going from kernel or user mode to
kernel mode, when using a call gate. This assumption is violated if
the call gate has a DPL that is greater than 0.

In addition, the stack accesses should count as explicit, not implicit
("kernel" in QEMU code), so that SMAP is not applied if DPL=3.

Analyzed-by: Robert R. Henry 
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/249
Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 07e3667639a..1430f477c43 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -678,7 +678,7 @@ static void do_interrupt_protected(CPUX86State *env, int 
intno, int is_int,
 
 sa.env = env;
 sa.ra = 0;
-sa.mmu_index = cpu_mmu_index_kernel(env);
+sa.mmu_index = x86_mmu_index_pl(env, dpl);
 
 if (type == 5) {
 /* task gate */
@@ -984,7 +984,7 @@ static void do_interrupt64(CPUX86State *env, int intno, int 
is_int,
 
 sa.env = env;
 sa.ra = 0;
-sa.mmu_index = cpu_mmu_index_kernel(env);
+sa.mmu_index = x86_mmu_index_pl(env, dpl);
 sa.sp_mask = -1;
 sa.ss_base = 0;
 if (dpl < cpl || ist != 0) {
@@ -1119,7 +1119,7 @@ static void do_interrupt_real(CPUX86State *env, int 
intno, int is_int,
 sa.sp = env->regs[R_ESP];
 sa.sp_mask = 0x;
 sa.ss_base = env->segs[R_SS].base;
-sa.mmu_index = cpu_mmu_index_kernel(env);
+sa.mmu_index = x86_mmu_index_pl(env, 0);
 
 if (is_int) {
 old_eip = next_eip;
@@ -1583,7 +1583,7 @@ void helper_lcall_real(CPUX86State *env, uint32_t new_cs, 
uint32_t new_eip,
 sa.sp = env->regs[R_ESP];
 sa.sp_mask = get_sp_mask(env->segs[R_SS].flags);
 sa.ss_base = env->segs[R_SS].base;
-sa.mmu_index = cpu_mmu_index_kernel(env);
+sa.mmu_index = x86_mmu_index_pl(env, 0);
 
 if (shift) {
 pushl(, env->segs[R_CS].selector);
@@ -1619,17 +1619,17 @@ void helper_lcall_protected(CPUX86State *env, int 
new_cs, target_ulong new_eip,
 raise_exception_err_ra(env, EXCP0D_GPF, new_cs & 0xfffc, GETPC());
 }
 cpl = env->hflags & HF_CPL_MASK;
+dpl = (e2 >> DESC_DPL_SHIFT) & 3;
 LOG_PCALL("desc=%08x:%08x\n", e1, e2);
 
 sa.env = env;
 sa.ra = GETPC();
-sa.mmu_index = cpu_mmu_index_kernel(env);
+sa.mmu_index = x86_mmu_index_pl(env, dpl);
 
 if (e2 & DESC_S_MASK) {
 if (!(e2 & DESC_CS_MASK)) {
 raise_exception_err_ra(env, EXCP0D_GPF, new_cs & 0xfffc, GETPC());
 }
-dpl = (e2 >> DESC_DPL_SHIFT) & 3;
 if (e2 & DESC_C_MASK) {
 /* conforming code segment */
 if (dpl > cpl) {
@@ -1691,7 +1691,6 @@ void helper_lcall_protected(CPUX86State *env, int new_cs, 
target_ulong new_eip,
 } else {
 /* check gate type */
 type = (e2 >> DESC_TYPE_SHIFT) & 0x1f;
-dpl = (e2 >> DESC_DPL_SHIFT) & 3;
 rpl = new_cs & 3;
 
 #ifdef TARGET_X86_64
-- 
2.45.2

[PATCH 08/10] target/i386/tcg: check for correct busy state before switching to a new task

2024-07-10 Thread Paolo Bonzini

This step is listed in the Intel manual: "Checks that the new task is available
(call, jump, exception, or interrupt) or busy (IRET return)".

The AMD manual lists the same operation under the "Preventing recursion"
paragraph of "12.3.4 Nesting Tasks", though it is not clear if the processor
checks the busy bit in the IRET case.

Signed-off-by: Paolo Bonzini 
---
 target/i386/tcg/seg_helper.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/target/i386/tcg/seg_helper.c b/target/i386/tcg/seg_helper.c
index 1430f477c43..25af9d4a4ec 100644
--- a/target/i386/tcg/seg_helper.c
+++ b/target/i386/tcg/seg_helper.c
@@ -306,6 +306,11 @@ static int switch_tss_ra(CPUX86State *env, int 
tss_selector,
 old_tss_limit_max = 43;
 }
 
+/* new TSS must be busy iff the source is an IRET instruction  */
+if (!!(e2 & DESC_TSS_BUSY_MASK) != (source == SWITCH_TSS_IRET)) {
+raise_exception_err_ra(env, EXCP0A_TSS, tss_selector & 0xfffc, 
retaddr);
+}
+
 /* read all the registers from the new TSS */
 if (type & 8) {
 /* 32 bit */
-- 
2.45.2

Re: [RFC PATCH v4 0/7] Add Rust support, implement ARM PL011

2024-07-09 Thread Paolo Bonzini

On Tue, Jul 9, 2024 at 2:18 PM Daniel P. Berrangé  wrote:
> My thought is that the initial merge focuses only on the build system
> integration. So that's basically patches 1 + 2 in this series.
>
> Patch 3, the high level APIs is where I see most of the work and
> collaboration being needed, but that doesn't need to be rushed into
> the first merge. We would have a "rust" subsystem + maintainer who
> would presumably have a staging tree, etc in the normal way we work
> and collaborate

It's complicated. A "perfect" build system integration would include
integration with Kconfig, but it's not simple work and it may not be
Manos's preference for what to work on (or maybe it is), and it's also
not a blocker for further work on patches 3-4.

On the other hand, patches 3 and 4 are _almost_ ready except for
requiring a very new Rust - we know how to tackle that, but again it
may take some time and it's completely separate work from better build
system integration.

In other words, improving build system integration is harder until
merge, but merge is blocked by independent work on lowering the
minimum supported Rust version. This is why I liked the idea of having
either a development tree to allow a merge into early 9.2.

On the other hand, given the exceptional scope (completely new code
that can be disabled at will) and circumstances, even a very early
merge into 9.1 (disabled by default) might be better to provide
developers with the easiest base for experimenting. The requirements
for merging, here, would basically amount to a good roadmap and some
established good habits.

An merge into early 9.2 would be a bit harder for experimenting, while
merging it now would sacrifice CI integration in the initial stages of
the work but make cooperation easier.

However, it's difficult to say what's best without knowing Manos's
rationale for preferring not to have a development tree yet.

Paolo

Re: [RFC PATCH v4 2/7] rust: add bindgen step as a meson dependency

2024-07-09 Thread Paolo Bonzini

On Tue, Jul 9, 2024 at 2:09 PM Peter Maydell  wrote:
>  * what is the actual baseline requirement? We definitely want
>to support "using rustup on an older system" (should be no
>problem) and "current distro building QEMU using the distro's
>rust", I assume. It would certainly be nice to have "building
>QEMU on the older-but-still-in-our-support-list distro releases
>with that distro's rust", but this probably implies not just
>a minimum rust version but also a limited set of crates.

I don't think limiting ourselves to the set of crates in the distro is
feasible. It's not the way the language works, for example I tried
checking if the "cstr" crate exists and I didn't find it in Debian.

We can define a known-good set of versions:
* either via Cargo.lock (i.e. recording the versions and downloading
them) or by including sources in the tree
* at any time in qemu.git, at branching time (three times a year), or
on every release

(That's six possibilities overall).

>I think (but forget the details) that some of what we've done
>with Python where we accept that the older-but-still-supported
>distro will end up taking the "download at build time" path
>in the build system might apply also for Rust.

Yes, and --without-download may be changed to the "--frozen" flag of cargo.

>  * what, on the Rust side, is the version requirement? Presumably
>there's some features that are pretty much "we really need
>this and can't do without it" and some which are "this would
>be pretty awkward not to have but if we have to we can implement
>some kind of fallback/alternative with a TODO note to come
>back and clean up when our baseline moves forwards".

Here are the stopping points that I found over the last couple weeks:

1.56.0: 2021 edition
1.59.0: const CStr::from_bytes_with_nul_unchecked (needed by cstr
crate, see below)
1.64.0: std::ffi::c_char
1.65.0: Generic Associated Types
1.74.0: Clippy can be configured in Cargo.toml
1.77.0: C string literals, offset_of!

I think 1.59.0 is pretty much the lower bound. Not having offset_of!
will be a bit painful, but it can be worked around (and pretty much
has to be, because 1.77.0 is really new).

As far as I understand, Debian bullseye has 1.63.0 these days and it's
the oldest platform we have.

Paolo

> At that point we have more information to figure out what
> if any tradeoff we want to make.
>
> thanks
> -- PMM
>

Re: [RFC PATCH v4 0/7] Add Rust support, implement ARM PL011

2024-07-09 Thread Paolo Bonzini

On Tue, Jul 9, 2024 at 9:38 AM Manos Pitsidianakis
 wrote:
> Ah, alright. That wasn't obvious because that e-mail was not directed
> to me nor did it mention my name :)

Oh, ok. Sorry about that. Generally when I say "we" I include as large
a part of the community as applicable.

> I do not want to do that, in any case. I do not think it's the right approach.

No problem with that (and in fact I agree, as I'd prefer a speedy
merge and doing the work on the QEMU master branch); however, we need
to reach an agreement on that and everybody (including Daniel) needs
to explain the reason for their position.

Daniel's proposed criteria for merging include:
- CI integration
- CI passing for all supported targets (thus lowering the MSRV to 1.63.0)
- plus any the code changes that were or will be requested during review

That seems to be a pretty high amount of work, and until it's done
everyone else is unable to contribute, not even in directions
orthogonal to the above (cross compilation support, less unsafe code,
porting more devices). So something has to give: either we decide for
an early merge, where the code is marked as experimental and disabled
by default. Personally I think it's fine, the contingency plan is
simply to "git rm -rf rust/". Or we can keep the above stringent
requirements for merging, but then I don't see it as a one-person job.

If I can say so, developing on a branch would also be a useful warm-up
for you in the maintainer role, if we expect that there will be
significant community contributions to Rust.

Paolo

Re: [RFC PATCH v4 0/7] Add Rust support, implement ARM PL011

2024-07-08 Thread Paolo Bonzini

Il lun 8 lug 2024, 20:39 Manos Pitsidianakis 
ha scritto:

>
>
> On Mon, 8 Jul 2024, 21:34 Paolo Bonzini,  wrote:
>
>>
>>
>> Il lun 8 lug 2024, 19:12 Daniel P. Berrangé  ha
>> scritto:
>>
>>> That's exactly why I suggest its a pre-requisite for merging
>>> this. Unless we're able to demonstrate that we can enable
>>> Rust on all our CI platforms, the benefits of Rust will
>>> not be realized in QEMU, and we'll have never ending debates
>>> about whether each given feature needs to be in C or Rust.
>>>
>>
>> In that case we should develop it on a branch, so that more than one
>> person can contribute (unlike if we keep iterating on this RFC).
>>
>
> If you do, I'd really appreciate it if you did not use any part of my
> patches.
>

"We" means that you would accept patches, review them and apply them until
any agreed-upon conditions for merging are satisfied, and then send either
a final series or a pull request for inclusion in QEMU.

Paolo

Re: [RFC PATCH v4 0/7] Add Rust support, implement ARM PL011

2024-07-08 Thread Paolo Bonzini

Il lun 8 lug 2024, 19:12 Daniel P. Berrangé  ha
scritto:

> That's exactly why I suggest its a pre-requisite for merging
> this. Unless we're able to demonstrate that we can enable
> Rust on all our CI platforms, the benefits of Rust will
> not be realized in QEMU, and we'll have never ending debates
> about whether each given feature needs to be in C or Rust.
>

In that case we should develop it on a branch, so that more than one person
can contribute (unlike if we keep iterating on this RFC).

Paolo


>
> > I also believe we should default to enabling rust toolchain by
> > > default in configure, and require and explicit --without-rust
> > > to disable it, *despite* it not technically being a mandatory
> > > featureyet.
> > >
> >
> > I guess the detection could be done, but actually enabling the build part
> > needs to wait until the minimum supported version is low enough.
>
> With regards,
> Daniel
> --
> |: https://berrange.com  -o-
> https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-
> https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-
> https://www.instagram.com/dberrange :|
>
>

Re: [RFC PATCH v4 0/7] Add Rust support, implement ARM PL011

2024-07-08 Thread Paolo Bonzini

Il lun 8 lug 2024, 18:33 Daniel P. Berrangé  ha
scritto:

> This series is still missing changes to enable build on all targets
> during CI, including cross-compiles, to prove that we're doing the
> correct thing on all our targetted platforms. That's a must have
> before considering it suitable for merge.
>

But we're not—in particular it's still using several features not in all
supported distros.

I also believe we should default to enabling rust toolchain by
> default in configure, and require and explicit --without-rust
> to disable it, *despite* it not technically being a mandatory
> featureyet.
>

I guess the detection could be done, but actually enabling the build part
needs to wait until the minimum supported version is low enough.

Paolo


> This is to give users a clear message that Rust is likely to
> become a fundamental part of QEMU, so they need to give feedback
> if they hit any problems / have use cases we've not anticipated
> that are problematic wrt Rust.
>
> With regards,
> Daniel
> --
> |: https://berrange.com  -o-
> https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-
> https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-
> https://www.instagram.com/dberrange :|
>
>

Re: [RFC PATCH v4 0/7] Add Rust support, implement ARM PL011

2024-07-08 Thread Paolo Bonzini


On 7/4/24 14:15, Manos Pitsidianakis wrote:

Changes from v3->v4:
- Add rust-specific files to .gitattributes
- Added help text to scripts/cargo_wrapper.py arguments (thanks Stephan)
- Split bindings separate crate
- Add declarative macros for symbols exported to QEMU to said crate
- Lowered MSRV to 1.77.2
- Removed auto-download and install of bindgen-cli
- Fixed re-compilation of Rust objects in case they are missing from
   filesystem
- Fixed optimized builds by adding #[used] (thanks Pierrick for the help
   debugging this)


I think the largest issue is that I'd rather have a single cargo build 
using a virtual manifest, because my hunch is that it'd be the easiest 
path towards Kconfig integration.  But it's better to do this after 
merge, as the changes are pretty large.  It's also independent from any 
other changes targeted at removing unsafe code, so no need to hold back 
on merging.


Other comments I made that should however be addressed before merging, 
from most to least important:


- TODO comments when the code is doing potential undefined behavior

- module structure should IMO resemble the C part of the tree

- only generate bindings.rs.inc once

- a couple abstractions that I'd like to have now: a trait to store the 
CStr corresponding to the structs, and one to generate all-zero structs 
without having to type "unsafe { MaybeUninit::zeroed().assume_init() }"


- I pointed out a couple lints that are too broad and should be enabled 
per-file, even if right now it's basically all files that include them.


- add support for --cargo and CARGO environment variables (if my patch 
works without too much hassle)


- I'd like to use ctor instead of non-portable linker magic, and the 
cstr crate instead of CStr statics or c""


- please check if -Wl,--whole-archive can be replaced with link_whole:

- probably, until Rust is enabled by default we should treat 
dependencies as a moving target and not commit Cargo.lock files.  In the 
meanwhile we can discuss how to handle them.


And a few aesthetic changes on top of this.

With respect to lints, marking entire groups as "deny" is problematic. 
Before merge, I'd rather have the groups as just "warn", and add a long 
list of denied lints[1].  After merge we can also add a non-fatal CI job 
that runs clippy with nightly rust and with groups marked as "deny". 
This matches the check-python-tox job in python/.


[1] https://github.com/bonzini/rust-qemu/commit/95b25f7c5f4e

Thanks,

Paolo

Re: [RFC PATCH v4 4/7] rust: add PL011 device model

2024-07-08 Thread Paolo Bonzini

On Thu, Jul 4, 2024 at 2:16 PM Manos Pitsidianakis
 wrote:
> +ARM PL011 Rust device
> +M: Manos Pitsidianakis 
> +S: Maintained
> +F: rust/pl011/

No need for this, since it's covered by rust/. If (when) it replaces
the main one, the PL011-specific stanza will be assigned to ARM
maintainers (while you keep covering it via rust/).


> +if with_rust
> +  subdir('rust')
> +endif

Should be in patch 3.

> +subdir('pl011')

As I said before it should be handled via Kconfig, but let's do that
after the initial merge. However...

> +correctness = { level = "deny", priority = -1 }
> +suspicious = { level = "deny", priority = -1 }
> +complexity = { level = "deny", priority = -1 }
> +perf = { level = "deny", priority = -1 }
> +cargo = { level = "deny", priority = -1 }
> +nursery = { level = "deny", priority = -1 }
> +style = { level = "deny", priority = -1 }
> +# restriction group
> +dbg_macro = "deny"
> +rc_buffer = "deny"
> +as_underscore = "deny"

... repeated lints really suggest that you should use a workspace and
a single cargo invocation to build both the rust-qapi and pl011
crates, which I think is doable if you can run bindgen just once.

> +use core::{mem::MaybeUninit, ptr::NonNull};

Let's remove at least this unsafety.

> +#[used]
> +pub static VMSTATE_PL011: VMStateDescription = VMStateDescription {
> +name: PL011_ARM_INFO.name,
> +unmigratable: true,
> +..unsafe { MaybeUninitzeroed().assume_init() }
> +};
> +
> +#[no_mangle]
> +pub unsafe extern "C" fn pl011_init(obj: *mut Object) {
> +assert!(!obj.is_null());
> +let mut state = NonNull::new_unchecked(obj.cast::());
> +state.as_mut().init();

This is fine for now, but please add a

// TODO: this assumes that "all zeroes" is a valid state for all fields of
// PL011State. This is not necessarily true of any #[repr(Rust)] structs,
// including bilge-generated types. It should instead use MaybeUninit.

> +}
> +
> +qemu_api::module_init! {
> +qom: register_type => {
> +type_register_static(_ARM_INFO);
> +}

Can you make the macro look like

   MODULE_INIT_QOM: fn register_type() {
 ...
   }

so that it's clear what "register_type" is, and so that it's easier to
extend it to more values?

> +#[doc(alias = "clk")]
> +pub clock: NonNull,

It's null when init() runs, so please use *mut Clock.

> +#[doc(alias = "migrate_clk")]
> +pub migrate_clock: bool,

Please put all properties together in the struct for readability.

> +}
> +
> +#[used]
> +pub static CLK_NAME:  = c"clk";
> +
> +impl PL011State {
> +pub fn init( self) {
> +unsafe {
> +memory_region_init_io(
> +addr_of_mut!(self.iomem),
> +addr_of_mut!(*self).cast::(),
> +_OPS,
> +addr_of_mut!(*self).cast::(),
> +PL011_ARM_INFO.name,
> +0x1000,
> +);
> +let sbd = addr_of_mut!(*self).cast::();
> +let dev = addr_of_mut!(*self).cast::();
> +sysbus_init_mmio(sbd, addr_of_mut!(self.iomem));
> +for irq in self.interrupts.iter_mut() {
> +sysbus_init_irq(sbd, irq);
> +}
> +self.clock = NonNull::new(qdev_init_clock_in(
> +dev,
> +CLK_NAME.as_ptr(),
> +None, /* pl011_clock_update */
> +addr_of_mut!(*self).cast::(),
> +ClockEvent_ClockUpdate,
> +))
> +.unwrap();
> +}
> +}
> +
> +pub fn read( self, offset: hwaddr, _size: core::ffi::c_uint) -> u64 {
> +use RegisterOffset::*;
> +
> +match RegisterOffset::try_from(offset) {
> +Err(v) if (0x3f8..0x400).contains() => {
> +u64::from(PL011_ID_ARM[((offset - 0xfe0) >> 2) as usize])
> +}
> +Err(_) => {
> +// qemu_log_mask(LOG_GUEST_ERROR, "pl011_read: Bad offset 
> 0x%x\n", (int)offset);
> +0
> +}
> +Ok(DR) => {
> +// s->flags &= ~PL011_FLAG_RXFF;
> +self.flags.set_receive_fifo_full(false);
> +let c = self.read_fifo[self.read_pos];
> +if self.read_count > 0 {
> +self.read_count -= 1;
> +self.read_pos = (self.read_pos + 1) & (self.fifo_depth() 
> - 1);
> +}
> +if self.read_count == 0 {
> +// self.flags |= PL011_FLAG_RXFE;
> +self.flags.set_receive_fifo_empty(true);
> +}
> +if self.read_count + 1 == self.read_trigger {
> +//self.int_level &= ~ INT_RX;
> +self.int_level &= !registers::INT_RX;
> +}
> +// Update error bits.
> +self.receive_status_error_clear = c.to_be_bytes()[3].into();
> +self.update();
> +unsafe { qemu_chr_fe_accept_input(

Re: [RFC PATCH v4 3/7] rust: add crate to expose bindings and interfaces

2024-07-08 Thread Paolo Bonzini


On 7/4/24 14:15, Manos Pitsidianakis wrote:

Add rust/qemu-api, which exposes rust-bindgen generated FFI bindings and
provides some declaration macros for symbols visible to the rest of
QEMU.

Signed-off-by: Manos Pitsidianakis 
---
  MAINTAINERS   |   7 ++
  rust/.cargo/config.toml   |   2 +
  rust/qemu-api/.gitignore  |   2 +
  rust/qemu-api/Cargo.lock  |   7 ++
  rust/qemu-api/Cargo.toml  |  59 ++
  rust/qemu-api/README.md   |  17 
  rust/qemu-api/build.rs|  48 +++
  rust/qemu-api/deny.toml   |  57 +
  rust/qemu-api/meson.build |   0
  rust/qemu-api/rustfmt.toml|   1 +
  rust/qemu-api/src/bindings.rs |   8 ++
  rust/qemu-api/src/definitions.rs  | 112 +
  rust/qemu-api/src/device_class.rs | 131 ++


I'd rather put the various definitions in directories that roughly 
correspond to the C files, so rust/qemu-api/src/{util,qom,hw/core/}.

+correctness = { level = "deny", priority = -1 }
+suspicious = { level = "deny", priority = -1 }
+complexity = { level = "deny", priority = -1 }
+perf = { level = "deny", priority = -1 }
+cargo = { level = "deny", priority = -1 }
+nursery = { level = "deny", priority = -1 }
+style = { level = "deny", priority = -1 }


Groups are problematic because they can (and will) cause breakage when 
new failures are added.  I think we should have a non-fatal job to test 
with groups, but by default we can add a large number of lints and 
"allow" the groups.



+cast_ptr_alignment = "allow"


Why?


+missing_safety_doc = "allow"


Please add it to individual files at the top.


diff --git a/rust/qemu-api/README.md b/rust/qemu-api/README.md
new file mode 100644
index 00..f16a1a929d
--- /dev/null
+++ b/rust/qemu-api/README.md
@@ -0,0 +1,17 @@
+# QEMU bindings and API wrappers
+
+This library exports helper Rust types, Rust macros and C FFI bindings for 
internal QEMU APIs.
+
+The C bindings can be generated with `bindgen`, using this build target:
+
+```console
+$ ninja bindings-aarch64-softmmu.rs
+```
+
+## Generate Rust documentation
+
+To generate docs for this crate, including private items:
+
+```sh
+cargo doc --no-deps --document-private-items
+```
diff --git a/rust/qemu-api/build.rs b/rust/qemu-api/build.rs
new file mode 100644
index 00..13164f8371
--- /dev/null
+++ b/rust/qemu-api/build.rs
@@ -0,0 +1,48 @@
+// Copyright 2024 Manos Pitsidianakis 
+// SPDX-License-Identifier: GPL-2.0 OR GPL-3.0-or-later
+
+use std::{env, path::Path};
+
+fn main() {
+println!("cargo:rerun-if-env-changed=MESON_BUILD_ROOT");
+println!("cargo:rerun-if-changed=src/bindings.rs.inc");
+
+let out_dir = env::var_os("OUT_DIR").unwrap();
+
+if let Some(build_dir) = std::env::var_os("MESON_BUILD_ROOT") {
+let mut build_dir = Path::new(_dir).to_path_buf();
+let mut out_dir = Path::new(_dir).to_path_buf();
+assert!(
+build_dir.exists(),
+"MESON_BUILD_ROOT value does not exist on filesystem: {}",
+build_dir.display()
+);
+assert!(
+build_dir.is_dir(),
+"MESON_BUILD_ROOT value is not actually a directory: {}",
+build_dir.display()
+);
+// TODO: add logic for other guest target architectures.
+build_dir.push("bindings-aarch64-softmmu.rs");
+let bindings_rs = build_dir;
+assert!(
+bindings_rs.exists(),
+"MESON_BUILD_ROOT/bindings-aarch64-softmmu.rs does not exist on 
filesystem: {}",
+bindings_rs.display()
+);
+assert!(
+bindings_rs.is_file(),
+"MESON_BUILD_ROOT/bindings-aarch64-softmmu.rs is not a file: {}",
+bindings_rs.display()
+);
+out_dir.push("bindings.rs");
+std::fs::copy(bindings_rs, out_dir).unwrap();
+println!("cargo:rustc-cfg=MESON_BINDINGS_RS");
+} else if !Path::new("src/bindings.rs.inc").exists() {
+panic!(
+"No generated C bindings found! Either build them manually with 
bindgen or with meson \
+ (`ninja bindings-aarch64-softmmu.rs`) and copy them to 
src/bindings.rs.inc, or build \
+ through meson."
+);
+}
+}
diff --git a/rust/qemu-api/deny.toml b/rust/qemu-api/deny.toml
new file mode 100644
index 00..3992380509
--- /dev/null
+++ b/rust/qemu-api/deny.toml
@@ -0,0 +1,57 @@
+# cargo-deny configuration file
+
+[graph]
+targets = [
+"aarch64-unknown-linux-gnu",
+"x86_64-unknown-linux-gnu",
+"x86_64-apple-darwin",
+"aarch64-apple-darwin",
+"x86_64-pc-windows-gnu",
+"aarch64-pc-windows-gnullvm",


Do we actually support LLVM Windows targets?


+unsafe impl Sync for TypeInfo {}
+unsafe impl Sync for VMStateDescription {}
+
+#[macro_export]
+macro_rules! module_init {
+($func:expr, $type:expr) => {
+#[used]
+

Re: [RFC PATCH v4 2/7] rust: add bindgen step as a meson dependency

2024-07-08 Thread Paolo Bonzini

On Thu, Jul 4, 2024 at 2:16 PM Manos Pitsidianakis
 wrote:
>
> Add mechanism to generate rust hw targets that depend on a custom
> bindgen target for rust bindings to C.
>
> This way bindings will be created before the rust crate is compiled.
>
> The bindings will end up in BUILDDIR/{target}-generated.rs and have the same 
> name
> as a target:
>
> ninja aarch64-softmmu-generated.rs

Is there anything target-dependent in these files? You can generate it
just once I think.


> +  if with_rust and target_type == 'system'
> +   # FIXME: meson outputs the following warnings, which should be 
> resolved
> +   # before merging:
> +   # > WARNING: Project specifies a minimum meson_version '>=0.63.0' but
> +   # > uses features which were added in newer versions:
> +   # > * 0.64.0: {'fs.copyfile'}

you can use configure_file() instead.

> +   # > * 1.0.0: {'dependencies arg in rust.bindgen', 'module rust as 
> stable module'}

You can use

if meson.version().version_compare('>=1.0.0')
  ...
else
  error('Rust support requires Meson version 1.0.0')
endif

> +  if target in rust_targets
> +rust_hw = ss.source_set()
> +foreach t: rust_targets[target]
> +  rust_device_cargo_build = custom_target(t['name'],
> +   output: t['output'],
> +   depends: [bindings_rs],
> +   build_always_stale: true,
> +   command: t['command'])

Also "console: true".

> +  rust_dep = declare_dependency(link_args: [
> +  '-Wl,--whole-archive',
> +  t['output'],
> +  '-Wl,--no-whole-archive'
> +  ],

This is not portable, but I think you can use just "link_whole:
rust_device_cargo_build".

> +msrv = {
> +  'rustc': '1.77.2',
> +  'cargo': '1.77.2',

I think rustc and cargo are always matching, so no need to check both of them?

> +foreach rust_hw_target, rust_hws: rust_hw_target_list
> +  foreach rust_hw_dev: rust_hws

Finding the available devices should be done using Kconfig symbols,
probably by passing the path to the config-devices.mak file to
build.rs via an environment variable.

> +#include "qemu/osdep.h"
> +#include "qemu/module.h"
> +#include "qemu-io.h"
> +#include "sysemu/sysemu.h"
> +#include "hw/sysbus.h"
> +#include "exec/memory.h"
> +#include "chardev/char-fe.h"
> +#include "hw/clock.h"
> +#include "hw/qdev-clock.h"
> +#include "hw/qdev-properties.h"
> +#include "hw/qdev-properties-system.h"
> +#include "hw/irq.h"
> +#include "qapi/error.h"
> +#include "migration/vmstate.h"
> +#include "chardev/char-serial.h"

I think all of these should be target-independent?

> diff --git a/scripts/cargo_wrapper.py b/scripts/cargo_wrapper.py
> index d2c7265461..e7d9238c16 100644
> --- a/scripts/cargo_wrapper.py
> +++ b/scripts/cargo_wrapper.py
> @@ -111,6 +111,8 @@ def get_cargo_rustc(args: argparse.Namespace) -> 
> tuple[Dict[str, Any], List[str]
>
>  env = os.environ
>  env["CARGO_ENCODED_RUSTFLAGS"] = cfg
> +env["MESON_BUILD_DIR"] = str(target_dir)
> +env["MESON_BUILD_ROOT"] = str(args.meson_build_root)
>
>  return (env, cargo_cmd)
>
> @@ -231,19 +233,11 @@ def main() -> None:
>  default=[],
>  )
>  parser.add_argument(
> -"--meson-build-dir",
> -metavar="BUILD_DIR",
> -help="meson.current_build_dir()",
> +"--meson-build-root",
> +metavar="BUILD_ROOT",
> +help="meson.project_build_root(): the root build directory. Example: 
> '/path/to/qemu/build'",
>  type=Path,
> -dest="meson_build_dir",
> -required=True,
> -)
> -parser.add_argument(
> -"--meson-source-dir",
> -metavar="SOURCE_DIR",
> -help="meson.current_source_dir()",
> -type=Path,
> -dest="meson_build_dir",
> +dest="meson_build_root",
>  required=True,
>  )
>  parser.add_argument(

Please squash in the previous patch.

Paolo

Re: [RFC PATCH v4 1/7] build-sys: Add rust feature option

2024-07-08 Thread Paolo Bonzini

On Thu, Jul 4, 2024 at 2:16 PM Manos Pitsidianakis
 wrote:
>
> Add options for Rust in meson_options.txt, meson.build, configure to
> prepare for adding Rust code in the followup commits.
>
> `rust` is a reserved meson name, so we have to use an alternative.
> `with_rust` was chosen.

Did you find any problem with the other approach that I sent, to
support --cargo and the CARGO environment variable in configure?

Paolo

Re: [PATCH v3 4/6] target/i386: add support for VMX FRED controls

2024-07-06 Thread Paolo Bonzini

Il sab 6 lug 2024, 17:57 Li, Xin3  ha scritto:

> >> The bits in the secondary vmexit controls are not supported, and in
> general the same
> >> is true for the secondary vmexit case.  I think it's better to not
> include the vmx-entry-
> >> load-fred bit either, and only do the vmxcap changes.
>
> > Right, we don't need it at all.
>
> Hi Paolo,
>
> We actually do need the following change for nested FRED guests to boot:
>
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 227ee1c759..dcf914a7ec 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -1285,7 +1285,7 @@ FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
>  NULL, "vmx-entry-ia32e-mode", NULL, NULL,
>  NULL, "vmx-entry-load-perf-global-ctrl",
> "vmx-entry-load-pat", "vmx-entry-load-efer",
>  "vmx-entry-load-bndcfgs", NULL, "vmx-entry-load-rtit-ctl",
> NULL,
> -NULL, NULL, "vmx-entry-load-pkrs", NULL,
> +NULL, NULL, "vmx-entry-load-pkrs", "vmx-entry-load-fred",
>  NULL, NULL, NULL, NULL,
>  NULL, NULL, NULL, NULL,
>  },
>
> Or do you think it's not the root cause?
>

The patch is correct but is FRED supported in nested VMX? Or is it with not
yet merged patches?

Paolo



> Thanks!
> Xin
>
>

Re: [PATCH 00/14] rust: example of bindings code for Rust in QEMU

2024-07-05 Thread Paolo Bonzini

Hi, first of all I want to clarify the raison d'etre for this posting,
which I have also explained to Manos. Nothing you see here is code
that will be certainly included in QEMU; it's (mostly) throwaway by
design. I don't have much attachment to any of the code except perhaps
the casting and reference counting stuff (which is in, like, the third
rewrite... I got nerd-sniped there :)). But at the same time I think
it's plausibly not too different (in look and complexity) from what
the actual code will look like.

It's also not an attempt to bypass/leapfrog other people in the
design; it's more a "let's be ready for this to happen" than a design
document. The first step and the more immediate focus remains the
build system integration.

But you have good questions and observations so I'll put something
below about the design as well.

On Fri, Jul 5, 2024 at 8:52 PM Pierrick Bouvier
 wrote:
> To give an example of questions we could have:
>
> Why should we have distinct structs for a device, the object
> represented, and its state?

If you refer to the conf and state, a bit because of the different
traits required (ConstDefault is needed for properties but not the
rest) and especially because we can enforce immutability of
configuration vs. interior mutability of state.

In particular, from Manos's prototype I noticed that you need to
access the Chardev while the state is *not* borrowed; methods on the
Chardev call back into the device and the callback needs mutable
access to the state. So some kind of separation seems to be necessary,
for better or worse, unless interior mutability is achieved with a
dozen Cells (not great).

> Having to implement ObjectImpl and DeviceImpl looks like an
> implementation detail, that could be derived automatically from a
> device. What's wrong with calling a realize that does nothing in the end?

The problem is not calling a realize that does nothing, it's that you
are forced to override the superclass implementation (which might be
in C) if you don't go through Option<>. class_init methods update a
few superclass methods but not all of them.

The vtable in ObjectImpl/DeviceImpl could be generated via macros; for
now I did it by hand to avoid getting carried away too much with
syntactic sugar.

> Could device state be serialized with something like serde?

Probably not, there's extra complications for versioning. A long time
ago there was an attempt to have some kind of IDL embedded in C code
(not unlike Rust attributes) to automatically generate properties and
vmstate, but it was too ambitious.
(https://www.linux-kvm.org/images/b/b5/2012-forum-qidl-talk.pdf,
slides 1-9).

Generating property and vmstate declarations could be done with
"normal" macros just like in C, or with attribute macros (needs a lot
more code and experience, wouldn't do it right away). However, serde
for QAPI and/or visitors may be a possibility, after all JSON is
serde's bread and butter.

> This is the kind of things that could be discussed, on a reduced
> example, without specially looking at how to implement that concretely,
> in a first time.

There are some things that Rust really hates that you do, and they
aren't always obvious. Therefore in this exercise I tried to let
intuition guide me, and see how much the type system fought that
intuition (not much actually!). I started from the more technical and
less artistic part to see if I was able to get somewhere.

But yes, it's a great exercise to do this experiment from the opposite
end. Then the actual glue code will "meet in the middle", applying
lessons learnt from both experiments, with individual pieces of the
real interface implemented and applied to the PL011 sample at the same
time. The main missing thing in PL011 is DMA, otherwise it's a nice
playground.

> usage of magic macros as syntactic sugar should indeed be thought twice.
> Return a collection of QDevProperty could be better than having a magic
> @properties syntactic sugar.

No hard opinions there, sure. I went for the magic syntax because the
properties cannot be a const and I wanted to hide the ugly "static
mut", but certainly there could be better ways to do it.

> Another thing that could be discussed is: do we want to have the whole
> inheritance mechanism for Rust devices? Is it really needed?

Eh, this time unfortunately I think it is. Stuff like children and
properties, which are methods of Object, need to be available to
subclasses in instance_init and/or realize. Reset is in Device and we
have the whole Device/SysBusDevice/PCIDevice set of classes, so you
need to access superclass methods from subclasses. It's not a lot of
code though.

> Between having a clean and simple Device definition, with a bit more of
> magic glue (hidden internally), and requires devices to do some magic
> initialization to satisfy existing architecture, what would be the best?
> Even though it takes 1000 more lines, I would be in favor to have
> Devices that are as clean and simple as possible.

Re: [PATCH 00/14] rust: example of bindings code for Rust in QEMU

2024-07-05 Thread Paolo Bonzini

On Thu, Jul 4, 2024 at 9:26 PM Pierrick Bouvier
 wrote:
> > Patches 9-10 deal with how to define new subclasses in Rust.  They are
> > a lot less polished and less ready.  There is probably a lot of polish
> > that could be applied to make the code look nicer, but I guess there is
> > always time to make the code look nicer until the details are sorted out.
> >
> > The things that I considered here are:
> >
> > - splitting configuration from runtime state
> > - automatic generation of instance_init and instance_finalize
> > - generation of C vtables from safe code that is written in Rust
> > - generation of qdev property tables
>
> Shouldn't we focus more on how we want to write a QOM device in Rust
> instead, by making abstraction of existing C implementation first?
> Once we have satisfying idea, we could evaluate how it maps to existing
> implementation, and which compromise should be made for efficiency.
> It seems that the current approach is to stick to existing model, and
> derive Rust bindings from this.

I think I tried to do a little bit of that. For example, the split of
configuration from runtime state is done to isolate the interior
mutability, and the automatic generation of instance_init and
instance_finalize relies on Rust traits like Default and Drop; the
realize implementation returns a qemu::Result<()>; and so on.

But there are many steps towards converting the PL011 device to Rust:

- declarative definitions of bitfields and registers (done)
- using RefCell for mutable state
- using wrappers for class and property definition
- defining and using wrappers for Chardev/CharBackend
- defining and using wrappers for MemoryRegion callbacks
- defining and using wrappers for SysBusDevice functionality
- defining and using wrappers for VMState functionality

At each step we need to design "how we want to do that in Rust" and
all the things you mention above. This prototype covers the second and
third steps, and it's already big enough! :)

I expect that code like this one would be merged together with the
corresponding changes to PL011: the glue code is added to the qemu/
crate and used in the hw/core/pl011/ directory. This way, both authors
and reviewers will have a clue of what becomes more readable/idiomatic
in the resulting code.

> I agree this glue can be a source of technical debt, but on the other
> side, it should be easy to refactor it, if we decided first what the
> "clean and idiomatic" Rust API should look like.

That's true, especially if we want to add more macro magic on top. We
can decide later where to draw the line between complexity of glue
code and cleanliness of Rust code, and also move the line as we see
fit.

On the other hand, if we believe that _this_ code is already too much,
that's also a data point. Again: I don't think it is, but I want us to
make an informed decision.

> A compromise where you have to manually translate some structs, or copy
> memory to traverse languages borders at some point, could be worth the
> safety and expressiveness.

Yep, Error is an example of this. It's not the common case, so it's
totally fine to convert to and from C Error* (which also includes
copying the inner error string, from a String to malloc-ed char*) to
keep the APIs nicer.

> > We should have an idea of what this glue code looks like, in order to make
> > an informed choice.  If we think we're not comfortable with reviewing it,
> > well, we should be ready to say so and stick with C until we are.
>
> While it is important that this glue code is maintainable and looks
> great, I don't think it should be the main reason to accept usage of a
> new language.

Not the main reason, but an important factor in judging if we are
*able* to accept usage of a new language. Maybe it's just a formal
step, but it feels safer to have _some_ code to show and to read, even
if it's just a prototype that barely compiles.

> We could have a specific trait with functions to handle various kind of
> events. And the glue code could be responsible to translate this to
> callbacks for the C side (and calling Rust code accordingly, eventually
> serializing this on a single thread to avoid any race issues).

That's a possibility, yes. Something like (totally random):

impl CharBackend {
pub fn start( self, obj: ) {
qemu_chr_fe_set_handlers(self.0,
 Some(obj::can_receive),
Some(obj::receive, obj::event),
 Some(obj::be_change), obj as *const _ as
*const c_void,
 ptr::null(), true);
}
pub fn stop( self) {
qemu_chr_fe_set_handlers(self.0, None, None,
 None, ptr::null(), ptr::null(), true);
}
}

but I left for later because I focused just on the lowest levels code
where you could have "one" design. For example, for memory regions
some devices are going to have more than one, so there could be
reasons to do callbacks in different ways. By the way, all of this

[PULL 15/16] target/i386: add support for masking CPUID features in confidential guests

2024-07-04 Thread Paolo Bonzini

Some CPUID features may be provided by KVM for some guests, independent of
processor support, for example TSC deadline or TSC adjust.  If these are
not supported by the confidential computing firmware, however, the guest
will fail to start.  Add support for removing unsupported features from
"-cpu host".

Signed-off-by: Paolo Bonzini 
---
 target/i386/confidential-guest.h | 24 
 target/i386/kvm/kvm.c|  5 +
 2 files changed, 29 insertions(+)

diff --git a/target/i386/confidential-guest.h b/target/i386/confidential-guest.h
index 532e172a60b..7342d2843aa 100644
--- a/target/i386/confidential-guest.h
+++ b/target/i386/confidential-guest.h
@@ -39,6 +39,8 @@ struct X86ConfidentialGuestClass {
 
 /*  */
 int (*kvm_type)(X86ConfidentialGuest *cg);
+uint32_t (*mask_cpuid_features)(X86ConfidentialGuest *cg, uint32_t 
feature, uint32_t index,
+int reg, uint32_t value);
 };
 
 /**
@@ -56,4 +58,26 @@ static inline int 
x86_confidential_guest_kvm_type(X86ConfidentialGuest *cg)
 return 0;
 }
 }
+
+/**
+ * x86_confidential_guest_mask_cpuid_features:
+ *
+ * Removes unsupported features from a confidential guest's CPUID values, 
returns
+ * the value with the bits removed.  The bits removed should be those that KVM
+ * provides independent of host-supported CPUID features, but are not 
supported by
+ * the confidential computing firmware.
+ */
+static inline int 
x86_confidential_guest_mask_cpuid_features(X86ConfidentialGuest *cg,
+ uint32_t feature, 
uint32_t index,
+ int reg, uint32_t 
value)
+{
+X86ConfidentialGuestClass *klass = X86_CONFIDENTIAL_GUEST_GET_CLASS(cg);
+
+if (klass->mask_cpuid_features) {
+return klass->mask_cpuid_features(cg, feature, index, reg, value);
+} else {
+return value;
+}
+}
+
 #endif
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index dd8b0f33136..056d117cd11 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -548,6 +548,11 @@ uint32_t kvm_arch_get_supported_cpuid(KVMState *s, 
uint32_t function,
 ret |= 1U << KVM_HINTS_REALTIME;
 }
 
+if (current_machine->cgs) {
+ret = x86_confidential_guest_mask_cpuid_features(
+X86_CONFIDENTIAL_GUEST(current_machine->cgs),
+function, index, reg, ret);
+}
 return ret;
 }
 
-- 
2.45.2

[PULL 00/16] meson, i386 changes for 2024-07-04

2024-07-04 Thread Paolo Bonzini

The following changes since commit 1a2d52c7fcaeaaf4f2fe8d4d5183dccaeab67768:

  Merge tag 'pull-request-2024-07-02' of https://gitlab.com/thuth/qemu into 
staging (2024-07-02 15:49:08 -0700)

are available in the Git repository at:

  https://gitlab.com/bonzini/qemu.git tags/for-upstream

for you to fetch changes up to 188569c10d5dc6996bde90ce25645083e9661ecb:

  target/i386/SEV: implement mask_cpuid_features (2024-07-04 11:56:20 +0200)


* meson: Pass objects and dependencies to declare_dependency(), not 
static_library()
* meson: Drop the .fa library suffix
* target/i386: drop AMD machine check bits from Intel CPUID
* target/i386: add avx-vnni-int16 feature
* target/i386: SEV bugfixes
* target/i386: SEV-SNP -cpu host support
* char: fix exit issues


Akihiko Odaki (2):
  meson: Pass objects and dependencies to declare_dependency()
  Revert "meson: Propagate gnutls dependency"

Maxim Mikityanskiy (1):
  char-stdio: Restore blocking mode of stdout on exit

Michal Privoznik (2):
  i386/sev: Fix error message in sev_get_capabilities()
  i386/sev: Fallback to the default SEV device if none provided in 
sev_get_capabilities()

Paolo Bonzini (11):
  meson: move shared_module() calls where modules are already walked
  meson: move block.syms dependency out of libblock
  meson: merge plugin_ldflags into emulator_link_args
  meson: Drop the .fa library suffix
  target/i386: pass X86CPU to x86_cpu_get_supported_feature_word
  target/i386: drop AMD machine check bits from Intel CPUID
  target/i386: SEV: fix formatting of CPUID mismatch message
  target/i386: do not include undefined bits in the AMD topoext leaf
  target/i386: add avx-vnni-int16 feature
  target/i386: add support for masking CPUID features in confidential guests
  target/i386/SEV: implement mask_cpuid_features

 docs/devel/build-system.rst |   8 +--
 meson.build | 100 ++--
 target/i386/confidential-guest.h|  24 +
 target/i386/cpu.h   |  10 +++-
 chardev/char-stdio.c|   4 ++
 hw/i386/pc.c|   1 +
 stubs/blk-exp-close-all.c   |   2 +-
 target/i386/cpu.c   |  40 +++
 target/i386/kvm/kvm-cpu.c   |   2 +-
 target/i386/kvm/kvm.c   |   5 ++
 target/i386/sev.c   |  51 ++
 .gitlab-ci.d/buildtest-template.yml |   2 -
 .gitlab-ci.d/buildtest.yml  |   2 -
 block/meson.build   |   2 +-
 gdbstub/meson.build |   6 +--
 io/meson.build  |   2 +-
 plugins/meson.build |   7 ++-
 pythondeps.toml |   2 +-
 storage-daemon/meson.build  |   3 +-
 tcg/meson.build |   8 +--
 tests/Makefile.include  |   2 +-
 tests/qtest/libqos/meson.build  |   3 +-
 ui/meson.build  |   2 +-
 23 files changed, 183 insertions(+), 105 deletions(-)
-- 
2.45.2

[PULL 09/16] target/i386: SEV: fix formatting of CPUID mismatch message

2024-07-04 Thread Paolo Bonzini

Fixes: 70943ad8e4d ("i386/sev: Add support for SNP CPUID validation", 
2024-06-05)
Signed-off-by: Paolo Bonzini 
---
 target/i386/sev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index 3ab8b3c28b7..2a0f94d390d 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -841,7 +841,7 @@ sev_snp_cpuid_report_mismatches(SnpCpuidInfo *old,
 size_t i;
 
 if (old->count != new->count) {
-error_report("SEV-SNP: CPUID validation failed due to count mismatch,"
+error_report("SEV-SNP: CPUID validation failed due to count mismatch, "
  "provided: %d, expected: %d", old->count, new->count);
 return;
 }
@@ -853,8 +853,8 @@ sev_snp_cpuid_report_mismatches(SnpCpuidInfo *old,
 new_func = >entries[i];
 
 if (memcmp(old_func, new_func, sizeof(SnpCpuidFunc))) {
-error_report("SEV-SNP: CPUID validation failed for function 0x%x, 
index: 0x%x"
- "provided: eax:0x%08x, ebx: 0x%08x, ecx: 0x%08x, edx: 
0x%08x"
+error_report("SEV-SNP: CPUID validation failed for function 0x%x, 
index: 0x%x, "
+ "provided: eax:0x%08x, ebx: 0x%08x, ecx: 0x%08x, edx: 
0x%08x, "
  "expected: eax:0x%08x, ebx: 0x%08x, ecx: 0x%08x, edx: 
0x%08x",
  old_func->eax_in, old_func->ecx_in,
  old_func->eax, old_func->ebx, old_func->ecx, 
old_func->edx,
-- 
2.45.2

[PULL 07/16] target/i386: pass X86CPU to x86_cpu_get_supported_feature_word

2024-07-04 Thread Paolo Bonzini

This allows modifying the bits in "-cpu max"/"-cpu host" depending on
the guest CPU vendor (which, at least by default, is the host vendor in
the case of KVM).

For example, machine check architecture differs between Intel and AMD,
and bits from AMD should be dropped when configuring the guest for
an Intel model.

Cc: Xiaoyao Li 
Cc: John Allen 
Signed-off-by: Paolo Bonzini 
---
 target/i386/cpu.h |  3 +--
 target/i386/cpu.c | 11 +--
 target/i386/kvm/kvm-cpu.c |  2 +-
 3 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 29daf370485..9bea7142bf4 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -666,8 +666,7 @@ typedef enum FeatureWord {
 } FeatureWord;
 
 typedef uint64_t FeatureWordArray[FEATURE_WORDS];
-uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
-bool migratable_only);
+uint64_t x86_cpu_get_supported_feature_word(X86CPU *cpu, FeatureWord w);
 
 /* cpuid_features bits */
 #define CPUID_FP87 (1U << 0)
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 4c2e6f3a71e..4364cb0f8e3 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6035,8 +6035,7 @@ CpuDefinitionInfoList *qmp_query_cpu_definitions(Error 
**errp)
 
 #endif /* !CONFIG_USER_ONLY */
 
-uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
-bool migratable_only)
+uint64_t x86_cpu_get_supported_feature_word(X86CPU *cpu, FeatureWord w)
 {
 FeatureWordInfo *wi = _word_info[w];
 uint64_t r = 0;
@@ -6078,7 +6077,7 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
 r &= ~unavail;
 }
 #endif
-if (migratable_only) {
+if (cpu && cpu->migratable) {
 r &= x86_cpu_get_migratable_flags(w);
 }
 return r;
@@ -7371,7 +7370,7 @@ void x86_cpu_expand_features(X86CPU *cpu, Error **errp)
  * by the user.
  */
 env->features[w] |=
-x86_cpu_get_supported_feature_word(w, cpu->migratable) &
+x86_cpu_get_supported_feature_word(cpu, w) &
 ~env->user_features[w] &
 ~feature_word_info[w].no_autoenable_flags;
 }
@@ -7497,7 +7496,7 @@ static void x86_cpu_filter_features(X86CPU *cpu, bool 
verbose)
 
 for (w = 0; w < FEATURE_WORDS; w++) {
 uint64_t host_feat =
-x86_cpu_get_supported_feature_word(w, false);
+x86_cpu_get_supported_feature_word(NULL, w);
 uint64_t requested_features = env->features[w];
 uint64_t unavailable_features = requested_features & ~host_feat;
 mark_unavailable_features(cpu, w, unavailable_features, prefix);
@@ -7617,7 +7616,7 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
**errp)
 env->features[FEAT_PERF_CAPABILITIES] & PERF_CAP_LBR_FMT;
 if (requested_lbr_fmt && kvm_enabled()) {
 uint64_t host_perf_cap =
-x86_cpu_get_supported_feature_word(FEAT_PERF_CAPABILITIES, false);
+x86_cpu_get_supported_feature_word(NULL, FEAT_PERF_CAPABILITIES);
 unsigned host_lbr_fmt = host_perf_cap & PERF_CAP_LBR_FMT;
 
 if (!cpu->enable_pmu) {
diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
index d57a68a301e..6bf8dcfc607 100644
--- a/target/i386/kvm/kvm-cpu.c
+++ b/target/i386/kvm/kvm-cpu.c
@@ -143,7 +143,7 @@ static void kvm_cpu_xsave_init(void)
 if (!esa->size) {
 continue;
 }
-if ((x86_cpu_get_supported_feature_word(esa->feature, false) & 
esa->bits)
+if ((x86_cpu_get_supported_feature_word(NULL, esa->feature) & 
esa->bits)
 != esa->bits) {
 continue;
 }
-- 
2.45.2

[PULL 13/16] target/i386: add avx-vnni-int16 feature

2024-07-04 Thread Paolo Bonzini

AVX-VNNI-INT16 (CPUID[EAX=7,ECX=1).EDX[10]) is supported by Clearwater
Forest processor, add it to QEMU as it does not need any specific
enablement.

Signed-off-by: Paolo Bonzini 
---
 target/i386/cpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index c40551d9bfb..c05765eeafc 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -1131,7 +1131,7 @@ FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
 .feat_names = {
 NULL, NULL, NULL, NULL,
 "avx-vnni-int8", "avx-ne-convert", NULL, NULL,
-"amx-complex", NULL, NULL, NULL,
+"amx-complex", NULL, "avx-vnni-int16", NULL,
 NULL, NULL, "prefetchiti", NULL,
 NULL, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL,
-- 
2.45.2

[PULL 14/16] char-stdio: Restore blocking mode of stdout on exit

2024-07-04 Thread Paolo Bonzini

From: Maxim Mikityanskiy 

qemu_chr_open_fd() sets stdout into non-blocking mode. Restore the old
fd flags on exit to avoid breaking unsuspecting applications that run on
the same terminal after qemu and don't expect to get EAGAIN.

While at at, also ensure term_exit is called once (at the moment it's
called both from char_stdio_finalize() and as the atexit() hook.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2423
Signed-off-by: Maxim Mikityanskiy 
Link: https://lore.kernel.org/r/20240703190812.3459514-1-maxtra...@gmail.com
Signed-off-by: Paolo Bonzini 
---
 chardev/char-stdio.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/chardev/char-stdio.c b/chardev/char-stdio.c
index 3c648678ab1..b960ddd4e4c 100644
--- a/chardev/char-stdio.c
+++ b/chardev/char-stdio.c
@@ -41,6 +41,7 @@
 /* init terminal so that we can grab keys */
 static struct termios oldtty;
 static int old_fd0_flags;
+static int old_fd1_flags;
 static bool stdio_in_use;
 static bool stdio_allow_signal;
 static bool stdio_echo_state;
@@ -50,6 +51,8 @@ static void term_exit(void)
 if (stdio_in_use) {
 tcsetattr(0, TCSANOW, );
 fcntl(0, F_SETFL, old_fd0_flags);
+fcntl(1, F_SETFL, old_fd1_flags);
+stdio_in_use = false;
 }
 }
 
@@ -102,6 +105,7 @@ static void qemu_chr_open_stdio(Chardev *chr,
 
 stdio_in_use = true;
 old_fd0_flags = fcntl(0, F_GETFL);
+old_fd1_flags = fcntl(1, F_GETFL);
 tcgetattr(0, );
 if (!g_unix_set_fd_nonblocking(0, true, NULL)) {
 error_setg_errno(errp, errno, "Failed to set FD nonblocking");
-- 
2.45.2

[PULL 08/16] target/i386: drop AMD machine check bits from Intel CPUID

2024-07-04 Thread Paolo Bonzini

The recent addition of the SUCCOR bit to kvm_arch_get_supported_cpuid()
causes the bit to be visible when "-cpu host" VMs are started on Intel
processors.

While this should in principle be harmless, it's not tidy and we don't
even know for sure that it doesn't cause any guest OS to take unexpected
paths.  Since x86_cpu_get_supported_feature_word() can return different
different values depending on the guest, adjust it to hide the SUCCOR
bit if the guest has non-AMD vendor.

Suggested-by: Xiaoyao Li 
Cc: John Allen 
Signed-off-by: Paolo Bonzini 
---
 target/i386/cpu.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 4364cb0f8e3..5e5bf71702c 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6039,6 +6039,7 @@ uint64_t x86_cpu_get_supported_feature_word(X86CPU *cpu, 
FeatureWord w)
 {
 FeatureWordInfo *wi = _word_info[w];
 uint64_t r = 0;
+uint32_t unavail = 0;
 
 if (kvm_enabled()) {
 switch (wi->type) {
@@ -6064,19 +6065,33 @@ uint64_t x86_cpu_get_supported_feature_word(X86CPU 
*cpu, FeatureWord w)
 } else {
 return ~0;
 }
+
+switch (w) {
 #ifndef TARGET_X86_64
-if (w == FEAT_8000_0001_EDX) {
+case FEAT_8000_0001_EDX:
 /*
  * 32-bit TCG can emulate 64-bit compatibility mode.  If there is no
  * way for userspace to get out of its 32-bit jail, we can leave
  * the LM bit set.
  */
-uint32_t unavail = tcg_enabled()
+unavail = tcg_enabled()
 ? CPUID_EXT2_LM & ~CPUID_EXT2_KERNEL_FEATURES
 : CPUID_EXT2_LM;
-r &= ~unavail;
-}
+break;
 #endif
+
+case FEAT_8000_0007_EBX:
+if (cpu && !IS_AMD_CPU(>env)) {
+/* Disable AMD machine check architecture for Intel CPU.  */
+unavail = ~0;
+}
+break;
+
+default:
+break;
+}
+
+r &= ~unavail;
 if (cpu && cpu->migratable) {
 r &= x86_cpu_get_migratable_flags(w);
 }
-- 
2.45.2

[PULL 10/16] target/i386: do not include undefined bits in the AMD topoext leaf

2024-07-04 Thread Paolo Bonzini

Commit d7c72735f61 ("target/i386: Add new EPYC CPU versions with updated
cache_info", 2023-05-08) ensured that AMD-defined CPU models did not
have the 'complex_indexing' bit set, but left it set in "-cpu host"
which uses the default ("legacy") cache information.

Reimplement that commit using a CPU feature, so that it can be applied
to all guests using a new machine type, independent of the CPU model.

Signed-off-by: Paolo Bonzini 
---
 target/i386/cpu.h | 3 +++
 hw/i386/pc.c  | 1 +
 target/i386/cpu.c | 4 
 3 files changed, 8 insertions(+)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 9bea7142bf4..0d5624355e4 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2104,6 +2104,9 @@ struct ArchCPU {
 /* Only advertise CPUID leaves defined by the vendor */
 bool vendor_cpuid_only;
 
+/* Only advertise TOPOEXT features that AMD defines */
+bool amd_topoext_features_only;
+
 /* Enable auto level-increase for Intel Processor Trace leave */
 bool intel_pt_auto_level;
 
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 77415064c62..5dff91422ff 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -80,6 +80,7 @@
 { "athlon-" TYPE_X86_CPU, "model-id", "QEMU Virtual CPU version " v, },
 
 GlobalProperty pc_compat_9_0[] = {
+{ TYPE_X86_CPU, "x-amd-topoext-features-only", "false" },
 { TYPE_X86_CPU, "x-l1-cache-per-thread", "false" },
 { TYPE_X86_CPU, "guest-phys-bits", "0" },
 { "sev-guest", "legacy-vm-type", "true" },
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 5e5bf71702c..c40551d9bfb 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6982,6 +6982,9 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 *eax = *ebx = *ecx = *edx = 0;
 break;
 }
+if (cpu->amd_topoext_features_only) {
+*edx &= CACHE_NO_INVD_SHARING | CACHE_INCLUSIVE;
+}
 break;
 case 0x801E:
 if (cpu->core_id <= 255) {
@@ -8293,6 +8296,7 @@ static Property x86_cpu_properties[] = {
 DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor),
 DEFINE_PROP_BOOL("cpuid-0xb", X86CPU, enable_cpuid_0xb, true),
 DEFINE_PROP_BOOL("x-vendor-cpuid-only", X86CPU, vendor_cpuid_only, true),
+DEFINE_PROP_BOOL("x-amd-topoext-features-only", X86CPU, 
amd_topoext_features_only, true),
 DEFINE_PROP_BOOL("lmce", X86CPU, enable_lmce, false),
 DEFINE_PROP_BOOL("l3-cache", X86CPU, enable_l3_cache, true),
 DEFINE_PROP_BOOL("kvm-pv-enforce-cpuid", X86CPU, kvm_pv_enforce_cpuid,
-- 
2.45.2

[PULL 16/16] target/i386/SEV: implement mask_cpuid_features

2024-07-04 Thread Paolo Bonzini

Drop features that are listed as "BitMask" in the PPR and currently
not supported by AMD processors.  The only ones that may become useful
in the future are TSC deadline timer and x2APIC, everything else is
not needed for SEV-SNP guests (e.g. VIRT_SSBD) or would require
processor support (e.g. TSC_ADJUST).

This allows running SEV-SNP guests with "-cpu host".

Signed-off-by: Paolo Bonzini 
---
 target/i386/cpu.h |  4 
 target/i386/sev.c | 33 +
 2 files changed, 37 insertions(+)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 0d5624355e4..c43ac01c794 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -812,6 +812,8 @@ uint64_t x86_cpu_get_supported_feature_word(X86CPU *cpu, 
FeatureWord w);
 
 /* Support RDFSBASE/RDGSBASE/WRFSBASE/WRGSBASE */
 #define CPUID_7_0_EBX_FSGSBASE  (1U << 0)
+/* Support TSC adjust MSR */
+#define CPUID_7_0_EBX_TSC_ADJUST(1U << 1)
 /* Support SGX */
 #define CPUID_7_0_EBX_SGX   (1U << 2)
 /* 1st Group of Advanced Bit Manipulation Extensions */
@@ -1002,6 +1004,8 @@ uint64_t x86_cpu_get_supported_feature_word(X86CPU *cpu, 
FeatureWord w);
 #define CPUID_8000_0008_EBX_STIBP_ALWAYS_ON(1U << 17)
 /* Speculative Store Bypass Disable */
 #define CPUID_8000_0008_EBX_AMD_SSBD(1U << 24)
+/* Paravirtualized Speculative Store Bypass Disable MSR */
+#define CPUID_8000_0008_EBX_VIRT_SSBD   (1U << 25)
 /* Predictive Store Forwarding Disable */
 #define CPUID_8000_0008_EBX_AMD_PSFD(1U << 28)
 
diff --git a/target/i386/sev.c b/target/i386/sev.c
index 2f3dbe289f4..2ba5f517228 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -945,6 +945,38 @@ out:
 return ret;
 }
 
+static uint32_t
+sev_snp_mask_cpuid_features(X86ConfidentialGuest *cg, uint32_t feature, 
uint32_t index,
+int reg, uint32_t value)
+{
+switch (feature) {
+case 1:
+if (reg == R_ECX) {
+return value & ~CPUID_EXT_TSC_DEADLINE_TIMER;
+}
+break;
+case 7:
+if (index == 0 && reg == R_EBX) {
+return value & ~CPUID_7_0_EBX_TSC_ADJUST;
+}
+if (index == 0 && reg == R_EDX) {
+return value & ~(CPUID_7_0_EDX_SPEC_CTRL |
+ CPUID_7_0_EDX_STIBP |
+ CPUID_7_0_EDX_FLUSH_L1D |
+ CPUID_7_0_EDX_ARCH_CAPABILITIES |
+ CPUID_7_0_EDX_CORE_CAPABILITY |
+ CPUID_7_0_EDX_SPEC_CTRL_SSBD);
+}
+break;
+case 0x8008:
+if (reg == R_EBX) {
+return value & ~CPUID_8000_0008_EBX_VIRT_SSBD;
+}
+break;
+}
+return value;
+}
+
 static int
 sev_launch_update_data(SevCommonState *sev_common, hwaddr gpa,
uint8_t *addr, size_t len)
@@ -2315,6 +2347,7 @@ sev_snp_guest_class_init(ObjectClass *oc, void *data)
 klass->launch_finish = sev_snp_launch_finish;
 klass->launch_update_data = sev_snp_launch_update_data;
 klass->kvm_init = sev_snp_kvm_init;
+x86_klass->mask_cpuid_features = sev_snp_mask_cpuid_features;
 x86_klass->kvm_type = sev_snp_kvm_type;
 
 object_class_property_add(oc, "policy", "uint64",
-- 
2.45.2

[PULL 04/16] meson: Pass objects and dependencies to declare_dependency()

2024-07-04 Thread Paolo Bonzini

From: Akihiko Odaki 

We used to request declare_dependency() to link_whole static libraries.
If a static library is a thin archive, GNU ld keeps all object files
referenced by the archive open, and sometimes exceeds the open file limit.

Another problem with link_whole is that suboptimal handling of nested
dependencies.

link_whole by itself does not propagate dependencies. In particular,
gnutls, a dependency of crypto, is not propagated to its users, and we
currently workaround the issue by declaring gnutls as a dependency for
each crypto user.  On the other hand, if you write something like

  libfoo = static_library('foo', 'foo.c', dependencies: gnutls)
  foo = declare_dependency(link_whole: libfoo)

  libbar = static_library('bar', 'bar.c', dependencies: foo)
  bar = declare_dependency(link_whole: libbar, dependencies: foo)
  executable('prog', sources: files('prog.c'), dependencies: [foo, bar])

hoping to propagate the gnutls dependency into bar.c, you'll see a
linking failure for "prog", because the foo.c.o object file is included in
libbar.a and therefore it is linked twice into "prog": once from libfoo.a
and once from libbar.a.  Here Meson does not see the duplication, it
just asks the linker to link all of libfoo.a and libbar.a into "prog".

Instead of using link_whole, extract objects included in static libraries
and pass them to declare_dependency(); and then the dependencies can be
added as well so that they are propagated, because object files on the
linker command line are always deduplicated.

This requires Meson 1.1.0 or later.

Signed-off-by: Akihiko Odaki 
Message-ID: <20240524-objects-v1-1-07cbbe961...@daynix.com>
Signed-off-by: Paolo Bonzini 
---
 docs/devel/build-system.rst|  3 ++-
 meson.build| 44 +++---
 gdbstub/meson.build|  4 ++--
 pythondeps.toml|  2 +-
 tcg/meson.build|  6 +++--
 tests/qtest/libqos/meson.build |  2 +-
 6 files changed, 35 insertions(+), 26 deletions(-)

diff --git a/docs/devel/build-system.rst b/docs/devel/build-system.rst
index f4fd76117df..39a1934c63f 100644
--- a/docs/devel/build-system.rst
+++ b/docs/devel/build-system.rst
@@ -238,7 +238,8 @@ Subsystem sourcesets:
 name_suffix: 'fa',
 build_by_default: false)
 
-chardev = declare_dependency(link_whole: libchardev)
+chardev = declare_dependency(objects: 
libchardev.extract_all_objects(recursive: false),
+ dependencies: chardev_ss.dependencies())
 
   As of Meson 0.55.1, the special ``.fa`` suffix should be used for everything
   that is used with ``link_whole``, to ensure that the link flags are placed
diff --git a/meson.build b/meson.build
index df9a64302f0..0c314ae5701 100644
--- a/meson.build
+++ b/meson.build
@@ -1,4 +1,4 @@
-project('qemu', ['c'], meson_version: '>=0.63.0',
+project('qemu', ['c'], meson_version: '>=1.1.0',
 default_options: ['warning_level=1', 'c_std=gnu11', 'cpp_std=gnu++11', 
'b_colorout=auto',
   'b_staticpic=false', 'stdsplit=false', 
'optimization=2', 'b_pie=true'],
 version: files('VERSION'))
@@ -3461,7 +3461,7 @@ endif
 
 if enable_modules
   libmodulecommon = static_library('module-common', files('module-common.c') + 
genh, pic: true, c_args: '-DBUILD_DSO')
-  modulecommon = declare_dependency(link_whole: libmodulecommon, compile_args: 
'-DBUILD_DSO')
+  modulecommon = declare_dependency(objects: 
libmodulecommon.extract_all_objects(recursive: false), compile_args: 
'-DBUILD_DSO')
 endif
 
 qom_ss = qom_ss.apply({})
@@ -3469,14 +3469,15 @@ libqom = static_library('qom', qom_ss.sources() + genh,
 dependencies: [qom_ss.dependencies()],
 name_suffix: 'fa',
 build_by_default: false)
-qom = declare_dependency(link_whole: libqom)
+qom = declare_dependency(objects: libqom.extract_all_objects(recursive: false),
+ dependencies: qom_ss.dependencies())
 
 event_loop_base = files('event-loop-base.c')
 event_loop_base = static_library('event-loop-base',
  sources: event_loop_base + genh,
  name_suffix: 'fa',
  build_by_default: false)
-event_loop_base = declare_dependency(link_whole: event_loop_base,
+event_loop_base = declare_dependency(objects: 
event_loop_base.extract_all_objects(recursive: false),
  dependencies: [qom])
 
 stub_ss = stub_ss.apply({})
@@ -3621,7 +3622,8 @@ foreach d, list : modules
   endif
   emulator_modules += shared_module(sl.name(),
 name_prefix: '',
-link_whole: sl,
+objects: sl.extract_all_objects(recursive: false),
+dependencies: module_ss.dependen

[PULL 02/16] meson: move block.syms dependency out of libblock

2024-07-04 Thread Paolo Bonzini

In order to define libqemuutil symbols that are requested by block modules,
QEMU currently uses a combination of the "link_depends" argument of
libraries (which is propagated into dependencies, but not available in
dependencies) and the "link_args" argument of declare_dependency()
(which _is_ available in static_library, but probably not used for
historical reasons only).

Unfortunately the link_depends will not be propagated into the
"block" dependency if it is defined using
declare_dependency(objects: ...); and it is not possible to
add it directly to the dependency because the keyword argument
simply is not available.

The only solution, in order to switch to defining the dependency
without using "link_whole" (which has problems of its own, see
https://github.com/mesonbuild/meson/pull/8151#issuecomment-754796420),
is unfortunately to add the link_args and link_depends to the
executables directly; fortunately there is just four of them.

It is possible (and I will look into it) to add "link_depends"
to declare_dependency(), but it probably will be a while before
QEMU can use it.

Signed-off-by: Paolo Bonzini 
---
 meson.build| 5 +++--
 storage-daemon/meson.build | 1 +
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/meson.build b/meson.build
index 8909f8c87d9..df9a64302f0 100644
--- a/meson.build
+++ b/meson.build
@@ -3759,12 +3759,10 @@ system_ss.add(migration)
 block_ss = block_ss.apply({})
 libblock = static_library('block', block_ss.sources() + genh,
   dependencies: block_ss.dependencies(),
-  link_depends: block_syms,
   name_suffix: 'fa',
   build_by_default: false)
 
 block = declare_dependency(link_whole: [libblock],
-   link_args: '@block.syms',
dependencies: [crypto, io])
 
 blockdev_ss = blockdev_ss.apply({})
@@ -4033,10 +4031,13 @@ endif
 
 if have_tools
   qemu_img = executable('qemu-img', [files('qemu-img.c'), hxdep],
+ link_args: '@block.syms', link_depends: block_syms,
  dependencies: [authz, block, crypto, io, qom, qemuutil], install: 
true)
   qemu_io = executable('qemu-io', files('qemu-io.c'),
+ link_args: '@block.syms', link_depends: block_syms,
  dependencies: [block, qemuutil], install: true)
   qemu_nbd = executable('qemu-nbd', files('qemu-nbd.c'),
+   link_args: '@block.syms', link_depends: block_syms,
dependencies: [blockdev, qemuutil, gnutls, selinux],
install: true)
 
diff --git a/storage-daemon/meson.build b/storage-daemon/meson.build
index 46267b63e72..fd5e32f4b28 100644
--- a/storage-daemon/meson.build
+++ b/storage-daemon/meson.build
@@ -8,6 +8,7 @@ if have_tools
   qsd_ss = qsd_ss.apply({})
   qsd = executable('qemu-storage-daemon',
qsd_ss.sources(),
+   link_args: '@block.syms', link_depends: block_syms,
dependencies: qsd_ss.dependencies(),
install: true)
 endif
-- 
2.45.2

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 46731 matches

Mail list logo