date:20230409

Re: [PATCH v3] target/i386: Change wrong XFRM value

2023-04-09 Thread Yang Zhong

On Sun, Apr 09, 2023 at 04:40:50PM +0300, Michael Tokarev wrote:
> 06.04.2023 09:40, Yang Zhong wrote:
> > The previous patch wrongly replaced FEAT_XSAVE_XCR0_{LO|HI} with
> > FEAT_XSAVE_XSS_{LO|HI} in CPUID(EAX=12,ECX=1):{ECX,EDX}, which made
> > SGX enclave only supported SSE and x87 feature(xfrm=0x3).
> > 
> > Fixes: 301e90675c3f ("target/i386: Enable support for XSAVES based 
> > features")
> 
> This seems to be -stable material, no?
>
  
  I checked Qemu stable-7.2, the 301e90675c3f patch was included into this 
release.
  So, this fix patch need to be merged into stable release. thanks!

  Regards,
  Yang

> /mjt

[PATCH 0/2] target/riscv: Separate implicitly-enabled and explicitly-enabled extensions

2023-04-09 Thread Weiwei Li

The patch tries to separate the multi-letter extensions that may 
implicitly-enabled by misa.EXT from the explicitly-enabled cases, so that the 
misa.EXT can truely disabled by write_misa().
With this separation, the implicitly-enabled zve64d/f and zve32f extensions 
will no work if we clear misa.V. And clear misa.V will have no effect on the 
explicitly-enalbed zve64d/f and zve32f extensions.

Weiwei Li (2):
  target/riscv: Add set_implicit_extensions_from_ext() function
  target/riscv: Add ext_z*_enabled for implicitly enabled extensions

 target/riscv/cpu.c  | 73 +++--
 target/riscv/cpu.h  |  8 +++
 target/riscv/cpu_helper.c   |  2 +-
 target/riscv/csr.c  |  2 +-
 target/riscv/insn_trans/trans_rvd.c.inc |  2 +-
 target/riscv/insn_trans/trans_rvf.c.inc |  2 +-
 target/riscv/insn_trans/trans_rvi.c.inc |  5 +-
 target/riscv/insn_trans/trans_rvv.c.inc | 16 +++---
 target/riscv/translate.c|  4 +-
 9 files changed, 68 insertions(+), 46 deletions(-)

-- 
2.25.1

[PATCH 2/2] target/riscv: Add ext_z*_enabled for implicitly enabled extensions

2023-04-09 Thread Weiwei Li

For extensions that may implicitly enabled from misa.EXT:
- ext_* is true if the extension is explicitly-enabled.
- ext_*_enabled is true no matter the extension is implicitly-enabled or
explicitly-enabled. And we change the check on the extension to it.

Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
---
 target/riscv/cpu.c  | 29 +++--
 target/riscv/cpu.h  |  8 +++
 target/riscv/cpu_helper.c   |  2 +-
 target/riscv/csr.c  |  2 +-
 target/riscv/insn_trans/trans_rvd.c.inc |  2 +-
 target/riscv/insn_trans/trans_rvf.c.inc |  2 +-
 target/riscv/insn_trans/trans_rvi.c.inc |  5 +++--
 target/riscv/insn_trans/trans_rvv.c.inc | 16 +++---
 target/riscv/translate.c|  4 ++--
 9 files changed, 43 insertions(+), 27 deletions(-)

diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index abb65d41b1..815dd19f36 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -811,29 +811,35 @@ static void riscv_cpu_disas_set_info(CPUState *s, 
disassemble_info *info)
 
 static void set_implicit_extensions_from_ext(RISCVCPU *cpu)
 {
+cpu->cfg.ext_zve64d_enabled = cpu->cfg.ext_zve64d;
+cpu->cfg.ext_zve64f_enabled = cpu->cfg.ext_zve64f;
+cpu->cfg.ext_zve32f_enabled = cpu->cfg.ext_zve32f;
+cpu->cfg.ext_zca_enabled = cpu->cfg.ext_zca;
+cpu->cfg.ext_zcf_enabled = cpu->cfg.ext_zcf;
+cpu->cfg.ext_zcd_enabled = cpu->cfg.ext_zcd;
 
 /* The V vector extension depends on the Zve64d extension */
 if (cpu->cfg.ext_v) {
-cpu->cfg.ext_zve64d = true;
+cpu->cfg.ext_zve64d_enabled = true;
 }
 
 /* The Zve64d extension depends on the Zve64f extension */
 if (cpu->cfg.ext_zve64d) {
-cpu->cfg.ext_zve64f = true;
+cpu->cfg.ext_zve64f_enabled = true;
 }
 
 /* The Zve64f extension depends on the Zve32f extension */
 if (cpu->cfg.ext_zve64f) {
-cpu->cfg.ext_zve32f = true;
+cpu->cfg.ext_zve32f_enabled = true;
 }
 
 if (cpu->cfg.ext_c) {
-cpu->cfg.ext_zca = true;
+cpu->cfg.ext_zca_enabled = true;
 if (cpu->cfg.ext_f && cpu->env.misa_mxl_max == MXL_RV32) {
-cpu->cfg.ext_zcf = true;
+cpu->cfg.ext_zcf_enabled = true;
 }
 if (cpu->cfg.ext_d) {
-cpu->cfg.ext_zcd = true;
+cpu->cfg.ext_zcd_enabled = true;
 }
 }
 }
@@ -917,12 +923,12 @@ static void riscv_cpu_validate_set_extensions(RISCVCPU 
*cpu, Error **errp)
 return;
 }
 
-if (cpu->cfg.ext_zve64d && !cpu->cfg.ext_d) {
+if (cpu->cfg.ext_zve64d_enabled && !cpu->cfg.ext_d) {
 error_setg(errp, "Zve64d/V extensions require D extension");
 return;
 }
 
-if (cpu->cfg.ext_zve32f && !cpu->cfg.ext_f) {
+if (cpu->cfg.ext_zve32f_enabled && !cpu->cfg.ext_f) {
 error_setg(errp, "Zve32f/Zve64f extensions require F extension");
 return;
 }
@@ -931,7 +937,7 @@ static void riscv_cpu_validate_set_extensions(RISCVCPU 
*cpu, Error **errp)
 cpu->cfg.ext_zvfhmin = true;
 }
 
-if (cpu->cfg.ext_zvfhmin && !cpu->cfg.ext_zve32f) {
+if (cpu->cfg.ext_zvfhmin && !cpu->cfg.ext_zve32f_enabled) {
 error_setg(errp, "Zvfh/Zvfhmin extensions require Zve32f extension");
 return;
 }
@@ -988,13 +994,14 @@ static void riscv_cpu_validate_set_extensions(RISCVCPU 
*cpu, Error **errp)
 }
 
 if ((cpu->cfg.ext_zcf || cpu->cfg.ext_zcd || cpu->cfg.ext_zcb ||
- cpu->cfg.ext_zcmp || cpu->cfg.ext_zcmt) && !cpu->cfg.ext_zca) {
+ cpu->cfg.ext_zcmp || cpu->cfg.ext_zcmt) &&
+!cpu->cfg.ext_zca_enabled) {
 error_setg(errp, "Zcf/Zcd/Zcb/Zcmp/Zcmt extensions require Zca "
  "extension");
 return;
 }
 
-if (cpu->cfg.ext_zcd && (cpu->cfg.ext_zcmp || cpu->cfg.ext_zcmt)) {
+if (cpu->cfg.ext_zcd_enabled && (cpu->cfg.ext_zcmp || cpu->cfg.ext_zcmt)) {
 error_setg(errp, "Zcmp/Zcmt extensions are incompatible with "
  "Zcd extension");
 return;
diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h
index 4af8ebc558..c59f099dd3 100644
--- a/target/riscv/cpu.h
+++ b/target/riscv/cpu.h
@@ -523,6 +523,14 @@ struct RISCVCPUConfig {
 bool debug;
 bool misa_w;
 
+/* May implicitly enabled extensions by misa.EXT */
+bool ext_zca_enabled;
+bool ext_zcf_enabled;
+bool ext_zcd_enabled;
+bool ext_zve32f_enabled;
+bool ext_zve64f_enabled;
+bool ext_zve64d_enabled;
+
 bool short_isa_string;
 
 #ifndef CONFIG_USER_ONLY
diff --git a/target/riscv/cpu_helper.c b/target/riscv/cpu_helper.c
index 433ea529b0..c9638ae905 100644
--- a/target/riscv/cpu_helper.c
+++ b/target/riscv/cpu_helper.c
@@ -51,7 +51,7 @@ void cpu_get_tb_cpu_state(CPURISCVState *env, target_ulong 
*pc,
 *pc = env->xl == MXL_RV32 ? env->pc & UINT32_MAX : env->pc;
 *cs_base = 0;
 
-if (cpu->cfg.ext_zve32f) {
+if

[PATCH 1/2] target/riscv: Add set_implicit_extensions_from_ext() function

2023-04-09 Thread Weiwei Li

Move multi-letter extensions that may implicitly enabled from misa.EXT
alone to prepare for following separation of implicitly enabled and
explicitly enabled extensions.

Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
---
 target/riscv/cpu.c | 56 +-
 1 file changed, 31 insertions(+), 25 deletions(-)

diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index cb68916fce..abb65d41b1 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -809,6 +809,35 @@ static void riscv_cpu_disas_set_info(CPUState *s, 
disassemble_info *info)
 }
 }
 
+static void set_implicit_extensions_from_ext(RISCVCPU *cpu)
+{
+
+/* The V vector extension depends on the Zve64d extension */
+if (cpu->cfg.ext_v) {
+cpu->cfg.ext_zve64d = true;
+}
+
+/* The Zve64d extension depends on the Zve64f extension */
+if (cpu->cfg.ext_zve64d) {
+cpu->cfg.ext_zve64f = true;
+}
+
+/* The Zve64f extension depends on the Zve32f extension */
+if (cpu->cfg.ext_zve64f) {
+cpu->cfg.ext_zve32f = true;
+}
+
+if (cpu->cfg.ext_c) {
+cpu->cfg.ext_zca = true;
+if (cpu->cfg.ext_f && cpu->env.misa_mxl_max == MXL_RV32) {
+cpu->cfg.ext_zcf = true;
+}
+if (cpu->cfg.ext_d) {
+cpu->cfg.ext_zcd = true;
+}
+}
+}
+
 /*
  * Check consistency between chosen extensions while setting
  * cpu->cfg accordingly, doing a set_misa() in the end.
@@ -833,6 +862,8 @@ static void riscv_cpu_validate_set_extensions(RISCVCPU 
*cpu, Error **errp)
 cpu->cfg.ext_ifencei = true;
 }
 
+set_implicit_extensions_from_ext(cpu);
+
 if (cpu->cfg.ext_i && cpu->cfg.ext_e) {
 error_setg(errp,
"I and E extensions are incompatible");
@@ -886,21 +917,6 @@ static void riscv_cpu_validate_set_extensions(RISCVCPU 
*cpu, Error **errp)
 return;
 }
 
-/* The V vector extension depends on the Zve64d extension */
-if (cpu->cfg.ext_v) {
-cpu->cfg.ext_zve64d = true;
-}
-
-/* The Zve64d extension depends on the Zve64f extension */
-if (cpu->cfg.ext_zve64d) {
-cpu->cfg.ext_zve64f = true;
-}
-
-/* The Zve64f extension depends on the Zve32f extension */
-if (cpu->cfg.ext_zve64f) {
-cpu->cfg.ext_zve32f = true;
-}
-
 if (cpu->cfg.ext_zve64d && !cpu->cfg.ext_d) {
 error_setg(errp, "Zve64d/V extensions require D extension");
 return;
@@ -956,16 +972,6 @@ static void riscv_cpu_validate_set_extensions(RISCVCPU 
*cpu, Error **errp)
 }
 }
 
-if (cpu->cfg.ext_c) {
-cpu->cfg.ext_zca = true;
-if (cpu->cfg.ext_f && env->misa_mxl_max == MXL_RV32) {
-cpu->cfg.ext_zcf = true;
-}
-if (cpu->cfg.ext_d) {
-cpu->cfg.ext_zcd = true;
-}
-}
-
 if (env->misa_mxl_max != MXL_RV32 && cpu->cfg.ext_zcf) {
 error_setg(errp, "Zcf extension is only relevant to RV32");
 return;
-- 
2.25.1

[PATCH for 8.1] intel_iommu: refine iotlb hash calculation

2023-04-09 Thread Jason Wang

Commit 1b2b12376c8 ("intel-iommu: PASID support") takes PASID into
account when calculating iotlb hash like:

static guint vtd_iotlb_hash(gconstpointer v)
{
const struct vtd_iotlb_key *key = v;

return key->gfn | ((key->sid) << VTD_IOTLB_SID_SHIFT) |
   (key->level) << VTD_IOTLB_LVL_SHIFT |
   (key->pasid) << VTD_IOTLB_PASID_SHIFT;
}

This turns out to be problematic since:

- the shift will lose bits if not converting to uint64_t
- level should be off by one in order to fit into 2 bits
- VTD_IOTLB_PASID_SHIFT is 30 but PASID is 20 bits which will waste
  some bits

So this patch fixes them by

- converting the keys into uint64_t before doing the shift
- off level by one to make it fit into two bits
- change the sid, lvl and pasid shift to 26, 42 and 44 in order to
  take the full width of uint64_t if possible

Fixes: Coverity CID 1508100
Fixes: 1b2b12376c8 ("intel-iommu: PASID support")
Signed-off-by: Jason Wang 
---
 hw/i386/intel_iommu.c  | 8 
 hw/i386/intel_iommu_internal.h | 6 +++---
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a62896759c..d394976e9b 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -64,8 +64,8 @@ struct vtd_as_key {
 struct vtd_iotlb_key {
 uint64_t gfn;
 uint32_t pasid;
-uint32_t level;
 uint16_t sid;
+uint8_t level;
 };
 
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
@@ -222,9 +222,9 @@ static guint vtd_iotlb_hash(gconstpointer v)
 {
 const struct vtd_iotlb_key *key = v;
 
-return key->gfn | ((key->sid) << VTD_IOTLB_SID_SHIFT) |
-   (key->level) << VTD_IOTLB_LVL_SHIFT |
-   (key->pasid) << VTD_IOTLB_PASID_SHIFT;
+return key->gfn | ((uint64_t)(key->sid) << VTD_IOTLB_SID_SHIFT) |
+(uint64_t)(key->level - 1) << VTD_IOTLB_LVL_SHIFT |
+(uint64_t)(key->pasid) << VTD_IOTLB_PASID_SHIFT;
 }
 
 static gboolean vtd_as_equal(gconstpointer v1, gconstpointer v2)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index f090e61e11..2e61eec2f5 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -114,9 +114,9 @@
  VTD_INTERRUPT_ADDR_FIRST + 1)
 
 /* The shift of source_id in the key of IOTLB hash table */
-#define VTD_IOTLB_SID_SHIFT 20
-#define VTD_IOTLB_LVL_SHIFT 28
-#define VTD_IOTLB_PASID_SHIFT   30
+#define VTD_IOTLB_SID_SHIFT 26
+#define VTD_IOTLB_LVL_SHIFT 42
+#define VTD_IOTLB_PASID_SHIFT   44
 #define VTD_IOTLB_MAX_SIZE  1024/* Max size of the hash table */
 
 /* IOTLB_REG */
-- 
2.25.1

Re: Reducing vdpa migration downtime because of memory pin / maps

2023-04-09 Thread Jason Wang

On Mon, Apr 10, 2023 at 11:17 AM Longpeng (Mike, Cloud Infrastructure
Service Product Dept.)  wrote:
>
>
>
> 在 2023/4/10 10:14, Jason Wang 写道:
> > On Wed, Apr 5, 2023 at 7:38 PM Eugenio Perez Martin  
> > wrote:
> >>
> >> Hi!
> >>
> >> As mentioned in the last upstream virtio-networking meeting, one of
> >> the factors that adds more downtime to migration is the handling of
> >> the guest memory (pin, map, etc). At this moment this handling is
> >> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> >> destination device waits until all the guest memory / state is
> >> migrated to start pinning all the memory.
> >>
> >> The proposal is to bind it to the char device life cycle (open vs
> >> close), so all the guest memory can be pinned for all the guest / qemu
> >> lifecycle.
> >>
> >> This has two main problems:
> >> * At this moment the reset semantics forces the vdpa device to unmap
> >> all the memory. So this change needs a vhost vdpa feature flag.
> >
> > Is this true? I didn't find any codes to unmap the memory in
> > vhost_vdpa_set_status().
> >
>
> It could depend on the vendor driver, for example, the vdpasim would do
> something like that.
>
> vhost_vdpa_set_status->vdpa_reset->vdpasim_reset->vdpasim_do_reset->vhost_iotlb_reset

This looks like a bug. Or I wonder if any user space depends on this
behaviour, if yes, we really need a new flag then.

Thanks

>
> > Thanks
> >
> >> * This may increase the initialization time. Maybe we can delay it if
> >> qemu is not the destination of a LM. Anyway I think this should be
> >> done as an optimization on top.
> >>
> >> Any ideas or comments in this regard?
> >>
> >> Thanks!
> >>
> >
> > .
>

Re: Reducing vdpa migration downtime because of memory pin / maps

2023-04-09 Thread longpeng2--- via





在 2023/4/10 10:14, Jason Wang 写道:

On Wed, Apr 5, 2023 at 7:38 PM Eugenio Perez Martin  wrote:


Hi!

As mentioned in the last upstream virtio-networking meeting, one of
the factors that adds more downtime to migration is the handling of
the guest memory (pin, map, etc). At this moment this handling is
bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
destination device waits until all the guest memory / state is
migrated to start pinning all the memory.

The proposal is to bind it to the char device life cycle (open vs
close), so all the guest memory can be pinned for all the guest / qemu
lifecycle.

This has two main problems:
* At this moment the reset semantics forces the vdpa device to unmap
all the memory. So this change needs a vhost vdpa feature flag.


Is this true? I didn't find any codes to unmap the memory in
vhost_vdpa_set_status().



It could depend on the vendor driver, for example, the vdpasim would do 
something like that.


vhost_vdpa_set_status->vdpa_reset->vdpasim_reset->vdpasim_do_reset->vhost_iotlb_reset


Thanks


* This may increase the initialization time. Maybe we can delay it if
qemu is not the destination of a LM. Anyway I think this should be
done as an optimization on top.

Any ideas or comments in this regard?

Thanks!



.

Re: [PULL v4 74/83] intel-iommu: PASID support

2023-04-09 Thread Jason Wang

On Fri, Apr 7, 2023 at 12:22 AM Peter Maydell  wrote:
>
> On Mon, 7 Nov 2022 at 22:53, Michael S. Tsirkin  wrote:
> >
> > From: Jason Wang 
> >
> > This patch introduce ECAP_PASID via "x-pasid-mode".
>
> Hi; Coverity points out an issue with this code (CID 1508100):
>
> > -static guint vtd_uint64_hash(gconstpointer v)
> > +static guint vtd_iotlb_hash(gconstpointer v)
> >  {
> > -return (guint)*(const uint64_t *)v;
> > +const struct vtd_iotlb_key *key = v;
> > +
> > +return key->gfn | ((key->sid) << VTD_IOTLB_SID_SHIFT) |
>
> key->sid is a uint16_t, and VTD_IOTLB_SID_SHIFT is 20. That
> means that the shift will be done as a signed 32 bit operation,
> losing the top 4 bits of key->sid; then it will get sign
> extended to 64 bits, so if bit 11 of key->sid is 1 then
> we will end up with 1 bits in 63..32 of the output hash value.
> This seems unlikely to be what was intended.

Right.

>
> > +   (key->level) << VTD_IOTLB_LVL_SHIFT |
> > +   (key->pasid) << VTD_IOTLB_PASID_SHIFT;
> >  }
>
> Also, VTD_IOTLB_LVL_SHIFT is only 28, so either the
> shift values are wrong or the type of key->sid is wrong:
> can there be 8 bits here, or 16 ?

It should be 16.

>
> Since PASID_SHIFT is 30, if key->pasid can be more than
> 2 bits wide we'll lose most of it.
>
> If key->level will fit into 2 bits as the SHIFT values
> suggest, vtd_iotlb_key could probably use a uint8_t for it,

Right.

> which would let that struct fit into 16 bytes rather than 18.

Ok, So to summarize:

1) key->gfn could be full 64bit
2) key->sid is at most 16bit
3) key->level is at most 2bit
4) key->pasid is at most 20bit

So we will lose some bits for sure.

Since the current VTD_IOTLB_PASID_SHIFT is 30, we might waste 14bits
there, then I tend to change

VTD_IOTLB_SID_SHIFT to 26 (42-16)
VTD_IOTLB_LVL_SHIFT to 42 (44-2)
VTD_IOTLB_PASID_SHIFT to 44 (64-20)

>
> > @@ -302,13 +321,6 @@ static void vtd_reset_caches(IntelIOMMUState *s)
> >  vtd_iommu_unlock(s);
> >  }
> >
> > -static uint64_t vtd_get_iotlb_key(uint64_t gfn, uint16_t source_id,
> > -  uint32_t level)
> > -{
> > -return gfn | ((uint64_t)(source_id) << VTD_IOTLB_SID_SHIFT) |
> > -   ((uint64_t)(level) << VTD_IOTLB_LVL_SHIFT);
> > -}
>
> In the old code you can see that we did casts to uint64_t in order
> to ensure that all the arithmetic was done as unsigned 64 bits.

Right, I will post a fix.

Thanks

>
> thanks
> -- PMM
>

Re: Reducing vdpa migration downtime because of memory pin / maps

2023-04-09 Thread Jason Wang

On Wed, Apr 5, 2023 at 7:38 PM Eugenio Perez Martin  wrote:
>
> Hi!
>
> As mentioned in the last upstream virtio-networking meeting, one of
> the factors that adds more downtime to migration is the handling of
> the guest memory (pin, map, etc). At this moment this handling is
> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> destination device waits until all the guest memory / state is
> migrated to start pinning all the memory.
>
> The proposal is to bind it to the char device life cycle (open vs
> close), so all the guest memory can be pinned for all the guest / qemu
> lifecycle.
>
> This has two main problems:
> * At this moment the reset semantics forces the vdpa device to unmap
> all the memory. So this change needs a vhost vdpa feature flag.

Is this true? I didn't find any codes to unmap the memory in
vhost_vdpa_set_status().

Thanks

> * This may increase the initialization time. Maybe we can delay it if
> qemu is not the destination of a LM. Anyway I think this should be
> done as an optimization on top.
>
> Any ideas or comments in this regard?
>
> Thanks!
>

[PULL 1/1] target/ppc: Fix temp usage in gen_op_arith_modw

2023-04-09 Thread Cédric Le Goater

From: Richard Henderson 

Fix a crash writing to 't3', which is now a constant.
Instead, write the result of the remu to 't0'.

Fixes: 7058ff5231a ("target/ppc: Avoid tcg_const_* in translate.c")
Reported-by: Nicholas Piggin 
Reviewed-by: Anton Johansson 
Signed-off-by: Richard Henderson 
[ clg: amend commit log s/t1/t0/ ]
Signed-off-by: Cédric Le Goater 
---
 target/ppc/translate.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target/ppc/translate.c b/target/ppc/translate.c
index 9d05357d03..f603f1a939 100644
--- a/target/ppc/translate.c
+++ b/target/ppc/translate.c
@@ -1807,8 +1807,8 @@ static inline void gen_op_arith_modw(DisasContext *ctx, 
TCGv ret, TCGv arg1,
 TCGv_i32 t2 = tcg_constant_i32(1);
 TCGv_i32 t3 = tcg_constant_i32(0);
 tcg_gen_movcond_i32(TCG_COND_EQ, t1, t1, t3, t2, t1);
-tcg_gen_remu_i32(t3, t0, t1);
-tcg_gen_extu_i32_tl(ret, t3);
+tcg_gen_remu_i32(t0, t0, t1);
+tcg_gen_extu_i32_tl(ret, t0);
 }
 }
 
-- 
2.39.2

[PULL 0/1] ppc queue

2023-04-09 Thread Cédric Le Goater

The following changes since commit c6f3cbca32bde9ee94d9949aa63e8a7ef2d7bc5b:

  Update version for v8.0.0-rc3 release (2023-04-05 17:26:14 +0100)

are available in the Git repository at:

  https://github.com/legoater/qemu/ tags/pull-ppc-20230409

for you to fetch changes up to a253231fbede6e69bf287afd90f67347a7383aab:

  target/ppc: Fix temp usage in gen_op_arith_modw (2023-04-09 19:21:27 +0200)


ppc queue:

* Fix regresion with prefix instructions and pcrel addressing


Richard Henderson (1):
  target/ppc: Fix temp usage in gen_op_arith_modw

 target/ppc/translate.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

[PATCH for-8.0] docs/cxl: Fix sentence

2023-04-09 Thread Stefan Weil via

Signed-off-by: Stefan Weil 
---

If my change is okay I suggest to apply the patch for 8.0
because it fixes documentation.

Regards,
Stefan W.

 docs/system/devices/cxl.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/system/devices/cxl.rst b/docs/system/devices/cxl.rst
index f25783a4ec..4c38223069 100644
--- a/docs/system/devices/cxl.rst
+++ b/docs/system/devices/cxl.rst
@@ -111,7 +111,7 @@ Interfaces provided include:
 
 CXL Root Ports (CXL RP)
 ~~~
-A CXL Root Port servers te same purpose as a PCIe Root Port.
+A CXL Root Port serves the same purpose as a PCIe Root Port.
 There are a number of CXL specific Designated Vendor Specific
 Extended Capabilities (DVSEC) in PCIe Configuration Space
 and associated component register access via PCI bars.
-- 
2.39.2

[PATCH for-8.0] docs: Fix typo (wphx => whpx)

2023-04-09 Thread Stefan Weil via

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1529
Signed-off-by: Stefan Weil 
---

I suggest to apply the patch for 8.0 because it fixes documentation.

Regards
Stefan W.

 docs/system/introduction.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/system/introduction.rst b/docs/system/introduction.rst
index c8a9fe6c1d..3e256f8326 100644
--- a/docs/system/introduction.rst
+++ b/docs/system/introduction.rst
@@ -27,7 +27,7 @@ Tiny Code Generator (TCG) capable of emulating many CPUs.
   * - Hypervisor Framework (hvf)
 - MacOS
 - x86 (64 bit only), Arm (64 bit only)
-  * - Windows Hypervisor Platform (wphx)
+  * - Windows Hypervisor Platform (whpx)
 - Windows
 - x86
   * - NetBSD Virtual Machine Monitor (nvmm)
-- 
2.39.2

[PATCH] hw/arm: Fix some typos in comments (most found by codespell)

2023-04-09 Thread Stefan Weil via

Signed-off-by: Stefan Weil 
---

The patch does not change code and could also be applied for 8.0.

 hw/arm/Kconfig| 2 +-
 hw/arm/exynos4210.c   | 4 ++--
 hw/arm/musicpal.c | 2 +-
 hw/arm/omap1.c| 2 +-
 hw/arm/omap2.c| 2 +-
 hw/arm/virt-acpi-build.c  | 2 +-
 hw/arm/virt.c | 2 +-
 hw/arm/xlnx-versal-virt.c | 2 +-
 8 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
index b5aed4aff5..db1105c717 100644
--- a/hw/arm/Kconfig
+++ b/hw/arm/Kconfig
@@ -126,7 +126,7 @@ config OLIMEX_STM32_H405
 config NSERIES
 bool
 select OMAP
-select TMP105   # tempature sensor
+select TMP105   # temperature sensor
 select BLIZZARD # LCD/TV controller
 select ONENAND
 select TSC210X  # touchscreen/sensors/audio
diff --git a/hw/arm/exynos4210.c b/hw/arm/exynos4210.c
index 6f2dda13f6..de39fb0ece 100644
--- a/hw/arm/exynos4210.c
+++ b/hw/arm/exynos4210.c
@@ -326,7 +326,7 @@ static int mapline_size(const int *mapline)
 
 /*
  * Initialize board IRQs.
- * These IRQs contain splitted Int/External Combiner and External Gic IRQs.
+ * These IRQs contain split Int/External Combiner and External Gic IRQs.
  */
 static void exynos4210_init_board_irqs(Exynos4210State *s)
 {
@@ -744,7 +744,7 @@ static void exynos4210_realize(DeviceState *socdev, Error 
**errp)
  * - SDMA
  * - ADMA2
  *
- * As this part of the Exynos4210 is not publically available,
+ * As this part of the Exynos4210 is not publicly available,
  * we used the "HS-MMC Controller S3C2416X RISC Microprocessor"
  * public datasheet which is very similar (implementing
  * MMC Specification Version 4.0 being the only difference noted)
diff --git a/hw/arm/musicpal.c b/hw/arm/musicpal.c
index c9010b2ffb..58f3d30c9b 100644
--- a/hw/arm/musicpal.c
+++ b/hw/arm/musicpal.c
@@ -100,7 +100,7 @@
 #define MP_LCD_SPI_CMD  0x00104011
 #define MP_LCD_SPI_INVALID  0x
 
-/* Commmands */
+/* Commands */
 #define MP_LCD_INST_SETPAGE00xB0
 /* ... */
 #define MP_LCD_INST_SETPAGE70xB7
diff --git a/hw/arm/omap1.c b/hw/arm/omap1.c
index 559c066ce9..d5438156ee 100644
--- a/hw/arm/omap1.c
+++ b/hw/arm/omap1.c
@@ -4057,7 +4057,7 @@ struct omap_mpu_state_s *omap310_mpu_init(MemoryRegion 
*dram,
 s->led[1] = omap_lpg_init(system_memory,
   0xfffbd800, omap_findclk(s, "clk32-kHz"));
 
-/* Register mappings not currenlty implemented:
+/* Register mappings not currently implemented:
  * MCSI2 Comm  fffb2000 - fffb27ff (not mapped on OMAP310)
  * MCSI1 Bluetooth fffb2800 - fffb2fff (not mapped on OMAP310)
  * USB W2FCfffb4000 - fffb47ff
diff --git a/hw/arm/omap2.c b/hw/arm/omap2.c
index 366d6af1b6..d5a2ae7af6 100644
--- a/hw/arm/omap2.c
+++ b/hw/arm/omap2.c
@@ -2523,7 +2523,7 @@ struct omap_mpu_state_s *omap2420_mpu_init(MemoryRegion 
*sdram,
 omap_findclk(s, "func_96m_clk"),
 omap_findclk(s, "core_l4_iclk"));
 
-/* All register mappings (includin those not currenlty implemented):
+/* All register mappings (including those not currently implemented):
  * SystemControlMod4800 - 48000fff
  * SystemControlL4 48001000 - 48001fff
  * 32kHz Timer Mod 48004000 - 48004fff
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 4156111d49..4af0de8b24 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -694,7 +694,7 @@ static void build_append_gicr(GArray *table_data, uint64_t 
base, uint32_t size)
 build_append_int_noprefix(table_data, 0xE, 1);  /* Type */
 build_append_int_noprefix(table_data, 16, 1);   /* Length */
 build_append_int_noprefix(table_data, 0, 2);/* Reserved */
-/* Discovery Range Base Addres */
+/* Discovery Range Base Address */
 build_append_int_noprefix(table_data, base, 8);
 build_append_int_noprefix(table_data, size, 4); /* Discovery Range Length 
*/
 }
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index ac626b3bef..4983f5fc93 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2052,7 +2052,7 @@ static void machvirt_init(MachineState *machine)
 int pa_bits;
 
 /*
- * Instanciate a temporary CPU object to find out about what
+ * Instantiate a temporary CPU object to find out about what
  * we are about to deal with. Once this is done, get rid of
  * the object.
  */
diff --git a/hw/arm/xlnx-versal-virt.c b/hw/arm/xlnx-versal-virt.c
index 37fc9b919c..668a9d65a4 100644
--- a/hw/arm/xlnx-versal-virt.c
+++ b/hw/arm/xlnx-versal-virt.c
@@ -659,7 +659,7 @@ static void versal_virt_init(MachineState *machine)
 fdt_add_clk_node(s, "/clk25", 2500, s->phandle.clk_25Mhz);
 
 /* Make the APU cpu address space visible to virtio and other
- * modules unaware of muliple address-spaces.  */
+ * modules unaware of multiple

Re: [PATCH for-8.0 v2] target/ppc: Fix temp usage in gen_op_arith_modw

2023-04-09 Thread Richard Henderson


On 4/9/23 10:29, Cédric Le Goater wrote:

On 4/9/23 18:21, Richard Henderson wrote:

On 4/8/23 14:24, Cédric Le Goater wrote:

On 4/8/23 09:05, Richard Henderson wrote:

Fix a crash writing to 't3', which is now a constant.
Instead, write the result of the remu to 't1'.

Fixes: 7058ff5231a ("target/ppc: Avoid tcg_const_* in translate.c")
Reported-by: Nicholas Piggin 
Reviewed-by: Anton Johansson 
Signed-off-by: Richard Henderson 


Looks good:

   https://gitlab.com/legoater/qemu/-/pipelines/831847446

I have a PR ready for this same branch. If you want to me send,
just tell.


Yes, please.  Also, the comment above needs s/t1/t0/.  :-P


sure :)

Are you taking care of :

   
https://lore.kernel.org/r/20230408154316.3812709-1-richard.hender...@linaro.org


Yes, I'll send that with two other tcg fixes.


r~

Re: [PATCH for-8.0 v2] target/ppc: Fix temp usage in gen_op_arith_modw

2023-04-09 Thread Cédric Le Goater


On 4/9/23 18:21, Richard Henderson wrote:

On 4/8/23 14:24, Cédric Le Goater wrote:

On 4/8/23 09:05, Richard Henderson wrote:

Fix a crash writing to 't3', which is now a constant.
Instead, write the result of the remu to 't1'.

Fixes: 7058ff5231a ("target/ppc: Avoid tcg_const_* in translate.c")
Reported-by: Nicholas Piggin 
Reviewed-by: Anton Johansson 
Signed-off-by: Richard Henderson 


Looks good:

   https://gitlab.com/legoater/qemu/-/pipelines/831847446

I have a PR ready for this same branch. If you want to me send,
just tell.


Yes, please.  Also, the comment above needs s/t1/t0/.  :-P


sure :)

Are you taking care of :

  
https://lore.kernel.org/r/20230408154316.3812709-1-richard.hender...@linaro.org

C.





r~



I don't think we have tcg tests for the prefix or mma instructions
introduced in P10. That would be good to have.

C.

Re: qemu-user: avoid allocations to convert stuff when not necessary

2023-04-09 Thread Richard Henderson


On 4/9/23 01:52, Michael Tokarev wrote:

Hi!

In the qemu-user case, we allocate various structures and arrays
for conversion of data between host and guest byte orders and sizes.
But it is actually not necessary to do such allocation when the
*size* is the same, and only byte order is different, because the
conversion can be done in-place.  Does it make any sense to avoid'
allocations in such cases?


Alignment can also change.  This is especially visible with m68k guest, where even 'int' 
is 2-byte aligned.


So, no.  It's best to just allocate and convert always.


r~

Re: [PATCH for-8.0 v2] target/ppc: Fix temp usage in gen_op_arith_modw

2023-04-09 Thread Richard Henderson


On 4/8/23 14:24, Cédric Le Goater wrote:

On 4/8/23 09:05, Richard Henderson wrote:

Fix a crash writing to 't3', which is now a constant.
Instead, write the result of the remu to 't1'.

Fixes: 7058ff5231a ("target/ppc: Avoid tcg_const_* in translate.c")
Reported-by: Nicholas Piggin 
Reviewed-by: Anton Johansson 
Signed-off-by: Richard Henderson 


Looks good:

   https://gitlab.com/legoater/qemu/-/pipelines/831847446

I have a PR ready for this same branch. If you want to me send,
just tell.


Yes, please.  Also, the comment above needs s/t1/t0/.  :-P


r~



I don't think we have tcg tests for the prefix or mma instructions
introduced in P10. That would be good to have.

C.

Re: qemu-user: avoid allocations to convert stuff when not necessary

2023-04-09 Thread Michael Tokarev


09.04.2023 17:04, Warner Losh пишет:

On Sun, Apr 9, 2023 at 2:53 AM Michael Tokarev mailto:m...@tls.msk.ru>> wrote:

Hi!

In the qemu-user case, we allocate various structures and arrays
for conversion of data between host and guest byte orders and sizes.
But it is actually not necessary to do such allocation when the
*size* is the same, and only byte order is different, because the
conversion can be done in-place.  Does it make any sense to avoid'
allocations in such cases?

There are 2 issues with this though. First is that in some cases,
the data being converted is const, and we may end up writing to a
data resides in a read-only segment, is it ever possible?  And
second - it is not entirely clear what to do in case the syscall
returned error.


I don't think you can reliably do it in place. What if another thread in the
guest reads the data after you've converted it? It will get the wrong answer.
I think you have to copy when endian mismatches, just like when alignment,
data size or layout differences are present. You'd need to convert it back
after the system call as well, which can cause problems, especially
if the system call needs multiple steps to emulate for whatever reason.


Han. I thought more about syscalls which accept in-out parameters.
Such as poll(fd, n, pfd) which uses pfd array for both input and
output. Or just for output, like getgroups() - for this one, if
sizeof(gid_t) == sizeof(target_gid_t), there's no need to allocate
anything, the supplied array can be used, and we need just a loop
after syscall returned, to convert each gid_t to target_gid_t.

For pure-input syscalls, like setgroups(), it is better to perform
allocation IMHO. If not only to avoid a case when RO-segment is
used.

/mjt

[PATCH v3 4/5] intel_iommu: allow Extended Interrupt Mode when using userspace APIC

2023-04-09 Thread Bui Quang Minh

As userspace APIC now supports x2APIC, intel interrupt remapping
hardware can be set to EIM mode when userspace local APIC is used.

Signed-off-by: Bui Quang Minh 
---
 hw/i386/intel_iommu.c | 11 ---
 1 file changed, 11 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a62896759c..fd7c16b852 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4045,17 +4045,6 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error 
**errp)
   && x86_iommu_ir_supported(x86_iommu) ?
   ON_OFF_AUTO_ON : ON_OFF_AUTO_OFF;
 }
-if (s->intr_eim == ON_OFF_AUTO_ON && !s->buggy_eim) {
-if (!kvm_irqchip_is_split()) {
-error_setg(errp, "eim=on requires accel=kvm,kernel-irqchip=split");
-return false;
-}
-if (!kvm_enable_x2apic()) {
-error_setg(errp, "eim=on requires support on the KVM side"
- "(X2APIC_API, first shipped in v4.7)");
-return false;
-}
-}
 
 /* Currently only address widths supported are 39 and 48 bits */
 if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
-- 
2.25.1

[PATCH v3 3/5] apic, i386/tcg: add x2apic transitions

2023-04-09 Thread Bui Quang Minh

This commit adds support for x2APIC transitions when writing to
MSR_IA32_APICBASE register and finally adds CPUID_EXT_X2APIC to
TCG_EXT_FEATURES.

Signed-off-by: Bui Quang Minh 
---
 hw/intc/apic.c   | 50 
 hw/intc/apic_common.c|  7 ++--
 target/i386/cpu-sysemu.c | 10 ++
 target/i386/cpu.c|  5 +--
 target/i386/cpu.h|  6 
 target/i386/tcg/sysemu/misc_helper.c |  4 +++
 6 files changed, 75 insertions(+), 7 deletions(-)

diff --git a/hw/intc/apic.c b/hw/intc/apic.c
index 0f2f29e679..716fe865d7 100644
--- a/hw/intc/apic.c
+++ b/hw/intc/apic.c
@@ -309,8 +309,41 @@ bool is_x2apic_mode(DeviceState *dev)
 return s->apicbase & MSR_IA32_APICBASE_EXTD;
 }
 
+static void apic_set_base_check(APICCommonState *s, uint64_t val)
+{
+/* Enable x2apic when x2apic is not supported by CPU */
+if (!cpu_has_x2apic_feature(>cpu->env) &&
+val & MSR_IA32_APICBASE_EXTD)
+raise_exception_ra(>cpu->env, EXCP0D_GPF, GETPC());
+
+/*
+ * Transition into invalid state
+ * (s->apicbase & MSR_IA32_APICBASE_ENABLE == 0) &&
+ * (s->apicbase & MSR_IA32_APICBASE_EXTD) == 1
+ */
+if (!(val & MSR_IA32_APICBASE_ENABLE) &&
+(val & MSR_IA32_APICBASE_EXTD))
+raise_exception_ra(>cpu->env, EXCP0D_GPF, GETPC());
+
+/* Invalid transition from disabled mode to x2APIC */
+if (!(s->apicbase & MSR_IA32_APICBASE_ENABLE) &&
+!(s->apicbase & MSR_IA32_APICBASE_EXTD) &&
+(val & MSR_IA32_APICBASE_ENABLE) &&
+(val & MSR_IA32_APICBASE_EXTD))
+raise_exception_ra(>cpu->env, EXCP0D_GPF, GETPC());
+
+/* Invalid transition from x2APIC to xAPIC */
+if ((s->apicbase & MSR_IA32_APICBASE_ENABLE) &&
+(s->apicbase & MSR_IA32_APICBASE_EXTD) &&
+(val & MSR_IA32_APICBASE_ENABLE) &&
+!(val & MSR_IA32_APICBASE_EXTD))
+raise_exception_ra(>cpu->env, EXCP0D_GPF, GETPC());
+}
+
 static void apic_set_base(APICCommonState *s, uint64_t val)
 {
+apic_set_base_check(s, val);
+
 s->apicbase = (val & 0xf000) |
 (s->apicbase & (MSR_IA32_APICBASE_BSP | MSR_IA32_APICBASE_ENABLE));
 /* if disabled, cannot be enabled again */
@@ -319,6 +352,23 @@ static void apic_set_base(APICCommonState *s, uint64_t val)
 cpu_clear_apic_feature(>cpu->env);
 s->spurious_vec &= ~APIC_SV_ENABLE;
 }
+
+/* Transition from disabled mode to xAPIC */
+if (!(s->apicbase & MSR_IA32_APICBASE_ENABLE) &&
+(val & MSR_IA32_APICBASE_ENABLE)) {
+s->apicbase |= MSR_IA32_APICBASE_ENABLE;
+cpu_set_apic_feature(>cpu->env);
+}
+
+/* Transition from xAPIC to x2APIC */
+if (cpu_has_x2apic_feature(>cpu->env) &&
+!(s->apicbase & MSR_IA32_APICBASE_EXTD) &&
+(val & MSR_IA32_APICBASE_EXTD)) {
+s->apicbase |= MSR_IA32_APICBASE_EXTD;
+
+s->log_dest = ((s->initial_apic_id & 0x0) << 16) |
+  (1 << (s->initial_apic_id & 0xf));
+}
 }
 
 static void apic_set_tpr(APICCommonState *s, uint8_t val)
diff --git a/hw/intc/apic_common.c b/hw/intc/apic_common.c
index d95914066e..396f828be8 100644
--- a/hw/intc/apic_common.c
+++ b/hw/intc/apic_common.c
@@ -43,11 +43,8 @@ void cpu_set_apic_base(DeviceState *dev, uint64_t val)
 if (dev) {
 APICCommonState *s = APIC_COMMON(dev);
 APICCommonClass *info = APIC_COMMON_GET_CLASS(s);
-/* switching to x2APIC, reset possibly modified xAPIC ID */
-if (!(s->apicbase & MSR_IA32_APICBASE_EXTD) &&
-(val & MSR_IA32_APICBASE_EXTD)) {
-s->id = s->initial_apic_id;
-}
+/* Reset possibly modified xAPIC ID */
+s->id = s->initial_apic_id;
 info->set_base(s, val);
 }
 }
diff --git a/target/i386/cpu-sysemu.c b/target/i386/cpu-sysemu.c
index a9ff10c517..f6bbe33372 100644
--- a/target/i386/cpu-sysemu.c
+++ b/target/i386/cpu-sysemu.c
@@ -235,6 +235,16 @@ void cpu_clear_apic_feature(CPUX86State *env)
 env->features[FEAT_1_EDX] &= ~CPUID_APIC;
 }
 
+void cpu_set_apic_feature(CPUX86State *env)
+{
+env->features[FEAT_1_EDX] |= CPUID_APIC;
+}
+
+bool cpu_has_x2apic_feature(CPUX86State *env)
+{
+return env->features[FEAT_1_ECX] & CPUID_EXT_X2APIC;
+}
+
 bool cpu_is_bsp(X86CPU *cpu)
 {
 return cpu_get_apic_base(cpu->apic_state) & MSR_IA32_APICBASE_BSP;
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 6576287e5b..6847b2ae02 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -627,12 +627,13 @@ void x86_cpu_vendor_words2str(char *dst, uint32_t vendor1,
   CPUID_EXT_XSAVE | /* CPUID_EXT_OSXSAVE is dynamic */   \
   CPUID_EXT_MOVBE | CPUID_EXT_AES | CPUID_EXT_HYPERVISOR | \
   CPUID_EXT_RDRAND | CPUID_EXT_AVX | CPUID_EXT_F16C | \
-  CPUID_EXT_FMA)
+  CPUID_EXT_FMA | CPUID_EXT_X2APIC)
   /* missing:
   CPUID_EXT_DTES64, CPUID_EXT_DSCPL, CPUID_EXT_VMX,

[PATCH] accel/kvm: Don't use KVM maximum support number to alloc user memslots

2023-04-09 Thread Robert Hoo

This patch corrects QEMU to properly use what KVM_CAP_NR_MEMSLOTS means,
i.e. the maximum user memslots KVM supports.
1. Rename KVMState::nr_slots --> max_slots.
2. Remember nr_slots in each KML. This also decouples each KML, e.g. x86's
   two KMLs don't need to have same size of slots[].
3. Change back initial slot[] size to 32, increase it dynamically
   (exponentially) until the maximum number KVM supports. 32 should suites
   almost all normal guests.

Background:
Since KVM commit 4fc096a99e01d ("KVM: Raise the maximum number of user
memslots"), KVM_CAP_NR_MEMSLOTS returns 32764 (SHRT_MAX - 3), which is a
huge increase from previous 509 (x86). This change based on the fact that
KVM alloc memslots dynamically. But QEMU allocates that huge number of user
memslots statically. This is indeed unwanted by both sides. It makes:

1. Memory waste. Allocates (SHRT_MAX - 3) * sizeof(KVMSlot), while a
typical VM needs far less, e.g. my VM just reached 9th as the highest mem
slot ever used. x86 further, has 2 kmls.
2. Time waste. Several KML Slot functions go through the whole
KML::slots[], (SHRT_MAX - 3) makes it far longer than necessary, e.g.
kvm_lookup_matching_slot(), kvm_physical_memory_addr_from_host(),
kvm_physical_log_clear(), kvm_log_sync_global().

Test:
Temporarily set KVM_DEF_NR_SLOTS = 8, let it go through slot[] dynamic
increase, VM launched and works well.

Signed-off-by: Robert Hoo 
---
 accel/kvm/kvm-all.c  | 57 +---
 include/sysemu/kvm_int.h |  4 ++-
 2 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index cf3a88d90e..708170139c 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -178,22 +178,50 @@ int kvm_get_max_memslots(void)
 {
 KVMState *s = KVM_STATE(current_accel());
 
-return s->nr_slots;
+return s->max_slots;
 }
 
-/* Called with KVMMemoryListener.slots_lock held */
+/* Called with kvm_slots_lock()'ed */
 static KVMSlot *kvm_get_free_slot(KVMMemoryListener *kml)
 {
 KVMState *s = kvm_state;
+KVMSlot *new_slots;
 int i;
+int new_nr, old_nr;
 
-for (i = 0; i < s->nr_slots; i++) {
+for (i = 0; i < kml->nr_slots; i++) {
 if (kml->slots[i].memory_size == 0) {
 return >slots[i];
 }
 }
 
-return NULL;
+/* Already reached maximum, no more can expand */
+if (kml->nr_slots >= s->max_slots) {
+return NULL;
+}
+
+new_nr = 2 * kml->nr_slots;
+new_nr = MIN(new_nr, s->max_slots);
+/* It might overflow */
+if (new_nr < 0 || new_nr <= kml->nr_slots) {
+return NULL;
+}
+
+new_slots = g_try_new0(KVMSlot, new_nr);
+if (!new_slots) {
+return NULL;
+}
+
+memcpy(new_slots, kml->slots, kml->nr_slots * sizeof(KVMSlot));
+old_nr = kml->nr_slots;
+kml->nr_slots = new_nr;
+g_free(kml->slots);
+kml->slots = new_slots;
+for (i = old_nr; i < kml->nr_slots; i++) {
+kml->slots[i].slot = i;
+}
+
+return >slots[old_nr];
 }
 
 bool kvm_has_free_slot(MachineState *ms)
@@ -226,10 +254,9 @@ static KVMSlot *kvm_lookup_matching_slot(KVMMemoryListener 
*kml,
  hwaddr start_addr,
  hwaddr size)
 {
-KVMState *s = kvm_state;
 int i;
 
-for (i = 0; i < s->nr_slots; i++) {
+for (i = 0; i < kml->nr_slots; i++) {
 KVMSlot *mem = >slots[i];
 
 if (start_addr == mem->start_addr && size == mem->memory_size) {
@@ -271,7 +298,7 @@ int kvm_physical_memory_addr_from_host(KVMState *s, void 
*ram,
 int i, ret = 0;
 
 kvm_slots_lock();
-for (i = 0; i < s->nr_slots; i++) {
+for (i = 0; i < kml->nr_slots; i++) {
 KVMSlot *mem = >slots[i];
 
 if (ram >= mem->ram && ram < mem->ram + mem->memory_size) {
@@ -1002,7 +1029,7 @@ static int kvm_physical_log_clear(KVMMemoryListener *kml,
 
 kvm_slots_lock();
 
-for (i = 0; i < s->nr_slots; i++) {
+for (i = 0; i < kml->nr_slots; i++) {
 mem = >slots[i];
 /* Discard slots that are empty or do not overlap the section */
 if (!mem->memory_size ||
@@ -1566,7 +1593,6 @@ static void kvm_log_sync(MemoryListener *listener,
 static void kvm_log_sync_global(MemoryListener *l)
 {
 KVMMemoryListener *kml = container_of(l, KVMMemoryListener, listener);
-KVMState *s = kvm_state;
 KVMSlot *mem;
 int i;
 
@@ -1578,7 +1604,7 @@ static void kvm_log_sync_global(MemoryListener *l)
  * only a few used slots (small VMs).
  */
 kvm_slots_lock();
-for (i = 0; i < s->nr_slots; i++) {
+for (i = 0; i < kml->nr_slots; i++) {
 mem = >slots[i];
 if (mem->memory_size && mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
 kvm_slot_sync_dirty_pages(mem);
@@ -1687,10 +1713,11 @@ void kvm_memory_listener_register(KVMState *s, 
KVMMemoryListener *kml,
 {
 int i;
 
-kml->slots = g_new0(KVMSlot, s->nr_slots);
+kml->slots

[PATCH v3 5/5] amd_iommu: report x2APIC support to the operating system

2023-04-09 Thread Bui Quang Minh

This commit adds XTSup configuration to let user choose to whether enable
this feature or not. When XTSup is enabled, additional bytes in IRTE with
enabled guest virtual VAPIC are used to support 32-bit destination id.

Additionally, this commit changes to use IVHD type 0x11 in ACPI table for
feature report to operating system. This is because Linux does not use
XTSup in IOMMU Feature Reporting field of IVHD type 0x10 but only use XTSup
bit in EFR Register Image of IVHD 0x11 to indicate x2APIC support (see
init_iommu_one in linux/drivers/iommu/amd/init.c)

Signed-off-by: Bui Quang Minh 
---
 hw/i386/acpi-build.c | 28 ++--
 hw/i386/amd_iommu.c  | 21 +++--
 hw/i386/amd_iommu.h  | 16 +++-
 3 files changed, 44 insertions(+), 21 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index ec857a117e..72d6bb2892 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2339,7 +2339,7 @@ static void
 build_amd_iommu(GArray *table_data, BIOSLinker *linker, const char *oem_id,
 const char *oem_table_id)
 {
-int ivhd_table_len = 24;
+int ivhd_table_len = 40;
 AMDVIState *s = AMD_IOMMU_DEVICE(x86_iommu_get_default());
 GArray *ivhd_blob = g_array_new(false, true, 1);
 AcpiTable table = { .sig = "IVRS", .rev = 1, .oem_id = oem_id,
@@ -2349,18 +2349,19 @@ build_amd_iommu(GArray *table_data, BIOSLinker *linker, 
const char *oem_id,
 /* IVinfo - IO virtualization information common to all
  * IOMMU units in a system
  */
-build_append_int_noprefix(table_data, 40UL << 8/* PASize */, 4);
+build_append_int_noprefix(table_data,
+ (1UL << 0) | /* EFRSup */
+ (40UL << 8), /* PASize */
+ 4);
 /* reserved */
 build_append_int_noprefix(table_data, 0, 8);
 
-/* IVHD definition - type 10h */
-build_append_int_noprefix(table_data, 0x10, 1);
+/* IVHD definition - type 11h */
+build_append_int_noprefix(table_data, 0x11, 1);
 /* virtualization flags */
 build_append_int_noprefix(table_data,
  (1UL << 0) | /* HtTunEn  */
- (1UL << 4) | /* iotblSup */
- (1UL << 6) | /* PrefSup  */
- (1UL << 7),  /* PPRSup   */
+ (1UL << 4),  /* iotblSup */
  1);
 
 /*
@@ -2404,13 +2405,12 @@ build_amd_iommu(GArray *table_data, BIOSLinker *linker, 
const char *oem_id,
 build_append_int_noprefix(table_data, 0, 2);
 /* IOMMU info */
 build_append_int_noprefix(table_data, 0, 2);
-/* IOMMU Feature Reporting */
-build_append_int_noprefix(table_data,
- (48UL << 30) | /* HATS   */
- (48UL << 28) | /* GATS   */
- (1UL << 2)   | /* GTSup  */
- (1UL << 6),/* GASup  */
- 4);
+/* IOMMU Attributes */
+build_append_int_noprefix(table_data, 0, 4);
+/* EFR Register Image */
+build_append_int_noprefix(table_data, s->efr_reg, 8);
+/* EFR Register Image 2 */
+build_append_int_noprefix(table_data, 0, 8);
 
 /* IVHD entries as found above */
 g_array_append_vals(table_data, ivhd_blob->data, ivhd_blob->len);
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index bcd016f5c5..5dfa93d945 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -31,6 +31,7 @@
 #include "hw/i386/apic_internal.h"
 #include "trace.h"
 #include "hw/i386/apic-msidef.h"
+#include "hw/qdev-properties.h"
 
 /* used AMD-Vi MMIO registers */
 const char *amdvi_mmio_low[] = {
@@ -1155,7 +1156,12 @@ static int amdvi_int_remap_ga(AMDVIState *iommu,
 irq->vector = irte.hi.fields.vector;
 irq->dest_mode = irte.lo.fields_remap.dm;
 irq->redir_hint = irte.lo.fields_remap.rq_eoi;
-irq->dest = irte.lo.fields_remap.destination;
+if (!iommu->xtsup) {
+irq->dest = irte.lo.fields_remap.destination & 0xff;
+} else {
+irq->dest = irte.lo.fields_remap.destination |
+(irte.hi.fields.destination_hi << 24);
+}
 
 return 0;
 }
@@ -1503,10 +1509,15 @@ static void amdvi_init(AMDVIState *s)
 s->enabled = false;
 s->ats_enabled = false;
 s->cmdbuf_enabled = false;
+s->efr_reg = AMDVI_DEFAULT_EXT_FEATURES;
+
+if (s->xtsup) {
+s->efr_reg |= AMDVI_FEATURE_XT;
+}
 
 /* reset MMIO */
 memset(s->mmior, 0, AMDVI_MMIO_SIZE);
-amdvi_set_quad(s, AMDVI_MMIO_EXT_FEATURES, AMDVI_EXT_FEATURES,
+amdvi_set_quad(s, AMDVI_MMIO_EXT_FEATURES, s->efr_reg,
 0xffef, 0);
 amdvi_set_quad(s, AMDVI_MMIO_STATUS, 0, 0x98, 0x67);
 
@@ -1586,6 +1597,11 @@ static void amdvi_sysbus_realize(DeviceState *dev, Error 
**errp)
 amdvi_init(s);
 }
 
+static Property amdvi_properties[] = {
+

[PATCH v3 1/5] i386/tcg: implement x2APIC registers MSR access

2023-04-09 Thread Bui Quang Minh

This commit refactors apic_mem_read/write to support both MMIO access in
xAPIC and MSR access in x2APIC.

Signed-off-by: Bui Quang Minh 
---
 hw/intc/apic.c   | 79 ++--
 hw/intc/trace-events |  4 +-
 include/hw/i386/apic.h   |  3 ++
 target/i386/cpu.h|  3 ++
 target/i386/tcg/sysemu/misc_helper.c | 27 ++
 5 files changed, 86 insertions(+), 30 deletions(-)

diff --git a/hw/intc/apic.c b/hw/intc/apic.c
index 20b5a94073..61b494b20a 100644
--- a/hw/intc/apic.c
+++ b/hw/intc/apic.c
@@ -288,6 +288,13 @@ void apic_deliver_irq(uint8_t dest, uint8_t dest_mode, 
uint8_t delivery_mode,
 apic_bus_deliver(deliver_bitmask, delivery_mode, vector_num, trigger_mode);
 }
 
+bool is_x2apic_mode(DeviceState *dev)
+{
+APICCommonState *s = APIC(dev);
+
+return s->apicbase & MSR_IA32_APICBASE_EXTD;
+}
+
 static void apic_set_base(APICCommonState *s, uint64_t val)
 {
 s->apicbase = (val & 0xf000) |
@@ -636,16 +643,11 @@ static void apic_timer(void *opaque)
 apic_timer_update(s, s->next_time);
 }
 
-static uint64_t apic_mem_read(void *opaque, hwaddr addr, unsigned size)
+uint64_t apic_register_read(int index)
 {
 DeviceState *dev;
 APICCommonState *s;
-uint32_t val;
-int index;
-
-if (size < 4) {
-return 0;
-}
+uint64_t val;
 
 dev = cpu_get_current_apic();
 if (!dev) {
@@ -653,7 +655,6 @@ static uint64_t apic_mem_read(void *opaque, hwaddr addr, 
unsigned size)
 }
 s = APIC(dev);
 
-index = (addr >> 4) & 0xff;
 switch(index) {
 case 0x02: /* id */
 val = s->id << 24;
@@ -720,7 +721,23 @@ static uint64_t apic_mem_read(void *opaque, hwaddr addr, 
unsigned size)
 val = 0;
 break;
 }
-trace_apic_mem_readl(addr, val);
+
+trace_apic_register_read(index, val);
+return val;
+}
+
+static uint64_t apic_mem_read(void *opaque, hwaddr addr, unsigned size)
+{
+uint32_t val;
+int index;
+
+if (size < 4) {
+return 0;
+}
+
+index = (addr >> 4) & 0xff;
+val = (uint32_t)apic_register_read(index);
+
 return val;
 }
 
@@ -737,27 +754,10 @@ static void apic_send_msi(MSIMessage *msi)
 apic_deliver_irq(dest, dest_mode, delivery, vector, trigger_mode);
 }
 
-static void apic_mem_write(void *opaque, hwaddr addr, uint64_t val,
-   unsigned size)
+void apic_register_write(int index, uint64_t val)
 {
 DeviceState *dev;
 APICCommonState *s;
-int index = (addr >> 4) & 0xff;
-
-if (size < 4) {
-return;
-}
-
-if (addr > 0xfff || !index) {
-/* MSI and MMIO APIC are at the same memory location,
- * but actually not on the global bus: MSI is on PCI bus
- * APIC is connected directly to the CPU.
- * Mapping them on the global bus happens to work because
- * MSI registers are reserved in APIC MMIO and vice versa. */
-MSIMessage msi = { .address = addr, .data = val };
-apic_send_msi();
-return;
-}
 
 dev = cpu_get_current_apic();
 if (!dev) {
@@ -765,7 +765,7 @@ static void apic_mem_write(void *opaque, hwaddr addr, 
uint64_t val,
 }
 s = APIC(dev);
 
-trace_apic_mem_writel(addr, val);
+trace_apic_register_write(index, val);
 
 switch(index) {
 case 0x02:
@@ -843,6 +843,29 @@ static void apic_mem_write(void *opaque, hwaddr addr, 
uint64_t val,
 }
 }
 
+static void apic_mem_write(void *opaque, hwaddr addr, uint64_t val,
+   unsigned size)
+{
+int index = (addr >> 4) & 0xff;
+
+if (size < 4) {
+return;
+}
+
+if (addr > 0xfff || !index) {
+/* MSI and MMIO APIC are at the same memory location,
+ * but actually not on the global bus: MSI is on PCI bus
+ * APIC is connected directly to the CPU.
+ * Mapping them on the global bus happens to work because
+ * MSI registers are reserved in APIC MMIO and vice versa. */
+MSIMessage msi = { .address = addr, .data = val };
+apic_send_msi();
+return;
+}
+
+apic_register_write(index, val);
+}
+
 static void apic_pre_save(APICCommonState *s)
 {
 apic_sync_vapic(s, SYNC_FROM_VAPIC);
diff --git a/hw/intc/trace-events b/hw/intc/trace-events
index 50cadfb996..9d4e7a67be 100644
--- a/hw/intc/trace-events
+++ b/hw/intc/trace-events
@@ -14,8 +14,8 @@ cpu_get_apic_base(uint64_t val) "0x%016"PRIx64
 # apic.c
 apic_local_deliver(int vector, uint32_t lvt) "vector %d delivery mode %d"
 apic_deliver_irq(uint8_t dest, uint8_t dest_mode, uint8_t delivery_mode, 
uint8_t vector_num, uint8_t trigger_mode) "dest %d dest_mode %d delivery_mode 
%d vector %d trigger_mode %d"
-apic_mem_readl(uint64_t addr, uint32_t val)  "0x%"PRIx64" = 0x%08x"
-apic_mem_writel(uint64_t addr, uint32_t val) "0x%"PRIx64" = 0x%08x"
+apic_register_read(uint8_t reg, uint64_t val) "register 0x%02x = 0x%"PRIx64
+apic_register_write(uint8_t

[PATCH v3 2/5] apic: add support for x2APIC mode

2023-04-09 Thread Bui Quang Minh

This commit extends the APIC ID to 32-bit long and remove the 255 max APIC
ID limit in userspace APIC. The array that manages local APICs is now
dynamically allocated based on the max APIC ID of created x86 machine.
Also, new x2APIC IPI destination determination scheme, self IPI and x2APIC
mode register access are supported.

Signed-off-by: Bui Quang Minh 
---
Version 3 changes:
- Allow APIC ID > 255 only when x2APIC feature is supported on CPU
- Make physical destination mode IPI which has destination id 0x
a broadcast to xAPIC CPUs
- Make cluster address 0xf in cluster model of xAPIC logical destination
mode a broadcast to all clusters
- Create new extended_log_dest to store APIC_LDR information in x2APIC
instead of extending log_dest for backward compatibility in vmstate

 hw/i386/x86.c   |   8 +-
 hw/intc/apic.c  | 266 
 hw/intc/apic_common.c   |   9 ++
 include/hw/i386/apic.h  |   3 +-
 include/hw/i386/apic_internal.h |   7 +-
 target/i386/cpu-sysemu.c|   8 +-
 6 files changed, 231 insertions(+), 70 deletions(-)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index a88a126123..8b70f0a6ea 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -132,11 +132,11 @@ void x86_cpus_init(X86MachineState *x86ms, int 
default_cpu_version)
  * Can we support APIC ID 255 or higher?
  *
  * Under Xen: yes.
- * With userspace emulated lapic: no
+ * With userspace emulated lapic: checked later in apic_common_set_id.
  * With KVM's in-kernel lapic: only if X2APIC API is enabled.
  */
 if (x86ms->apic_id_limit > 255 && !xen_enabled() &&
-(!kvm_irqchip_in_kernel() || !kvm_enable_x2apic())) {
+kvm_irqchip_in_kernel() && !kvm_enable_x2apic()) {
 error_report("current -smp configuration requires kernel "
  "irqchip and X2APIC API support.");
 exit(EXIT_FAILURE);
@@ -146,6 +146,10 @@ void x86_cpus_init(X86MachineState *x86ms, int 
default_cpu_version)
 kvm_set_max_apic_id(x86ms->apic_id_limit);
 }
 
+if (!kvm_irqchip_in_kernel()) {
+apic_set_max_apic_id(x86ms->apic_id_limit);
+}
+
 possible_cpus = mc->possible_cpu_arch_ids(ms);
 for (i = 0; i < ms->smp.cpus; i++) {
 x86_cpu_new(x86ms, possible_cpus->cpus[i].arch_id, _fatal);
diff --git a/hw/intc/apic.c b/hw/intc/apic.c
index 61b494b20a..0f2f29e679 100644
--- a/hw/intc/apic.c
+++ b/hw/intc/apic.c
@@ -31,15 +31,15 @@
 #include "hw/i386/apic-msidef.h"
 #include "qapi/error.h"
 #include "qom/object.h"
-
-#define MAX_APICS 255
-#define MAX_APIC_WORDS 8
+#include "tcg/helper-tcg.h"
 
 #define SYNC_FROM_VAPIC 0x1
 #define SYNC_TO_VAPIC   0x2
 #define SYNC_ISR_IRR_TO_VAPIC   0x4
 
-static APICCommonState *local_apics[MAX_APICS + 1];
+static APICCommonState **local_apics;
+static uint32_t max_apics;
+static uint32_t max_apic_words;
 
 #define TYPE_APIC "apic"
 /*This is reusing the APICCommonState typedef from APIC_COMMON */
@@ -49,7 +49,19 @@ DECLARE_INSTANCE_CHECKER(APICCommonState, APIC,
 static void apic_set_irq(APICCommonState *s, int vector_num, int trigger_mode);
 static void apic_update_irq(APICCommonState *s);
 static void apic_get_delivery_bitmask(uint32_t *deliver_bitmask,
-  uint8_t dest, uint8_t dest_mode);
+  uint32_t dest, uint8_t dest_mode);
+
+void apic_set_max_apic_id(uint32_t max_apic_id)
+{
+int word_size = 32;
+
+/* round up the max apic id to next multiple of words */
+max_apics = (max_apic_id + word_size - 1) & ~(word_size - 1);
+
+local_apics = g_malloc0(sizeof(*local_apics) * max_apics);
+max_apic_words = max_apics >> 5;
+}
+
 
 /* Find first bit starting from msb */
 static int apic_fls_bit(uint32_t value)
@@ -199,7 +211,7 @@ static void apic_external_nmi(APICCommonState *s)
 #define foreach_apic(apic, deliver_bitmask, code) \
 {\
 int __i, __j;\
-for(__i = 0; __i < MAX_APIC_WORDS; __i++) {\
+for(__i = 0; __i < max_apic_words; __i++) {\
 uint32_t __mask = deliver_bitmask[__i];\
 if (__mask) {\
 for(__j = 0; __j < 32; __j++) {\
@@ -226,7 +238,7 @@ static void apic_bus_deliver(const uint32_t 
*deliver_bitmask,
 {
 int i, d;
 d = -1;
-for(i = 0; i < MAX_APIC_WORDS; i++) {
+for(i = 0; i < max_apic_words; i++) {
 if (deliver_bitmask[i]) {
 d = i * 32 + apic_ffs_bit(deliver_bitmask[i]);
 break;
@@ -276,16 +288,18 @@ static void apic_bus_deliver(const uint32_t 
*deliver_bitmask,
  apic_set_irq(apic_iter, vector_num, trigger_mode) );
 }
 
-void apic_deliver_irq(uint8_t dest, uint8_t dest_mode, uint8_t delivery_mode,
-  uint8_t vector_num, uint8_t trigger_mode)
+static void apic_deliver_irq(uint32_t dest,

[PATCH v3 0/5] Support x2APIC mode with TCG accelerator

2023-04-09 Thread Bui Quang Minh

Hi everyone,

This series implements x2APIC mode in userspace local APIC and the
RDMSR/WRMSR helper to access x2APIC registers in x2APIC mode. Intel iommu
and AMD iommu are adjusted to support x2APIC interrupt remapping. With this
series, we can now boot Linux kernel into x2APIC mode with TCG accelerator
using either Intel or AMD iommu.

Testing the emulated userspace APIC with kvm-unit-tests, disable test
device with this patch

diff --git a/lib/x86/fwcfg.c b/lib/x86/fwcfg.c
index 1734afb..f56fe1c 100644
--- a/lib/x86/fwcfg.c
+++ b/lib/x86/fwcfg.c
@@ -27,6 +27,7 @@ static void read_cfg_override(void)
 
if ((str = getenv("TEST_DEVICE")))
no_test_device = !atol(str);
+   no_test_device = true;
 
if ((str = getenv("MEMLIMIT")))
fw_override[FW_CFG_MAX_RAM] = atol(str) * 1024 * 1024;

~ env QEMU=/home/minh/Desktop/oss/qemu/build/qemu-system-x86_64 ACCEL=tcg \
./run_tests.sh -v -g apic 

TESTNAME=apic-split TIMEOUT=90s ACCEL=tcg ./x86/run x86/apic.flat -smp 2
-cpu qemu64,+x2apic,+tsc-deadline -machine kernel_irqchip=split FAIL
apic-split (54 tests, 8 unexpected failures, 1 skipped)
TESTNAME=ioapic-split TIMEOUT=90s ACCEL=tcg ./x86/run x86/ioapic.flat -smp
1 -cpu qemu64 -machine kernel_irqchip=split PASS ioapic-split (19 tests)
TESTNAME=x2apic TIMEOUT=30 ACCEL=tcg ./x86/run x86/apic.flat -smp 2 -cpu
qemu64,+x2apic,+tsc-deadline FAIL x2apic (54 tests, 8 unexpected failures,
1 skipped) TESTNAME=xapic TIMEOUT=60 ACCEL=tcg ./x86/run x86/apic.flat -smp
2 -cpu qemu64,-x2apic,+tsc-deadline -machine pit=off FAIL xapic (43 tests,
6 unexpected failures, 2 skipped)

  FAIL: apic_disable: *0xfee00030: 50014
  FAIL: apic_disable: *0xfee00080: f0
  FAIL: apic_disable: *0xfee00030: 50014
  FAIL: apic_disable: *0xfee00080: f0 
  FAIL: apicbase: relocate apic

These errors are because we don't disable MMIO region when switching to
x2APIC and don't support relocate MMIO region yet. This is a problem
because, MMIO region is the same for all CPUs, in order to support these we
need to figure out how to allocate and manage different MMIO regions for
each CPUs. This can be an improvement in the future.

  FAIL: nmi-after-sti
  FAIL: multiple nmi

These errors are in the way we handle CPU_INTERRUPT_NMI in core TCG.

  FAIL: TMCCT should stay at zero

This error is related to APIC timer which should be addressed in separate
patch.

Version 3 changes,
- Patch 2:
  + Allow APIC ID > 255 only when x2APIC feature is supported on CPU
  + Make physical destination mode IPI which has destination id 0x
  a broadcast to xAPIC CPUs
  + Make cluster address 0xf in cluster model of xAPIC logical destination
  mode a broadcast to all clusters
  + Create new extended_log_dest to store APIC_LDR information in x2APIC
  instead of extending log_dest for backward compatibility in vmstate

Version 2 changes,
- Add support for APIC ID larger than 255
- Adjust AMD iommu for x2APIC suuport
- Reorganize and split patch 1,2 into patch 1,2,3 in version 2

Thanks,
Quang Minh.

Bui Quang Minh (5):
  i386/tcg: implement x2APIC registers MSR access
  apic: add support for x2APIC mode
  apic, i386/tcg: add x2apic transitions
  intel_iommu: allow Extended Interrupt Mode when using userspace APIC
  amd_iommu: report x2APIC support to the operating system

 hw/i386/acpi-build.c |  28 +-
 hw/i386/amd_iommu.c  |  21 +-
 hw/i386/amd_iommu.h  |  16 +-
 hw/i386/intel_iommu.c|  11 -
 hw/i386/x86.c|   8 +-
 hw/intc/apic.c   | 395 +--
 hw/intc/apic_common.c|  16 +-
 hw/intc/trace-events |   4 +-
 include/hw/i386/apic.h   |   6 +-
 include/hw/i386/apic_internal.h  |   7 +-
 target/i386/cpu-sysemu.c |  18 +-
 target/i386/cpu.c|   5 +-
 target/i386/cpu.h|   9 +
 target/i386/tcg/sysemu/misc_helper.c |  31 +++
 14 files changed, 436 insertions(+), 139 deletions(-)

-- 
2.25.1

Re: [PATCH v2 2/5] apic: add support for x2APIC mode

2023-04-09 Thread Bui Quang Minh


On 4/3/23 23:38, Bui Quang Minh wrote:

On 4/3/23 17:27, David Woodhouse wrote:

On Wed, 2023-03-29 at 22:30 +0700, Bui Quang Minh wrote:




I do some more testing on my hardware, your point is correct when dest
== 0x, the interrupt is delivered to all APICs regardless of
their mode.


To be precisely, it only broadcasts to CPUs in xAPIC mode if the IPI
destination mode is physical. In case the destination mode is logical,
flat model/cluster model rule applies to determine if the xAPIC CPUs
accept the IPI. Wow, this is so complicated :)


So even if you send to *all* of the first 8 CPUs in a cluster (e.g.
cluster 0x0001 giving a destination 0x000100FF, a CPU in xAPIC mode
doesn't see that as a broadcast because it's logical mode?


I mean if the destination is 0x, the xAPIC CPU will see 
destination as 0xff. 0xff is broadcast in physical destination mode 
only, in logical destination, it may not be a broadcast. It may depend 
on the whether it is flat model or cluster model in logical destination 
mode.


In flat model, 8 bits are used as mask, so in theory, this model can 
only support 8 CPUs, each CPU reserves its own bit by setting the upper 
8 bits of APIC LDR register. In Intel SDM, it is said that 0xff can be 
interpreted as a broadcast, this is true in normal case, but I think if 
the CPU its APIC LDR to 0, the IPI should not deliver to this CPU. This 
also matches with the current flat model of logical destination mode 
implementation of userspace APIC in Qemu before my series. However, I 
see this seems like a corner case, I didn't test it on real hardware.


With cluster model, when writing the above paragraphs, I think that 0xff 
will be delivered to all APICs (mask = 0xf) of cluster 15 (0xf). 
However, reading the SDM more carefully, I see that the there are only 
15 clusters with address from 0 to 14 so there is no cluster with 
address 15. 0xff is interpreted as broadcast to all APICs in all 
clusters too.


In conclusion, IPI with destination 0x can be a broadcast to all 
xAPIC CPUs too if we just ignore the corner case in flat model of 
logical destination mode (we may need to test more)


I do some tests on my machine with custom Linux kernel module (I can't 
use kvm-unit-tests because the display on my laptop is on embedded 
display port not serial port so I don't know how to get the output). I 
find out that
- In flat model, if the CPU intentionally set its 8 bit address to 0 in 
APIC_LDR, the 0xff logical IPI does not deliver to this CPU.
- In cluster model, the 4 higher bits used as cluster address, if these 
4 bits is equal to 0xf, the IPI is broadcasted to all cluster. The 4 
lower bits are used to address APICs in the cluster. If the CPU 
intentionally set these 4 lower bits to 0 in APIC_LDR, same as the flat 
model, this CPU does not receive the logical IPI. For example, the CPUs 
set its address to 0x20 (cluster 2, address 0 in cluster), the logical 
IPI with destination == 0xff does deliver to cluster 2 but does not 
deliver to this CPU.


So I choose to stick with the current implementation, 0x in IPI 
is seen as 0xff IPI in xAPIC CPUs. This IPI if in physical mode is a 
broadcast but not in logical mode.

Re: qemu-user: avoid allocations to convert stuff when not necessary

2023-04-09 Thread Warner Losh

On Sun, Apr 9, 2023 at 2:53 AM Michael Tokarev  wrote:

> Hi!
>
> In the qemu-user case, we allocate various structures and arrays
> for conversion of data between host and guest byte orders and sizes.
> But it is actually not necessary to do such allocation when the
> *size* is the same, and only byte order is different, because the
> conversion can be done in-place.  Does it make any sense to avoid'
> allocations in such cases?
>
> There are 2 issues with this though. First is that in some cases,
> the data being converted is const, and we may end up writing to a
> data resides in a read-only segment, is it ever possible?  And
> second - it is not entirely clear what to do in case the syscall
> returned error.
>

I don't think you can reliably do it in place. What if another thread in the
guest reads the data after you've converted it? It will get the wrong
answer.
I think you have to copy when endian mismatches, just like when alignment,
data size or layout differences are present. You'd need to convert it back
after the system call as well, which can cause problems, especially
if the system call needs multiple steps to emulate for whatever reason.

Warner

Re: [PATCH v3] target/i386: Change wrong XFRM value

2023-04-09 Thread Michael Tokarev


06.04.2023 09:40, Yang Zhong wrote:

The previous patch wrongly replaced FEAT_XSAVE_XCR0_{LO|HI} with
FEAT_XSAVE_XSS_{LO|HI} in CPUID(EAX=12,ECX=1):{ECX,EDX}, which made
SGX enclave only supported SSE and x87 feature(xfrm=0x3).

Fixes: 301e90675c3f ("target/i386: Enable support for XSAVES based features")


This seems to be -stable material, no?

/mjt

Re: [PATCH] tracetool: use relative paths for '#line' preprocessor directives

2023-04-09 Thread Stefan Hajnoczi

On Thu, 6 Apr 2023 at 09:22, Thomas De Schampheleire
 wrote:
>
> The event filename is an absolute path. Convert it to a relative path when
> writing '#line' directives, to preserve reproducibility of the generated
> output when different base paths are used.
>
> Signed-off-by: Thomas De Schampheleire 
> ---
>  scripts/tracetool/backend/ftrace.py | 4 +++-
>  scripts/tracetool/backend/log.py| 4 +++-
>  scripts/tracetool/backend/syslog.py | 4 +++-
>  3 files changed, 9 insertions(+), 3 deletions(-)

Thanks, applied to my block-next tree:
https://gitlab.com/stefanha/qemu/-/tree/block-next

qemu.git/master is currently frozen for the 8.0 release. The latest
release candidate is -rc3 and the final release is imminent. I don't
want to risk breaking and delaying the release at this stage, so I
have queued up this patch for the 8.1 release cycle.

The timing is unlucky, I would have liked to still include this in the
8.0 release. Thank you for the patch!

Stefan

Re: [PATCH] linux-user: ppoll: eliminate large alloca

2023-04-09 Thread Michael Tokarev


(Replying to an old(ish) email... original thread:
https://patchwork.ozlabs.org/project/qemu-devel/patch/20221216192220.2881898-1-...@msgid.tls.msk.ru/
 )

16.12.2022 23:44, Richard Henderson wrote:

On 12/16/22 11:22, Michael Tokarev wrote:

do_ppoll() in linux-user/syscall.c uses alloca() to
allocate an array of struct pullfds on the stack.
The only upper boundary for number of entries for this
array is so that whole thing fits in INT_MAX. But this
is definitely too much for a stack allocation.

Use heap allocation when large number of entries
is requested (currently 128, arbitrary), and continue
to use alloca() for smaller allocations, to optimize
small operations for small sizes.


I think it would be cleaner to always use heap allocation, and use g_autofree 
for the pointer.


Yes it is cleaner to always use the same type of allocation.
Does it really unnecessary to try to avoid heap allocations
for small things? It costs not that much, but might speed
some things up. Dunno how much it saves though.  Maybe it
is from the "premature optimization" field :)

Speaking of g_autofree, we already have to unlock_user anyway
(which we forgot to call), - so it makes no difference
between marking it as g_autofree or explicitly freeing it.

Thanks,

/mjt

[PATCH 7/7] target/riscv: Remove pc_succ_insn from DisasContext

2023-04-09 Thread Weiwei Li

pc_succ_insn is no longer useful after the introduce of cur_insn_len
and all pc related value use diff value instead of absolute value.

Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
---
 target/riscv/translate.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/target/riscv/translate.c b/target/riscv/translate.c
index 632e1cef59..d8899fcc4b 100644
--- a/target/riscv/translate.c
+++ b/target/riscv/translate.c
@@ -1150,7 +1150,6 @@ static void decode_opc(CPURISCVState *env, DisasContext 
*ctx, uint16_t opcode)
 /* Check for compressed insn */
 if (ctx->cur_insn_len == 2) {
 ctx->opcode = opcode;
-ctx->pc_succ_insn = ctx->base.pc_next + 2;
 if (has_ext(ctx, RVC) && decode_insn16(ctx, opcode)) {
 return;
 }
@@ -1160,7 +1159,6 @@ static void decode_opc(CPURISCVState *env, DisasContext 
*ctx, uint16_t opcode)
  translator_lduw(env, >base,
  ctx->base.pc_next + 2));
 ctx->opcode = opcode32;
-ctx->pc_succ_insn = ctx->base.pc_next + 4;
 
 for (size_t i = 0; i < ARRAY_SIZE(decoders); ++i) {
 if (decoders[i].guard_func(ctx) &&
@@ -1181,7 +1179,6 @@ static void riscv_tr_init_disas_context(DisasContextBase 
*dcbase, CPUState *cs)
 uint32_t tb_flags = ctx->base.tb->flags;
 
 ctx->pc_save = ctx->base.pc_first;
-ctx->pc_succ_insn = ctx->base.pc_first;
 ctx->mem_idx = FIELD_EX32(tb_flags, TB_FLAGS, MEM_IDX);
 ctx->mstatus_fs = tb_flags & TB_FLAGS_MSTATUS_FS;
 ctx->mstatus_vs = tb_flags & TB_FLAGS_MSTATUS_VS;
@@ -1244,7 +1241,7 @@ static void riscv_tr_translate_insn(DisasContextBase 
*dcbase, CPUState *cpu)
 
 ctx->ol = ctx->xl;
 decode_opc(env, ctx, opcode16);
-ctx->base.pc_next = ctx->pc_succ_insn;
+ctx->base.pc_next += ctx->cur_insn_len;
 
 /* Only the first insn within a TB is allowed to cross a page boundary. */
 if (ctx->base.is_jmp == DISAS_NEXT) {
-- 
2.25.1

[PATCH 1/7] target/riscv: Fix target address to update badaddr

2023-04-09 Thread Weiwei Li

Compute the target address before storing it into badaddr
when mis-aligned exception is triggered.
Use a target_pc temp to store the target address to avoid
the confusing operation that udpate target address into
cpu_pc before misalign check, then update it into badaddr
and restore cpu_pc to current pc if exception is triggered.

Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
Reviewed-by: Richard Henderson 
Reviewed-by: Alistair Francis 
---
 target/riscv/insn_trans/trans_rvi.c.inc | 23 ---
 target/riscv/translate.c| 21 ++---
 2 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/target/riscv/insn_trans/trans_rvi.c.inc 
b/target/riscv/insn_trans/trans_rvi.c.inc
index 4ad54e8a49..cc72864d32 100644
--- a/target/riscv/insn_trans/trans_rvi.c.inc
+++ b/target/riscv/insn_trans/trans_rvi.c.inc
@@ -51,25 +51,30 @@ static bool trans_jal(DisasContext *ctx, arg_jal *a)
 static bool trans_jalr(DisasContext *ctx, arg_jalr *a)
 {
 TCGLabel *misaligned = NULL;
+TCGv target_pc = tcg_temp_new();
 
-tcg_gen_addi_tl(cpu_pc, get_gpr(ctx, a->rs1, EXT_NONE), a->imm);
-tcg_gen_andi_tl(cpu_pc, cpu_pc, (target_ulong)-2);
+tcg_gen_addi_tl(target_pc, get_gpr(ctx, a->rs1, EXT_NONE), a->imm);
+tcg_gen_andi_tl(target_pc, target_pc, (target_ulong)-2);
+
+if (get_xl(ctx) == MXL_RV32) {
+tcg_gen_ext32s_tl(target_pc, target_pc);
+}
 
-gen_set_pc(ctx, cpu_pc);
 if (!has_ext(ctx, RVC)) {
 TCGv t0 = tcg_temp_new();
 
 misaligned = gen_new_label();
-tcg_gen_andi_tl(t0, cpu_pc, 0x2);
+tcg_gen_andi_tl(t0, target_pc, 0x2);
 tcg_gen_brcondi_tl(TCG_COND_NE, t0, 0x0, misaligned);
 }
 
 gen_set_gpri(ctx, a->rd, ctx->pc_succ_insn);
+tcg_gen_mov_tl(cpu_pc, target_pc);
 lookup_and_goto_ptr(ctx);
 
 if (misaligned) {
 gen_set_label(misaligned);
-gen_exception_inst_addr_mis(ctx);
+gen_exception_inst_addr_mis(ctx, target_pc);
 }
 ctx->base.is_jmp = DISAS_NORETURN;
 
@@ -153,6 +158,7 @@ static bool gen_branch(DisasContext *ctx, arg_b *a, TCGCond 
cond)
 TCGLabel *l = gen_new_label();
 TCGv src1 = get_gpr(ctx, a->rs1, EXT_SIGN);
 TCGv src2 = get_gpr(ctx, a->rs2, EXT_SIGN);
+target_ulong next_pc;
 
 if (get_xl(ctx) == MXL_RV128) {
 TCGv src1h = get_gprh(ctx, a->rs1);
@@ -169,9 +175,12 @@ static bool gen_branch(DisasContext *ctx, arg_b *a, 
TCGCond cond)
 
 gen_set_label(l); /* branch taken */
 
-if (!has_ext(ctx, RVC) && ((ctx->base.pc_next + a->imm) & 0x3)) {
+next_pc = ctx->base.pc_next + a->imm;
+if (!has_ext(ctx, RVC) && (next_pc & 0x3)) {
 /* misaligned */
-gen_exception_inst_addr_mis(ctx);
+TCGv target_pc = tcg_temp_new();
+gen_pc_plus_diff(target_pc, ctx, next_pc);
+gen_exception_inst_addr_mis(ctx, target_pc);
 } else {
 gen_goto_tb(ctx, 0, ctx->base.pc_next + a->imm);
 }
diff --git a/target/riscv/translate.c b/target/riscv/translate.c
index 0ee8ee147d..1c8eae86c5 100644
--- a/target/riscv/translate.c
+++ b/target/riscv/translate.c
@@ -222,21 +222,18 @@ static void decode_save_opc(DisasContext *ctx)
 ctx->insn_start = NULL;
 }
 
-static void gen_set_pc_imm(DisasContext *ctx, target_ulong dest)
+static void gen_pc_plus_diff(TCGv target, DisasContext *ctx,
+ target_ulong dest)
 {
 if (get_xl(ctx) == MXL_RV32) {
 dest = (int32_t)dest;
 }
-tcg_gen_movi_tl(cpu_pc, dest);
+tcg_gen_movi_tl(target, dest);
 }
 
-static void gen_set_pc(DisasContext *ctx, TCGv dest)
+static void gen_set_pc_imm(DisasContext *ctx, target_ulong dest)
 {
-if (get_xl(ctx) == MXL_RV32) {
-tcg_gen_ext32s_tl(cpu_pc, dest);
-} else {
-tcg_gen_mov_tl(cpu_pc, dest);
-}
+gen_pc_plus_diff(cpu_pc, ctx, dest);
 }
 
 static void generate_exception(DisasContext *ctx, int excp)
@@ -257,9 +254,9 @@ static void gen_exception_illegal(DisasContext *ctx)
 }
 }
 
-static void gen_exception_inst_addr_mis(DisasContext *ctx)
+static void gen_exception_inst_addr_mis(DisasContext *ctx, TCGv target)
 {
-tcg_gen_st_tl(cpu_pc, cpu_env, offsetof(CPURISCVState, badaddr));
+tcg_gen_st_tl(target, cpu_env, offsetof(CPURISCVState, badaddr));
 generate_exception(ctx, RISCV_EXCP_INST_ADDR_MIS);
 }
 
@@ -551,7 +548,9 @@ static void gen_jal(DisasContext *ctx, int rd, target_ulong 
imm)
 next_pc = ctx->base.pc_next + imm;
 if (!has_ext(ctx, RVC)) {
 if ((next_pc & 0x3) != 0) {
-gen_exception_inst_addr_mis(ctx);
+TCGv target_pc = tcg_temp_new();
+gen_pc_plus_diff(target_pc, ctx, next_pc);
+gen_exception_inst_addr_mis(ctx, target_pc);
 return;
 }
 }
-- 
2.25.1

Re: [PATCH v3] linux-user: fix getgroups/setgroups allocations

2023-04-09 Thread Michael Tokarev


09.04.2023 13:48, Michael Tokarev пишет:

..
v3:
  - fix a bug in getgroups(). In initial implementation I checked
for ret>0 in order to convert returned list of groups to target
byte order. But this clashes with unusual corner case for this
syscall: getgroups(0,NULL) return current number of groups in
the set, so this resulted in writing to *NULL. The right condition
here is gidsetsize>0:
-if (!is_error(ret) && ret > 0) {
+if (!is_error(ret) && gidsetsize > 0) {


The same fix is needed for getgroups32. v4 sent.

/mjt

[PATCH 2/7] target/riscv: Introduce cur_insn_len into DisasContext

2023-04-09 Thread Weiwei Li

Use cur_insn_len to store the length of the current instruction to
prepare for PC-relative translation.

Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
---
 target/riscv/translate.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/target/riscv/translate.c b/target/riscv/translate.c
index 1c8eae86c5..eee13b1225 100644
--- a/target/riscv/translate.c
+++ b/target/riscv/translate.c
@@ -59,6 +59,7 @@ typedef struct DisasContext {
 DisasContextBase base;
 /* pc_succ_insn points to the instruction following base.pc_next */
 target_ulong pc_succ_insn;
+target_ulong cur_insn_len;
 target_ulong priv_ver;
 RISCVMXL misa_mxl_max;
 RISCVMXL xl;
@@ -1117,8 +1118,9 @@ static void decode_opc(CPURISCVState *env, DisasContext 
*ctx, uint16_t opcode)
 };
 
 ctx->virt_inst_excp = false;
+ctx->cur_insn_len = insn_len(opcode);
 /* Check for compressed insn */
-if (insn_len(opcode) == 2) {
+if (ctx->cur_insn_len == 2) {
 ctx->opcode = opcode;
 ctx->pc_succ_insn = ctx->base.pc_next + 2;
 if (has_ext(ctx, RVC) && decode_insn16(ctx, opcode)) {
-- 
2.25.1

[PATCH 4/7] target/riscv: Change gen_set_pc_imm to gen_update_pc

2023-04-09 Thread Weiwei Li

Reduce reliance on absolute values(by passing pc difference) to
prepare for PC-relative translation.

Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
---
 target/riscv/insn_trans/trans_privileged.c.inc |  2 +-
 target/riscv/insn_trans/trans_rvi.c.inc|  6 +++---
 target/riscv/insn_trans/trans_rvv.c.inc|  4 ++--
 target/riscv/insn_trans/trans_rvzawrs.c.inc|  2 +-
 target/riscv/insn_trans/trans_xthead.c.inc |  2 +-
 target/riscv/translate.c   | 10 +-
 6 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/target/riscv/insn_trans/trans_privileged.c.inc 
b/target/riscv/insn_trans/trans_privileged.c.inc
index 59501b2780..f45859ba1e 100644
--- a/target/riscv/insn_trans/trans_privileged.c.inc
+++ b/target/riscv/insn_trans/trans_privileged.c.inc
@@ -106,7 +106,7 @@ static bool trans_wfi(DisasContext *ctx, arg_wfi *a)
 {
 #ifndef CONFIG_USER_ONLY
 decode_save_opc(ctx);
-gen_set_pc_imm(ctx, ctx->pc_succ_insn);
+gen_update_pc(ctx, ctx->cur_insn_len);
 gen_helper_wfi(cpu_env);
 return true;
 #else
diff --git a/target/riscv/insn_trans/trans_rvi.c.inc 
b/target/riscv/insn_trans/trans_rvi.c.inc
index f9a2464287..012534c883 100644
--- a/target/riscv/insn_trans/trans_rvi.c.inc
+++ b/target/riscv/insn_trans/trans_rvi.c.inc
@@ -776,7 +776,7 @@ static bool trans_pause(DisasContext *ctx, arg_pause *a)
  * PAUSE is a no-op in QEMU,
  * end the TB and return to main loop
  */
-gen_set_pc_imm(ctx, ctx->pc_succ_insn);
+gen_update_pc(ctx, ctx->cur_insn_len);
 exit_tb(ctx);
 ctx->base.is_jmp = DISAS_NORETURN;
 
@@ -800,7 +800,7 @@ static bool trans_fence_i(DisasContext *ctx, arg_fence_i *a)
  * FENCE_I is a no-op in QEMU,
  * however we need to end the translation block
  */
-gen_set_pc_imm(ctx, ctx->pc_succ_insn);
+gen_update_pc(ctx, ctx->cur_insn_len);
 exit_tb(ctx);
 ctx->base.is_jmp = DISAS_NORETURN;
 return true;
@@ -811,7 +811,7 @@ static bool do_csr_post(DisasContext *ctx)
 /* The helper may raise ILLEGAL_INSN -- record binv for unwind. */
 decode_save_opc(ctx);
 /* We may have changed important cpu state -- exit to main loop. */
-gen_set_pc_imm(ctx, ctx->pc_succ_insn);
+gen_update_pc(ctx, ctx->cur_insn_len);
 exit_tb(ctx);
 ctx->base.is_jmp = DISAS_NORETURN;
 return true;
diff --git a/target/riscv/insn_trans/trans_rvv.c.inc 
b/target/riscv/insn_trans/trans_rvv.c.inc
index f2e3d38515..fc666c113a 100644
--- a/target/riscv/insn_trans/trans_rvv.c.inc
+++ b/target/riscv/insn_trans/trans_rvv.c.inc
@@ -169,7 +169,7 @@ static bool do_vsetvl(DisasContext *s, int rd, int rs1, 
TCGv s2)
 gen_set_gpr(s, rd, dst);
 mark_vs_dirty(s);
 
-gen_set_pc_imm(s, s->pc_succ_insn);
+gen_update_pc(s, s->cur_insn_len);
 lookup_and_goto_ptr(s);
 s->base.is_jmp = DISAS_NORETURN;
 return true;
@@ -188,7 +188,7 @@ static bool do_vsetivli(DisasContext *s, int rd, TCGv s1, 
TCGv s2)
 gen_helper_vsetvl(dst, cpu_env, s1, s2);
 gen_set_gpr(s, rd, dst);
 mark_vs_dirty(s);
-gen_set_pc_imm(s, s->pc_succ_insn);
+gen_update_pc(s, s->cur_insn_len);
 lookup_and_goto_ptr(s);
 s->base.is_jmp = DISAS_NORETURN;
 
diff --git a/target/riscv/insn_trans/trans_rvzawrs.c.inc 
b/target/riscv/insn_trans/trans_rvzawrs.c.inc
index 8254e7dfe2..32efbff4d5 100644
--- a/target/riscv/insn_trans/trans_rvzawrs.c.inc
+++ b/target/riscv/insn_trans/trans_rvzawrs.c.inc
@@ -33,7 +33,7 @@ static bool trans_wrs(DisasContext *ctx)
 /* Clear the load reservation  (if any).  */
 tcg_gen_movi_tl(load_res, -1);
 
-gen_set_pc_imm(ctx, ctx->pc_succ_insn);
+gen_update_pc(ctx, ctx->cur_insn_len);
 tcg_gen_exit_tb(NULL, 0);
 ctx->base.is_jmp = DISAS_NORETURN;
 
diff --git a/target/riscv/insn_trans/trans_xthead.c.inc 
b/target/riscv/insn_trans/trans_xthead.c.inc
index df504c3f2c..16b9a4b806 100644
--- a/target/riscv/insn_trans/trans_xthead.c.inc
+++ b/target/riscv/insn_trans/trans_xthead.c.inc
@@ -1011,7 +1011,7 @@ static void gen_th_sync_local(DisasContext *ctx)
  * Emulate out-of-order barriers with pipeline flush
  * by exiting the translation block.
  */
-gen_set_pc_imm(ctx, ctx->pc_succ_insn);
+gen_update_pc(ctx, ctx->cur_insn_len);
 tcg_gen_exit_tb(NULL, 0);
 ctx->base.is_jmp = DISAS_NORETURN;
 }
diff --git a/target/riscv/translate.c b/target/riscv/translate.c
index 8c157d947e..db061064a6 100644
--- a/target/riscv/translate.c
+++ b/target/riscv/translate.c
@@ -232,14 +232,14 @@ static void gen_pc_plus_diff(TCGv target, DisasContext 
*ctx,
 tcg_gen_movi_tl(target, dest);
 }
 
-static void gen_set_pc_imm(DisasContext *ctx, target_ulong dest)
+static void gen_update_pc(DisasContext *ctx, target_long diff)
 {
-gen_pc_plus_diff(cpu_pc, ctx, dest);
+gen_pc_plus_diff(cpu_pc, ctx, ctx->base.pc_next + diff);
 }
 
 static void generate_exception(DisasContext *ctx, int excp)
 {
-gen_set_pc_imm(ctx,

[PATCH v4] linux-user: fix getgroups/setgroups allocations

2023-04-09 Thread Michael Tokarev

linux-user getgroups(), setgroups(), getgroups32() and setgroups32()
used alloca() to allocate grouplist arrays, with unchecked gidsetsize
coming from the "guest".  With NGROUPS_MAX being 65536 (linux, and it
is common for an application to allocate NGROUPS_MAX for getgroups()),
this means a typical allocation is half the megabyte on the stack.
Which just overflows stack, which leads to immediate SIGSEGV in actual
system getgroups() implementation.

An example of such issue is aptitude, eg
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=811087#72

Cap gidsetsize to NGROUPS_MAX (return EINVAL if it is larger than that),
and use heap allocation for grouplist instead of alloca().  While at it,
fix coding style and make all 4 implementations identical.

Try to not impose random limits - for example, allow gidsetsize to be
negative for getgroups() - just do not allocate negative-sized grouplist
in this case but still do actual getgroups() call.  But do not allow
negative gidsetsize for setgroups() since its argument is unsigned.

Capping by NGROUPS_MAX seems a bit arbitrary, - we can do more, it is
not an error if set size will be NGROUPS_MAX+1. But we should not allow
integer overflow for the array being allocated. Maybe it is enough to
just call g_try_new() and return ENOMEM if it fails.

Maybe there's also no need to convert setgroups() since this one is
usually smaller and known beforehand (KERN_NGROUPS_MAX is actually 63, -
this is apparently a kernel-imposed limit for runtime group set).

The patch fixes aptitude segfault mentioned above.

Signed-off-by: Michael Tokarev 
---
v4:
 - the same ret-vs-gidsetsize fix in getgroups32.
v3:
 - fix a bug in getgroups(). In initial implementation I checked
   for ret>0 in order to convert returned list of groups to target
   byte order. But this clashes with unusual corner case for this
   syscall: getgroups(0,NULL) return current number of groups in
   the set, so this resulted in writing to *NULL. The right condition
   here is gidsetsize>0:
   -if (!is_error(ret) && ret > 0) {
   +if (!is_error(ret) && gidsetsize > 0) {
v2:
 - remove g_free, use g_autofree annotations instead,
 - a bit more coding style changes, makes checkpatch.pl happy

 linux-user/syscall.c | 99 ++--
 1 file changed, 68 insertions(+), 31 deletions(-)

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 24b25759be..c532ee92c1 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -11433,39 +11433,58 @@ static abi_long do_syscall1(CPUArchState *cpu_env, 
int num, abi_long arg1,
 {
 int gidsetsize = arg1;
 target_id *target_grouplist;
-gid_t *grouplist;
+g_autofree gid_t *grouplist = NULL;
 int i;
 
-grouplist = alloca(gidsetsize * sizeof(gid_t));
+if (gidsetsize > NGROUPS_MAX) {
+return -TARGET_EINVAL;
+}
+if (gidsetsize > 0) {
+grouplist = g_try_new(gid_t, gidsetsize);
+if (!grouplist) {
+return -TARGET_ENOMEM;
+}
+}
 ret = get_errno(getgroups(gidsetsize, grouplist));
-if (gidsetsize == 0)
-return ret;
-if (!is_error(ret)) {
-target_grouplist = lock_user(VERIFY_WRITE, arg2, gidsetsize * 
sizeof(target_id), 0);
-if (!target_grouplist)
+if (!is_error(ret) && gidsetsize > 0) {
+target_grouplist = lock_user(VERIFY_WRITE, arg2,
+ gidsetsize * sizeof(target_id), 
0);
+if (!target_grouplist) {
 return -TARGET_EFAULT;
-for(i = 0;i < ret; i++)
+}
+for (i = 0; i < ret; i++) {
 target_grouplist[i] = tswapid(high2lowgid(grouplist[i]));
-unlock_user(target_grouplist, arg2, gidsetsize * 
sizeof(target_id));
+}
+unlock_user(target_grouplist, arg2,
+gidsetsize * sizeof(target_id));
 }
+return ret;
 }
-return ret;
 case TARGET_NR_setgroups:
 {
 int gidsetsize = arg1;
 target_id *target_grouplist;
-gid_t *grouplist = NULL;
+g_autofree gid_t *grouplist = NULL;
 int i;
-if (gidsetsize) {
-grouplist = alloca(gidsetsize * sizeof(gid_t));
-target_grouplist = lock_user(VERIFY_READ, arg2, gidsetsize * 
sizeof(target_id), 1);
+
+if (gidsetsize > NGROUPS_MAX || gidsetsize < 0) {
+return -TARGET_EINVAL;
+}
+if (gidsetsize > 0) {
+grouplist = g_try_new(gid_t, gidsetsize);
+if (!grouplist) {
+return -TARGET_ENOMEM;
+}
+

[PATCH 5/7] target/riscv: Use true diff for gen_pc_plus_diff

2023-04-09 Thread Weiwei Li

Reduce reliance on absolute values by using true pc difference for
gen_pc_plus_diff() to prepare for PC-relative translation.

Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
---
 target/riscv/insn_trans/trans_rvi.c.inc |  6 ++
 target/riscv/translate.c| 13 ++---
 2 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/target/riscv/insn_trans/trans_rvi.c.inc 
b/target/riscv/insn_trans/trans_rvi.c.inc
index 012534c883..b77e6c4fb6 100644
--- a/target/riscv/insn_trans/trans_rvi.c.inc
+++ b/target/riscv/insn_trans/trans_rvi.c.inc
@@ -158,7 +158,6 @@ static bool gen_branch(DisasContext *ctx, arg_b *a, TCGCond 
cond)
 TCGLabel *l = gen_new_label();
 TCGv src1 = get_gpr(ctx, a->rs1, EXT_SIGN);
 TCGv src2 = get_gpr(ctx, a->rs2, EXT_SIGN);
-target_ulong next_pc;
 
 if (get_xl(ctx) == MXL_RV128) {
 TCGv src1h = get_gprh(ctx, a->rs1);
@@ -175,11 +174,10 @@ static bool gen_branch(DisasContext *ctx, arg_b *a, 
TCGCond cond)
 
 gen_set_label(l); /* branch taken */
 
-next_pc = ctx->base.pc_next + a->imm;
-if (!has_ext(ctx, RVC) && (next_pc & 0x3)) {
+if (!has_ext(ctx, RVC) && (a->imm & 0x3)) {
 /* misaligned */
 TCGv target_pc = tcg_temp_new();
-gen_pc_plus_diff(target_pc, ctx, next_pc);
+gen_pc_plus_diff(target_pc, ctx, a->imm);
 gen_exception_inst_addr_mis(ctx, target_pc);
 } else {
 gen_goto_tb(ctx, 0, a->imm);
diff --git a/target/riscv/translate.c b/target/riscv/translate.c
index db061064a6..50a87d7367 100644
--- a/target/riscv/translate.c
+++ b/target/riscv/translate.c
@@ -224,8 +224,10 @@ static void decode_save_opc(DisasContext *ctx)
 }
 
 static void gen_pc_plus_diff(TCGv target, DisasContext *ctx,
- target_ulong dest)
+ target_long diff)
 {
+target_ulong dest = ctx->base.pc_next + diff;
+
 if (get_xl(ctx) == MXL_RV32) {
 dest = (int32_t)dest;
 }
@@ -234,7 +236,7 @@ static void gen_pc_plus_diff(TCGv target, DisasContext *ctx,
 
 static void gen_update_pc(DisasContext *ctx, target_long diff)
 {
-gen_pc_plus_diff(cpu_pc, ctx, ctx->base.pc_next + diff);
+gen_pc_plus_diff(cpu_pc, ctx, diff);
 }
 
 static void generate_exception(DisasContext *ctx, int excp)
@@ -545,14 +547,11 @@ static void gen_set_fpr_d(DisasContext *ctx, int reg_num, 
TCGv_i64 t)
 
 static void gen_jal(DisasContext *ctx, int rd, target_ulong imm)
 {
-target_ulong next_pc;
-
 /* check misaligned: */
-next_pc = ctx->base.pc_next + imm;
 if (!has_ext(ctx, RVC)) {
-if ((next_pc & 0x3) != 0) {
+if ((imm & 0x3) != 0) {
 TCGv target_pc = tcg_temp_new();
-gen_pc_plus_diff(target_pc, ctx, next_pc);
+gen_pc_plus_diff(target_pc, ctx, imm);
 gen_exception_inst_addr_mis(ctx, target_pc);
 return;
 }
-- 
2.25.1

[PATCH 3/7] target/riscv: Change gen_goto_tb to work on displacements

2023-04-09 Thread Weiwei Li

Reduce reliance on absolute value to prepare for PC-relative translation.

Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
---
 target/riscv/insn_trans/trans_rvi.c.inc | 4 ++--
 target/riscv/translate.c| 8 +---
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/target/riscv/insn_trans/trans_rvi.c.inc 
b/target/riscv/insn_trans/trans_rvi.c.inc
index cc72864d32..f9a2464287 100644
--- a/target/riscv/insn_trans/trans_rvi.c.inc
+++ b/target/riscv/insn_trans/trans_rvi.c.inc
@@ -171,7 +171,7 @@ static bool gen_branch(DisasContext *ctx, arg_b *a, TCGCond 
cond)
 } else {
 tcg_gen_brcond_tl(cond, src1, src2, l);
 }
-gen_goto_tb(ctx, 1, ctx->pc_succ_insn);
+gen_goto_tb(ctx, 1, ctx->cur_insn_len);
 
 gen_set_label(l); /* branch taken */
 
@@ -182,7 +182,7 @@ static bool gen_branch(DisasContext *ctx, arg_b *a, TCGCond 
cond)
 gen_pc_plus_diff(target_pc, ctx, next_pc);
 gen_exception_inst_addr_mis(ctx, target_pc);
 } else {
-gen_goto_tb(ctx, 0, ctx->base.pc_next + a->imm);
+gen_goto_tb(ctx, 0, a->imm);
 }
 ctx->base.is_jmp = DISAS_NORETURN;
 
diff --git a/target/riscv/translate.c b/target/riscv/translate.c
index eee13b1225..8c157d947e 100644
--- a/target/riscv/translate.c
+++ b/target/riscv/translate.c
@@ -281,8 +281,10 @@ static void exit_tb(DisasContext *ctx)
 tcg_gen_exit_tb(NULL, 0);
 }
 
-static void gen_goto_tb(DisasContext *ctx, int n, target_ulong dest)
+static void gen_goto_tb(DisasContext *ctx, int n, target_long diff)
 {
+target_ulong dest = ctx->base.pc_next + diff;
+
  /*
   * Under itrigger, instruction executes one by one like singlestep,
   * direct block chain benefits will be small.
@@ -557,7 +559,7 @@ static void gen_jal(DisasContext *ctx, int rd, target_ulong 
imm)
 }
 
 gen_set_gpri(ctx, rd, ctx->pc_succ_insn);
-gen_goto_tb(ctx, 0, ctx->base.pc_next + imm); /* must use this for safety 
*/
+gen_goto_tb(ctx, 0, imm); /* must use this for safety */
 ctx->base.is_jmp = DISAS_NORETURN;
 }
 
@@ -1237,7 +1239,7 @@ static void riscv_tr_tb_stop(DisasContextBase *dcbase, 
CPUState *cpu)
 
 switch (ctx->base.is_jmp) {
 case DISAS_TOO_MANY:
-gen_goto_tb(ctx, 0, ctx->base.pc_next);
+gen_goto_tb(ctx, 0, 0);
 break;
 case DISAS_NORETURN:
 break;
-- 
2.25.1

[PATCH 6/7] target/riscv: Enable PC-relative translation

2023-04-09 Thread Weiwei Li

Add a base pc_save for PC-relative translation(CF_PCREL).
Diable the directly sync pc from tb by riscv_cpu_synchronize_from_tb.
Use gen_pc_plus_diff to get the pc-relative address.
Enable CF_PCREL in System mode.

Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
---
 target/riscv/cpu.c  | 31 ++--
 target/riscv/insn_trans/trans_rvi.c.inc | 12 +--
 target/riscv/translate.c| 47 +
 3 files changed, 71 insertions(+), 19 deletions(-)

diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index 1e97473af2..3b562d5d9f 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -658,16 +658,18 @@ static vaddr riscv_cpu_get_pc(CPUState *cs)
 static void riscv_cpu_synchronize_from_tb(CPUState *cs,
   const TranslationBlock *tb)
 {
-RISCVCPU *cpu = RISCV_CPU(cs);
-CPURISCVState *env = >env;
-RISCVMXL xl = FIELD_EX32(tb->flags, TB_FLAGS, XL);
+if (!(tb_cflags(tb) & CF_PCREL)) {
+RISCVCPU *cpu = RISCV_CPU(cs);
+CPURISCVState *env = >env;
+RISCVMXL xl = FIELD_EX32(tb->flags, TB_FLAGS, XL);
 
-tcg_debug_assert(!(cs->tcg_cflags & CF_PCREL));
+tcg_debug_assert(!(cs->tcg_cflags & CF_PCREL));
 
-if (xl == MXL_RV32) {
-env->pc = (int32_t) tb->pc;
-} else {
-env->pc = tb->pc;
+if (xl == MXL_RV32) {
+env->pc = (int32_t) tb->pc;
+} else {
+env->pc = tb->pc;
+}
 }
 }
 
@@ -693,11 +695,18 @@ static void riscv_restore_state_to_opc(CPUState *cs,
 RISCVCPU *cpu = RISCV_CPU(cs);
 CPURISCVState *env = >env;
 RISCVMXL xl = FIELD_EX32(tb->flags, TB_FLAGS, XL);
+target_ulong pc;
+
+if (tb_cflags(tb) & CF_PCREL) {
+pc = (env->pc & TARGET_PAGE_MASK) | data[0];
+} else {
+pc = data[0];
+}
 
 if (xl == MXL_RV32) {
-env->pc = (int32_t)data[0];
+env->pc = (int32_t)pc;
 } else {
-env->pc = data[0];
+env->pc = pc;
 }
 env->bins = data[1];
 }
@@ -1184,6 +1193,8 @@ static void riscv_cpu_realize(DeviceState *dev, Error 
**errp)
 
 
 #ifndef CONFIG_USER_ONLY
+cs->tcg_cflags |= CF_PCREL;
+
 if (cpu->cfg.ext_sstc) {
 riscv_timer_init(cpu);
 }
diff --git a/target/riscv/insn_trans/trans_rvi.c.inc 
b/target/riscv/insn_trans/trans_rvi.c.inc
index b77e6c4fb6..0afba787ef 100644
--- a/target/riscv/insn_trans/trans_rvi.c.inc
+++ b/target/riscv/insn_trans/trans_rvi.c.inc
@@ -38,7 +38,9 @@ static bool trans_lui(DisasContext *ctx, arg_lui *a)
 
 static bool trans_auipc(DisasContext *ctx, arg_auipc *a)
 {
-gen_set_gpri(ctx, a->rd, a->imm + ctx->base.pc_next);
+TCGv target_pc = dest_gpr(ctx, a->rd);
+gen_pc_plus_diff(target_pc, ctx, a->imm);
+gen_set_gpr(ctx, a->rd, target_pc);
 return true;
 }
 
@@ -52,6 +54,7 @@ static bool trans_jalr(DisasContext *ctx, arg_jalr *a)
 {
 TCGLabel *misaligned = NULL;
 TCGv target_pc = tcg_temp_new();
+TCGv succ_pc = dest_gpr(ctx, a->rd);
 
 tcg_gen_addi_tl(target_pc, get_gpr(ctx, a->rs1, EXT_NONE), a->imm);
 tcg_gen_andi_tl(target_pc, target_pc, (target_ulong)-2);
@@ -68,7 +71,9 @@ static bool trans_jalr(DisasContext *ctx, arg_jalr *a)
 tcg_gen_brcondi_tl(TCG_COND_NE, t0, 0x0, misaligned);
 }
 
-gen_set_gpri(ctx, a->rd, ctx->pc_succ_insn);
+gen_pc_plus_diff(succ_pc, ctx, ctx->cur_insn_len);
+gen_set_gpr(ctx, a->rd, succ_pc);
+
 tcg_gen_mov_tl(cpu_pc, target_pc);
 lookup_and_goto_ptr(ctx);
 
@@ -158,6 +163,7 @@ static bool gen_branch(DisasContext *ctx, arg_b *a, TCGCond 
cond)
 TCGLabel *l = gen_new_label();
 TCGv src1 = get_gpr(ctx, a->rs1, EXT_SIGN);
 TCGv src2 = get_gpr(ctx, a->rs2, EXT_SIGN);
+target_ulong orig_pc_save = ctx->pc_save;
 
 if (get_xl(ctx) == MXL_RV128) {
 TCGv src1h = get_gprh(ctx, a->rs1);
@@ -171,6 +177,7 @@ static bool gen_branch(DisasContext *ctx, arg_b *a, TCGCond 
cond)
 tcg_gen_brcond_tl(cond, src1, src2, l);
 }
 gen_goto_tb(ctx, 1, ctx->cur_insn_len);
+ctx->pc_save = orig_pc_save;
 
 gen_set_label(l); /* branch taken */
 
@@ -182,6 +189,7 @@ static bool gen_branch(DisasContext *ctx, arg_b *a, TCGCond 
cond)
 } else {
 gen_goto_tb(ctx, 0, a->imm);
 }
+ctx->pc_save = -1;
 ctx->base.is_jmp = DISAS_NORETURN;
 
 return true;
diff --git a/target/riscv/translate.c b/target/riscv/translate.c
index 50a87d7367..632e1cef59 100644
--- a/target/riscv/translate.c
+++ b/target/riscv/translate.c
@@ -60,6 +60,7 @@ typedef struct DisasContext {
 /* pc_succ_insn points to the instruction following base.pc_next */
 target_ulong pc_succ_insn;
 target_ulong cur_insn_len;
+target_ulong pc_save;
 target_ulong priv_ver;
 RISCVMXL misa_mxl_max;
 RISCVMXL xl;
@@ -228,15 +229,24 @@ static void gen_pc_plus_diff(TCGv target, DisasContext 
*ctx,
 {
 target_ulong dest = ctx->base.pc_next + diff;
 
-

[PATCH 0/7] target/riscv: Add support for PC-relative translation

2023-04-09 Thread Weiwei Li

This patchset tries to add support for PC-relative translation.

The existence of CF_PCREL can improve performance with the guest
kernel's address space randomization.  Each guest process maps libc.so
(et al) at a different virtual address, and this allows those
translations to be shared.

And support of PC-relative translation is the precondition to support
pointer mask for instruction.

The port is available here:
https://github.com/plctlab/plct-qemu/tree/plct-pcrel-upstream

Weiwei Li (7):
  target/riscv: Fix target address to update badaddr
  target/riscv: Introduce cur_insn_len into DisasContext
  target/riscv: Change gen_goto_tb to work on displacements
  target/riscv: Change gen_set_pc_imm to gen_update_pc
  target/riscv: Use true diff for gen_pc_plus_diff
  target/riscv: Enable PC-relative translation
  target/riscv: Remove pc_succ_insn from DisasContext

 target/riscv/cpu.c| 31 +--
 .../riscv/insn_trans/trans_privileged.c.inc   |  2 +-
 target/riscv/insn_trans/trans_rvi.c.inc   | 43 ++---
 target/riscv/insn_trans/trans_rvv.c.inc   |  4 +-
 target/riscv/insn_trans/trans_rvzawrs.c.inc   |  2 +-
 target/riscv/insn_trans/trans_xthead.c.inc|  2 +-
 target/riscv/translate.c  | 92 +--
 7 files changed, 117 insertions(+), 59 deletions(-)

-- 
2.25.1

[PATCH v3] linux-user: fix getgroups/setgroups allocations

2023-04-09 Thread Michael Tokarev

linux-user getgroups(), setgroups(), getgroups32() and setgroups32()
used alloca() to allocate grouplist arrays, with unchecked gidsetsize
coming from the "guest".  With NGROUPS_MAX being 65536 (linux, and it
is common for an application to allocate NGROUPS_MAX for getgroups()),
this means a typical allocation is half the megabyte on the stack.
Which just overflows stack, which leads to immediate SIGSEGV in actual
system getgroups() implementation.

An example of such issue is aptitude, eg
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=811087#72

Cap gidsetsize to NGROUPS_MAX (return EINVAL if it is larger than that),
and use heap allocation for grouplist instead of alloca().  While at it,
fix coding style and make all 4 implementations identical.

Try to not impose random limits - for example, allow gidsetsize to be
negative for getgroups() - just do not allocate negative-sized grouplist
in this case but still do actual getgroups() call.  But do not allow
negative gidsetsize for setgroups() since its argument is unsigned.

Capping by NGROUPS_MAX seems a bit arbitrary, - we can do more, it is
not an error if set size will be NGROUPS_MAX+1. But we should not allow
integer overflow for the array being allocated. Maybe it is enough to
just call g_try_new() and return ENOMEM if it fails.

Maybe there's also no need to convert setgroups() since this one is
usually smaller and known beforehand (KERN_NGROUPS_MAX is actually 63, -
this is apparently a kernel-imposed limit for runtime group set).

The patch fixes aptitude segfault mentioned above.

Signed-off-by: Michael Tokarev 
---
v3:
 - fix a bug in getgroups(). In initial implementation I checked
   for ret>0 in order to convert returned list of groups to target
   byte order. But this clashes with unusual corner case for this
   syscall: getgroups(0,NULL) return current number of groups in
   the set, so this resulted in writing to *NULL. The right condition
   here is gidsetsize>0:
   -if (!is_error(ret) && ret > 0) {
   +if (!is_error(ret) && gidsetsize > 0) {
v2:
 - remove g_free, use g_autofree annotations instead,
 - a bit more coding style changes, makes checkpatch.pl happy

 linux-user/syscall.c | 99 ++--
 1 file changed, 68 insertions(+), 31 deletions(-)

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 24b25759be..52020c5aed 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -11433,39 +11433,58 @@ static abi_long do_syscall1(CPUArchState *cpu_env, 
int num, abi_long arg1,
 {
 int gidsetsize = arg1;
 target_id *target_grouplist;
-gid_t *grouplist;
+g_autofree gid_t *grouplist = NULL;
 int i;
 
-grouplist = alloca(gidsetsize * sizeof(gid_t));
+if (gidsetsize > NGROUPS_MAX) {
+return -TARGET_EINVAL;
+}
+if (gidsetsize > 0) {
+grouplist = g_try_new(gid_t, gidsetsize);
+if (!grouplist) {
+return -TARGET_ENOMEM;
+}
+}
 ret = get_errno(getgroups(gidsetsize, grouplist));
-if (gidsetsize == 0)
-return ret;
-if (!is_error(ret)) {
-target_grouplist = lock_user(VERIFY_WRITE, arg2, gidsetsize * 
sizeof(target_id), 0);
-if (!target_grouplist)
+if (!is_error(ret) && gidsetsize > 0) {
+target_grouplist = lock_user(VERIFY_WRITE, arg2,
+ gidsetsize * sizeof(target_id), 
0);
+if (!target_grouplist) {
 return -TARGET_EFAULT;
-for(i = 0;i < ret; i++)
+}
+for (i = 0; i < ret; i++) {
 target_grouplist[i] = tswapid(high2lowgid(grouplist[i]));
-unlock_user(target_grouplist, arg2, gidsetsize * 
sizeof(target_id));
+}
+unlock_user(target_grouplist, arg2,
+gidsetsize * sizeof(target_id));
 }
+return ret;
 }
-return ret;
 case TARGET_NR_setgroups:
 {
 int gidsetsize = arg1;
 target_id *target_grouplist;
-gid_t *grouplist = NULL;
+g_autofree gid_t *grouplist = NULL;
 int i;
-if (gidsetsize) {
-grouplist = alloca(gidsetsize * sizeof(gid_t));
-target_grouplist = lock_user(VERIFY_READ, arg2, gidsetsize * 
sizeof(target_id), 1);
+
+if (gidsetsize > NGROUPS_MAX || gidsetsize < 0) {
+return -TARGET_EINVAL;
+}
+if (gidsetsize > 0) {
+grouplist = g_try_new(gid_t, gidsetsize);
+if (!grouplist) {
+return -TARGET_ENOMEM;
+}
+target_grouplist = lock_user(VERIFY_READ, arg2,
+

Re: [PATCH v2] linux-user: fix getgroups/setgroups allocations

2023-04-09 Thread Michael Tokarev


09.04.2023 12:06, Michael Tokarev пишет:
..


Laurent, can you describe what you're doing there in a bit more details please?
I'd love to fix this one.  Do you know which test is/was failing?


Oh, n/m, I found it: it's getgroups01 test in 
testcases/kernel/syscalls/getgroups/.

/mjt

Re: [PATCH v2] linux-user: fix getgroups/setgroups allocations

2023-04-09 Thread Michael Tokarev


04.02.2023 01:23, Laurent Vivier wrote:

Le 03/02/2023 à 22:57, Richard Henderson a écrit :

On 2/3/23 11:49, Laurent Vivier wrote:


..

I'm going to remove this patch from my branch because it breaks something.

When I execute LTP test suite (20200930), I have:

getgroups01    1  TPASS  :  getgroups failed as expected with EINVAL
**
ERROR:../../../Projects/qemu/accel/tcg/cpu-exec.c:998:cpu_exec_setjmp: 
assertion failed: (cpu == current_cpu)
Bail out! 
ERROR:../../../Projects/qemu/accel/tcg/cpu-exec.c:998:cpu_exec_setjmp: 
assertion failed: (cpu == current_cpu)


Which host+guest?


host is x86_64 + Fedora 37 (QEMU needs to be patched to detect missing libdw 
static library)
guest is all, replicated with m68k/sid

Note that the error is generated with the test case that expects EINVAL.


Hm. I missed this one at its time.  The patch's in debian qemu for quite some 
time
without any issues so far.

Laurent, can you describe what you're doing there in a bit more details please?
I'd love to fix this one.  Do you know which test is/was failing?

Thanks,

/mjt

qemu-user: avoid allocations to convert stuff when not necessary

2023-04-09 Thread Michael Tokarev


Hi!

In the qemu-user case, we allocate various structures and arrays
for conversion of data between host and guest byte orders and sizes.
But it is actually not necessary to do such allocation when the
*size* is the same, and only byte order is different, because the
conversion can be done in-place.  Does it make any sense to avoid'
allocations in such cases?

There are 2 issues with this though. First is that in some cases,
the data being converted is const, and we may end up writing to a
data resides in a read-only segment, is it ever possible?  And
second - it is not entirely clear what to do in case the syscall
returned error.

Thanks,

/mjt

45 matches

Mail list logo