[PATCH] i386/cpu: Drop the check of phys_bits in host_cpu_realizefn()

2024-07-04 Thread Xiaoyao Li
The check of cpu->phys_bits to be in range between
[32, TARGET_PHYS_ADDR_SPACE_BITS] in host_cpu_realizefn()
is duplicated with check in x86_cpu_realizefn().

Since the ckeck in x86_cpu_realizefn() is called later and can cover all
teh x86 case. Remove the one in host_cpu_realizefn().

Signed-off-by: Xiaoyao Li 
---
 target/i386/host-cpu.c | 12 +---
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/target/i386/host-cpu.c b/target/i386/host-cpu.c
index 8b8bf5afeccf..b109c1a2221f 100644
--- a/target/i386/host-cpu.c
+++ b/target/i386/host-cpu.c
@@ -75,17 +75,7 @@ bool host_cpu_realizefn(CPUState *cs, Error **errp)
 CPUX86State *env = >env;
 
 if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
-uint32_t phys_bits = host_cpu_adjust_phys_bits(cpu);
-
-if (phys_bits &&
-(phys_bits > TARGET_PHYS_ADDR_SPACE_BITS ||
- phys_bits < 32)) {
-error_setg(errp, "phys-bits should be between 32 and %u "
-   " (but is %u)",
-   TARGET_PHYS_ADDR_SPACE_BITS, phys_bits);
-return false;
-}
-cpu->phys_bits = phys_bits;
+cpu->phys_bits = host_cpu_adjust_phys_bits(cpu);
 }
 return true;
 }
-- 
2.34.1




Re: [PATCH v4 20/31] i386/sev: Add support for SNP CPUID validation

2024-07-03 Thread Xiaoyao Li

On 7/4/2024 8:34 AM, Michael Roth wrote:

On Tue, Jul 02, 2024 at 11:07:18AM +0800, Xiaoyao Li wrote:

On 5/30/2024 7:16 PM, Pankaj Gupta wrote:

From: Michael Roth 

SEV-SNP firmware allows a special guest page to be populated with a
table of guest CPUID values so that they can be validated through
firmware before being loaded into encrypted guest memory where they can
be used in place of hypervisor-provided values[1].

As part of SEV-SNP guest initialization, use this interface to validate
the CPUID entries reported by KVM_GET_CPUID2 prior to initial guest
start and populate the CPUID page reserved by OVMF with the resulting
encrypted data.


How is KVM CPUIDs (leaf 0x4001) validated?

I suppose not all KVM_FEATURE_XXX are supported for SNP guest. And SNP
firmware doesn't validate such CPUID range. So how does them get validated?


This rules for CPUID enforcement are documented in the PPR for each AMD
CPU model in Chapter 2, section "CPUID Policy Enforcement". For the
situation you mentioned, it's stated there that:

   The PSP enforces the following policy:
   - If the CPUID function is not in the standard range (Fn through
 Fn) or the extended range
 (Fn8000_ through Fn8000_), the function output check is
 UnChecked.
   - If the CPUID function is in the standard or extended range and the
 function is not listed in SEV-SNP CPUID
 Policy table, then the output check is Strict and required to be 0. Note
 that if the CPUID function does not depend
 on ECX and/or XCR0, then the PSP policy ignores those inputs,
 respectively.
   - Otherwise, the check is defined according to the values listed in
 SEV-SNP CPUID Policy table.

So there are specific ranges that are checked, mainly ones where there
is potential for guests to misbehave if they are being lied to. But
hypervisor-ranges are paravirtual in a sense so there's no assumptions
being made about what the underlying hardware is doing, so the checks
are needed as much in those cases.


I'm a little confused. Per your reference above, hypervisor-ranges is 
unchecked because it's not in the standard range nor the extended range.


And your last sentence said "so the checks are needed as much in those 
cases". So how does hypervisor-ranges get checked?



-Mike




[1] SEV SNP Firmware ABI Specification, Rev. 0.8, 8.13.2.6








Re: [PATCH 4/4] target/i386: Update CMPLegacy handling for Zhaoxin and VIA CPUs

2024-07-03 Thread Xiaoyao Li

On 7/4/2024 11:14 AM, Ewan Hai wrote:

On 7/3/24 10:49, Xiaoyao Li wrote:


On 6/25/2024 5:19 PM, EwanHai wrote:

Zhaoxin and VIA CPUs handle the CMPLegacy bit in the same way
as Intel CPUs. This patch simplifies the existing logic by
using the IS_XXX_CPU macro and includes checks for Zhaoxin
and VIA vendors to align their behavior with Intel.

Signed-off-by: EwanHai 
---
  target/i386/cpu.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 50edff077e..0836416617 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6945,9 +6945,9 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t 
index, uint32_t count,

   * So don't set it here for Intel to make Linux guests happy.
   */
  if (threads_per_pkg > 1) {
-    if (env->cpuid_vendor1 != CPUID_VENDOR_INTEL_1 ||
-    env->cpuid_vendor2 != CPUID_VENDOR_INTEL_2 ||
-    env->cpuid_vendor3 != CPUID_VENDOR_INTEL_3) {
+    if (!IS_INTEL_CPU(env) &&
+    !IS_ZHAOXIN_CPU(env) &&
+    !IS_VIA_CPU(env)) {


it seems you added ! by mistake.


  *ecx |= 1 << 1; /* CmpLegacy bit */
  }
  }


For CPUID leaf 0x8001 ECX bit 1, Intel defines it as "Bits 04-01: 
Reserved,"
whereas AMD defines it as "CmpLegacy, Core multi-processing legacy 
mode." For Intel
CPUs and those following Intel's behavior, this bit should not be set to 
1. Therefore,

I believe the "!" here is correct.



Sorry, I misread the original code.

I think maybe we can just use is_AMD_CPU(). But I'm not sure if any 
magic use case with customized VENDOR ID relies on it. So you code looks 
good to me.




Re: [PATCH 4/4] target/i386: Update CMPLegacy handling for Zhaoxin and VIA CPUs

2024-07-03 Thread Xiaoyao Li

On 6/25/2024 5:19 PM, EwanHai wrote:

Zhaoxin and VIA CPUs handle the CMPLegacy bit in the same way
as Intel CPUs. This patch simplifies the existing logic by
using the IS_XXX_CPU macro and includes checks for Zhaoxin
and VIA vendors to align their behavior with Intel.

Signed-off-by: EwanHai 
---
  target/i386/cpu.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 50edff077e..0836416617 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6945,9 +6945,9 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
   * So don't set it here for Intel to make Linux guests happy.
   */
  if (threads_per_pkg > 1) {
-if (env->cpuid_vendor1 != CPUID_VENDOR_INTEL_1 ||
-env->cpuid_vendor2 != CPUID_VENDOR_INTEL_2 ||
-env->cpuid_vendor3 != CPUID_VENDOR_INTEL_3) {
+if (!IS_INTEL_CPU(env) &&
+!IS_ZHAOXIN_CPU(env) &&
+!IS_VIA_CPU(env)) {


it seems you added ! by mistake.


  *ecx |= 1 << 1;/* CmpLegacy bit */
  }
  }





Re: [PATCH v4 20/31] i386/sev: Add support for SNP CPUID validation

2024-07-01 Thread Xiaoyao Li

On 5/30/2024 7:16 PM, Pankaj Gupta wrote:

From: Michael Roth 

SEV-SNP firmware allows a special guest page to be populated with a
table of guest CPUID values so that they can be validated through
firmware before being loaded into encrypted guest memory where they can
be used in place of hypervisor-provided values[1].

As part of SEV-SNP guest initialization, use this interface to validate
the CPUID entries reported by KVM_GET_CPUID2 prior to initial guest
start and populate the CPUID page reserved by OVMF with the resulting
encrypted data.


How is KVM CPUIDs (leaf 0x4001) validated?

I suppose not all KVM_FEATURE_XXX are supported for SNP guest. And SNP 
firmware doesn't validate such CPUID range. So how does them get validated?



[1] SEV SNP Firmware ABI Specification, Rev. 0.8, 8.13.2.6






Re: [PATCH 2/2] target/i386: drop AMD machine check bits from Intel CPUID

2024-06-28 Thread Xiaoyao Li

On 6/27/2024 10:06 PM, Paolo Bonzini wrote:

The recent addition of the SUCCOR bit to kvm_arch_get_supported_cpuid()
causes the bit to be visible when "-cpu host" VMs are started on Intel
processors.

While this should in principle be harmless, it's not tidy and we don't
even know for sure that it doesn't cause any guest OS to take unexpected
paths.  Since x86_cpu_get_supported_feature_word() can return different
different values depending on the guest, adjust it to hide the SUCCOR


superfluous different


bit if the guest has non-AMD vendor.


It seems to adjust it based on vendor in kvm_arch_get_supported_cpuid() 
is better than in x86_cpu_get_supported_feature_word(). Otherwise 
kvm_arch_get_supported_cpuid() still returns "risky" value for Intel VMs.




Suggested-by: Xiaoyao Li 
Cc: John Allen 
Signed-off-by: Paolo Bonzini 
---
  target/i386/cpu.c | 16 +++-
  1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index deb58670651..f3e9b543682 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6064,8 +6064,10 @@ uint64_t x86_cpu_get_supported_feature_word(X86CPU *cpu, 
FeatureWord w)
  } else {
  return ~0;
  }
+
+switch (w) {
  #ifndef TARGET_X86_64
-if (w == FEAT_8000_0001_EDX) {
+case FEAT_8000_0001_EDX:
  /*
   * 32-bit TCG can emulate 64-bit compatibility mode.  If there is no
   * way for userspace to get out of its 32-bit jail, we can leave
@@ -6077,6 +6079,18 @@ uint64_t x86_cpu_get_supported_feature_word(X86CPU *cpu, 
FeatureWord w)
  r &= ~unavail;
  break;
  #endif
+
+case FEAT_8000_0007_EBX:
+if (cpu && !IS_AMD_CPU(>env)) {
+/* Disable AMD machine check architecture for Intel CPU.  */
+r = 0;
+}
+break;
+
+default:
+break;
+}
+
  if (cpu && cpu->migratable) {
  r &= x86_cpu_get_migratable_flags(w);
  }





Re: [PATCH 1/2] target/i386: pass X86CPU to x86_cpu_get_supported_feature_word

2024-06-28 Thread Xiaoyao Li

On 6/27/2024 10:06 PM, Paolo Bonzini wrote:

This allows modifying the bits in "-cpu max"/"-cpu host" depending on
the guest CPU vendor (which, at least by default, is the host vendor in
the case of KVM).

For example, machine check architecture differs between Intel and AMD,
and bits from AMD should be dropped when configuring the guest for
an Intel model.

Cc: Xiaoyao Li 
Cc: John Allen 
Signed-off-by: Paolo Bonzini 
---
  target/i386/cpu.h |  3 +--
  target/i386/cpu.c | 13 ++---
  target/i386/kvm/kvm-cpu.c |  2 +-
  3 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 29daf370485..9bea7142bf4 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -666,8 +666,7 @@ typedef enum FeatureWord {
  } FeatureWord;
  
  typedef uint64_t FeatureWordArray[FEATURE_WORDS];

-uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
-bool migratable_only);
+uint64_t x86_cpu_get_supported_feature_word(X86CPU *cpu, FeatureWord w);
  
  /* cpuid_features bits */

  #define CPUID_FP87 (1U << 0)
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 4c2e6f3a71e..deb58670651 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6035,8 +6035,7 @@ CpuDefinitionInfoList *qmp_query_cpu_definitions(Error 
**errp)
  
  #endif /* !CONFIG_USER_ONLY */
  
-uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,

-bool migratable_only)
+uint64_t x86_cpu_get_supported_feature_word(X86CPU *cpu, FeatureWord w)
  {
  FeatureWordInfo *wi = _word_info[w];
  uint64_t r = 0;
@@ -6076,9 +6075,9 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
  ? CPUID_EXT2_LM & ~CPUID_EXT2_KERNEL_FEATURES
  : CPUID_EXT2_LM;
  r &= ~unavail;
-}
+break;


It seems some leftover when spliting from next patch. Need to remove it.

Other than it,

Reviewed-by: Xiaoyao Li 


  #endif
-if (migratable_only) {
+if (cpu && cpu->migratable) {
  r &= x86_cpu_get_migratable_flags(w);
  }
  return r;
@@ -7371,7 +7370,7 @@ void x86_cpu_expand_features(X86CPU *cpu, Error **errp)
   * by the user.
   */
  env->features[w] |=
-x86_cpu_get_supported_feature_word(w, cpu->migratable) &
+x86_cpu_get_supported_feature_word(cpu, w) &
  ~env->user_features[w] &
  ~feature_word_info[w].no_autoenable_flags;
  }
@@ -7497,7 +7496,7 @@ static void x86_cpu_filter_features(X86CPU *cpu, bool 
verbose)
  
  for (w = 0; w < FEATURE_WORDS; w++) {

  uint64_t host_feat =
-x86_cpu_get_supported_feature_word(w, false);
+x86_cpu_get_supported_feature_word(NULL, w);
  uint64_t requested_features = env->features[w];
  uint64_t unavailable_features = requested_features & ~host_feat;
  mark_unavailable_features(cpu, w, unavailable_features, prefix);
@@ -7617,7 +7616,7 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
**errp)
  env->features[FEAT_PERF_CAPABILITIES] & PERF_CAP_LBR_FMT;
  if (requested_lbr_fmt && kvm_enabled()) {
  uint64_t host_perf_cap =
-x86_cpu_get_supported_feature_word(FEAT_PERF_CAPABILITIES, false);
+x86_cpu_get_supported_feature_word(NULL, FEAT_PERF_CAPABILITIES);
  unsigned host_lbr_fmt = host_perf_cap & PERF_CAP_LBR_FMT;
  
  if (!cpu->enable_pmu) {

diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
index f9b99b5f50d..b2735d6efee 100644
--- a/target/i386/kvm/kvm-cpu.c
+++ b/target/i386/kvm/kvm-cpu.c
@@ -136,7 +136,7 @@ static void kvm_cpu_xsave_init(void)
  if (!esa->size) {
  continue;
  }
-if ((x86_cpu_get_supported_feature_word(esa->feature, false) & 
esa->bits)
+if ((x86_cpu_get_supported_feature_word(NULL, esa->feature) & 
esa->bits)
  != esa->bits) {
  continue;
  }





Re: [PATCH v5 25/65] i386/tdx: Add property sept-ve-disable for tdx-guest object

2024-06-26 Thread Xiaoyao Li

On 6/24/2024 11:01 PM, Daniel P. Berrangé wrote:

On Fri, Jun 14, 2024 at 08:49:57AM +0100, Daniel P. Berrangé wrote:

On Fri, Jun 14, 2024 at 09:04:33AM +0800, Xiaoyao Li wrote:

On 6/13/2024 4:35 PM, Duan, Zhenzhong wrote:




-Original Message-
From: Li, Xiaoyao 
Subject: Re: [PATCH v5 25/65] i386/tdx: Add property sept-ve-disable for
tdx-guest object

On 6/6/2024 6:45 PM, Daniel P. Berrangé wrote:

Copying  Zhenzhong Duan as my point relates to the proposed libvirt
TDX patches.

On Thu, Feb 29, 2024 at 01:36:46AM -0500, Xiaoyao Li wrote:

Bit 28 of TD attribute, named SEPT_VE_DISABLE. When set to 1, it

disables

EPT violation conversion to #VE on guest TD access of PENDING pages.

Some guest OS (e.g., Linux TD guest) may require this bit as 1.
Otherwise refuse to boot.

Add sept-ve-disable property for tdx-guest object, for user to configure
this bit.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
Acked-by: Markus Armbruster 
---
Changes in v4:
- collect Acked-by from Markus

Changes in v3:
- update the comment of property @sept-ve-disable to make it more
 descriptive and use new format. (Daniel and Markus)
---
qapi/qom.json |  7 ++-
target/i386/kvm/tdx.c | 24 
2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 220cc6c98d4b..89ed89b9b46e 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -900,10 +900,15 @@
#
# Properties for tdx-guest objects.
#
+# @sept-ve-disable: toggle bit 28 of TD attributes to control disabling
+# of EPT violation conversion to #VE on guest TD access of PENDING
+# pages.  Some guest OS (e.g., Linux TD guest) may require this to
+# be set, otherwise they refuse to boot.
+#
# Since: 9.0
##
{ 'struct': 'TdxGuestProperties',
-  'data': { }}
+  'data': { '*sept-ve-disable': 'bool' } }


So this exposes a single boolean property that gets mapped into one
specific bit in the TD attributes:


+
+static void tdx_guest_set_sept_ve_disable(Object *obj, bool value, Error

**errp)

+{
+TdxGuest *tdx = TDX_GUEST(obj);
+
+if (value) {
+tdx->attributes |= TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE;
+} else {
+tdx->attributes &= ~TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE;
+}
+}


If I look at the documentation for TD attributes

 https://download.01.org/intel-sgx/latest/dcap-

latest/linux/docs/Intel_TDX_DCAP_Quoting_Library_API.pdf


Section "A.3.4. TD Attributes"

I see "TD attributes" is a 64-bit int, with 5 bits currently
defined "DEBUG", "SEPT_VE_DISABLE", "PKS", "PL", "PERFMON",
and the rest currently reserved for future use. This makes me
wonder about our modelling approach into the future ?

For the AMD SEV equivalent we've just directly exposed the whole
field as an int:

'policy' : 'uint32',

For the proposed SEV-SNP patches, the same has been done again

https://lists.nongnu.org/archive/html/qemu-devel/2024-

06/msg00536.html


'*policy': 'uint64',


The advantage of exposing individual booleans is that it is
self-documenting at the QAPI level, but the disadvantage is
that every time we want to expose ability to control a new
bit in the policy we have to modify QEMU, libvirt, the mgmt
app above libvirt, and whatever tools the end user has to
talk to the mgmt app.

If we expose a policy int, then newly defined bits only require
a change in QEMU, and everything above QEMU will already be
capable of setting it.

In fact if I look at the proposed libvirt patches, they have
proposed just exposing a policy "int" field in the XML, which
then has to be unpacked to set the individual QAPI booleans



https://lists.libvirt.org/archives/list/de...@lists.libvirt.org/message/WXWX
EESYUA77DP7YIBP55T2OPSVKV5QW/


On balance, I think it would be better if QEMU just exposed
the raw TD attributes policy as an uint64 at QAPI, instead
of trying to unpack it to discrete bool fields. This gives
consistency with SEV and SEV-SNP, and with what's proposed
at the libvirt level, and minimizes future changes when
more policy bits are defined.


The reasons why introducing individual bit of sept-ve-disable instead of
a raw TD attribute as a whole are that

1. other bits like perfmon, PKS, KL are associated with cpu properties,
e.g.,

perfmon -> pmu,
pks -> pks,
kl -> keylokcer feature that QEMU currently doesn't support

If allowing configuring attribute directly, we need to deal with the
inconsistence between attribute vs cpu property.


What about defining those bits associated with cpu properties reserved
But other bits work as Daniel suggested way.


I don't understand. Do you mean we provide the interface to configure raw 64
bit attributes while some bits of it are reserved?


Just have a mask of what bits are permitted to be set, and report an
error if the user sets non-permitted bits.


Looking at the I

Re: [PATCH v4 28/31] hw/i386: Add support for loading BIOS using guest_memfd

2024-06-14 Thread Xiaoyao Li

On 6/14/2024 4:48 PM, Gupta, Pankaj wrote:

On 6/14/2024 10:34 AM, Xiaoyao Li wrote:

On 5/30/2024 7:16 PM, Pankaj Gupta wrote:

From: Michael Roth 

When guest_memfd is enabled, the BIOS is generally part of the initial
encrypted guest image and will be accessed as private guest memory. Add
the necessary changes to set up the associated RAM region with a
guest_memfd backend to allow for this.

Current support centers around using -bios to load the BIOS data.
Support for loading the BIOS via pflash requires additional enablement
since those interfaces rely on the use of ROM memory regions which make
use of the KVM_MEM_READONLY memslot flag, which is not supported for
guest_memfd-backed memslots.

Signed-off-by: Michael Roth 
Signed-off-by: Pankaj Gupta 
---
  hw/i386/x86-common.c | 22 --
  1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/hw/i386/x86-common.c b/hw/i386/x86-common.c
index f41cb0a6a8..059de65f36 100644
--- a/hw/i386/x86-common.c
+++ b/hw/i386/x86-common.c
@@ -999,10 +999,18 @@ void x86_bios_rom_init(X86MachineState *x86ms, 
const char *default_firmware,

  }
  if (bios_size <= 0 ||
  (bios_size % 65536) != 0) {
-    goto bios_error;
+    if (!machine_require_guest_memfd(MACHINE(x86ms))) {
+    g_warning("%s: Unaligned BIOS size %d", __func__, 
bios_size);

+    goto bios_error;
+    }
+    }
+    if (machine_require_guest_memfd(MACHINE(x86ms))) {
+    memory_region_init_ram_guest_memfd(>bios, NULL, 
"pc.bios",

+   bios_size, _fatal);
+    } else {
+    memory_region_init_ram(>bios, NULL, "pc.bios",
+   bios_size, _fatal);
  }
-    memory_region_init_ram(>bios, NULL, "pc.bios", bios_size,
-   _fatal);
  if (sev_enabled()) {
  /*
   * The concept of a "reset" simply doesn't exist for
@@ -1023,9 +1031,11 @@ void x86_bios_rom_init(X86MachineState *x86ms, 
const char *default_firmware,

  }
  g_free(filename);
-    /* map the last 128KB of the BIOS in ISA space */
-    x86_isa_bios_init(>isa_bios, rom_memory, >bios,
-  !isapc_ram_fw);
+    if (!machine_require_guest_memfd(MACHINE(x86ms))) {
+    /* map the last 128KB of the BIOS in ISA space */
+    x86_isa_bios_init(>isa_bios, rom_memory, >bios,
+  !isapc_ram_fw);
+    }


Could anyone explain to me why above change is related to this patch 
and why need it?


because inside x86_isa_bios_init(), the alias isa_bios is set to 
read_only while guest_memfd doesn't support readonly?


I could not understand your comment entirely. This condition is for non 
guest_memfd case? You expect something else?


I'm asking why x86_isa_bios_init() cannot be called when machine 
requires guest memfd.


Please see my comment[1] to previous patch, patch 27, the two patches 
conflict with each other. The two patches did lack the clarification on 
the changes it made. sigh...


[1] 
https://lore.kernel.org/qemu-devel/ce895ad3-7a84-4af1-8927-6e85f60ef...@intel.com/



Thanks,
Pankaj



  /* map all the bios at the top of memory */
  memory_region_add_subregion(rom_memory,









Re: [PATCH v4 27/31] hw/i386/sev: Use guest_memfd for legacy ROMs

2024-06-14 Thread Xiaoyao Li

On 5/30/2024 7:16 PM, Pankaj Gupta wrote:

From: Michael Roth 

Current SNP guest kernels will attempt to access these regions with
with C-bit set, so guest_memfd is needed to handle that. Otherwise,
kvm_convert_memory() will fail when the guest kernel tries to access it
and QEMU attempts to call KVM_SET_MEMORY_ATTRIBUTES to set these ranges
to private.

Whether guests should actually try to access ROM regions in this way (or
need to deal with legacy ROM regions at all), is a separate issue to be
addressed on kernel side, but current SNP guest kernels will exhibit
this behavior and so this handling is needed to allow QEMU to continue
running existing SNP guest kernels.

Signed-off-by: Michael Roth 
[pankaj: Added sev_snp_enabled() check]
Signed-off-by: Pankaj Gupta 
---
  hw/i386/pc.c   | 14 ++
  hw/i386/pc_sysfw.c | 13 ++---
  2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 7b638da7aa..62c25ea1e9 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -62,6 +62,7 @@
  #include "hw/mem/memory-device.h"
  #include "e820_memory_layout.h"
  #include "trace.h"
+#include "sev.h"
  #include CONFIG_DEVICES
  
  #ifdef CONFIG_XEN_EMU

@@ -1022,10 +1023,15 @@ void pc_memory_init(PCMachineState *pcms,
  pc_system_firmware_init(pcms, rom_memory);
  
  option_rom_mr = g_malloc(sizeof(*option_rom_mr));

-memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE,
-   _fatal);
-if (pcmc->pci_enabled) {
-memory_region_set_readonly(option_rom_mr, true);
+if (sev_snp_enabled()) {
+memory_region_init_ram_guest_memfd(option_rom_mr, NULL, "pc.rom",
+   PC_ROM_SIZE, _fatal);
+} else {
+memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE,
+   _fatal);
+if (pcmc->pci_enabled) {
+memory_region_set_readonly(option_rom_mr, true);
+}
  }
  memory_region_add_subregion_overlap(rom_memory,
  PC_ROM_MIN_VGA,
diff --git a/hw/i386/pc_sysfw.c b/hw/i386/pc_sysfw.c
index 00464afcb4..def77a442d 100644
--- a/hw/i386/pc_sysfw.c
+++ b/hw/i386/pc_sysfw.c
@@ -51,8 +51,13 @@ static void pc_isa_bios_init(MemoryRegion *isa_bios, 
MemoryRegion *rom_memory,
  
  /* map the last 128KB of the BIOS in ISA space */

  isa_bios_size = MIN(flash_size, 128 * KiB);
-memory_region_init_ram(isa_bios, NULL, "isa-bios", isa_bios_size,
-   _fatal);
+if (sev_snp_enabled()) {
+memory_region_init_ram_guest_memfd(isa_bios, NULL, "isa-bios",
+   isa_bios_size, _fatal);
+} else {
+memory_region_init_ram(isa_bios, NULL, "isa-bios", isa_bios_size,
+   _fatal);
+}
  memory_region_add_subregion_overlap(rom_memory,
  0x10 - isa_bios_size,
  isa_bios,
@@ -65,7 +70,9 @@ static void pc_isa_bios_init(MemoryRegion *isa_bios, 
MemoryRegion *rom_memory,
 ((uint8_t*)flash_ptr) + (flash_size - isa_bios_size),
 isa_bios_size);
  
-memory_region_set_readonly(isa_bios, true);

+if (!machine_require_guest_memfd(current_machine)) {
+memory_region_set_readonly(isa_bios, true);
+}


This patch takes different approach than next patch that this patch 
chooses to not set readonly when require guest memfd while next patch 
skips the whole isa_bios setup for -bios case. Why make different 
handling for the two cases?


More importantly, with commit a44ea3fa7f2a,
pcmc->isa_bios_alias is default true for all new machine after 9.0, then 
pc_isa_bios_init() would be hit even on plash case. It will call 
x86_isa_bios_init() in pc_system_flash_map().


So with -bios case, the call site is

  pc_system_firmware_init()
 -> x86_bios_rom_init()
-> x86_isa_bios_init()

  because require_guest_memfd is true for snp, x86_isa_bios_init() is 
not called.


However, with pflash case, the call site is

  pc_system_firmware_init()
 -> pc_system_flash_map()
 -> if (pcmc->isa_bios_alias) {
x86_isa_bios_init();
} else {
pc_isa_bios_init();
}

As I said above, pcmc->isa_bios_alias is true for machine after 9.0, so 
it will goes x86_isa_bios_init().


So please anyone explain to me why x86_isa_bios_init() is ok for pflash 
case but not for -bios case?



  }
  
  static PFlashCFI01 *pc_pflash_create(PCMachineState *pcms,





Re: [PATCH v4 28/31] hw/i386: Add support for loading BIOS using guest_memfd

2024-06-14 Thread Xiaoyao Li

On 5/30/2024 7:16 PM, Pankaj Gupta wrote:

From: Michael Roth 

When guest_memfd is enabled, the BIOS is generally part of the initial
encrypted guest image and will be accessed as private guest memory. Add
the necessary changes to set up the associated RAM region with a
guest_memfd backend to allow for this.

Current support centers around using -bios to load the BIOS data.
Support for loading the BIOS via pflash requires additional enablement
since those interfaces rely on the use of ROM memory regions which make
use of the KVM_MEM_READONLY memslot flag, which is not supported for
guest_memfd-backed memslots.

Signed-off-by: Michael Roth 
Signed-off-by: Pankaj Gupta 
---
  hw/i386/x86-common.c | 22 --
  1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/hw/i386/x86-common.c b/hw/i386/x86-common.c
index f41cb0a6a8..059de65f36 100644
--- a/hw/i386/x86-common.c
+++ b/hw/i386/x86-common.c
@@ -999,10 +999,18 @@ void x86_bios_rom_init(X86MachineState *x86ms, const char 
*default_firmware,
  }
  if (bios_size <= 0 ||
  (bios_size % 65536) != 0) {
-goto bios_error;
+if (!machine_require_guest_memfd(MACHINE(x86ms))) {
+g_warning("%s: Unaligned BIOS size %d", __func__, bios_size);
+goto bios_error;
+}
+}
+if (machine_require_guest_memfd(MACHINE(x86ms))) {
+memory_region_init_ram_guest_memfd(>bios, NULL, "pc.bios",
+   bios_size, _fatal);
+} else {
+memory_region_init_ram(>bios, NULL, "pc.bios",
+   bios_size, _fatal);
  }
-memory_region_init_ram(>bios, NULL, "pc.bios", bios_size,
-   _fatal);
  if (sev_enabled()) {
  /*
   * The concept of a "reset" simply doesn't exist for
@@ -1023,9 +1031,11 @@ void x86_bios_rom_init(X86MachineState *x86ms, const 
char *default_firmware,
  }
  g_free(filename);
  
-/* map the last 128KB of the BIOS in ISA space */

-x86_isa_bios_init(>isa_bios, rom_memory, >bios,
-  !isapc_ram_fw);
+if (!machine_require_guest_memfd(MACHINE(x86ms))) {
+/* map the last 128KB of the BIOS in ISA space */
+x86_isa_bios_init(>isa_bios, rom_memory, >bios,
+  !isapc_ram_fw);
+}


Could anyone explain to me why above change is related to this patch and 
why need it?


because inside x86_isa_bios_init(), the alias isa_bios is set to 
read_only while guest_memfd doesn't support readonly?



  /* map all the bios at the top of memory */
  memory_region_add_subregion(rom_memory,





Re: [PATCH v5 25/65] i386/tdx: Add property sept-ve-disable for tdx-guest object

2024-06-13 Thread Xiaoyao Li

On 6/13/2024 4:35 PM, Duan, Zhenzhong wrote:




-Original Message-
From: Li, Xiaoyao 
Subject: Re: [PATCH v5 25/65] i386/tdx: Add property sept-ve-disable for
tdx-guest object

On 6/6/2024 6:45 PM, Daniel P. Berrangé wrote:

Copying  Zhenzhong Duan as my point relates to the proposed libvirt
TDX patches.

On Thu, Feb 29, 2024 at 01:36:46AM -0500, Xiaoyao Li wrote:

Bit 28 of TD attribute, named SEPT_VE_DISABLE. When set to 1, it

disables

EPT violation conversion to #VE on guest TD access of PENDING pages.

Some guest OS (e.g., Linux TD guest) may require this bit as 1.
Otherwise refuse to boot.

Add sept-ve-disable property for tdx-guest object, for user to configure
this bit.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
Acked-by: Markus Armbruster 
---
Changes in v4:
- collect Acked-by from Markus

Changes in v3:
- update the comment of property @sept-ve-disable to make it more
descriptive and use new format. (Daniel and Markus)
---
   qapi/qom.json |  7 ++-
   target/i386/kvm/tdx.c | 24 
   2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 220cc6c98d4b..89ed89b9b46e 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -900,10 +900,15 @@
   #
   # Properties for tdx-guest objects.
   #
+# @sept-ve-disable: toggle bit 28 of TD attributes to control disabling
+# of EPT violation conversion to #VE on guest TD access of PENDING
+# pages.  Some guest OS (e.g., Linux TD guest) may require this to
+# be set, otherwise they refuse to boot.
+#
   # Since: 9.0
   ##
   { 'struct': 'TdxGuestProperties',
-  'data': { }}
+  'data': { '*sept-ve-disable': 'bool' } }


So this exposes a single boolean property that gets mapped into one
specific bit in the TD attributes:


+
+static void tdx_guest_set_sept_ve_disable(Object *obj, bool value, Error

**errp)

+{
+TdxGuest *tdx = TDX_GUEST(obj);
+
+if (value) {
+tdx->attributes |= TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE;
+} else {
+tdx->attributes &= ~TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE;
+}
+}


If I look at the documentation for TD attributes

https://download.01.org/intel-sgx/latest/dcap-

latest/linux/docs/Intel_TDX_DCAP_Quoting_Library_API.pdf


Section "A.3.4. TD Attributes"

I see "TD attributes" is a 64-bit int, with 5 bits currently
defined "DEBUG", "SEPT_VE_DISABLE", "PKS", "PL", "PERFMON",
and the rest currently reserved for future use. This makes me
wonder about our modelling approach into the future ?

For the AMD SEV equivalent we've just directly exposed the whole
field as an int:

   'policy' : 'uint32',

For the proposed SEV-SNP patches, the same has been done again

https://lists.nongnu.org/archive/html/qemu-devel/2024-

06/msg00536.html


   '*policy': 'uint64',


The advantage of exposing individual booleans is that it is
self-documenting at the QAPI level, but the disadvantage is
that every time we want to expose ability to control a new
bit in the policy we have to modify QEMU, libvirt, the mgmt
app above libvirt, and whatever tools the end user has to
talk to the mgmt app.

If we expose a policy int, then newly defined bits only require
a change in QEMU, and everything above QEMU will already be
capable of setting it.

In fact if I look at the proposed libvirt patches, they have
proposed just exposing a policy "int" field in the XML, which
then has to be unpacked to set the individual QAPI booleans



https://lists.libvirt.org/archives/list/de...@lists.libvirt.org/message/WXWX
EESYUA77DP7YIBP55T2OPSVKV5QW/


On balance, I think it would be better if QEMU just exposed
the raw TD attributes policy as an uint64 at QAPI, instead
of trying to unpack it to discrete bool fields. This gives
consistency with SEV and SEV-SNP, and with what's proposed
at the libvirt level, and minimizes future changes when
more policy bits are defined.


The reasons why introducing individual bit of sept-ve-disable instead of
a raw TD attribute as a whole are that

1. other bits like perfmon, PKS, KL are associated with cpu properties,
e.g.,

perfmon -> pmu,
pks -> pks,
kl -> keylokcer feature that QEMU currently doesn't support

If allowing configuring attribute directly, we need to deal with the
inconsistence between attribute vs cpu property.


What about defining those bits associated with cpu properties reserved
But other bits work as Daniel suggested way.


I don't understand. Do you mean we provide the interface to configure 
raw 64 bit attributes while some bits of it are reserved?



Thanks
Zhenzhong



2. people need to know the exact bit position of each attribute. I don't
think it is a user-friendly interface to require user to be aware of
such details.

For example, if user wants to create a Debug TD, user just needs to set
'debug=on' for tdx-guest object. I

Re: [PATCH v5 17/65] i386/tdx: Adjust the supported CPUID based on TDX restrictions

2024-06-13 Thread Xiaoyao Li

On 6/13/2024 4:26 PM, Duan, Zhenzhong wrote:

+ *
+ * It also has side effect to enable unsupported bits, e.g., the
+ * bits of "fixed0" type while present natively. It's safe because
+ * the unsupported bits will be masked off by .fixed0 later.
+ */
+    *ret |= host_cpuid_reg(function, index, reg);

Looks KVM capabilities are merged with native bits, is this intentional?

yes, if we change the order, it would be more clear for you I guess.

host_cpuid_reg() | kvm_capabilities

The base is host's native value, while any bit that absent from native
but KVM can emulate is also added to base.

Imagine there is a 'type native' bit that's absent from native but KVM emulated,
With above code we pass 1 to tdx module but it wants native 0, is it an issue?


yes, it will have issue but it's not "we pass 1 to tdx_module".

"Native" bit is not configurable in the view of TDX module, and QEMU/KVM 
cannot configure it. But it does causes mismatch in above case that QEMU 
sees the bit is supported while in the TD the bit is not supported.


This is one of the reason why we are going to drop the solution that 
QEMU maintains the CPUID configurability in this series.




Re: [PULL 39/42] i386: Add support for SUCCOR feature

2024-06-13 Thread Xiaoyao Li

On 6/8/2024 4:34 PM, Paolo Bonzini wrote:

From: John Allen 

Add cpuid bit definition for the SUCCOR feature. This cpuid bit is required to
be exposed to guests to allow them to handle machine check exceptions on AMD
hosts.


v2:
   - Add "succor" feature word.
   - Add case to kvm_arch_get_supported_cpuid for the SUCCOR feature.

Reported-by: William Roche 
Reviewed-by: Joao Martins 
Signed-off-by: John Allen 
Message-ID: <20240603193622.47156-3-john.al...@amd.com>
Signed-off-by: Paolo Bonzini 


[snip]
...


diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 55a9e8a70cf..56d8e2a89ec 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -532,6 +532,8 @@ uint32_t kvm_arch_get_supported_cpuid(KVMState *s, uint32_t 
function,
   */
  cpuid_1_edx = kvm_arch_get_supported_cpuid(s, 1, 0, R_EDX);
  ret |= cpuid_1_edx & CPUID_EXT2_AMD_ALIASES;
+} else if (function == 0x8007 && reg == R_EBX) {
+ret |= CPUID_8000_0007_EBX_SUCCOR;


IMO, this is incorrect.

Why make it supported unconditionally? Shouldn't there be a KVM patch to 
report it in KVM_GET_SUPPORTED_CPUID?


If make it supported unconditionally, all VMs boot with "-cpu host/max" 
will see the CPUID bits, even if it is Intel VMs.




  } else if (function == KVM_CPUID_FEATURES && reg == R_EAX) {
  /* kvm_pv_unhalt is reported by GET_SUPPORTED_CPUID, but it can't
   * be enabled without the in-kernel irqchip





Re: [PATCH v5 18/65] i386/tdx: Make Intel-PT unsupported for TD guest

2024-06-12 Thread Xiaoyao Li

On 5/31/2024 5:27 PM, Duan, Zhenzhong wrote:


On 2/29/2024 2:36 PM, Xiaoyao Li wrote:

Due to the fact that Intel-PT virtualization support has been broken in
QEMU since Sapphire Rapids generation[1], below warning is triggered when
luanching TD guest:

   warning: host doesn't support requested feature: 
CPUID.07H:EBX.intel-pt [bit 25]


Before Intel-pt is fixed in QEMU, just make Intel-PT unsupported for TD
guest, to avoid the confusing warning.


I guess normal guest has same issue.


yeah, just the bug referenced by [1]


Thanks

Zhenzhong



[1] 
https://lore.kernel.org/qemu-devel/20230531084311.3807277-1-xiaoyao...@intel.com/


Signed-off-by: Xiaoyao Li 
---
Changes in v4:
  - newly added patch;
---
  target/i386/kvm/tdx.c | 5 +
  1 file changed, 5 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 85d96140b450..239170142e4f 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -292,6 +292,11 @@ void tdx_get_supported_cpuid(uint32_t function, 
uint32_t index, int reg,

  if (function == 1 && reg == R_ECX && !enable_cpu_pm) {
  *ret &= ~CPUID_EXT_MONITOR;
  }
+
+    /* QEMU Intel-pt support is broken, don't advertise Intel-PT */
+    if (function == 7 && reg == R_EBX) {
+    *ret &= ~CPUID_7_0_EBX_INTEL_PT;
+    }
  }
  enum tdx_ioctl_level{





Re: [PATCH v5 17/65] i386/tdx: Adjust the supported CPUID based on TDX restrictions

2024-06-12 Thread Xiaoyao Li

On 5/31/2024 4:47 PM, Duan, Zhenzhong wrote:


On 2/29/2024 2:36 PM, Xiaoyao Li wrote:

According to Chapter "CPUID Virtualization" in TDX module spec, CPUID
bits of TD can be classified into 6 types:


1 | As configured | configurable by VMM, independent of native value;

2 | As configured | configurable by VMM if the bit is supported natively
 (if native)   | Otherwise it equals as native(0).

3 | Fixed | fixed to 0/1

4 | Native    | reflect the native value

5 | Calculated    | calculated by TDX module.

6 | Inducing #VE  | get #VE exception


Note:
1. All the configurable XFAM related features and TD attributes related
    features fall into type #2. And fixed0/1 bits of XFAM and TD
    attributes fall into type #3.

2. For CPUID leaves not listed in "CPUID virtualization Overview" table
    in TDX module spec, TDX module injects #VE to TDs when those are
    queried. For this case, TDs can request CPUID emulation from VMM via
    TDVMCALL and the values are fully controlled by VMM.

Due to TDX module has its own virtualization policy on CPUID bits, it 
leads

to what reported via KVM_GET_SUPPORTED_CPUID diverges from the supported
CPUID bits for TDs. In order to keep a consistent CPUID configuration
between VMM and TDs. Adjust supported CPUID for TDs based on TDX
restrictions.

Currently only focus on the CPUID leaves recognized by QEMU's
feature_word_info[] that are indexed by a FeatureWord.

Introduce a TDX CPUID lookup table, which maintains 1 entry for each
FeatureWord. Each entry has below fields:

  - tdx_fixed0/1: The bits that are fixed as 0/1;

  - depends_on_vmm_cap: The bits that are configurable from the view of
   TDX module. But they requires emulation of VMM
   when configured as enabled. For those, they are
   not supported if VMM doesn't report them as
   supported. So they need be fixed up by checking
   if VMM supports them.


Previously I thought bits configurable for TDX module are emulated by 
TDX module,


it looks not. Just curious why doesn't those bits belong to "Inducing 
#VE" type?


Because when TD guest queries this type of CPUID leaf, it doesn't get #VE.

The bits in this category are free to be configured as on/off by VMM and 
passed to TDX module via TD_PARAM. Once they get configured, they are 
queried directly by TD guest without getting #VE.


The problem is whether VMM allows them to be configured freely. E.g., 
for features TME and PCONFIG, they are configurable. However when VMM 
configures them to 1, VMM needs to provide the support of related MSRs 
of them. If VMM cannot provide such support, VMM should be allow user to 
configured to 1.


That's why we have this kind of type.

BTW, we are going to abondan the solution that let QEMU to maintain the 
CPUID configurability table in this series. Next version we will come up 
with



Then guest can get KVM reported capabilities with tdvmcall directly.



  - inducing_ve: TD gets #VE when querying this CPUID leaf. The result is
 totally configurable by VMM.

  - supported_on_ve: It's valid only when @inducing_ve is true. It 
represents

    the maximum feature set supported that be emulated
    for TDs.
This is never used together with depends_on_vmm_cap, maybe one variable 
is enough?


By applying TDX CPUID lookup table and TDX capabilities reported from
TDX module, the supported CPUID for TDs can be obtained from following
steps:

- get the base of VMM supported feature set;

- if the leaf is not a FeatureWord just return VMM's value without
   modification;

- if the leaf is an inducing_ve type, applying supported_on_ve mask and
   return;

- include all native bits, it covers type #2, #4, and parts of type #1.
   (it also includes some unsupported bits. The following step will
    correct it.)

- apply fixed0/1 to it (it covers #3, and rectifies the previous step);

- add configurable bits (it covers the other part of type #1);

- fix the ones in vmm_fixup;

(Calculated type is ignored since it's determined at runtime).

Co-developed-by: Chenyi Qiang 
Signed-off-by: Chenyi Qiang 
Signed-off-by: Xiaoyao Li 
---
  target/i386/cpu.h |  16 +++
  target/i386/kvm/kvm.c |   4 +
  target/i386/kvm/tdx.c | 263 ++
  target/i386/kvm/tdx.h |   3 +
  4 files changed, 286 insertions(+)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 952

Re: [PATCH v5 25/65] i386/tdx: Add property sept-ve-disable for tdx-guest object

2024-06-12 Thread Xiaoyao Li

On 6/6/2024 6:45 PM, Daniel P. Berrangé wrote:

Copying  Zhenzhong Duan as my point relates to the proposed libvirt
TDX patches.

On Thu, Feb 29, 2024 at 01:36:46AM -0500, Xiaoyao Li wrote:

Bit 28 of TD attribute, named SEPT_VE_DISABLE. When set to 1, it disables
EPT violation conversion to #VE on guest TD access of PENDING pages.

Some guest OS (e.g., Linux TD guest) may require this bit as 1.
Otherwise refuse to boot.

Add sept-ve-disable property for tdx-guest object, for user to configure
this bit.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
Acked-by: Markus Armbruster 
---
Changes in v4:
- collect Acked-by from Markus

Changes in v3:
- update the comment of property @sept-ve-disable to make it more
   descriptive and use new format. (Daniel and Markus)
---
  qapi/qom.json |  7 ++-
  target/i386/kvm/tdx.c | 24 
  2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 220cc6c98d4b..89ed89b9b46e 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -900,10 +900,15 @@
  #
  # Properties for tdx-guest objects.
  #
+# @sept-ve-disable: toggle bit 28 of TD attributes to control disabling
+# of EPT violation conversion to #VE on guest TD access of PENDING
+# pages.  Some guest OS (e.g., Linux TD guest) may require this to
+# be set, otherwise they refuse to boot.
+#
  # Since: 9.0
  ##
  { 'struct': 'TdxGuestProperties',
-  'data': { }}
+  'data': { '*sept-ve-disable': 'bool' } }


So this exposes a single boolean property that gets mapped into one
specific bit in the TD attributes:


+
+static void tdx_guest_set_sept_ve_disable(Object *obj, bool value, Error 
**errp)
+{
+TdxGuest *tdx = TDX_GUEST(obj);
+
+if (value) {
+tdx->attributes |= TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE;
+} else {
+tdx->attributes &= ~TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE;
+}
+}


If I look at the documentation for TD attributes

   
https://download.01.org/intel-sgx/latest/dcap-latest/linux/docs/Intel_TDX_DCAP_Quoting_Library_API.pdf

Section "A.3.4. TD Attributes"

I see "TD attributes" is a 64-bit int, with 5 bits currently
defined "DEBUG", "SEPT_VE_DISABLE", "PKS", "PL", "PERFMON",
and the rest currently reserved for future use. This makes me
wonder about our modelling approach into the future ?

For the AMD SEV equivalent we've just directly exposed the whole
field as an int:

  'policy' : 'uint32',

For the proposed SEV-SNP patches, the same has been done again

https://lists.nongnu.org/archive/html/qemu-devel/2024-06/msg00536.html

  '*policy': 'uint64',


The advantage of exposing individual booleans is that it is
self-documenting at the QAPI level, but the disadvantage is
that every time we want to expose ability to control a new
bit in the policy we have to modify QEMU, libvirt, the mgmt
app above libvirt, and whatever tools the end user has to
talk to the mgmt app.

If we expose a policy int, then newly defined bits only require
a change in QEMU, and everything above QEMU will already be
capable of setting it.

In fact if I look at the proposed libvirt patches, they have
proposed just exposing a policy "int" field in the XML, which
then has to be unpacked to set the individual QAPI booleans

   
https://lists.libvirt.org/archives/list/de...@lists.libvirt.org/message/WXWXEESYUA77DP7YIBP55T2OPSVKV5QW/

On balance, I think it would be better if QEMU just exposed
the raw TD attributes policy as an uint64 at QAPI, instead
of trying to unpack it to discrete bool fields. This gives
consistency with SEV and SEV-SNP, and with what's proposed
at the libvirt level, and minimizes future changes when
more policy bits are defined.


The reasons why introducing individual bit of sept-ve-disable instead of 
a raw TD attribute as a whole are that


1. other bits like perfmon, PKS, KL are associated with cpu properties, 
e.g.,


perfmon -> pmu,
pks -> pks,
kl -> keylokcer feature that QEMU currently doesn't support

If allowing configuring attribute directly, we need to deal with the 
inconsistence between attribute vs cpu property.


2. people need to know the exact bit position of each attribute. I don't 
think it is a user-friendly interface to require user to be aware of 
such details.


For example, if user wants to create a Debug TD, user just needs to set 
'debug=on' for tdx-guest object. It's much more friendly than that user 
needs to set the bit 0 of the attribute.




+
  /* tdx guest */
  OBJECT_DEFINE_TYPE_WITH_INTERFACES(TdxGuest,
 tdx_guest,
@@ -529,6 +549,10 @@ static void tdx_guest_init(Object *obj)
  qemu_mutex_init(>lock);
  
  tdx->attributes = 0;

+
+object_property_add_bool(obj, "sept-ve-disable",
+ tdx_guest_get_sept_ve_disable,
+ tdx_guest_set_sept_ve_disable);
  }
  
  static void tdx_guest_finalize(Object *obj)

--
2.34.1



With regards,
Daniel





Re: [PATCH] i386/apic: Add hint on boot failure because of disabling x2APIC

2024-06-07 Thread Xiaoyao Li

On 6/7/2024 3:46 PM, Zhao Liu wrote:

Hi Philippe,

On Fri, Jun 07, 2024 at 08:17:36AM +0200, Philippe Mathieu-Daudé wrote:

Date: Fri, 7 Jun 2024 08:17:36 +0200
From: Philippe Mathieu-Daudé 
Subject: Re: [PATCH] i386/apic: Add hint on boot failure because of
  disabling x2APIC

On 6/6/24 16:08, Zhao Liu wrote:

Currently, the Q35 supports up to 4096 vCPUs (since v9.0), but for TCG
cases, if x2APIC is not actively enabled to boot more than 255 vCPUs (


why bother mentioning TCG case?

you below example is not for TCG. And the issue is not caused by TCG but 
default cpu model doesn't have X2APIC enabled.



e.g., qemu-system-i386 -M pc-q35-9.0 -smp 666), the following error is
reported:

Unexpected error in apic_common_set_id() at ../hw/intc/apic_common.c:449:
qemu-system-i386: APIC ID 255 requires x2APIC feature in CPU
Aborted (core dumped)

This error can be resolved by setting x2apic=on in -cpu. In order to
better help users deal with this scenario, add the error hint to
instruct users on how to enable the x2apic feature.


Why not automatically set x2apic=on in this case instead?


The default CPU model is qemu64 without x2APIC. I think it might be
necessary to update the version to add x2APIC in qemu64, and I'd like to
look into it again for any other potential issues.

In addition to adding x2APIC directly to the qemu64, this hint can also
help some users who want a large number of vCPUs based on other older
CPU models. Though, it's not very common but this hint would be helpful.


Then, the error
report becomes the following:

Unexpected error in apic_common_set_id() at ../hw/intc/apic_common.c:448:
qemu-system-i386: APIC ID 255 requires x2APIC feature in CPU
Try x2apic=on in -cpu.
Aborted (core dumped)

Note since @errp is _abort, error_append_hint() can't be applied
on @errp. And in order to separate the exact error message from the
(perhaps effectively) hint, adding a hint via error_append_hint() is
also necessary. Therefore, introduce @local_error in
apic_common_set_id() to handle both the error message and the error
hint.

Suggested-by: Philippe Mathieu-Daudé 
Signed-off-by: Zhao Liu 
---
   hw/intc/apic_common.c | 7 ++-
   1 file changed, 6 insertions(+), 1 deletion(-)


Reviewed-by: Philippe Mathieu-Daudé 


Thanks!







Re: [PATCH] target/i386: SEV: do not assume machine->cgs is SEV

2024-06-06 Thread Xiaoyao Li

On 6/6/2024 6:44 AM, Paolo Bonzini wrote:

There can be other confidential computing classes that are not derived
from sev-common.  Avoid aborting when encountering them.


I hit it today when rebasing TDX patches to latest QEMU master, which 
has the SEV-SNP series merged. (I didn't get time to review it between 
it gets merged.)



Signed-off-by: Paolo Bonzini 
---
  target/i386/sev.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index 004c667ac14..97e15f8b7a9 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -1710,7 +1710,9 @@ void sev_es_set_reset_vector(CPUState *cpu)


my approach is to guard with sev_enabled() when calling 
sev_es_set_reset_vector() in kvm_arch_reset_vcpu(), because calling sev* 
specific function in generic kvm code doesn't look reasonable to me.



  {
  X86CPU *x86;
  CPUX86State *env;
-SevCommonState *sev_common = SEV_COMMON(MACHINE(qdev_get_machine())->cgs);
+ConfidentialGuestSupport *cgs = MACHINE(qdev_get_machine())->cgs;
+SevCommonState *sev_common = SEV_COMMON(
+object_dynamic_cast(OBJECT(cgs), TYPE_SEV_COMMON));
  
  /* Only update if we have valid reset information */

  if (!sev_common || !sev_common->reset_data_valid) {





Re: [PATCH V3 2/2] target/i386: Advertise MWAIT iff host supports

2024-06-04 Thread Xiaoyao Li

On 6/4/2024 8:02 AM, Zide Chen wrote:

host_cpu_realizefn() sets CPUID_EXT_MONITOR without consulting host/KVM
capabilities. This may cause problems:

- If MWAIT/MONITOR is not available on the host, advertising this
   feature to the guest and executing MWAIT/MONITOR from the guest
   triggers #UD and the guest doesn't boot.  This is because typically
   #UD takes priority over VM-Exit interception checks and KVM doesn't
   emulate MONITOR/MWAIT on #UD.

- If KVM doesn't support KVM_X86_DISABLE_EXITS_MWAIT, MWAIT/MONITOR
   from the guest are intercepted by KVM, which is not what cpu-pm=on
   intends to do.

In these cases, MWAIT/MONITOR should not be exposed to the guest.

The logic in kvm_arch_get_supported_cpuid() to handle CPUID_EXT_MONITOR
is correct and sufficient, and we can't set CPUID_EXT_MONITOR after
x86_cpu_filter_features().

This was not an issue before commit 662175b91ff ("i386: reorder call to
cpu_exec_realizefn") because the feature added in the accel-specific
realizefn could be checked against host availability and filtered out.

Additionally, it seems not a good idea to handle guest CPUID leaves in
host_cpu_realizefn(), and this patch merges host_cpu_enable_cpu_pm()
into kvm_cpu_realizefn().

Fixes: f5cc5a5c1686 ("i386: split cpu accelerators from cpu.c, using 
AccelCPUClass")
Fixes: 662175b91ff2 ("i386: reorder call to cpu_exec_realizefn")
Signed-off-by: Zide Chen 


Reviewed-by: Xiaoyao Li 


---

V3:
- don't set CPUID_EXT_MONITOR in kvm_cpu_realizefn().
- Change title to reflect the main purpose of this patch.

  target/i386/host-cpu.c| 12 
  target/i386/kvm/kvm-cpu.c | 11 +--
  2 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/target/i386/host-cpu.c b/target/i386/host-cpu.c
index 280e427c017c..8b8bf5afeccf 100644
--- a/target/i386/host-cpu.c
+++ b/target/i386/host-cpu.c
@@ -42,15 +42,6 @@ static uint32_t host_cpu_phys_bits(void)
  return host_phys_bits;
  }
  
-static void host_cpu_enable_cpu_pm(X86CPU *cpu)

-{
-CPUX86State *env = >env;
-
-host_cpuid(5, 0, >mwait.eax, >mwait.ebx,
-   >mwait.ecx, >mwait.edx);
-env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR;
-}
-
  static uint32_t host_cpu_adjust_phys_bits(X86CPU *cpu)
  {
  uint32_t host_phys_bits = host_cpu_phys_bits();
@@ -83,9 +74,6 @@ bool host_cpu_realizefn(CPUState *cs, Error **errp)
  X86CPU *cpu = X86_CPU(cs);
  CPUX86State *env = >env;
  
-if (cpu->max_features && enable_cpu_pm) {

-host_cpu_enable_cpu_pm(cpu);
-}
  if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
  uint32_t phys_bits = host_cpu_adjust_phys_bits(cpu);
  
diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c

index f76972e47e61..148d10ce3711 100644
--- a/target/i386/kvm/kvm-cpu.c
+++ b/target/i386/kvm/kvm-cpu.c
@@ -65,8 +65,15 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
   *   cpu_common_realizefn() (via xcc->parent_realize)
   */
  if (cpu->max_features) {
-if (enable_cpu_pm && kvm_has_waitpkg()) {
-env->features[FEAT_7_0_ECX] |= CPUID_7_0_ECX_WAITPKG;
+if (enable_cpu_pm) {
+if (kvm_has_waitpkg()) {
+env->features[FEAT_7_0_ECX] |= CPUID_7_0_ECX_WAITPKG;
+}
+
+if (env->features[FEAT_1_ECX] & CPUID_EXT_MONITOR) {
+host_cpuid(5, 0, >mwait.eax, >mwait.ebx,
+   >mwait.ecx, >mwait.edx);
+   }
  }
  if (cpu->ucode_rev == 0) {
  cpu->ucode_rev =





Re: [PATCH v2] i386/cpu: fixup number of addressable IDs for processor cores in the physical package

2024-06-04 Thread Xiaoyao Li

On 6/4/2024 5:43 PM, Zhao Liu wrote:

Hi Chuang,

On Mon, Jun 03, 2024 at 04:36:41PM +0800, Chuang Xu wrote:

Date: Mon,  3 Jun 2024 16:36:41 +0800
From: Chuang Xu 
Subject: [PATCH v2] i386/cpu: fixup number of addressable IDs for processor
  cores in the physical package
X-Mailer: git-send-email 2.24.3 (Apple Git-128)

When QEMU is started with:
-cpu host,host-cache-info=on,l3-cache=off \
-smp 2,sockets=1,dies=1,cores=1,threads=2
Guest can't acquire maximum number of addressable IDs for processor cores in
the physical package from CPUID[04H].

When testing Intel TDX, guest attempts to acquire extended topology from 
CPUID[0bH],
but because the TDX module doesn't provide the emulation of CPUID[0bH],
guest will instead acquire extended topology from CPUID[04H]. However,
due to QEMU's inaccurate emulation of CPUID[04H], one of the vcpus in 2c TDX
guest would be offline.


I guess this case is based on downstream's TDX patches... Since TDX
hasn't landed in QEMU yet, it's a bit ahead of the curve to elaborate on
TDX-specific case.

Because normal VM will also face the such cache topology error, I think
it could be stated a bit more generically like:


yes. it's not TDX specific though it's found by TDX case.

I think We can reproduce it by limiting the "min-level" to less than 0xb 
and disable "full-cpuid-auto-level". With it, CPUID leaves 0xb are not 
exposed to guest and guest will use leaves 0x4 to enumerate the CPU 
topology.



When creating a CPU topology of 1 core per package, host-cache-info only
uses the Host's addressable core IDs field (CPUID.04H.EAX[bits 31-26]),
resulting in a conflict (on the multicore Host) between the Guest core
topology information in this field and the Guest's actual cores number.

Fix it by removing the unnecessary condition to cover 1 core per package
case. This is safe because cores_per_pkg will not be 0 and will be at
least 1.


Fix it by removing the unnecessary condition.

Fixes: d7caf13b5fcf742e5680c1d3448ba070fc811644 ("x86: cpu: fixup number of 
addressable IDs for logical processors sharing cache")


yeah. This is the exact commit that introduced the issue. Because it moved

*eax &= ~0xFC00;

into the condition of

cs->nr_cores > 1


12 characters (d7caf13b5fcf) is enough. No blank line. ;-)


Signed-off-by: Guixiong Wei 
Signed-off-by: Yipeng Yin 
Signed-off-by: Chuang Xu 
---
  target/i386/cpu.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index bc2dceb647..b68f7460db 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6426,10 +6426,8 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  if (*eax & 31) {
  int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14);
  
-if (cores_per_pkg > 1) {

-*eax &= ~0xFC00;
-*eax |= max_core_ids_in_package(_info) << 26;
-}
+*eax &= ~0xFC00;
+*eax |= max_core_ids_in_package(_info) << 26;
  if (host_vcpus_per_cache > threads_per_pkg) {
  *eax &= ~0x3FFC000;
  
--

2.20.1








Re: [PATCH 2/2] tests: add testing of parameter=1 for SMP topology

2024-05-15 Thread Xiaoyao Li

On 5/13/2024 8:33 PM, Daniel P. Berrangé wrote:

Validate that it is possible to pass 'parameter=1' for any SMP topology
parameter, since unsupported parameters are implicitly considered to
always have a value of 1.

Signed-off-by: Daniel P. Berrangé 
---
  tests/unit/test-smp-parse.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/tests/unit/test-smp-parse.c b/tests/unit/test-smp-parse.c
index 56165e6644..56ce5128f1 100644
--- a/tests/unit/test-smp-parse.c
+++ b/tests/unit/test-smp-parse.c
@@ -330,6 +330,14 @@ static const struct SMPTestData data_generic_valid[] = {
  .config = SMP_CONFIG_GENERIC(T, 8, T, 2, T, 4, T, 2, T, 16),
  .expect_prefer_sockets = CPU_TOPOLOGY_GENERIC(8, 2, 4, 2, 16),
  .expect_prefer_cores   = CPU_TOPOLOGY_GENERIC(8, 2, 4, 2, 16),
+}, {
+/*
+ * Unsupported parameters are always allowed to be set to '1'
+ * config: -smp 
8,books=1,drawers=1,sockets=2,modules=1,dies=1,cores=4,threads=2,maxcpus=8


cores=2 not 4.


+ * expect: cpus=8,sockets=2,cores=2,threads=2,maxcpus=8 */
+.config = SMP_CONFIG_WITH_FULL_TOPO(8, 1, 1, 2, 1, 1, 2, 2, 8),
+.expect_prefer_sockets = CPU_TOPOLOGY_GENERIC(8, 2, 2, 2, 8),
+.expect_prefer_cores   = CPU_TOPOLOGY_GENERIC(8, 2, 2, 2, 8),
  },
  };
  





Re: [PATCH 6/6] target/i386/confidential-guest: Fix comment of x86_confidential_guest_kvm_type()

2024-04-26 Thread Xiaoyao Li

On 4/26/2024 6:07 PM, Zhao Liu wrote:

Update the comment to match the X86ConfidentialGuestClass
implementation.

Suggested-by: Xiaoyao Li 


I think it should be "Reported-by"


Signed-off-by: Zhao Liu 
---
  target/i386/confidential-guest.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/i386/confidential-guest.h b/target/i386/confidential-guest.h
index 532e172a60b6..06d54a120227 100644
--- a/target/i386/confidential-guest.h
+++ b/target/i386/confidential-guest.h
@@ -44,7 +44,7 @@ struct X86ConfidentialGuestClass {
  /**
   * x86_confidential_guest_kvm_type:
   *
- * Calls #X86ConfidentialGuestClass.unplug callback of @plug_handler.
+ * Calls #X86ConfidentialGuestClass.kvm_type() callback.
   */
  static inline int x86_confidential_guest_kvm_type(X86ConfidentialGuest *cg)
  {





Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock feature name

2024-04-25 Thread Xiaoyao Li

On 4/25/2024 6:29 PM, Zhao Liu wrote:

On Thu, Apr 25, 2024 at 04:40:10PM +0800, Xiaoyao Li wrote:

Date: Thu, 25 Apr 2024 16:40:10 +0800
From: Xiaoyao Li 
Subject: Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock
  feature name

On 4/25/2024 3:17 PM, Zhao Liu wrote:

Hi Xiaoyao,

On Wed, Apr 24, 2024 at 11:57:11PM +0800, Xiaoyao Li wrote:

Date: Wed, 24 Apr 2024 23:57:11 +0800
From: Xiaoyao Li 
Subject: Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock
   feature name

On 3/29/2024 6:19 PM, Zhao Liu wrote:

From: Zhao Liu 

Hi list,

This series is based on Paolo's guest_phys_bits patchset [1].

Currently, the old and new kvmclocks have the same feature name
"kvmclock" in FeatureWordInfo[FEAT_KVM].

When I tried to dig into the history of this unusual naming and fix it,
I realized that Tim was already trying to rename it, so I picked up his
renaming patch [2] (with a new commit message and other minor changes).

13 years age, the same name was introduced in [3], and its main purpose
is to make it easy for users to enable/disable 2 kvmclocks. Then, in
2012, Don tried to rename the new kvmclock, but the follow-up did not
address Igor and Eduardo's comments about compatibility.

Tim [2], not long ago, and I just now, were both puzzled by the naming
one after the other.


The commit message of [3] said the reason clearly:

When we tweak flags involving this value - specially when we use "-",
we have to act on both.

So you are trying to change it to "when people want to disable kvmclock,
they need to use '-kvmclock,-kvmclock2' instead of '-kvmclock'"

IMHO, I prefer existing code and I don't see much value of differentiating
them. If the current code puzzles you, then we can add comment to explain.


It's enough to just enable kvmclock2 for Guest; kvmclock (old) is
redundant in the presence of kvmclock2.

So operating both feature bits at the same time is not a reasonable
choice, we should only keep kvmclock2 for Guest. It's possible because
the oldest linux (v4.5) which QEMU i386 supports has kvmclock2.


who said the oldest Linux QEMU supports is from 4.5? what about 2.x kernel?


For Host (docs/system/target-i386.rst).


Besides, not only the Linux guest, whatever guest OS is, it will be broken
if the guest is using kvmclock and QEMU starts to drop support of kvmclock.


I'm not aware of any minimum version requirements for Guest supported
by KVM, but there are no commitment.


the common commitment is at least keeping backwards compatibility.


So, again, hard NAK to drop the support of kvmclock. It breaks existing
guests that use kvmclock. You cannot force people to upgrade their existing
VMs to use kvmclock2 instead of kvmclock.


I agree, legacy kvmclock can be left out, if the old kernel does not
support kvmclock2 and strongly requires kvmclock, it can be enabled
using 9.1 and earlier machines or legacy-kvmclock, as long as Host still
supports it.

What's the gap in handling it this way? especially considering that
kvmclock2 was introduced in v2.6.35, and earlier kernels are no longer
maintained. The availability of the PV feature requires compatibility
for both Host and Guest.

Anyway, the above discussion here is about future plans, and this series
does not prevent any Guest from ignoring kvmclock2 in favor of kvmclock.

What this series is doing, i.e. separating the current two kvmclock and
ensuring CPUID compatibility via legacy-kvmclock, could balance the
compatibility requirements of the ancient (unmaintained kernel) with
the need for future feature changes.


You introduce a user-visible change that people need to use 
"-kvmclock,-kvmclock2" to totally disable kvmclock.


The only difference between kvmclock and kvmclock2 is the MSR index. And 
from users' perspective, they don't care this difference. The existing 
usage is simple. When I want kvmclock, use "+kvmclock". When I don't 
want it, use "-kvmclock".


You are complicating things and I don't see a strong reason that we have 
to do it.



Thanks,
Zhao






Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock feature name

2024-04-25 Thread Xiaoyao Li

On 4/25/2024 3:17 PM, Zhao Liu wrote:

Hi Xiaoyao,

On Wed, Apr 24, 2024 at 11:57:11PM +0800, Xiaoyao Li wrote:

Date: Wed, 24 Apr 2024 23:57:11 +0800
From: Xiaoyao Li 
Subject: Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock
  feature name

On 3/29/2024 6:19 PM, Zhao Liu wrote:

From: Zhao Liu 

Hi list,

This series is based on Paolo's guest_phys_bits patchset [1].

Currently, the old and new kvmclocks have the same feature name
"kvmclock" in FeatureWordInfo[FEAT_KVM].

When I tried to dig into the history of this unusual naming and fix it,
I realized that Tim was already trying to rename it, so I picked up his
renaming patch [2] (with a new commit message and other minor changes).

13 years age, the same name was introduced in [3], and its main purpose
is to make it easy for users to enable/disable 2 kvmclocks. Then, in
2012, Don tried to rename the new kvmclock, but the follow-up did not
address Igor and Eduardo's comments about compatibility.

Tim [2], not long ago, and I just now, were both puzzled by the naming
one after the other.


The commit message of [3] said the reason clearly:

   When we tweak flags involving this value - specially when we use "-",
   we have to act on both.

So you are trying to change it to "when people want to disable kvmclock,
they need to use '-kvmclock,-kvmclock2' instead of '-kvmclock'"

IMHO, I prefer existing code and I don't see much value of differentiating
them. If the current code puzzles you, then we can add comment to explain.


It's enough to just enable kvmclock2 for Guest; kvmclock (old) is
redundant in the presence of kvmclock2.

So operating both feature bits at the same time is not a reasonable
choice, we should only keep kvmclock2 for Guest. It's possible because
the oldest linux (v4.5) which QEMU i386 supports has kvmclock2.


who said the oldest Linux QEMU supports is from 4.5? what about 2.x kernel?

Besides, not only the Linux guest, whatever guest OS is, it will be 
broken if the guest is using kvmclock and QEMU starts to drop support of 
kvmclock.


So, again, hard NAK to drop the support of kvmclock. It breaks existing 
guests that use kvmclock. You cannot force people to upgrade their 
existing VMs to use kvmclock2 instead of kvmclock.



Pls see:
https://elixir.bootlin.com/linux/v4.5/source/arch/x86/include/uapi/asm/kvm_para.h#L22

With this in mind, I'm trying to implement a silky smooth transition to
"only kcmclock2 support".

This series is now separating kvmclock and kvmclock2, and then I plan to
go to the KVM side and deprecate kvmclock2, and then finally remove
kvmclock (old) in QEMU. Separating the two features makes the process
clearer.

Continuing to use the same name equally would have silently caused the
CPUID to change on the Guest side, which would have hurt compatibility
as well.


So, this series is to push for renaming the new kvmclock feature to
"kvmclock2" and adding compatibility support for older machines (PC 9.0
and older).

Finally, let's put an end to decades of doubt about this name.


Next Step
=

This series just separates the two kvmclocks from the naming, and in
subsequent patches I plan to stop setting kvmclock(old kcmclock) by
default as long as KVM supports kvmclock2 (new kvmclock).


No. It will break existing guests that use KVM_FEATURE_CLOCKSOURCE.


Please refer to my elaboration on v4.5 above, where kvmclock2 is
available, it should be used in priority.


Also, try to deprecate the old kvmclock in KVM side.

[1]: 
https://lore.kernel.org/qemu-devel/20240325141422.1380087-1-pbonz...@redhat.com/
[2]: 
https://lore.kernel.org/qemu-devel/20230908124534.25027-4-twied...@redhat.com/
[3]: 
https://lore.kernel.org/qemu-devel/1300401727-5235-3-git-send-email-glom...@redhat.com/
[4]: 
https://lore.kernel.org/qemu-devel/1348171412-23669-3-git-send-email-...@cloudswitch.com/


Thanks,
Zhao







Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock feature name

2024-04-24 Thread Xiaoyao Li

On 3/29/2024 6:19 PM, Zhao Liu wrote:

From: Zhao Liu 

Hi list,

This series is based on Paolo's guest_phys_bits patchset [1].

Currently, the old and new kvmclocks have the same feature name
"kvmclock" in FeatureWordInfo[FEAT_KVM].

When I tried to dig into the history of this unusual naming and fix it,
I realized that Tim was already trying to rename it, so I picked up his
renaming patch [2] (with a new commit message and other minor changes).

13 years age, the same name was introduced in [3], and its main purpose
is to make it easy for users to enable/disable 2 kvmclocks. Then, in
2012, Don tried to rename the new kvmclock, but the follow-up did not
address Igor and Eduardo's comments about compatibility.

Tim [2], not long ago, and I just now, were both puzzled by the naming
one after the other.


The commit message of [3] said the reason clearly:

  When we tweak flags involving this value - specially when we use "-",
  we have to act on both.

So you are trying to change it to "when people want to disable kvmclock, 
they need to use '-kvmclock,-kvmclock2' instead of '-kvmclock'"


IMHO, I prefer existing code and I don't see much value of 
differentiating them. If the current code puzzles you, then we can add 
comment to explain.



So, this series is to push for renaming the new kvmclock feature to
"kvmclock2" and adding compatibility support for older machines (PC 9.0
and older).

Finally, let's put an end to decades of doubt about this name.


Next Step
=

This series just separates the two kvmclocks from the naming, and in
subsequent patches I plan to stop setting kvmclock(old kcmclock) by
default as long as KVM supports kvmclock2 (new kvmclock).


No. It will break existing guests that use KVM_FEATURE_CLOCKSOURCE.


Also, try to deprecate the old kvmclock in KVM side.

[1]: 
https://lore.kernel.org/qemu-devel/20240325141422.1380087-1-pbonz...@redhat.com/
[2]: 
https://lore.kernel.org/qemu-devel/20230908124534.25027-4-twied...@redhat.com/
[3]: 
https://lore.kernel.org/qemu-devel/1300401727-5235-3-git-send-email-glom...@redhat.com/
[4]: 
https://lore.kernel.org/qemu-devel/1348171412-23669-3-git-send-email-...@cloudswitch.com/

Thanks and Best Regards,
Zhao

---
Tim Wiederhake (1):
   target/i386: Fix duplicated kvmclock name in FEAT_KVM

Zhao Liu (6):
   target/i386/kvm: Add feature bit definitions for KVM CPUID
   target/i386/kvm: Remove local MSR_KVM_WALL_CLOCK and
 MSR_KVM_SYSTEM_TIME definitions
   target/i386/kvm: Only Save/load kvmclock MSRs when kvmclock enabled
   target/i386/kvm: Save/load MSRs of new kvmclock
 (KVM_FEATURE_CLOCKSOURCE2)
   target/i386/kvm: Add legacy_kvmclock cpu property
   target/i386/kvm: Update comment in kvm_cpu_realizefn()

  hw/i386/kvm/clock.c   |  5 ++--
  hw/i386/pc.c  |  1 +
  target/i386/cpu.c |  3 +-
  target/i386/cpu.h | 32 +
  target/i386/kvm/kvm-cpu.c | 25 -
  target/i386/kvm/kvm.c | 59 +--
  6 files changed, 99 insertions(+), 26 deletions(-)






Re: [PATCH for-9.1 2/7] target/i386/kvm: Remove local MSR_KVM_WALL_CLOCK and MSR_KVM_SYSTEM_TIME definitions

2024-04-24 Thread Xiaoyao Li

On 3/29/2024 6:19 PM, Zhao Liu wrote:

From: Zhao Liu 

These 2 MSRs have been already defined in the kvm_para header
(standard-headers/asm-x86/kvm_para.h).

Remove QEMU local definitions to avoid duplication.

Signed-off-by: Zhao Liu 


Reviewed-by: Xiaoyao Li 


---
  target/i386/kvm/kvm.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 2f3c8bc3a4ed..be88339fb8bd 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -78,9 +78,6 @@
  #define KVM_APIC_BUS_CYCLE_NS   1
  #define KVM_APIC_BUS_FREQUENCY  (10ULL / KVM_APIC_BUS_CYCLE_NS)
  
-#define MSR_KVM_WALL_CLOCK  0x11

-#define MSR_KVM_SYSTEM_TIME 0x12
-
  /* A 4096-byte buffer can hold the 8-byte kvm_msrs header, plus
   * 255 kvm_msr_entry structs */
  #define MSR_BUF_SIZE 4096





Re: [PATCH for-9.1 1/7] target/i386/kvm: Add feature bit definitions for KVM CPUID

2024-04-24 Thread Xiaoyao Li

On 3/29/2024 6:19 PM, Zhao Liu wrote:

From: Zhao Liu 

Add feature definiations for KVM_CPUID_FEATURES in CPUID (
CPUID[4000_0001].EAX and CPUID[4000_0001].EDX), to get rid of lots of
offset calculations.

Signed-off-by: Zhao Liu 
---
  hw/i386/kvm/clock.c   |  5 ++---
  target/i386/cpu.h | 23 +++
  target/i386/kvm/kvm.c | 28 ++--
  3 files changed, 39 insertions(+), 17 deletions(-)

diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
index 40aa9a32c32c..7c9752d5036f 100644
--- a/hw/i386/kvm/clock.c
+++ b/hw/i386/kvm/clock.c
@@ -27,7 +27,6 @@
  #include "qapi/error.h"
  
  #include 

-#include "standard-headers/asm-x86/kvm_para.h"
  #include "qom/object.h"
  
  #define TYPE_KVM_CLOCK "kvmclock"

@@ -334,8 +333,8 @@ void kvmclock_create(bool create_always)
  
  assert(kvm_enabled());

  if (create_always ||
-cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) |
-   (1ULL << KVM_FEATURE_CLOCKSOURCE2))) {
+cpu->env.features[FEAT_KVM] & (CPUID_FEAT_KVM_CLOCK |
+   CPUID_FEAT_KVM_CLOCK2)) {
  sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
  }
  }
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 83e473584517..b1b8d11cb0fe 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -27,6 +27,7 @@
  #include "qapi/qapi-types-common.h"
  #include "qemu/cpu-float.h"
  #include "qemu/timer.h"
+#include "standard-headers/asm-x86/kvm_para.h"
  
  #define XEN_NR_VIRQS 24
  
@@ -951,6 +952,28 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,

  /* Packets which contain IP payload have LIP values */
  #define CPUID_14_0_ECX_LIP  (1U << 31)
  
+/* (Old) KVM paravirtualized clocksource */

+#define CPUID_FEAT_KVM_CLOCK(1U << KVM_FEATURE_CLOCKSOURCE)


we can drop the _FEAT_, just name it as

CPUID_KVM_CLOCK


+/* (New) KVM specific paravirtualized clocksource */
+#define CPUID_FEAT_KVM_CLOCK2   (1U << KVM_FEATURE_CLOCKSOURCE2)
+/* KVM asynchronous page fault */
+#define CPUID_FEAT_KVM_ASYNCPF  (1U << KVM_FEATURE_ASYNC_PF)
+/* KVM stolen (when guest vCPU is not running) time accounting */
+#define CPUID_FEAT_KVM_STEAL_TIME   (1U << KVM_FEATURE_STEAL_TIME)
+/* KVM paravirtualized end-of-interrupt signaling */
+#define CPUID_FEAT_KVM_PV_EOI   (1U << KVM_FEATURE_PV_EOI)
+/* KVM paravirtualized spinlocks support */
+#define CPUID_FEAT_KVM_PV_UNHALT(1U << KVM_FEATURE_PV_UNHALT)
+/* KVM host-side polling on HLT control from the guest */
+#define CPUID_FEAT_KVM_POLL_CONTROL (1U << KVM_FEATURE_POLL_CONTROL)
+/* KVM interrupt based asynchronous page fault*/
+#define CPUID_FEAT_KVM_ASYNCPF_INT  (1U << KVM_FEATURE_ASYNC_PF_INT)
+/* KVM 'Extended Destination ID' support for external interrupts */
+#define CPUID_FEAT_KVM_MSI_EXT_DEST_ID  (1U << KVM_FEATURE_MSI_EXT_DEST_ID)
+
+/* Hint to KVM that vCPUs expect never preempted for an unlimited time */
+#define CPUID_FEAT_KVM_HINTS_REALTIME(1U << KVM_HINTS_REALTIME)
+
  /* CLZERO instruction */
  #define CPUID_8000_0008_EBX_CLZERO  (1U << 0)
  /* Always save/restore FP error pointers */
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index e68cbe929302..2f3c8bc3a4ed 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -481,13 +481,13 @@ uint32_t kvm_arch_get_supported_cpuid(KVMState *s, 
uint32_t function,
   * be enabled without the in-kernel irqchip
   */
  if (!kvm_irqchip_in_kernel()) {
-ret &= ~(1U << KVM_FEATURE_PV_UNHALT);
+ret &= ~CPUID_FEAT_KVM_PV_UNHALT;
  }
  if (kvm_irqchip_is_split()) {
-ret |= 1U << KVM_FEATURE_MSI_EXT_DEST_ID;
+ret |= CPUID_FEAT_KVM_MSI_EXT_DEST_ID;
  }
  } else if (function == KVM_CPUID_FEATURES && reg == R_EDX) {
-ret |= 1U << KVM_HINTS_REALTIME;
+ret |= CPUID_FEAT_KVM_HINTS_REALTIME;
  }
  
  return ret;

@@ -3324,20 +3324,20 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
  kvm_msr_entry_add(cpu, MSR_IA32_TSC, env->tsc);
  kvm_msr_entry_add(cpu, MSR_KVM_SYSTEM_TIME, env->system_time_msr);
  kvm_msr_entry_add(cpu, MSR_KVM_WALL_CLOCK, env->wall_clock_msr);
-if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_ASYNC_PF_INT)) {
+if (env->features[FEAT_KVM] & CPUID_FEAT_KVM_ASYNCPF_INT) {
  kvm_msr_entry_add(cpu, MSR_KVM_ASYNC_PF_INT, 
env->async_pf_int_msr);
  }
-if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_ASYNC_PF)) {
+if (env->features[FEAT_KVM] & CPUID_FEAT_KVM_ASYNCPF) {
  kvm_msr_entry_add(cpu, MSR_KVM_ASYNC_PF_EN, env->async_pf_en_msr);
  }
-if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_PV_EOI)) {
+if (env->features[FEAT_KVM] & CPUID_FEAT_KVM_PV_EOI) {
  kvm_msr_entry_add(cpu, MSR_KVM_PV_EOI_EN, env->pv_eoi_en_msr);
   

Re: [PULL 43/63] target/i386: Implement mc->kvm_type() to get VM type

2024-04-24 Thread Xiaoyao Li

On 4/23/2024 11:09 PM, Paolo Bonzini wrote:

+
+/**
+ * x86_confidential_guest_kvm_type:
+ *
+ * Calls #X86ConfidentialGuestClass.unplug callback of @plug_handler.


the comment needs to be updated:

Calls #X86ConfidentialGuestClass.kvm_type() callback


+ */
+static inline int x86_confidential_guest_kvm_type(X86ConfidentialGuest *cg)
+{
+X86ConfidentialGuestClass *klass = X86_CONFIDENTIAL_GUEST_GET_CLASS(cg);
+
+if (klass->kvm_type) {
+return klass->kvm_type(cg);
+} else {
+return 0;
+}
+}





Re: [PULL 25/63] i386/kvm: Move architectural CPUID leaf generation to separate helper

2024-04-23 Thread Xiaoyao Li

On 4/23/2024 11:09 PM, Paolo Bonzini wrote:

From: Sean Christopherson 

Move the architectural (for lack of a better term) CPUID leaf generation
to a separate helper so that the generation code can be reused by TDX,
which needs to generate a canonical VM-scoped configuration.

For now this is just a cleanup, so keep the function static.

Signed-off-by: Sean Christopherson 
Signed-off-by: Xiaoyao Li 
Message-ID: <20240229063726.610065-23-xiaoyao...@intel.com>
Reviewed-by: Xiaoyao Li 
Signed-off-by: Paolo Bonzini 
---
  target/i386/kvm/kvm.c | 449 +-
  1 file changed, 227 insertions(+), 222 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index e68cbe92930..fcf9603d3e6 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1706,6 +1706,231 @@ static void kvm_init_nested_state(CPUX86State *env)
  }
  }
  
+static uint32_t kvm_x86_build_cpuid(CPUX86State *env,

+struct kvm_cpuid_entry2 *entries,
+uint32_t cpuid_i)
+{
+uint32_t limit, i, j;
+uint32_t unused;
+struct kvm_cpuid_entry2 *c;
+
+cpu_x86_cpuid(env, 0, 0, , , , );
+
+for (i = 0; i <= limit; i++) {
+j = 0;
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+switch (i) {
+case 2: {
+/* Keep reading function 2 till all the input is received */
+int times;
+
+c->function = i;
+c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC |
+   KVM_CPUID_FLAG_STATE_READ_NEXT;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+times = c->eax & 0xff;
+
+for (j = 1; j < times; ++j) {
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+c->function = i;
+c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+}
+break;
+}
+case 0x1f:
+if (env->nr_dies < 2) {
+cpuid_i--;
+break;
+}
+/* fallthrough */
+case 4:
+case 0xb:
+case 0xd:
+for (j = 0; ; j++) {
+if (i == 0xd && j == 64) {
+break;
+}
+
+c->function = i;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+c->index = j;
+cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx);
+
+if (i == 4 && c->eax == 0) {
+break;
+}
+if (i == 0xb && !(c->ecx & 0xff00)) {
+break;
+}
+if (i == 0x1f && !(c->ecx & 0xff00)) {
+break;
+}
+if (i == 0xd && c->eax == 0) {
+continue;
+}
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+}
+break;
+case 0x12:
+for (j = 0; ; j++) {
+c->function = i;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+c->index = j;
+cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx);
+
+if (j > 1 && (c->eax & 0xf) != 1) {
+break;
+}
+
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+}
+break;
+case 0x7:
+case 0x14:
+case 0x1d:
+case 0x1e: {
+uint32_t times;
+
+c->function = i;
+c->index = 0;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+times = c->eax;
+
+for (j = 1; j <= times; ++j) {
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+c->function = i;
+c->index = j;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx);
+}
+break;
+}
+default:
+c->function = i;
+c->flags = 0;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+if (!c->eax && !c->ebx && !c->ecx &a

Re: [PATCH v5 28/65] i386/tdx: Disable pmu for TD guest

2024-04-16 Thread Xiaoyao Li

On 4/16/2024 4:32 PM, Chenyi Qiang wrote:



On 2/29/2024 2:36 PM, Xiaoyao Li wrote:

Current KVM doesn't support PMU for TD guest. It returns error if TD is
created with PMU bit being set in attributes.

Disable PMU for TD guest on QEMU side.

Signed-off-by: Xiaoyao Li 
---
  target/i386/kvm/tdx.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 262e86fd2c67..1c12cda002b8 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -496,6 +496,8 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
  g_autofree struct kvm_tdx_init_vm *init_vm = NULL;
  int r = 0;
  
+object_property_set_bool(OBJECT(cpu), "pmu", false, _abort);


Is it necessary to output some prompt if the user wants to enable pmu by
"-cpu host,pmu=on"? As in patch 27, it mentions PMU is configured by
x86cpu->enable_pmu, but PMU is actually not support in this series and
will be disabled silently.


We do this in QEMU is mainly for KVM, because KVM will fail to init TD 
if ATTRIBUTE.PERFMON is set.


It's expected that KVM reports PERFMON in attributes.fixed0 when KVM 
cannot provide support of it. Then QEMU will print error message 
automatically when validate the attributes.


For QEMU part, next version is going to set the default value of "pmu" 
to false in kvm_cpu_max_instance_init(), so that "-cpu host" will not 
enable pmu for TDX VMs by default.


I suppose both the KVM and QEMU change will show up in the next version.


+
  QEMU_LOCK_GUARD(_guest->lock);
  if (tdx_guest->initialized) {
  return r;





Re: [PATCH v2] hw/i386/acpi: Set PCAT_COMPAT bit only when pic is not disabled

2024-04-03 Thread Xiaoyao Li

On 4/3/2024 11:12 PM, Igor Mammedov wrote:

On Wed,  3 Apr 2024 10:59:53 -0400
Xiaoyao Li  wrote:


A value 1 of PCAT_COMPAT (bit 0) of MADT.Flags indicates that the system
also has a PC-AT-compatible dual-8259 setup, i.e., the PIC.

When PIC is not enabled (pic=off) for x86 machine, the PCAT_COMPAT bit
needs to be cleared. Otherwise, the guest thinks there is a present PIC.


Can you add to commit message reproducer (aka qemu CLI and relevant
logs/symptoms observed on guest side)?


When booting a VM with "-machine xx,pic=off", there is supposed to be no 
PIC for the guest. When guest probes PIC, it should find nothing and log 
of below should be printed:


  [0.155970] Using NULL legacy PIC

However, the fact is that no such log printed in guest kernel, with the 
VM created with "pic=off". This is because guest think there is a 
present due to pcat_compat is reporte as 1 in MADT. See Linux code of 
probe_8259A() in arch/x86/kernel/i8259.c 





   
Signed-off-by: Xiaoyao Li   
---

changes in v2:
- Clarify more in commit message;
---
  hw/i386/acpi-common.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/i386/acpi-common.c b/hw/i386/acpi-common.c
index 20f19269da40..0cc2919bb851 100644
--- a/hw/i386/acpi-common.c
+++ b/hw/i386/acpi-common.c
@@ -107,7 +107,9 @@ void acpi_build_madt(GArray *table_data, BIOSLinker *linker,
  acpi_table_begin(, table_data);
  /* Local APIC Address */
  build_append_int_noprefix(table_data, APIC_DEFAULT_ADDRESS, 4);
-build_append_int_noprefix(table_data, 1 /* PCAT_COMPAT */, 4); /* Flags */
+/* Flags. bit 0: PCAT_COMPAT */
+build_append_int_noprefix(table_data,
+  x86ms->pic != ON_OFF_AUTO_OFF ? 1 : 0 , 4);
  
  for (i = 0; i < apic_ids->len; i++) {

  pc_madt_cpu_entry(i, apic_ids, table_data, false);







[PATCH v2] hw/i386/acpi: Set PCAT_COMPAT bit only when pic is not disabled

2024-04-03 Thread Xiaoyao Li
A value 1 of PCAT_COMPAT (bit 0) of MADT.Flags indicates that the system
also has a PC-AT-compatible dual-8259 setup, i.e., the PIC.

When PIC is not enabled (pic=off) for x86 machine, the PCAT_COMPAT bit
needs to be cleared. Otherwise, the guest thinks there is a present PIC.

Signed-off-by: Xiaoyao Li 
---
changes in v2:
- Clarify more in commit message;
---
 hw/i386/acpi-common.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/i386/acpi-common.c b/hw/i386/acpi-common.c
index 20f19269da40..0cc2919bb851 100644
--- a/hw/i386/acpi-common.c
+++ b/hw/i386/acpi-common.c
@@ -107,7 +107,9 @@ void acpi_build_madt(GArray *table_data, BIOSLinker *linker,
 acpi_table_begin(, table_data);
 /* Local APIC Address */
 build_append_int_noprefix(table_data, APIC_DEFAULT_ADDRESS, 4);
-build_append_int_noprefix(table_data, 1 /* PCAT_COMPAT */, 4); /* Flags */
+/* Flags. bit 0: PCAT_COMPAT */
+build_append_int_noprefix(table_data,
+  x86ms->pic != ON_OFF_AUTO_OFF ? 1 : 0 , 4);
 
 for (i = 0; i < apic_ids->len; i++) {
 pc_madt_cpu_entry(i, apic_ids, table_data, false);
-- 
2.34.1




Re: [PATCH] hw/i386/acpi: Set PCAT_COMPAT bit only when pic is not disabled

2024-04-02 Thread Xiaoyao Li

On 4/2/2024 10:31 PM, Michael S. Tsirkin wrote:

On Tue, Apr 02, 2024 at 09:18:44PM +0800, Xiaoyao Li wrote:

On 4/2/2024 6:02 PM, Michael S. Tsirkin wrote:

On Tue, Apr 02, 2024 at 04:25:16AM -0400, Xiaoyao Li wrote:

Set MADT.FLAGS[bit 0].PCAT_COMPAT based on x86ms->pic.

Signed-off-by: Xiaoyao Li 


Please include more info in the commit log:
what is the behaviour you observe, why it is wrong,
how does the patch fix it, what is guest behaviour
before and after.


Sorry, I thought it was straightforward.

A value 1 of PCAT_COMPAT (bit 0) of MADT.Flags indicates that the system
also has a PC-AT-compatible dual-8259 setup, i.e., the PIC.

When PIC is not enabled for x86 machine, the PCAT_COMPAT bit needs to be
cleared. Otherwise, the guest thinks there is a present PIC even it is
booted with pic=off on QEMU.

(I haven't seen real issue from Linux guest. The user of PIC inside guest
seems only the pit calibration. Whether pit calibration is triggered depends
on other things. But logically, current code is wrong, we need to fix it
anyway.

@Isaku, please share more info if you have)



+ Kirill,

It seems to have issue with legacy irqs with PCAT_COMPAT set 1 while no 
PIC on QEMU side. Kirill, could you elaborate it?




That's sufficient, thanks! Pls put this in commit log and resubmit.


The commit log and the subject should not repeat
what the diff already states.


---
   hw/i386/acpi-common.c | 4 +++-
   1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/i386/acpi-common.c b/hw/i386/acpi-common.c
index 20f19269da40..0cc2919bb851 100644
--- a/hw/i386/acpi-common.c
+++ b/hw/i386/acpi-common.c
@@ -107,7 +107,9 @@ void acpi_build_madt(GArray *table_data, BIOSLinker *linker,
   acpi_table_begin(, table_data);
   /* Local APIC Address */
   build_append_int_noprefix(table_data, APIC_DEFAULT_ADDRESS, 4);
-build_append_int_noprefix(table_data, 1 /* PCAT_COMPAT */, 4); /* Flags */
+/* Flags. bit 0: PCAT_COMPAT */
+build_append_int_noprefix(table_data,
+  x86ms->pic != ON_OFF_AUTO_OFF ? 1 : 0 , 4);
   for (i = 0; i < apic_ids->len; i++) {
   pc_madt_cpu_entry(i, apic_ids, table_data, false);
--
2.34.1









Re: [PATCH] hw/i386/acpi: Set PCAT_COMPAT bit only when pic is not disabled

2024-04-02 Thread Xiaoyao Li

On 4/2/2024 6:02 PM, Michael S. Tsirkin wrote:

On Tue, Apr 02, 2024 at 04:25:16AM -0400, Xiaoyao Li wrote:

Set MADT.FLAGS[bit 0].PCAT_COMPAT based on x86ms->pic.

Signed-off-by: Xiaoyao Li 


Please include more info in the commit log:
what is the behaviour you observe, why it is wrong,
how does the patch fix it, what is guest behaviour
before and after.


Sorry, I thought it was straightforward.

A value 1 of PCAT_COMPAT (bit 0) of MADT.Flags indicates that the system 
also has a PC-AT-compatible dual-8259 setup, i.e., the PIC.


When PIC is not enabled for x86 machine, the PCAT_COMPAT bit needs to be 
cleared. Otherwise, the guest thinks there is a present PIC even it is 
booted with pic=off on QEMU.


(I haven't seen real issue from Linux guest. The user of PIC inside 
guest seems only the pit calibration. Whether pit calibration is 
triggered depends on other things. But logically, current code is wrong, 
we need to fix it anyway.


@Isaku, please share more info if you have)



The commit log and the subject should not repeat
what the diff already states.


---
  hw/i386/acpi-common.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/i386/acpi-common.c b/hw/i386/acpi-common.c
index 20f19269da40..0cc2919bb851 100644
--- a/hw/i386/acpi-common.c
+++ b/hw/i386/acpi-common.c
@@ -107,7 +107,9 @@ void acpi_build_madt(GArray *table_data, BIOSLinker *linker,
  acpi_table_begin(, table_data);
  /* Local APIC Address */
  build_append_int_noprefix(table_data, APIC_DEFAULT_ADDRESS, 4);
-build_append_int_noprefix(table_data, 1 /* PCAT_COMPAT */, 4); /* Flags */
+/* Flags. bit 0: PCAT_COMPAT */
+build_append_int_noprefix(table_data,
+  x86ms->pic != ON_OFF_AUTO_OFF ? 1 : 0 , 4);
  
  for (i = 0; i < apic_ids->len; i++) {

  pc_madt_cpu_entry(i, apic_ids, table_data, false);
--
2.34.1







[PATCH] hw/i386/acpi: Set PCAT_COMPAT bit only when pic is not disabled

2024-04-02 Thread Xiaoyao Li
Set MADT.FLAGS[bit 0].PCAT_COMPAT based on x86ms->pic.

Signed-off-by: Xiaoyao Li 
---
 hw/i386/acpi-common.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/i386/acpi-common.c b/hw/i386/acpi-common.c
index 20f19269da40..0cc2919bb851 100644
--- a/hw/i386/acpi-common.c
+++ b/hw/i386/acpi-common.c
@@ -107,7 +107,9 @@ void acpi_build_madt(GArray *table_data, BIOSLinker *linker,
 acpi_table_begin(, table_data);
 /* Local APIC Address */
 build_append_int_noprefix(table_data, APIC_DEFAULT_ADDRESS, 4);
-build_append_int_noprefix(table_data, 1 /* PCAT_COMPAT */, 4); /* Flags */
+/* Flags. bit 0: PCAT_COMPAT */
+build_append_int_noprefix(table_data,
+  x86ms->pic != ON_OFF_AUTO_OFF ? 1 : 0 , 4);
 
 for (i = 0; i < apic_ids->len; i++) {
 pc_madt_cpu_entry(i, apic_ids, table_data, false);
-- 
2.34.1




Re: [PATCH 26/26] i386/kvm: Move architectural CPUID leaf generation to separate helper

2024-04-01 Thread Xiaoyao Li

On 3/23/2024 2:11 AM, Paolo Bonzini wrote:

From: Sean Christopherson 

Move the architectural (for lack of a better term) CPUID leaf generation
to a separate helper so that the generation code can be reused by TDX,
which needs to generate a canonical VM-scoped configuration.

For now this is just a cleanup, so keep the function static.

Signed-off-by: Sean Christopherson 
Signed-off-by: Xiaoyao Li 
Message-ID: <20240229063726.610065-23-xiaoyao...@intel.com>
[Unify error reporting, rename function. - Paolo]
Signed-off-by: Paolo Bonzini 
---
  target/i386/kvm/kvm.c | 446 +-
  1 file changed, 224 insertions(+), 222 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 2577e345502..eab6261e1f5 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1752,6 +1752,228 @@ static void kvm_init_nested_state(CPUX86State *env)
  }
  }
  
+static uint32_t kvm_x86_build_cpuid(CPUX86State *env,

+struct kvm_cpuid_entry2 *entries,
+uint32_t cpuid_i)
+{
+uint32_t limit, i, j;
+uint32_t unused;
+struct kvm_cpuid_entry2 *c;
+
+cpu_x86_cpuid(env, 0, 0, , , , );
+
+for (i = 0; i <= limit; i++) {
+j = 0;
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+switch (i) {
+case 2: {
+/* Keep reading function 2 till all the input is received */
+int times;
+
+c->function = i;
+c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC |
+   KVM_CPUID_FLAG_STATE_READ_NEXT;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+times = c->eax & 0xff;
+
+for (j = 1; j < times; ++j) {
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+c->function = i;
+c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+}
+break;
+}
+case 0x1f:
+if (env->nr_dies < 2) {
+cpuid_i--;
+break;
+}
+/* fallthrough */
+case 4:
+case 0xb:
+case 0xd:
+for (j = 0; ; j++) {
+if (i == 0xd && j == 64) {
+break;
+}
+
+c->function = i;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+c->index = j;
+cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx);
+
+if (i == 4 && c->eax == 0) {
+break;
+}
+if (i == 0xb && !(c->ecx & 0xff00)) {
+break;
+}
+if (i == 0x1f && !(c->ecx & 0xff00)) {
+break;
+}
+if (i == 0xd && c->eax == 0) {
+continue;
+}
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+}
+break;
+case 0x12:
+for (j = 0; ; j++) {
+c->function = i;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+c->index = j;
+cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx);
+
+if (j > 1 && (c->eax & 0xf) != 1) {
+break;
+}
+
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+}
+break;
+case 0x7:
+case 0x14:
+case 0x1d:
+case 0x1e: {
+uint32_t times;
+
+c->function = i;
+c->index = 0;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+times = c->eax;
+
+for (j = 1; j <= times; ++j) {
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+c->function = i;
+c->index = j;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx);
+}
+break;
+}
+default:
+c->function = i;
+c->flags = 0;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+if (!c->eax &&

Re: [PATCH v3 48/49] hw/i386/sev: Use guest_memfd for legacy ROMs

2024-03-27 Thread Xiaoyao Li

On 3/21/2024 2:12 AM, Isaku Yamahata wrote:

On Wed, Mar 20, 2024 at 03:39:44AM -0500,
Michael Roth  wrote:


TODO: make this SNP-specific if TDX disables legacy ROMs in general


TDX disables pc.rom, not disable isa-bios. IIRC, TDX doesn't need pc pflash.


Not TDX doesn't need pc pflash, but TDX cannot support pflash.

Can SNP support the behavior of pflash? That what's got changed will be 
synced back to OVMF file?



Xiaoyao can chime in.

Thanks,



Current SNP guest kernels will attempt to access these regions with
with C-bit set, so guest_memfd is needed to handle that. Otherwise,
kvm_convert_memory() will fail when the guest kernel tries to access it
and QEMU attempts to call KVM_SET_MEMORY_ATTRIBUTES to set these ranges
to private.

Whether guests should actually try to access ROM regions in this way (or
need to deal with legacy ROM regions at all), is a separate issue to be
addressed on kernel side, but current SNP guest kernels will exhibit
this behavior and so this handling is needed to allow QEMU to continue
running existing SNP guest kernels.

Signed-off-by: Michael Roth 
---
  hw/i386/pc.c   | 13 +
  hw/i386/pc_sysfw.c | 13 ++---
  2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index feb7a93083..5feaeb43ee 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1011,10 +1011,15 @@ void pc_memory_init(PCMachineState *pcms,
  pc_system_firmware_init(pcms, rom_memory);
  
  option_rom_mr = g_malloc(sizeof(*option_rom_mr));

-memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE,
-   _fatal);
-if (pcmc->pci_enabled) {
-memory_region_set_readonly(option_rom_mr, true);
+if (machine_require_guest_memfd(machine)) {
+memory_region_init_ram_guest_memfd(option_rom_mr, NULL, "pc.rom",
+   PC_ROM_SIZE, _fatal);
+} else {
+memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE,
+   _fatal);
+if (pcmc->pci_enabled) {
+memory_region_set_readonly(option_rom_mr, true);
+}
  }
  memory_region_add_subregion_overlap(rom_memory,
  PC_ROM_MIN_VGA,
diff --git a/hw/i386/pc_sysfw.c b/hw/i386/pc_sysfw.c
index 9dbb3f7337..850f86edd4 100644
--- a/hw/i386/pc_sysfw.c
+++ b/hw/i386/pc_sysfw.c
@@ -54,8 +54,13 @@ static void pc_isa_bios_init(MemoryRegion *rom_memory,
  /* map the last 128KB of the BIOS in ISA space */
  isa_bios_size = MIN(flash_size, 128 * KiB);
  isa_bios = g_malloc(sizeof(*isa_bios));
-memory_region_init_ram(isa_bios, NULL, "isa-bios", isa_bios_size,
-   _fatal);
+if (machine_require_guest_memfd(current_machine)) {
+memory_region_init_ram_guest_memfd(isa_bios, NULL, "isa-bios",
+   isa_bios_size, _fatal);
+} else {
+memory_region_init_ram(isa_bios, NULL, "isa-bios", isa_bios_size,
+   _fatal);
+}
  memory_region_add_subregion_overlap(rom_memory,
  0x10 - isa_bios_size,
  isa_bios,
@@ -68,7 +73,9 @@ static void pc_isa_bios_init(MemoryRegion *rom_memory,
 ((uint8_t*)flash_ptr) + (flash_size - isa_bios_size),
 isa_bios_size);
  
-memory_region_set_readonly(isa_bios, true);

+if (!machine_require_guest_memfd(current_machine)) {
+memory_region_set_readonly(isa_bios, true);
+}
  }
  
  static PFlashCFI01 *pc_pflash_create(PCMachineState *pcms,

--
2.25.1









Re: [PATCH for-9.1 v5 2/3] target/i386: add guest-phys-bits cpu property

2024-03-26 Thread Xiaoyao Li

On 3/25/2024 10:14 PM, Paolo Bonzini wrote:

From: Gerd Hoffmann 

Allows to set guest-phys-bits (cpuid leaf 8008, eax[23:16])
via -cpu $model,guest-phys-bits=$nr.

Signed-off-by: Gerd Hoffmann 
Message-ID: <20240318155336.156197-3-kra...@redhat.com>
Signed-off-by: Paolo Bonzini 


Reviewed-by: Xiaoyao Li 


---
v4->v5:
- move here all non-KVM parts
- add compat property and support for special value "-1" (accelerator
   defines value)

  target/i386/cpu.h |  1 +
  hw/i386/pc.c  |  4 +++-
  target/i386/cpu.c | 22 ++
  3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 6b057380791..83e47358451 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2026,6 +2026,7 @@ struct ArchCPU {
  
  /* Number of physical address bits supported */

  uint32_t phys_bits;
+uint32_t guest_phys_bits;
  
  /* in order to simplify APIC support, we leave this pointer to the

 user */
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 461fcaa1b48..9c4b3969cc8 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -78,7 +78,9 @@
  { "qemu64-" TYPE_X86_CPU, "model-id", "QEMU Virtual CPU version " v, },\
  { "athlon-" TYPE_X86_CPU, "model-id", "QEMU Virtual CPU version " v, },
  
-GlobalProperty pc_compat_9_0[] = {};

+GlobalProperty pc_compat_9_0[] = {
+{ TYPE_X86_CPU, "guest-phys-bits", "0" },
+};
  const size_t pc_compat_9_0_len = G_N_ELEMENTS(pc_compat_9_0);
  
  GlobalProperty pc_compat_8_2[] = {};

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 33760a2ee16..eef3d08473e 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
  /* 64 bit processor */
   *eax |= (cpu_x86_virtual_addr_width(env) << 8);
+ *eax |= (cpu->guest_phys_bits << 16);
  }
  *ebx = env->features[FEAT_8000_0008_EBX];
  if (cs->nr_cores * cs->nr_threads > 1) {
@@ -7329,6 +7330,14 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
**errp)
  goto out;
  }
  
+if (cpu->guest_phys_bits == -1) {

+/*
+ * If it was not set by the user, or by the accelerator via
+ * cpu_exec_realizefn, clear.
+ */
+cpu->guest_phys_bits = 0;
+}
+
  if (cpu->ucode_rev == 0) {
  /*
   * The default is the same as KVM's. Note that this check
@@ -7379,6 +7388,14 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
**errp)
  if (cpu->phys_bits == 0) {
  cpu->phys_bits = TCG_PHYS_ADDR_BITS;
  }
+if (cpu->guest_phys_bits &&
+(cpu->guest_phys_bits > cpu->phys_bits ||
+cpu->guest_phys_bits < 32)) {
+error_setg(errp, "guest-phys-bits should be between 32 and %u "
+ " (but is %u)",
+ cpu->phys_bits, cpu->guest_phys_bits);
+return;
+}
  } else {
  /* For 32 bit systems don't use the user set value, but keep
   * phys_bits consistent with what we tell the guest.
@@ -7387,6 +7404,10 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
**errp)
  error_setg(errp, "phys-bits is not user-configurable in 32 bit");
  return;
  }
+if (cpu->guest_phys_bits != 0) {
+error_setg(errp, "guest-phys-bits is not user-configurable in 32 
bit");
+return;
+}
  
  if (env->features[FEAT_1_EDX] & (CPUID_PSE36 | CPUID_PAE)) {

  cpu->phys_bits = 36;
@@ -7887,6 +7908,7 @@ static Property x86_cpu_properties[] = {
  DEFINE_PROP_BOOL("x-force-features", X86CPU, force_features, false),
  DEFINE_PROP_BOOL("kvm", X86CPU, expose_kvm, true),
  DEFINE_PROP_UINT32("phys-bits", X86CPU, phys_bits, 0),
+DEFINE_PROP_UINT32("guest-phys-bits", X86CPU, guest_phys_bits, -1),
  DEFINE_PROP_BOOL("host-phys-bits", X86CPU, host_phys_bits, false),
  DEFINE_PROP_UINT8("host-phys-bits-limit", X86CPU, host_phys_bits_limit, 
0),
  DEFINE_PROP_BOOL("fill-mtrr-mask", X86CPU, fill_mtrr_mask, true),





Re: [PATCH 12/26] KVM: track whether guest state is encrypted

2024-03-26 Thread Xiaoyao Li

On 3/23/2024 2:11 AM, Paolo Bonzini wrote:

So far, KVM has allowed KVM_GET/SET_* ioctls to execute even if the
guest state is encrypted, in which case they do nothing.  For the new
API using VM types, instead, the ioctls will fail which is a safer and
more robust approach.

The new API will be the only one available for SEV-SNP and TDX, but it
is also usable for SEV and SEV-ES.  In preparation for that, require
architecture-specific KVM code to communicate the point at which guest
state is protected (which must be after kvm_cpu_synchronize_post_init(),
though that might change in the future in order to suppor migration).
 From that point, skip reading registers so that cpu->vcpu_dirty is
never true: if it ever becomes true, kvm_arch_put_registers() will
fail miserably.

Signed-off-by: Paolo Bonzini 
---
  include/sysemu/kvm.h |  2 ++
  include/sysemu/kvm_int.h |  1 +
  accel/kvm/kvm-all.c  | 14 --
  target/i386/sev.c|  1 +
  4 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index fad9a7e8ff3..302e8f6f1e5 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -539,6 +539,8 @@ bool kvm_dirty_ring_enabled(void);
  
  uint32_t kvm_dirty_ring_size(void);
  
+void kvm_mark_guest_state_protected(void);

+
  /**
   * kvm_hwpoisoned_mem - indicate if there is any hwpoisoned page
   * reported for the VM.
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 882e37e12c5..3496be7997a 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -87,6 +87,7 @@ struct KVMState
  bool kernel_irqchip_required;
  OnOffAuto kernel_irqchip_split;
  bool sync_mmu;
+bool guest_state_protected;
  uint64_t manual_dirty_log_protect;
  /* The man page (and posix) say ioctl numbers are signed int, but
   * they're not.  Linux, glibc and *BSD all treat ioctl numbers as
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index a8cecd040eb..05fa3533c66 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2698,7 +2698,7 @@ bool kvm_cpu_check_are_resettable(void)
  
  static void do_kvm_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)

  {
-if (!cpu->vcpu_dirty) {
+if (!cpu->vcpu_dirty && !kvm_state->guest_state_protected) {
  int ret = kvm_arch_get_registers(cpu);
  if (ret) {
  error_report("Failed to get registers: %s", strerror(-ret));
@@ -2712,7 +2712,7 @@ static void do_kvm_cpu_synchronize_state(CPUState *cpu, 
run_on_cpu_data arg)
  
  void kvm_cpu_synchronize_state(CPUState *cpu)

  {
-if (!cpu->vcpu_dirty) {
+if (!cpu->vcpu_dirty && !kvm_state->guest_state_protected) {
  run_on_cpu(cpu, do_kvm_cpu_synchronize_state, RUN_ON_CPU_NULL);
  }
  }
@@ -2747,6 +2747,11 @@ static void do_kvm_cpu_synchronize_post_init(CPUState 
*cpu, run_on_cpu_data arg)
  
  void kvm_cpu_synchronize_post_init(CPUState *cpu)

  {
+/*
+ * This runs before the machine_init_done notifiers, and is the last
+ * opportunity to synchronize the state of confidential guests.
+ */ > +assert(!kvm_state->guest_state_protected);


So, this requires confidential guests to call 
kvm_mark_guest_state_protected() in its machine_init_done notifier callback?


But for TDX, the guest_state is protected at the beginning, not some 
time later when machine_init_done.



  run_on_cpu(cpu, do_kvm_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
  }
  
@@ -4094,3 +4099,8 @@ void query_stats_schemas_cb(StatsSchemaList **result, Error **errp)

  query_stats_schema_vcpu(first_cpu, _args);
  }
  }
+
+void kvm_mark_guest_state_protected(void)
+{
+kvm_state->guest_state_protected = true;
+}
diff --git a/target/i386/sev.c b/target/i386/sev.c
index b8f79d34d19..c49a8fd55eb 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -755,6 +755,7 @@ sev_launch_get_measure(Notifier *notifier, void *unused)
  if (ret) {
  exit(1);
  }
+kvm_mark_guest_state_protected();
  }
  
  /* query the measurement blob length */





Re: [PATCH 21/26] kvm/memory: Make memory type private by default if it has guest memfd backend

2024-03-26 Thread Xiaoyao Li

On 3/23/2024 2:11 AM, Paolo Bonzini wrote:

From: Xiaoyao Li 

KVM side leaves the memory to shared by default, while may incur the


/s/while/which/

fix typo from myself.


overhead of paging conversion on the first visit of each page. Because
the expectation is that page is likely to private for the VMs that
require private memory (has guest memfd).

Explicitly set the memory to private when memory region has valid
guest memfd backend.

Signed-off-by: Xiaoyao Li 
Signed-off-by: Michael Roth 
Message-ID: <20240320083945.991426-16-michael.r...@amd.com>
Signed-off-by: Paolo Bonzini 
---
  accel/kvm/kvm-all.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 7fbaf31cbaf..56b17cbd8aa 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1430,6 +1430,16 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
  strerror(-err));
  abort();
  }
+
+if (memory_region_has_guest_memfd(mr)) {
+err = kvm_set_memory_attributes_private(start_addr, slot_size);
+if (err) {
+error_report("%s: failed to set memory attribute private: 
%s\n",
+ __func__, strerror(-err));
+exit(1);
+}
+}
+
  start_addr += slot_size;
  ram_start_offset += slot_size;
  ram += slot_size;





Re: [PATCH 25/26] kvm: handle KVM_EXIT_MEMORY_FAULT

2024-03-26 Thread Xiaoyao Li

On 3/23/2024 2:11 AM, Paolo Bonzini wrote:

From: Chao Peng 

When geeting KVM_EXIT_MEMORY_FAULT exit, it indicates userspace needs to


typo: /s/geeting/getting


do the memory conversion on the RAMBlock to turn the memory into desired
attribute, i.e., private/shared.

Currently only KVM_MEMORY_EXIT_FLAG_PRIVATE in flags is valid when
KVM_EXIT_MEMORY_FAULT happens.

Note, KVM_EXIT_MEMORY_FAULT makes sense only when the RAMBlock has
guest_memfd memory backend.

Note, KVM_EXIT_MEMORY_FAULT returns with -EFAULT, so special handling is
added.

When page is converted from shared to private, the original shared
memory can be discarded via ram_block_discard_range(). Note, shared
memory can be discarded only when it's not back'ed by hugetlb because
hugetlb is supposed to be pre-allocated and no need for discarding.

Signed-off-by: Chao Peng 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 

Message-ID: <20240320083945.991426-13-michael.r...@amd.com>
Signed-off-by: Paolo Bonzini 
---
  include/sysemu/kvm.h   |  2 +
  accel/kvm/kvm-all.c| 99 +-
  accel/kvm/trace-events |  2 +
  3 files changed, 93 insertions(+), 10 deletions(-)

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 2cb31925091..698f1640fe2 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -541,4 +541,6 @@ int kvm_create_guest_memfd(uint64_t size, uint64_t flags, 
Error **errp);
  
  int kvm_set_memory_attributes_private(hwaddr start, hwaddr size);

  int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size);
+
+int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private);
  #endif
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 56b17cbd8aa..afd7f992e39 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2893,6 +2893,70 @@ static void kvm_eat_signals(CPUState *cpu)
  } while (sigismember(, SIG_IPI));
  }
  
+int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)

+{
+MemoryRegionSection section;
+ram_addr_t offset;
+MemoryRegion *mr;
+RAMBlock *rb;
+void *addr;
+int ret = -1;
+
+trace_kvm_convert_memory(start, size, to_private ? "shared_to_private" : 
"private_to_shared");
+
+if (!QEMU_PTR_IS_ALIGNED(start, qemu_real_host_page_size()) ||
+!QEMU_PTR_IS_ALIGNED(size, qemu_real_host_page_size())) {
+return -1;
+}
+
+if (!size) {
+return -1;
+}
+
+section = memory_region_find(get_system_memory(), start, size);
+mr = section.mr;
+if (!mr) {
+return -1;
+}
+
+if (!memory_region_has_guest_memfd(mr)) {
+error_report("Converting non guest_memfd backed memory region "
+ "(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s",
+ start, size, to_private ? "private" : "shared");
+ret = -1;


No need for it. ret is initialized as -1 at the function start.


+goto out_unref;
+}
+
+if (to_private) {
+ret = kvm_set_memory_attributes_private(start, size);
+} else {
+ret = kvm_set_memory_attributes_shared(start, size);
+}
+if (ret) {
+goto out_unref;
+}
+
+addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
+rb = qemu_ram_block_from_host(addr, false, );
+
+if (to_private) {
+if (rb->page_size == qemu_real_host_page_size()) {
+/*
+* shared memory is back'ed by  hugetlb, which is supposed to be


Please fix the bad comment indentation for me, as well as the extra 
space before 'hugetlb'



+* pre-allocated and doesn't need to be discarded
+*/
+goto out_unref;
+}
+ret = ram_block_discard_range(rb, offset, size);
+} else {
+ret = ram_block_discard_guest_memfd_range(rb, offset, size);
+}
+
+out_unref:
+memory_region_unref(section.mr);
+return ret;
+}
+
  int kvm_cpu_exec(CPUState *cpu)
  {
  struct kvm_run *run = cpu->kvm_run;
@@ -2960,18 +3024,20 @@ int kvm_cpu_exec(CPUState *cpu)
  ret = EXCP_INTERRUPT;
  break;
  }
-fprintf(stderr, "error: kvm run failed %s\n",
-strerror(-run_ret));
+if (!(run_ret == -EFAULT && run->exit_reason == 
KVM_EXIT_MEMORY_FAULT)) {
+fprintf(stderr, "error: kvm run failed %s\n",
+strerror(-run_ret));
  #ifdef TARGET_PPC
-if (run_ret == -EBUSY) {
-fprintf(stderr,
-"This is probably because your SMT is enabled.\n"
-"VCPU can only run on primary threads with all "
-"secondary threads offline.\n");
-}
+if (run_ret == -EBUSY) {
+fprintf(stderr,
+

Re: [PATCH 3/7] KVM: track whether guest state is encrypted

2024-03-22 Thread Xiaoyao Li

On 3/19/2024 9:59 PM, Paolo Bonzini wrote:

So far, KVM has allowed KVM_GET/SET_* ioctls to execute even if the
guest state is encrypted, in which case they do nothing.  For the new
API using VM types, instead, the ioctls will fail which is a safer and
more robust approach.

The new API will be the only one available for SEV-SNP and TDX, but it
is also usable for SEV and SEV-ES.  In preparation for that, require
architecture-specific KVM code to communicate the point at which guest
state is protected (which must be after kvm_cpu_synchronize_post_init(),
though that might change in the future in order to suppor migration).
 From that point, skip reading registers so that cpu->vcpu_dirty is
never true: if it ever becomes true, kvm_arch_put_registers() will
fail miserably.

Signed-off-by: Paolo Bonzini 


Reviewed-by: Xiaoyao Li 




Re: [PATCH 4/7] KVM: remove kvm_arch_cpu_check_are_resettable

2024-03-22 Thread Xiaoyao Li

On 3/19/2024 9:59 PM, Paolo Bonzini wrote:

Board reset requires writing a fresh CPU state.  As far as KVM is
concerned, the only thing that blocks reset is that CPU state is
encrypted; therefore, kvm_cpus_are_resettable() can simply check
if that is the case.

Signed-off-by: Paolo Bonzini 


Reviewed-by: Xiaoyao Li 





Re: [PATCH 5/7] target/i386: introduce x86-confidential-guest

2024-03-22 Thread Xiaoyao Li

On 3/19/2024 9:59 PM, Paolo Bonzini wrote:

Introduce a common superclass for x86 confidential guest implementations.
It will extend ConfidentialGuestSupportClass with a method that provides
the VM type to be passed to KVM_CREATE_VM.

Signed-off-by: Paolo Bonzini 


Reviewed-by: Xiaoyao Li 





Re: [PATCH 6/7] target/i386: Implement mc->kvm_type() to get VM type

2024-03-22 Thread Xiaoyao Li

On 3/19/2024 9:59 PM, Paolo Bonzini wrote:

From: Xiaoyao Li 

KVM is introducing a new API to create confidential guests, which
will be used by TDX and SEV-SNP but is also available for SEV and
SEV-ES.  The API uses the VM type argument to KVM_CREATE_VM to
identify which confidential computing technology to use.

Since there are no other expected uses of VM types, delegate
mc->kvm_type() for x86 boards to the confidential-guest-support
object pointed to by ms->cgs.

For example, if a sev-guest object is specified to confidential-guest-support,
like,

   qemu -machine ...,confidential-guest-support=sev0 \
-object sev-guest,id=sev0,...

it will check if a VM type KVM_X86_SEV_VM or KVM_X86_SEV_ES_VM
is supported, and if so use them together with the KVM_SEV_INIT2
function of the KVM_MEMORY_ENCRYPT_OP ioctl. If not, it will fall back to
KVM_SEV_INIT and KVM_SEV_ES_INIT.

This is a preparatory work towards TDX and SEV-SNP support, but it
will also enable support for VMSA features such as DebugSwap, which
are only available via KVM_SEV_INIT2.

Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 
Signed-off-by: Paolo Bonzini 


Reviewed-by: Xiaoyao Li 

some nits below.


---
  target/i386/confidential-guest.h | 19 ++
  target/i386/kvm/kvm_i386.h   |  2 ++
  hw/i386/x86.c|  6 +
  target/i386/kvm/kvm.c| 44 
  4 files changed, 71 insertions(+)

diff --git a/target/i386/confidential-guest.h b/target/i386/confidential-guest.h
index ca12d5a8fba..532e172a60b 100644
--- a/target/i386/confidential-guest.h
+++ b/target/i386/confidential-guest.h
@@ -36,5 +36,24 @@ struct X86ConfidentialGuest {
  struct X86ConfidentialGuestClass {
  /*  */
  ConfidentialGuestSupportClass parent;
+
+/*  */
+int (*kvm_type)(X86ConfidentialGuest *cg);
  };
+
+/**
+ * x86_confidential_guest_kvm_type:
+ *
+ * Calls #X86ConfidentialGuestClass.unplug callback of @plug_handler.


ah, forgot to change the callback name after copy+paste.


+ */
+static inline int x86_confidential_guest_kvm_type(X86ConfidentialGuest *cg)
+{
+X86ConfidentialGuestClass *klass = X86_CONFIDENTIAL_GUEST_GET_CLASS(cg);
+
+if (klass->kvm_type) {
+return klass->kvm_type(cg);
+} else {
+return 0;
+}
+}
  #endif
diff --git a/target/i386/kvm/kvm_i386.h b/target/i386/kvm/kvm_i386.h
index 30fedcffea3..02168122787 100644
--- a/target/i386/kvm/kvm_i386.h
+++ b/target/i386/kvm/kvm_i386.h
@@ -37,6 +37,7 @@ bool kvm_hv_vpindex_settable(void);
  bool kvm_enable_sgx_provisioning(KVMState *s);
  bool kvm_hyperv_expand_features(X86CPU *cpu, Error **errp);
  
+int kvm_get_vm_type(MachineState *ms, const char *vm_type);

  void kvm_arch_reset_vcpu(X86CPU *cs);
  void kvm_arch_after_reset_vcpu(X86CPU *cpu);
  void kvm_arch_do_init_vcpu(X86CPU *cs);
@@ -49,6 +50,7 @@ void kvm_request_xsave_components(X86CPU *cpu, uint64_t mask);
  
  #ifdef CONFIG_KVM
  
+bool kvm_is_vm_type_supported(int type);

  bool kvm_has_adjust_clock_stable(void);
  bool kvm_has_exception_payload(void);
  void kvm_synchronize_all_tsc(void);
diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index ffbda48917f..2d4b148cd25 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1389,6 +1389,11 @@ static void machine_set_sgx_epc(Object *obj, Visitor *v, 
const char *name,
  qapi_free_SgxEPCList(list);
  }
  
+static int x86_kvm_type(MachineState *ms, const char *vm_type)

+{
+return kvm_enabled() ? kvm_get_vm_type(ms, vm_type) : 0;
+}
+
  static void x86_machine_initfn(Object *obj)
  {
  X86MachineState *x86ms = X86_MACHINE(obj);
@@ -1413,6 +1418,7 @@ static void x86_machine_class_init(ObjectClass *oc, void 
*data)
  mc->cpu_index_to_instance_props = x86_cpu_index_to_props;
  mc->get_default_cpu_node_id = x86_get_default_cpu_node_id;
  mc->possible_cpu_arch_ids = x86_possible_cpu_arch_ids;
+mc->kvm_type = x86_kvm_type;
  x86mc->save_tsc_khz = true;
  x86mc->fwcfg_dma_enabled = true;
  nc->nmi_monitor_handler = x86_nmi;
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 0ec69109a2b..e109648f260 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -31,6 +31,7 @@
  #include "sysemu/kvm_int.h"
  #include "sysemu/runstate.h"
  #include "kvm_i386.h"
+#include "../confidential-guest.h"
  #include "sev.h"
  #include "xen-emu.h"
  #include "hyperv.h"
@@ -161,6 +162,49 @@ static KVMMSRHandlers 
msr_handlers[KVM_MSR_FILTER_MAX_RANGES];
  static RateLimit bus_lock_ratelimit_ctrl;
  static int kvm_get_one_msr(X86CPU *cpu, int index, uint64_t *value);
  
+static const char *vm_type_name[] = {

+[KVM_X86_DEFAULT_VM] = "default",
+};
+
+bool kvm_is_vm_type_supported(int type)
+{
+uint32_t machine_types;


The name of machine_types confuses me a lot. why not supported_vm_types?


+
+/*
+ * old KVM doesn't support

Re: [PATCH RFC v3 00/49] Add AMD Secure Nested Paging (SEV-SNP) support

2024-03-20 Thread Xiaoyao Li

On 3/21/2024 1:08 AM, Paolo Bonzini wrote:

On Wed, Mar 20, 2024 at 10:59 AM Paolo Bonzini  wrote:

I will now focus on reviewing patches 6-20.  This way we can prepare a
common tree for SEV_INIT2/SNP/TDX, for both vendors to build upon.


Ok, the attachment is the delta that I have. The only major change is
requiring discard (thus effectively blocking VFIO support for
SEV-SNP/TDX, at least for now).

I will push it shortly to the same sevinit2 branch, and will post the
patches sometime soon.

Xiaoyao, you can use that branch too (it's on
https://gitlab.com/bonzini/qemu) as the basis for your TDX work.


Sure, it's really a good news for us.

BTW, there are some minor comments on guest_memfd patches of my v5 
post[*]. Could you please resolve them it your branch?


[*] 
https://lore.kernel.org/qemu-devel/20240229063726.610065-1-xiaoyao...@intel.com/



Paolo





Re: [PATCH v5 08/65] kvm: handle KVM_EXIT_MEMORY_FAULT

2024-03-20 Thread Xiaoyao Li

On 3/19/2024 10:14 AM, Wang, Lei wrote:

On 2/29/2024 14:36, Xiaoyao Li wrote:

From: Chao Peng 

When geeting KVM_EXIT_MEMORY_FAULT exit, it indicates userspace needs to
do the memory conversion on the RAMBlock to turn the memory into desired
attribute, i.e., private/shared.

Currently only KVM_MEMORY_EXIT_FLAG_PRIVATE in flags is valid when
KVM_EXIT_MEMORY_FAULT happens.

Note, KVM_EXIT_MEMORY_FAULT makes sense only when the RAMBlock has
guest_memfd memory backend.

Note, KVM_EXIT_MEMORY_FAULT returns with -EFAULT, so special handling is
added.

When page is converted from shared to private, the original shared
memory can be discarded via ram_block_discard_range(). Note, shared
memory can be discarded only when it's not back'ed by hugetlb because
hugetlb is supposed to be pre-allocated and no need for discarding.

Signed-off-by: Chao Peng 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 

---
Changes in v4:
- open-coded ram_block_discard logic;
- change warn_report() to error_report(); (Daniel)
---
  accel/kvm/kvm-all.c | 94 -
  1 file changed, 84 insertions(+), 10 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 70d482a2c936..87e4275932a7 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2903,6 +2903,68 @@ static void kvm_eat_signals(CPUState *cpu)
  } while (sigismember(, SIG_IPI));
  }
  
+static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)

+{
+MemoryRegionSection section;
+ram_addr_t offset;
+MemoryRegion *mr;
+RAMBlock *rb;
+void *addr;
+int ret = -1;
+
+if (!QEMU_PTR_IS_ALIGNED(start, qemu_host_page_size) ||
+!QEMU_PTR_IS_ALIGNED(size, qemu_host_page_size)) {
+return -1;
+}
+
+if (!size) {
+return -1;
+}
+
+section = memory_region_find(get_system_memory(), start, size);
+mr = section.mr;
+if (!mr) {
+return -1;
+}
+
+if (memory_region_has_guest_memfd(mr)) {
+if (to_private) {
+ret = kvm_set_memory_attributes_private(start, size);
+} else {
+ret = kvm_set_memory_attributes_shared(start, size);
+}
+
+if (ret) {
+memory_region_unref(section.mr);
+return ret;
+}
+
+addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
+rb = qemu_ram_block_from_host(addr, false, );
+
+if (to_private) {
+if (rb->page_size != qemu_host_page_size) {
+/*
+* shared memory is back'ed by  hugetlb, which is supposed to be
+* pre-allocated and doesn't need to be discarded
+*/


Nit: comment indentation is broken here.


+return 0;
+} else {
+ret = ram_block_discard_range(rb, offset, size);
+}
+} else {
+ret = ram_block_discard_guest_memfd_range(rb, offset, size);
+}
+} else {
+error_report("Convert non guest_memfd backed memory region "
+"(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s",


Same as above.



Fixed.

thanks!




Re: [PATCH v3 13/49] [FIXUP] "kvm: handle KVM_EXIT_MEMORY_FAULT": drop qemu_host_page_size

2024-03-20 Thread Xiaoyao Li

On 3/20/2024 4:39 PM, Michael Roth wrote:

TODO: squash into "kvm: handle KVM_EXIT_MEMORY_FAULT"

qemu_host_page_size has been superseded by qemu_real_host_page_size()
in newer QEMU, so update the patch accordingly.


I found it today as well when rebase to qemu v9.0.0-rc0.

Fix it locally, will show up on my next post of TDX-QEMU patches. :)


Signed-off-by: Michael Roth 
---
  accel/kvm/kvm-all.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 2fdc07a472..a9c19ab9a1 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2912,8 +2912,8 @@ static int kvm_convert_memory(hwaddr start, hwaddr size, 
bool to_private)
  void *addr;
  int ret = -1;
  
-if (!QEMU_PTR_IS_ALIGNED(start, qemu_host_page_size) ||

-!QEMU_PTR_IS_ALIGNED(size, qemu_host_page_size)) {
+if (!QEMU_PTR_IS_ALIGNED(start, qemu_real_host_page_size()) ||
+!QEMU_PTR_IS_ALIGNED(size, qemu_real_host_page_size())) {
  return -1;
  }
  
@@ -2943,7 +2943,7 @@ static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)

  rb = qemu_ram_block_from_host(addr, false, );
  
  if (to_private) {

-if (rb->page_size != qemu_host_page_size) {
+if (rb->page_size != qemu_real_host_page_size()) {
  /*
  * shared memory is back'ed by  hugetlb, which is supposed to 
be
  * pre-allocated and doesn't need to be discarded





Re: [PATCH v5 06/65] kvm: Introduce support for memory_attributes

2024-03-20 Thread Xiaoyao Li

On 3/19/2024 10:03 AM, Wang, Lei wrote:

On 2/29/2024 14:36, Xiaoyao Li wrote:> Introduce the helper functions to set 
the attributes of a range of

memory to private or shared.

This is necessary to notify KVM the private/shared attribute of each gpa
range. KVM needs the information to decide the GPA needs to be mapped at
hva-based shared memory or guest_memfd based private memory.

Signed-off-by: Xiaoyao Li 
---
Changes in v4:
- move the check of kvm_supported_memory_attributes to the common
   kvm_set_memory_attributes(); (Wang Wei)
- change warn_report() to error_report() in kvm_set_memory_attributes()
   and drop the __func__; (Daniel)
---
  accel/kvm/kvm-all.c  | 44 
  include/sysemu/kvm.h |  3 +++
  2 files changed, 47 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index cd0aa7545a1f..70d482a2c936 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -92,6 +92,7 @@ static bool kvm_has_guest_debug;
  static int kvm_sstep_flags;
  static bool kvm_immediate_exit;
  static bool kvm_guest_memfd_supported;
+static uint64_t kvm_supported_memory_attributes;
  static hwaddr kvm_max_slot_size = ~0;
  
  static const KVMCapabilityInfo kvm_required_capabilites[] = {

@@ -1304,6 +1305,46 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size)
  kvm_max_slot_size = max_slot_size;
  }
  
+static int kvm_set_memory_attributes(hwaddr start, hwaddr size, uint64_t attr)

+{
+struct kvm_memory_attributes attrs;
+int r;
+
+if (kvm_supported_memory_attributes == 0) {
+error_report("No memory attribute supported by KVM\n");
+return -EINVAL;
+}
+
+if ((attr & kvm_supported_memory_attributes) != attr) {
+error_report("memory attribute 0x%lx not supported by KVM,"
+ " supported bits are 0x%lx\n",
+ attr, kvm_supported_memory_attributes);
+return -EINVAL;
+}
+
+attrs.attributes = attr;
+attrs.address = start;
+attrs.size = size;
+attrs.flags = 0;
+
+r = kvm_vm_ioctl(kvm_state, KVM_SET_MEMORY_ATTRIBUTES, );
+if (r) {
+error_report("failed to set memory (0x%lx+%#zx) with attr 0x%lx error 
'%s'",
+ start, size, attr, strerror(errno));
+}
+return r;
+}
+
+int kvm_set_memory_attributes_private(hwaddr start, hwaddr size)
+{
+return kvm_set_memory_attributes(start, size, 
KVM_MEMORY_ATTRIBUTE_PRIVATE);
+}
+
+int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size)
+{
+return kvm_set_memory_attributes(start, size, 0);
+}
+
  /* Called with KVMMemoryListener.slots_lock held */
  static void kvm_set_phys_mem(KVMMemoryListener *kml,
   MemoryRegionSection *section, bool add)
@@ -2439,6 +2480,9 @@ static int kvm_init(MachineState *ms)
  
  kvm_guest_memfd_supported = kvm_check_extension(s, KVM_CAP_GUEST_MEMFD);
  
+ret = kvm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES);

+kvm_supported_memory_attributes = ret > 0 ? ret : 0;


kvm_check_extension() only returns non-negative value, so we can just

kvm_supported_memory_attributes = kvm_check_extension(s, 
KVM_CAP_MEMORY_ATTRIBUTES);


Good catch, I will update it in next version.


+
  if (object_property_find(OBJECT(current_machine), "kvm-type")) {
  g_autofree char *kvm_type = 
object_property_get_str(OBJECT(current_machine),
  "kvm-type",
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 6cdf82de8372..8e83adfbbd19 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -546,4 +546,7 @@ uint32_t kvm_dirty_ring_size(void);
  bool kvm_hwpoisoned_mem(void);
  
  int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp);

+
+int kvm_set_memory_attributes_private(hwaddr start, hwaddr size);
+int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size);
  #endif





Re: [PATCH v3 11/49] physmem: Introduce ram_block_discard_guest_memfd_range()

2024-03-20 Thread Xiaoyao Li

On 3/20/2024 5:37 PM, David Hildenbrand wrote:

On 20.03.24 09:39, Michael Roth wrote:

From: Xiaoyao Li 

When memory page is converted from private to shared, the original
private memory is back'ed by guest_memfd. Introduce
ram_block_discard_guest_memfd_range() for discarding memory in
guest_memfd.

Originally-from: Isaku Yamahata 
Codeveloped-by: Xiaoyao Li 


"Co-developed-by"


Michael is using the patch from my TDX-QEMU v5 series[1]. I need to fix it.


Signed-off-by: Xiaoyao Li 
Reviewed-by: David Hildenbrand 


Your SOB should go here.


---
Changes in v5:
- Collect Reviewed-by from David;

Changes in in v4:
- Drop ram_block_convert_range() and open code its implementation in the
   next Patch.

Signed-off-by: Michael Roth 


I only received 3 patches from this series, and now I am confused: 
changelog talks about v5 and this is "PATCH v3"


As above, because the guest_memfd patches in my TDX-QEMU v5[1] were 
directly picked for this series, so the change history says v5. They are 
needed by SEV-SNP as well.


I want to raise the question, how do we want to proceed with the guest 
memfd patches (patch 2 to 10 in [1])? Can they be merged separately 
before TDX/SNP patches?


Please make sure to send at least the cover letter along (I might not 
need the other 46 patches :D ).





[1] 
https://lore.kernel.org/qemu-devel/20240229063726.610065-1-xiaoyao...@intel.com/




Re: [PATCH v4 1/2] kvm: add support for guest physical bits

2024-03-19 Thread Xiaoyao Li

On 3/18/2024 11:53 PM, Gerd Hoffmann wrote:

Query kvm for supported guest physical address bits, in cpuid
function 8008, eax[23:16].  Usually this is identical to host
physical address bits.  With NPT or EPT being used this might be
restricted to 48 (max 4-level paging address space size) even if
the host cpu supports more physical address bits.

When set pass this to the guest, using cpuid too.  Guest firmware
can use this to figure how big the usable guest physical address
space is, so PCI bar mapping are actually reachable.

Signed-off-by: Gerd Hoffmann 
---
  target/i386/cpu.h |  1 +
  target/i386/cpu.c |  1 +
  target/i386/kvm/kvm-cpu.c | 31 ++-
  3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 952174bb6f52..d427218827f6 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2026,6 +2026,7 @@ struct ArchCPU {
  
  /* Number of physical address bits supported */

  uint32_t phys_bits;
+uint32_t guest_phys_bits;
  
  /* in order to simplify APIC support, we leave this pointer to the

 user */
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 9a210d8d9290..c88c895a5b3e 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
  /* 64 bit processor */
   *eax |= (cpu_x86_virtual_addr_width(env) << 8);
+ *eax |= (cpu->guest_phys_bits << 16);
  }
  *ebx = env->features[FEAT_8000_0008_EBX];
  if (cs->nr_cores * cs->nr_threads > 1) {
diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
index 9c791b7b0520..5132bb96abd5 100644
--- a/target/i386/kvm/kvm-cpu.c
+++ b/target/i386/kvm/kvm-cpu.c
@@ -18,10 +18,33 @@
  #include "kvm_i386.h"
  #include "hw/core/accel-cpu.h"
  
+static void kvm_set_guest_phys_bits(CPUState *cs)

+{
+X86CPU *cpu = X86_CPU(cs);
+uint32_t eax, guest_phys_bits;
+
+eax = kvm_arch_get_supported_cpuid(cs->kvm_state, 0x8008, 0, R_EAX);
+guest_phys_bits = (eax >> 16) & 0xff;
+if (!guest_phys_bits) {
+return;
+}
+
+if (cpu->guest_phys_bits == 0 ||
+cpu->guest_phys_bits > guest_phys_bits) {
+cpu->guest_phys_bits = guest_phys_bits;
+}
+
+if (cpu->host_phys_bits_limit &&
+cpu->guest_phys_bits > cpu->host_phys_bits_limit) {
+cpu->guest_phys_bits = cpu->host_phys_bits_limit;


host_phys_bits_limit takes effect only when cpu->host_phys_bits is set.

If users pass configuration like "-cpu 
qemu64,phys-bits=52,host-phys-bits-limit=45", the cpu->guest_phys_bits 
will be set to 45. I think this is not what we want, though the usage 
seems insane.


We can guard it as

 if (cpu->host_phys_bits && cpu->host_phys_bits_limit &&
 cpu->guest_phys_bits > cpu->host_phys_bits_limt)
{
}

Simpler, we can guard with cpu->phys_bits like below, because 
cpu->host_phys_bits_limit is used to guard cpu->phys_bits in 
host_cpu_realizefn()


 if (cpu->guest_phys_bits > cpu->phys_bits) {
cpu->guest_phys_bits = cpu->phys_bits;
}


with this resolved,

Reviewed-by: Xiaoyao Li 


+}
+}
+
  static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
  {
  X86CPU *cpu = X86_CPU(cs);
  CPUX86State *env = >env;
+bool ret;
  
  /*

   * The realize order is important, since x86_cpu_realize() checks if
@@ -50,7 +73,13 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
 MSR_IA32_UCODE_REV);
  }
  }
-return host_cpu_realizefn(cs, errp);
+ret = host_cpu_realizefn(cs, errp);
+if (!ret) {
+return ret;
+}
+
+kvm_set_guest_phys_bits(cs);
+return true;
  }
  
  static bool lmce_supported(void)





Re: [PATCH] target/i386: Export RFDS bit to guests

2024-03-19 Thread Xiaoyao Li

On 3/19/2024 11:08 PM, Pawan Gupta wrote:

On Tue, Mar 19, 2024 at 12:22:08PM +0800, Xiaoyao Li wrote:

On 3/13/2024 10:53 PM, Pawan Gupta wrote:

Register File Data Sampling (RFDS) is a CPU side-channel vulnerability
that may expose stale register value. CPUs that set RFDS_NO bit in MSR
IA32_ARCH_CAPABILITIES indicate that they are not vulnerable to RFDS.
Similarly, RFDS_CLEAR indicates that CPU is affected by RFDS, and has
the microcode to help mitigate RFDS.

Make RFDS_CLEAR and RFDS_NO bits available to guests.


What's the status of KVM part?


KVM part is already upstreamed and backported:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v6.8.1=50d33b98b1e23d1cd8743b3cac7a0ae5718b8b00


I see. It was not sent to kvm maillist and not merged through KVM tree.

With KVM part in palce,

Reviewed-by: Xiaoyao Li 



Re: [PATCH] target/i386: Export RFDS bit to guests

2024-03-18 Thread Xiaoyao Li

On 3/13/2024 10:53 PM, Pawan Gupta wrote:

Register File Data Sampling (RFDS) is a CPU side-channel vulnerability
that may expose stale register value. CPUs that set RFDS_NO bit in MSR
IA32_ARCH_CAPABILITIES indicate that they are not vulnerable to RFDS.
Similarly, RFDS_CLEAR indicates that CPU is affected by RFDS, and has
the microcode to help mitigate RFDS.

Make RFDS_CLEAR and RFDS_NO bits available to guests.


What's the status of KVM part?


Signed-off-by: Pawan Gupta 
---
  target/i386/cpu.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 9a210d8d9290..693a5e0fb2ce 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -1158,8 +1158,8 @@ FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
  NULL, "sbdr-ssdp-no", "fbsdp-no", "psdp-no",
  NULL, "fb-clear", NULL, NULL,
  NULL, NULL, NULL, NULL,
-"pbrsb-no", NULL, "gds-no", NULL,
-NULL, NULL, NULL, NULL,
+"pbrsb-no", NULL, "gds-no", "rfds-no",
+"rfds-clear", NULL, NULL, NULL,
  },
  .msr = {
  .index = MSR_IA32_ARCH_CAPABILITIES,

base-commit: a1932d7cd6507d4d9db2044a54731fff3e749bac





Re: [PATCH 2/4] i386/sev: Switch to use confidential_guest_kvm_init()

2024-03-18 Thread Xiaoyao Li

On 3/19/2024 5:51 AM, Paolo Bonzini wrote:

On Thu, Feb 29, 2024 at 7:01 AM Xiaoyao Li  wrote:


Use confidential_guest_kvm_init() instead of calling SEV specific
sev_kvm_init(). As a bouns, it fits to future TDX when TDX implements
its own confidential_guest_support and .kvm_init().

Move the "TypeInfo sev_guest_info" definition and related functions to
the end of the file, to avoid declaring the sev_kvm_init() ahead.

Delete the sve-stub.c since it's not needed anymore.

Signed-off-by: Xiaoyao Li 
---
Changes from rfc v1:
- check ms->cgs not NULL before calling confidential_guest_kvm_init();
- delete the sev-stub.c;


Queued, with just one small simplification that can be done on top:


thank you, Paolo!


diff --git a/target/i386/sev.c b/target/i386/sev.c
index e89d64fa52..b8f79d34d1 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -851,18 +851,13 @@ sev_vm_state_change(void *opaque, bool running,
RunState state)

  static int sev_kvm_init(ConfidentialGuestSupport *cgs, Error **errp)
  {
-SevGuestState *sev
-= (SevGuestState *)object_dynamic_cast(OBJECT(cgs), TYPE_SEV_GUEST);
+SevGuestState *sev = SEV_GUEST(cgs);
  char *devname;
  int ret, fw_error, cmd;
  uint32_t ebx;
  uint32_t host_cbitpos;
  struct sev_user_data_status status = {};

-if (!sev) {
-return 0;
-}
-
  ret = ram_block_discard_disable(true);
  if (ret) {
  error_report("%s: cannot disable RAM discard", __func__);


It looks good.


Thanks!

Paolo





Re: [PATCH v3 2/3] kvm: add support for guest physical bits

2024-03-17 Thread Xiaoyao Li

On 3/13/2024 9:27 PM, Gerd Hoffmann wrote:

Query kvm for supported guest physical address bits, in cpuid
function 8008, eax[23:16].  Usually this is identical to host
physical address bits.  With NPT or EPT being used this might be
restricted to 48 (max 4-level paging address space size) even if
the host cpu supports more physical address bits.

When set pass this to the guest, using cpuid too.  Guest firmware
can use this to figure how big the usable guest physical address
space is, so PCI bar mapping are actually reachable.

Signed-off-by: Gerd Hoffmann 
---
  target/i386/cpu.h |  1 +
  target/i386/cpu.c |  1 +
  target/i386/kvm/kvm-cpu.c | 32 +++-
  3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 952174bb6f52..d427218827f6 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2026,6 +2026,7 @@ struct ArchCPU {
  
  /* Number of physical address bits supported */

  uint32_t phys_bits;
+uint32_t guest_phys_bits;
  
  /* in order to simplify APIC support, we leave this pointer to the

 user */
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 9a210d8d9290..c88c895a5b3e 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
  /* 64 bit processor */
   *eax |= (cpu_x86_virtual_addr_width(env) << 8);
+ *eax |= (cpu->guest_phys_bits << 16);
  }
  *ebx = env->features[FEAT_8000_0008_EBX];
  if (cs->nr_cores * cs->nr_threads > 1) {
diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
index 9c791b7b0520..a2b7bfaeadf8 100644
--- a/target/i386/kvm/kvm-cpu.c
+++ b/target/i386/kvm/kvm-cpu.c
@@ -18,10 +18,36 @@
  #include "kvm_i386.h"
  #include "hw/core/accel-cpu.h"
  
+static void kvm_set_guest_phys_bits(CPUState *cs)

+{
+X86CPU *cpu = X86_CPU(cs);
+uint32_t eax, guest_phys_bits;
+
+if (!cpu->host_phys_bits) {
+return;
+}


This needs explanation of why. What if users set the phys-bits to 
exactly host's value, via "-cpu xxx,phys-bits=host's value"?




+eax = kvm_arch_get_supported_cpuid(cs->kvm_state, 0x8008, 0, R_EAX);
+guest_phys_bits = (eax >> 16) & 0xff;
+if (!guest_phys_bits) {
+return;
+}
+
+if (cpu->guest_phys_bits == 0 ||
+cpu->guest_phys_bits > guest_phys_bits) {
+cpu->guest_phys_bits = guest_phys_bits;
+}
+
+if (cpu->guest_phys_bits > cpu->host_phys_bits_limit) {
+cpu->guest_phys_bits = cpu->host_phys_bits_limit;
+}
+}
+
  static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
  {
  X86CPU *cpu = X86_CPU(cs);
  CPUX86State *env = >env;
+bool ret;
  
  /*

   * The realize order is important, since x86_cpu_realize() checks if
@@ -50,7 +76,11 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
 MSR_IA32_UCODE_REV);
  }
  }
-return host_cpu_realizefn(cs, errp);
+ret = host_cpu_realizefn(cs, errp);


We need to check ret and return if !ret;


+kvm_set_guest_phys_bits(cs);
+
+return ret;
  }
  
  static bool lmce_supported(void)





Re: [PATCH v5 49/65] i386/tdx: handle TDG.VP.VMCALL

2024-03-15 Thread Xiaoyao Li

On 3/13/2024 11:31 PM, Daniel P. Berrangé wrote:

On Tue, Mar 12, 2024 at 03:44:32PM +0800, Xiaoyao Li wrote:

On 3/11/2024 5:27 PM, Daniel P. Berrangé wrote:

On Thu, Feb 29, 2024 at 01:37:10AM -0500, Xiaoyao Li wrote:

From: Isaku Yamahata 

Add property "quote-generation-socket" to tdx-guest, which is a property
of type SocketAddress to specify Quote Generation Service(QGS).

On request of GetQuote, it connects to the QGS socket, read request
data from shared guest memory, send the request data to the QGS,
and store the response into shared guest memory, at last notify
TD guest by interrupt.

command line example:
qemu-system-x86_64 \
  -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", 
"cid":"1","port":"1234"}}' \


Can you illustrate this with 'unix' sockets, not 'vsock'.


Are you suggesting only updating the commit message to an example of unix
socket? Or you want the code to test with some unix socket QGS?

(It seems the QGS I got for testing, only supports vsock socket. Because at
the time when it got developed, it was supposed to communicate with drivers
inside TD guest directly not via VMM (KVM+QEMU). Anyway, I will talk to
internal folks to see if any plan to support unix socket.)


The QGS provided as part of DCAP supports running with both
UNIX sockets and VSOCK, and I would expect QEMU to be made
to work with this, since its is Intel's OSS reference impl.


After synced with internal folks, yes, the QGS I used does support unix 
socket. I tested it and it worked.


-object 
'{"qom-type":"tdx-guest","id":"tdx","quote-generation-socket":{"type":"unix", 
"path":"/var/run/tdx-qgs/qgs.socket"}}'



Exposing QGS to the guest when we only intend for it to be
used by the host QEMU is needlessly expanding the attack
surface.

With regards,
Daniel





Re: [PATCH v5 49/65] i386/tdx: handle TDG.VP.VMCALL

2024-03-12 Thread Xiaoyao Li

On 3/11/2024 5:27 PM, Daniel P. Berrangé wrote:

On Thu, Feb 29, 2024 at 01:37:10AM -0500, Xiaoyao Li wrote:

From: Isaku Yamahata 

Add property "quote-generation-socket" to tdx-guest, which is a property
of type SocketAddress to specify Quote Generation Service(QGS).

On request of GetQuote, it connects to the QGS socket, read request
data from shared guest memory, send the request data to the QGS,
and store the response into shared guest memory, at last notify
TD guest by interrupt.

command line example:
   qemu-system-x86_64 \
 -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", 
"cid":"1","port":"1234"}}' \


Can you illustrate this with 'unix' sockets, not 'vsock'.


Are you suggesting only updating the commit message to an example of 
unix socket? Or you want the code to test with some unix socket QGS?


(It seems the QGS I got for testing, only supports vsock socket. Because 
at the time when it got developed, it was supposed to communicate with 
drivers inside TD guest directly not via VMM (KVM+QEMU). Anyway, I will 
talk to internal folks to see if any plan to support unix socket.)



It makes no conceptual sense to be using vsock for two
processes on the host to be using vsock to talk to
each other. vsock is only needed for the guest to talk
to the host.


 -machine confidential-guest-support=tdx0

Note, above example uses vsock type socket because the QGS we used
implements the vsock socket. It can be other types, like UNIX socket,
which depends on the implementation of QGS.

To avoid no response from QGS server, setup a timer for the transaction.
If timeout, make it an error and interrupt guest. Define the threshold of
time to 30s at present, maybe change to other value if not appropriate.

Signed-off-by: Isaku Yamahata 
Codeveloped-by: Chenyi Qiang 
Signed-off-by: Chenyi Qiang 
Codeveloped-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 
---
Changes in v5:
- add more decription of quote-generation-socket property;

Changes in v4:
- merge next patch "i386/tdx: setup a timer for the qio channel";

Changes in v3:
- rename property "quote-generation-service" to "quote-generation-socket";
- change the type of "quote-generation-socket" from str to
   SocketAddress;
- squash next patch into this one;
---
  qapi/qom.json |   8 +-
  target/i386/kvm/meson.build   |   2 +-
  target/i386/kvm/tdx-quote-generator.c | 170 
  target/i386/kvm/tdx-quote-generator.h |  95 +++
  target/i386/kvm/tdx.c | 216 ++
  target/i386/kvm/tdx.h |   6 +
  6 files changed, 495 insertions(+), 2 deletions(-)
  create mode 100644 target/i386/kvm/tdx-quote-generator.c
  create mode 100644 target/i386/kvm/tdx-quote-generator.h



With regards,
Daniel





Re: [PATCH v5 52/65] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility

2024-03-12 Thread Xiaoyao Li

On 3/11/2024 3:29 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 3/7/2024 9:51 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 2/29/2024 4:51 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


Integrate TDX's TDX_REPORT_FATAL_ERROR into QEMU GuestPanic facility

Originated-from: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
Changes in v5:
- mention additional error information in gpa when it presents;
- refine the documentation; (Markus)

Changes in v4:
- refine the documentation; (Markus)

Changes in v3:
- Add docmentation of new type and struct; (Daniel)
- refine the error message handling; (Daniel)
---
qapi/run-state.json   | 31 +--
system/runstate.c | 58 +++
target/i386/kvm/tdx.c | 24 +-
3 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/qapi/run-state.json b/qapi/run-state.json
index dd0770b379e5..b71dd1884eb6 100644
--- a/qapi/run-state.json
+++ b/qapi/run-state.json


[...]


@@ -564,6 +567,30 @@
  'psw-addr': 'uint64',
  'reason': 'S390CrashReason'}}
+##
+# @GuestPanicInformationTdx:
+#
+# TDX Guest panic information specific to TDX, as specified in the
+# "Guest-Hypervisor Communication Interface (GHCI) Specification",
+# section TDG.VP.VMCALL.
+#
+# @error-code: TD-specific error code
+#
+# @message: Human-readable error message provided by the guest. Not
+# to be trusted.
+#
+# @gpa: guest-physical address of a page that contains more verbose
+# error information, as zero-terminated string.  Present when the
+# "GPA valid" bit (bit 63) is set in @error-code.


Uh, peeking at GHCI Spec section 3.4 TDG.VP.VMCALL, I
see operand R12 consists of

   bitsnamedescription
   31:0TD-specific error code  TD-specific error code
   Panic – 0x0.
   Values – 0x1 to 0x
   reserved.
   62:32   TD-specific extendedTD-specific extended error code.
   error code  TD software defined.
   63  GPA Valid   Set if the TD specified additional
   information in the GPA parameter
   (R13).
Is @error-code all of R12, or just bits 31:0?
If it's all of R12, description of @error-code as "TD-specific error
code" is misleading.


We pass all of R12 to @error_code.

Here it wants to use "error_code" as generic as the whole R12. Do you have any 
better description of it ?


Sadly, the spec is of no help: it doesn't name the entire thing, only
the three sub-fields TD-specific error code, TD-specific extended error
code, GPA valid.

We could take the hint, and provide the sub-fields instead:

* @error-code contains the TD-specific error code (bits 31:0)

* @extended-error-code contains the TD-specific extended error code
(bits 62:32)

* we don't need @gpa-valid, because it's the same as "@gpa is present"

If we decide to keep the single member, we do need another name for it.
@error-codes (plural) doesn't exactly feel wonderful, but it gives at
least a subtle hint that it's not just *the* error code.


The reason we only defined one single member, is that the
extended-error-code is not used now, and I believe it won't be used in
the near future.


Aha!  Then I recommend

* @error-code contains the TD-specific error code (bits 31:0)

* Omit bits 62:32 from the reply; if we later find an actual use for
   them, we can add a suitable member

* Omit bit 63, because it's the same as "@gpa is present"


If no objection from others, I will use @error-codes (plural) in the
next version.


I recommend to keep the @error-code name, but narrow its value to the
actual error code, i.e. bits 31:0.


It works for me. I will got this direction in the next version.


If it's just bits 31:0, then 'Present when the "GPA valid" bit (bit 63)
is set in @error-code' is wrong.  Could go with 'Only present when the
guest provides this information'.


[...]






Re: [PATCH v9 06/21] i386/cpu: Use APIC ID info to encode cache topo in CPUID[4]

2024-03-11 Thread Xiaoyao Li

On 3/10/2024 9:38 PM, Zhao Liu wrote:

Hi Xiaoyao,


   case 3: /* L3 cache info */
-die_offset = apicid_die_offset(_info);
   if (cpu->enable_l3_cache) {
+addressable_threads_width = apicid_die_offset(_info);


Please get rid of the local variable @addressable_threads_width.

It is truly confusing.


There're several reasons for this:

1. This commit is trying to use APIC ID topology layout to decode 2
cache topology fields in CPUID[4], CPUID.04H:EAX[bits 25:14] and
CPUID.04H:EAX[bits 31:26]. When there's a addressable_cores_width to map
to CPUID.04H:EAX[bits 31:26], it's more clear to also map
CPUID.04H:EAX[bits 25:14] to another variable.


I don't dislike using a variable. I dislike the name of that variable 
since it's misleading



2. All these 2 variables are temporary in this commit, and they will be
replaed by 2 helpers in follow-up cleanup of this series.


you mean patch 20?

I don't see how removing the local variable @addressable_threads_width 
conflicts with patch 20. As a con, it introduces code churn.



3. Similarly, to make it easier to clean up later with the helper and
for more people to review, it's neater to explicitly indicate the
CPUID.04H:EAX[bits 25:14] with a variable here.


If you do want keeping the variable. Please add a comment above it to 
explain the meaning.



4. I call this field as addressable_threads_width since it's "Maximum
number of addressable IDs for logical processors sharing this cache".

Its own name is long, but given the length, only individual words could
be selected as names.

TBH, "addressable" deserves more emphasis than "sharing". The former
emphasizes the fact that the number of threads chosen here is based on
the APIC ID layout and does not necessarily correspond to actual threads.

Therefore, this variable is needed here.

Thanks,
Zhao






Re: [PATCH v9 11/21] i386/cpu: Decouple CPUID[0x1F] subleaf with specific topology level

2024-03-11 Thread Xiaoyao Li

On 2/27/2024 6:32 PM, Zhao Liu wrote:

From: Zhao Liu 

At present, the subleaf 0x02 of CPUID[0x1F] is bound to the "die" level.

In fact, the specific topology level exposed in 0x1F depends on the
platform's support for extension levels (module, tile and die).

To help expose "module" level in 0x1F, decouple CPUID[0x1F] subleaf
with specific topology level.

Tested-by: Yongwei Ma 
Signed-off-by: Zhao Liu 


Reviewed-by: Xiaoyao Li 

Besides, some nits below.


---
Changes since v7:
  * Refactored the encode_topo_cpuid1f() to use traversal to search the
encoded level and avoid using static variables. (Xiaoyao)
- Since the total number of levels in the bitmap is not too large,
  the overhead of traversing is supposed to be acceptable.
  * Renamed the variable num_cpus_next_level to num_threads_next_level.
(Xiaoyao)
  * Renamed the helper num_cpus_by_topo_level() to
num_threads_by_topo_level(). (Xiaoyao)
  * Dropped Michael/Babu's Acked/Tested tags since the code change.
  * Re-added Yongwei's Tested tag For his re-testing.

Changes since v3:
  * New patch to prepare to expose module level in 0x1F.
  * Moved the CPUTopoLevel enumeration definition from "i386: Add cache
topology info in CPUCacheInfo" to this patch. Note, to align with
topology types in SDM, revert the name of CPU_TOPO_LEVEL_UNKNOW to
CPU_TOPO_LEVEL_INVALID.
---
  target/i386/cpu.c | 138 +-
  1 file changed, 113 insertions(+), 25 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 88dffd2b52e3..b0f171c6a465 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -269,6 +269,118 @@ static void encode_cache_cpuid4(CPUCacheInfo *cache,
 (cache->complex_indexing ? CACHE_COMPLEX_IDX : 0);
  }
  
+static uint32_t num_threads_by_topo_level(X86CPUTopoInfo *topo_info,

+  enum CPUTopoLevel topo_level)
+{
+switch (topo_level) {
+case CPU_TOPO_LEVEL_SMT:
+return 1;
+case CPU_TOPO_LEVEL_CORE:
+return topo_info->threads_per_core;
+case CPU_TOPO_LEVEL_DIE:
+return topo_info->threads_per_core * topo_info->cores_per_die;
+case CPU_TOPO_LEVEL_PACKAGE:
+return topo_info->threads_per_core * topo_info->cores_per_die *
+   topo_info->dies_per_pkg;
+default:
+g_assert_not_reached();
+}
+return 0;
+}
+
+static uint32_t apicid_offset_by_topo_level(X86CPUTopoInfo *topo_info,
+enum CPUTopoLevel topo_level)
+{
+switch (topo_level) {
+case CPU_TOPO_LEVEL_SMT:
+return 0;
+case CPU_TOPO_LEVEL_CORE:
+return apicid_core_offset(topo_info);
+case CPU_TOPO_LEVEL_DIE:
+return apicid_die_offset(topo_info);
+case CPU_TOPO_LEVEL_PACKAGE:
+return apicid_pkg_offset(topo_info);
+default:
+g_assert_not_reached();
+}
+return 0;
+}
+
+static uint32_t cpuid1f_topo_type(enum CPUTopoLevel topo_level)
+{
+switch (topo_level) {
+case CPU_TOPO_LEVEL_INVALID:
+return CPUID_1F_ECX_TOPO_LEVEL_INVALID;
+case CPU_TOPO_LEVEL_SMT:
+return CPUID_1F_ECX_TOPO_LEVEL_SMT;
+case CPU_TOPO_LEVEL_CORE:
+return CPUID_1F_ECX_TOPO_LEVEL_CORE;
+case CPU_TOPO_LEVEL_DIE:
+return CPUID_1F_ECX_TOPO_LEVEL_DIE;
+default:
+/* Other types are not supported in QEMU. */
+g_assert_not_reached();
+}
+return 0;
+}
+
+static void encode_topo_cpuid1f(CPUX86State *env, uint32_t count,
+X86CPUTopoInfo *topo_info,
+uint32_t *eax, uint32_t *ebx,
+uint32_t *ecx, uint32_t *edx)
+{
+X86CPU *cpu = env_archcpu(env);
+unsigned long level;
+uint32_t num_threads_next_level, offset_next_level;
+
+assert(count + 1 < CPU_TOPO_LEVEL_MAX);
+
+/*
+ * Find the No.count topology levels in avail_cpu_topo bitmap.
+ * Start from bit 0 (CPU_TOPO_LEVEL_INVALID).


AFAICS, it starts from bit 1 (CPU_TOPO_LEVEL_SMT). Because the initial 
value of level is CPU_TOPO_LEVEL_INVALID, but the first round of the 
loop is to find the valid bit starting from (level + 1).



+ */
+level = CPU_TOPO_LEVEL_INVALID;
+for (int i = 0; i <= count; i++) {
+level = find_next_bit(env->avail_cpu_topo,
+  CPU_TOPO_LEVEL_PACKAGE,
+  level + 1);
+
+/*
+ * CPUID[0x1f] doesn't explicitly encode the package level,
+ * and it just encode the invalid level (all fields are 0)
+ * into the last subleaf of 0x1f.
+ */


QEMU will never set bit CPU_TOPO_LEVEL_PACKAGE in env->avail_cpu_topo.

So I think we should assert() it instead of fixing it silently.


+if (level == CPU_TOPO_LEVEL_PACKAGE) {
+level = CPU_TOPO_LEVEL_INVALID;
+brea

Re: [PATCH v9 09/21] i386/cpu: Introduce bitmap to cache available CPU topology levels

2024-03-11 Thread Xiaoyao Li

On 2/27/2024 6:32 PM, Zhao Liu wrote:

From: Zhao Liu 

Currently, QEMU checks the specify number of topology domains to detect
if there's extended topology levels (e.g., checking nr_dies).

With this bitmap, the extended CPU topology (the levels other than SMT,
core and package) could be easier to detect without touching the
topology details.

This is also in preparation for the follow-up to decouple CPUID[0x1F]
subleaf with specific topology level.

Tested-by: Yongwei Ma 
Signed-off-by: Zhao Liu 



Reviewed-by: Xiaoyao Li 



Re: [PATCH v5 52/65] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility

2024-03-10 Thread Xiaoyao Li

On 3/7/2024 9:51 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 2/29/2024 4:51 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


Integrate TDX's TDX_REPORT_FATAL_ERROR into QEMU GuestPanic facility

Originated-from: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
Changes in v5:
- mention additional error information in gpa when it presents;
- refine the documentation; (Markus)

Changes in v4:
- refine the documentation; (Markus)

Changes in v3:
- Add docmentation of new type and struct; (Daniel)
- refine the error message handling; (Daniel)
---
   qapi/run-state.json   | 31 +--
   system/runstate.c | 58 +++
   target/i386/kvm/tdx.c | 24 +-
   3 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/qapi/run-state.json b/qapi/run-state.json
index dd0770b379e5..b71dd1884eb6 100644
--- a/qapi/run-state.json
+++ b/qapi/run-state.json
@@ -483,10 +483,12 @@
  #
  # @s390: s390 guest panic information type (Since: 2.12)
  #
+# @tdx: tdx guest panic information type (Since: 9.0)
+#
  # Since: 2.9
  ##
  { 'enum': 'GuestPanicInformationType',
-  'data': [ 'hyper-v', 's390' ] }
+  'data': [ 'hyper-v', 's390', 'tdx' ] }
 ##
# @GuestPanicInformation:
@@ -501,7 +503,8 @@
'base': {'type': 'GuestPanicInformationType'},
'discriminator': 'type',
'data': {'hyper-v': 'GuestPanicInformationHyperV',
-  's390': 'GuestPanicInformationS390'}}
+  's390': 'GuestPanicInformationS390',
+  'tdx' : 'GuestPanicInformationTdx'}}
  ##
  # @GuestPanicInformationHyperV:
@@ -564,6 +567,30 @@
 'psw-addr': 'uint64',
 'reason': 'S390CrashReason'}}
+##
+# @GuestPanicInformationTdx:
+#
+# TDX Guest panic information specific to TDX, as specified in the
+# "Guest-Hypervisor Communication Interface (GHCI) Specification",
+# section TDG.VP.VMCALL.
+#
+# @error-code: TD-specific error code
+#
+# @message: Human-readable error message provided by the guest. Not
+# to be trusted.
+#
+# @gpa: guest-physical address of a page that contains more verbose
+# error information, as zero-terminated string.  Present when the
+# "GPA valid" bit (bit 63) is set in @error-code.


Uh, peeking at GHCI Spec section 3.4 TDG.VP.VMCALL, I
see operand R12 consists of

  bitsnamedescription
  31:0TD-specific error code  TD-specific error code
  Panic – 0x0.
  Values – 0x1 to 0x
  reserved.
  62:32   TD-specific extendedTD-specific extended error code.
  error code  TD software defined.
  63  GPA Valid   Set if the TD specified additional
  information in the GPA parameter
  (R13).
Is @error-code all of R12, or just bits 31:0?
If it's all of R12, description of @error-code as "TD-specific error
code" is misleading.


We pass all of R12 to @error_code.

Here it wants to use "error_code" as generic as the whole R12. Do you have any 
better description of it ?


Sadly, the spec is of no help: it doesn't name the entire thing, only
the three sub-fields TD-specific error code, TD-specific extended error
code, GPA valid.

We could take the hint, and provide the sub-fields instead:

* @error-code contains the TD-specific error code (bits 31:0)

* @extended-error-code contains the TD-specific extended error code
   (bits 62:32)

* we don't need @gpa-valid, because it's the same as "@gpa is present"

If we decide to keep the single member, we do need another name for it.
@error-codes (plural) doesn't exactly feel wonderful, but it gives at
least a subtle hint that it's not just *the* error code.


The reason we only defined one single member, is that the 
extended-error-code is not used now, and I believe it won't be used in 
the near future.


If no objection from others, I will use @error-codes (plural) in the 
next version.



If it's just bits 31:0, then 'Present when the "GPA valid" bit (bit 63)
is set in @error-code' is wrong.  Could go with 'Only present when the
guest provides this information'.


+#
+#


Drop one of these two lines, please.


+# Since: 9.0
+##
+{'struct': 'GuestPanicInformationTdx',
+ 'data': {'error-code': 'uint64',
+  'message': 'str',
+  '*gpa': 'uint64'}}
+
   ##
   # @MEMORY_FAILURE:
   #









Re: [PATCH v5 30/65] i386/tdx: Support user configurable mrconfigid/mrowner/mrownerconfig

2024-03-10 Thread Xiaoyao Li

On 3/7/2024 9:56 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 3/7/2024 4:39 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 2/29/2024 9:25 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 2/29/2024 4:37 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


From: Isaku Yamahata 

Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD
can be provided for TDX attestation. Detailed meaning of them can be
found: 
https://lore.kernel.org/qemu-devel/31d6dbc1-f453-4cef-ab08-4813f4e0f...@intel.com/

Allow user to specify those values via property mrconfigid, mrowner and
mrownerconfig. They are all in base64 format.

example
-object tdx-guest, \
  
mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
  mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
  
mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v

Signed-off-by: Isaku Yamahata 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 

[...]


diff --git a/qapi/qom.json b/qapi/qom.json
index 89ed89b9b46e..cac875349a3a 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -905,10 +905,25 @@
# pages.  Some guest OS (e.g., Linux TD guest) may require this to
# be set, otherwise they refuse to boot.
#
+# @mrconfigid: ID for non-owner-defined configuration of the guest TD,
+# e.g., run-time or OS configuration (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).


Suggest to drop the parenthesis in the last sentence.

@mrconfigid is a string, so the default value can't be 0.  Actually,
it's not just any string, but a base64 encoded SHA384 digest, which
means it must be exactly 96 hex digits.  So it can't be "0", either.  It
could be
"".


I thought value 0 of SHA384 just means it.

That's my fault and my poor english.


"Fault" is too harsh :)  It's not as precise as I want our interface
documentation to be.  We work together to get there.


More on this below.


+#
+# @mrowner: ID for the guest TD’s owner (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).
+#
+# @mrownerconfig: ID for owner-defined configuration of the guest TD,
+# e.g., specific to the workload rather than the run-time or OS
+# (base64 encoded SHA384 digest). (A default value 0 of SHA384 is
+# used when absent).
+#
# Since: 9.0
##
{ 'struct': 'TdxGuestProperties',
-  'data': { '*sept-ve-disable': 'bool' } }
+  'data': { '*sept-ve-disable': 'bool',
+'*mrconfigid': 'str',
+'*mrowner': 'str',
+'*mrownerconfig': 'str' } }
##
# @ThreadContextProperties:
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index d0ad4f57b5d0..4ce2f1d082ce 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -13,6 +13,7 @@
#include "qemu/osdep.h"
#include "qemu/error-report.h"
+#include "qemu/base64.h"
#include "qapi/error.h"
#include "qom/object_interfaces.h"
#include "standard-headers/asm-x86/kvm_para.h"
@@ -516,6 +517,7 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
X86CPU *x86cpu = X86_CPU(cpu);
CPUX86State *env = >env;
g_autofree struct kvm_tdx_init_vm *init_vm = NULL;
+size_t data_len;
int r = 0;
object_property_set_bool(OBJECT(cpu), "pmu", false, _abort);
@@ -528,6 +530,38 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) +
sizeof(struct kvm_cpuid_entry2) * 
KVM_MAX_CPUID_ENTRIES);
+#define SHA384_DIGEST_SIZE  48
+
+if (tdx_guest->mrconfigid) {
+g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrconfigid,
+  strlen(tdx_guest->mrconfigid), _len, errp);
+if (!data || data_len != SHA384_DIGEST_SIZE) {
+error_setg(errp, "TDX: failed to decode mrconfigid");
+return -1;
+}
+memcpy(init_vm->mrconfigid, data, data_len);
+}


When @mrconfigid is absent, the property remains null, and this
conditional is not executed.  init_vm->mrconfigid[], an array of 6
__u64, remains all zero.  How does the kernel treat that?


A all-zero SHA384 value is still a valid value, isn't it?

KVM treats it with no difference.


Can you point me to the spot in the kernel where mrconfigid is used?


https://github.com/intel/tdx/blob/66a10e258636fa8ec9f5ce687607bf2196a92341/arch/x86/kvm/vmx/tdx.c#L2322

KVM just copy what QEMU provides into its own data structure @td_params. The 
format @of td_params is defined by TDX spec, and @td_params needs to be passed 
to TDX module when initialize the context of TD via SEAMCALL(TDH.MNG.INIT): 
https://github

Re: [PATCH v9 08/21] i386/cpu: Consolidate the use of topo_info in cpu_x86_cpuid()

2024-03-09 Thread Xiaoyao Li

On 2/27/2024 6:32 PM, Zhao Liu wrote:

From: Zhao Liu

In cpu_x86_cpuid(), there are many variables in representing the cpu
topology, e.g., topo_info, cs->nr_cores and cs->nr_threads.

Since the names of cs->nr_cores/cs->nr_threads does not accurately


Again as in v7, please changes to "cs->nr_cores and cs->nr_threads",

"cs->nr_cores/cs->nr_threads" looks like a single variable of division


represent its meaning, the use of cs->nr_cores/cs->nr_threads is prone
to confusion and mistakes.

And the structure X86CPUTopoInfo names its members clearly, thus the
variable "topo_info" should be preferred.

In addition, in cpu_x86_cpuid(), to uniformly use the topology variable,
replace env->dies with topo_info.dies_per_pkg as well.





Re: [PATCH v9 07/21] i386/cpu: Use APIC ID info get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]

2024-03-09 Thread Xiaoyao Li

On 2/27/2024 6:32 PM, Zhao Liu wrote:

From: Zhao Liu 

The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
for cpuid 0x801D") adds the cache topology for AMD CPU by encoding
the number of sharing threads directly.

 From AMD's APM, NumSharingCache (CPUID[0x801D].EAX[bits 25:14])
means [1]:

The number of logical processors sharing this cache is the value of
this field incremented by 1. To determine which logical processors are
sharing a cache, determine a Share Id for each processor as follows:

ShareId = LocalApicId >> log2(NumSharingCache+1)

Logical processors with the same ShareId then share a cache. If
NumSharingCache+1 is not a power of two, round it up to the next power
of two.

 From the description above, the calculation of this field should be same
as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
APIC ID to calculate this field.

[1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
  Information

Cc: Babu Moger 
Tested-by: Yongwei Ma 
Signed-off-by: Zhao Liu 


Reviewed-by: Xiaoyao Li 


---
Changes since v7:
  * Moved this patch after CPUID[4]'s similar change ("i386/cpu: Use APIC
ID offset to encode cache topo in CPUID[4]"). (Xiaoyao)
  * Dropped Michael/Babu's Acked/Reviewed/Tested tags since the code
change due to the rebase.
  * Re-added Yongwei's Tested tag For his re-testing (compilation on
Intel platforms).

Changes since v3:
  * Rewrote the subject. (Babu)
  * Deleted the original "comment/help" expression, as this behavior is
confirmed for AMD CPUs. (Babu)
  * Renamed "num_apic_ids" (v3) to "num_sharing_cache" to match spec
definition. (Babu)

Changes since v1:
  * Renamed "l3_threads" to "num_apic_ids" in
encode_cache_cpuid801d(). (Yanan)
  * Added the description of the original commit and add Cc.
---
  target/i386/cpu.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index c77bcbc44d59..df56c7a449c8 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -331,7 +331,7 @@ static void encode_cache_cpuid801d(CPUCacheInfo *cache,
 uint32_t *eax, uint32_t *ebx,
 uint32_t *ecx, uint32_t *edx)
  {
-uint32_t l3_threads;
+uint32_t num_sharing_cache;
  assert(cache->size == cache->line_size * cache->associativity *
cache->partitions * cache->sets);
  
@@ -340,11 +340,11 @@ static void encode_cache_cpuid801d(CPUCacheInfo *cache,
  
  /* L3 is shared among multiple cores */

  if (cache->level == 3) {
-l3_threads = topo_info->cores_per_die * topo_info->threads_per_core;
-*eax |= (l3_threads - 1) << 14;
+num_sharing_cache = 1 << apicid_die_offset(topo_info);
  } else {
-*eax |= ((topo_info->threads_per_core - 1) << 14);
+num_sharing_cache = 1 << apicid_core_offset(topo_info);
  }
+*eax |= (num_sharing_cache - 1) << 14;
  
  assert(cache->line_size > 0);

  assert(cache->partitions > 0);





Re: [PATCH v9 06/21] i386/cpu: Use APIC ID info to encode cache topo in CPUID[4]

2024-03-09 Thread Xiaoyao Li
picid_pkg_offset(_info);
+
  *eax &= ~0x3FFC000;
-*eax |= (pow2ceil(vcpus_per_socket) - 1) << 14;
+*eax |= ((1 << addressable_threads_width) - 1) << 14;
  }
  }
  } else if (cpu->vendor_cpuid_only && IS_AMD_CPU(env)) {
  *eax = *ebx = *ecx = *edx = 0;
  } else {
  *eax = 0;
+addressable_cores_width = apicid_pkg_offset(_info) -
+  apicid_core_offset(_info);
+
  switch (count) {
  case 0: /* L1 dcache info */
+addressable_threads_width = apicid_core_offset(_info);
  encode_cache_cpuid4(env->cache_info_cpuid4.l1d_cache,
-cs->nr_threads, cs->nr_cores,
+(1 << addressable_threads_width),
+(1 << addressable_cores_width),
  eax, ebx, ecx, edx);
  break;
  case 1: /* L1 icache info */
+addressable_threads_width = apicid_core_offset(_info);
  encode_cache_cpuid4(env->cache_info_cpuid4.l1i_cache,
-cs->nr_threads, cs->nr_cores,
+(1 << addressable_threads_width),
+(1 << addressable_cores_width),
  eax, ebx, ecx, edx);
  break;
  case 2: /* L2 cache info */
+addressable_threads_width = apicid_core_offset(_info);
  encode_cache_cpuid4(env->cache_info_cpuid4.l2_cache,
-cs->nr_threads, cs->nr_cores,
+(1 << addressable_threads_width),
+(1 << addressable_cores_width),
  eax, ebx, ecx, edx);
  break;
  case 3: /* L3 cache info */
-die_offset = apicid_die_offset(_info);
  if (cpu->enable_l3_cache) {
+addressable_threads_width = apicid_die_offset(_info);


Please get rid of the local variable @addressable_threads_width.

It is truly confusing. In this patch, it is assigned to
 - apicid_pkg_offset(_info);
 - apicid_core_offset(_info);
 - apicid_die_offset(_info);

in different cases.

I only suggested the name of addressable_core_width in v7, which is 
apicid_pkg_offset(_info) - apicid_core_offset(_info); 

And it is straightforward that it means the number of bits in x2APICID 
to encode different addressable cores.


But it is not similar to addressable_threads_width, the semantic changes 
per different cache level. In fact, you want something like 
bit_width_of_addressable_threads_sharing_this_level_of_cache.


So I suggest stop using the variable of "address_therads_width". Instead 
just apicid_*_offset() directly.



With above fixed, feel free to add my

Reviewed-by: Xiaoyao Li 


  encode_cache_cpuid4(env->cache_info_cpuid4.l3_cache,
-(1 << die_offset), cs->nr_cores,
+(1 << addressable_threads_width),
+(1 << addressable_cores_width),
  eax, ebx, ecx, edx);
  break;
  }
@@ -6141,6 +6159,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  }
  }
  break;
+}
  case 5:
  /* MONITOR/MWAIT Leaf */
  *eax = cpu->mwait.eax; /* Smallest monitor-line size in bytes */





Re: [PATCH v2 2/2] kvm: add support for guest physical bits

2024-03-07 Thread Xiaoyao Li

On 3/5/2024 6:52 PM, Gerd Hoffmann wrote:

Query kvm for supported guest physical address bits, in cpuid
function 8008, eax[23:16].  Usually this is identical to host
physical address bits.  With NPT or EPT being used this might be
restricted to 48 (max 4-level paging address space size) even if
the host cpu supports more physical address bits.

When set pass this to the guest, using cpuid too.  Guest firmware
can use this to figure how big the usable guest physical address
space is, so PCI bar mapping are actually reachable.

Signed-off-by: Gerd Hoffmann 
---
  target/i386/cpu.h |  1 +
  target/i386/cpu.c |  1 +
  target/i386/kvm/kvm.c | 17 +
  3 files changed, 19 insertions(+)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 952174bb6f52..d427218827f6 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2026,6 +2026,7 @@ struct ArchCPU {
  
  /* Number of physical address bits supported */

  uint32_t phys_bits;
+uint32_t guest_phys_bits;
  
  /* in order to simplify APIC support, we leave this pointer to the

 user */
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 2666ef380891..1a6cfc75951e 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
  /* 64 bit processor */
   *eax |= (cpu_x86_virtual_addr_width(env) << 8);
+ *eax |= (cpu->guest_phys_bits << 16);
  }
  *ebx = env->features[FEAT_8000_0008_EBX];
  if (cs->nr_cores * cs->nr_threads > 1) {
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 7298822cb511..ce22dfcaa661 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -238,6 +238,15 @@ static int kvm_get_tsc(CPUState *cs)
  return 0;
  }
  
+/* return cpuid fn 8000_0008 eax[23:16] aka GuestPhysBits */

+static int kvm_get_guest_phys_bits(KVMState *s)
+{
+uint32_t eax;
+
+eax = kvm_arch_get_supported_cpuid(s, 0x8008, 0, R_EAX);
+return (eax >> 16) & 0xff;
+}
+
  static inline void do_kvm_synchronize_tsc(CPUState *cpu, run_on_cpu_data arg)
  {
  kvm_get_tsc(cpu);
@@ -1730,6 +1739,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
  X86CPU *cpu = X86_CPU(cs);
  CPUX86State *env = >env;
  uint32_t limit, i, j, cpuid_i;
+uint32_t guest_phys_bits;
  uint32_t unused;
  struct kvm_cpuid_entry2 *c;
  uint32_t signature[3];
@@ -1765,6 +1775,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
  
  env->apic_bus_freq = KVM_APIC_BUS_FREQUENCY;
  
+guest_phys_bits = kvm_get_guest_phys_bits(cs->kvm_state);

+if (guest_phys_bits &&
+(cpu->guest_phys_bits == 0 ||
+ cpu->guest_phys_bits > guest_phys_bits)) {
+cpu->guest_phys_bits = guest_phys_bits;
+}


suggest to move this added code block to kvm_cpu_realizefn().

Main code in kvm_arch_init_vcpu() is to consume the data in cpu or 
env->features to configure KVM specific configuration via KVM ioctls.


The preparation work of initializing the required data usually not 
happen in kvm_arch_init_vcpu();



  /*
   * kvm_hyperv_expand_features() is called here for the second time in case
   * KVM_CAP_SYS_HYPERV_CPUID is not supported. While we can't possibly 
handle





Re: [PATCH v5 52/65] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility

2024-03-07 Thread Xiaoyao Li

On 2/29/2024 4:51 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


Integrate TDX's TDX_REPORT_FATAL_ERROR into QEMU GuestPanic facility

Originated-from: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
Changes in v5:
- mention additional error information in gpa when it presents;
- refine the documentation; (Markus)

Changes in v4:
- refine the documentation; (Markus)

Changes in v3:
- Add docmentation of new type and struct; (Daniel)
- refine the error message handling; (Daniel)
---
  qapi/run-state.json   | 31 +--
  system/runstate.c | 58 +++
  target/i386/kvm/tdx.c | 24 +-
  3 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/qapi/run-state.json b/qapi/run-state.json
index dd0770b379e5..b71dd1884eb6 100644
--- a/qapi/run-state.json
+++ b/qapi/run-state.json
@@ -483,10 +483,12 @@
  #
  # @s390: s390 guest panic information type (Since: 2.12)
  #
+# @tdx: tdx guest panic information type (Since: 9.0)
+#
  # Since: 2.9
  ##
  { 'enum': 'GuestPanicInformationType',
-  'data': [ 'hyper-v', 's390' ] }
+  'data': [ 'hyper-v', 's390', 'tdx' ] }
  
  ##

  # @GuestPanicInformation:
@@ -501,7 +503,8 @@
   'base': {'type': 'GuestPanicInformationType'},
   'discriminator': 'type',
   'data': {'hyper-v': 'GuestPanicInformationHyperV',
-  's390': 'GuestPanicInformationS390'}}
+  's390': 'GuestPanicInformationS390',
+  'tdx' : 'GuestPanicInformationTdx'}}
  
  ##

  # @GuestPanicInformationHyperV:
@@ -564,6 +567,30 @@
'psw-addr': 'uint64',
'reason': 'S390CrashReason'}}
  
+##

+# @GuestPanicInformationTdx:
+#
+# TDX Guest panic information specific to TDX, as specified in the
+# "Guest-Hypervisor Communication Interface (GHCI) Specification",
+# section TDG.VP.VMCALL.
+#
+# @error-code: TD-specific error code
+#
+# @message: Human-readable error message provided by the guest. Not
+# to be trusted.
+#
+# @gpa: guest-physical address of a page that contains more verbose
+# error information, as zero-terminated string.  Present when the
+# "GPA valid" bit (bit 63) is set in @error-code.


Uh, peeking at GHCI Spec section 3.4 TDG.VP.VMCALL, I
see operand R12 consists of

 bitsnamedescription
 31:0TD-specific error code  TD-specific error code
 Panic – 0x0.
 Values – 0x1 to 0x
 reserved.
 62:32   TD-specific extendedTD-specific extended error code.
 error code  TD software defined.
 63  GPA Valid   Set if the TD specified additional
 information in the GPA parameter
 (R13).

Is @error-code all of R12, or just bits 31:0?

If it's all of R12, description of @error-code as "TD-specific error
code" is misleading.


We pass all of R12 to @error_code.

Here it wants to use "error_code" as generic as the whole R12. Do you 
have any better description of it ?



If it's just bits 31:0, then 'Present when the "GPA valid" bit (bit 63)
is set in @error-code' is wrong.  Could go with 'Only present when the
guest provides this information'.


+#
+#


Drop one of these two lines, please.


+# Since: 9.0
+##
+{'struct': 'GuestPanicInformationTdx',
+ 'data': {'error-code': 'uint64',
+  'message': 'str',
+  '*gpa': 'uint64'}}
+
  ##
  # @MEMORY_FAILURE:
  #







Re: [PATCH v5 49/65] i386/tdx: handle TDG.VP.VMCALL

2024-03-07 Thread Xiaoyao Li

On 2/29/2024 9:28 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 2/29/2024 4:40 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


From: Isaku Yamahata 

Add property "quote-generation-socket" to tdx-guest, which is a property
of type SocketAddress to specify Quote Generation Service(QGS).

On request of GetQuote, it connects to the QGS socket, read request
data from shared guest memory, send the request data to the QGS,
and store the response into shared guest memory, at last notify
TD guest by interrupt.

command line example:
qemu-system-x86_64 \
  -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", 
"cid":"1","port":"1234"}}' \
  -machine confidential-guest-support=tdx0

Note, above example uses vsock type socket because the QGS we used
implements the vsock socket. It can be other types, like UNIX socket,
which depends on the implementation of QGS.

To avoid no response from QGS server, setup a timer for the transaction.
If timeout, make it an error and interrupt guest. Define the threshold of
time to 30s at present, maybe change to other value if not appropriate.

Signed-off-by: Isaku Yamahata 
Codeveloped-by: Chenyi Qiang 
Signed-off-by: Chenyi Qiang 
Codeveloped-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 


[...]


diff --git a/qapi/qom.json b/qapi/qom.json
index cac875349a3a..7b26b0a0d3aa 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -917,13 +917,19 @@
  # (base64 encoded SHA384 digest). (A default value 0 of SHA384 is
  # used when absent).
  #
+# @quote-generation-socket: socket address for Quote Generation
+# Service (QGS).  QGS is a daemon running on the host.  User in
+# TD guest cannot get TD quoting for attestation if QGS is not
+# provided.  So admin should always provide it.


This makes me wonder why it's optional.  Can you describe a use case for
*not* specifying @quote-generation-socket?


Maybe at last when all the TDX support lands on all the components, attestation 
will become a must for a TD guest to be usable.

However, at least for today, booting and running a TD guest don't require 
attestation. So not provide it, doesn't affect anything excepting cannot get a 
Quote.


Maybe

   # @quote-generation-socket: Socket address for Quote Generation
   # Service (QGS).  QGS is a daemon running on the host.  Without
   # it, the guest will not be able to get a TD quote for
   # attestation.


Thanks! will update to it.


[...]






Re: [PATCH v5 30/65] i386/tdx: Support user configurable mrconfigid/mrowner/mrownerconfig

2024-03-07 Thread Xiaoyao Li

On 3/7/2024 4:39 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 2/29/2024 9:25 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 2/29/2024 4:37 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


From: Isaku Yamahata 

Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD
can be provided for TDX attestation. Detailed meaning of them can be
found: 
https://lore.kernel.org/qemu-devel/31d6dbc1-f453-4cef-ab08-4813f4e0f...@intel.com/

Allow user to specify those values via property mrconfigid, mrowner and
mrownerconfig. They are all in base64 format.

example
-object tdx-guest, \
 
mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
 mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
 
mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v

Signed-off-by: Isaku Yamahata 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 

[...]


diff --git a/qapi/qom.json b/qapi/qom.json
index 89ed89b9b46e..cac875349a3a 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -905,10 +905,25 @@
   # pages.  Some guest OS (e.g., Linux TD guest) may require this to
   # be set, otherwise they refuse to boot.
   #
+# @mrconfigid: ID for non-owner-defined configuration of the guest TD,
+# e.g., run-time or OS configuration (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).


Suggest to drop the parenthesis in the last sentence.

@mrconfigid is a string, so the default value can't be 0.  Actually,
it's not just any string, but a base64 encoded SHA384 digest, which
means it must be exactly 96 hex digits.  So it can't be "0", either.  It
could be
"".


I thought value 0 of SHA384 just means it.

That's my fault and my poor english.


"Fault" is too harsh :)  It's not as precise as I want our interface
documentation to be.  We work together to get there.


More on this below.


+#
+# @mrowner: ID for the guest TD’s owner (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).
+#
+# @mrownerconfig: ID for owner-defined configuration of the guest TD,
+# e.g., specific to the workload rather than the run-time or OS
+# (base64 encoded SHA384 digest). (A default value 0 of SHA384 is
+# used when absent).
+#
   # Since: 9.0
   ##
   { 'struct': 'TdxGuestProperties',
-  'data': { '*sept-ve-disable': 'bool' } }
+  'data': { '*sept-ve-disable': 'bool',
+'*mrconfigid': 'str',
+'*mrowner': 'str',
+'*mrownerconfig': 'str' } }
   ##
   # @ThreadContextProperties:
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index d0ad4f57b5d0..4ce2f1d082ce 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -13,6 +13,7 @@
   #include "qemu/osdep.h"
   #include "qemu/error-report.h"
+#include "qemu/base64.h"
   #include "qapi/error.h"
   #include "qom/object_interfaces.h"
   #include "standard-headers/asm-x86/kvm_para.h"
@@ -516,6 +517,7 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
   X86CPU *x86cpu = X86_CPU(cpu);
   CPUX86State *env = >env;
   g_autofree struct kvm_tdx_init_vm *init_vm = NULL;
+size_t data_len;
   int r = 0;
   object_property_set_bool(OBJECT(cpu), "pmu", false, _abort);
@@ -528,6 +530,38 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
   init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) +
   sizeof(struct kvm_cpuid_entry2) * 
KVM_MAX_CPUID_ENTRIES);
+#define SHA384_DIGEST_SIZE  48
+
+if (tdx_guest->mrconfigid) {
+g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrconfigid,
+  strlen(tdx_guest->mrconfigid), _len, errp);
+if (!data || data_len != SHA384_DIGEST_SIZE) {
+error_setg(errp, "TDX: failed to decode mrconfigid");
+return -1;
+}
+memcpy(init_vm->mrconfigid, data, data_len);
+}


When @mrconfigid is absent, the property remains null, and this
conditional is not executed.  init_vm->mrconfigid[], an array of 6
__u64, remains all zero.  How does the kernel treat that?


A all-zero SHA384 value is still a valid value, isn't it?

KVM treats it with no difference.


Can you point me to the spot in the kernel where mrconfigid is used?


https://github.com/intel/tdx/blob/66a10e258636fa8ec9f5ce687607bf2196a92341/arch/x86/kvm/vmx/tdx.c#L2322

KVM just copy what QEMU provides into its own data structure @td_params. The 
format @of td_params is defined by TDX spec, and @td_params needs to be passed 
to TDX module when initialize the context of TD via SEAMCALL(TDH.MNG.INIT): 
https://github.com/intel/tdx/blob/66a10e258636fa8ec9f5ce687607bf2196a92341/arch/x86/kvm/vmx/tdx.c#L2450


In fact, a

Re: [PATCH v5 08/65] kvm: handle KVM_EXIT_MEMORY_FAULT

2024-03-06 Thread Xiaoyao Li

On 3/5/2024 5:10 PM, Isaku Yamahata wrote:

On Thu, Feb 29, 2024 at 01:36:29AM -0500,
Xiaoyao Li  wrote:


From: Chao Peng 

When geeting KVM_EXIT_MEMORY_FAULT exit, it indicates userspace needs to
do the memory conversion on the RAMBlock to turn the memory into desired
attribute, i.e., private/shared.

Currently only KVM_MEMORY_EXIT_FLAG_PRIVATE in flags is valid when
KVM_EXIT_MEMORY_FAULT happens.

Note, KVM_EXIT_MEMORY_FAULT makes sense only when the RAMBlock has
guest_memfd memory backend.

Note, KVM_EXIT_MEMORY_FAULT returns with -EFAULT, so special handling is
added.

When page is converted from shared to private, the original shared
memory can be discarded via ram_block_discard_range(). Note, shared
memory can be discarded only when it's not back'ed by hugetlb because
hugetlb is supposed to be pre-allocated and no need for discarding.

Signed-off-by: Chao Peng 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 

---
Changes in v4:
- open-coded ram_block_discard logic;
- change warn_report() to error_report(); (Daniel)
---
  accel/kvm/kvm-all.c | 94 -
  1 file changed, 84 insertions(+), 10 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 70d482a2c936..87e4275932a7 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2903,6 +2903,68 @@ static void kvm_eat_signals(CPUState *cpu)
  } while (sigismember(, SIG_IPI));
  }
  
+static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)

+{
+MemoryRegionSection section;
+ram_addr_t offset;
+MemoryRegion *mr;
+RAMBlock *rb;
+void *addr;
+int ret = -1;
+
+if (!QEMU_PTR_IS_ALIGNED(start, qemu_host_page_size) ||
+!QEMU_PTR_IS_ALIGNED(size, qemu_host_page_size)) {
+return -1;
+}
+
+if (!size) {
+return -1;
+}
+
+section = memory_region_find(get_system_memory(), start, size);
+mr = section.mr;
+if (!mr) {
+return -1;
+}
+
+if (memory_region_has_guest_memfd(mr)) {
+if (to_private) {
+ret = kvm_set_memory_attributes_private(start, size);
+} else {
+ret = kvm_set_memory_attributes_shared(start, size);
+}
+
+if (ret) {
+memory_region_unref(section.mr);
+return ret;
+}
+
+addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
+rb = qemu_ram_block_from_host(addr, false, );
+
+if (to_private) {
+if (rb->page_size != qemu_host_page_size) {
+/*
+* shared memory is back'ed by  hugetlb, which is supposed to be
+* pre-allocated and doesn't need to be discarded
+*/
+return 0;


The reference count leaks. Add memory_region_unref() is needed.


thanks for catching it. Will fix it in next version.


Otherwise looks good to me.
Reviewed-by: Isaku Yamahata 





Re: [PATCH 1/1] kvm: add support for guest physical bits

2024-03-04 Thread Xiaoyao Li

On 3/4/2024 10:58 PM, Gerd Hoffmann wrote:

On Mon, Mar 04, 2024 at 09:54:40AM +0800, Xiaoyao Li wrote:

On 3/1/2024 6:17 PM, Gerd Hoffmann wrote:

query kvm for supported guest physical address bits using
KVM_CAP_VM_GPA_BITS.  Expose the value to the guest via cpuid
(leaf 0x8008, eax, bits 16-23).

Signed-off-by: Gerd Hoffmann 
---
   target/i386/cpu.h | 1 +
   target/i386/cpu.c | 1 +
   target/i386/kvm/kvm.c | 8 
   3 files changed, 10 insertions(+)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 952174bb6f52..d427218827f6 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2026,6 +2026,7 @@ struct ArchCPU {
   /* Number of physical address bits supported */
   uint32_t phys_bits;
+uint32_t guest_phys_bits;
   /* in order to simplify APIC support, we leave this pointer to the
  user */
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 2666ef380891..1a6cfc75951e 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
   if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
   /* 64 bit processor */
*eax |= (cpu_x86_virtual_addr_width(env) << 8);
+ *eax |= (cpu->guest_phys_bits << 16);


I think you misunderstand this field.

If you expose this field to guest, it's the information for nested guest.
i.e., the guest itself runs as a hypervisor will know its nested guest can
have guest_phys_bits for physical addr.


I think those limits (l1 + l2 guest phys-bits) are identical, no?


Sorry, I didn't know this patch was based on the off-list proposal made 
by Paolo that changing the definition of CPUID.0x8008:EAX[23:16] to 
advertise the "maximum usable physical address bits".


If you call out this in the change log, it can avoid the misunderstanding.

As I replied to KVM series, I think the info is better to setup by KVM 
and reported by GET_SUPPORTED_CPUID.



The problem this tries to solve is that making the guest phys-bits
smaller than the host phys-bits is problematic (which why we have
allow_smaller_maxphyaddr), but nevertheless there are cases where
the usable guest physical address space is smaller than the host
physical address space.  One case is intel processors with phys-bits
larger than 48 and 4-level EPT.  Another case is amd processors with
phys-bits larger than 48 and the l0 hypervisor using 4-level paging.

The guest needs to know that limit, specifically the guest firmware
so it knows where it can map PCI bars.

take care,
   Gerd






Re: [PATCH 1/1] kvm: add support for guest physical bits

2024-03-03 Thread Xiaoyao Li

On 3/1/2024 6:17 PM, Gerd Hoffmann wrote:

query kvm for supported guest physical address bits using
KVM_CAP_VM_GPA_BITS.  Expose the value to the guest via cpuid
(leaf 0x8008, eax, bits 16-23).

Signed-off-by: Gerd Hoffmann 
---
  target/i386/cpu.h | 1 +
  target/i386/cpu.c | 1 +
  target/i386/kvm/kvm.c | 8 
  3 files changed, 10 insertions(+)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 952174bb6f52..d427218827f6 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2026,6 +2026,7 @@ struct ArchCPU {
  
  /* Number of physical address bits supported */

  uint32_t phys_bits;
+uint32_t guest_phys_bits;
  
  /* in order to simplify APIC support, we leave this pointer to the

 user */
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 2666ef380891..1a6cfc75951e 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
  /* 64 bit processor */
   *eax |= (cpu_x86_virtual_addr_width(env) << 8);
+ *eax |= (cpu->guest_phys_bits << 16);


I think you misunderstand this field.

If you expose this field to guest, it's the information for nested 
guest. i.e., the guest itself runs as a hypervisor will know its nested 
guest can have guest_phys_bits for physical addr.



  }
  *ebx = env->features[FEAT_8000_0008_EBX];
  if (cs->nr_cores * cs->nr_threads > 1) {
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 42970ab046fa..e06c9d66bb01 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1716,6 +1716,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
  X86CPU *cpu = X86_CPU(cs);
  CPUX86State *env = >env;
  uint32_t limit, i, j, cpuid_i;
+uint32_t guest_phys_bits;
  uint32_t unused;
  struct kvm_cpuid_entry2 *c;
  uint32_t signature[3];
@@ -1751,6 +1752,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
  
  env->apic_bus_freq = KVM_APIC_BUS_FREQUENCY;
  
+guest_phys_bits = kvm_check_extension(cs->kvm_state, KVM_CAP_VM_GPA_BITS);

+if (guest_phys_bits &&
+(cpu->guest_phys_bits == 0 ||
+ cpu->guest_phys_bits > guest_phys_bits)) {
+cpu->guest_phys_bits = guest_phys_bits;
+}
+
  /*
   * kvm_hyperv_expand_features() is called here for the second time in case
   * KVM_CAP_SYS_HYPERV_CPUID is not supported. While we can't possibly 
handle





Re: [PATCH v5 30/65] i386/tdx: Support user configurable mrconfigid/mrowner/mrownerconfig

2024-02-29 Thread Xiaoyao Li

On 2/29/2024 9:25 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 2/29/2024 4:37 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


From: Isaku Yamahata 

Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD
can be provided for TDX attestation. Detailed meaning of them can be
found: 
https://lore.kernel.org/qemu-devel/31d6dbc1-f453-4cef-ab08-4813f4e0f...@intel.com/

Allow user to specify those values via property mrconfigid, mrowner and
mrownerconfig. They are all in base64 format.

example
-object tdx-guest, \

mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\

mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v

Signed-off-by: Isaku Yamahata 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 


[...]


diff --git a/qapi/qom.json b/qapi/qom.json
index 89ed89b9b46e..cac875349a3a 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -905,10 +905,25 @@
  # pages.  Some guest OS (e.g., Linux TD guest) may require this to
  # be set, otherwise they refuse to boot.
  #
+# @mrconfigid: ID for non-owner-defined configuration of the guest TD,
+# e.g., run-time or OS configuration (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).


Suggest to drop the parenthesis in the last sentence.

@mrconfigid is a string, so the default value can't be 0.  Actually,
it's not just any string, but a base64 encoded SHA384 digest, which
means it must be exactly 96 hex digits.  So it can't be "0", either.  It
could be
"".


I thought value 0 of SHA384 just means it.

That's my fault and my poor english.


"Fault" is too harsh :)  It's not as precise as I want our interface
documentation to be.  We work together to get there.


More on this below.


+#
+# @mrowner: ID for the guest TD’s owner (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).
+#
+# @mrownerconfig: ID for owner-defined configuration of the guest TD,
+# e.g., specific to the workload rather than the run-time or OS
+# (base64 encoded SHA384 digest). (A default value 0 of SHA384 is
+# used when absent).
+#
  # Since: 9.0
  ##
  { 'struct': 'TdxGuestProperties',
-  'data': { '*sept-ve-disable': 'bool' } }
+  'data': { '*sept-ve-disable': 'bool',
+'*mrconfigid': 'str',
+'*mrowner': 'str',
+'*mrownerconfig': 'str' } }
  ##
  # @ThreadContextProperties:
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index d0ad4f57b5d0..4ce2f1d082ce 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -13,6 +13,7 @@
  #include "qemu/osdep.h"
  #include "qemu/error-report.h"
+#include "qemu/base64.h"
  #include "qapi/error.h"
  #include "qom/object_interfaces.h"
  #include "standard-headers/asm-x86/kvm_para.h"
@@ -516,6 +517,7 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
  X86CPU *x86cpu = X86_CPU(cpu);
  CPUX86State *env = >env;
  g_autofree struct kvm_tdx_init_vm *init_vm = NULL;
+size_t data_len;
  int r = 0;
  object_property_set_bool(OBJECT(cpu), "pmu", false, _abort);
@@ -528,6 +530,38 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
  init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) +
  sizeof(struct kvm_cpuid_entry2) * 
KVM_MAX_CPUID_ENTRIES);
+#define SHA384_DIGEST_SIZE  48
+
+if (tdx_guest->mrconfigid) {
+g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrconfigid,
+  strlen(tdx_guest->mrconfigid), _len, errp);
+if (!data || data_len != SHA384_DIGEST_SIZE) {
+error_setg(errp, "TDX: failed to decode mrconfigid");
+return -1;
+}
+memcpy(init_vm->mrconfigid, data, data_len);
+}


When @mrconfigid is absent, the property remains null, and this
conditional is not executed.  init_vm->mrconfigid[], an array of 6
__u64, remains all zero.  How does the kernel treat that?


A all-zero SHA384 value is still a valid value, isn't it?

KVM treats it with no difference.


Can you point me to the spot in the kernel where mrconfigid is used?



https://github.com/intel/tdx/blob/66a10e258636fa8ec9f5ce687607bf2196a92341/arch/x86/kvm/vmx/tdx.c#L2322

KVM just copy what QEMU provides into its own data structure @td_params. 
The format @of td_params is defined by TDX spec, and @td_params needs to 
be passed to TDX module when initialize the context of TD via 
SEAMCALL(TDH.MNG.INIT): 
https://github.com/intel/tdx/blob/66a10e258636fa8ec9f5ce687607bf2196a92341/arch/x86/kvm/vmx/tdx.c#L2450



In fact, all the three SHA384 fields, will be hashed together with some 
other fields (in td_params and other content of TD) to compromise the 
initial measurement of TD.


TDX module doesn't care the value of td_params->mrconfigid.




Re: [PATCH v5 49/65] i386/tdx: handle TDG.VP.VMCALL

2024-02-29 Thread Xiaoyao Li

On 2/29/2024 4:40 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


From: Isaku Yamahata 

Add property "quote-generation-socket" to tdx-guest, which is a property
of type SocketAddress to specify Quote Generation Service(QGS).

On request of GetQuote, it connects to the QGS socket, read request
data from shared guest memory, send the request data to the QGS,
and store the response into shared guest memory, at last notify
TD guest by interrupt.

command line example:
   qemu-system-x86_64 \
 -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", 
"cid":"1","port":"1234"}}' \
 -machine confidential-guest-support=tdx0

Note, above example uses vsock type socket because the QGS we used
implements the vsock socket. It can be other types, like UNIX socket,
which depends on the implementation of QGS.

To avoid no response from QGS server, setup a timer for the transaction.
If timeout, make it an error and interrupt guest. Define the threshold of
time to 30s at present, maybe change to other value if not appropriate.

Signed-off-by: Isaku Yamahata 
Codeveloped-by: Chenyi Qiang 
Signed-off-by: Chenyi Qiang 
Codeveloped-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 


[...]


diff --git a/qapi/qom.json b/qapi/qom.json
index cac875349a3a..7b26b0a0d3aa 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -917,13 +917,19 @@
  # (base64 encoded SHA384 digest). (A default value 0 of SHA384 is
  # used when absent).
  #
+# @quote-generation-socket: socket address for Quote Generation
+# Service (QGS).  QGS is a daemon running on the host.  User in
+# TD guest cannot get TD quoting for attestation if QGS is not
+# provided.  So admin should always provide it.


This makes me wonder why it's optional.  Can you describe a use case for
*not* specifying @quote-generation-socket?


Maybe at last when all the TDX support lands on all the components, 
attestation will become a must for a TD guest to be usable.


However, at least for today, booting and running a TD guest don't 
require attestation. So not provide it, doesn't affect anything 
excepting cannot get a Quote.



+#
  # Since: 9.0
  ##
  { 'struct': 'TdxGuestProperties',
'data': { '*sept-ve-disable': 'bool',
  '*mrconfigid': 'str',
  '*mrowner': 'str',
-'*mrownerconfig': 'str' } }
+'*mrownerconfig': 'str',
+'*quote-generation-socket': 'SocketAddress' } }
  
  ##

  # @ThreadContextProperties:


[...]






Re: [PATCH v5 30/65] i386/tdx: Support user configurable mrconfigid/mrowner/mrownerconfig

2024-02-29 Thread Xiaoyao Li

On 2/29/2024 4:37 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


From: Isaku Yamahata 

Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD
can be provided for TDX attestation. Detailed meaning of them can be
found: 
https://lore.kernel.org/qemu-devel/31d6dbc1-f453-4cef-ab08-4813f4e0f...@intel.com/

Allow user to specify those values via property mrconfigid, mrowner and
mrownerconfig. They are all in base64 format.

example
-object tdx-guest, \
   mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
   mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
   
mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v

Signed-off-by: Isaku Yamahata 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 

---
Changes in v5:
  - refine the description of QAPI properties and add description of
default value when not specified;

Changes in v4:
  - describe more of there fields in qom.json
  - free the old value before set new value to avoid memory leak in
_setter(); (Daniel)

Changes in v3:
  - use base64 encoding instread of hex-string;
---
  qapi/qom.json | 17 -
  target/i386/kvm/tdx.c | 87 +++
  target/i386/kvm/tdx.h |  3 ++
  3 files changed, 106 insertions(+), 1 deletion(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 89ed89b9b46e..cac875349a3a 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -905,10 +905,25 @@
  # pages.  Some guest OS (e.g., Linux TD guest) may require this to
  # be set, otherwise they refuse to boot.
  #
+# @mrconfigid: ID for non-owner-defined configuration of the guest TD,
+# e.g., run-time or OS configuration (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).


Suggest to drop the parenthesis in the last sentence.

@mrconfigid is a string, so the default value can't be 0.  Actually,
it's not just any string, but a base64 encoded SHA384 digest, which
means it must be exactly 96 hex digits.  So it can't be "0", either.  It
could be
"".


I thought value 0 of SHA384 just means it.

That's my fault and my poor english.


More on this below.


+#
+# @mrowner: ID for the guest TD’s owner (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).
+#
+# @mrownerconfig: ID for owner-defined configuration of the guest TD,
+# e.g., specific to the workload rather than the run-time or OS
+# (base64 encoded SHA384 digest). (A default value 0 of SHA384 is
+# used when absent).
+#
  # Since: 9.0
  ##
  { 'struct': 'TdxGuestProperties',
-  'data': { '*sept-ve-disable': 'bool' } }
+  'data': { '*sept-ve-disable': 'bool',
+'*mrconfigid': 'str',
+'*mrowner': 'str',
+'*mrownerconfig': 'str' } }
  
  ##

  # @ThreadContextProperties:
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index d0ad4f57b5d0..4ce2f1d082ce 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -13,6 +13,7 @@
  
  #include "qemu/osdep.h"

  #include "qemu/error-report.h"
+#include "qemu/base64.h"
  #include "qapi/error.h"
  #include "qom/object_interfaces.h"
  #include "standard-headers/asm-x86/kvm_para.h"
@@ -516,6 +517,7 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
  X86CPU *x86cpu = X86_CPU(cpu);
  CPUX86State *env = >env;
  g_autofree struct kvm_tdx_init_vm *init_vm = NULL;
+size_t data_len;
  int r = 0;
  
  object_property_set_bool(OBJECT(cpu), "pmu", false, _abort);

@@ -528,6 +530,38 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
  init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) +
  sizeof(struct kvm_cpuid_entry2) * 
KVM_MAX_CPUID_ENTRIES);
  
+#define SHA384_DIGEST_SIZE  48

+
+if (tdx_guest->mrconfigid) {
+g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrconfigid,
+  strlen(tdx_guest->mrconfigid), _len, errp);
+if (!data || data_len != SHA384_DIGEST_SIZE) {
+error_setg(errp, "TDX: failed to decode mrconfigid");
+return -1;
+}
+memcpy(init_vm->mrconfigid, data, data_len);
+}


When @mrconfigid is absent, the property remains null, and this
conditional is not executed.  init_vm->mrconfigid[], an array of 6
__u64, remains all zero.  How does the kernel treat that?


A all-zero SHA384 value is still a valid value, isn't it?

KVM treats it with no difference.


+
+if (tdx_guest->mrowner) {
+g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrowner,
+  strlen(tdx_guest->mrowner), _len, errp);
+if (!data || data_len != SHA384_DIGEST_SIZE) {
+error_s

[PATCH v5 39/65] i386/tdx: Skip BIOS shadowing setup

2024-02-28 Thread Xiaoyao Li
TDX doesn't support map different GPAs to same private memory. Thus,
aliasing top 128KB of BIOS as isa-bios is not supported.

On the other hand, TDX guest cannot go to real mode, it can work fine
without isa-bios.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
Changes in v1:
 - update commit message and comment to clarify
---
 hw/i386/x86.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 5a0cadc88c4f..61c45dfc14dd 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1190,17 +1190,20 @@ void x86_bios_rom_init(MachineState *ms, const char 
*default_firmware,
 }
 g_free(filename);
 
-/* map the last 128KB of the BIOS in ISA space */
-isa_bios_size = MIN(bios_size, 128 * KiB);
-isa_bios = g_malloc(sizeof(*isa_bios));
-memory_region_init_alias(isa_bios, NULL, "isa-bios", bios,
- bios_size - isa_bios_size, isa_bios_size);
-memory_region_add_subregion_overlap(rom_memory,
-0x10 - isa_bios_size,
-isa_bios,
-1);
-if (!isapc_ram_fw) {
-memory_region_set_readonly(isa_bios, true);
+/* For TDX, alias different GPAs to same private memory is not supported */
+if (!is_tdx_vm()) {
+/* map the last 128KB of the BIOS in ISA space */
+isa_bios_size = MIN(bios_size, 128 * KiB);
+isa_bios = g_malloc(sizeof(*isa_bios));
+memory_region_init_alias(isa_bios, NULL, "isa-bios", bios,
+bios_size - isa_bios_size, isa_bios_size);
+memory_region_add_subregion_overlap(rom_memory,
+0x10 - isa_bios_size,
+isa_bios,
+1);
+if (!isapc_ram_fw) {
+memory_region_set_readonly(isa_bios, true);
+}
 }
 
 /* map all the bios at the top of memory */
-- 
2.34.1




[PATCH v5 40/65] i386/tdx: Don't initialize pc.rom for TDX VMs

2024-02-28 Thread Xiaoyao Li
For TDX, the address below 1MB are entirely general RAM. No need to
initialize pc.rom memory region for TDs.

Signed-off-by: Xiaoyao Li 
---
This is more as a workaround of the issue that for q35 machine type, the
real memslot update (which requires memslot deletion )for pc.rom happens
after tdx_init_memory_region. It leads to the private memory ADD'ed
before get lost. I haven't work out a good solution to resolve the
order issue. So just skip the pc.rom setup to avoid memslot deletion.
---
 hw/i386/pc.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index f5ff970acfa0..3f8dd218eb08 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -43,6 +43,7 @@
 #include "sysemu/xen.h"
 #include "sysemu/reset.h"
 #include "kvm/kvm_i386.h"
+#include "kvm/tdx.h"
 #include "hw/xen/xen.h"
 #include "qapi/qmp/qlist.h"
 #include "qemu/error-report.h"
@@ -1028,16 +1029,18 @@ void pc_memory_init(PCMachineState *pcms,
 /* Initialize PC system firmware */
 pc_system_firmware_init(pcms, rom_memory);
 
-option_rom_mr = g_malloc(sizeof(*option_rom_mr));
-memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE,
-   _fatal);
-if (pcmc->pci_enabled) {
-memory_region_set_readonly(option_rom_mr, true);
+if (!is_tdx_vm()) {
+option_rom_mr = g_malloc(sizeof(*option_rom_mr));
+memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE,
+_fatal);
+if (pcmc->pci_enabled) {
+memory_region_set_readonly(option_rom_mr, true);
+}
+memory_region_add_subregion_overlap(rom_memory,
+PC_ROM_MIN_VGA,
+option_rom_mr,
+1);
 }
-memory_region_add_subregion_overlap(rom_memory,
-PC_ROM_MIN_VGA,
-option_rom_mr,
-1);
 
 fw_cfg = fw_cfg_arch_create(machine,
 x86ms->boot_cpus, x86ms->apic_id_limit);
-- 
2.34.1




[PATCH v5 52/65] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility

2024-02-28 Thread Xiaoyao Li
Integrate TDX's TDX_REPORT_FATAL_ERROR into QEMU GuestPanic facility

Originated-from: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
Changes in v5:
- mention additional error information in gpa when it presents;
- refine the documentation; (Markus)

Changes in v4:
- refine the documentation; (Markus)

Changes in v3:
- Add docmentation of new type and struct; (Daniel)
- refine the error message handling; (Daniel)
---
 qapi/run-state.json   | 31 +--
 system/runstate.c | 58 +++
 target/i386/kvm/tdx.c | 24 +-
 3 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/qapi/run-state.json b/qapi/run-state.json
index dd0770b379e5..b71dd1884eb6 100644
--- a/qapi/run-state.json
+++ b/qapi/run-state.json
@@ -483,10 +483,12 @@
 #
 # @s390: s390 guest panic information type (Since: 2.12)
 #
+# @tdx: tdx guest panic information type (Since: 9.0)
+#
 # Since: 2.9
 ##
 { 'enum': 'GuestPanicInformationType',
-  'data': [ 'hyper-v', 's390' ] }
+  'data': [ 'hyper-v', 's390', 'tdx' ] }
 
 ##
 # @GuestPanicInformation:
@@ -501,7 +503,8 @@
  'base': {'type': 'GuestPanicInformationType'},
  'discriminator': 'type',
  'data': {'hyper-v': 'GuestPanicInformationHyperV',
-  's390': 'GuestPanicInformationS390'}}
+  's390': 'GuestPanicInformationS390',
+  'tdx' : 'GuestPanicInformationTdx'}}
 
 ##
 # @GuestPanicInformationHyperV:
@@ -564,6 +567,30 @@
   'psw-addr': 'uint64',
   'reason': 'S390CrashReason'}}
 
+##
+# @GuestPanicInformationTdx:
+#
+# TDX Guest panic information specific to TDX, as specified in the
+# "Guest-Hypervisor Communication Interface (GHCI) Specification",
+# section TDG.VP.VMCALL.
+#
+# @error-code: TD-specific error code
+#
+# @message: Human-readable error message provided by the guest. Not
+# to be trusted.
+#
+# @gpa: guest-physical address of a page that contains more verbose
+# error information, as zero-terminated string.  Present when the
+# "GPA valid" bit (bit 63) is set in @error-code.
+#
+#
+# Since: 9.0
+##
+{'struct': 'GuestPanicInformationTdx',
+ 'data': {'error-code': 'uint64',
+  'message': 'str',
+  '*gpa': 'uint64'}}
+
 ##
 # @MEMORY_FAILURE:
 #
diff --git a/system/runstate.c b/system/runstate.c
index d6ab860ecaa7..3b628c38739d 100644
--- a/system/runstate.c
+++ b/system/runstate.c
@@ -519,6 +519,52 @@ static void qemu_system_wakeup(void)
 }
 }
 
+static char* tdx_parse_panic_message(char *message)
+{
+bool printable = false;
+char *buf = NULL;
+int len = 0, i;
+
+/*
+ * Although message is defined as a json string, we shouldn't
+ * unconditionally treat it as is because the guest generated it and
+ * it's not necessarily trustable.
+ */
+if (message) {
+/* The caller guarantees the NUL-terminated string. */
+len = strlen(message);
+
+printable = len > 0;
+for (i = 0; i < len; i++) {
+if (!(0x20 <= message[i] && message[i] <= 0x7e)) {
+printable = false;
+break;
+}
+}
+}
+
+if (!printable && len) {
+/* 3 = length of "%02x " */
+buf = g_malloc(len * 3);
+for (i = 0; i < len; i++) {
+if (message[i] == '\0') {
+break;
+} else {
+sprintf(buf + 3 * i, "%02x ", message[i]);
+}
+}
+if (i > 0)
+/* replace the last ' '(space) to NUL */
+buf[i * 3 - 1] = '\0';
+else
+buf[0] = '\0';
+
+return buf;
+}
+
+return message;
+}
+
 void qemu_system_guest_panicked(GuestPanicInformation *info)
 {
 qemu_log_mask(LOG_GUEST_ERROR, "Guest crashed");
@@ -560,7 +606,19 @@ void qemu_system_guest_panicked(GuestPanicInformation 
*info)
   S390CrashReason_str(info->u.s390.reason),
   info->u.s390.psw_mask,
   info->u.s390.psw_addr);
+} else if (info->type == GUEST_PANIC_INFORMATION_TYPE_TDX) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  " TDX guest reports fatal error:"
+  " error code: 0x%#" PRIx64 " error message:\"%s\"\n",
+  info->u.tdx.error_code,
+  tdx_parse_panic_message(info->u.tdx.message));
+if (info->u.tdx.error_code & (1ull << 63)) {
+qemu_log_mask(LOG_GUEST_ERROR, "Additional error information "
+  "can be found at gpa page: 0x%#" PRIx64 "\n",
+  info->u.tdx.gpa);
+}
 }
+
 qapi_free_GuestPanicInformation(info);
 }
 }
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm

[PATCH v5 32/65] i386/tdx: Set kvm_readonly_mem_enabled to false for TDX VM

2024-02-28 Thread Xiaoyao Li
TDX only supports readonly for shared memory but not for private memory.

In the view of QEMU, it has no idea whether a memslot is used as shared
memory of private. Thus just mark kvm_readonly_mem_enabled to false to
TDX VM for simplicity.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 target/i386/kvm/tdx.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 42dbb5ce5c15..13f069171db7 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -480,6 +480,15 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, 
Error **errp)
 
 update_tdx_cpuid_lookup_by_tdx_caps();
 
+/*
+ * Set kvm_readonly_mem_allowed to false, because TDX only supports 
readonly
+ * memory for shared memory but not for private memory. Besides, whether a
+ * memslot is private or shared is not determined by QEMU.
+ *
+ * Thus, just mark readonly memory not supported for simplicity.
+ */
+kvm_readonly_mem_allowed = false;
+
 tdx_guest = tdx;
 return 0;
 }
-- 
2.34.1




[PATCH v5 65/65] docs: Add TDX documentation

2024-02-28 Thread Xiaoyao Li
Add docs/system/i386/tdx.rst for TDX support, and add tdx in
confidential-guest-support.rst

Signed-off-by: Xiaoyao Li 
---
Changes in v5:
 - Add TD attestation section and update the QEMU parameter;

Changes since v1:
 - Add prerequisite of private gmem;
 - update example command to launch TD;

Changes since RFC v4:
 - add the restriction that kernel-irqchip must be split
---
 docs/system/confidential-guest-support.rst |   1 +
 docs/system/i386/tdx.rst   | 143 +
 docs/system/target-i386.rst|   1 +
 3 files changed, 145 insertions(+)
 create mode 100644 docs/system/i386/tdx.rst

diff --git a/docs/system/confidential-guest-support.rst 
b/docs/system/confidential-guest-support.rst
index 0c490dbda2b7..66129fbab64c 100644
--- a/docs/system/confidential-guest-support.rst
+++ b/docs/system/confidential-guest-support.rst
@@ -38,6 +38,7 @@ Supported mechanisms
 Currently supported confidential guest mechanisms are:
 
 * AMD Secure Encrypted Virtualization (SEV) (see 
:doc:`i386/amd-memory-encryption`)
+* Intel Trust Domain Extension (TDX) (see :doc:`i386/tdx`)
 * POWER Protected Execution Facility (PEF) (see 
:ref:`power-papr-protected-execution-facility-pef`)
 * s390x Protected Virtualization (PV) (see :doc:`s390x/protvirt`)
 
diff --git a/docs/system/i386/tdx.rst b/docs/system/i386/tdx.rst
new file mode 100644
index ..8491cdcfa163
--- /dev/null
+++ b/docs/system/i386/tdx.rst
@@ -0,0 +1,143 @@
+Intel Trusted Domain eXtension (TDX)
+
+
+Intel Trusted Domain eXtensions (TDX) refers to an Intel technology that 
extends
+Virtual Machine Extensions (VMX) and Multi-Key Total Memory Encryption (MKTME)
+with a new kind of virtual machine guest called a Trust Domain (TD). A TD runs
+in a CPU mode that is designed to protect the confidentiality of its memory
+contents and its CPU state from any other software, including the hosting
+Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
+
+Prerequisites
+-
+
+To run TD, the physical machine needs to have TDX module loaded and initialized
+while KVM hypervisor has TDX support and has TDX enabled. If those requirements
+are met, the ``KVM_CAP_VM_TYPES`` will report the support of 
``KVM_X86_TDX_VM``.
+
+Trust Domain Virtual Firmware (TDVF)
+
+
+Trust Domain Virtual Firmware (TDVF) is required to provide TD services to boot
+TD Guest OS. TDVF needs to be copied to guest private memory and measured 
before
+the TD boots.
+
+KVM vcpu ioctl ``KVM_MEMORY_MAPPING`` can be used to populates the TDVF content
+into its private memory.
+
+Since TDX doesn't support readonly memslot, TDVF cannot be mapped as pflash
+device and it actually works as RAM. "-bios" option is chosen to load TDVF.
+
+OVMF is the opensource firmware that implements the TDVF support. Thus the
+command line to specify and load TDVF is ``-bios OVMF.fd``
+
+KVM private memory
+
+
+TD's memory (RAM) needs to be able to be transformed between private and 
shared.
+Its BIOS (OVMF/TDVF) needs to be mapped as private as well. Thus QEMU needs to
+allocate private guest memfd for them via KVM's IOCTL (KVM_CREATE_GUEST_MEMFD),
+which requires KVM is newer enough that reports KVM_CAP_GUEST_MEMFD.
+
+Feature Control
+---
+
+Unlike non-TDX VM, the CPU features (enumerated by CPU or MSR) of a TD is not
+under full control of VMM. VMM can only configure part of features of a TD on
+``KVM_TDX_INIT_VM`` command of VM scope ``MEMORY_ENCRYPT_OP`` ioctl.
+
+The configurable features have three types:
+
+- Attributes:
+  - PKS (bit 30) controls whether Supervisor Protection Keys is exposed to TD,
+  which determines related CPUID bit and CR4 bit;
+  - PERFMON (bit 63) controls whether PMU is exposed to TD.
+
+- XSAVE related features (XFAM):
+  XFAM is a 64b mask, which has the same format as XCR0 or IA32_XSS MSR. It
+  determines the set of extended features available for use by the guest TD.
+
+- CPUID features:
+  Only some bits of some CPUID leaves are directly configurable by VMM.
+
+What features can be configured is reported via TDX capabilities.
+
+TDX capabilities
+
+
+The VM scope ``MEMORY_ENCRYPT_OP`` ioctl provides command 
``KVM_TDX_CAPABILITIES``
+to get the TDX capabilities from KVM. It returns a data structure of
+``struct kvm_tdx_capabilites``, which tells the supported configuration of
+attributes, XFAM and CPUIDs.
+
+TD attestation
+--
+
+In TD guest, the attestation process is used to verify the TDX guest
+trustworthiness to other entities before provisioning secrets to the guest.
+
+TD attestation is initiated first by calling TDG.MR.REPORT inside TD to get the
+REPORT. Then the REPORT data needs to be converted into a remotely verifiable
+Quote by SGX Quoting Enclave (QE).
+
+A host daemon, Quote Generation Service (QGS), provides the functionality of
+SGX GE. I

[PATCH v5 28/65] i386/tdx: Disable pmu for TD guest

2024-02-28 Thread Xiaoyao Li
Current KVM doesn't support PMU for TD guest. It returns error if TD is
created with PMU bit being set in attributes.

Disable PMU for TD guest on QEMU side.

Signed-off-by: Xiaoyao Li 
---
 target/i386/kvm/tdx.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 262e86fd2c67..1c12cda002b8 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -496,6 +496,8 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
 g_autofree struct kvm_tdx_init_vm *init_vm = NULL;
 int r = 0;
 
+object_property_set_bool(OBJECT(cpu), "pmu", false, _abort);
+
 QEMU_LOCK_GUARD(_guest->lock);
 if (tdx_guest->initialized) {
 return r;
-- 
2.34.1




[PATCH v5 34/65] kvm/tdx: Ignore memory conversion to shared of unassigned region

2024-02-28 Thread Xiaoyao Li
From: Isaku Yamahata 

TDX requires vMMIO region to be shared.  For KVM, MMIO region is the region
which kvm memslot isn't assigned to (except in-kernel emulation).
qemu has the memory region for vMMIO at each device level.

While OVMF issues MapGPA(to-shared) conservatively on 32bit PCI MMIO
region, qemu doesn't find corresponding vMMIO region because it's before
PCI device allocation and memory_region_find() finds the device region, not
PCI bus region.  It's safe to ignore MapGPA(to-shared) because when guest
accesses those region they use GPA with shared bit set for vMMIO.  Ignore
memory conversion request of non-assigned region to shared and return
success.  Otherwise OVMF is confused and panics there.

Signed-off-by: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
 accel/kvm/kvm-all.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index d533e2611ad8..9dc17a1b5f43 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2953,6 +2953,18 @@ static int kvm_convert_memory(hwaddr start, hwaddr size, 
bool to_private)
 section = memory_region_find(get_system_memory(), start, size);
 mr = section.mr;
 if (!mr) {
+/*
+ * Ignore converting non-assigned region to shared.
+ *
+ * TDX requires vMMIO region to be shared to inject #VE to guest.
+ * OVMF issues conservatively MapGPA(shared) on 32bit PCI MMIO region,
+ * and vIO-APIC 0xFEC0 4K page.
+ * OVMF assigns 32bit PCI MMIO region to
+ * [top of low memory: typically 2GB=0xC00,  0xFC0)
+ */
+if (!to_private) {
+return 0;
+}
 return -1;
 }
 
-- 
2.34.1




[PATCH v5 29/65] i386/tdx: Validate TD attributes

2024-02-28 Thread Xiaoyao Li
Validate TD attributes with tdx_caps that fixed-0 bits must be zero and
fixed-1 bits must be set.

Besides, sanity check the attribute bits that have not been supported by
QEMU yet. e.g., debug bit, it will be allowed in the future when debug
TD support lands in QEMU.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 

---
Changes in v3:
- using error_setg() for error report; (Daniel)
---
 target/i386/kvm/tdx.c | 29 +++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 1c12cda002b8..d0ad4f57b5d0 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -32,6 +32,7 @@
  (1U << KVM_FEATURE_PV_SCHED_YIELD) | \
  (1U << KVM_FEATURE_MSI_EXT_DEST_ID))
 
+#define TDX_TD_ATTRIBUTES_DEBUG BIT_ULL(0)
 #define TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE   BIT_ULL(28)
 #define TDX_TD_ATTRIBUTES_PKS   BIT_ULL(30)
 #define TDX_TD_ATTRIBUTES_PERFMON   BIT_ULL(63)
@@ -479,13 +480,34 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, 
Error **errp)
 return 0;
 }
 
-static void setup_td_guest_attributes(X86CPU *x86cpu)
+static int tdx_validate_attributes(TdxGuest *tdx, Error **errp)
+{
+if (((tdx->attributes & tdx_caps->attrs_fixed0) | tdx_caps->attrs_fixed1) 
!=
+tdx->attributes) {
+error_setg(errp, "Invalid attributes 0x%lx for TDX VM "
+   "(fixed0 0x%llx, fixed1 0x%llx)",
+   tdx->attributes, tdx_caps->attrs_fixed0,
+   tdx_caps->attrs_fixed1);
+return -1;
+}
+
+if (tdx->attributes & TDX_TD_ATTRIBUTES_DEBUG) {
+error_setg(errp, "Current QEMU doesn't support attributes.debug[bit 0] 
for TDX VM");
+return -1;
+}
+
+return 0;
+}
+
+static int setup_td_guest_attributes(X86CPU *x86cpu, Error **errp)
 {
 CPUX86State *env = >env;
 
 tdx_guest->attributes |= (env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_PKS) 
?
  TDX_TD_ATTRIBUTES_PKS : 0;
 tdx_guest->attributes |= x86cpu->enable_pmu ? TDX_TD_ATTRIBUTES_PERFMON : 
0;
+
+return tdx_validate_attributes(tdx_guest, errp);
 }
 
 int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
@@ -512,7 +534,10 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
 return r;
 }
 
-setup_td_guest_attributes(x86cpu);
+r = setup_td_guest_attributes(x86cpu, errp);
+if (r) {
+return r;
+}
 
 init_vm->cpuid.nent = kvm_x86_arch_cpuid(env, init_vm->cpuid.entries, 0);
 
-- 
2.34.1




[PATCH v5 35/65] memory: Introduce memory_region_init_ram_guest_memfd()

2024-02-28 Thread Xiaoyao Li
Introduce memory_region_init_ram_guest_memfd() to allocate private
guset memfd on the MemoryRegion initialization. It's for the use case of
TDVF, which must be private on TDX case.

Signed-off-by: Xiaoyao Li 
---
Changes in v5:
- drop memory_region_set_default_private() because this function is
  dropped in this v5 series;
---
 include/exec/memory.h |  6 ++
 system/memory.c   | 25 +
 2 files changed, 31 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 679a8476852e..1e351f6fc875 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1603,6 +1603,12 @@ bool memory_region_init_ram(MemoryRegion *mr,
 uint64_t size,
 Error **errp);
 
+bool memory_region_init_ram_guest_memfd(MemoryRegion *mr,
+Object *owner,
+const char *name,
+uint64_t size,
+Error **errp);
+
 /**
  * memory_region_init_rom: Initialize a ROM memory region.
  *
diff --git a/system/memory.c b/system/memory.c
index c756950c0c0f..85a22408e9a4 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -3606,6 +3606,31 @@ bool memory_region_init_ram(MemoryRegion *mr,
 return true;
 }
 
+bool memory_region_init_ram_guest_memfd(MemoryRegion *mr,
+Object *owner,
+const char *name,
+uint64_t size,
+Error **errp)
+{
+DeviceState *owner_dev;
+
+if (!memory_region_init_ram_flags_nomigrate(mr, owner, name, size,
+RAM_GUEST_MEMFD, errp)) {
+return false;
+}
+
+/* This will assert if owner is neither NULL nor a DeviceState.
+ * We only want the owner here for the purposes of defining a
+ * unique name for migration. TODO: Ideally we should implement
+ * a naming scheme for Objects which are not DeviceStates, in
+ * which case we can relax this restriction.
+ */
+owner_dev = DEVICE(owner);
+vmstate_register_ram(mr, owner_dev);
+
+return true;
+}
+
 bool memory_region_init_rom(MemoryRegion *mr,
 Object *owner,
 const char *name,
-- 
2.34.1




[PATCH v5 06/65] kvm: Introduce support for memory_attributes

2024-02-28 Thread Xiaoyao Li
Introduce the helper functions to set the attributes of a range of
memory to private or shared.

This is necessary to notify KVM the private/shared attribute of each gpa
range. KVM needs the information to decide the GPA needs to be mapped at
hva-based shared memory or guest_memfd based private memory.

Signed-off-by: Xiaoyao Li 
---
Changes in v4:
- move the check of kvm_supported_memory_attributes to the common
  kvm_set_memory_attributes(); (Wang Wei)
- change warn_report() to error_report() in kvm_set_memory_attributes()
  and drop the __func__; (Daniel)
---
 accel/kvm/kvm-all.c  | 44 
 include/sysemu/kvm.h |  3 +++
 2 files changed, 47 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index cd0aa7545a1f..70d482a2c936 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -92,6 +92,7 @@ static bool kvm_has_guest_debug;
 static int kvm_sstep_flags;
 static bool kvm_immediate_exit;
 static bool kvm_guest_memfd_supported;
+static uint64_t kvm_supported_memory_attributes;
 static hwaddr kvm_max_slot_size = ~0;
 
 static const KVMCapabilityInfo kvm_required_capabilites[] = {
@@ -1304,6 +1305,46 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size)
 kvm_max_slot_size = max_slot_size;
 }
 
+static int kvm_set_memory_attributes(hwaddr start, hwaddr size, uint64_t attr)
+{
+struct kvm_memory_attributes attrs;
+int r;
+
+if (kvm_supported_memory_attributes == 0) {
+error_report("No memory attribute supported by KVM\n");
+return -EINVAL;
+}
+
+if ((attr & kvm_supported_memory_attributes) != attr) {
+error_report("memory attribute 0x%lx not supported by KVM,"
+ " supported bits are 0x%lx\n",
+ attr, kvm_supported_memory_attributes);
+return -EINVAL;
+}
+
+attrs.attributes = attr;
+attrs.address = start;
+attrs.size = size;
+attrs.flags = 0;
+
+r = kvm_vm_ioctl(kvm_state, KVM_SET_MEMORY_ATTRIBUTES, );
+if (r) {
+error_report("failed to set memory (0x%lx+%#zx) with attr 0x%lx error 
'%s'",
+ start, size, attr, strerror(errno));
+}
+return r;
+}
+
+int kvm_set_memory_attributes_private(hwaddr start, hwaddr size)
+{
+return kvm_set_memory_attributes(start, size, 
KVM_MEMORY_ATTRIBUTE_PRIVATE);
+}
+
+int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size)
+{
+return kvm_set_memory_attributes(start, size, 0);
+}
+
 /* Called with KVMMemoryListener.slots_lock held */
 static void kvm_set_phys_mem(KVMMemoryListener *kml,
  MemoryRegionSection *section, bool add)
@@ -2439,6 +2480,9 @@ static int kvm_init(MachineState *ms)
 
 kvm_guest_memfd_supported = kvm_check_extension(s, KVM_CAP_GUEST_MEMFD);
 
+ret = kvm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES);
+kvm_supported_memory_attributes = ret > 0 ? ret : 0;
+
 if (object_property_find(OBJECT(current_machine), "kvm-type")) {
 g_autofree char *kvm_type = 
object_property_get_str(OBJECT(current_machine),
 "kvm-type",
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 6cdf82de8372..8e83adfbbd19 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -546,4 +546,7 @@ uint32_t kvm_dirty_ring_size(void);
 bool kvm_hwpoisoned_mem(void);
 
 int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp);
+
+int kvm_set_memory_attributes_private(hwaddr start, hwaddr size);
+int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size);
 #endif
-- 
2.34.1




[PATCH v5 30/65] i386/tdx: Support user configurable mrconfigid/mrowner/mrownerconfig

2024-02-28 Thread Xiaoyao Li
From: Isaku Yamahata 

Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD
can be provided for TDX attestation. Detailed meaning of them can be
found: 
https://lore.kernel.org/qemu-devel/31d6dbc1-f453-4cef-ab08-4813f4e0f...@intel.com/

Allow user to specify those values via property mrconfigid, mrowner and
mrownerconfig. They are all in base64 format.

example
-object tdx-guest, \
  mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
  mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
  mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v

Signed-off-by: Isaku Yamahata 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 

---
Changes in v5:
 - refine the description of QAPI properties and add description of
   default value when not specified;

Changes in v4:
 - describe more of there fields in qom.json
 - free the old value before set new value to avoid memory leak in
   _setter(); (Daniel)

Changes in v3:
 - use base64 encoding instread of hex-string;
---
 qapi/qom.json | 17 -
 target/i386/kvm/tdx.c | 87 +++
 target/i386/kvm/tdx.h |  3 ++
 3 files changed, 106 insertions(+), 1 deletion(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 89ed89b9b46e..cac875349a3a 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -905,10 +905,25 @@
 # pages.  Some guest OS (e.g., Linux TD guest) may require this to
 # be set, otherwise they refuse to boot.
 #
+# @mrconfigid: ID for non-owner-defined configuration of the guest TD,
+# e.g., run-time or OS configuration (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).
+#
+# @mrowner: ID for the guest TD’s owner (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).
+#
+# @mrownerconfig: ID for owner-defined configuration of the guest TD,
+# e.g., specific to the workload rather than the run-time or OS
+# (base64 encoded SHA384 digest). (A default value 0 of SHA384 is
+# used when absent).
+#
 # Since: 9.0
 ##
 { 'struct': 'TdxGuestProperties',
-  'data': { '*sept-ve-disable': 'bool' } }
+  'data': { '*sept-ve-disable': 'bool',
+'*mrconfigid': 'str',
+'*mrowner': 'str',
+'*mrownerconfig': 'str' } }
 
 ##
 # @ThreadContextProperties:
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index d0ad4f57b5d0..4ce2f1d082ce 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -13,6 +13,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
+#include "qemu/base64.h"
 #include "qapi/error.h"
 #include "qom/object_interfaces.h"
 #include "standard-headers/asm-x86/kvm_para.h"
@@ -516,6 +517,7 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
 X86CPU *x86cpu = X86_CPU(cpu);
 CPUX86State *env = >env;
 g_autofree struct kvm_tdx_init_vm *init_vm = NULL;
+size_t data_len;
 int r = 0;
 
 object_property_set_bool(OBJECT(cpu), "pmu", false, _abort);
@@ -528,6 +530,38 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
 init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) +
 sizeof(struct kvm_cpuid_entry2) * 
KVM_MAX_CPUID_ENTRIES);
 
+#define SHA384_DIGEST_SIZE  48
+
+if (tdx_guest->mrconfigid) {
+g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrconfigid,
+  strlen(tdx_guest->mrconfigid), _len, errp);
+if (!data || data_len != SHA384_DIGEST_SIZE) {
+error_setg(errp, "TDX: failed to decode mrconfigid");
+return -1;
+}
+memcpy(init_vm->mrconfigid, data, data_len);
+}
+
+if (tdx_guest->mrowner) {
+g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrowner,
+  strlen(tdx_guest->mrowner), _len, errp);
+if (!data || data_len != SHA384_DIGEST_SIZE) {
+error_setg(errp, "TDX: failed to decode mrowner");
+return -1;
+}
+memcpy(init_vm->mrowner, data, data_len);
+}
+
+if (tdx_guest->mrownerconfig) {
+g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrownerconfig,
+  strlen(tdx_guest->mrownerconfig), _len, 
errp);
+if (!data || data_len != SHA384_DIGEST_SIZE) {
+error_setg(errp, "TDX: failed to decode mrownerconfig");
+return -1;
+}
+memcpy(init_vm->mrownerconfig, data, data_len);
+}
+
 r = kvm_vm_enable_cap(kvm_state, KVM_CAP_MAX_VCPUS, 0, ms->smp.cpus);
 if (r < 0) {
 error_setg(errp, "Unable to set MAX VCPUS to %d", ms->smp.cpus);
@@ -574,6 +608,51 @@ static void tdx_guest_set_sept_ve_disable(Object *obj, 
bool value, Error **errp)
 }
 }
 
+static

[PATCH v5 51/65] i386/tdx: Handle TDG.VP.VMCALL

2024-02-28 Thread Xiaoyao Li
TD guest can use TDG.VP.VMCALL to request termination
with error message encoded in GPRs.

Parse and print the error message, and terminate the TD guest in the
handler.

Signed-off-by: Xiaoyao Li 
---
 target/i386/kvm/tdx.c | 39 +++
 target/i386/kvm/tdx.h |  1 +
 2 files changed, 40 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 88d245577594..6cd72a26b970 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -1092,6 +1092,43 @@ static int tdx_handle_get_quote(X86CPU *cpu, struct 
kvm_tdx_vmcall *vmcall)
 return 0;
 }
 
+static int tdx_handle_report_fatal_error(X86CPU *cpu,
+ struct kvm_tdx_vmcall *vmcall)
+{
+uint64_t error_code = vmcall->in_r12;
+char *message = NULL;
+
+if (error_code & 0x) {
+error_report("TDX: REPORT_FATAL_ERROR: invalid error code: "
+ "0x%lx\n", error_code);
+return -1;
+}
+
+/* it has optional message */
+if (vmcall->in_r14) {
+uint64_t * tmp;
+
+#define GUEST_PANIC_INFO_TDX_MESSAGE_MAX64
+message = g_malloc0(GUEST_PANIC_INFO_TDX_MESSAGE_MAX + 1);
+
+tmp = (uint64_t *)message;
+/* The order is defined in TDX GHCI spec */
+*(tmp++) = cpu_to_le64(vmcall->in_r14);
+*(tmp++) = cpu_to_le64(vmcall->in_r15);
+*(tmp++) = cpu_to_le64(vmcall->in_rbx);
+*(tmp++) = cpu_to_le64(vmcall->in_rdi);
+*(tmp++) = cpu_to_le64(vmcall->in_rsi);
+*(tmp++) = cpu_to_le64(vmcall->in_r8);
+*(tmp++) = cpu_to_le64(vmcall->in_r9);
+*(tmp++) = cpu_to_le64(vmcall->in_rdx);
+message[GUEST_PANIC_INFO_TDX_MESSAGE_MAX] = '\0';
+assert((char *)tmp == message + GUEST_PANIC_INFO_TDX_MESSAGE_MAX);
+}
+
+error_report("TD guest reports fatal error. %s\n", message ? : "");
+return -1;
+}
+
 static int tdx_handle_setup_event_notify_interrupt(X86CPU *cpu,
struct kvm_tdx_vmcall 
*vmcall)
 {
@@ -1126,6 +1163,8 @@ static int tdx_handle_vmcall(X86CPU *cpu, struct 
kvm_tdx_vmcall *vmcall)
 return tdx_handle_map_gpa(cpu, vmcall);
 case TDG_VP_VMCALL_GET_QUOTE:
 return tdx_handle_get_quote(cpu, vmcall);
+case TDG_VP_VMCALL_REPORT_FATAL_ERROR:
+return  tdx_handle_report_fatal_error(cpu, vmcall);
 case TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT:
 return tdx_handle_setup_event_notify_interrupt(cpu, vmcall);
 default:
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
index b6b8742e79f7..86b1f80d6f49 100644
--- a/target/i386/kvm/tdx.h
+++ b/target/i386/kvm/tdx.h
@@ -20,6 +20,7 @@ typedef struct TdxGuestClass {
 
 #define TDG_VP_VMCALL_MAP_GPA   0x10001ULL
 #define TDG_VP_VMCALL_GET_QUOTE 0x10002ULL
+#define TDG_VP_VMCALL_REPORT_FATAL_ERROR0x10003ULL
 #define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT  0x10004ULL
 
 #define TDG_VP_VMCALL_SUCCESS   0xULL
-- 
2.34.1




[PATCH v5 59/65] hw/i386: add eoi_intercept_unsupported member to X86MachineState

2024-02-28 Thread Xiaoyao Li
Add a new bool member, eoi_intercept_unsupported, to X86MachineState
with default value false. Set true for TDX VM.

Inability to intercept eoi causes impossibility to emulate level
triggered interrupt to be re-injected when level is still kept active.
which affects interrupt controller emulation.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 hw/i386/x86.c | 1 +
 include/hw/i386/x86.h | 1 +
 target/i386/kvm/tdx.c | 2 ++
 3 files changed, 4 insertions(+)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 61c45dfc14dd..6ff2475535bc 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1425,6 +1425,7 @@ static void x86_machine_initfn(Object *obj)
 x86ms->oem_table_id = g_strndup(ACPI_BUILD_APPNAME8, 8);
 x86ms->bus_lock_ratelimit = 0;
 x86ms->above_4g_mem_start = 4 * GiB;
+x86ms->eoi_intercept_unsupported = false;
 }
 
 static void x86_machine_class_init(ObjectClass *oc, void *data)
diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h
index d28e79cc484a..033f0a34891b 100644
--- a/include/hw/i386/x86.h
+++ b/include/hw/i386/x86.h
@@ -60,6 +60,7 @@ struct X86MachineState {
 uint64_t above_4g_mem_start;
 
 /* CPU and apic information: */
+bool eoi_intercept_unsupported;
 unsigned pci_irq_mask;
 unsigned apic_id_limit;
 uint16_t boot_cpus;
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 0225a9b79b36..b1fb326bd395 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -727,6 +727,8 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, 
Error **errp)
 return -EINVAL;
 }
 
+x86ms->eoi_intercept_unsupported = true;
+
 if (!tdx_caps) {
 r = get_tdx_capabilities(errp);
 if (r) {
-- 
2.34.1




[PATCH v5 18/65] i386/tdx: Make Intel-PT unsupported for TD guest

2024-02-28 Thread Xiaoyao Li
Due to the fact that Intel-PT virtualization support has been broken in
QEMU since Sapphire Rapids generation[1], below warning is triggered when
luanching TD guest:

  warning: host doesn't support requested feature: CPUID.07H:EBX.intel-pt [bit 
25]

Before Intel-pt is fixed in QEMU, just make Intel-PT unsupported for TD
guest, to avoid the confusing warning.

[1] 
https://lore.kernel.org/qemu-devel/20230531084311.3807277-1-xiaoyao...@intel.com/

Signed-off-by: Xiaoyao Li 
---
Changes in v4:
 - newly added patch;
---
 target/i386/kvm/tdx.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 85d96140b450..239170142e4f 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -292,6 +292,11 @@ void tdx_get_supported_cpuid(uint32_t function, uint32_t 
index, int reg,
 if (function == 1 && reg == R_ECX && !enable_cpu_pm) {
 *ret &= ~CPUID_EXT_MONITOR;
 }
+
+/* QEMU Intel-pt support is broken, don't advertise Intel-PT */
+if (function == 7 && reg == R_EBX) {
+*ret &= ~CPUID_7_0_EBX_INTEL_PT;
+}
 }
 
 enum tdx_ioctl_level{
-- 
2.34.1




[PATCH v5 17/65] i386/tdx: Adjust the supported CPUID based on TDX restrictions

2024-02-28 Thread Xiaoyao Li
According to Chapter "CPUID Virtualization" in TDX module spec, CPUID
bits of TD can be classified into 6 types:


1 | As configured | configurable by VMM, independent of native value;

2 | As configured | configurable by VMM if the bit is supported natively
(if native)   | Otherwise it equals as native(0).

3 | Fixed | fixed to 0/1

4 | Native| reflect the native value

5 | Calculated| calculated by TDX module.

6 | Inducing #VE  | get #VE exception


Note:
1. All the configurable XFAM related features and TD attributes related
   features fall into type #2. And fixed0/1 bits of XFAM and TD
   attributes fall into type #3.

2. For CPUID leaves not listed in "CPUID virtualization Overview" table
   in TDX module spec, TDX module injects #VE to TDs when those are
   queried. For this case, TDs can request CPUID emulation from VMM via
   TDVMCALL and the values are fully controlled by VMM.

Due to TDX module has its own virtualization policy on CPUID bits, it leads
to what reported via KVM_GET_SUPPORTED_CPUID diverges from the supported
CPUID bits for TDs. In order to keep a consistent CPUID configuration
between VMM and TDs. Adjust supported CPUID for TDs based on TDX
restrictions.

Currently only focus on the CPUID leaves recognized by QEMU's
feature_word_info[] that are indexed by a FeatureWord.

Introduce a TDX CPUID lookup table, which maintains 1 entry for each
FeatureWord. Each entry has below fields:

 - tdx_fixed0/1: The bits that are fixed as 0/1;

 - depends_on_vmm_cap: The bits that are configurable from the view of
   TDX module. But they requires emulation of VMM
   when configured as enabled. For those, they are
   not supported if VMM doesn't report them as
   supported. So they need be fixed up by checking
   if VMM supports them.

 - inducing_ve: TD gets #VE when querying this CPUID leaf. The result is
totally configurable by VMM.

 - supported_on_ve: It's valid only when @inducing_ve is true. It represents
the maximum feature set supported that be emulated
for TDs.

By applying TDX CPUID lookup table and TDX capabilities reported from
TDX module, the supported CPUID for TDs can be obtained from following
steps:

- get the base of VMM supported feature set;

- if the leaf is not a FeatureWord just return VMM's value without
  modification;

- if the leaf is an inducing_ve type, applying supported_on_ve mask and
  return;

- include all native bits, it covers type #2, #4, and parts of type #1.
  (it also includes some unsupported bits. The following step will
   correct it.)

- apply fixed0/1 to it (it covers #3, and rectifies the previous step);

- add configurable bits (it covers the other part of type #1);

- fix the ones in vmm_fixup;

(Calculated type is ignored since it's determined at runtime).

Co-developed-by: Chenyi Qiang 
Signed-off-by: Chenyi Qiang 
Signed-off-by: Xiaoyao Li 
---
 target/i386/cpu.h |  16 +++
 target/i386/kvm/kvm.c |   4 +
 target/i386/kvm/tdx.c | 263 ++
 target/i386/kvm/tdx.h |   3 +
 4 files changed, 286 insertions(+)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 952174bb6f52..7bd604f802a1 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -787,6 +787,8 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
 
 /* Support RDFSBASE/RDGSBASE/WRFSBASE/WRGSBASE */
 #define CPUID_7_0_EBX_FSGSBASE  (1U << 0)
+/* Support for TSC adjustment MSR 0x3B */
+#define CPUID_7_0_EBX_TSC_ADJUST(1U << 1)
 /* Support SGX */
 #define CPUID_7_0_EBX_SGX   (1U << 2)
 /* 1st Group of Advanced Bit Manipulation Extensions */
@@ -805,8 +807,12 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
 #define CPUID_7_0_EBX_INVPCID   (1U << 10)
 /* Restricted Transactional Memory */
 #define CPUID_7_0_EBX_RTM   (1U << 11)
+/* Cache QoS Monitoring */
+#define CPUID_7_0_EBX_PQM   (1U << 12)
 /* Memory Protection Extension */
 #define CPUID_7_0_EBX_MPX   (1U << 14)
+/* Resource Director Technology Allocation */
+#define CPUID_7_0_EBX_RDT_A (1U << 15)
 /* AVX-512 Foundation */
 #define CPUID_7_0_EBX_AVX512F   (1U << 16)
 /* AVX-512 Doubleword & Quadword Instruction */
@@ -862,10 

[PATCH v5 22/65] i386/kvm: Move architectural CPUID leaf generation to separate helper

2024-02-28 Thread Xiaoyao Li
From: Sean Christopherson 

Move the architectural (for lack of a better term) CPUID leaf generation
to a separate helper so that the generation code can be reused by TDX,
which needs to generate a canonical VM-scoped configuration.

Signed-off-by: Sean Christopherson 
Signed-off-by: Xiaoyao Li 
---
 target/i386/kvm/kvm.c  | 459 +++--
 target/i386/kvm/kvm_i386.h |   3 +
 2 files changed, 240 insertions(+), 222 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 389b631c03dd..315998c8f7e5 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1731,6 +1731,241 @@ static void kvm_init_nested_state(CPUX86State *env)
 }
 }
 
+uint32_t kvm_x86_arch_cpuid(CPUX86State *env, struct kvm_cpuid_entry2 *entries,
+uint32_t cpuid_i)
+{
+uint32_t limit, i, j;
+uint32_t unused;
+struct kvm_cpuid_entry2 *c;
+
+if (cpuid_i >  KVM_MAX_CPUID_ENTRIES) {
+error_report("exceeded cpuid index (%d) for entries[]", cpuid_i);
+abort();
+}
+
+cpu_x86_cpuid(env, 0, 0, , , , );
+
+for (i = 0; i <= limit; i++) {
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+fprintf(stderr, "unsupported level value: 0x%x\n", limit);
+abort();
+}
+c = [cpuid_i++];
+
+switch (i) {
+case 2: {
+/* Keep reading function 2 till all the input is received */
+int times;
+
+c->function = i;
+c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC |
+   KVM_CPUID_FLAG_STATE_READ_NEXT;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+times = c->eax & 0xff;
+
+for (j = 1; j < times; ++j) {
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+fprintf(stderr, "cpuid_data is full, no space for "
+"cpuid(eax:2):eax & 0xf = 0x%x\n", times);
+abort();
+}
+c = [cpuid_i++];
+c->function = i;
+c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+}
+break;
+}
+case 0x1f:
+if (env->nr_dies < 2) {
+cpuid_i--;
+break;
+}
+/* fallthrough */
+case 4:
+case 0xb:
+case 0xd:
+for (j = 0; ; j++) {
+if (i == 0xd && j == 64) {
+break;
+}
+
+c->function = i;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+c->index = j;
+cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx);
+
+if (i == 4 && c->eax == 0) {
+break;
+}
+if (i == 0xb && !(c->ecx & 0xff00)) {
+break;
+}
+if (i == 0x1f && !(c->ecx & 0xff00)) {
+break;
+}
+if (i == 0xd && c->eax == 0) {
+continue;
+}
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+fprintf(stderr, "cpuid_data is full, no space for "
+"cpuid(eax:0x%x,ecx:0x%x)\n", i, j);
+abort();
+}
+c = [cpuid_i++];
+}
+break;
+case 0x12:
+for (j = 0; ; j++) {
+c->function = i;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+c->index = j;
+cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx);
+
+if (j > 1 && (c->eax & 0xf) != 1) {
+break;
+}
+
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+fprintf(stderr, "cpuid_data is full, no space for "
+"cpuid(eax:0x12,ecx:0x%x)\n", j);
+abort();
+}
+c = [cpuid_i++];
+}
+break;
+case 0x7:
+case 0x14:
+case 0x1d:
+case 0x1e: {
+uint32_t times;
+
+c->function = i;
+c->index = 0;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+times = c->eax;
+
+for (j = 1; j <= times; ++j) {
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+fprintf(stderr, "cpuid_data is full, no space for "
+"cpuid

[PATCH v5 53/65] pci-host/q35: Move PAM initialization above SMRAM initialization

2024-02-28 Thread Xiaoyao Li
From: Isaku Yamahata 

In mch_realize(), process PAM initialization before SMRAM initialization so
that later patch can skill all the SMRAM related with a single check.

Signed-off-by: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
 hw/pci-host/q35.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index 0d7d4e3f0860..98d4a7c253a6 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -568,6 +568,16 @@ static void mch_realize(PCIDevice *d, Error **errp)
 /* setup pci memory mapping */
 pc_pci_as_mapping_init(mch->system_memory, mch->pci_address_space);
 
+/* PAM */
+init_pam(>pam_regions[0], OBJECT(mch), mch->ram_memory,
+ mch->system_memory, mch->pci_address_space,
+ PAM_BIOS_BASE, PAM_BIOS_SIZE);
+for (i = 0; i < ARRAY_SIZE(mch->pam_regions) - 1; ++i) {
+init_pam(>pam_regions[i + 1], OBJECT(mch), mch->ram_memory,
+ mch->system_memory, mch->pci_address_space,
+ PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE, PAM_EXPAN_SIZE);
+}
+
 /* if *disabled* show SMRAM to all CPUs */
 memory_region_init_alias(>smram_region, OBJECT(mch), "smram-region",
  mch->pci_address_space, 
MCH_HOST_BRIDGE_SMRAM_C_BASE,
@@ -634,15 +644,6 @@ static void mch_realize(PCIDevice *d, Error **errp)
 
 object_property_add_const_link(qdev_get_machine(), "smram",
OBJECT(>smram));
-
-init_pam(>pam_regions[0], OBJECT(mch), mch->ram_memory,
- mch->system_memory, mch->pci_address_space,
- PAM_BIOS_BASE, PAM_BIOS_SIZE);
-for (i = 0; i < ARRAY_SIZE(mch->pam_regions) - 1; ++i) {
-init_pam(>pam_regions[i + 1], OBJECT(mch), mch->ram_memory,
- mch->system_memory, mch->pci_address_space,
- PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE, PAM_EXPAN_SIZE);
-}
 }
 
 uint64_t mch_mcfg_base(void)
-- 
2.34.1




[PATCH v5 50/65] i386/tdx: handle TDG.VP.VMCALL hypercall

2024-02-28 Thread Xiaoyao Li
From: Isaku Yamahata 

MapGPA is a hypercall to convert GPA from/to private GPA to/from shared GPA.
As the conversion function is already implemented as kvm_convert_memory,
wire it to TDX hypercall exit.

Signed-off-by: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
 accel/kvm/kvm-all.c   |  2 +-
 include/sysemu/kvm.h  |  2 ++
 target/i386/kvm/tdx.c | 54 +++
 target/i386/kvm/tdx.h |  1 +
 4 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 9dc17a1b5f43..2c83b6d270f7 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2930,7 +2930,7 @@ static void kvm_eat_signals(CPUState *cpu)
 } while (sigismember(, SIG_IPI));
 }
 
-static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
+int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
 {
 MemoryRegionSection section;
 ram_addr_t offset;
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 82b547848130..78fcf1c3a617 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -550,4 +550,6 @@ int kvm_create_guest_memfd(uint64_t size, uint64_t flags, 
Error **errp);
 
 int kvm_set_memory_attributes_private(hwaddr start, hwaddr size);
 int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size);
+
+int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private);
 #endif
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 7dfda507cc8c..88d245577594 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -898,6 +898,58 @@ static hwaddr tdx_shared_bit(X86CPU *cpu)
 return (cpu->phys_bits > 48) ? BIT_ULL(51) : BIT_ULL(47);
 }
 
+/* 64MB at most in one call. What value is appropriate? */
+#define TDX_MAP_GPA_MAX_LEN (64 * 1024 * 1024)
+
+static int tdx_handle_map_gpa(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
+{
+hwaddr shared_bit = tdx_shared_bit(cpu);
+hwaddr gpa = vmcall->in_r12 & ~shared_bit;
+bool private = !(vmcall->in_r12 & shared_bit);
+hwaddr size = vmcall->in_r13;
+bool retry = false;
+int ret = 0;
+
+vmcall->status_code = TDG_VP_VMCALL_INVALID_OPERAND;
+
+if (!QEMU_IS_ALIGNED(gpa, 4096) || !QEMU_IS_ALIGNED(size, 4096)) {
+vmcall->status_code = TDG_VP_VMCALL_ALIGN_ERROR;
+return 0;
+}
+
+/* Overflow case. */
+if (gpa + size < gpa) {
+return 0;
+}
+if (gpa >= (1ULL << cpu->phys_bits) ||
+gpa + size >= (1ULL << cpu->phys_bits)) {
+return 0;
+}
+
+if (size > TDX_MAP_GPA_MAX_LEN) {
+retry = true;
+size = TDX_MAP_GPA_MAX_LEN;
+}
+
+if (size > 0) {
+ret = kvm_convert_memory(gpa, size, private);
+}
+
+if (!ret) {
+if (retry) {
+vmcall->status_code = TDG_VP_VMCALL_RETRY;
+vmcall->out_r11 = gpa + size;
+if (!private) {
+vmcall->out_r11 |= shared_bit;
+}
+} else {
+vmcall->status_code = TDG_VP_VMCALL_SUCCESS;
+}
+}
+
+return 0;
+}
+
 static void tdx_get_quote_completion(struct tdx_generate_quote_task *task)
 {
 int ret;
@@ -1070,6 +1122,8 @@ static int tdx_handle_vmcall(X86CPU *cpu, struct 
kvm_tdx_vmcall *vmcall)
 }
 
 switch (vmcall->subfunction) {
+case TDG_VP_VMCALL_MAP_GPA:
+return tdx_handle_map_gpa(cpu, vmcall);
 case TDG_VP_VMCALL_GET_QUOTE:
 return tdx_handle_get_quote(cpu, vmcall);
 case TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT:
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
index b7ce793d6037..b6b8742e79f7 100644
--- a/target/i386/kvm/tdx.h
+++ b/target/i386/kvm/tdx.h
@@ -18,6 +18,7 @@ typedef struct TdxGuestClass {
 ConfidentialGuestSupportClass parent_class;
 } TdxGuestClass;
 
+#define TDG_VP_VMCALL_MAP_GPA   0x10001ULL
 #define TDG_VP_VMCALL_GET_QUOTE 0x10002ULL
 #define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT  0x10004ULL
 
-- 
2.34.1




[PATCH v5 55/65] i386/tdx: Disable SMM for TDX VMs

2024-02-28 Thread Xiaoyao Li
TDX doesn't support SMM and VMM cannot emulate SMM for TDX VMs because
VMM cannot manipulate TDX VM's memory.

Disable SMM for TDX VMs and error out if user requests to enable SMM.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 target/i386/kvm/tdx.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 811a3b81af99..c3fadbc5c58e 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -707,11 +707,19 @@ static Notifier tdx_machine_done_notify = {
 static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp)
 {
 MachineState *ms = MACHINE(qdev_get_machine());
+X86MachineState *x86ms = X86_MACHINE(ms);
 TdxGuest *tdx = TDX_GUEST(cgs);
 int r = 0;
 
 ms->require_guest_memfd = true;
 
+if (x86ms->smm == ON_OFF_AUTO_AUTO) {
+x86ms->smm = ON_OFF_AUTO_OFF;
+} else if (x86ms->smm == ON_OFF_AUTO_ON) {
+error_setg(errp, "TDX VM doesn't support SMM");
+return -EINVAL;
+}
+
 if (!tdx_caps) {
 r = get_tdx_capabilities(errp);
 if (r) {
-- 
2.34.1




[PATCH v5 47/65] i386/tdx: Finalize TDX VM

2024-02-28 Thread Xiaoyao Li
Invoke KVM_TDX_FINALIZE_VM to finalize the TD's measurement and make
the TD vCPUs runnable once machine initialization is complete.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 target/i386/kvm/tdx.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 4625f806920e..d445d4b70f77 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -685,6 +685,13 @@ static void tdx_finalize_vm(Notifier *notifier, void 
*unused)
  */
 ram_block = tdx_guest->tdvf_mr->ram_block;
 ram_block_discard_range(ram_block, 0, ram_block->max_length);
+
+r = tdx_vm_ioctl(KVM_TDX_FINALIZE_VM, 0, NULL);
+if (r < 0) {
+error_report("KVM_TDX_FINALIZE_VM failed %s", strerror(-r));
+exit(0);
+}
+tdx_guest->parent_obj.ready = true;
 }
 
 static Notifier tdx_machine_done_notify = {
-- 
2.34.1




[PATCH v5 37/65] i386/tdvf: Introduce function to parse TDVF metadata

2024-02-28 Thread Xiaoyao Li
From: Isaku Yamahata 

TDX VM needs to boot with its specialized firmware, Trusted Domain
Virtual Firmware (TDVF). QEMU needs to parse TDVF and map it in TD
guest memory prior to running the TDX VM.

A TDVF Metadata in TDVF image describes the structure of firmware.
QEMU refers to it to setup memory for TDVF. Introduce function
tdvf_parse_metadata() to parse the metadata from TDVF image and store
the info of each TDVF section.

TDX metadata is located by a TDX metadata offset block, which is a
GUID-ed structure. The data portion of the GUID structure contains
only an 4-byte field that is the offset of TDX metadata to the end
of firmware file.

Select X86_FW_OVMF when TDX is enable to leverage existing functions
to parse and search OVMF's GUID-ed structures.

Signed-off-by: Isaku Yamahata 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 

---
Changes in v1:
 - rename tdvf_parse_section_entry() to
   tdvf_parse_and_check_section_entry()
Changes in RFC v4:
 - rename TDX_METADATA_GUID to TDX_METADATA_OFFSET_GUID
---
 hw/i386/Kconfig|   1 +
 hw/i386/meson.build|   1 +
 hw/i386/tdvf.c | 199 +
 include/hw/i386/tdvf.h |  51 +++
 4 files changed, 252 insertions(+)
 create mode 100644 hw/i386/tdvf.c
 create mode 100644 include/hw/i386/tdvf.h

diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig
index c0ccf50ac3ef..4e6c8905f077 100644
--- a/hw/i386/Kconfig
+++ b/hw/i386/Kconfig
@@ -12,6 +12,7 @@ config SGX
 
 config TDX
 bool
+select X86_FW_OVMF
 depends on KVM
 
 config PC
diff --git a/hw/i386/meson.build b/hw/i386/meson.build
index b9c1ca39cb05..f09441c1ea54 100644
--- a/hw/i386/meson.build
+++ b/hw/i386/meson.build
@@ -28,6 +28,7 @@ i386_ss.add(when: 'CONFIG_PC', if_true: files(
   'port92.c'))
 i386_ss.add(when: 'CONFIG_X86_FW_OVMF', if_true: files('pc_sysfw_ovmf.c'),
 if_false: 
files('pc_sysfw_ovmf-stubs.c'))
+i386_ss.add(when: 'CONFIG_TDX', if_true: files('tdvf.c'))
 
 subdir('kvm')
 subdir('xen')
diff --git a/hw/i386/tdvf.c b/hw/i386/tdvf.c
new file mode 100644
index ..ff51f40088f0
--- /dev/null
+++ b/hw/i386/tdvf.c
@@ -0,0 +1,199 @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+
+ * Copyright (c) 2020 Intel Corporation
+ * Author: Isaku Yamahata 
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+
+#include "hw/i386/pc.h"
+#include "hw/i386/tdvf.h"
+#include "sysemu/kvm.h"
+
+#define TDX_METADATA_OFFSET_GUID"e47a6535-984a-4798-865e-4685a7bf8ec2"
+#define TDX_METADATA_VERSION1
+#define TDVF_SIGNATURE  0x46564454 /* TDVF as little endian */
+
+typedef struct {
+uint32_t DataOffset;
+uint32_t RawDataSize;
+uint64_t MemoryAddress;
+uint64_t MemoryDataSize;
+uint32_t Type;
+uint32_t Attributes;
+} TdvfSectionEntry;
+
+typedef struct {
+uint32_t Signature;
+uint32_t Length;
+uint32_t Version;
+uint32_t NumberOfSectionEntries;
+TdvfSectionEntry SectionEntries[];
+} TdvfMetadata;
+
+struct tdx_metadata_offset {
+uint32_t offset;
+};
+
+static TdvfMetadata *tdvf_get_metadata(void *flash_ptr, int size)
+{
+TdvfMetadata *metadata;
+uint32_t offset = 0;
+uint8_t *data;
+
+if ((uint32_t) size != size) {
+return NULL;
+}
+
+if (pc_system_ovmf_table_find(TDX_METADATA_OFFSET_GUID, , NULL)) {
+offset = size - le32_to_cpu(((struct tdx_metadata_offset 
*)data)->offset);
+
+if (offset + sizeof(*metadata) > size) {
+return NULL;
+}
+} else {
+error_report("Cannot find TDX_METADATA_OFFSET_GUID");
+return NULL;
+}
+
+metadata = flash_ptr + offset;
+
+/* Finally, verify the signature to determine if this is a TDVF image. */
+metadata->Signature = le32_to_cpu(metadata->Signature);
+if (metadata->Signature != TDVF_SIGNATURE) {
+error_report("Invalid TDVF signature in metadata!");
+return NULL;
+}
+
+/* Sanity check that the TDVF doesn't overlap its own metadata. */
+metadata->Length = le32_to_cpu(metadata->Length);
+if (offset + metadata->Length > size) {
+re

[PATCH v5 19/65] i386/tdx: Update tdx_cpuid_lookup[].tdx_fixed0/1 by tdx_caps.cpuid_config[]

2024-02-28 Thread Xiaoyao Li
tdx_cpuid_lookup[].tdx_fixed0/1 is QEMU maintained data which reflects
TDX restrictions regrading what bits are fixed by TDX module.

It's retrieved from TDX spec and static. However, TDX may evolve and
change some fixed fields to configurable in the future. Update
tdx_cpuid.lookup[].tdx_fixed0/1 fields by removing the bits that
reported from TDX module as configurable. This can adapt with the
updated TDX (module) automatically.

Signed-off-by: Xiaoyao Li 
---
 target/i386/kvm/tdx.c | 34 ++
 1 file changed, 34 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 239170142e4f..424c0f3c0fbb 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -377,6 +377,38 @@ static int get_tdx_capabilities(Error **errp)
 return 0;
 }
 
+static void update_tdx_cpuid_lookup_by_tdx_caps(void)
+{
+KvmTdxCpuidLookup *entry;
+FeatureWordInfo *fi;
+uint32_t config;
+FeatureWord w;
+
+for (w = 0; w < FEATURE_WORDS; w++) {
+fi = _word_info[w];
+entry = _cpuid_lookup[w];
+
+if (fi->type != CPUID_FEATURE_WORD) {
+continue;
+}
+
+config = tdx_cap_cpuid_config(fi->cpuid.eax,
+  fi->cpuid.needs_ecx ? fi->cpuid.ecx : 
~0u,
+  fi->cpuid.reg);
+
+if (!config) {
+continue;
+}
+
+/*
+ * Remove the configurable bits from tdx_fixed0/1 in case QEMU
+ * maintained fixed0/1 values is outdated to TDX module.
+ */
+entry->tdx_fixed0 &= ~config;
+entry->tdx_fixed1 &= ~config;
+}
+}
+
 static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp)
 {
 MachineState *ms = MACHINE(qdev_get_machine());
@@ -392,6 +424,8 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, 
Error **errp)
 }
 }
 
+update_tdx_cpuid_lookup_by_tdx_caps();
+
 tdx_guest = tdx;
 return 0;
 }
-- 
2.34.1




[PATCH v5 63/65] i386/tdx: Skip kvm_put_apicbase() for TDs

2024-02-28 Thread Xiaoyao Li
KVM doesn't allow wirting to MSR_IA32_APICBASE for TDs.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 target/i386/kvm/kvm.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index d23f94b77257..31aed1c9aae0 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -3052,6 +3052,11 @@ void kvm_put_apicbase(X86CPU *cpu, uint64_t value)
 {
 int ret;
 
+/* TODO: Allow accessing guest state for debug TDs. */
+if (is_tdx_vm()) {
+return;
+}
+
 ret = kvm_put_one_msr(cpu, MSR_IA32_APICBASE, value);
 assert(ret == 1);
 }
-- 
2.34.1




  1   2   3   4   5   6   7   8   9   >