Re: Cross vendor migration ideas

Nitin A Kamble Fri, 14 Nov 2008 15:44:18 -0800

Amit, Alex,
  Please see my comments bellow.

Avi, 
  Please have a look at the patches, and let me know the parts you think
can be done better.


On Fri, 2008-11-14 at 06:07 -0700, Amit Shah wrote:
> * On Thursday 13 Nov 2008 19:08:14 Alexander Graf wrote:
> > On 13.11.2008, at 05:35, Amit Shah wrote:
> > > * On Wednesday 12 Nov 2008 22:49:16 Alexander Graf wrote:
> > >> On 12.11.2008, at 17:52, Amit Shah wrote:
> > >>> Hi Alex,
> > >>>
> > >>> * On Wednesday 12 Nov 2008 21:09:43 Alexander Graf wrote:
> > >>>> Hi,
> > >>>>
> > >>>> I was thinking a bit about cross vendor migration recently and
> > >>>> since
> > >>>> we're doing open source development, I figured it might be a good
> > >>>> idea
> > >>>> to talk to everyone about this.
> > >>>>
> > >>>> So why are we having a problem?
> > >>>>
> > >>>> In normal operation we don't. If we're running a 32-bit kernel, we
> > >>>> can
> > >>>> use SYSENTER to jump from kernel<->userspace. If we're on a 64-bit
> > >>>> kernel with 64-bit userspace, every CPU supports SYSCALL. At least
> > >>>> Linux is being smart on this and does use exactly these two
> > >>>> capabilities in these two cases.
> > >>>> But if we're running in compat mode (64-bit kernel with 32-bit
> > >>>> userspace), things differ. Intel supports only SYSENTER here, while
> > >>>> AMD only supports SYSCALL. Both can still use int80.
> > >>>>
> > >>>> Operating systems detect usage of SYSCALL or SYSENTER pretty
> > >>>> early on
> > >>>> (Linux does this on vdso). So when we boot up on an Intel machine,
> > >>>> Linux assumes that using SYSENTER in compat mode is fine. Migrating
> > >>>> that machine to an AMD machine breaks this assumption though, since
> > >>>> SYSENTER can't be used in compat mode.
> > >>>> On LInux, this detection is based on the CPU vendor string. If
> > >>>> Linux
> > >>>> finds a "GenuineIntel", SYSENTER is used in compat mode, if it's
> > >>>> "AuthenticAMD", SYSCALL is used and if none of these two is found,
> > >>>> int80 is used.
> > >>>>
> > >>>> I tried modifying the vendor string, removed the "overwrite the
> > >>>> vendor
> > >>>> string with the native string" hack and things look like they work
> > >>>> just fine with Linux.
> > >>>>
> > >>>> Unfortunately right now I don't have a 64-bit Windows installation
> > >>>> around to check if that approach works there too, but if it does
> > >>>> and
> > >>>> no known OS breaks due to the invalid vendor string, we can just
> > >>>> create our own virtual CPU string, no?
> > >>>
> > >>> qemu has an option for that, -cpu qemu64 IIRC. As long as we expose
> > >>> practically correct cpuids and MSRs, this should be fine. I've not
> > >>> tested
> > >>> qemu64 with winxp x64 though. Also, last I knew, winxp x64
> > >>> installation
> > >>> didn't succeed with --no-kvm. qemu by default exposes an AMD CPU
> > >>> type.
> > >>
> > >> I wasn't talking about CPUID features, but the vendor string. Qemu64
> > >> provides the AuthenticAMD string, so we don't run into any issues I'm
> > >> presuming.
> > >
> > > Right -- the thing is, with the default AuthenticAMD string, winp x64
> > > installation fails. That has to be because of some missing cpuids.
> > > That's one
> > > of the drawbacks of exposing a well-known CPU type. I was suggesting
> > > we
> > > should try out the -cpu qemu64 CPU type since it exposes a non-
> > > standard CPU
> > > to see if guests and most userspace programs work fine without any
> > > further
> > > tweaking -- see the 'cons' below for why this might be a problem.
> >
> > I still don't really understand what you're trying to say - qemu64 is
> > the default in KVM right now. You mean winxp64 installation doesn't


> No, the default for KVM is the host CPU type.

Amit, Aliex is correct. the default cpu for kvm is qemu64 not the host.
I have sent the patches to add an options -cpu host. Some of the patches
are gone in, But All the patches are not in yet. Also my patches does
not make the host option as default. I have attached the remaining two
patches. 

Alex, can you try these patches with "-cpu host" option and see if you
can get the host vendor string in the guest for AMD box. I have already
tested it on the latest Intel system.

> 
> > work as is and we should fix it? This has nothing to do with the
> > migration problems, right?
> 
> Solutions shouldn't involve adding known regressions. If our default cpu type
> changes to one that renders some of the OSes we support right now to become
> nonfunctional, such changes won't be accepted. Of course, we can improve the
> qemu64 cpu type to ensure the popular OS types work properly at the least.
> 
> > >>> There are pros and cons to expose a custom vendor ID:
> > >>>
> > >>> pros:
> > >>> - We don't need to have all the cpuid features exposed which are
> > >>> expected of a
> > >>> physically available CPU in the market, for example, badly-coded
> > >>> applications
> > >>> might crash if we don't have SSSE3 on a Core2Duo. But badly-coded or
> > >>> not, not
> > >>> exposing what's actually available on every C2D out there is bad.
> > >>>
> > >>> cons:
> > >>> - To expose the "correct" set of feature bits for a known processor,
> > >>> we also
> > >>> need to check the family/model/stepping to support the exact same
> > >>> feature
> > >>> bits that were present in the CPU.
> > >>> - We might not get some optimizations that OSes might have based on
> > >>> CPU type,
> > >>> even if the host CPU qualifies for such optimizations
> > >>> - Standard programs like benchmarking tools, etc., might fail if
> > >>> they depend
> > >>> on the vendor string for their functionality
> > >>>
> > >>> For 32-bit guests, I think exposing a pentium4 or Athlon CPU type
> > >>> should be
> > >>> fine. For 64-bit guests, the newer the better.
> > >>
> > >> Well, we could create different CPU definitions:
> > >>
> > >> - migration safe (do what is safe for migration)
> > >
> > > There are multiple ways of approaching this: peg to a least-known
> > > good CPU
> > > type, all of whose instructions will work on processors from both
> > > the major
> > > vendors. However, you never know how the server pools change and
> > > you'd want
> > > to upgrade the CPU type once you know the CPUs that are installed in
> > > servers.
> > > This has to be dynamic and the management application has to take
> > > care of
> > > exposing a CPU that's of a "safe" type for the particular server
> > > pool. We
> > > have to provide ways to mask off CPUID bits as requested by the
> > > management
> > > application. (Each server sends its cpuid to the management
> > > application,
> > > which calculates the safest bits and then conveys this to each
> > > server before
> > > starting a VM.)
> >
> > IMHO we shouldn't really start to be smart here. There's only so much
> > benefit in using the least common dominator between all CPUs in the
> > datacenter vs. using the least common dominator between all possible
> > CPUs. You'll basically end up enabling some newer SSE instructions.
> 
> I'm just saying the management application will do it. So it'll be local to
> the server pool the management app caters to. Not a common denominator for
> all deployments.
> 
> > So I don't think we need to go through the hassle of making this
> > dynamic. If you want to migrate your machines - use the migrate
> > preset. That won't give you the 150% speed boost on video encoding,
> > but should not really be any slower on normal workloads. It does make
> > things a lot more transparent to us and the admin of a network though,
> > because you know what you'll end up with "-cpu migration".
> 
> We hardly know what uses KVM will be put to. Server virtualisation, desktop
> virtualisation, combination, what not. If we provide with the flexibility to
> the admin to tune as necessary, it's not a bad option at all. All the
> userspace needs is one tool that can calculate the max. features supported by
> the current CPU and send it over to the management app when asked for. The
> management app does the rest. KVM is not involved at all.
> 
> > >> - CPU specific (like a Core2Duo, necessary to run Mac OS X)
> > >
> > > This doesn't need any more work -- we already have the ability to
> > > select CPU
> > > types. If the management application has knowledge of the kind of OS
> > > being
> > > installed in a VM (which these days is true), exposing a Core2Duo
> > > for a
> > > Mac-based OS isn't difficult.
> >
> > There is no sysenter emulation for IA-32e on AMD yet, right? That's
> > the only issue I see here and your emulation patch should address that.
> 
> As I've mentioned before, I've not yet been able to test my patch because I've
> not found the sysenter/sysexit calls being used at all. It's included at the
> end of this mail for review; hopefully someone finds a use-case and we can
> take it forward.
> 
> > >> - host (fastest possible, but no migration)
> > >
> > > This should be the default.
> >
> > I'm not sure. Either host or migration should be the default. This
> 
> I'm suggesting that 'host' should be default. Where do we disagree?

Well I believe the corss-architecture migration is not a common case. So
the host should be default in IMHO too. But I will not press for it.

> 
> > actually depends on the workload you have on KVM. For servers you'll
> > probably want to have migration be the default. For desktop usage it's
> > host. I can't think of a way we can be smart about that on the KVM
> > level.
> 
> For individual runs from the command line, I'd prefer the host to be the
> default. For a wider deployment, the management app will set the defaults as
> necessary (admin-chosen).
> 
> > >> I don't think we could find one definition that fits all, so the user
> > >> would have to define what the usage pattern will be.
> > >>
> > >>>> I'd love to hear comments and suggestions on this and hope we'll
> > >>>> end
> > >>>> up in a fruitful discussion on how to improve the current
> > >>>> situation.
> > >>>
> > >>> I have a patch ready for emulating sysenter/sysexit on AMD systems
> > >>> (needs
> > >>> testing). Patching the guest was an option that was discouraged; I
> > >>> had a hack
> > >>> ready but it was quickly shelved (again, untested).
> > >>
> > >> That sounds useful for misbehaving guests or cases I haven't thought
> > >> of yet. Are you sure you're intercepting the SYSENTER MSRs on AMD, so
> > >> you don't end up only getting 32 bits?
> > >
> > > Can you elaborate?
> >
> > When you write to MSR_IA32_SYSENTER_EIP on AMD, that MSR will be
> > directly passed through to the hardware (search for that MSR in
> > svm.c). This is because SVM automatically writes the SYSENTER MSRs to
> > the SYSENTER fields in the VMCB.
> 
> My patch just handles the case when a sysenter is attempted on a system which
> doesn't have that instruction. So I just emulate it. Accessing the MSR and
> setting values is done at boot-time by the OS, and any migrations at that
> instant is a corner case and not too critical.
> 
> Now, the patch.
> 
> From e1b760d8e596811081c282484621b49c674f1c22 Mon Sep 17 00:00:00 2001
> From: Amit Shah <[EMAIL PROTECTED]>
> Date: Wed, 12 Nov 2008 11:31:05 +0530
> Subject: [PATCH] KVM: SVM: Emulate SYSENTER/SYSEXIT on AMD processors
> 
> This patch enables emulation of the sysenter/sysexit instructions in
> AMD long mode. This will enable a guest started on an Intel machine to
> be migrated to an AMD machine.
> 
> Signed-off-by: Amit Shah <[EMAIL PROTECTED]>
> ---
>  arch/x86/kvm/svm.c         |   13 ++++
>  arch/x86/kvm/x86_emulate.c |  137
> +++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 149 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index f0ad4d4..4e6e1dc 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1155,6 +1155,19 @@ static int vmmcall_interception(struct vcpu_svm *svm,
> struct kvm_run *kvm_run)
>  static int invalid_op_interception(struct vcpu_svm *svm,
>                                    struct kvm_run *kvm_run)
>  {
> +       /*
> +        * If we're running in long mode on x86_64, check if we can
> +        * emulate sysenter / sysexit
> +        */
> +       if (!is_long_mode(&svm->vcpu))
> +               goto out;
> +
> +       if (emulate_instruction(&svm->vcpu, NULL, 0, 0, 0) == EMULATE_DONE) {
> +               /* We could emulate it. */
> +               return 1;
> +       }
> +
> + out:
>         kvm_queue_exception(&svm->vcpu, UD_VECTOR);
>         return 1;
>  }
> diff --git a/arch/x86/kvm/x86_emulate.c b/arch/x86/kvm/x86_emulate.c
> index 8f60ace..c8afd20 100644
> --- a/arch/x86/kvm/x86_emulate.c
> +++ b/arch/x86/kvm/x86_emulate.c
> @@ -205,7 +205,9 @@ static u16 twobyte_table[256] = {
>         ModRM | ImplicitOps, ModRM, ModRM | ImplicitOps, ModRM, 0, 0, 0, 0,
>         0, 0, 0, 0, 0, 0, 0, 0,
>         /* 0x30 - 0x3F */
> -       ImplicitOps, 0, ImplicitOps, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> +       ImplicitOps, 0, ImplicitOps, 0,
> +       ImplicitOps, ImplicitOps, 0, 0,
> +       0, 0, 0, 0, 0, 0, 0, 0,
>         /* 0x40 - 0x47 */
>         DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
>         DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
> @@ -305,8 +307,11 @@ static u16 group2_table[] = {
>  };
> 
>  /* EFLAGS bit definitions. */
> +#define EFLG_VM (1<<17)
> +#define EFLG_RF (1<<16)
>  #define EFLG_OF (1<<11)
>  #define EFLG_DF (1<<10)
> +#define EFLG_IF (1<<9)
>  #define EFLG_SF (1<<7)
>  #define EFLG_ZF (1<<6)
>  #define EFLG_AF (1<<4)
> @@ -1959,6 +1964,136 @@ twobyte_insn:
>                 rc = X86EMUL_CONTINUE;
>                 c->dst.type = OP_NONE;
>                 break;
> +       case 0x34: { /* sysenter */
> +               /* Vol 2b */
> +               unsigned long cr0 = ctxt->vcpu->arch.cr0;
> +               struct kvm_segment cs, ss;
> +               u64 data;
> +
> +               if (cr0 & X86_CR0_PE) {
> +                       kvm_inject_gp(ctxt->vcpu, 0);
> +                       goto cannot_emulate;
> +               }
> +
> +               kvm_x86_ops->get_msr(ctxt->vcpu, MSR_IA32_SYSENTER_CS, &data);
> +               if (!(data & 0xFFFC)) {
> +                       kvm_inject_gp(ctxt->vcpu, 0);
> +                       goto cannot_emulate;
> +               }
> +
> +               ctxt->eflags &= ~(EFLG_VM | EFLG_IF | EFLG_RF);
> +
> +               kvm_x86_ops->get_segment(ctxt->vcpu, &cs, VCPU_SREG_CS);
> +               cs.selector = (__u16) data;
> +               cs.base = 0;
> +               cs.limit = 0xfffff;
> +               cs.g = 1;
> +               cs.s = 1;
> +               cs.type = 0x0b;
> +               cs.db = 1;
> +               cs.dpl = 0;
> +               cs.selector &= ~SELECTOR_RPL_MASK;
> +               cs.present = 1;
> +               /* The CPL should be set to 0 */
> +
> +               if (ctxt->mode == X86EMUL_MODE_PROT64) {
> +                       cs.l = 1;
> +                       cs.limit = 0xffffffff;
> +               }
> +
> +               ss.selector = cs.selector + 8;
> +               ss.base = 0;
> +               ss.limit = 0xfffff;
> +               ss.g = 1;
> +               ss.s = 1;
> +               ss.type = 0x03;
> +               ss.db = 1;
> +               ss.dpl = 0;
> +               ss.selector &= ~SELECTOR_RPL_MASK;
> +               ss.present = 1;
> +               if (ctxt->mode == X86EMUL_MODE_PROT64) {
> +                       ss.limit = 0xffffffff;
> +               }
> +
> +               kvm_x86_ops->set_segment(ctxt->vcpu, &cs, VCPU_SREG_CS);
> +               kvm_x86_ops->set_segment(ctxt->vcpu, &ss, VCPU_SREG_SS);
> +
> +               kvm_x86_ops->get_msr(ctxt->vcpu, MSR_IA32_SYSENTER_EIP, 
> &data);
> +               c->eip = data;
> +
> +               kvm_x86_ops->get_msr(ctxt->vcpu, MSR_IA32_SYSENTER_ESP, 
> &data);
> +               c->regs[VCPU_REGS_RSP] = data;
> +
> +               goto writeback;
> +               break;
> +       }
> +       case 0x35: { /* sysexit */
> +               /* Vol 2b */
> +               u64 data;
> +               unsigned long cr0 = ctxt->vcpu->arch.cr0;
> +               struct kvm_segment cs, ss;
> +
> +               if (cr0 & X86_CR0_PE) {
> +                       kvm_inject_gp(ctxt->vcpu, 0);
> +                       goto cannot_emulate;
> +               }
> +
> +               kvm_x86_ops->get_msr(ctxt->vcpu, MSR_IA32_SYSENTER_CS, &data);
> +               if (!(data & 0xFFFC) ||
> +                   ((ctxt->mode == X86EMUL_MODE_PROT64) && !data)) {
> +                       kvm_inject_gp(ctxt->vcpu, 0);
> +                       goto cannot_emulate;
> +               }
> +
> +               /* Check if CPL is 0. If not, inject_gp */
> +
> +               kvm_x86_ops->get_segment(ctxt->vcpu, &cs, VCPU_SREG_CS);
> +               cs.selector = (u16)(data +
> +                                   (ctxt->mode == X86EMUL_MODE_PROT64 ? 32 : 
> 16));
> +               cs.base = 0;
> +               cs.limit = 0xfffff;
> +               cs.g = 1;
> +               cs.s = 1;
> +               cs.type = 0x0b;
> +               cs.db = 1;
> +               cs.dpl = 3;
> +               cs.selector |= SELECTOR_RPL_MASK;
> +               cs.present = 1;
> +               cs.l = 0; /* For return to compatibility mode */
> +               /* The CPL should be set to 3 */
> +
> +               if (ctxt->mode == X86EMUL_MODE_PROT64) {
> +                       cs.l = 1;
> +                       /* The manual doesn't talk about CS limit */
> +               }
> +
> +               ss.selector = cs.selector +
> +                       (ctxt->mode == X86EMUL_MODE_PROT64 ? 16 : 8);
> +               ss.base = 0;
> +               ss.limit = 0xfffff;
> +               ss.g = 1;
> +               ss.s = 1;
> +               ss.type = 0x03;
> +               ss.db = 1;
> +               ss.dpl = 3;
> +               ss.selector |= SELECTOR_RPL_MASK;
> +               ss.present = 1;
> +               if (ctxt->mode == X86EMUL_MODE_PROT64) {
> +                       ss.base = 0;
> +                       ss.limit = 0xffffffff;
> +               }
> +
> +               kvm_x86_ops->set_segment(ctxt->vcpu, &cs, VCPU_SREG_CS);
> +               kvm_x86_ops->set_segment(ctxt->vcpu, &ss, VCPU_SREG_SS);
> +
> +               c->eip = ctxt->vcpu->arch.regs[VCPU_REGS_RDX];
> +               c->regs[VCPU_REGS_RSP] = c->regs[VCPU_REGS_RCX];
> +
> +               /* TODO: Check if rip and rsp are canonical. inject_gp() if 
> not */
> +
> +               goto writeback;
> +               break;
> +       }
>         case 0x40 ... 0x4f:     /* cmov */
>                 c->dst.val = c->dst.orig_val = c->src.val;
>                 if (!test_cc(c->b, ctxt->eflags))
> --
> 1.5.4.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Thanks & Regards,
Nitin
Open Source Technology Center, Intel Corporation
-----------------------------------------------------------------
The mind is like a parachute; it works much better when it's open

commit 70e4e65bc591eb9cf25c1cbc0d16b2cbdb089a6f
Author: Nitin A Kamble <[EMAIL PROTECTED]>
Date:   Wed Nov 5 16:17:46 2008 -0800

    Change the ioctl KVM_GET_SUPPORTED_CPUID, such that it will return the
    no of entries in the list when requested no of entries (nent) is 0.
    
    Also add another KVM_CHECK_EXTENSION, KVM_CAP_CPUID_SIZER to determine
    if the running kernel supports the above changed ABI.
    
    Signed-Off-By: Nitin A Kamble <[EMAIL PROTECTED]>

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09e6c56..e50db11 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -86,7 +86,7 @@
 #define KVM_NUM_MMU_PAGES (1 << KVM_MMU_HASH_SHIFT)
 #define KVM_MIN_FREE_MMU_PAGES 5
 #define KVM_REFILL_PAGES 25
-#define KVM_MAX_CPUID_ENTRIES 40
+#define KVM_MAX_CPUID_ENTRIES 100
 #define KVM_NR_FIXED_MTRR_REGION 88
 #define KVM_NR_VAR_MTRR 8
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bf7461b..52e6207 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -969,6 +969,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_NOP_IO_DELAY:
 	case KVM_CAP_MP_STATE:
 	case KVM_CAP_SYNC_MMU:
+	case KVM_CAP_CPUID_SIZER:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -1303,10 +1304,14 @@ static int kvm_dev_ioctl_get_supported_cpuid(struct kvm_cpuid2 *cpuid,
 {
 	struct kvm_cpuid_entry2 *cpuid_entries;
 	int limit, nent = 0, r = -E2BIG;
+	int sizer = 0;
 	u32 func;
 
-	if (cpuid->nent < 1)
-		goto out;
+	if (cpuid->nent == 0) {
+		sizer = 1;
+		cpuid->nent = KVM_MAX_CPUID_ENTRIES; 
+        }
+
 	r = -ENOMEM;
 	cpuid_entries = vmalloc(sizeof(struct kvm_cpuid_entry2) * cpuid->nent);
 	if (!cpuid_entries)
@@ -1327,9 +1332,11 @@ static int kvm_dev_ioctl_get_supported_cpuid(struct kvm_cpuid2 *cpuid,
 		do_cpuid_ent(&cpuid_entries[nent], func, 0,
 			       &nent, cpuid->nent);
 	r = -EFAULT;
-	if (copy_to_user(entries, cpuid_entries,
+	if (!sizer) {
+		if (copy_to_user(entries, cpuid_entries,
 			nent * sizeof(struct kvm_cpuid_entry2)))
-		goto out_free;
+			goto out_free;
+        }
 	cpuid->nent = nent;
 	r = 0;
 
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 44fd7fa..d4cb8b1 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -392,6 +392,9 @@ struct kvm_trace_rec {
 #endif
 #define KVM_CAP_IOMMU 18
 #define KVM_CAP_NMI 19
+#define KVM_CAP_CPUID_SIZER 20 /* return of 1 means the KVM_GET_SUPPORTED_CPUID */
+                               /* ioctl will return the size of list when input */
+                               /* list size (nent) is 0 */
 
 /*
  * ioctls for VM fds

diff --git a/libkvm/libkvm-x86.c b/libkvm/libkvm-x86.c
index a8cca15..7aafa20 100644
--- a/libkvm/libkvm-x86.c
+++ b/libkvm/libkvm-x86.c
@@ -379,6 +379,34 @@ int kvm_set_msrs(kvm_context_t kvm, int vcpu, struct kvm_msr_entry *msrs,
     return r;
 }
 
+/*
+ * Returns available host cpuid entries.  User must free.
+ */
+struct kvm_cpuid2 *kvm_get_host_cpuid_entries(kvm_context_t kvm)
+{
+	struct kvm_cpuid2 sizer, *cpuids;
+	int r, e;
+
+	sizer.nent = 0;
+	r = ioctl(kvm->fd, KVM_GET_SUPPORTED_CPUID, &sizer);
+	if (r == -1 && errno != E2BIG)
+		return NULL;
+	cpuids = malloc(sizeof *cpuids + sizer.nent * sizeof *cpuids->entries);
+	if (!cpuids) {
+		errno = ENOMEM;
+		return NULL;
+	}
+	cpuids->nent = sizer.nent;
+	r = ioctl(kvm->fd, KVM_GET_SUPPORTED_CPUID, cpuids);
+	if (r == -1) {
+		e = errno;
+		free(cpuids);
+		errno = e;
+		return NULL;
+	}
+	return cpuids;
+}
+
 static void print_seg(FILE *file, const char *name, struct kvm_segment *seg)
 {
     	fprintf(stderr,
@@ -458,9 +486,9 @@ __u64 kvm_get_cr8(kvm_context_t kvm, int vcpu)
 }
 
 int kvm_setup_cpuid(kvm_context_t kvm, int vcpu, int nent,
-		    struct kvm_cpuid_entry *entries)
+		    struct kvm_cpuid_entry2 *entries)
 {
-	struct kvm_cpuid *cpuid;
+	struct kvm_cpuid2 *cpuid;
 	int r;
 
 	cpuid = malloc(sizeof(*cpuid) + nent * sizeof(*entries));
@@ -469,7 +497,7 @@ int kvm_setup_cpuid(kvm_context_t kvm, int vcpu, int nent,
 
 	cpuid->nent = nent;
 	memcpy(cpuid->entries, entries, nent * sizeof(*entries));
-	r = ioctl(kvm->vcpu_fd[vcpu], KVM_SET_CPUID, cpuid);
+	r = ioctl(kvm->vcpu_fd[vcpu], KVM_SET_CPUID2, cpuid);
 
 	free(cpuid);
 	return r;
diff --git a/libkvm/libkvm.h b/libkvm/libkvm.h
index 423ce31..f84d524 100644
--- a/libkvm/libkvm.h
+++ b/libkvm/libkvm.h
@@ -27,6 +27,9 @@ typedef struct kvm_context *kvm_context_t;
 struct kvm_msr_list *kvm_get_msr_list(kvm_context_t);
 int kvm_get_msrs(kvm_context_t, int vcpu, struct kvm_msr_entry *msrs, int n);
 int kvm_set_msrs(kvm_context_t, int vcpu, struct kvm_msr_entry *msrs, int n);
+struct kvm_cpuid2 *kvm_get_host_cpuid_entries(kvm_context_t);
+void get_host_cpuid_entry(uint32_t function, uint32_t index, 
+                struct kvm_cpuid_entry2 * e);
 #endif
 
 /*!
@@ -374,7 +377,7 @@ int kvm_guest_debug(kvm_context_t, int vcpu, struct kvm_debug_guest *dbg);
  * \return 0 on success, or -errno on error
  */
 int kvm_setup_cpuid(kvm_context_t kvm, int vcpu, int nent,
-		    struct kvm_cpuid_entry *entries);
+		    struct kvm_cpuid_entry2 *entries);
 
 /*!
  * \brief Setting the number of shadow pages to be allocated to the vm
diff --git a/qemu/qemu-kvm-x86.c b/qemu/qemu-kvm-x86.c
index bf62e18..b776ebf 100644
--- a/qemu/qemu-kvm-x86.c
+++ b/qemu/qemu-kvm-x86.c
@@ -21,6 +21,7 @@
 #define MSR_IA32_TSC		0x10
 
 static struct kvm_msr_list *kvm_msr_list;
+static struct kvm_cpuid2 *kvm_host_cpuid_entries;
 extern unsigned int kvm_shadow_memory;
 extern kvm_context_t kvm_context;
 static int kvm_has_msr_star;
@@ -52,11 +53,17 @@ int kvm_arch_qemu_create_context(void)
 
     kvm_msr_list = kvm_get_msr_list(kvm_context);
     if (!kvm_msr_list)
-		return -1;
+        return -1;
+
     for (i = 0; i < kvm_msr_list->nmsrs; ++i)
-	if (kvm_msr_list->indices[i] == MSR_STAR)
-	    kvm_has_msr_star = 1;
-	return 0;
+        if (kvm_msr_list->indices[i] == MSR_STAR)
+        kvm_has_msr_star = 1;
+
+    kvm_host_cpuid_entries = kvm_get_host_cpuid_entries(kvm_context);
+    if (!kvm_host_cpuid_entries)
+        return -1;
+
+    return 0;
 }
 
 static void set_msr_entry(struct kvm_msr_entry *entry, uint32_t index,
@@ -476,13 +483,61 @@ static void host_cpuid(uint32_t function, uint32_t *eax, uint32_t *ebx,
 	*edx = vec[3];
 }
 
+void get_host_cpuid_entry(uint32_t function, uint32_t index, 
+                struct kvm_cpuid_entry2 * e)
+{
+    int i;
+    struct kvm_cpuid_entry2 *entries;
+
+    memset(e, 0, (sizeof *e));
+    e->function = function;
+    e->index = index;
+    
+    if (!kvm_host_cpuid_entries) 
+        return;
+
+    entries = kvm_host_cpuid_entries->entries;
+
+    for (i=0; i<kvm_host_cpuid_entries->nent; i++) {
+        struct kvm_cpuid_entry2 *ent = &entries[i];
+        if (ent->function != function)
+           continue; 
+        if ((ent->flags & KVM_CPUID_FLAG_SIGNIFCANT_INDEX) && 
+                (ent->index != index))
+           continue;
+        if ((ent->flags & KVM_CPUID_FLAG_STATEFUL_FUNC) &&
+                !(ent->flags & KVM_CPUID_FLAG_STATE_READ_NEXT))
+           continue;
+
+        memcpy(e, ent, sizeof (*e));
+
+        if (ent->flags & KVM_CPUID_FLAG_STATEFUL_FUNC) {
+                int j;
+		ent->flags &= ~KVM_CPUID_FLAG_STATE_READ_NEXT;
+                for (j=i+1; ; j=(j+1)%(kvm_host_cpuid_entries->nent)) {
+                    struct kvm_cpuid_entry2 *entj = &entries[j];
+                    if (entj->function == ent->function) {
+                        entj->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT;
+                        break;
+                    }
+                }
+        }
+        break;
+    }
+}
 
-static void do_cpuid_ent(struct kvm_cpuid_entry *e, uint32_t function,
-			 CPUState *env)
+static void do_cpuid_ent(struct kvm_cpuid_entry2 *e, uint32_t function,
+			 uint32_t index, CPUState *env)
 {
+    if (env->cpuid_host_cpu) {
+        get_host_cpuid_entry(function, index, e);
+	return;
+    }
+    e->function = function;
+    e->index = index;
     env->regs[R_EAX] = function;
+    env->regs[R_ECX] = index;
     qemu_kvm_cpuid_on_env(env);
-    e->function = function;
     e->eax = env->regs[R_EAX];
     e->ebx = env->regs[R_EBX];
     e->ecx = env->regs[R_ECX];
@@ -521,6 +576,11 @@ static void do_cpuid_ent(struct kvm_cpuid_entry *e, uint32_t function,
     if (function == 1)
 	e->ecx |= (1u << 31);
 
+    if ((function == 4) || (function == 0xb)) 
+       e->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+    else
+       e->flags = 0;
+
     // 3dnow isn't properly emulated yet
     if (function == 0x80000001)
 	e->edx &= ~0xc0000000;
@@ -559,17 +619,30 @@ static int get_para_features(kvm_context_t kvm_context)
 
 int kvm_arch_qemu_init_env(CPUState *cenv)
 {
-    struct kvm_cpuid_entry cpuid_ent[100];
+    struct kvm_cpuid_entry2 *cpuid_ent, entry, *e;
+    int cpuid_nent = 0, malloc_size = 0;
+    CPUState copy;
+    uint32_t i, limit; 
 #ifdef KVM_CPUID_SIGNATURE
-    struct kvm_cpuid_entry *pv_ent;
+    struct kvm_cpuid_entry2 *pv_ent;
     uint32_t signature[3];
+
+    malloc_size += 2;
 #endif
-    int cpuid_nent = 0;
-    CPUState copy;
-    uint32_t i, limit;
 
     copy = *cenv;
 
+    if (copy.cpuid_host_cpu) {
+        if (!kvm_host_cpuid_entries)
+	   return -EINVAL;
+        malloc_size += kvm_host_cpuid_entries->nent;
+    } else
+           malloc_size += 100;
+
+    cpuid_ent = malloc(malloc_size * sizeof (struct kvm_cpuid_entry2));
+    if (!cpuid_ent)
+       return -ENOMEM;
+
 #ifdef KVM_CPUID_SIGNATURE
     /* Paravirtualization CPUIDs */
     memcpy(signature, "KVMKVMKVM", 12);
@@ -587,21 +660,48 @@ int kvm_arch_qemu_init_env(CPUState *cenv)
     pv_ent->eax = get_para_features(kvm_context);
 #endif
 
-    copy.regs[R_EAX] = 0;
-    qemu_kvm_cpuid_on_env(&copy);
-    limit = copy.regs[R_EAX];
+    limit = copy.cpuid_level;
+    for (i=0; ((i<2) && (i<limit)) ; i++) {
+        e = &cpuid_ent[cpuid_nent++];
+	do_cpuid_ent(e, i, 0, &copy);
+    }
+
+    if (limit >= 2) { /* get the multiple stateful leaf values */
+        do_cpuid_ent(&entry, 2, 0, &copy);
+	cpuid_ent[cpuid_nent++] = entry;
+        for (i = 1; i<(entry.eax & 0xff); i++) {
+            e = &cpuid_ent[cpuid_nent++];
+            do_cpuid_ent(e, 2, 0, &copy);
+        } 
+    }
 
-    for (i = 0; i <= limit; ++i)
-	do_cpuid_ent(&cpuid_ent[cpuid_nent++], i, &copy);
+    for (i = 3; i <= limit; i++) {
+        e = &cpuid_ent[cpuid_nent++];
+	do_cpuid_ent(e, i, 0, &copy);
+    }
 
-    copy.regs[R_EAX] = 0x80000000;
-    qemu_kvm_cpuid_on_env(&copy);
-    limit = copy.regs[R_EAX];
+    if (limit >= 4) { /* get the per index values */
+        int i = 1;
+        do {
+            e = &cpuid_ent[cpuid_nent++];
+            do_cpuid_ent(e, 4, i++, &copy);
+        } while(e->eax & 0x1f); /* until the last index */
+    }
+
+    if (limit >= 0xb) { /* get the per index values */
+        int i = 1;
+        do {
+            e = &cpuid_ent[cpuid_nent++];
+            do_cpuid_ent(e, 0xb, i++, &copy);
+        } while(e->ecx & 0xff00); /* until the last index */
+    }
 
+    limit = copy.cpuid_xlevel;
     for (i = 0x80000000; i <= limit; ++i)
-	do_cpuid_ent(&cpuid_ent[cpuid_nent++], i, &copy);
+	do_cpuid_ent(&cpuid_ent[cpuid_nent++], i, 0, &copy);
 
     kvm_setup_cpuid(kvm_context, cenv->cpu_index, cpuid_nent, cpuid_ent);
+    free(cpuid_ent);
     return 0;
 }
 
diff --git a/qemu/target-i386/cpu.h b/qemu/target-i386/cpu.h
index 11bc2c1..42d646a 100644
--- a/qemu/target-i386/cpu.h
+++ b/qemu/target-i386/cpu.h
@@ -612,6 +612,7 @@ typedef struct CPUX86State {
     uint32_t cpuid_ext2_features;
     uint32_t cpuid_ext3_features;
     uint32_t cpuid_apic_id;
+    uint32_t cpuid_host_cpu;
 
 #ifdef USE_KQEMU
     int kqemu_enabled;
diff --git a/qemu/target-i386/helper.c b/qemu/target-i386/helper.c
index 68efd4d..c23e16e 100644
--- a/qemu/target-i386/helper.c
+++ b/qemu/target-i386/helper.c
@@ -152,6 +152,9 @@ typedef struct x86_def_t {
 static x86_def_t x86_defs[] = {
 #ifdef TARGET_X86_64
     {
+        .name = "host",
+    },
+    {
         .name = "qemu64",
         .level = 2,
         .vendor1 = CPUID_VENDOR_AMD_1,
@@ -405,10 +408,59 @@ void x86_cpu_list (FILE *f, int (*cpu_fprintf)(FILE *f, const char *fmt, ...))
         (*cpu_fprintf)(f, "x86 %16s\n", x86_defs[i].name);
 }
 
+int fill_x86_defs_for_host(CPUX86State *env, x86_def_t * def)
+{
+    struct kvm_cpuid_entry2 e;
+
+    get_host_cpuid_entry(0, 0, &e);
+    env->cpuid_level = e.eax;
+    env->cpuid_vendor1 = e.ebx;
+    env->cpuid_vendor2 = e.ecx;
+    env->cpuid_vendor3 = e.edx;
+
+    get_host_cpuid_entry(1, 0, &e);
+    env->cpuid_version = e.eax;
+    env->cpuid_features = e.edx;
+    env->cpuid_ext_features = e.ecx;
+
+    get_host_cpuid_entry(0x80000000, 0, &e);
+    env->cpuid_xlevel = e.eax;
+
+    get_host_cpuid_entry(0x80000001, 0, &e);
+    env->cpuid_ext3_features = e.ecx;
+    env->cpuid_ext2_features = e.edx;
+
+    get_host_cpuid_entry(0x80000002, 0, &e);
+    env->cpuid_model[0] = e.eax;
+    env->cpuid_model[1] = e.ebx;
+    env->cpuid_model[2] = e.ecx;
+    env->cpuid_model[3] = e.edx;
+
+    get_host_cpuid_entry(0x80000003, 0, &e);
+    env->cpuid_model[4] = e.eax;
+    env->cpuid_model[5] = e.ebx;
+    env->cpuid_model[6] = e.ecx;
+    env->cpuid_model[7] = e.edx;
+
+    get_host_cpuid_entry(0x80000004, 0, &e);
+    env->cpuid_model[8] = e.eax;
+    env->cpuid_model[9] = e.ebx;
+    env->cpuid_model[10] = e.ecx;
+    env->cpuid_model[11] = e.edx;
+
+    return 0;
+}
+
 static int cpu_x86_register (CPUX86State *env, const char *cpu_model)
 {
     x86_def_t def1, *def = &def1;
 
+    if (strcmp(cpu_model, "host") == 0) {
+        env->cpuid_host_cpu = 1;
+        fill_x86_defs_for_host(env, def);
+        return 0;
+    } /* else follow through */
+    env->cpuid_host_cpu = 0;
     if (cpu_x86_find_by_name(def, cpu_model) < 0)
         return -1;
     if (def->vendor1) {

Re: Cross vendor migration ideas

Reply via email to