Re: 3.13 hangs when I tried to start a KVM at a 32 bit stable Gentoo
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Il 23/01/2014 20:50, Toralf Förster ha scritto: | What makes the situation really annyoing - sometimes I just can | restart my wlan device it the system works normal, but sometimes | the whole system hangs and for those cases then sometimes not even | sysrq buttons do work. Can you reproduce it with the wlan driver disabled completely? Paolo -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJS4l5cAAoJEBvWZb6bTYbyC0AP/jrHp8WUqDjgv/1kJYcYb0+0 cSM8jqcutYCcMAKXTsKGgQj4gWdw7q/bEo4lWAEX8fQqxOfFZjBZjhTuA66XF010 fi+nOCEwjafMUeflPZaSZr34KGSix1FTtmcdzz55Kzkh+A4hZJj2NgaAYr8KFsWW 0hCoW7+MHWTRsNLIZk1ms2H/lItz19gnO9CkNYWqSIxBtx1/le5v8LNi20OkHKCz mdemG6DSwj9sE+jfezDjZ7jV/a+RBv4Pi0v71YPEUEa7HCiyKxddVAgD0KrvMwzF 2A7H4+rTwsJLnWTKFyMdoeUUz6iC5Rntq+B4ltHVXKjN2O5bG0aqmG+cURbvXrFF y9EzChdaEikPlYJHp/ci2NOB8NJlB6oEb9uO3k2kRITdWsr3iK5gCZE3fu99ziz4 JmNPUpFGTy5dnQVI/edsuZDyafRYVq3U4CpP+drnWxbic4aH7nCTv2iR5Op4pnhd LQlmt0MkZAyF9+6XFNBAAfPnfV0YPBHydCxhC2UaOmqqNXiYILW65dCD5aQiXq1y lQi2dc6i7MYbVmc28I45LV/BFD4zn622c5ges0gKw2bshgzhQm12jWHJ+dMsLDp/ qpFFvaGoqWCNJZo9FCH6vaTiN5KNUyiUOZjvKVa3u959xC9BK8nh3vTWHxinK2Yt Do62ZMYWJbljdgTvMN7d =cfJ7 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: compiling with kvm-kmod
Thanks for the quick fix, I found that the modules build, but I still have a problem loading the module due to the fact that kvm_vfio_ops remains undefined in the kvm module. Below is the patch that I used to fix the issue if it is helpful. -Jonas diff --git a/x86/Kbuild b/x86/Kbuild index d75b756..637b3b1 100644 --- a/x86/Kbuild +++ b/x86/Kbuild @@ -1,7 +1,7 @@ obj-m := kvm.o kvm-intel.o kvm-amd.o kvm-objs := kvm_main.o x86.o mmu.o emulate.o irq.o i8259.o pmu.o \ lapic.o ioapic.o preempt.o i8254.o coalesced_mmio.o irq_comm.o \ -eventfd.o compat-x86.o async_pf.o cpuid.o irqchip.o \ +eventfd.o compat-x86.o async_pf.o cpuid.o irqchip.o vfio.o\ ../external-module-compat.o ifeq ($(CONFIG_IOMMU_API)$(CONFIG_PCI),yy) kvm-objs += assigned-dev.o iommu.o On 23.01.2014 19:19, Jan Kiszka wrote: On 2014-01-23 17:34, Jonas Pfoh wrote: Hello, I am currently working on a project involving KVM and have been making use Jan's kvm-kmod repository. I receive the below error when I attempt to compile with the most recent version. My question is simply if this is something anyone is aware of or has any suggestions for before I go poking around? I am compiling against a 3.13.0 mainline vanilla kernel and am using the master branch (3d923a3) of Jan's kvm-kmod repo which seems to be syncing from kvm commit 7650b68. Thanks and regards, Jonas Pfoh make -C /lib/modules/3.13.0/build M=`pwd` \ LINUXINCLUDE=-I`pwd`/include -I`pwd`/include/uapi -Iinclude \ -Iinclude/uapi -Iarch/x86/include -Iarch/x86/include/uapi \ -Iinclude/generated/uapi -Iarch/x86/include/generated \ -Iarch/x86/include/generated/uapi \ -I`pwd`/include-compat -I`pwd`/x86 \ -include include/generated/autoconf.h \ -include `pwd`/x86/external-module-compat.h \ $@ make[1]: Entering directory `/usr/src/linux-3.13' CC [M] /local/repos/kvm-kmod/x86/svm.o In file included from include/linux/device.h:29:0, from include/linux/node.h:17, from include/linux/cpu.h:16, from /local/repos/kvm-kmod/x86/../external-module-compat-comm.h:15, from /local/repos/kvm-kmod/x86/external-module-compat.h:45, from command-line:0: include/linux/gfp.h: In function ‘gfp_zonelist’: include/linux/gfp.h:272:2: error: implicit declaration of function ‘IS_ENABLED’ [-Werror=implicit-function-declaration] if (IS_ENABLED(CONFIG_NUMA) unlikely(flags __GFP_THISNODE)) ^ cc1: some warnings being treated as errors make[3]: *** [/local/repos/kvm-kmod/x86/svm.o] Error 1 make[2]: *** [/local/repos/kvm-kmod/x86] Error 2 make[1]: *** [_module_/local/repos/kvm-kmod] Error 2 make[1]: Leaving directory `/usr/src/linux-3.13' make: *** [all] Error 2 That's likely a kernel issue: not all required headers are pulled by gfp.h. It's worked around now with 2b06046. Thanks, Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: compiling with kvm-kmod
On 2014-01-24 13:39, Jonas Pfoh wrote: Thanks for the quick fix, I found that the modules build, but I still have a problem loading the module due to the fact that kvm_vfio_ops remains undefined in the kvm module. Below is the patch that I used to fix the issue if it is helpful. Yes, indeed - can you send it as proper patch (subject, brief reason, signed-off)? Then I could merge directly. Thanks, Jan -Jonas diff --git a/x86/Kbuild b/x86/Kbuild index d75b756..637b3b1 100644 --- a/x86/Kbuild +++ b/x86/Kbuild @@ -1,7 +1,7 @@ obj-m := kvm.o kvm-intel.o kvm-amd.o kvm-objs := kvm_main.o x86.o mmu.o emulate.o irq.o i8259.o pmu.o \ lapic.o ioapic.o preempt.o i8254.o coalesced_mmio.o irq_comm.o \ -eventfd.o compat-x86.o async_pf.o cpuid.o irqchip.o \ +eventfd.o compat-x86.o async_pf.o cpuid.o irqchip.o vfio.o\ ../external-module-compat.o ifeq ($(CONFIG_IOMMU_API)$(CONFIG_PCI),yy) kvm-objs += assigned-dev.o iommu.o On 23.01.2014 19:19, Jan Kiszka wrote: On 2014-01-23 17:34, Jonas Pfoh wrote: Hello, I am currently working on a project involving KVM and have been making use Jan's kvm-kmod repository. I receive the below error when I attempt to compile with the most recent version. My question is simply if this is something anyone is aware of or has any suggestions for before I go poking around? I am compiling against a 3.13.0 mainline vanilla kernel and am using the master branch (3d923a3) of Jan's kvm-kmod repo which seems to be syncing from kvm commit 7650b68. Thanks and regards, Jonas Pfoh make -C /lib/modules/3.13.0/build M=`pwd` \ LINUXINCLUDE=-I`pwd`/include -I`pwd`/include/uapi -Iinclude \ -Iinclude/uapi -Iarch/x86/include -Iarch/x86/include/uapi \ -Iinclude/generated/uapi -Iarch/x86/include/generated \ -Iarch/x86/include/generated/uapi \ -I`pwd`/include-compat -I`pwd`/x86 \ -include include/generated/autoconf.h \ -include `pwd`/x86/external-module-compat.h \ $@ make[1]: Entering directory `/usr/src/linux-3.13' CC [M] /local/repos/kvm-kmod/x86/svm.o In file included from include/linux/device.h:29:0, from include/linux/node.h:17, from include/linux/cpu.h:16, from /local/repos/kvm-kmod/x86/../external-module-compat-comm.h:15, from /local/repos/kvm-kmod/x86/external-module-compat.h:45, from command-line:0: include/linux/gfp.h: In function ‘gfp_zonelist’: include/linux/gfp.h:272:2: error: implicit declaration of function ‘IS_ENABLED’ [-Werror=implicit-function-declaration] if (IS_ENABLED(CONFIG_NUMA) unlikely(flags __GFP_THISNODE)) ^ cc1: some warnings being treated as errors make[3]: *** [/local/repos/kvm-kmod/x86/svm.o] Error 1 make[2]: *** [/local/repos/kvm-kmod/x86] Error 2 make[1]: *** [_module_/local/repos/kvm-kmod] Error 2 make[1]: Leaving directory `/usr/src/linux-3.13' make: *** [all] Error 2 That's likely a kernel issue: not all required headers are pulled by gfp.h. It's worked around now with 2b06046. Thanks, Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO
Il 24/01/2014 01:01, Peter Maydell ha scritto: +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host kernel should always be able to do: +type val = *((type *)mmio.data). I think this would be better phrased as The host userspace should always, since this documentation is supposed to be telling userspace what the kernel's contract with it is, not the kernel keeping notes for itself on its own implementation. (It also clarifies what the intention is for the obscure and maybe-we'll-never-implement-this case of an LE host kernel using a compatibility interface to run the host userspace (QEMU) as a BE process which sees the same ABI a BE kernel provides, without actually dragging that red herring explicitly into the documentation.) I agree, and also the first line should mention userspace. In PPC I think it's possible or even common to have BE host kernel and LE host userspace (or perhaps vice versa is the common one). Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO
On 24.01.2014, at 14:09, Paolo Bonzini pbonz...@redhat.com wrote: Il 24/01/2014 01:01, Peter Maydell ha scritto: +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host kernel should always be able to do: +type val = *((type *)mmio.data). I think this would be better phrased as The host userspace should always, since this documentation is supposed to be telling userspace what the kernel's contract with it is, not the kernel keeping notes for itself on its own implementation. (It also clarifies what the intention is for the obscure and maybe-we'll-never-implement-this case of an LE host kernel using a compatibility interface to run the host userspace (QEMU) as a BE process which sees the same ABI a BE kernel provides, without actually dragging that red herring explicitly into the documentation.) I agree, and also the first line should mention userspace. In PPC I think it's possible or even common to have BE host kernel and LE host userspace (or perhaps vice versa is the common one). It was possible on 32bit, but I'm not sure anyone's actively using it :). The thing that was very common (not so much anymore for enterprise distros) is 32-bit user space with 64-bit kernels. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-kmod] fix undefined kvm_vfio_ops
vfio.c is being pulled in from the submodule, but it is not being compiled/linked by the build environment, causing a Unknown symbol in module error when loading the resulting module. Adding vfio.o to the kvm-objs variable in x86/Kbuild fixes this issue Signed-off-by: Jonas Pfoh p...@sec.in.tum.de --- diff --git a/x86/Kbuild b/x86/Kbuild index d75b756..637b3b1 100644 --- a/x86/Kbuild +++ b/x86/Kbuild @@ -1,7 +1,7 @@ obj-m := kvm.o kvm-intel.o kvm-amd.o kvm-objs := kvm_main.o x86.o mmu.o emulate.o irq.o i8259.o pmu.o \ lapic.o ioapic.o preempt.o i8254.o coalesced_mmio.o irq_comm.o \ -eventfd.o compat-x86.o async_pf.o cpuid.o irqchip.o \ +eventfd.o compat-x86.o async_pf.o cpuid.o irqchip.o vfio.o\ ../external-module-compat.o ifeq ($(CONFIG_IOMMU_API)$(CONFIG_PCI),yy) kvm-objs += assigned-dev.o iommu.o -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH kvm-kmod] fix undefined kvm_vfio_ops
On 2014-01-24 15:29, Jonas Pfoh wrote: vfio.c is being pulled in from the submodule, but it is not being compiled/linked by the build environment, causing a Unknown symbol in module error when loading the resulting module. Adding vfio.o to the kvm-objs variable in x86/Kbuild fixes this issue Signed-off-by: Jonas Pfoh p...@sec.in.tum.de --- diff --git a/x86/Kbuild b/x86/Kbuild index d75b756..637b3b1 100644 --- a/x86/Kbuild +++ b/x86/Kbuild @@ -1,7 +1,7 @@ obj-m := kvm.o kvm-intel.o kvm-amd.o kvm-objs := kvm_main.o x86.o mmu.o emulate.o irq.o i8259.o pmu.o \ lapic.o ioapic.o preempt.o i8254.o coalesced_mmio.o irq_comm.o \ -eventfd.o compat-x86.o async_pf.o cpuid.o irqchip.o \ +eventfd.o compat-x86.o async_pf.o cpuid.o irqchip.o vfio.o\ ../external-module-compat.o ifeq ($(CONFIG_IOMMU_API)$(CONFIG_PCI),yy) kvm-objs += assigned-dev.o iommu.o Format was almost perfect format - just your mailer mangled whitespaces. Make sure to either use git send-email then or to switch of line-wrapper etc. in the mail client when sending patches - to whichever project. Unfortunately it broke the build for 3.11 and older kernels. I merged a variant to next that passed build tests here. Thanks, Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO
On 24 January 2014 05:13, Alexander Graf ag...@suse.de wrote: On 24.01.2014, at 14:09, Paolo Bonzini pbonz...@redhat.com wrote: Il 24/01/2014 01:01, Peter Maydell ha scritto: +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. Also if you use ints on real bus as description, you may want to clarify restrictions on mmio.len. Basically on 32 bit platform (i.e like V7 ARM) one cannot have mmio.len=8, because one cannot have 64bit value on 32bit data bus. Without such clarification introduction of text like the value as it would go on the bus in real hardware is confusing for len=8 for emulated CPUs where real busses are 32bit. If ldrd/strd would be emulated on ARMV7 one would need to use mmio twice to pass required data in either direction using len=4 .. Thanks, Victor The host kernel should always be able to do: +type val = *((type *)mmio.data). I think this would be better phrased as The host userspace should always, since this documentation is supposed to be telling userspace what the kernel's contract with it is, not the kernel keeping notes for itself on its own implementation. (It also clarifies what the intention is for the obscure and maybe-we'll-never-implement-this case of an LE host kernel using a compatibility interface to run the host userspace (QEMU) as a BE process which sees the same ABI a BE kernel provides, without actually dragging that red herring explicitly into the documentation.) I agree, and also the first line should mention userspace. In PPC I think it's possible or even common to have BE host kernel and LE host userspace (or perhaps vice versa is the common one). It was possible on 32bit, but I'm not sure anyone's actively using it :). The thing that was very common (not so much anymore for enterprise distros) is 32-bit user space with 64-bit kernels. Alex ___ kvmarm mailing list kvm...@lists.cs.columbia.edu https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO
Il 24/01/2014 16:23, Victor Kamensky ha scritto: Also if you use ints on real bus as description, you may want to clarify restrictions on mmio.len. Basically on 32 bit platform (i.e like V7 ARM) one cannot have mmio.len=8, because one cannot have 64bit value on 32bit data bus. Without such clarification introduction of text like the value as it would go on the bus in real hardware is confusing for len=8 for emulated CPUs where real busses are 32bit. This is not necessarily true. On a 32-bit CPU you can have a 64-bit memory bus. Even x86 32-bit CPUs can do 64-bit MMIO via MMX or SSE or double-word compare-and-swap (CMPXCHG8B). Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/4] KVM: nVMX: Do not inject NMI vmexits when L2 has a pending interrupt
According to SDM 27.2.3, IDT vectoring information will not be valid on vmexits caused by external NMIs. So we have to avoid creating such scenarios by delaying EXIT_REASON_EXCEPTION_NMI injection as long as we have a pending interrupt because that one would be migrated to L1's IDT vectoring info on nested exit. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/kvm/vmx.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 081a15c..7ed0ecc 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8159,7 +8159,8 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) } if (vcpu-arch.nmi_pending nested_exit_on_nmi(vcpu)) { - if (vmx-nested.nested_run_pending) + if (vmx-nested.nested_run_pending || + vcpu-arch.interrupt.pending) return -EBUSY; nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, NMI_VECTOR | INTR_TYPE_NMI_INTR | -- 1.8.1.1.298.ge7eed54 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/4] KVM: x86: Validate guest writes to MSR_IA32_APICBASE
Check for invalid state transitions on guest-initiated updates of MSR_IA32_APICBASE. This address both enabling of the x2APIC when it is not supported and all invalid transitions as described in SDM section 10.12.5. It also checks that no reserved bit is set in APICBASE by the guest. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/kvm/cpuid.h | 16 arch/x86/kvm/lapic.h | 2 +- arch/x86/kvm/vmx.c | 9 + arch/x86/kvm/x86.c | 32 +--- 4 files changed, 47 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index f1e4895..b012ad2 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -72,4 +72,20 @@ static inline bool guest_cpuid_has_pcid(struct kvm_vcpu *vcpu) return best (best-ecx bit(X86_FEATURE_PCID)); } +static inline bool guest_cpuid_has_x2apic(struct kvm_vcpu *vcpu) +{ + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 1, 0); + return best (best-ecx bit(X86_FEATURE_X2APIC)); +} + +static inline unsigned int guest_cpuid_get_phys_bits(struct kvm_vcpu *vcpu) +{ + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 0x8008, 0); + return best ? best-eax 0xff : 36; +} + #endif diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index c8b0d0d..6a11845 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -65,7 +65,7 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src, struct kvm_lapic_irq *irq, int *r, unsigned long *dest_map); u64 kvm_get_apic_base(struct kvm_vcpu *vcpu); -void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data); +int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info); void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s); int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5c88791..a06f101 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4392,7 +4392,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) static void vmx_vcpu_reset(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); - u64 msr; + struct msr_data apic_base_msr; vmx-rmode.vm86_active = 0; @@ -4400,10 +4400,11 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu) vmx-vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val(); kvm_set_cr8(vmx-vcpu, 0); - msr = 0xfee0 | MSR_IA32_APICBASE_ENABLE; + apic_base_msr.data = 0xfee0 | MSR_IA32_APICBASE_ENABLE; if (kvm_vcpu_is_bsp(vmx-vcpu)) - msr |= MSR_IA32_APICBASE_BSP; - kvm_set_apic_base(vmx-vcpu, msr); + apic_base_msr.data |= MSR_IA32_APICBASE_BSP; + apic_base_msr.host_initiated = true; + kvm_set_apic_base(vmx-vcpu, apic_base_msr); vmx_segment_cache_clear(vmx); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0c76f7c..f4b0591 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -257,10 +257,26 @@ u64 kvm_get_apic_base(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_get_apic_base); -void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data) -{ - /* TODO: reserve bits check */ - kvm_lapic_set_base(vcpu, data); +int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info) +{ + u64 old_state = vcpu-arch.apic_base + (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE); + u64 new_state = msr_info-data + (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE); + u64 reserved_bits = ((~0ULL) guest_cpuid_get_phys_bits(vcpu)) | + 0x2ff | (guest_cpuid_has_x2apic(vcpu) ? 0 : X2APIC_ENABLE); + + if (!msr_info-host_initiated + ((msr_info-data reserved_bits) != 0 || +new_state == X2APIC_ENABLE || +(new_state == MSR_IA32_APICBASE_ENABLE + old_state == (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE)) || +(new_state == (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE) + old_state == 0))) + return 1; + + kvm_lapic_set_base(vcpu, msr_info-data); + return 0; } EXPORT_SYMBOL_GPL(kvm_set_apic_base); @@ -2006,8 +2022,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case 0x200 ... 0x2ff: return set_msr_mtrr(vcpu, msr, data); case MSR_IA32_APICBASE: - kvm_set_apic_base(vcpu, data); - break; + return kvm_set_apic_base(vcpu, msr_info); case APIC_BASE_MSR ... APIC_BASE_MSR + 0x3ff: return kvm_x2apic_msr_write(vcpu, msr, data); case MSR_IA32_TSCDEADLINE: @@ -6409,6 +6424,7 @@ EXPORT_SYMBOL_GPL(kvm_task_switch); int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) { + struct msr_data apic_base_msr;
[PATCH v2 3/4] KVM: nVMX: Fully emulate preemption timer
We cannot rely on the hardware-provided preemption timer support because we are holding L2 in HLT outside non-root mode. Furthermore, emulating the preemption will resolve tick rate errata on older Intel CPUs. The emulation is based on hrtimer which is started on L2 entry, stopped on L2 exit and evaluated via the new check_nested_events hook. As we no longer rely on hardware features, we can enable both the preemption timer support and value saving unconditionally. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/kvm/vmx.c | 151 ++--- 1 file changed, 96 insertions(+), 55 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 089aa3c..081a15c 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -31,6 +31,7 @@ #include linux/ftrace_event.h #include linux/slab.h #include linux/tboot.h +#include linux/hrtimer.h #include kvm_cache_regs.h #include x86.h @@ -110,6 +111,8 @@ module_param(nested, bool, S_IRUGO); #define RMODE_GUEST_OWNED_EFLAGS_BITS (~(X86_EFLAGS_IOPL | X86_EFLAGS_VM)) +#define VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE 5 + /* * These 2 parameters are used to config the controls for Pause-Loop Exiting: * ple_gap:upper bound on the amount of time between two successive @@ -374,6 +377,9 @@ struct nested_vmx { */ struct page *apic_access_page; u64 msr_ia32_feature_control; + + struct hrtimer preemption_timer; + bool preemption_timer_expired; }; #define POSTED_INTR_ON 0 @@ -1047,6 +1053,12 @@ static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12) return vmcs12-pin_based_vm_exec_control PIN_BASED_VIRTUAL_NMIS; } +static inline bool nested_cpu_has_preemption_timer(struct vmcs12 *vmcs12) +{ + return vmcs12-pin_based_vm_exec_control + PIN_BASED_VMX_PREEMPTION_TIMER; +} + static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12) { return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT); @@ -2248,9 +2260,9 @@ static __init void nested_vmx_setup_ctls_msrs(void) */ nested_vmx_pinbased_ctls_low |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR; nested_vmx_pinbased_ctls_high = PIN_BASED_EXT_INTR_MASK | - PIN_BASED_NMI_EXITING | PIN_BASED_VIRTUAL_NMIS | + PIN_BASED_NMI_EXITING | PIN_BASED_VIRTUAL_NMIS; + nested_vmx_pinbased_ctls_high |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR | PIN_BASED_VMX_PREEMPTION_TIMER; - nested_vmx_pinbased_ctls_high |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR; /* * Exit controls @@ -2265,15 +2277,10 @@ static __init void nested_vmx_setup_ctls_msrs(void) #ifdef CONFIG_X86_64 VM_EXIT_HOST_ADDR_SPACE_SIZE | #endif - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT | + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT; + nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | + VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER | VM_EXIT_SAVE_VMX_PREEMPTION_TIMER; - if (!(nested_vmx_pinbased_ctls_high PIN_BASED_VMX_PREEMPTION_TIMER) || - !(nested_vmx_exit_ctls_high VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) { - nested_vmx_exit_ctls_high = ~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER; - nested_vmx_pinbased_ctls_high = ~PIN_BASED_VMX_PREEMPTION_TIMER; - } - nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | - VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER); /* entry controls */ rdmsr(MSR_IA32_VMX_ENTRY_CTLS, @@ -2342,9 +2349,9 @@ static __init void nested_vmx_setup_ctls_msrs(void) /* miscellaneous data */ rdmsr(MSR_IA32_VMX_MISC, nested_vmx_misc_low, nested_vmx_misc_high); - nested_vmx_misc_low = VMX_MISC_PREEMPTION_TIMER_RATE_MASK | - VMX_MISC_SAVE_EFER_LMA; - nested_vmx_misc_low |= VMX_MISC_ACTIVITY_HLT; + nested_vmx_misc_low = VMX_MISC_SAVE_EFER_LMA; + nested_vmx_misc_low |= VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE | + VMX_MISC_ACTIVITY_HLT; nested_vmx_misc_high = 0; } @@ -5702,6 +5709,18 @@ static void nested_vmx_failValid(struct kvm_vcpu *vcpu, */ } +static enum hrtimer_restart vmx_preemption_timer_fn(struct hrtimer *timer) +{ + struct vcpu_vmx *vmx = + container_of(timer, struct vcpu_vmx, nested.preemption_timer); + + vmx-nested.preemption_timer_expired = true; + kvm_make_request(KVM_REQ_EVENT, vmx-vcpu); + kvm_vcpu_kick(vmx-vcpu); + + return HRTIMER_NORESTART; +} + /* * Emulate the VMXON instruction. * Currently, we just remember that VMX is active, and do not save or even @@ -5766,6 +5785,10 @@ static int handle_vmon(struct kvm_vcpu *vcpu) INIT_LIST_HEAD((vmx-nested.vmcs02_pool)); vmx-nested.vmcs02_num = 0; + hrtimer_init(vmx-nested.preemption_timer, CLOCK_MONOTONIC, +
[PATCH v2 2/4] KVM: nVMX: Rework interception of IRQs and NMIs
Move the check for leaving L2 on pending and intercepted IRQs or NMIs from the *_allowed handler into a dedicated callback. Invoke this callback at the relevant points before KVM checks if IRQs/NMIs can be injected. The callback has the task to switch from L2 to L1 if needed and inject the proper vmexit events. The rework fixes L2 wakeups from HLT and provides the foundation for preemption timer emulation. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/vmx.c | 67 +++-- arch/x86/kvm/x86.c | 15 +++-- 3 files changed, 53 insertions(+), 31 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index fdf83af..8d6cc7a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -765,6 +765,8 @@ struct kvm_x86_ops { struct x86_instruction_info *info, enum x86_intercept_stage stage); void (*handle_external_intr)(struct kvm_vcpu *vcpu); + + int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr); }; struct kvm_arch_async_pf { diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index a06f101..089aa3c 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4620,22 +4620,8 @@ static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked) static int vmx_nmi_allowed(struct kvm_vcpu *vcpu) { - if (is_guest_mode(vcpu)) { - if (to_vmx(vcpu)-nested.nested_run_pending) - return 0; - if (nested_exit_on_nmi(vcpu)) { - nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, - NMI_VECTOR | INTR_TYPE_NMI_INTR | - INTR_INFO_VALID_MASK, 0); - /* -* The NMI-triggered VM exit counts as injection: -* clear this one and block further NMIs. -*/ - vcpu-arch.nmi_pending = 0; - vmx_set_nmi_mask(vcpu, true); - return 0; - } - } + if (to_vmx(vcpu)-nested.nested_run_pending) + return 0; if (!cpu_has_virtual_nmis() to_vmx(vcpu)-soft_vnmi_blocked) return 0; @@ -4647,19 +4633,8 @@ static int vmx_nmi_allowed(struct kvm_vcpu *vcpu) static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu) { - if (is_guest_mode(vcpu)) { - if (to_vmx(vcpu)-nested.nested_run_pending) - return 0; - if (nested_exit_on_intr(vcpu)) { - nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, - 0, 0); - /* -* fall through to normal code, but now in L1, not L2 -*/ - } - } - - return (vmcs_readl(GUEST_RFLAGS) X86_EFLAGS_IF) + return (!to_vmx(vcpu)-nested.nested_run_pending + vmcs_readl(GUEST_RFLAGS) X86_EFLAGS_IF) !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS)); } @@ -8155,6 +8130,35 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu, } } +static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + + if (vcpu-arch.nmi_pending nested_exit_on_nmi(vcpu)) { + if (vmx-nested.nested_run_pending) + return -EBUSY; + nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, + NMI_VECTOR | INTR_TYPE_NMI_INTR | + INTR_INFO_VALID_MASK, 0); + /* +* The NMI-triggered VM exit counts as injection: +* clear this one and block further NMIs. +*/ + vcpu-arch.nmi_pending = 0; + vmx_set_nmi_mask(vcpu, true); + return 0; + } + + if ((kvm_cpu_has_interrupt(vcpu) || external_intr) + nested_exit_on_intr(vcpu)) { + if (vmx-nested.nested_run_pending) + return -EBUSY; + nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0); + } + + return 0; +} + /* * prepare_vmcs12 is part of what we need to do when the nested L2 guest exits * and we want to prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), @@ -8495,6 +8499,9 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason, nested_vmx_succeed(vcpu); if (enable_shadow_vmcs) vmx-nested.sync_shadow_vmcs = true; + + /* in case we halted in L2 */ + vcpu-arch.mp_state = KVM_MP_STATE_RUNNABLE; }
[PATCH v2 0/4] KVM: x86: Fixes for IA32_APIC_BASE and nVMX
This is the yet unmerged part of the previous round. Changes since v1: - rebase over next - switched APIC_BASE reserved bits check to guest's number of physical bits - addressed small review comment on Rework interception of IRQs and NMIs - added fix for improper EXCEPTION_NMI vmexit injection with valid IDT vectoring info Paolo, did you already look into nested event handling for SVM? I assume you will want to (re-)base it on top of this. Jan Jan Kiszka (4): KVM: x86: Validate guest writes to MSR_IA32_APICBASE KVM: nVMX: Rework interception of IRQs and NMIs KVM: nVMX: Fully emulate preemption timer KVM: nVMX: Do not inject NMI vmexits when L2 has a pending interrupt arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/cpuid.h| 16 +++ arch/x86/kvm/lapic.h| 2 +- arch/x86/kvm/vmx.c | 228 arch/x86/kvm/x86.c | 47 +++-- 5 files changed, 197 insertions(+), 98 deletions(-) -- 1.8.1.1.298.ge7eed54 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/4] KVM: x86: Validate guest writes to MSR_IA32_APICBASE
-BEGIN PGP MESSAGE- Charset: ISO-8859-15 Version: GnuPG v2.0.22 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ hQIMA9Qj09KLn/EdARAAlKm0iclhSEbbzfzFlbJoqImUMiS+6Z/Ir1Lb1Z6xiwWp 3GXiVyMl8KXs0XzqaJF8lETKcurOFmCEdmaUDQT7iQa9k+KoXi4o2qR1K+HthiUK 4hMaHsBP94hsZjQBrTK6nWKVyChA849FG5IVudmp1HP4npWSZmVFvxCPUEdgvF68 VhG0WTN9u1IAPhpVww7EfR24WgVwWQbxSzzJfa9PKrtkKIB+oYK9wddLAPrFjW3t 1KPhkPwW98F/38eYwzHWoeaSPlLc+m84AAtkJNp1eurJCtXAGXPUf8XzU5NmJ5dp qOkt0s6GM5nm/krzsjCOcikca7JugC1u9Zyr9H3VV8mBkmOFLXBs9a3PbA7ZYtob qEAa1gbE44BnkMiOfrLgGf5Kgfg6ABcwQy1AHQi+LPPCrQBztAGmGyj8ZNQc6zYa qo5gtK+M90BBg/tVmrCe0KwsY2IBf1vxnh7YM/m/yayDF0VGWnAmK2FlHkHscDH9 PYcMI7SFFn9S4Y8zcvZPv7qG8i7jb8AaRclILGduj1KveFQEOrAdX/AW+r4W5tC+ UIxgFJLUlIciKQcbzEkLKhtZDtoXL026h82OhEwitJgOpw2BCs1owthVl0MrffCK ZlntWSDeOj3PhUpV8siOz3ACKiY5DkKM60az3N9ktNASdmJ0mv2kWIe+4Qtu+BqF Ag4DKb7ACrQJWQ0QB/9yyyYc+W+vymJ6pPsIL9SPCO1f/gPT/7r1ulYDVt6flff1 FrGkOFVvMMw8ftVgakBDdKt4y3EOqX930/HaYcUu5ZpfvfEW8qZkJSSckrqy3UYB StipXwcXZ+14lCGesibSzBAJW4Egv9AueKLMYHA+qaCdN02qrHoBm9Upb67ioKvX QhvJEGPdU5yEMMy5fKy+BJGJqU+7o4OKnvOQHzMqLrxxzy54C4sBvp5zNgA6KQXH B65CIjtvkfpeB1l/uwOUgziz7J58r58al3UqLqz5P30kBF99KABl06P7jKnAOAR8 qYLjKcRRGbIGZXCnK/xPHrs5iVlJmAUEnNjg2qbDB/9QOyI01DpOI2hoOTmQV27E fePWW6L415LEJ0iAm8Snmdi8lp66/bSAl4dhDoAz6wd3Vnq7LaCHsGn1Vwe+kcah hFJL3Mhvi3iag5+PhbYHGbvAQ/Xzm+PTpP7GyElgljKB/1ILk6uvGsDDH9uxCLiK tNq2MSIZAwtckcdXyEm5YL4+v+YNNx+Dwnct6WoYP8CnuFhz2b70BLNw5B0/QZMf GWou3nwNt6RdexwCgm4bsJJElQJAAR7u29fFlAcVmXHqNMvsebTUZnIwpqjWkgOc yrplYw+vLagiqCH8gsyQeFqWsWtZBLBJYkPXOamyD9UpHZwkhi/TqTvIdVUBz985 hQIMA1uMoGen9Cu3AQ/9HjsJgJQHJVCCHG3lCCGFdrTRGiiifRb08ik0ZEU9OKFX tB0Ka8LdrJ4gwY9YnnQKWel11BoSgToAc1cil0NJoCSY05/AK8LhnJnaGbZn8CsE PK6hCiFDXucpvO9EVF/dTF9Xqi5HMDhKRd8K1BRDo+ZSj6K1j3W5INMPPTm5nL7P RPNnaMuf4e0YCK2buuAKEagnJKQL4uC/Xi6lEUq680g19TL+vXubWLLEo0Ny5Iq6 oa5ZVCII9TsJl9OMJpYmo77zByBJ482nobX2X6+N9IQFv4VuCsJ1KKZC3fLykvQ+ aMT/t2+MY3ovFM+ra2cHyxWQjkLvcTZPe/o2rG5QAPuGtf+2X8T9Mph8mqBhT+NK /mr78zp+1JsgLM/DYWO0Zg+GRx3Sk2P+8qsJe2+tUUDl7vcqDvF54dNJ19AL9Xwr 5sWME/C4FhAbLtgMQv2SbdPx2CUBk6dBJQRPvcnmTr25TiGAO1yRhv4cF1S/9Hnk 6ace9PZBTKcivc1KKqtTGDM5zSEOk5DZLv2FncMITFABsk+v7eG7eQ2YG3v5Tiq6 gs+4cJyq+fYwKES2DiT54iRIlHsIB9T53ng+z6fx/wThIu9wcqdGRgeZvicViyZ6 GoTzU6HV6jgm4n1xc91tRzFCFgOu6oCSOUS1k1wJ9NhFWN74TiKvratXZFDVQNuF AgwDyn8FWYRSbREBEACnZNEkJqKMKuY3Mlbh4Pck6DM6CTinaoy0Dv+DkXI5riPQ kzXXv+M2LOlaJ3A5Sh/XDuB9CyUJObeHK12tLmWYg27WY59EAsEd0Vgu4zpaIKvK 2zREEq8PFpchi2hMfIiReSHrlw1a459mbeC0z6Gd44hBoOUFsFVCNt8khQAP0TKT m4BGdNkxF9Cl3VUqqP7LdyT6Xo3g12tmsGKYO3rs3Tm8QoiUsfYTzSDHPN9iupLZ wVsdzydgB8vQ1GXL1JNoFZXLXbYAraRs2wgwvGo6auLhE/zwURHP6ky7QmmCEEjb UvWbHg5xiZga/Gx5Y+Ghtzb5yMHBn9HrZa9wxjuk6SqrIBMCWzhJP9t38AhiomiG d4L3sQsmRKlH+YsC+TVdCVVGKo7d+Zox/pM25/lo2PkbDGQv6A5JfxT7K+O/cK6K G2HtkIhuVQoLwdGu5quMrty6uIVfvtrMgnNFxOwsC/qwfPj50+6sDtNYyxvl4+d9 7S2hBSieCJPohQbzdgdDyNvbj9D/913Evq5M/lPdUMvue/Oq2RGWF4ZzKfr+f6k6 79rhvm2cO5fGzgF6AVNjhzWbTfvdZfMQo7A6Cew/y70ukwwoX0vkYYbg4pWbrQ1t EyjdU6sSivEH60Jkev1DpkCVZpDizXv+pmDY+g5sxIhHbL0a3XNnyDi+ni1uidLr Aaq+G4yujcM50b8cjUjYSnd+1ISz22qoFK/V2dpaRzi1QKVkcHI95yoTWG35T3fb SabLhZ1yzExBTt1uZ73P7qHuJN8MvBbvjo8mEvpBFyGb555fdBlPlbZ8FdtJodBv OsJhHbc0/bUYbflVkGlbygoG2Ykk7DPwL7hCQmwSBLzhIyGp0Ak9lgqASByzWfDP num1/3zWQVq7dWMFEEzfnmMaWsWzBEK2h5dLP5RDNYjIUYVzA0EZJPNCZSO2Tin+ 77OQxaZw00yrghK8+NTutV1xzkSzw42cykUkqx2/FHmEPwthsuVb6rT7eu3xTqWU jg6cjsdjQkMy5S7WKCeGjQiY1629qzkVjuD8QHOgcYKutRB7mzXUfW9ABUJyQ9mS 4hsdS0tteH4ang8YVoli2pB9WGxQ437AaJ5nWya1Xkr/ztYNb8uQEpgjoaUPc6ca fdu4bZuuGTqAz+IokgBn8mZR2yPAuAmypqsGmriQ0Y5G10OIuX/ULfvd5DZD8ijM l0qMQUKaQhIHIInu/rv0SMD2WGECJmAxhXQWYwe0YCTDO/Xyet91P8xpwq+U87v+ jxx+copEuFHTvpA4IdMNIS1xDuvIQdXVHdQuCUgagHgpQZ6ZT0RrRFb7QTCZuR80 dR3fDp2YraZQufvLEnAxvti1uXzgnzhFbpjDlzOXgzwYXBH56XRhWSg8ok/dwI+x qIOuzIi5uy1h11STLfiS5fOLSI2oadl5jrR4H65YOY1CVXLjOoNUJVLZ6lY9GDER 9Mtae0sgXum4mFvOiZCDCDpDokIM/6nNGbuQruzVgUlbF4uL0/+sWpXrEpFFN4t9 xYVbT97HyGj7myRkTpIhtr/2xdMEeK9Sjd5QPzIzk11QuEPMQfMGYP3sGOjc5iY8 e/VheXlHgKGPeDsj1y6PvST4SXLxm/MgppO6yx342g2GVOBWDd7mjKnoT5b+n//V 8bircdvcZq18lAZsLFKpSxqtHgdAHfCDTCL+ZLdiaOl1Pa6ZQZdUgOA5Om9V38HF gfGoQStYUUC6cxPPSuTXl4294ZK8IZ2F2Jg2fmzyGacNETfNoA2y+/u2FyeUO1zv jZcpiOnnR/ciZ0vGAQzPtqdkTAiFhfCgFlVfBajDWDdRGKkOxliqXrI+5KTlEID7 yxAQZ3f7wDN2OKFZQcXGErPqs71D8yGPVJh6qhRlrV/v4thRynZx8tzyiXKtsPSC JWdRszPzL9PyyBtTQgwC/PzU35DjJqgUdZ3XxHcjpgRBLbwIu9xFAy4W5/cALAwF HlMM79cxHsHKwJ3aeql8Rm5ZeP6Rp2P1F918F/4LyiZyNEd3iIO0GUsVicWbSfBz YFnWxjdVR16A2enr/3DT0+yT5drep9huvV8gaLoPxi5XmLpb2tQebBeG4cvV04VP iaLjdcStoJPYSMx/A3s5JWmc0k/60WjacHmNnAVwE4uUyfwGeZdJqTWGORth1PU4 5vKr76tuw5Vd9UUQhF936gucEL+O+NkxywR+iBl8oKEecyXgHjzH2OFKuDsnBy/C y3uQ0qRsj8JXu2O1N0Cdb5REFH+ZFTZa3qlSpqsjNMqdm9msYICaGIBHHQ12QIQw 3wa8FikGOmAhLCIZscUoVAt5mE7114aNpsRpxfGY+0aPjAFmK822LmCuZ/r0hmAe Ma2nd3F3BgWXJsCxEaaf2tcmF1OM0uDe8MKvBS0sX6La+z7CJ+4WcA6iWAC8C18h LnVtiM2SI8BmkrPNyYI5QakBOFaUO+7MifjAcobvLxTkHr2lcRV8bTOyDQM4H2CV oYLbU5e+banLW7LD/icfwiKM1IrZ6azzM9sKLdQWq86tTpmrotZf7xdAJfPNZ46Y fnZhJCd9zTKtp3Hv4UFupj8z3/hKPDAT2HXN+VSQ4nlqBZcNdXNoSr3hJ03qhc2B
Re: [PATCH v2 1/4] KVM: x86: Validate guest writes to MSR_IA32_APICBASE
Il 24/01/2014 16:48, Jan Kiszka ha scritto: Check for invalid state transitions on guest-initiated updates of MSR_IA32_APICBASE. This address both enabling of the x2APIC when it is not supported and all invalid transitions as described in SDM section 10.12.5. It also checks that no reserved bit is set in APICBASE by the guest. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/kvm/cpuid.h | 16 arch/x86/kvm/lapic.h | 2 +- arch/x86/kvm/vmx.c | 9 + arch/x86/kvm/x86.c | 32 +--- 4 files changed, 47 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index f1e4895..b012ad2 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -72,4 +72,20 @@ static inline bool guest_cpuid_has_pcid(struct kvm_vcpu *vcpu) return best (best-ecx bit(X86_FEATURE_PCID)); } +static inline bool guest_cpuid_has_x2apic(struct kvm_vcpu *vcpu) +{ + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 1, 0); + return best (best-ecx bit(X86_FEATURE_X2APIC)); +} + +static inline unsigned int guest_cpuid_get_phys_bits(struct kvm_vcpu *vcpu) +{ + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 0x8008, 0); + return best ? best-eax 0xff : 36; +} [Resending after learning that Ctrl-Shift-C does other things beyond copying to clipboard] There's already cpuid_maxphyaddr for this. I can adjust it when committing. This is applied to kvm/queue. The other three will have to wait for after the merge window. Paolo #endif diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index c8b0d0d..6a11845 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -65,7 +65,7 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src, struct kvm_lapic_irq *irq, int *r, unsigned long *dest_map); u64 kvm_get_apic_base(struct kvm_vcpu *vcpu); -void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data); +int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info); void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s); int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5c88791..a06f101 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4392,7 +4392,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) static void vmx_vcpu_reset(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); - u64 msr; + struct msr_data apic_base_msr; vmx-rmode.vm86_active = 0; @@ -4400,10 +4400,11 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu) vmx-vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val(); kvm_set_cr8(vmx-vcpu, 0); - msr = 0xfee0 | MSR_IA32_APICBASE_ENABLE; + apic_base_msr.data = 0xfee0 | MSR_IA32_APICBASE_ENABLE; if (kvm_vcpu_is_bsp(vmx-vcpu)) - msr |= MSR_IA32_APICBASE_BSP; - kvm_set_apic_base(vmx-vcpu, msr); + apic_base_msr.data |= MSR_IA32_APICBASE_BSP; + apic_base_msr.host_initiated = true; + kvm_set_apic_base(vmx-vcpu, apic_base_msr); vmx_segment_cache_clear(vmx); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0c76f7c..f4b0591 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -257,10 +257,26 @@ u64 kvm_get_apic_base(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_get_apic_base); -void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data) -{ - /* TODO: reserve bits check */ - kvm_lapic_set_base(vcpu, data); +int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info) +{ + u64 old_state = vcpu-arch.apic_base + (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE); + u64 new_state = msr_info-data + (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE); + u64 reserved_bits = ((~0ULL) guest_cpuid_get_phys_bits(vcpu)) | + 0x2ff | (guest_cpuid_has_x2apic(vcpu) ? 0 : X2APIC_ENABLE); + + if (!msr_info-host_initiated + ((msr_info-data reserved_bits) != 0 || +new_state == X2APIC_ENABLE || +(new_state == MSR_IA32_APICBASE_ENABLE + old_state == (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE)) || +(new_state == (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE) + old_state == 0))) + return 1; + + kvm_lapic_set_base(vcpu, msr_info-data); + return 0; } EXPORT_SYMBOL_GPL(kvm_set_apic_base); @@ -2006,8 +2022,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case 0x200 ... 0x2ff: return set_msr_mtrr(vcpu, msr, data); case MSR_IA32_APICBASE: - kvm_set_apic_base(vcpu, data); - break; + return kvm_set_apic_base(vcpu, msr_info); case APIC_BASE_MSR ... APIC_BASE_MSR + 0x3ff:
[PATCH uq/master 0/5] Hyper-V improvements and migratability
The first patch fixes the KVM leaves at 0x4100. Before, there is no leaf at 0x4101 (and the data of the highest Intel leaf is returned, e.g. 0xd on a Sandy Bridge). After this patch there is one. The second patch is extracted from Vadim's migration patches, which are patches 3-5. Review of the first two patches is particularly welcome. Paolo Bonzini (2): KVM: fix coexistence of KVM and Hyper-V leaves kvm: make availability of Hyper-V enlightenments dependent on KVM_CAP_HYPERV Vadim Rozenfeld (3): kvm: make hyperv hypercall and guest os id MSRs migratable. kvm: make hyperv vapic assist page migratable kvm: add support for hyper-v timers linux-headers/asm-x86/hyperv.h | 3 ++ linux-headers/linux/kvm.h | 1 + target-i386/cpu-qom.h | 1 + target-i386/cpu.c | 1 + target-i386/cpu.h | 4 ++ target-i386/kvm.c | 109 + target-i386/machine.c | 67 + 7 files changed, 155 insertions(+), 31 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] kvm: add support for hyper-v timers
From: Vadim Rozenfeld vroze...@redhat.com http://msdn.microsoft.com/en-us/library/windows/hardware/ff541625%28v=vs.85%29.aspx This code is generic for activating reference time counter or virtual reference time stamp counter Signed-off-by: Vadim Rozenfeld vroze...@redhat.com Reviewed-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- linux-headers/asm-x86/hyperv.h | 3 +++ linux-headers/linux/kvm.h | 1 + target-i386/cpu-qom.h | 1 + target-i386/cpu.c | 1 + target-i386/cpu.h | 1 + target-i386/kvm.c | 20 +++- target-i386/machine.c | 22 ++ 7 files changed, 48 insertions(+), 1 deletion(-) diff --git a/linux-headers/asm-x86/hyperv.h b/linux-headers/asm-x86/hyperv.h index b8f1c01..3b400ee 100644 --- a/linux-headers/asm-x86/hyperv.h +++ b/linux-headers/asm-x86/hyperv.h @@ -149,6 +149,9 @@ /* MSR used to read the per-partition time reference counter */ #define HV_X64_MSR_TIME_REF_COUNT 0x4020 +/* A partition's reference time stamp counter (TSC) page */ +#define HV_X64_MSR_REFERENCE_TSC 0x4021 + /* MSR used to retrieve the TSC frequency */ #define HV_X64_MSR_TSC_FREQUENCY 0x4022 diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h index 5a49671..999fb13 100644 --- a/linux-headers/linux/kvm.h +++ b/linux-headers/linux/kvm.h @@ -674,6 +674,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_ARM_EL1_32BIT 93 #define KVM_CAP_SPAPR_MULTITCE 94 #define KVM_CAP_EXT_EMUL_CPUID 95 +#define KVM_CAP_HYPERV_TIME 96 #ifdef KVM_CAP_IRQ_ROUTING diff --git a/target-i386/cpu-qom.h b/target-i386/cpu-qom.h index d1751a4..722f11a 100644 --- a/target-i386/cpu-qom.h +++ b/target-i386/cpu-qom.h @@ -69,6 +69,7 @@ typedef struct X86CPU { bool hyperv_vapic; bool hyperv_relaxed_timing; int hyperv_spinlock_attempts; +bool hyperv_time; bool check_cpuid; bool enforce_cpuid; diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 2e0be01..1f30efd 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -2702,6 +2702,7 @@ static Property x86_cpu_properties[] = { { .name = hv-spinlocks, .info = qdev_prop_spinlocks }, DEFINE_PROP_BOOL(hv-relaxed, X86CPU, hyperv_relaxed_timing, false), DEFINE_PROP_BOOL(hv-vapic, X86CPU, hyperv_vapic, false), +DEFINE_PROP_BOOL(hv-time, X86CPU, hyperv_time, false), DEFINE_PROP_BOOL(check, X86CPU, check_cpuid, false), DEFINE_PROP_BOOL(enforce, X86CPU, enforce_cpuid, false), DEFINE_PROP_END_OF_LIST() diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 45bd554..1b94f0f 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -865,6 +865,7 @@ typedef struct CPUX86State { uint64_t msr_hv_hypercall; uint64_t msr_hv_guest_os_id; uint64_t msr_hv_vapic; +uint64_t msr_hv_tsc; /* exception/interrupt handling */ int error_code; diff --git a/target-i386/kvm.c b/target-i386/kvm.c index ddd437f..e555040 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -74,6 +74,7 @@ static bool has_msr_kvm_steal_time; static int lm_capable_kernel; static bool has_msr_hv_hypercall; static bool has_msr_hv_vapic; +static bool has_msr_hv_tsc; static bool has_msr_architectural_pmu; static uint32_t num_architectural_pmu_counters; @@ -442,6 +443,7 @@ static bool hyperv_enabled(X86CPU *cpu) CPUState *cs = CPU(cpu); return kvm_check_extension(cs-kvm_state, KVM_CAP_HYPERV) 0 (hyperv_hypercall_available(cpu) || +cpu-hyperv_time || cpu-hyperv_relaxed_timing); } @@ -499,7 +501,13 @@ int kvm_arch_init_vcpu(CPUState *cs) c-eax |= HV_X64_MSR_APIC_ACCESS_AVAILABLE; has_msr_hv_vapic = true; } - +if (cpu-hyperv_time +kvm_check_extension(cs-kvm_state, KVM_CAP_HYPERV_TIME) 0) { +c-eax |= HV_X64_MSR_HYPERCALL_AVAILABLE; +c-eax |= HV_X64_MSR_TIME_REF_COUNT_AVAILABLE; +c-eax |= 0x200; +has_msr_hv_tsc = true; +} c = cpuid_data.entries[cpuid_i++]; c-function = HYPERV_CPUID_ENLIGHTMENT_INFO; if (cpu-hyperv_relaxed_timing) { @@ -1239,6 +1247,10 @@ static int kvm_put_msrs(X86CPU *cpu, int level) kvm_msr_entry_set(msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, env-msr_hv_vapic); } +if (has_msr_hv_tsc) { +kvm_msr_entry_set(msrs[n++], HV_X64_MSR_REFERENCE_TSC, + env-msr_hv_tsc); +} /* Note: MSR_IA32_FEATURE_CONTROL is written separately, see * kvm_put_msr_feature_control. */ @@ -1530,6 +1542,9 @@ static int kvm_get_msrs(X86CPU *cpu) if (has_msr_hv_vapic) { msrs[n++].index = HV_X64_MSR_APIC_ASSIST_PAGE; } +if (has_msr_hv_tsc) { +msrs[n++].index = HV_X64_MSR_REFERENCE_TSC; +}
[PATCH 2/5] kvm: make availability of Hyper-V enlightenments dependent on KVM_CAP_HYPERV
The MS docs specify HV_X64_MSR_HYPERCALL as a mandatory interface, thus we must provide the MSRs even if the user only specified features that, like relaxed timing, in principle don't require them. And the MSRs are only there if the hypervisor has KVM_CAP_HYPERV. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- target-i386/kvm.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 5738911..08c47bb 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -72,6 +72,7 @@ static bool has_msr_misc_enable; static bool has_msr_bndcfgs; static bool has_msr_kvm_steal_time; static int lm_capable_kernel; +static bool has_msr_hv_hypercall; static bool has_msr_architectural_pmu; static uint32_t num_architectural_pmu_counters; @@ -437,8 +438,10 @@ static bool hyperv_hypercall_available(X86CPU *cpu) static bool hyperv_enabled(X86CPU *cpu) { -return hyperv_hypercall_available(cpu) || - cpu-hyperv_relaxed_timing; +CPUState *cs = CPU(cpu); +return kvm_check_extension(cs-kvm_state, KVM_CAP_HYPERV) 0 + (hyperv_hypercall_available(cpu) || +cpu-hyperv_relaxed_timing); } #define KVM_MAX_CPUID_ENTRIES 100 @@ -511,6 +514,7 @@ int kvm_arch_init_vcpu(CPUState *cs) c-ebx = 0x40; kvm_base = KVM_CPUID_SIGNATURE_NEXT; +has_msr_hv_hypercall = true; } memcpy(signature, KVMKVMKVM\0\0\0, 12); -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] kvm: make hyperv hypercall and guest os id MSRs migratable.
From: Vadim Rozenfeld vroze...@redhat.com Signed-off-by: Vadim Rozenfeld vroze...@redhat.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- target-i386/cpu.h | 2 ++ target-i386/kvm.c | 18 +++--- target-i386/machine.c | 23 +++ 3 files changed, 40 insertions(+), 3 deletions(-) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 1fcbc82..3dba5ef 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -862,6 +862,8 @@ typedef struct CPUX86State { uint64_t msr_fixed_counters[MAX_FIXED_COUNTERS]; uint64_t msr_gp_counters[MAX_GP_COUNTERS]; uint64_t msr_gp_evtsel[MAX_GP_COUNTERS]; +uint64_t msr_hv_hypercall; +uint64_t msr_hv_guest_os_id; /* exception/interrupt handling */ int error_code; diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 08c47bb..8f2854a 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -1227,9 +1227,11 @@ static int kvm_put_msrs(X86CPU *cpu, int level) kvm_msr_entry_set(msrs[n++], MSR_CORE_PERF_GLOBAL_CTRL, env-msr_global_ctrl); } -if (hyperv_hypercall_available(cpu)) { -kvm_msr_entry_set(msrs[n++], HV_X64_MSR_GUEST_OS_ID, 0); -kvm_msr_entry_set(msrs[n++], HV_X64_MSR_HYPERCALL, 0); +if (has_msr_hv_hypercall) { +kvm_msr_entry_set(msrs[n++], HV_X64_MSR_GUEST_OS_ID, + env-msr_hv_guest_os_id); +kvm_msr_entry_set(msrs[n++], HV_X64_MSR_HYPERCALL, + env-msr_hv_hypercall); } if (cpu-hyperv_vapic) { kvm_msr_entry_set(msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, 0); @@ -1518,6 +1520,10 @@ static int kvm_get_msrs(X86CPU *cpu) } } +if (has_msr_hv_hypercall) { +msrs[n++].index = HV_X64_MSR_HYPERCALL; +msrs[n++].index = HV_X64_MSR_GUEST_OS_ID; +} msr_data.info.nmsrs = n; ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_MSRS, msr_data); if (ret 0) { @@ -1625,6 +1631,12 @@ static int kvm_get_msrs(X86CPU *cpu) case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1: env-msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data; break; +case HV_X64_MSR_HYPERCALL: +env-msr_hv_hypercall = msrs[i].data; +break; +case HV_X64_MSR_GUEST_OS_ID: +env-msr_hv_guest_os_id = msrs[i].data; +break; } } diff --git a/target-i386/machine.c b/target-i386/machine.c index 2de1964..96fd045 100644 --- a/target-i386/machine.c +++ b/target-i386/machine.c @@ -554,6 +554,26 @@ static const VMStateDescription vmstate_mpx = { } }; +static bool hyperv_hypercall_enable_needed(void *opaque) +{ +X86CPU *cpu = opaque; +CPUX86State *env = cpu-env; + +return env-msr_hv_hypercall != 0 || env-msr_hv_guest_os_id != 0; +} + +static const VMStateDescription vmstate_msr_hypercall_hypercall = { +.name = cpu/msr_hyperv_hypercall, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT64(env.msr_hv_hypercall, X86CPU), +VMSTATE_UINT64(env.msr_hv_guest_os_id, X86CPU), +VMSTATE_END_OF_LIST() +} +}; + const VMStateDescription vmstate_x86_cpu = { .name = cpu, .version_id = 12, @@ -688,6 +708,9 @@ const VMStateDescription vmstate_x86_cpu = { } , { .vmsd = vmstate_mpx, .needed = mpx_needed, +}, { +.vmsd = vmstate_msr_hypercall_hypercall, +.needed = hyperv_hypercall_enable_needed, } , { /* empty */ } -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] KVM: fix coexistence of KVM and Hyper-V leaves
kvm_arch_init_vcpu's initialization of the KVM leaves at 0x4100 is broken, because KVM_CPUID_FEATURES is left at 0x4001. Move it to 0x4101 if Hyper-V is enabled. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- target-i386/kvm.c | 47 +-- 1 file changed, 25 insertions(+), 22 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 0a21c30..5738911 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -455,6 +455,7 @@ int kvm_arch_init_vcpu(CPUState *cs) uint32_t unused; struct kvm_cpuid_entry2 *c; uint32_t signature[3]; +int kvm_base = KVM_CPUID_SIGNATURE; int r; memset(cpuid_data, 0, sizeof(cpuid_data)); @@ -462,26 +463,22 @@ int kvm_arch_init_vcpu(CPUState *cs) cpuid_i = 0; /* Paravirtualization CPUIDs */ -c = cpuid_data.entries[cpuid_i++]; -c-function = KVM_CPUID_SIGNATURE; -if (!hyperv_enabled(cpu)) { -memcpy(signature, KVMKVMKVM\0\0\0, 12); -c-eax = 0; -} else { +if (hyperv_enabled(cpu)) { +c = cpuid_data.entries[cpuid_i++]; +c-function = HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS; memcpy(signature, Microsoft Hv, 12); c-eax = HYPERV_CPUID_MIN; -} -c-ebx = signature[0]; -c-ecx = signature[1]; -c-edx = signature[2]; - -c = cpuid_data.entries[cpuid_i++]; -c-function = KVM_CPUID_FEATURES; -c-eax = env-features[FEAT_KVM]; +c-ebx = signature[0]; +c-ecx = signature[1]; +c-edx = signature[2]; -if (hyperv_enabled(cpu)) { +c = cpuid_data.entries[cpuid_i++]; +c-function = HYPERV_CPUID_INTERFACE; memcpy(signature, Hv#1\0\0\0\0\0\0\0\0, 12); c-eax = signature[0]; +c-ebx = 0; +c-ecx = 0; +c-edx = 0; c = cpuid_data.entries[cpuid_i++]; c-function = HYPERV_CPUID_VERSION; @@ -513,15 +510,21 @@ int kvm_arch_init_vcpu(CPUState *cs) c-eax = 0x40; c-ebx = 0x40; -c = cpuid_data.entries[cpuid_i++]; -c-function = KVM_CPUID_SIGNATURE_NEXT; -memcpy(signature, KVMKVMKVM\0\0\0, 12); -c-eax = 0; -c-ebx = signature[0]; -c-ecx = signature[1]; -c-edx = signature[2]; +kvm_base = KVM_CPUID_SIGNATURE_NEXT; } +memcpy(signature, KVMKVMKVM\0\0\0, 12); +c = cpuid_data.entries[cpuid_i++]; +c-function = KVM_CPUID_SIGNATURE | kvm_base; +c-eax = 0; +c-ebx = signature[0]; +c-ecx = signature[1]; +c-edx = signature[2]; + +c = cpuid_data.entries[cpuid_i++]; +c-function = KVM_CPUID_FEATURES | kvm_base; +c-eax = env-features[FEAT_KVM]; + has_msr_async_pf_en = c-eax (1 KVM_FEATURE_ASYNC_PF); has_msr_pv_eoi_en = c-eax (1 KVM_FEATURE_PV_EOI); -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] kvm: make hyperv vapic assist page migratable
From: Vadim Rozenfeld vroze...@redhat.com Signed-off-by: Vadim Rozenfeld vroze...@redhat.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- target-i386/cpu.h | 1 + target-i386/kvm.c | 16 +--- target-i386/machine.c | 22 ++ 3 files changed, 36 insertions(+), 3 deletions(-) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 3dba5ef..45bd554 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -864,6 +864,7 @@ typedef struct CPUX86State { uint64_t msr_gp_evtsel[MAX_GP_COUNTERS]; uint64_t msr_hv_hypercall; uint64_t msr_hv_guest_os_id; +uint64_t msr_hv_vapic; /* exception/interrupt handling */ int error_code; diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 8f2854a..ddd437f 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -73,6 +73,7 @@ static bool has_msr_bndcfgs; static bool has_msr_kvm_steal_time; static int lm_capable_kernel; static bool has_msr_hv_hypercall; +static bool has_msr_hv_vapic; static bool has_msr_architectural_pmu; static uint32_t num_architectural_pmu_counters; @@ -496,6 +497,7 @@ int kvm_arch_init_vcpu(CPUState *cs) if (cpu-hyperv_vapic) { c-eax |= HV_X64_MSR_HYPERCALL_AVAILABLE; c-eax |= HV_X64_MSR_APIC_ACCESS_AVAILABLE; +has_msr_hv_vapic = true; } c = cpuid_data.entries[cpuid_i++]; @@ -503,7 +505,7 @@ int kvm_arch_init_vcpu(CPUState *cs) if (cpu-hyperv_relaxed_timing) { c-eax |= HV_X64_RELAXED_TIMING_RECOMMENDED; } -if (cpu-hyperv_vapic) { +if (has_msr_hv_vapic) { c-eax |= HV_X64_APIC_ACCESS_RECOMMENDED; } c-ebx = cpu-hyperv_spinlock_attempts; @@ -1233,8 +1235,9 @@ static int kvm_put_msrs(X86CPU *cpu, int level) kvm_msr_entry_set(msrs[n++], HV_X64_MSR_HYPERCALL, env-msr_hv_hypercall); } -if (cpu-hyperv_vapic) { -kvm_msr_entry_set(msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, 0); +if (has_msr_hv_vapic) { +kvm_msr_entry_set(msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, + env-msr_hv_vapic); } /* Note: MSR_IA32_FEATURE_CONTROL is written separately, see @@ -1524,6 +1527,10 @@ static int kvm_get_msrs(X86CPU *cpu) msrs[n++].index = HV_X64_MSR_HYPERCALL; msrs[n++].index = HV_X64_MSR_GUEST_OS_ID; } +if (has_msr_hv_vapic) { +msrs[n++].index = HV_X64_MSR_APIC_ASSIST_PAGE; +} + msr_data.info.nmsrs = n; ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_MSRS, msr_data); if (ret 0) { @@ -1637,6 +1644,9 @@ static int kvm_get_msrs(X86CPU *cpu) case HV_X64_MSR_GUEST_OS_ID: env-msr_hv_guest_os_id = msrs[i].data; break; +case HV_X64_MSR_APIC_ASSIST_PAGE: +env-msr_hv_vapic = msrs[i].data; +break; } } diff --git a/target-i386/machine.c b/target-i386/machine.c index 96fd045..e72e270 100644 --- a/target-i386/machine.c +++ b/target-i386/machine.c @@ -574,6 +574,25 @@ static const VMStateDescription vmstate_msr_hypercall_hypercall = { } }; +static bool hyperv_vapic_enable_needed(void *opaque) +{ +X86CPU *cpu = opaque; +CPUX86State *env = cpu-env; + +return env-msr_hv_vapic != 0; +} + +static const VMStateDescription vmstate_msr_hyperv_vapic = { +.name = cpu/msr_hyperv_vapic, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT64(env.msr_hv_vapic, X86CPU), +VMSTATE_END_OF_LIST() +} +}; + const VMStateDescription vmstate_x86_cpu = { .name = cpu, .version_id = 12, @@ -711,6 +730,9 @@ const VMStateDescription vmstate_x86_cpu = { }, { .vmsd = vmstate_msr_hypercall_hypercall, .needed = hyperv_hypercall_enable_needed, +}, { +.vmsd = vmstate_msr_hyperv_vapic, +.needed = hyperv_vapic_enable_needed, } , { /* empty */ } -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO
On 24 January 2014 07:32, Paolo Bonzini pbonz...@redhat.com wrote: Il 24/01/2014 16:23, Victor Kamensky ha scritto: Also if you use ints on real bus as description, you may want to clarify restrictions on mmio.len. Basically on 32 bit platform (i.e like V7 ARM) one cannot have mmio.len=8, because one cannot have 64bit value on 32bit data bus. Without such clarification introduction of text like the value as it would go on the bus in real hardware is confusing for len=8 for emulated CPUs where real busses are 32bit. This is not necessarily true. On a 32-bit CPU you can have a 64-bit memory bus. Even x86 32-bit CPUs can do 64-bit MMIO via MMX or SSE or double-word compare-and-swap (CMPXCHG8B). Sure, that was part of my point :). But now text says about real hardware buses and in any given moment emulator specify particular type of CPU emulated, and I am quite sure that we can find one emulated ARMV7 CPU that would just have real 32bit data bus. Note there was no such problem and nobody cares what real data buses width are, with definition I argued for. That I think that was original intent of '__u8 data[8]' use - just bytes array description of how memory at given phys_addr looks before (mmiois_write=0) or would look after (mmio.is_write = 1) requested memory transaction (or several of them for that matter) are emulated by KVM_EXIT_MMIO. When if comes to devices memory read/write it is logical view of memory of course. Such definition works with any real data buses sizes, starting with byte. But, ooh, well ... nobody understands such definition .. I stand failed to explain it clear enough. Thanks, Victor Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO
On 24 January 2014 07:23, Victor Kamensky victor.kamen...@linaro.org wrote: On 24 January 2014 05:13, Alexander Graf ag...@suse.de wrote: On 24.01.2014, at 14:09, Paolo Bonzini pbonz...@redhat.com wrote: Il 24/01/2014 01:01, Peter Maydell ha scritto: +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. Other thing with mmio.len need clarification: please specify what mmio.len values are allowed. It seems that it was implied that len could be only power of 2 in range between 1 and 8. I would rather see that it is spelled out. I think code on all sides of KVM_EXIT_MMIO already make such assumption. Now when you talk about integers values on bus in real hardware power of 2 sizes implied Also note for memory pieces with sizes other than 2, 4, 8 endianness is not defined. Also if you use ints on real bus as description, you may want to clarify restrictions on mmio.len. Basically on 32 bit platform (i.e like V7 ARM) one cannot have mmio.len=8, because one cannot have 64bit value on 32bit data bus. Without such clarification introduction of text like the value as it would go on the bus in real hardware is confusing for len=8 for emulated CPUs where real busses are 32bit. If ldrd/strd would be emulated on ARMV7 one would need to use mmio twice to pass required data in either direction using len=4 .. Thanks, Victor The host kernel should always be able to do: +type val = *((type *)mmio.data). would it be prudent to say that it is just integer type here. Thanks, Victor I think this would be better phrased as The host userspace should always, since this documentation is supposed to be telling userspace what the kernel's contract with it is, not the kernel keeping notes for itself on its own implementation. (It also clarifies what the intention is for the obscure and maybe-we'll-never-implement-this case of an LE host kernel using a compatibility interface to run the host userspace (QEMU) as a BE process which sees the same ABI a BE kernel provides, without actually dragging that red herring explicitly into the documentation.) I agree, and also the first line should mention userspace. In PPC I think it's possible or even common to have BE host kernel and LE host userspace (or perhaps vice versa is the common one). It was possible on 32bit, but I'm not sure anyone's actively using it :). The thing that was very common (not so much anymore for enterprise distros) is 32-bit user space with 64-bit kernels. Alex ___ kvmarm mailing list kvm...@lists.cs.columbia.edu https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [PATCH] KVM, XEN: Fix potential race in pvclock code
On Thu, Jan 16, 2014 at 03:13:44PM +0100, Julian Stecklina wrote: The paravirtualized clock used in KVM and Xen uses a version field to allow the guest to see when the shared data structure is inconsistent. The code reads the version field twice (before and after the data structure is copied) and checks whether they haven't changed and that the lowest bit is not set. As the second access is not synchronized, the compiler could generate code that accesses version more than two times and you end up with inconsistent data. Could you paste in the code that the 'bad' compiler generates vs the compiler that generate 'good' code please? An example using pvclock_get_time_values: host starts updating data, sets src-version to 1 guest reads src-version (1) and stores it into dst-version. guest copies inconsistent data guest reads src-version (1) and computes xor with dst-version. host finishes updating data and sets src-version to 2 guest reads src-version (2) and checks whether lower bit is not set. while loop exits with inconsistent data! AFAICS the compiler is allowed to optimize the given code this way. Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de --- arch/x86/kernel/pvclock.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 42eb330..f62b41c 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -55,6 +55,8 @@ static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, struct pvclock_vcpu_time_info *src) { + u32 nversion; + do { dst-version = src-version; rmb(); /* fetch version before data */ @@ -64,7 +66,8 @@ static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, dst-tsc_shift = src-tsc_shift; dst-flags = src-flags; rmb(); /* test version after fetching data */ - } while ((src-version 1) || (dst-version != src-version)); + nversion = ACCESS_ONCE(src-version); + } while ((nversion 1) || (dst-version != nversion)); return dst-version; } @@ -135,7 +138,7 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock, struct pvclock_vcpu_time_info *vcpu_time, struct timespec *ts) { - u32 version; + u32 version, nversion; u64 delta; struct timespec now; @@ -146,7 +149,8 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock, now.tv_sec = wall_clock-sec; now.tv_nsec = wall_clock-nsec; rmb(); /* fetch time before checking version */ - } while ((wall_clock-version 1) || (version != wall_clock-version)); + nversion = ACCESS_ONCE(wall_clock-version); + } while ((nversion 1) || (version != nversion)); delta = pvclock_clocksource_read(vcpu_time);/* time since system boot */ delta += now.tv_sec * (u64)NSEC_PER_SEC + now.tv_nsec; -- 1.8.4.2 ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm virtio ethernet ring on guest side over high throughput (packet per second)
Well, its confirmed that ... because of the shape of our traffic, constant burst of many many small packages (1.5k / 3.5k) Nagle algorithm was in a beggining the root cause of our performance issues. So i will have this thread as solved. Thank you so much to everyone involved, specially people from RedHat. Thanks a lot! Alejandro Comisario #melicloud CloudBuilders Arias 3751, Piso 7 (C1430CRG) Ciudad de Buenos Aires - Argentina Cel: +549(11) 15-3770-1857 Tel : +54(11) 4640-8443 On Thu, Jan 23, 2014 at 4:25 PM, Alejandro Comisario alejandro.comisa...@mercadolibre.com wrote: Jason, Stefan ... thank you so much. At a glance, disabling Nagle algorithm made the hundred thousands 20ms delays to dissapear suddenly, tomorrow we are gonna made a whole day test again, and test client connectivity against NginX and Memcached to see if, because of the traffic we have (hundred thousands packages per minute) Nagle introduced this delay. I'll get back to you tomorrow with the tests. Thanks again. kindest regards. Alejandro Comisario #melicloud CloudBuilders Arias 3751, Piso 7 (C1430CRG) Ciudad de Buenos Aires - Argentina Cel: +549(11) 15-3770-1857 Tel : +54(11) 4640-8443 On Thu, Jan 23, 2014 at 12:14 AM, Jason Wang jasow...@redhat.com wrote: On 01/23/2014 05:32 AM, Alejandro Comisario wrote: Thank you so much Stefan for the help and cc'ing Michael Jason. Like you advised yesterday on IRC, today we are making some tests with the application setting TCP_NODELAY in the socket options. So we will try that and get back to you with further information. In the mean time, maybe showing what options the vms are using while running ! # -- /usr/bin/kvm -S -M pc-1.0 -cpu core2duo,+lahf_lm,+rdtscp,+pdpe1gb,+aes,+popcnt,+x2apic,+sse4.2,+sse4.1,+dca,+xtpr,+cx16,+tm2,+est,+vmx,+ds_cpl,+pbe,+tm,+ht,+ss,+acpi,+ds -enable-kvm -m 32768 -smp 8,sockets=1,cores=6,threads=2 -name instance-0254 -uuid d25b1b20-409e-4d7f-bd92-2ef4073c7c2b -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-0254.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -kernel /var/lib/nova/instances/instance-0254/kernel -initrd /var/lib/nova/instances/instance-0254/ramdisk -append root=/dev/vda console=ttyS0 -drive file=/var/lib/nova/instances/instance-0254/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=writethrough -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,fd=19,id=hostnet0 -device Better enable vhost as Stefan suggested. It may help a lot here. virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:27:d4:6d,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/instance-0254/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -usb -device usb-tablet,id=input0 -vnc 0.0.0.0:4 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 # -- best regards Alejandro Comisario #melicloud CloudBuilders Arias 3751, Piso 7 (C1430CRG) Ciudad de Buenos Aires - Argentina Cel: +549(11) 15-3770-1857 Tel : +54(11) 4640-8443 On Wed, Jan 22, 2014 at 12:22 PM, Stefan Hajnoczi stefa...@gmail.com wrote: On Tue, Jan 21, 2014 at 04:06:05PM -0200, Alejandro Comisario wrote: CCed Michael Tsirkin and Jason Wang who work on KVM networking. Hi guys, we had in the past when using physical servers, several throughput issues regarding the throughput of our APIS, in our case we measure this with packets per seconds, since we dont have that much bandwidth (Mb/s) since our apis respond lots of packets very small ones (maximum response of 3.5k and avg response of 1.5k), when we where using this physical servers, when we reach throughput capacity (due to clients tiemouts) we touched the ethernet ring configuration and we made the problem dissapear. Today with kvm and over 10k virtual instances, when we want to increase the throughput of KVM instances, we bumped with the fact that when using virtio on guests, we have a max configuration of the ring of 256 TX/RX, and from the host side the atached vnet has a txqueuelen of 500. What i want to know is, how can i tune the guest to support more packets per seccond if i know that's my bottleneck? I suggest investigating performance in a systematic way. Set up a benchmark that saturates the network. Post the details of the benchmark and the results that you are seeing. Then, we can discuss how to investigate the root cause of the bottleneck. *
Re: [PATCH 1/5] KVM: fix coexistence of KVM and Hyper-V leaves
On Fri, Jan 24, 2014 at 05:17:52PM +0100, Paolo Bonzini wrote: kvm_arch_init_vcpu's initialization of the KVM leaves at 0x4100 is broken, because KVM_CPUID_FEATURES is left at 0x4001. Move it to 0x4101 if Hyper-V is enabled. Signed-off-by: Paolo Bonzini pbonz...@redhat.com arch/x86/include/asm/kvm_para.h static inline unsigned int kvm_arch_para_features(void) { return cpuid_eax(KVM_CPUID_FEATURES); } Shouldnt it be using kvm_cpuid_base ? --- target-i386/kvm.c | 47 +-- 1 file changed, 25 insertions(+), 22 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 0a21c30..5738911 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -455,6 +455,7 @@ int kvm_arch_init_vcpu(CPUState *cs) uint32_t unused; struct kvm_cpuid_entry2 *c; uint32_t signature[3]; +int kvm_base = KVM_CPUID_SIGNATURE; int r; memset(cpuid_data, 0, sizeof(cpuid_data)); @@ -462,26 +463,22 @@ int kvm_arch_init_vcpu(CPUState *cs) cpuid_i = 0; /* Paravirtualization CPUIDs */ -c = cpuid_data.entries[cpuid_i++]; -c-function = KVM_CPUID_SIGNATURE; -if (!hyperv_enabled(cpu)) { -memcpy(signature, KVMKVMKVM\0\0\0, 12); -c-eax = 0; -} else { +if (hyperv_enabled(cpu)) { +c = cpuid_data.entries[cpuid_i++]; +c-function = HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS; memcpy(signature, Microsoft Hv, 12); c-eax = HYPERV_CPUID_MIN; -} -c-ebx = signature[0]; -c-ecx = signature[1]; -c-edx = signature[2]; - -c = cpuid_data.entries[cpuid_i++]; -c-function = KVM_CPUID_FEATURES; -c-eax = env-features[FEAT_KVM]; +c-ebx = signature[0]; +c-ecx = signature[1]; +c-edx = signature[2]; -if (hyperv_enabled(cpu)) { +c = cpuid_data.entries[cpuid_i++]; +c-function = HYPERV_CPUID_INTERFACE; memcpy(signature, Hv#1\0\0\0\0\0\0\0\0, 12); c-eax = signature[0]; +c-ebx = 0; +c-ecx = 0; +c-edx = 0; c = cpuid_data.entries[cpuid_i++]; c-function = HYPERV_CPUID_VERSION; @@ -513,15 +510,21 @@ int kvm_arch_init_vcpu(CPUState *cs) c-eax = 0x40; c-ebx = 0x40; -c = cpuid_data.entries[cpuid_i++]; -c-function = KVM_CPUID_SIGNATURE_NEXT; -memcpy(signature, KVMKVMKVM\0\0\0, 12); -c-eax = 0; -c-ebx = signature[0]; -c-ecx = signature[1]; -c-edx = signature[2]; +kvm_base = KVM_CPUID_SIGNATURE_NEXT; } +memcpy(signature, KVMKVMKVM\0\0\0, 12); +c = cpuid_data.entries[cpuid_i++]; +c-function = KVM_CPUID_SIGNATURE | kvm_base; +c-eax = 0; +c-ebx = signature[0]; +c-ecx = signature[1]; +c-edx = signature[2]; + +c = cpuid_data.entries[cpuid_i++]; +c-function = KVM_CPUID_FEATURES | kvm_base; +c-eax = env-features[FEAT_KVM]; + has_msr_async_pf_en = c-eax (1 KVM_FEATURE_ASYNC_PF); has_msr_pv_eoi_en = c-eax (1 KVM_FEATURE_PV_EOI); -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO
On 24 January 2014 05:13, Alexander Graf ag...@suse.de wrote: On 24.01.2014, at 14:09, Paolo Bonzini pbonz...@redhat.com wrote: Il 24/01/2014 01:01, Peter Maydell ha scritto: +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host kernel should always be able to do: +type val = *((type *)mmio.data). I think this would be better phrased as The host userspace should always, since this documentation is supposed to be telling userspace what the kernel's contract with it is, not the kernel keeping notes for itself on its own implementation. (It also clarifies what the intention is for the obscure and maybe-we'll-never-implement-this case of an LE host kernel using a compatibility interface to run the host userspace (QEMU) as a BE process which sees the same ABI a BE kernel provides, without actually dragging that red herring explicitly into the documentation.) I agree, and also the first line should mention userspace. In PPC I think it's possible or even common to have BE host kernel and LE host userspace (or perhaps vice versa is the common one). It was possible on 32bit, but I'm not sure anyone's actively using it :). The thing that was very common (not so much anymore for enterprise distros) is 32-bit user space with 64-bit kernels. Paolo, Alex, good point about BE kernel / LE user-land mix! How KVM kernel code that deals with KVM_MMIO_EXIT can find out what is user process endianity that handles this KVM_MMIO_EXIT? Do we have kernel function that can tell that. Hey, what is current user-land task endianity? :). And reverse case, how user-land code that wants to do KVM_MMIO_EXIT can find out what is endianity of kernel? Do we have such system call? Hey, kernel tell me your endianity :). Just a thought: should not we instead of trying to implicitly setup endianity by some other side properties like emulator or kernel endianity, let's just do it explicitly. Adding 'endianness' field into current mmio structure is not an option, but maybe there is outside mechanism like KVM features, special ioctl number, through which one can explicitly either set of learn endianity in mmio.data[] for given KVM session. At least, if we don't want to consider mixed BE/LE kernel/user-land, then document should clarify that BE/LE kernel/user-land mix is not supported and assumption here is that they always coincide. Thanks, Victor Alex ___ kvmarm mailing list kvm...@lists.cs.columbia.edu https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 69361] Host call trace and guest hang after create guest.
https://bugzilla.kernel.org/show_bug.cgi?id=69361 Marcelo Tosatti mtosa...@redhat.com changed: What|Removed |Added CC||mtosa...@redhat.com --- Comment #4 from Marcelo Tosatti mtosa...@redhat.com --- Can you make the host kernel config file available? -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: fix coexistence of KVM and Hyper-V leaves
Il 24/01/2014 20:08, Marcelo Tosatti ha scritto: kvm_arch_init_vcpu's initialization of the KVM leaves at 0x4100 is broken, because KVM_CPUID_FEATURES is left at 0x4001. Move it to 0x4101 if Hyper-V is enabled. Signed-off-by: Paolo Bonzini pbonz...@redhat.com arch/x86/include/asm/kvm_para.h static inline unsigned int kvm_arch_para_features(void) { return cpuid_eax(KVM_CPUID_FEATURES); } Shouldnt it be using kvm_cpuid_base ? Yes. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] ARM: mm: Fix stage-2 device memory attributes
On Sat, Jan 04, 2014 at 08:27:23AM -0800, Christoffer Dall wrote: The stage-2 memory attributes are distinct from the Hyp memory attributes and the Stage-1 memory attributes. We were using the stage-1 memory attributes for stage-2 mappings causing device mappings to be mapped as normal memory. Add the S2 equivalent defines for memory attributes and fix the comments explaining the defines while at it. Add a prot_pte_s2 field to the mem_type struct and fill out the field for device mappings accordingly. Signed-off-by: Christoffer Dall christoffer.d...@linaro.org --- Changelog[v2]: - Guard the use of L_PTE_S2 defines with s2_policy to allow non-LPAE compiles. arch/arm/include/asm/pgtable-3level.h | 20 +--- arch/arm/mm/mm.h | 1 + arch/arm/mm/mmu.c | 15 ++- 3 files changed, 28 insertions(+), 8 deletions(-) diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h index 4f95039..d5e04d6 100644 --- a/arch/arm/include/asm/pgtable-3level.h +++ b/arch/arm/include/asm/pgtable-3level.h @@ -120,13 +120,19 @@ /* * 2nd stage PTE definitions for LPAE. */ -#define L_PTE_S2_MT_UNCACHED (_AT(pteval_t, 0x5) 2) /* MemAttr[3:0] */ -#define L_PTE_S2_MT_WRITETHROUGH (_AT(pteval_t, 0xa) 2) /* MemAttr[3:0] */ -#define L_PTE_S2_MT_WRITEBACK (_AT(pteval_t, 0xf) 2) /* MemAttr[3:0] */ -#define L_PTE_S2_RDONLY (_AT(pteval_t, 1) 6) /* HAP[1] */ -#define L_PTE_S2_RDWR (_AT(pteval_t, 3) 6) /* HAP[2:1] */ - -#define L_PMD_S2_RDWR (_AT(pmdval_t, 3) 6) /* HAP[2:1] */ +#define L_PTE_S2_MT_UNCACHED (_AT(pteval_t, 0x0) 2) /* strongly ordered */ +#define L_PTE_S2_MT_WRITETHROUGH (_AT(pteval_t, 0xa) 2) /* normal inner write-through */ +#define L_PTE_S2_MT_WRITEBACK(_AT(pteval_t, 0xf) 2) /* normal inner write-back */ +#define L_PTE_S2_MT_DEV_SHARED (_AT(pteval_t, 0x1) 2) /* device */ +#define L_PTE_S2_MT_DEV_NONSHARED(_AT(pteval_t, 0x1) 2) /* device */ +#define L_PTE_S2_MT_DEV_WC (_AT(pteval_t, 0x5) 2) /* normal non-cacheable */ +#define L_PTE_S2_MT_DEV_CACHED (_AT(pteval_t, 0xf) 2) /* normal inner write-back */ +#define L_PTE_S2_MT_MASK (_AT(pteval_t, 0xf) 2) + +#define L_PTE_S2_RDONLY (_AT(pteval_t, 1) 6) /* HAP[1] */ +#define L_PTE_S2_RDWR(_AT(pteval_t, 3) 6) /* HAP[2:1] */ + +#define L_PMD_S2_RDWR(_AT(pmdval_t, 3) 6) /* HAP[2:1] */ /* * Hyp-mode PL2 PTE definitions for LPAE. diff --git a/arch/arm/mm/mm.h b/arch/arm/mm/mm.h index d5a982d..7ea641b 100644 --- a/arch/arm/mm/mm.h +++ b/arch/arm/mm/mm.h @@ -38,6 +38,7 @@ static inline pmd_t *pmd_off_k(unsigned long virt) struct mem_type { pteval_t prot_pte; + pteval_t prot_pte_s2; pmdval_t prot_l1; pmdval_t prot_sect; unsigned int domain; diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c index 580ef2d..44d571f 100644 --- a/arch/arm/mm/mmu.c +++ b/arch/arm/mm/mmu.c @@ -231,36 +231,48 @@ __setup(noalign, noalign_setup); #endif /* ifdef CONFIG_CPU_CP15 / else */ #define PROT_PTE_DEVICE L_PTE_PRESENT|L_PTE_YOUNG|L_PTE_DIRTY|L_PTE_XN +#define PROT_PTE_S2_DEVICE PROT_PTE_DEVICE #define PROT_SECT_DEVICE PMD_TYPE_SECT|PMD_SECT_AP_WRITE static struct mem_type mem_types[] = { [MT_DEVICE] = { /* Strongly ordered / ARMv6 shared device */ .prot_pte = PROT_PTE_DEVICE | L_PTE_MT_DEV_SHARED | L_PTE_SHARED, + .prot_pte_s2= s2_policy(PROT_PTE_S2_DEVICE) | + s2_policy(L_PTE_S2_MT_DEV_SHARED) | + L_PTE_SHARED, .prot_l1= PMD_TYPE_TABLE, .prot_sect = PROT_SECT_DEVICE | PMD_SECT_S, .domain = DOMAIN_IO, }, [MT_DEVICE_NONSHARED] = { /* ARMv6 non-shared device */ .prot_pte = PROT_PTE_DEVICE | L_PTE_MT_DEV_NONSHARED, + .prot_pte_s2= s2_policy(PROT_PTE_S2_DEVICE) | + s2_policy(L_PTE_S2_MT_DEV_NONSHARED), .prot_l1= PMD_TYPE_TABLE, .prot_sect = PROT_SECT_DEVICE, .domain = DOMAIN_IO, }, [MT_DEVICE_CACHED] = {/* ioremap_cached */ .prot_pte = PROT_PTE_DEVICE | L_PTE_MT_DEV_CACHED, + .prot_pte_s2= s2_policy(PROT_PTE_S2_DEVICE) | + s2_policy(L_PTE_S2_MT_DEV_CACHED), .prot_l1= PMD_TYPE_TABLE, .prot_sect = PROT_SECT_DEVICE | PMD_SECT_WB, .domain = DOMAIN_IO, }, [MT_DEVICE_WC] = { /*
[PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO
The KVM API documentation is not clear about the semantics of the data field on the mmio struct on the kvm_run struct. This has become problematic when supporting ARM guests on big-endian host systems with guests of both endianness types, because it is unclear how the data should be exported to user space. This should not break with existing implementations as all supported existing implementations of known user space applications (QEMU and kvmtools for virtio) only support default endianness of the architectures on the host side. Cc: Marc Zyngier marc.zyng...@arm.com Cc: Peter Maydell peter.mayd...@linaro.org Cc: Alexander Graf ag...@suse.de Signed-off-by: Christoffer Dall christoffer.d...@linaro.org --- Changes [v1 - v2]: - s/host kernel should/host user space should/ Documentation/virtual/kvm/api.txt | 5 + 1 file changed, 5 insertions(+) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 366bf4b..6dbd68c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could not be satisfied by kvm. The 'data' member contains the written data if 'is_write' is true, and should be filled by application code otherwise. +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host user space should always be able to do: +type val = *((type *)mmio.data). + NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_DCR, KVM_EXIT_PAPR and KVM_EXIT_EPR the corresponding operations are complete (and guest state is consistent) only after userspace -- 1.8.5.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO
On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote: The KVM API documentation is not clear about the semantics of the data field on the mmio struct on the kvm_run struct. This has become problematic when supporting ARM guests on big-endian host systems with guests of both endianness types, because it is unclear how the data should be exported to user space. This should not break with existing implementations as all supported existing implementations of known user space applications (QEMU and kvmtools for virtio) only support default endianness of the architectures on the host side. Cc: Marc Zyngier marc.zyng...@arm.com Cc: Peter Maydell peter.mayd...@linaro.org Cc: Alexander Graf ag...@suse.de Signed-off-by: Christoffer Dall christoffer.d...@linaro.org --- Changes [v1 - v2]: - s/host kernel should/host user space should/ Documentation/virtual/kvm/api.txt | 5 + 1 file changed, 5 insertions(+) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 366bf4b..6dbd68c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could not be satisfied by kvm. The 'data' member contains the written data if 'is_write' is true, and should be filled by application code otherwise. +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host user space should always be able to do: +type val = *((type *)mmio.data). Host userspace should be able to do that with what results? It would only produce a directly usable value if host endianness is the same as the emulated device's endianness. I'm not sure that host kernel native endianness is an accurate way of describing what currently happens. Regardless of host or guest endianness, the guest should be swapping the value as necessary to ensure that the value that goes on the (real or emulated) bus is the same. Plus, if it were really as it would go on the bus the value wouldn't necessarily be left justified within data[], depending on how the bus works. How about a wording like this: The 'data' member contains, in its first 'len' bytes, the value as it would appear if the guest had accessed memory rather than I/O. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO
On 24 January 2014 15:51, Scott Wood scottw...@freescale.com wrote: On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote: The KVM API documentation is not clear about the semantics of the data field on the mmio struct on the kvm_run struct. This has become problematic when supporting ARM guests on big-endian host systems with guests of both endianness types, because it is unclear how the data should be exported to user space. This should not break with existing implementations as all supported existing implementations of known user space applications (QEMU and kvmtools for virtio) only support default endianness of the architectures on the host side. Cc: Marc Zyngier marc.zyng...@arm.com Cc: Peter Maydell peter.mayd...@linaro.org Cc: Alexander Graf ag...@suse.de Signed-off-by: Christoffer Dall christoffer.d...@linaro.org --- Changes [v1 - v2]: - s/host kernel should/host user space should/ Documentation/virtual/kvm/api.txt | 5 + 1 file changed, 5 insertions(+) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 366bf4b..6dbd68c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could not be satisfied by kvm. The 'data' member contains the written data if 'is_write' is true, and should be filled by application code otherwise. +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host user space should always be able to do: +type val = *((type *)mmio.data). Host userspace should be able to do that with what results? It would only produce a directly usable value if host endianness is the same as the emulated device's endianness. I'm not sure that host kernel native endianness is an accurate way of describing what currently happens. Regardless of host or guest endianness, the guest should be swapping the value as necessary to ensure that the value that goes on the (real or emulated) bus is the same. Plus, if it were really as it would go on the bus the value wouldn't necessarily be left justified within data[], depending on how the bus works. How about a wording like this: The 'data' member contains, in its first 'len' bytes, the value as it would appear if the guest had accessed memory rather than I/O. Scott, Yes! Thank you so much! Completely agree. You managed elegantly to express in two lines, what I failed to communicate with hundreds lines of my emails in last few days on this and other email thread. - Victor -Scott ___ kvmarm mailing list kvm...@lists.cs.columbia.edu https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO
On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote: On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote: The KVM API documentation is not clear about the semantics of the data field on the mmio struct on the kvm_run struct. This has become problematic when supporting ARM guests on big-endian host systems with guests of both endianness types, because it is unclear how the data should be exported to user space. This should not break with existing implementations as all supported existing implementations of known user space applications (QEMU and kvmtools for virtio) only support default endianness of the architectures on the host side. Cc: Marc Zyngier marc.zyng...@arm.com Cc: Peter Maydell peter.mayd...@linaro.org Cc: Alexander Graf ag...@suse.de Signed-off-by: Christoffer Dall christoffer.d...@linaro.org --- Changes [v1 - v2]: - s/host kernel should/host user space should/ Documentation/virtual/kvm/api.txt | 5 + 1 file changed, 5 insertions(+) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 366bf4b..6dbd68c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could not be satisfied by kvm. The 'data' member contains the written data if 'is_write' is true, and should be filled by application code otherwise. +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host user space should always be able to do: +type val = *((type *)mmio.data). Host userspace should be able to do that with what results? It would only produce a directly usable value if host endianness is the same as the emulated device's endianness. With the result that it gets the value the CPU has sent out on the bus as the memory transaction. Obviously if what userspace is emulating is a bus which has a byteswapping bridge or if it's being helpful to device emulation by providing here's the value even though you think you're wired up backwards then it needs to byteswap. I'm not sure that host kernel native endianness is an accurate way of describing what currently happens. Regardless of host or guest endianness, the guest should be swapping the value as necessary to ensure that the value that goes on the (real or emulated) bus is the same. I don't know why you're bringing the guest in here. Whether the guest chooses to byteswap or not is IMHO not relevant. What KVM and userspace need to combine to achieve is that whatever the guest happens to have done causes the same result it would on the real hardware. Whether the guest sends out a write of 0x12345678 because it wrote 0x12345678 directly or because it started with 0x87654321 and issued a byte-reverse instruction doesn't matter. Plus, if it were really as it would go on the bus the value wouldn't necessarily be left justified within data[], depending on how the bus works. The point is that the value in data[] is not as it would go on the bus, but the value you get out by treating it as a host-native-endianness value of the relevant size left-justified within data[] is the value as it would go on the bus. How about a wording like this: The 'data' member contains, in its first 'len' bytes, the value as it would appear if the guest had accessed memory rather than I/O. I think this is confusing, because now userspace authors have to figure out how to get back to value X of size Y at address Z by interpreting this text... Can you write out the equivalent of Christoffer's text here's how you get the memory transaction value for what you want? (Also, value as it would appear to who?) I think your wording implies that the order of bytes in data[] depend on the guest CPU usual byte order, ie the order which the CPU does not do a byte-lane-swap for (LE for ARM, BE for PPC), and it would mean it would come out differently from my/Alex/Christoffer's proposal if the host kernel was the opposite endianness from that usual order. Finally, I think it's a bit confusing in that as if the guest had accessed memory is assigning implicit semantics to memory in the emulated system, when memory is actually kind of outside KVM's purview because it's not part of the CPU. thanks -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO
On Sat, Jan 25, 2014 at 12:24:08AM +, Peter Maydell wrote: On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote: On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote: The KVM API documentation is not clear about the semantics of the data field on the mmio struct on the kvm_run struct. This has become problematic when supporting ARM guests on big-endian host systems with guests of both endianness types, because it is unclear how the data should be exported to user space. This should not break with existing implementations as all supported existing implementations of known user space applications (QEMU and kvmtools for virtio) only support default endianness of the architectures on the host side. Cc: Marc Zyngier marc.zyng...@arm.com Cc: Peter Maydell peter.mayd...@linaro.org Cc: Alexander Graf ag...@suse.de Signed-off-by: Christoffer Dall christoffer.d...@linaro.org --- Changes [v1 - v2]: - s/host kernel should/host user space should/ Documentation/virtual/kvm/api.txt | 5 + 1 file changed, 5 insertions(+) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 366bf4b..6dbd68c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could not be satisfied by kvm. The 'data' member contains the written data if 'is_write' is true, and should be filled by application code otherwise. +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host user space should always be able to do: +type val = *((type *)mmio.data). Host userspace should be able to do that with what results? It would only produce a directly usable value if host endianness is the same as the emulated device's endianness. With the result that it gets the value the CPU has sent out on the bus as the memory transaction. Obviously if what userspace is emulating is a bus which has a byteswapping bridge or if it's being helpful to device emulation by providing here's the value even though you think you're wired up backwards then it needs to byteswap. Yes, because KVM emulates the CPU, it's interface should emulate the external interface of the CPU, not CPU-internal bits. The value is directly usable to emulate the bus to attached devices in user space. I'm not sure that host kernel native endianness is an accurate way of describing what currently happens. Regardless of host or guest endianness, the guest should be swapping the value as necessary to ensure that the value that goes on the (real or emulated) bus is the same. I don't know why you're bringing the guest in here. Whether the guest chooses to byteswap or not is IMHO not relevant. What KVM and userspace need to combine to achieve is that whatever the guest happens to have done causes the same result it would on the real hardware. Whether the guest sends out a write of 0x12345678 because it wrote 0x12345678 directly or because it started with 0x87654321 and issued a byte-reverse instruction doesn't matter. Plus, if it were really as it would go on the bus the value wouldn't necessarily be left justified within data[], depending on how the bus works. The point is that the value in data[] is not as it would go on the bus, but the value you get out by treating it as a host-native-endianness value of the relevant size left-justified within data[] is the value as it would go on the bus. How about a wording like this: The 'data' member contains, in its first 'len' bytes, the value as it would appear if the guest had accessed memory rather than I/O. I think this is confusing, because now userspace authors have to figure out how to get back to value X of size Y at address Z by interpreting this text... Can you write out the equivalent of Christoffer's text here's how you get the memory transaction value for what you want? (Also, value as it would appear to who?) I think your wording implies that the order of bytes in data[] depend on the guest CPU usual byte order, ie the order which the CPU does not do a byte-lane-swap for (LE for ARM, BE for PPC), and it would mean it would come out differently from my/Alex/Christoffer's proposal if the host kernel was the opposite endianness from that usual order. Finally, I think it's a bit confusing in that as if the guest had accessed memory is assigning implicit semantics to memory in the emulated system, when memory is actually kind of outside KVM's purview because it's not part of the CPU. The as if the guest had accessed memory, would imply that user space needs to look at the settings of the CPU (in the case of ARM the E-bit) to understand which value the CPU intended for the memory
Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO
On Sat, 2014-01-25 at 00:24 +, Peter Maydell wrote: On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote: On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote: diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 366bf4b..6dbd68c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could not be satisfied by kvm. The 'data' member contains the written data if 'is_write' is true, and should be filled by application code otherwise. +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host user space should always be able to do: +type val = *((type *)mmio.data). Host userspace should be able to do that with what results? It would only produce a directly usable value if host endianness is the same as the emulated device's endianness. With the result that it gets the value the CPU has sent out on the bus as the memory transaction. Doesn't that assume the host kernel endianness is the same as the bus (or rather, that the host CPU would not swap such an access before it hits the bus)? If you take the same hardware and boot a little endian host kernel one day, and a big endian host kernel the next, the bus doesn't change, and neither should the bytewise (assuming address invariance) contents of data[]. How data[] would look when read as a larger integer would of course change -- but that's due to how you're reading it. It's clear to say that a value in memory has been stored there in host endianness when the value is as you would want to see it in a CPU register, but it's less clear when you talk about it relative to values on a bus. It's harder to correlate that to something that is software visible. I don't think there's any actual technical difference between your wording and mine when each wording is properly interpreted, but I suspect my wording is less likely to be misinterpreted (I could be wrong). Obviously if what userspace is emulating is a bus which has a byteswapping bridge or if it's being helpful to device emulation by providing here's the value even though you think you're wired up backwards then it needs to byteswap. Whether the emulated bus has a byteswapping bridge doesn't sound like something that depends on the endianness that the host CPU is currently running in. How about a wording like this: The 'data' member contains, in its first 'len' bytes, the value as it would appear if the guest had accessed memory rather than I/O. I think this is confusing, because now userspace authors have to figure out how to get back to value X of size Y at address Z by interpreting this text... Can you write out the equivalent of Christoffer's text here's how you get the memory transaction value for what you want? Userspace swaps the value if and only if userspace's endianness differs from the endianness with which the device interprets the data (regardless of whether said interpretation is considered natural or swapped relative to the way the bus is documented). It's similar to how userspace would handle emulating DMA. KVM swaps the value if and only if the endianness of the guest access differs from that of the host, i.e. if it would have done swapping when emulating an ordinary memory access. (Also, value as it would appear to who?) As it would appear to anyone. It works because data[] actually is memory. Any difference in how data appears based on the reader's context would already be reflected when the reader performs the load. I think your wording implies that the order of bytes in data[] depend on the guest CPU usual byte order, ie the order which the CPU does not do a byte-lane-swap for (LE for ARM, BE for PPC), and it would mean it would come out differently from my/Alex/Christoffer's proposal if the host kernel was the opposite endianness from that usual order. It doesn't depend on usual anything. The only thing it implicitly says about guest byte order is that it's KVM's job to implement any swapping if the endianness of the guest access is different from the endianness of the host kernel access (whether it's due to the guest's mode, the way a page is mapped, the instruction used, etc). Finally, I think it's a bit confusing in that as if the guest had accessed memory is assigning implicit semantics to memory in the emulated system, when memory is actually kind of outside KVM's purview because it's not part of the CPU. That's sort of the point. It defines it in a way that is independent of the CPU, and thus independent of what endianness the CPU operates in. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo
Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO
On Fri, 2014-01-24 at 17:56 -0800, Christoffer Dall wrote: On Sat, Jan 25, 2014 at 12:24:08AM +, Peter Maydell wrote: Finally, I think it's a bit confusing in that as if the guest had accessed memory is assigning implicit semantics to memory in the emulated system, when memory is actually kind of outside KVM's purview because it's not part of the CPU. The as if the guest had accessed memory, would imply that user space needs to look at the settings of the CPU (in the case of ARM the E-bit) to understand which value the CPU intended for the memory transaction. No, it doesn't, just as userspace doesn't need to do that to interpret DMA descriptors. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO
On 25.01.2014, at 02:58, Scott Wood scottw...@freescale.com wrote: On Sat, 2014-01-25 at 00:24 +, Peter Maydell wrote: On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote: On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote: diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 366bf4b..6dbd68c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could not be satisfied by kvm. The 'data' member contains the written data if 'is_write' is true, and should be filled by application code otherwise. +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host user space should always be able to do: +type val = *((type *)mmio.data). Host userspace should be able to do that with what results? It would only produce a directly usable value if host endianness is the same as the emulated device's endianness. With the result that it gets the value the CPU has sent out on the bus as the memory transaction. Doesn't that assume the host kernel endianness is the same as the bus (or rather, that the host CPU would not swap such an access before it hits the bus)? If you take the same hardware and boot a little endian host kernel one day, and a big endian host kernel the next, the bus doesn't change, and neither should the bytewise (assuming address invariance) contents of data[]. How data[] would look when read as a larger integer would of course change -- but that's due to how you're reading it. It's clear to say that a value in memory has been stored there in host endianness when the value is as you would want to see it in a CPU register, but it's less clear when you talk about it relative to values on a bus. It's harder to correlate that to something that is software visible. I don't think there's any actual technical difference between your wording and mine when each wording is properly interpreted, but I suspect my wording is less likely to be misinterpreted (I could be wrong). Obviously if what userspace is emulating is a bus which has a byteswapping bridge or if it's being helpful to device emulation by providing here's the value even though you think you're wired up backwards then it needs to byteswap. Whether the emulated bus has a byteswapping bridge doesn't sound like something that depends on the endianness that the host CPU is currently running in. How about a wording like this: The 'data' member contains, in its first 'len' bytes, the value as it would appear if the guest had accessed memory rather than I/O. I think this is confusing, because now userspace authors have to figure out how to get back to value X of size Y at address Z by interpreting this text... Can you write out the equivalent of Christoffer's text here's how you get the memory transaction value for what you want? Userspace swaps the value if and only if userspace's endianness differs from the endianness with which the device interprets the data (regardless of whether said interpretation is considered natural or swapped relative to the way the bus is documented). It's similar to how userspace would handle emulating DMA. KVM swaps the value if and only if the endianness of the guest access differs from that of the host, i.e. if it would have done swapping when emulating an ordinary memory access. (Also, value as it would appear to who?) As it would appear to anyone. It works because data[] actually is memory. Any difference in how data appears based on the reader's context would already be reflected when the reader performs the load. I think your wording implies that the order of bytes in data[] depend on the guest CPU usual byte order, ie the order which the CPU does not do a byte-lane-swap for (LE for ARM, BE for PPC), and it would mean it would come out differently from my/Alex/Christoffer's proposal if the host kernel was the opposite endianness from that usual order. It doesn't depend on usual anything. The only thing it implicitly says about guest byte order is that it's KVM's job to implement any swapping if the endianness of the guest access is different from the endianness of the host kernel access (whether it's due to the guest's mode, the way a page is mapped, the instruction used, etc). Finally, I think it's a bit confusing in that as if the guest had accessed memory is assigning implicit semantics to memory in the emulated system, when memory is actually kind of outside KVM's purview because it's not part of the CPU. That's sort of the point. It defines it in a way that is independent of the CPU, and thus independent of what endianness the CPU operates in. Ok, let's go through the combinations for a
Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO
On 25.01.2014, at 03:04, Scott Wood scottw...@freescale.com wrote: On Fri, 2014-01-24 at 17:56 -0800, Christoffer Dall wrote: On Sat, Jan 25, 2014 at 12:24:08AM +, Peter Maydell wrote: Finally, I think it's a bit confusing in that as if the guest had accessed memory is assigning implicit semantics to memory in the emulated system, when memory is actually kind of outside KVM's purview because it's not part of the CPU. The as if the guest had accessed memory, would imply that user space needs to look at the settings of the CPU (in the case of ARM the E-bit) to understand which value the CPU intended for the memory transaction. No, it doesn't, just as userspace doesn't need to do that to interpret DMA descriptors. DMA descriptors are handled by specific pieces of hardware that access memory. Guests usually push data to memory in a layout so that it's natural to the device they want to talk to. MMIO is more tricky. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO
On Sat, Jan 25, 2014 at 03:15:35AM +0100, Alexander Graf wrote: On 25.01.2014, at 02:58, Scott Wood scottw...@freescale.com wrote: On Sat, 2014-01-25 at 00:24 +, Peter Maydell wrote: On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote: On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote: diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 366bf4b..6dbd68c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could not be satisfied by kvm. The 'data' member contains the written data if 'is_write' is true, and should be filled by application code otherwise. +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host user space should always be able to do: +type val = *((type *)mmio.data). Host userspace should be able to do that with what results? It would only produce a directly usable value if host endianness is the same as the emulated device's endianness. With the result that it gets the value the CPU has sent out on the bus as the memory transaction. Doesn't that assume the host kernel endianness is the same as the bus (or rather, that the host CPU would not swap such an access before it hits the bus)? If you take the same hardware and boot a little endian host kernel one day, and a big endian host kernel the next, the bus doesn't change, and neither should the bytewise (assuming address invariance) contents of data[]. How data[] would look when read as a larger integer would of course change -- but that's due to how you're reading it. It's clear to say that a value in memory has been stored there in host endianness when the value is as you would want to see it in a CPU register, but it's less clear when you talk about it relative to values on a bus. It's harder to correlate that to something that is software visible. I don't think there's any actual technical difference between your wording and mine when each wording is properly interpreted, but I suspect my wording is less likely to be misinterpreted (I could be wrong). Obviously if what userspace is emulating is a bus which has a byteswapping bridge or if it's being helpful to device emulation by providing here's the value even though you think you're wired up backwards then it needs to byteswap. Whether the emulated bus has a byteswapping bridge doesn't sound like something that depends on the endianness that the host CPU is currently running in. How about a wording like this: The 'data' member contains, in its first 'len' bytes, the value as it would appear if the guest had accessed memory rather than I/O. I think this is confusing, because now userspace authors have to figure out how to get back to value X of size Y at address Z by interpreting this text... Can you write out the equivalent of Christoffer's text here's how you get the memory transaction value for what you want? Userspace swaps the value if and only if userspace's endianness differs from the endianness with which the device interprets the data (regardless of whether said interpretation is considered natural or swapped relative to the way the bus is documented). It's similar to how userspace would handle emulating DMA. KVM swaps the value if and only if the endianness of the guest access differs from that of the host, i.e. if it would have done swapping when emulating an ordinary memory access. (Also, value as it would appear to who?) As it would appear to anyone. It works because data[] actually is memory. Any difference in how data appears based on the reader's context would already be reflected when the reader performs the load. I think your wording implies that the order of bytes in data[] depend on the guest CPU usual byte order, ie the order which the CPU does not do a byte-lane-swap for (LE for ARM, BE for PPC), and it would mean it would come out differently from my/Alex/Christoffer's proposal if the host kernel was the opposite endianness from that usual order. It doesn't depend on usual anything. The only thing it implicitly says about guest byte order is that it's KVM's job to implement any swapping if the endianness of the guest access is different from the endianness of the host kernel access (whether it's due to the guest's mode, the way a page is mapped, the instruction used, etc). Finally, I think it's a bit confusing in that as if the guest had accessed memory is assigning implicit semantics to memory in the emulated system, when memory is actually kind of outside KVM's purview because it's not part of the CPU. That's sort of the
Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO
On 24 January 2014 18:15, Alexander Graf ag...@suse.de wrote: On 25.01.2014, at 02:58, Scott Wood scottw...@freescale.com wrote: On Sat, 2014-01-25 at 00:24 +, Peter Maydell wrote: On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote: On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote: diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 366bf4b..6dbd68c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could not be satisfied by kvm. The 'data' member contains the written data if 'is_write' is true, and should be filled by application code otherwise. +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host user space should always be able to do: +type val = *((type *)mmio.data). Host userspace should be able to do that with what results? It would only produce a directly usable value if host endianness is the same as the emulated device's endianness. With the result that it gets the value the CPU has sent out on the bus as the memory transaction. Doesn't that assume the host kernel endianness is the same as the bus (or rather, that the host CPU would not swap such an access before it hits the bus)? If you take the same hardware and boot a little endian host kernel one day, and a big endian host kernel the next, the bus doesn't change, and neither should the bytewise (assuming address invariance) contents of data[]. How data[] would look when read as a larger integer would of course change -- but that's due to how you're reading it. It's clear to say that a value in memory has been stored there in host endianness when the value is as you would want to see it in a CPU register, but it's less clear when you talk about it relative to values on a bus. It's harder to correlate that to something that is software visible. I don't think there's any actual technical difference between your wording and mine when each wording is properly interpreted, but I suspect my wording is less likely to be misinterpreted (I could be wrong). Obviously if what userspace is emulating is a bus which has a byteswapping bridge or if it's being helpful to device emulation by providing here's the value even though you think you're wired up backwards then it needs to byteswap. Whether the emulated bus has a byteswapping bridge doesn't sound like something that depends on the endianness that the host CPU is currently running in. How about a wording like this: The 'data' member contains, in its first 'len' bytes, the value as it would appear if the guest had accessed memory rather than I/O. I think this is confusing, because now userspace authors have to figure out how to get back to value X of size Y at address Z by interpreting this text... Can you write out the equivalent of Christoffer's text here's how you get the memory transaction value for what you want? Userspace swaps the value if and only if userspace's endianness differs from the endianness with which the device interprets the data (regardless of whether said interpretation is considered natural or swapped relative to the way the bus is documented). It's similar to how userspace would handle emulating DMA. KVM swaps the value if and only if the endianness of the guest access differs from that of the host, i.e. if it would have done swapping when emulating an ordinary memory access. (Also, value as it would appear to who?) As it would appear to anyone. It works because data[] actually is memory. Any difference in how data appears based on the reader's context would already be reflected when the reader performs the load. I think your wording implies that the order of bytes in data[] depend on the guest CPU usual byte order, ie the order which the CPU does not do a byte-lane-swap for (LE for ARM, BE for PPC), and it would mean it would come out differently from my/Alex/Christoffer's proposal if the host kernel was the opposite endianness from that usual order. It doesn't depend on usual anything. The only thing it implicitly says about guest byte order is that it's KVM's job to implement any swapping if the endianness of the guest access is different from the endianness of the host kernel access (whether it's due to the guest's mode, the way a page is mapped, the instruction used, etc). Finally, I think it's a bit confusing in that as if the guest had accessed memory is assigning implicit semantics to memory in the emulated system, when memory is actually kind of outside KVM's purview because it's not part of the CPU. That's sort of the point. It defines it in a way that is independent of the CPU, and thus independent of what endianness the CPU operates in. Ok,
Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO
Il 24/01/2014 01:01, Peter Maydell ha scritto: +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host kernel should always be able to do: +type val = *((type *)mmio.data). I think this would be better phrased as The host userspace should always, since this documentation is supposed to be telling userspace what the kernel's contract with it is, not the kernel keeping notes for itself on its own implementation. (It also clarifies what the intention is for the obscure and maybe-we'll-never-implement-this case of an LE host kernel using a compatibility interface to run the host userspace (QEMU) as a BE process which sees the same ABI a BE kernel provides, without actually dragging that red herring explicitly into the documentation.) I agree, and also the first line should mention userspace. In PPC I think it's possible or even common to have BE host kernel and LE host userspace (or perhaps vice versa is the common one). Paolo -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO
On 24.01.2014, at 14:09, Paolo Bonzini pbonz...@redhat.com wrote: Il 24/01/2014 01:01, Peter Maydell ha scritto: +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host kernel should always be able to do: +type val = *((type *)mmio.data). I think this would be better phrased as The host userspace should always, since this documentation is supposed to be telling userspace what the kernel's contract with it is, not the kernel keeping notes for itself on its own implementation. (It also clarifies what the intention is for the obscure and maybe-we'll-never-implement-this case of an LE host kernel using a compatibility interface to run the host userspace (QEMU) as a BE process which sees the same ABI a BE kernel provides, without actually dragging that red herring explicitly into the documentation.) I agree, and also the first line should mention userspace. In PPC I think it's possible or even common to have BE host kernel and LE host userspace (or perhaps vice versa is the common one). It was possible on 32bit, but I'm not sure anyone's actively using it :). The thing that was very common (not so much anymore for enterprise distros) is 32-bit user space with 64-bit kernels. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO
On 24 January 2014 05:13, Alexander Graf ag...@suse.de wrote: On 24.01.2014, at 14:09, Paolo Bonzini pbonz...@redhat.com wrote: Il 24/01/2014 01:01, Peter Maydell ha scritto: +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. Also if you use ints on real bus as description, you may want to clarify restrictions on mmio.len. Basically on 32 bit platform (i.e like V7 ARM) one cannot have mmio.len=8, because one cannot have 64bit value on 32bit data bus. Without such clarification introduction of text like the value as it would go on the bus in real hardware is confusing for len=8 for emulated CPUs where real busses are 32bit. If ldrd/strd would be emulated on ARMV7 one would need to use mmio twice to pass required data in either direction using len=4 .. Thanks, Victor The host kernel should always be able to do: +type val = *((type *)mmio.data). I think this would be better phrased as The host userspace should always, since this documentation is supposed to be telling userspace what the kernel's contract with it is, not the kernel keeping notes for itself on its own implementation. (It also clarifies what the intention is for the obscure and maybe-we'll-never-implement-this case of an LE host kernel using a compatibility interface to run the host userspace (QEMU) as a BE process which sees the same ABI a BE kernel provides, without actually dragging that red herring explicitly into the documentation.) I agree, and also the first line should mention userspace. In PPC I think it's possible or even common to have BE host kernel and LE host userspace (or perhaps vice versa is the common one). It was possible on 32bit, but I'm not sure anyone's actively using it :). The thing that was very common (not so much anymore for enterprise distros) is 32-bit user space with 64-bit kernels. Alex ___ kvmarm mailing list kvm...@lists.cs.columbia.edu https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO
Il 24/01/2014 16:23, Victor Kamensky ha scritto: Also if you use ints on real bus as description, you may want to clarify restrictions on mmio.len. Basically on 32 bit platform (i.e like V7 ARM) one cannot have mmio.len=8, because one cannot have 64bit value on 32bit data bus. Without such clarification introduction of text like the value as it would go on the bus in real hardware is confusing for len=8 for emulated CPUs where real busses are 32bit. This is not necessarily true. On a 32-bit CPU you can have a 64-bit memory bus. Even x86 32-bit CPUs can do 64-bit MMIO via MMX or SSE or double-word compare-and-swap (CMPXCHG8B). Paolo -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO
On 24 January 2014 07:32, Paolo Bonzini pbonz...@redhat.com wrote: Il 24/01/2014 16:23, Victor Kamensky ha scritto: Also if you use ints on real bus as description, you may want to clarify restrictions on mmio.len. Basically on 32 bit platform (i.e like V7 ARM) one cannot have mmio.len=8, because one cannot have 64bit value on 32bit data bus. Without such clarification introduction of text like the value as it would go on the bus in real hardware is confusing for len=8 for emulated CPUs where real busses are 32bit. This is not necessarily true. On a 32-bit CPU you can have a 64-bit memory bus. Even x86 32-bit CPUs can do 64-bit MMIO via MMX or SSE or double-word compare-and-swap (CMPXCHG8B). Sure, that was part of my point :). But now text says about real hardware buses and in any given moment emulator specify particular type of CPU emulated, and I am quite sure that we can find one emulated ARMV7 CPU that would just have real 32bit data bus. Note there was no such problem and nobody cares what real data buses width are, with definition I argued for. That I think that was original intent of '__u8 data[8]' use - just bytes array description of how memory at given phys_addr looks before (mmiois_write=0) or would look after (mmio.is_write = 1) requested memory transaction (or several of them for that matter) are emulated by KVM_EXIT_MMIO. When if comes to devices memory read/write it is logical view of memory of course. Such definition works with any real data buses sizes, starting with byte. But, ooh, well ... nobody understands such definition .. I stand failed to explain it clear enough. Thanks, Victor Paolo -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO
On 24 January 2014 07:23, Victor Kamensky victor.kamen...@linaro.org wrote: On 24 January 2014 05:13, Alexander Graf ag...@suse.de wrote: On 24.01.2014, at 14:09, Paolo Bonzini pbonz...@redhat.com wrote: Il 24/01/2014 01:01, Peter Maydell ha scritto: +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. Other thing with mmio.len need clarification: please specify what mmio.len values are allowed. It seems that it was implied that len could be only power of 2 in range between 1 and 8. I would rather see that it is spelled out. I think code on all sides of KVM_EXIT_MMIO already make such assumption. Now when you talk about integers values on bus in real hardware power of 2 sizes implied Also note for memory pieces with sizes other than 2, 4, 8 endianness is not defined. Also if you use ints on real bus as description, you may want to clarify restrictions on mmio.len. Basically on 32 bit platform (i.e like V7 ARM) one cannot have mmio.len=8, because one cannot have 64bit value on 32bit data bus. Without such clarification introduction of text like the value as it would go on the bus in real hardware is confusing for len=8 for emulated CPUs where real busses are 32bit. If ldrd/strd would be emulated on ARMV7 one would need to use mmio twice to pass required data in either direction using len=4 .. Thanks, Victor The host kernel should always be able to do: +type val = *((type *)mmio.data). would it be prudent to say that it is just integer type here. Thanks, Victor I think this would be better phrased as The host userspace should always, since this documentation is supposed to be telling userspace what the kernel's contract with it is, not the kernel keeping notes for itself on its own implementation. (It also clarifies what the intention is for the obscure and maybe-we'll-never-implement-this case of an LE host kernel using a compatibility interface to run the host userspace (QEMU) as a BE process which sees the same ABI a BE kernel provides, without actually dragging that red herring explicitly into the documentation.) I agree, and also the first line should mention userspace. In PPC I think it's possible or even common to have BE host kernel and LE host userspace (or perhaps vice versa is the common one). It was possible on 32bit, but I'm not sure anyone's actively using it :). The thing that was very common (not so much anymore for enterprise distros) is 32-bit user space with 64-bit kernels. Alex ___ kvmarm mailing list kvm...@lists.cs.columbia.edu https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO
On 24 January 2014 05:13, Alexander Graf ag...@suse.de wrote: On 24.01.2014, at 14:09, Paolo Bonzini pbonz...@redhat.com wrote: Il 24/01/2014 01:01, Peter Maydell ha scritto: +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host kernel should always be able to do: +type val = *((type *)mmio.data). I think this would be better phrased as The host userspace should always, since this documentation is supposed to be telling userspace what the kernel's contract with it is, not the kernel keeping notes for itself on its own implementation. (It also clarifies what the intention is for the obscure and maybe-we'll-never-implement-this case of an LE host kernel using a compatibility interface to run the host userspace (QEMU) as a BE process which sees the same ABI a BE kernel provides, without actually dragging that red herring explicitly into the documentation.) I agree, and also the first line should mention userspace. In PPC I think it's possible or even common to have BE host kernel and LE host userspace (or perhaps vice versa is the common one). It was possible on 32bit, but I'm not sure anyone's actively using it :). The thing that was very common (not so much anymore for enterprise distros) is 32-bit user space with 64-bit kernels. Paolo, Alex, good point about BE kernel / LE user-land mix! How KVM kernel code that deals with KVM_MMIO_EXIT can find out what is user process endianity that handles this KVM_MMIO_EXIT? Do we have kernel function that can tell that. Hey, what is current user-land task endianity? :). And reverse case, how user-land code that wants to do KVM_MMIO_EXIT can find out what is endianity of kernel? Do we have such system call? Hey, kernel tell me your endianity :). Just a thought: should not we instead of trying to implicitly setup endianity by some other side properties like emulator or kernel endianity, let's just do it explicitly. Adding 'endianness' field into current mmio structure is not an option, but maybe there is outside mechanism like KVM features, special ioctl number, through which one can explicitly either set of learn endianity in mmio.data[] for given KVM session. At least, if we don't want to consider mixed BE/LE kernel/user-land, then document should clarify that BE/LE kernel/user-land mix is not supported and assumption here is that they always coincide. Thanks, Victor Alex ___ kvmarm mailing list kvm...@lists.cs.columbia.edu https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO
On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote: The KVM API documentation is not clear about the semantics of the data field on the mmio struct on the kvm_run struct. This has become problematic when supporting ARM guests on big-endian host systems with guests of both endianness types, because it is unclear how the data should be exported to user space. This should not break with existing implementations as all supported existing implementations of known user space applications (QEMU and kvmtools for virtio) only support default endianness of the architectures on the host side. Cc: Marc Zyngier marc.zyng...@arm.com Cc: Peter Maydell peter.mayd...@linaro.org Cc: Alexander Graf ag...@suse.de Signed-off-by: Christoffer Dall christoffer.d...@linaro.org --- Changes [v1 - v2]: - s/host kernel should/host user space should/ Documentation/virtual/kvm/api.txt | 5 + 1 file changed, 5 insertions(+) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 366bf4b..6dbd68c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could not be satisfied by kvm. The 'data' member contains the written data if 'is_write' is true, and should be filled by application code otherwise. +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host user space should always be able to do: +type val = *((type *)mmio.data). Host userspace should be able to do that with what results? It would only produce a directly usable value if host endianness is the same as the emulated device's endianness. I'm not sure that host kernel native endianness is an accurate way of describing what currently happens. Regardless of host or guest endianness, the guest should be swapping the value as necessary to ensure that the value that goes on the (real or emulated) bus is the same. Plus, if it were really as it would go on the bus the value wouldn't necessarily be left justified within data[], depending on how the bus works. How about a wording like this: The 'data' member contains, in its first 'len' bytes, the value as it would appear if the guest had accessed memory rather than I/O. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO
On 24 January 2014 15:51, Scott Wood scottw...@freescale.com wrote: On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote: The KVM API documentation is not clear about the semantics of the data field on the mmio struct on the kvm_run struct. This has become problematic when supporting ARM guests on big-endian host systems with guests of both endianness types, because it is unclear how the data should be exported to user space. This should not break with existing implementations as all supported existing implementations of known user space applications (QEMU and kvmtools for virtio) only support default endianness of the architectures on the host side. Cc: Marc Zyngier marc.zyng...@arm.com Cc: Peter Maydell peter.mayd...@linaro.org Cc: Alexander Graf ag...@suse.de Signed-off-by: Christoffer Dall christoffer.d...@linaro.org --- Changes [v1 - v2]: - s/host kernel should/host user space should/ Documentation/virtual/kvm/api.txt | 5 + 1 file changed, 5 insertions(+) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 366bf4b..6dbd68c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could not be satisfied by kvm. The 'data' member contains the written data if 'is_write' is true, and should be filled by application code otherwise. +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host user space should always be able to do: +type val = *((type *)mmio.data). Host userspace should be able to do that with what results? It would only produce a directly usable value if host endianness is the same as the emulated device's endianness. I'm not sure that host kernel native endianness is an accurate way of describing what currently happens. Regardless of host or guest endianness, the guest should be swapping the value as necessary to ensure that the value that goes on the (real or emulated) bus is the same. Plus, if it were really as it would go on the bus the value wouldn't necessarily be left justified within data[], depending on how the bus works. How about a wording like this: The 'data' member contains, in its first 'len' bytes, the value as it would appear if the guest had accessed memory rather than I/O. Scott, Yes! Thank you so much! Completely agree. You managed elegantly to express in two lines, what I failed to communicate with hundreds lines of my emails in last few days on this and other email thread. - Victor -Scott ___ kvmarm mailing list kvm...@lists.cs.columbia.edu https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO
On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote: On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote: The KVM API documentation is not clear about the semantics of the data field on the mmio struct on the kvm_run struct. This has become problematic when supporting ARM guests on big-endian host systems with guests of both endianness types, because it is unclear how the data should be exported to user space. This should not break with existing implementations as all supported existing implementations of known user space applications (QEMU and kvmtools for virtio) only support default endianness of the architectures on the host side. Cc: Marc Zyngier marc.zyng...@arm.com Cc: Peter Maydell peter.mayd...@linaro.org Cc: Alexander Graf ag...@suse.de Signed-off-by: Christoffer Dall christoffer.d...@linaro.org --- Changes [v1 - v2]: - s/host kernel should/host user space should/ Documentation/virtual/kvm/api.txt | 5 + 1 file changed, 5 insertions(+) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 366bf4b..6dbd68c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could not be satisfied by kvm. The 'data' member contains the written data if 'is_write' is true, and should be filled by application code otherwise. +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host user space should always be able to do: +type val = *((type *)mmio.data). Host userspace should be able to do that with what results? It would only produce a directly usable value if host endianness is the same as the emulated device's endianness. With the result that it gets the value the CPU has sent out on the bus as the memory transaction. Obviously if what userspace is emulating is a bus which has a byteswapping bridge or if it's being helpful to device emulation by providing here's the value even though you think you're wired up backwards then it needs to byteswap. I'm not sure that host kernel native endianness is an accurate way of describing what currently happens. Regardless of host or guest endianness, the guest should be swapping the value as necessary to ensure that the value that goes on the (real or emulated) bus is the same. I don't know why you're bringing the guest in here. Whether the guest chooses to byteswap or not is IMHO not relevant. What KVM and userspace need to combine to achieve is that whatever the guest happens to have done causes the same result it would on the real hardware. Whether the guest sends out a write of 0x12345678 because it wrote 0x12345678 directly or because it started with 0x87654321 and issued a byte-reverse instruction doesn't matter. Plus, if it were really as it would go on the bus the value wouldn't necessarily be left justified within data[], depending on how the bus works. The point is that the value in data[] is not as it would go on the bus, but the value you get out by treating it as a host-native-endianness value of the relevant size left-justified within data[] is the value as it would go on the bus. How about a wording like this: The 'data' member contains, in its first 'len' bytes, the value as it would appear if the guest had accessed memory rather than I/O. I think this is confusing, because now userspace authors have to figure out how to get back to value X of size Y at address Z by interpreting this text... Can you write out the equivalent of Christoffer's text here's how you get the memory transaction value for what you want? (Also, value as it would appear to who?) I think your wording implies that the order of bytes in data[] depend on the guest CPU usual byte order, ie the order which the CPU does not do a byte-lane-swap for (LE for ARM, BE for PPC), and it would mean it would come out differently from my/Alex/Christoffer's proposal if the host kernel was the opposite endianness from that usual order. Finally, I think it's a bit confusing in that as if the guest had accessed memory is assigning implicit semantics to memory in the emulated system, when memory is actually kind of outside KVM's purview because it's not part of the CPU. thanks -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO
On Sat, 2014-01-25 at 00:24 +, Peter Maydell wrote: On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote: On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote: diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 366bf4b..6dbd68c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could not be satisfied by kvm. The 'data' member contains the written data if 'is_write' is true, and should be filled by application code otherwise. +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host user space should always be able to do: +type val = *((type *)mmio.data). Host userspace should be able to do that with what results? It would only produce a directly usable value if host endianness is the same as the emulated device's endianness. With the result that it gets the value the CPU has sent out on the bus as the memory transaction. Doesn't that assume the host kernel endianness is the same as the bus (or rather, that the host CPU would not swap such an access before it hits the bus)? If you take the same hardware and boot a little endian host kernel one day, and a big endian host kernel the next, the bus doesn't change, and neither should the bytewise (assuming address invariance) contents of data[]. How data[] would look when read as a larger integer would of course change -- but that's due to how you're reading it. It's clear to say that a value in memory has been stored there in host endianness when the value is as you would want to see it in a CPU register, but it's less clear when you talk about it relative to values on a bus. It's harder to correlate that to something that is software visible. I don't think there's any actual technical difference between your wording and mine when each wording is properly interpreted, but I suspect my wording is less likely to be misinterpreted (I could be wrong). Obviously if what userspace is emulating is a bus which has a byteswapping bridge or if it's being helpful to device emulation by providing here's the value even though you think you're wired up backwards then it needs to byteswap. Whether the emulated bus has a byteswapping bridge doesn't sound like something that depends on the endianness that the host CPU is currently running in. How about a wording like this: The 'data' member contains, in its first 'len' bytes, the value as it would appear if the guest had accessed memory rather than I/O. I think this is confusing, because now userspace authors have to figure out how to get back to value X of size Y at address Z by interpreting this text... Can you write out the equivalent of Christoffer's text here's how you get the memory transaction value for what you want? Userspace swaps the value if and only if userspace's endianness differs from the endianness with which the device interprets the data (regardless of whether said interpretation is considered natural or swapped relative to the way the bus is documented). It's similar to how userspace would handle emulating DMA. KVM swaps the value if and only if the endianness of the guest access differs from that of the host, i.e. if it would have done swapping when emulating an ordinary memory access. (Also, value as it would appear to who?) As it would appear to anyone. It works because data[] actually is memory. Any difference in how data appears based on the reader's context would already be reflected when the reader performs the load. I think your wording implies that the order of bytes in data[] depend on the guest CPU usual byte order, ie the order which the CPU does not do a byte-lane-swap for (LE for ARM, BE for PPC), and it would mean it would come out differently from my/Alex/Christoffer's proposal if the host kernel was the opposite endianness from that usual order. It doesn't depend on usual anything. The only thing it implicitly says about guest byte order is that it's KVM's job to implement any swapping if the endianness of the guest access is different from the endianness of the host kernel access (whether it's due to the guest's mode, the way a page is mapped, the instruction used, etc). Finally, I think it's a bit confusing in that as if the guest had accessed memory is assigning implicit semantics to memory in the emulated system, when memory is actually kind of outside KVM's purview because it's not part of the CPU. That's sort of the point. It defines it in a way that is independent of the CPU, and thus independent of what endianness the CPU operates in. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More
Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO
On 25.01.2014, at 03:04, Scott Wood scottw...@freescale.com wrote: On Fri, 2014-01-24 at 17:56 -0800, Christoffer Dall wrote: On Sat, Jan 25, 2014 at 12:24:08AM +, Peter Maydell wrote: Finally, I think it's a bit confusing in that as if the guest had accessed memory is assigning implicit semantics to memory in the emulated system, when memory is actually kind of outside KVM's purview because it's not part of the CPU. The as if the guest had accessed memory, would imply that user space needs to look at the settings of the CPU (in the case of ARM the E-bit) to understand which value the CPU intended for the memory transaction. No, it doesn't, just as userspace doesn't need to do that to interpret DMA descriptors. DMA descriptors are handled by specific pieces of hardware that access memory. Guests usually push data to memory in a layout so that it's natural to the device they want to talk to. MMIO is more tricky. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO
On Sat, Jan 25, 2014 at 03:15:35AM +0100, Alexander Graf wrote: On 25.01.2014, at 02:58, Scott Wood scottw...@freescale.com wrote: On Sat, 2014-01-25 at 00:24 +, Peter Maydell wrote: On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote: On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote: diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 366bf4b..6dbd68c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could not be satisfied by kvm. The 'data' member contains the written data if 'is_write' is true, and should be filled by application code otherwise. +The 'data' member byte order is host kernel native endianness, regardless of +the endianness of the guest, and represents the the value as it would go on the +bus in real hardware. The host user space should always be able to do: +type val = *((type *)mmio.data). Host userspace should be able to do that with what results? It would only produce a directly usable value if host endianness is the same as the emulated device's endianness. With the result that it gets the value the CPU has sent out on the bus as the memory transaction. Doesn't that assume the host kernel endianness is the same as the bus (or rather, that the host CPU would not swap such an access before it hits the bus)? If you take the same hardware and boot a little endian host kernel one day, and a big endian host kernel the next, the bus doesn't change, and neither should the bytewise (assuming address invariance) contents of data[]. How data[] would look when read as a larger integer would of course change -- but that's due to how you're reading it. It's clear to say that a value in memory has been stored there in host endianness when the value is as you would want to see it in a CPU register, but it's less clear when you talk about it relative to values on a bus. It's harder to correlate that to something that is software visible. I don't think there's any actual technical difference between your wording and mine when each wording is properly interpreted, but I suspect my wording is less likely to be misinterpreted (I could be wrong). Obviously if what userspace is emulating is a bus which has a byteswapping bridge or if it's being helpful to device emulation by providing here's the value even though you think you're wired up backwards then it needs to byteswap. Whether the emulated bus has a byteswapping bridge doesn't sound like something that depends on the endianness that the host CPU is currently running in. How about a wording like this: The 'data' member contains, in its first 'len' bytes, the value as it would appear if the guest had accessed memory rather than I/O. I think this is confusing, because now userspace authors have to figure out how to get back to value X of size Y at address Z by interpreting this text... Can you write out the equivalent of Christoffer's text here's how you get the memory transaction value for what you want? Userspace swaps the value if and only if userspace's endianness differs from the endianness with which the device interprets the data (regardless of whether said interpretation is considered natural or swapped relative to the way the bus is documented). It's similar to how userspace would handle emulating DMA. KVM swaps the value if and only if the endianness of the guest access differs from that of the host, i.e. if it would have done swapping when emulating an ordinary memory access. (Also, value as it would appear to who?) As it would appear to anyone. It works because data[] actually is memory. Any difference in how data appears based on the reader's context would already be reflected when the reader performs the load. I think your wording implies that the order of bytes in data[] depend on the guest CPU usual byte order, ie the order which the CPU does not do a byte-lane-swap for (LE for ARM, BE for PPC), and it would mean it would come out differently from my/Alex/Christoffer's proposal if the host kernel was the opposite endianness from that usual order. It doesn't depend on usual anything. The only thing it implicitly says about guest byte order is that it's KVM's job to implement any swapping if the endianness of the guest access is different from the endianness of the host kernel access (whether it's due to the guest's mode, the way a page is mapped, the instruction used, etc). Finally, I think it's a bit confusing in that as if the guest had accessed memory is assigning implicit semantics to memory in the emulated system, when memory is actually kind of outside KVM's purview because it's not part of the CPU. That's sort of the