Re: 3.13 hangs when I tried to start a KVM at a 32 bit stable Gentoo

2014-01-24 Thread Paolo Bonzini

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Il 23/01/2014 20:50, Toralf Förster ha scritto:
| What makes the situation really annyoing - sometimes I just can
| restart my wlan device it the system works normal, but sometimes
| the whole system hangs and for those cases then sometimes not even
| sysrq buttons do work.

Can you reproduce it with the wlan driver disabled completely?

Paolo
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJS4l5cAAoJEBvWZb6bTYbyC0AP/jrHp8WUqDjgv/1kJYcYb0+0
cSM8jqcutYCcMAKXTsKGgQj4gWdw7q/bEo4lWAEX8fQqxOfFZjBZjhTuA66XF010
fi+nOCEwjafMUeflPZaSZr34KGSix1FTtmcdzz55Kzkh+A4hZJj2NgaAYr8KFsWW
0hCoW7+MHWTRsNLIZk1ms2H/lItz19gnO9CkNYWqSIxBtx1/le5v8LNi20OkHKCz
mdemG6DSwj9sE+jfezDjZ7jV/a+RBv4Pi0v71YPEUEa7HCiyKxddVAgD0KrvMwzF
2A7H4+rTwsJLnWTKFyMdoeUUz6iC5Rntq+B4ltHVXKjN2O5bG0aqmG+cURbvXrFF
y9EzChdaEikPlYJHp/ci2NOB8NJlB6oEb9uO3k2kRITdWsr3iK5gCZE3fu99ziz4
JmNPUpFGTy5dnQVI/edsuZDyafRYVq3U4CpP+drnWxbic4aH7nCTv2iR5Op4pnhd
LQlmt0MkZAyF9+6XFNBAAfPnfV0YPBHydCxhC2UaOmqqNXiYILW65dCD5aQiXq1y
lQi2dc6i7MYbVmc28I45LV/BFD4zn622c5ges0gKw2bshgzhQm12jWHJ+dMsLDp/
qpFFvaGoqWCNJZo9FCH6vaTiN5KNUyiUOZjvKVa3u959xC9BK8nh3vTWHxinK2Yt
Do62ZMYWJbljdgTvMN7d
=cfJ7
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: compiling with kvm-kmod

2014-01-24 Thread Jonas Pfoh
Thanks for the quick fix, I found that the modules build, but I still have a 
problem loading the module due to the fact that kvm_vfio_ops remains undefined 
in the kvm module.  Below is the patch that I used to fix the issue if it is 
helpful.

-Jonas



diff --git a/x86/Kbuild b/x86/Kbuild
index d75b756..637b3b1 100644
--- a/x86/Kbuild
+++ b/x86/Kbuild
@@ -1,7 +1,7 @@
 obj-m := kvm.o kvm-intel.o kvm-amd.o
 kvm-objs := kvm_main.o x86.o mmu.o emulate.o irq.o i8259.o pmu.o \
 lapic.o ioapic.o preempt.o i8254.o coalesced_mmio.o irq_comm.o \
-eventfd.o compat-x86.o async_pf.o cpuid.o irqchip.o \
+eventfd.o compat-x86.o async_pf.o cpuid.o irqchip.o vfio.o\
 ../external-module-compat.o
 ifeq ($(CONFIG_IOMMU_API)$(CONFIG_PCI),yy)
 kvm-objs += assigned-dev.o iommu.o

On 23.01.2014 19:19, Jan Kiszka wrote:
 On 2014-01-23 17:34, Jonas Pfoh wrote:
 Hello,

 I am currently working on a project involving KVM and have been making use 
 Jan's kvm-kmod repository.  I receive the below error when I attempt to 
 compile with the most recent version.  My question is simply if this is 
 something anyone is aware of or has any suggestions for before I go poking 
 around?  

 I am compiling against a 3.13.0 mainline vanilla kernel and am using the 
 master branch (3d923a3) of Jan's kvm-kmod repo which seems to be syncing 
 from kvm commit 7650b68.

 Thanks and regards,
 Jonas Pfoh

 make -C /lib/modules/3.13.0/build M=`pwd` \
 LINUXINCLUDE=-I`pwd`/include -I`pwd`/include/uapi -Iinclude 
 \
  -Iinclude/uapi -Iarch/x86/include 
 -Iarch/x86/include/uapi \
 -Iinclude/generated/uapi 
 -Iarch/x86/include/generated \
 -Iarch/x86/include/generated/uapi \
 -I`pwd`/include-compat -I`pwd`/x86 \
 -include  include/generated/autoconf.h \
 -include `pwd`/x86/external-module-compat.h \
 $@
 make[1]: Entering directory `/usr/src/linux-3.13'
   CC [M]  /local/repos/kvm-kmod/x86/svm.o
 In file included from include/linux/device.h:29:0,
  from include/linux/node.h:17,
  from include/linux/cpu.h:16,
  from 
 /local/repos/kvm-kmod/x86/../external-module-compat-comm.h:15,
  from /local/repos/kvm-kmod/x86/external-module-compat.h:45,
  from command-line:0:
 include/linux/gfp.h: In function ‘gfp_zonelist’:
 include/linux/gfp.h:272:2: error: implicit declaration of function 
 ‘IS_ENABLED’ [-Werror=implicit-function-declaration]
   if (IS_ENABLED(CONFIG_NUMA)  unlikely(flags  __GFP_THISNODE))
   ^
 cc1: some warnings being treated as errors
 make[3]: *** [/local/repos/kvm-kmod/x86/svm.o] Error 1
 make[2]: *** [/local/repos/kvm-kmod/x86] Error 2
 make[1]: *** [_module_/local/repos/kvm-kmod] Error 2
 make[1]: Leaving directory `/usr/src/linux-3.13'
 make: *** [all] Error 2
 
 That's likely a kernel issue: not all required headers are pulled by gfp.h.
 
 It's worked around now with 2b06046.
 
 Thanks,
 Jan
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: compiling with kvm-kmod

2014-01-24 Thread Jan Kiszka
On 2014-01-24 13:39, Jonas Pfoh wrote:
 Thanks for the quick fix, I found that the modules build, but I still have a 
 problem loading the module due to the fact that kvm_vfio_ops remains 
 undefined in the kvm module.  Below is the patch that I used to fix the issue 
 if it is helpful.

Yes, indeed - can you send it as proper patch (subject, brief reason,
signed-off)? Then I could merge directly.

Thanks,
Jan

 
 -Jonas
 
 
 
 diff --git a/x86/Kbuild b/x86/Kbuild
 index d75b756..637b3b1 100644
 --- a/x86/Kbuild
 +++ b/x86/Kbuild
 @@ -1,7 +1,7 @@
  obj-m := kvm.o kvm-intel.o kvm-amd.o
  kvm-objs := kvm_main.o x86.o mmu.o emulate.o irq.o i8259.o pmu.o \
  lapic.o ioapic.o preempt.o i8254.o coalesced_mmio.o irq_comm.o \
 -eventfd.o compat-x86.o async_pf.o cpuid.o irqchip.o \
 +eventfd.o compat-x86.o async_pf.o cpuid.o irqchip.o vfio.o\
  ../external-module-compat.o
  ifeq ($(CONFIG_IOMMU_API)$(CONFIG_PCI),yy)
  kvm-objs += assigned-dev.o iommu.o
 
 On 23.01.2014 19:19, Jan Kiszka wrote:
 On 2014-01-23 17:34, Jonas Pfoh wrote:
 Hello,

 I am currently working on a project involving KVM and have been making use 
 Jan's kvm-kmod repository.  I receive the below error when I attempt to 
 compile with the most recent version.  My question is simply if this is 
 something anyone is aware of or has any suggestions for before I go poking 
 around?  

 I am compiling against a 3.13.0 mainline vanilla kernel and am using the 
 master branch (3d923a3) of Jan's kvm-kmod repo which seems to be syncing 
 from kvm commit 7650b68.

 Thanks and regards,
 Jonas Pfoh

 make -C /lib/modules/3.13.0/build M=`pwd` \
 LINUXINCLUDE=-I`pwd`/include -I`pwd`/include/uapi 
 -Iinclude \
  -Iinclude/uapi -Iarch/x86/include 
 -Iarch/x86/include/uapi \
 -Iinclude/generated/uapi 
 -Iarch/x86/include/generated \
 -Iarch/x86/include/generated/uapi \
 -I`pwd`/include-compat -I`pwd`/x86 \
 -include  include/generated/autoconf.h \
 -include `pwd`/x86/external-module-compat.h \
 $@
 make[1]: Entering directory `/usr/src/linux-3.13'
   CC [M]  /local/repos/kvm-kmod/x86/svm.o
 In file included from include/linux/device.h:29:0,
  from include/linux/node.h:17,
  from include/linux/cpu.h:16,
  from 
 /local/repos/kvm-kmod/x86/../external-module-compat-comm.h:15,
  from /local/repos/kvm-kmod/x86/external-module-compat.h:45,
  from command-line:0:
 include/linux/gfp.h: In function ‘gfp_zonelist’:
 include/linux/gfp.h:272:2: error: implicit declaration of function 
 ‘IS_ENABLED’ [-Werror=implicit-function-declaration]
   if (IS_ENABLED(CONFIG_NUMA)  unlikely(flags  __GFP_THISNODE))
   ^
 cc1: some warnings being treated as errors
 make[3]: *** [/local/repos/kvm-kmod/x86/svm.o] Error 1
 make[2]: *** [/local/repos/kvm-kmod/x86] Error 2
 make[1]: *** [_module_/local/repos/kvm-kmod] Error 2
 make[1]: Leaving directory `/usr/src/linux-3.13'
 make: *** [all] Error 2

 That's likely a kernel issue: not all required headers are pulled by gfp.h.

 It's worked around now with 2b06046.

 Thanks,
 Jan


-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Paolo Bonzini

Il 24/01/2014 01:01, Peter Maydell ha scritto:


 +The 'data' member byte order is host kernel native endianness, regardless of
 +the endianness of the guest, and represents the the value as it would go on 
the
 +bus in real hardware.  The host kernel should always be able to do:
 +type val = *((type *)mmio.data).

I think this would be better phrased as The host userspace should always,
since this documentation is supposed to be telling userspace what the
kernel's contract with it is, not the kernel keeping notes for itself on
its own implementation. (It also clarifies what the intention is for the
obscure and maybe-we'll-never-implement-this case of an LE host
kernel using a compatibility interface to run the host userspace (QEMU)
as a BE process which sees the same ABI a BE kernel provides,
without actually dragging that red herring explicitly into the documentation.)


I agree, and also the first line should mention userspace.

In PPC I think it's possible or even common to have BE host kernel and 
LE host userspace (or perhaps vice versa is the common one).


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Alexander Graf

On 24.01.2014, at 14:09, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 24/01/2014 01:01, Peter Maydell ha scritto:
 
  +The 'data' member byte order is host kernel native endianness, 
  regardless of
  +the endianness of the guest, and represents the the value as it would go 
  on the
  +bus in real hardware.  The host kernel should always be able to do:
  +type val = *((type *)mmio.data).
 I think this would be better phrased as The host userspace should always,
 since this documentation is supposed to be telling userspace what the
 kernel's contract with it is, not the kernel keeping notes for itself on
 its own implementation. (It also clarifies what the intention is for the
 obscure and maybe-we'll-never-implement-this case of an LE host
 kernel using a compatibility interface to run the host userspace (QEMU)
 as a BE process which sees the same ABI a BE kernel provides,
 without actually dragging that red herring explicitly into the 
 documentation.)
 
 I agree, and also the first line should mention userspace.
 
 In PPC I think it's possible or even common to have BE host kernel and LE 
 host userspace (or perhaps vice versa is the common one).

It was possible on 32bit, but I'm not sure anyone's actively using it :). The 
thing that was very common (not so much anymore for enterprise distros) is 
32-bit user space with 64-bit kernels.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-kmod] fix undefined kvm_vfio_ops

2014-01-24 Thread Jonas Pfoh
vfio.c is being pulled in from the submodule, but it is not being 
compiled/linked by the build environment, causing a Unknown symbol in module 
error when loading the resulting module.

Adding vfio.o to the kvm-objs variable in x86/Kbuild fixes this issue

Signed-off-by: Jonas Pfoh p...@sec.in.tum.de
---

diff --git a/x86/Kbuild b/x86/Kbuild
index d75b756..637b3b1 100644
--- a/x86/Kbuild
+++ b/x86/Kbuild
@@ -1,7 +1,7 @@
 obj-m := kvm.o kvm-intel.o kvm-amd.o
 kvm-objs := kvm_main.o x86.o mmu.o emulate.o irq.o i8259.o pmu.o \
 lapic.o ioapic.o preempt.o i8254.o coalesced_mmio.o irq_comm.o \
-eventfd.o compat-x86.o async_pf.o cpuid.o irqchip.o \
+eventfd.o compat-x86.o async_pf.o cpuid.o irqchip.o vfio.o\
 ../external-module-compat.o
 ifeq ($(CONFIG_IOMMU_API)$(CONFIG_PCI),yy)
 kvm-objs += assigned-dev.o iommu.o
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kvm-kmod] fix undefined kvm_vfio_ops

2014-01-24 Thread Jan Kiszka
On 2014-01-24 15:29, Jonas Pfoh wrote:
 vfio.c is being pulled in from the submodule, but it is not being 
 compiled/linked by the build environment, causing a Unknown symbol in 
 module error when loading the resulting module.
 
 Adding vfio.o to the kvm-objs variable in x86/Kbuild fixes this issue
 
 Signed-off-by: Jonas Pfoh p...@sec.in.tum.de
 ---
 
 diff --git a/x86/Kbuild b/x86/Kbuild
 index d75b756..637b3b1 100644
 --- a/x86/Kbuild
 +++ b/x86/Kbuild
 @@ -1,7 +1,7 @@
  obj-m := kvm.o kvm-intel.o kvm-amd.o
  kvm-objs := kvm_main.o x86.o mmu.o emulate.o irq.o i8259.o pmu.o \
  lapic.o ioapic.o preempt.o i8254.o coalesced_mmio.o irq_comm.o \
 -eventfd.o compat-x86.o async_pf.o cpuid.o irqchip.o \
 +eventfd.o compat-x86.o async_pf.o cpuid.o irqchip.o vfio.o\
  ../external-module-compat.o
  ifeq ($(CONFIG_IOMMU_API)$(CONFIG_PCI),yy)
  kvm-objs += assigned-dev.o iommu.o
 

Format was almost perfect format - just your mailer mangled whitespaces.
Make sure to either use git send-email then or to switch of line-wrapper
etc. in the mail client when sending patches - to whichever project.

Unfortunately it broke the build for 3.11 and older kernels. I merged a
variant to next that passed build tests here.

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Victor Kamensky
On 24 January 2014 05:13, Alexander Graf ag...@suse.de wrote:

 On 24.01.2014, at 14:09, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 24/01/2014 01:01, Peter Maydell ha scritto:
 
  +The 'data' member byte order is host kernel native endianness, 
  regardless of
  +the endianness of the guest, and represents the the value as it would 
  go on the
  +bus in real hardware.

Also if you use ints on real bus as description, you may want to clarify
restrictions on mmio.len. Basically on 32 bit platform (i.e like V7
ARM) one cannot have mmio.len=8, because one cannot have 64bit
value on 32bit data bus. Without such clarification introduction of
text like the value as it would go on the bus in real hardware is
confusing for len=8 for emulated CPUs where real busses are
32bit.

If ldrd/strd would be emulated on ARMV7 one would need to use
mmio twice to pass required data in either direction using len=4 ..

Thanks,
Victor

 The host kernel should always be able to do:
  +type val = *((type *)mmio.data).
 I think this would be better phrased as The host userspace should always,
 since this documentation is supposed to be telling userspace what the
 kernel's contract with it is, not the kernel keeping notes for itself on
 its own implementation. (It also clarifies what the intention is for the
 obscure and maybe-we'll-never-implement-this case of an LE host
 kernel using a compatibility interface to run the host userspace (QEMU)
 as a BE process which sees the same ABI a BE kernel provides,
 without actually dragging that red herring explicitly into the 
 documentation.)

 I agree, and also the first line should mention userspace.

 In PPC I think it's possible or even common to have BE host kernel and LE 
 host userspace (or perhaps vice versa is the common one).

 It was possible on 32bit, but I'm not sure anyone's actively using it :). The 
 thing that was very common (not so much anymore for enterprise distros) is 
 32-bit user space with 64-bit kernels.


 Alex


 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Paolo Bonzini

Il 24/01/2014 16:23, Victor Kamensky ha scritto:

Also if you use ints on real bus as description, you may want to clarify
restrictions on mmio.len. Basically on 32 bit platform (i.e like V7
ARM) one cannot have mmio.len=8, because one cannot have 64bit
value on 32bit data bus. Without such clarification introduction of
text like the value as it would go on the bus in real hardware is
confusing for len=8 for emulated CPUs where real busses are
32bit.


This is not necessarily true.  On a 32-bit CPU you can have a 64-bit 
memory bus.  Even x86 32-bit CPUs can do 64-bit MMIO via MMX or SSE or 
double-word compare-and-swap (CMPXCHG8B).


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/4] KVM: nVMX: Do not inject NMI vmexits when L2 has a pending interrupt

2014-01-24 Thread Jan Kiszka
According to SDM 27.2.3, IDT vectoring information will not be valid on
vmexits caused by external NMIs. So we have to avoid creating such
scenarios by delaying EXIT_REASON_EXCEPTION_NMI injection as long as we
have a pending interrupt because that one would be migrated to L1's IDT
vectoring info on nested exit.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 arch/x86/kvm/vmx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 081a15c..7ed0ecc 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8159,7 +8159,8 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, 
bool external_intr)
}
 
if (vcpu-arch.nmi_pending  nested_exit_on_nmi(vcpu)) {
-   if (vmx-nested.nested_run_pending)
+   if (vmx-nested.nested_run_pending ||
+   vcpu-arch.interrupt.pending)
return -EBUSY;
nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
  NMI_VECTOR | INTR_TYPE_NMI_INTR |
-- 
1.8.1.1.298.ge7eed54

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/4] KVM: x86: Validate guest writes to MSR_IA32_APICBASE

2014-01-24 Thread Jan Kiszka
Check for invalid state transitions on guest-initiated updates of
MSR_IA32_APICBASE. This address both enabling of the x2APIC when it is
not supported and all invalid transitions as described in SDM section
10.12.5. It also checks that no reserved bit is set in APICBASE by the
guest.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 arch/x86/kvm/cpuid.h | 16 
 arch/x86/kvm/lapic.h |  2 +-
 arch/x86/kvm/vmx.c   |  9 +
 arch/x86/kvm/x86.c   | 32 +---
 4 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index f1e4895..b012ad2 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -72,4 +72,20 @@ static inline bool guest_cpuid_has_pcid(struct kvm_vcpu 
*vcpu)
return best  (best-ecx  bit(X86_FEATURE_PCID));
 }
 
+static inline bool guest_cpuid_has_x2apic(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 1, 0);
+   return best  (best-ecx  bit(X86_FEATURE_X2APIC));
+}
+
+static inline unsigned int guest_cpuid_get_phys_bits(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 0x8008, 0);
+   return best ? best-eax  0xff : 36;
+}
+
 #endif
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index c8b0d0d..6a11845 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -65,7 +65,7 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct 
kvm_lapic *src,
struct kvm_lapic_irq *irq, int *r, unsigned long *dest_map);
 
 u64 kvm_get_apic_base(struct kvm_vcpu *vcpu);
-void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data);
+int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
struct kvm_lapic_state *s);
 int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5c88791..a06f101 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4392,7 +4392,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 static void vmx_vcpu_reset(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
-   u64 msr;
+   struct msr_data apic_base_msr;
 
vmx-rmode.vm86_active = 0;
 
@@ -4400,10 +4400,11 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu)
 
vmx-vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
kvm_set_cr8(vmx-vcpu, 0);
-   msr = 0xfee0 | MSR_IA32_APICBASE_ENABLE;
+   apic_base_msr.data = 0xfee0 | MSR_IA32_APICBASE_ENABLE;
if (kvm_vcpu_is_bsp(vmx-vcpu))
-   msr |= MSR_IA32_APICBASE_BSP;
-   kvm_set_apic_base(vmx-vcpu, msr);
+   apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
+   apic_base_msr.host_initiated = true;
+   kvm_set_apic_base(vmx-vcpu, apic_base_msr);
 
vmx_segment_cache_clear(vmx);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c76f7c..f4b0591 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -257,10 +257,26 @@ u64 kvm_get_apic_base(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_get_apic_base);
 
-void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data)
-{
-   /* TODO: reserve bits check */
-   kvm_lapic_set_base(vcpu, data);
+int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+   u64 old_state = vcpu-arch.apic_base 
+   (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE);
+   u64 new_state = msr_info-data 
+   (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE);
+   u64 reserved_bits = ((~0ULL)  guest_cpuid_get_phys_bits(vcpu)) |
+   0x2ff | (guest_cpuid_has_x2apic(vcpu) ? 0 : X2APIC_ENABLE);
+
+   if (!msr_info-host_initiated 
+   ((msr_info-data  reserved_bits) != 0 ||
+new_state == X2APIC_ENABLE ||
+(new_state == MSR_IA32_APICBASE_ENABLE 
+ old_state == (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE)) ||
+(new_state == (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE) 
+ old_state == 0)))
+   return 1;
+
+   kvm_lapic_set_base(vcpu, msr_info-data);
+   return 0;
 }
 EXPORT_SYMBOL_GPL(kvm_set_apic_base);
 
@@ -2006,8 +2022,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
case 0x200 ... 0x2ff:
return set_msr_mtrr(vcpu, msr, data);
case MSR_IA32_APICBASE:
-   kvm_set_apic_base(vcpu, data);
-   break;
+   return kvm_set_apic_base(vcpu, msr_info);
case APIC_BASE_MSR ... APIC_BASE_MSR + 0x3ff:
return kvm_x2apic_msr_write(vcpu, msr, data);
case MSR_IA32_TSCDEADLINE:
@@ -6409,6 +6424,7 @@ EXPORT_SYMBOL_GPL(kvm_task_switch);
 int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
  struct kvm_sregs *sregs)
 {
+   struct msr_data apic_base_msr;
 

[PATCH v2 3/4] KVM: nVMX: Fully emulate preemption timer

2014-01-24 Thread Jan Kiszka
We cannot rely on the hardware-provided preemption timer support because
we are holding L2 in HLT outside non-root mode. Furthermore, emulating
the preemption will resolve tick rate errata on older Intel CPUs.

The emulation is based on hrtimer which is started on L2 entry, stopped
on L2 exit and evaluated via the new check_nested_events hook. As we no
longer rely on hardware features, we can enable both the preemption
timer support and value saving unconditionally.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 arch/x86/kvm/vmx.c | 151 ++---
 1 file changed, 96 insertions(+), 55 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 089aa3c..081a15c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -31,6 +31,7 @@
 #include linux/ftrace_event.h
 #include linux/slab.h
 #include linux/tboot.h
+#include linux/hrtimer.h
 #include kvm_cache_regs.h
 #include x86.h
 
@@ -110,6 +111,8 @@ module_param(nested, bool, S_IRUGO);
 
 #define RMODE_GUEST_OWNED_EFLAGS_BITS (~(X86_EFLAGS_IOPL | X86_EFLAGS_VM))
 
+#define VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE 5
+
 /*
  * These 2 parameters are used to config the controls for Pause-Loop Exiting:
  * ple_gap:upper bound on the amount of time between two successive
@@ -374,6 +377,9 @@ struct nested_vmx {
 */
struct page *apic_access_page;
u64 msr_ia32_feature_control;
+
+   struct hrtimer preemption_timer;
+   bool preemption_timer_expired;
 };
 
 #define POSTED_INTR_ON  0
@@ -1047,6 +1053,12 @@ static inline bool nested_cpu_has_virtual_nmis(struct 
vmcs12 *vmcs12)
return vmcs12-pin_based_vm_exec_control  PIN_BASED_VIRTUAL_NMIS;
 }
 
+static inline bool nested_cpu_has_preemption_timer(struct vmcs12 *vmcs12)
+{
+   return vmcs12-pin_based_vm_exec_control 
+   PIN_BASED_VMX_PREEMPTION_TIMER;
+}
+
 static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12)
 {
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT);
@@ -2248,9 +2260,9 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 */
nested_vmx_pinbased_ctls_low |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR;
nested_vmx_pinbased_ctls_high = PIN_BASED_EXT_INTR_MASK |
-   PIN_BASED_NMI_EXITING | PIN_BASED_VIRTUAL_NMIS |
+   PIN_BASED_NMI_EXITING | PIN_BASED_VIRTUAL_NMIS;
+   nested_vmx_pinbased_ctls_high |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR |
PIN_BASED_VMX_PREEMPTION_TIMER;
-   nested_vmx_pinbased_ctls_high |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR;
 
/*
 * Exit controls
@@ -2265,15 +2277,10 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 #ifdef CONFIG_X86_64
VM_EXIT_HOST_ADDR_SPACE_SIZE |
 #endif
-   VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
+   VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT;
+   nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
+   VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
-   if (!(nested_vmx_pinbased_ctls_high  PIN_BASED_VMX_PREEMPTION_TIMER) ||
-   !(nested_vmx_exit_ctls_high  VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) {
-   nested_vmx_exit_ctls_high = ~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
-   nested_vmx_pinbased_ctls_high = 
~PIN_BASED_VMX_PREEMPTION_TIMER;
-   }
-   nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
-   VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER);
 
/* entry controls */
rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
@@ -2342,9 +2349,9 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 
/* miscellaneous data */
rdmsr(MSR_IA32_VMX_MISC, nested_vmx_misc_low, nested_vmx_misc_high);
-   nested_vmx_misc_low = VMX_MISC_PREEMPTION_TIMER_RATE_MASK |
-   VMX_MISC_SAVE_EFER_LMA;
-   nested_vmx_misc_low |= VMX_MISC_ACTIVITY_HLT;
+   nested_vmx_misc_low = VMX_MISC_SAVE_EFER_LMA;
+   nested_vmx_misc_low |= VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE |
+   VMX_MISC_ACTIVITY_HLT;
nested_vmx_misc_high = 0;
 }
 
@@ -5702,6 +5709,18 @@ static void nested_vmx_failValid(struct kvm_vcpu *vcpu,
 */
 }
 
+static enum hrtimer_restart vmx_preemption_timer_fn(struct hrtimer *timer)
+{
+   struct vcpu_vmx *vmx =
+   container_of(timer, struct vcpu_vmx, nested.preemption_timer);
+
+   vmx-nested.preemption_timer_expired = true;
+   kvm_make_request(KVM_REQ_EVENT, vmx-vcpu);
+   kvm_vcpu_kick(vmx-vcpu);
+
+   return HRTIMER_NORESTART;
+}
+
 /*
  * Emulate the VMXON instruction.
  * Currently, we just remember that VMX is active, and do not save or even
@@ -5766,6 +5785,10 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
INIT_LIST_HEAD((vmx-nested.vmcs02_pool));
vmx-nested.vmcs02_num = 0;
 
+   hrtimer_init(vmx-nested.preemption_timer, CLOCK_MONOTONIC,
+ 

[PATCH v2 2/4] KVM: nVMX: Rework interception of IRQs and NMIs

2014-01-24 Thread Jan Kiszka
Move the check for leaving L2 on pending and intercepted IRQs or NMIs
from the *_allowed handler into a dedicated callback. Invoke this
callback at the relevant points before KVM checks if IRQs/NMIs can be
injected. The callback has the task to switch from L2 to L1 if needed
and inject the proper vmexit events.

The rework fixes L2 wakeups from HLT and provides the foundation for
preemption timer emulation.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/vmx.c  | 67 +++--
 arch/x86/kvm/x86.c  | 15 +++--
 3 files changed, 53 insertions(+), 31 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fdf83af..8d6cc7a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -765,6 +765,8 @@ struct kvm_x86_ops {
   struct x86_instruction_info *info,
   enum x86_intercept_stage stage);
void (*handle_external_intr)(struct kvm_vcpu *vcpu);
+
+   int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a06f101..089aa3c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4620,22 +4620,8 @@ static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool 
masked)
 
 static int vmx_nmi_allowed(struct kvm_vcpu *vcpu)
 {
-   if (is_guest_mode(vcpu)) {
-   if (to_vmx(vcpu)-nested.nested_run_pending)
-   return 0;
-   if (nested_exit_on_nmi(vcpu)) {
-   nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
- NMI_VECTOR | INTR_TYPE_NMI_INTR |
- INTR_INFO_VALID_MASK, 0);
-   /*
-* The NMI-triggered VM exit counts as injection:
-* clear this one and block further NMIs.
-*/
-   vcpu-arch.nmi_pending = 0;
-   vmx_set_nmi_mask(vcpu, true);
-   return 0;
-   }
-   }
+   if (to_vmx(vcpu)-nested.nested_run_pending)
+   return 0;
 
if (!cpu_has_virtual_nmis()  to_vmx(vcpu)-soft_vnmi_blocked)
return 0;
@@ -4647,19 +4633,8 @@ static int vmx_nmi_allowed(struct kvm_vcpu *vcpu)
 
 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
-   if (is_guest_mode(vcpu)) {
-   if (to_vmx(vcpu)-nested.nested_run_pending)
-   return 0;
-   if (nested_exit_on_intr(vcpu)) {
-   nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT,
- 0, 0);
-   /*
-* fall through to normal code, but now in L1, not L2
-*/
-   }
-   }
-
-   return (vmcs_readl(GUEST_RFLAGS)  X86_EFLAGS_IF) 
+   return (!to_vmx(vcpu)-nested.nested_run_pending 
+   vmcs_readl(GUEST_RFLAGS)  X86_EFLAGS_IF) 
!(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) 
(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
 }
@@ -8155,6 +8130,35 @@ static void vmcs12_save_pending_event(struct kvm_vcpu 
*vcpu,
}
 }
 
+static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+   if (vcpu-arch.nmi_pending  nested_exit_on_nmi(vcpu)) {
+   if (vmx-nested.nested_run_pending)
+   return -EBUSY;
+   nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
+ NMI_VECTOR | INTR_TYPE_NMI_INTR |
+ INTR_INFO_VALID_MASK, 0);
+   /*
+* The NMI-triggered VM exit counts as injection:
+* clear this one and block further NMIs.
+*/
+   vcpu-arch.nmi_pending = 0;
+   vmx_set_nmi_mask(vcpu, true);
+   return 0;
+   }
+
+   if ((kvm_cpu_has_interrupt(vcpu) || external_intr) 
+   nested_exit_on_intr(vcpu)) {
+   if (vmx-nested.nested_run_pending)
+   return -EBUSY;
+   nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0);
+   }
+
+   return 0;
+}
+
 /*
  * prepare_vmcs12 is part of what we need to do when the nested L2 guest exits
  * and we want to prepare to run its L1 parent. L1 keeps a vmcs for L2 
(vmcs12),
@@ -8495,6 +8499,9 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 
exit_reason,
nested_vmx_succeed(vcpu);
if (enable_shadow_vmcs)
vmx-nested.sync_shadow_vmcs = true;
+
+   /* in case we halted in L2 */
+   vcpu-arch.mp_state = KVM_MP_STATE_RUNNABLE;
 }

[PATCH v2 0/4] KVM: x86: Fixes for IA32_APIC_BASE and nVMX

2014-01-24 Thread Jan Kiszka
This is the yet unmerged part of the previous round. Changes since v1:
 - rebase over next
 - switched APIC_BASE reserved bits check to guest's number of physical
   bits
 - addressed small review comment on Rework interception of IRQs and
   NMIs
 - added fix for improper EXCEPTION_NMI vmexit injection with valid IDT
   vectoring info

Paolo, did you already look into nested event handling for SVM? I assume
you will want to (re-)base it on top of this.

Jan

Jan Kiszka (4):
  KVM: x86: Validate guest writes to MSR_IA32_APICBASE
  KVM: nVMX: Rework interception of IRQs and NMIs
  KVM: nVMX: Fully emulate preemption timer
  KVM: nVMX: Do not inject NMI vmexits when L2 has a pending interrupt

 arch/x86/include/asm/kvm_host.h |   2 +
 arch/x86/kvm/cpuid.h|  16 +++
 arch/x86/kvm/lapic.h|   2 +-
 arch/x86/kvm/vmx.c  | 228 
 arch/x86/kvm/x86.c  |  47 +++--
 5 files changed, 197 insertions(+), 98 deletions(-)

-- 
1.8.1.1.298.ge7eed54

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/4] KVM: x86: Validate guest writes to MSR_IA32_APICBASE

2014-01-24 Thread Paolo Bonzini

-BEGIN PGP MESSAGE-
Charset: ISO-8859-15
Version: GnuPG v2.0.22 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

hQIMA9Qj09KLn/EdARAAlKm0iclhSEbbzfzFlbJoqImUMiS+6Z/Ir1Lb1Z6xiwWp
3GXiVyMl8KXs0XzqaJF8lETKcurOFmCEdmaUDQT7iQa9k+KoXi4o2qR1K+HthiUK
4hMaHsBP94hsZjQBrTK6nWKVyChA849FG5IVudmp1HP4npWSZmVFvxCPUEdgvF68
VhG0WTN9u1IAPhpVww7EfR24WgVwWQbxSzzJfa9PKrtkKIB+oYK9wddLAPrFjW3t
1KPhkPwW98F/38eYwzHWoeaSPlLc+m84AAtkJNp1eurJCtXAGXPUf8XzU5NmJ5dp
qOkt0s6GM5nm/krzsjCOcikca7JugC1u9Zyr9H3VV8mBkmOFLXBs9a3PbA7ZYtob
qEAa1gbE44BnkMiOfrLgGf5Kgfg6ABcwQy1AHQi+LPPCrQBztAGmGyj8ZNQc6zYa
qo5gtK+M90BBg/tVmrCe0KwsY2IBf1vxnh7YM/m/yayDF0VGWnAmK2FlHkHscDH9
PYcMI7SFFn9S4Y8zcvZPv7qG8i7jb8AaRclILGduj1KveFQEOrAdX/AW+r4W5tC+
UIxgFJLUlIciKQcbzEkLKhtZDtoXL026h82OhEwitJgOpw2BCs1owthVl0MrffCK
ZlntWSDeOj3PhUpV8siOz3ACKiY5DkKM60az3N9ktNASdmJ0mv2kWIe+4Qtu+BqF
Ag4DKb7ACrQJWQ0QB/9yyyYc+W+vymJ6pPsIL9SPCO1f/gPT/7r1ulYDVt6flff1
FrGkOFVvMMw8ftVgakBDdKt4y3EOqX930/HaYcUu5ZpfvfEW8qZkJSSckrqy3UYB
StipXwcXZ+14lCGesibSzBAJW4Egv9AueKLMYHA+qaCdN02qrHoBm9Upb67ioKvX
QhvJEGPdU5yEMMy5fKy+BJGJqU+7o4OKnvOQHzMqLrxxzy54C4sBvp5zNgA6KQXH
B65CIjtvkfpeB1l/uwOUgziz7J58r58al3UqLqz5P30kBF99KABl06P7jKnAOAR8
qYLjKcRRGbIGZXCnK/xPHrs5iVlJmAUEnNjg2qbDB/9QOyI01DpOI2hoOTmQV27E
fePWW6L415LEJ0iAm8Snmdi8lp66/bSAl4dhDoAz6wd3Vnq7LaCHsGn1Vwe+kcah
hFJL3Mhvi3iag5+PhbYHGbvAQ/Xzm+PTpP7GyElgljKB/1ILk6uvGsDDH9uxCLiK
tNq2MSIZAwtckcdXyEm5YL4+v+YNNx+Dwnct6WoYP8CnuFhz2b70BLNw5B0/QZMf
GWou3nwNt6RdexwCgm4bsJJElQJAAR7u29fFlAcVmXHqNMvsebTUZnIwpqjWkgOc
yrplYw+vLagiqCH8gsyQeFqWsWtZBLBJYkPXOamyD9UpHZwkhi/TqTvIdVUBz985
hQIMA1uMoGen9Cu3AQ/9HjsJgJQHJVCCHG3lCCGFdrTRGiiifRb08ik0ZEU9OKFX
tB0Ka8LdrJ4gwY9YnnQKWel11BoSgToAc1cil0NJoCSY05/AK8LhnJnaGbZn8CsE
PK6hCiFDXucpvO9EVF/dTF9Xqi5HMDhKRd8K1BRDo+ZSj6K1j3W5INMPPTm5nL7P
RPNnaMuf4e0YCK2buuAKEagnJKQL4uC/Xi6lEUq680g19TL+vXubWLLEo0Ny5Iq6
oa5ZVCII9TsJl9OMJpYmo77zByBJ482nobX2X6+N9IQFv4VuCsJ1KKZC3fLykvQ+
aMT/t2+MY3ovFM+ra2cHyxWQjkLvcTZPe/o2rG5QAPuGtf+2X8T9Mph8mqBhT+NK
/mr78zp+1JsgLM/DYWO0Zg+GRx3Sk2P+8qsJe2+tUUDl7vcqDvF54dNJ19AL9Xwr
5sWME/C4FhAbLtgMQv2SbdPx2CUBk6dBJQRPvcnmTr25TiGAO1yRhv4cF1S/9Hnk
6ace9PZBTKcivc1KKqtTGDM5zSEOk5DZLv2FncMITFABsk+v7eG7eQ2YG3v5Tiq6
gs+4cJyq+fYwKES2DiT54iRIlHsIB9T53ng+z6fx/wThIu9wcqdGRgeZvicViyZ6
GoTzU6HV6jgm4n1xc91tRzFCFgOu6oCSOUS1k1wJ9NhFWN74TiKvratXZFDVQNuF
AgwDyn8FWYRSbREBEACnZNEkJqKMKuY3Mlbh4Pck6DM6CTinaoy0Dv+DkXI5riPQ
kzXXv+M2LOlaJ3A5Sh/XDuB9CyUJObeHK12tLmWYg27WY59EAsEd0Vgu4zpaIKvK
2zREEq8PFpchi2hMfIiReSHrlw1a459mbeC0z6Gd44hBoOUFsFVCNt8khQAP0TKT
m4BGdNkxF9Cl3VUqqP7LdyT6Xo3g12tmsGKYO3rs3Tm8QoiUsfYTzSDHPN9iupLZ
wVsdzydgB8vQ1GXL1JNoFZXLXbYAraRs2wgwvGo6auLhE/zwURHP6ky7QmmCEEjb
UvWbHg5xiZga/Gx5Y+Ghtzb5yMHBn9HrZa9wxjuk6SqrIBMCWzhJP9t38AhiomiG
d4L3sQsmRKlH+YsC+TVdCVVGKo7d+Zox/pM25/lo2PkbDGQv6A5JfxT7K+O/cK6K
G2HtkIhuVQoLwdGu5quMrty6uIVfvtrMgnNFxOwsC/qwfPj50+6sDtNYyxvl4+d9
7S2hBSieCJPohQbzdgdDyNvbj9D/913Evq5M/lPdUMvue/Oq2RGWF4ZzKfr+f6k6
79rhvm2cO5fGzgF6AVNjhzWbTfvdZfMQo7A6Cew/y70ukwwoX0vkYYbg4pWbrQ1t
EyjdU6sSivEH60Jkev1DpkCVZpDizXv+pmDY+g5sxIhHbL0a3XNnyDi+ni1uidLr
Aaq+G4yujcM50b8cjUjYSnd+1ISz22qoFK/V2dpaRzi1QKVkcHI95yoTWG35T3fb
SabLhZ1yzExBTt1uZ73P7qHuJN8MvBbvjo8mEvpBFyGb555fdBlPlbZ8FdtJodBv
OsJhHbc0/bUYbflVkGlbygoG2Ykk7DPwL7hCQmwSBLzhIyGp0Ak9lgqASByzWfDP
num1/3zWQVq7dWMFEEzfnmMaWsWzBEK2h5dLP5RDNYjIUYVzA0EZJPNCZSO2Tin+
77OQxaZw00yrghK8+NTutV1xzkSzw42cykUkqx2/FHmEPwthsuVb6rT7eu3xTqWU
jg6cjsdjQkMy5S7WKCeGjQiY1629qzkVjuD8QHOgcYKutRB7mzXUfW9ABUJyQ9mS
4hsdS0tteH4ang8YVoli2pB9WGxQ437AaJ5nWya1Xkr/ztYNb8uQEpgjoaUPc6ca
fdu4bZuuGTqAz+IokgBn8mZR2yPAuAmypqsGmriQ0Y5G10OIuX/ULfvd5DZD8ijM
l0qMQUKaQhIHIInu/rv0SMD2WGECJmAxhXQWYwe0YCTDO/Xyet91P8xpwq+U87v+
jxx+copEuFHTvpA4IdMNIS1xDuvIQdXVHdQuCUgagHgpQZ6ZT0RrRFb7QTCZuR80
dR3fDp2YraZQufvLEnAxvti1uXzgnzhFbpjDlzOXgzwYXBH56XRhWSg8ok/dwI+x
qIOuzIi5uy1h11STLfiS5fOLSI2oadl5jrR4H65YOY1CVXLjOoNUJVLZ6lY9GDER
9Mtae0sgXum4mFvOiZCDCDpDokIM/6nNGbuQruzVgUlbF4uL0/+sWpXrEpFFN4t9
xYVbT97HyGj7myRkTpIhtr/2xdMEeK9Sjd5QPzIzk11QuEPMQfMGYP3sGOjc5iY8
e/VheXlHgKGPeDsj1y6PvST4SXLxm/MgppO6yx342g2GVOBWDd7mjKnoT5b+n//V
8bircdvcZq18lAZsLFKpSxqtHgdAHfCDTCL+ZLdiaOl1Pa6ZQZdUgOA5Om9V38HF
gfGoQStYUUC6cxPPSuTXl4294ZK8IZ2F2Jg2fmzyGacNETfNoA2y+/u2FyeUO1zv
jZcpiOnnR/ciZ0vGAQzPtqdkTAiFhfCgFlVfBajDWDdRGKkOxliqXrI+5KTlEID7
yxAQZ3f7wDN2OKFZQcXGErPqs71D8yGPVJh6qhRlrV/v4thRynZx8tzyiXKtsPSC
JWdRszPzL9PyyBtTQgwC/PzU35DjJqgUdZ3XxHcjpgRBLbwIu9xFAy4W5/cALAwF
HlMM79cxHsHKwJ3aeql8Rm5ZeP6Rp2P1F918F/4LyiZyNEd3iIO0GUsVicWbSfBz
YFnWxjdVR16A2enr/3DT0+yT5drep9huvV8gaLoPxi5XmLpb2tQebBeG4cvV04VP
iaLjdcStoJPYSMx/A3s5JWmc0k/60WjacHmNnAVwE4uUyfwGeZdJqTWGORth1PU4
5vKr76tuw5Vd9UUQhF936gucEL+O+NkxywR+iBl8oKEecyXgHjzH2OFKuDsnBy/C
y3uQ0qRsj8JXu2O1N0Cdb5REFH+ZFTZa3qlSpqsjNMqdm9msYICaGIBHHQ12QIQw
3wa8FikGOmAhLCIZscUoVAt5mE7114aNpsRpxfGY+0aPjAFmK822LmCuZ/r0hmAe
Ma2nd3F3BgWXJsCxEaaf2tcmF1OM0uDe8MKvBS0sX6La+z7CJ+4WcA6iWAC8C18h
LnVtiM2SI8BmkrPNyYI5QakBOFaUO+7MifjAcobvLxTkHr2lcRV8bTOyDQM4H2CV
oYLbU5e+banLW7LD/icfwiKM1IrZ6azzM9sKLdQWq86tTpmrotZf7xdAJfPNZ46Y
fnZhJCd9zTKtp3Hv4UFupj8z3/hKPDAT2HXN+VSQ4nlqBZcNdXNoSr3hJ03qhc2B

Re: [PATCH v2 1/4] KVM: x86: Validate guest writes to MSR_IA32_APICBASE

2014-01-24 Thread Paolo Bonzini

Il 24/01/2014 16:48, Jan Kiszka ha scritto:

Check for invalid state transitions on guest-initiated updates of
MSR_IA32_APICBASE. This address both enabling of the x2APIC when it is
not supported and all invalid transitions as described in SDM section
10.12.5. It also checks that no reserved bit is set in APICBASE by the
guest.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 arch/x86/kvm/cpuid.h | 16 
 arch/x86/kvm/lapic.h |  2 +-
 arch/x86/kvm/vmx.c   |  9 +
 arch/x86/kvm/x86.c   | 32 +---
 4 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index f1e4895..b012ad2 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -72,4 +72,20 @@ static inline bool guest_cpuid_has_pcid(struct kvm_vcpu 
*vcpu)
return best  (best-ecx  bit(X86_FEATURE_PCID));
 }

+static inline bool guest_cpuid_has_x2apic(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 1, 0);
+   return best  (best-ecx  bit(X86_FEATURE_X2APIC));
+}
+
+static inline unsigned int guest_cpuid_get_phys_bits(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 0x8008, 0);
+   return best ? best-eax  0xff : 36;
+}


[Resending after learning that Ctrl-Shift-C does other things beyond 
copying to clipboard]


There's already cpuid_maxphyaddr for this.  I can adjust it when committing.

This is applied to kvm/queue.  The other three will have to wait for
after the merge window.

Paolo


 #endif
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index c8b0d0d..6a11845 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -65,7 +65,7 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct 
kvm_lapic *src,
struct kvm_lapic_irq *irq, int *r, unsigned long *dest_map);

 u64 kvm_get_apic_base(struct kvm_vcpu *vcpu);
-void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data);
+int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
struct kvm_lapic_state *s);
 int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5c88791..a06f101 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4392,7 +4392,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 static void vmx_vcpu_reset(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
-   u64 msr;
+   struct msr_data apic_base_msr;

vmx-rmode.vm86_active = 0;

@@ -4400,10 +4400,11 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu)

vmx-vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
kvm_set_cr8(vmx-vcpu, 0);
-   msr = 0xfee0 | MSR_IA32_APICBASE_ENABLE;
+   apic_base_msr.data = 0xfee0 | MSR_IA32_APICBASE_ENABLE;
if (kvm_vcpu_is_bsp(vmx-vcpu))
-   msr |= MSR_IA32_APICBASE_BSP;
-   kvm_set_apic_base(vmx-vcpu, msr);
+   apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
+   apic_base_msr.host_initiated = true;
+   kvm_set_apic_base(vmx-vcpu, apic_base_msr);

vmx_segment_cache_clear(vmx);

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c76f7c..f4b0591 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -257,10 +257,26 @@ u64 kvm_get_apic_base(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_get_apic_base);

-void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data)
-{
-   /* TODO: reserve bits check */
-   kvm_lapic_set_base(vcpu, data);
+int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+   u64 old_state = vcpu-arch.apic_base 
+   (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE);
+   u64 new_state = msr_info-data 
+   (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE);
+   u64 reserved_bits = ((~0ULL)  guest_cpuid_get_phys_bits(vcpu)) |
+   0x2ff | (guest_cpuid_has_x2apic(vcpu) ? 0 : X2APIC_ENABLE);
+
+   if (!msr_info-host_initiated 
+   ((msr_info-data  reserved_bits) != 0 ||
+new_state == X2APIC_ENABLE ||
+(new_state == MSR_IA32_APICBASE_ENABLE 
+ old_state == (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE)) ||
+(new_state == (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE) 
+ old_state == 0)))
+   return 1;
+
+   kvm_lapic_set_base(vcpu, msr_info-data);
+   return 0;
 }
 EXPORT_SYMBOL_GPL(kvm_set_apic_base);

@@ -2006,8 +2022,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
case 0x200 ... 0x2ff:
return set_msr_mtrr(vcpu, msr, data);
case MSR_IA32_APICBASE:
-   kvm_set_apic_base(vcpu, data);
-   break;
+   return kvm_set_apic_base(vcpu, msr_info);
case APIC_BASE_MSR ... APIC_BASE_MSR + 0x3ff:

[PATCH uq/master 0/5] Hyper-V improvements and migratability

2014-01-24 Thread Paolo Bonzini
The first patch fixes the KVM leaves at 0x4100.  Before, there
is no leaf at 0x4101 (and the data of the highest Intel leaf
is returned, e.g. 0xd on a Sandy Bridge).  After this patch there is
one.

The second patch is extracted from Vadim's migration patches, which
are patches 3-5.

Review of the first two patches is particularly welcome.

Paolo Bonzini (2):
  KVM: fix coexistence of KVM and Hyper-V leaves
  kvm: make availability of Hyper-V enlightenments dependent on
KVM_CAP_HYPERV

Vadim Rozenfeld (3):
  kvm: make hyperv hypercall and guest os id MSRs migratable.
  kvm: make hyperv vapic assist page migratable
  kvm: add support for hyper-v timers

 linux-headers/asm-x86/hyperv.h |   3 ++
 linux-headers/linux/kvm.h  |   1 +
 target-i386/cpu-qom.h  |   1 +
 target-i386/cpu.c  |   1 +
 target-i386/cpu.h  |   4 ++
 target-i386/kvm.c  | 109 +
 target-i386/machine.c  |  67 +
 7 files changed, 155 insertions(+), 31 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] kvm: add support for hyper-v timers

2014-01-24 Thread Paolo Bonzini
From: Vadim Rozenfeld vroze...@redhat.com

http://msdn.microsoft.com/en-us/library/windows/hardware/ff541625%28v=vs.85%29.aspx

This code is generic for activating reference time counter or virtual reference 
time stamp counter

Signed-off-by: Vadim Rozenfeld vroze...@redhat.com
Reviewed-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 linux-headers/asm-x86/hyperv.h |  3 +++
 linux-headers/linux/kvm.h  |  1 +
 target-i386/cpu-qom.h  |  1 +
 target-i386/cpu.c  |  1 +
 target-i386/cpu.h  |  1 +
 target-i386/kvm.c  | 20 +++-
 target-i386/machine.c  | 22 ++
 7 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/linux-headers/asm-x86/hyperv.h b/linux-headers/asm-x86/hyperv.h
index b8f1c01..3b400ee 100644
--- a/linux-headers/asm-x86/hyperv.h
+++ b/linux-headers/asm-x86/hyperv.h
@@ -149,6 +149,9 @@
 /* MSR used to read the per-partition time reference counter */
 #define HV_X64_MSR_TIME_REF_COUNT  0x4020
 
+/* A partition's reference time stamp counter (TSC) page */
+#define HV_X64_MSR_REFERENCE_TSC   0x4021
+
 /* MSR used to retrieve the TSC frequency */
 #define HV_X64_MSR_TSC_FREQUENCY   0x4022
 
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 5a49671..999fb13 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -674,6 +674,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_ARM_EL1_32BIT 93
 #define KVM_CAP_SPAPR_MULTITCE 94
 #define KVM_CAP_EXT_EMUL_CPUID 95
+#define KVM_CAP_HYPERV_TIME 96
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/target-i386/cpu-qom.h b/target-i386/cpu-qom.h
index d1751a4..722f11a 100644
--- a/target-i386/cpu-qom.h
+++ b/target-i386/cpu-qom.h
@@ -69,6 +69,7 @@ typedef struct X86CPU {
 bool hyperv_vapic;
 bool hyperv_relaxed_timing;
 int hyperv_spinlock_attempts;
+bool hyperv_time;
 bool check_cpuid;
 bool enforce_cpuid;
 
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 2e0be01..1f30efd 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -2702,6 +2702,7 @@ static Property x86_cpu_properties[] = {
 { .name  = hv-spinlocks, .info  = qdev_prop_spinlocks },
 DEFINE_PROP_BOOL(hv-relaxed, X86CPU, hyperv_relaxed_timing, false),
 DEFINE_PROP_BOOL(hv-vapic, X86CPU, hyperv_vapic, false),
+DEFINE_PROP_BOOL(hv-time, X86CPU, hyperv_time, false),
 DEFINE_PROP_BOOL(check, X86CPU, check_cpuid, false),
 DEFINE_PROP_BOOL(enforce, X86CPU, enforce_cpuid, false),
 DEFINE_PROP_END_OF_LIST()
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 45bd554..1b94f0f 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -865,6 +865,7 @@ typedef struct CPUX86State {
 uint64_t msr_hv_hypercall;
 uint64_t msr_hv_guest_os_id;
 uint64_t msr_hv_vapic;
+uint64_t msr_hv_tsc;
 
 /* exception/interrupt handling */
 int error_code;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index ddd437f..e555040 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -74,6 +74,7 @@ static bool has_msr_kvm_steal_time;
 static int lm_capable_kernel;
 static bool has_msr_hv_hypercall;
 static bool has_msr_hv_vapic;
+static bool has_msr_hv_tsc;
 
 static bool has_msr_architectural_pmu;
 static uint32_t num_architectural_pmu_counters;
@@ -442,6 +443,7 @@ static bool hyperv_enabled(X86CPU *cpu)
 CPUState *cs = CPU(cpu);
 return kvm_check_extension(cs-kvm_state, KVM_CAP_HYPERV)  0 
(hyperv_hypercall_available(cpu) ||
+cpu-hyperv_time  ||
 cpu-hyperv_relaxed_timing);
 }
 
@@ -499,7 +501,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
 c-eax |= HV_X64_MSR_APIC_ACCESS_AVAILABLE;
 has_msr_hv_vapic = true;
 }
-
+if (cpu-hyperv_time 
+kvm_check_extension(cs-kvm_state, KVM_CAP_HYPERV_TIME)  0) {
+c-eax |= HV_X64_MSR_HYPERCALL_AVAILABLE;
+c-eax |= HV_X64_MSR_TIME_REF_COUNT_AVAILABLE;
+c-eax |= 0x200;
+has_msr_hv_tsc = true;
+}
 c = cpuid_data.entries[cpuid_i++];
 c-function = HYPERV_CPUID_ENLIGHTMENT_INFO;
 if (cpu-hyperv_relaxed_timing) {
@@ -1239,6 +1247,10 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 kvm_msr_entry_set(msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE,
   env-msr_hv_vapic);
 }
+if (has_msr_hv_tsc) {
+kvm_msr_entry_set(msrs[n++], HV_X64_MSR_REFERENCE_TSC,
+  env-msr_hv_tsc);
+}
 
 /* Note: MSR_IA32_FEATURE_CONTROL is written separately, see
  *   kvm_put_msr_feature_control. */
@@ -1530,6 +1542,9 @@ static int kvm_get_msrs(X86CPU *cpu)
 if (has_msr_hv_vapic) {
 msrs[n++].index = HV_X64_MSR_APIC_ASSIST_PAGE;
 }
+if (has_msr_hv_tsc) {
+msrs[n++].index = HV_X64_MSR_REFERENCE_TSC;
+}
 
   

[PATCH 2/5] kvm: make availability of Hyper-V enlightenments dependent on KVM_CAP_HYPERV

2014-01-24 Thread Paolo Bonzini
The MS docs specify HV_X64_MSR_HYPERCALL as a mandatory interface,
thus we must provide the MSRs even if the user only specified
features that, like relaxed timing, in principle don't require them.
And the MSRs are only there if the hypervisor has KVM_CAP_HYPERV.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 target-i386/kvm.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 5738911..08c47bb 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -72,6 +72,7 @@ static bool has_msr_misc_enable;
 static bool has_msr_bndcfgs;
 static bool has_msr_kvm_steal_time;
 static int lm_capable_kernel;
+static bool has_msr_hv_hypercall;
 
 static bool has_msr_architectural_pmu;
 static uint32_t num_architectural_pmu_counters;
@@ -437,8 +438,10 @@ static bool hyperv_hypercall_available(X86CPU *cpu)
 
 static bool hyperv_enabled(X86CPU *cpu)
 {
-return hyperv_hypercall_available(cpu) ||
-   cpu-hyperv_relaxed_timing;
+CPUState *cs = CPU(cpu);
+return kvm_check_extension(cs-kvm_state, KVM_CAP_HYPERV)  0 
+   (hyperv_hypercall_available(cpu) ||
+cpu-hyperv_relaxed_timing);
 }
 
 #define KVM_MAX_CPUID_ENTRIES  100
@@ -511,6 +514,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
 c-ebx = 0x40;
 
 kvm_base = KVM_CPUID_SIGNATURE_NEXT;
+has_msr_hv_hypercall = true;
 }
 
 memcpy(signature, KVMKVMKVM\0\0\0, 12);
-- 
1.8.3.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] kvm: make hyperv hypercall and guest os id MSRs migratable.

2014-01-24 Thread Paolo Bonzini
From: Vadim Rozenfeld vroze...@redhat.com

Signed-off-by: Vadim Rozenfeld vroze...@redhat.com
Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 target-i386/cpu.h |  2 ++
 target-i386/kvm.c | 18 +++---
 target-i386/machine.c | 23 +++
 3 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 1fcbc82..3dba5ef 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -862,6 +862,8 @@ typedef struct CPUX86State {
 uint64_t msr_fixed_counters[MAX_FIXED_COUNTERS];
 uint64_t msr_gp_counters[MAX_GP_COUNTERS];
 uint64_t msr_gp_evtsel[MAX_GP_COUNTERS];
+uint64_t msr_hv_hypercall;
+uint64_t msr_hv_guest_os_id;
 
 /* exception/interrupt handling */
 int error_code;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 08c47bb..8f2854a 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1227,9 +1227,11 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 kvm_msr_entry_set(msrs[n++], MSR_CORE_PERF_GLOBAL_CTRL,
   env-msr_global_ctrl);
 }
-if (hyperv_hypercall_available(cpu)) {
-kvm_msr_entry_set(msrs[n++], HV_X64_MSR_GUEST_OS_ID, 0);
-kvm_msr_entry_set(msrs[n++], HV_X64_MSR_HYPERCALL, 0);
+if (has_msr_hv_hypercall) {
+kvm_msr_entry_set(msrs[n++], HV_X64_MSR_GUEST_OS_ID,
+  env-msr_hv_guest_os_id);
+kvm_msr_entry_set(msrs[n++], HV_X64_MSR_HYPERCALL,
+  env-msr_hv_hypercall);
 }
 if (cpu-hyperv_vapic) {
 kvm_msr_entry_set(msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, 0);
@@ -1518,6 +1520,10 @@ static int kvm_get_msrs(X86CPU *cpu)
 }
 }
 
+if (has_msr_hv_hypercall) {
+msrs[n++].index = HV_X64_MSR_HYPERCALL;
+msrs[n++].index = HV_X64_MSR_GUEST_OS_ID;
+}
 msr_data.info.nmsrs = n;
 ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_MSRS, msr_data);
 if (ret  0) {
@@ -1625,6 +1631,12 @@ static int kvm_get_msrs(X86CPU *cpu)
 case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1:
 env-msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data;
 break;
+case HV_X64_MSR_HYPERCALL:
+env-msr_hv_hypercall = msrs[i].data;
+break;
+case HV_X64_MSR_GUEST_OS_ID:
+env-msr_hv_guest_os_id = msrs[i].data;
+break;
 }
 }
 
diff --git a/target-i386/machine.c b/target-i386/machine.c
index 2de1964..96fd045 100644
--- a/target-i386/machine.c
+++ b/target-i386/machine.c
@@ -554,6 +554,26 @@ static const VMStateDescription vmstate_mpx = {
 }
 };
 
+static bool hyperv_hypercall_enable_needed(void *opaque)
+{
+X86CPU *cpu = opaque;
+CPUX86State *env = cpu-env;
+
+return env-msr_hv_hypercall != 0 || env-msr_hv_guest_os_id != 0;
+}
+
+static const VMStateDescription vmstate_msr_hypercall_hypercall = {
+.name = cpu/msr_hyperv_hypercall,
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT64(env.msr_hv_hypercall, X86CPU),
+VMSTATE_UINT64(env.msr_hv_guest_os_id, X86CPU),
+VMSTATE_END_OF_LIST()
+}
+};
+
 const VMStateDescription vmstate_x86_cpu = {
 .name = cpu,
 .version_id = 12,
@@ -688,6 +708,9 @@ const VMStateDescription vmstate_x86_cpu = {
 } , {
 .vmsd = vmstate_mpx,
 .needed = mpx_needed,
+}, {
+.vmsd = vmstate_msr_hypercall_hypercall,
+.needed = hyperv_hypercall_enable_needed,
 } , {
 /* empty */
 }
-- 
1.8.3.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] KVM: fix coexistence of KVM and Hyper-V leaves

2014-01-24 Thread Paolo Bonzini
kvm_arch_init_vcpu's initialization of the KVM leaves at 0x4100
is broken, because KVM_CPUID_FEATURES is left at 0x4001.  Move
it to 0x4101 if Hyper-V is enabled.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 target-i386/kvm.c | 47 +--
 1 file changed, 25 insertions(+), 22 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 0a21c30..5738911 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -455,6 +455,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
 uint32_t unused;
 struct kvm_cpuid_entry2 *c;
 uint32_t signature[3];
+int kvm_base = KVM_CPUID_SIGNATURE;
 int r;
 
 memset(cpuid_data, 0, sizeof(cpuid_data));
@@ -462,26 +463,22 @@ int kvm_arch_init_vcpu(CPUState *cs)
 cpuid_i = 0;
 
 /* Paravirtualization CPUIDs */
-c = cpuid_data.entries[cpuid_i++];
-c-function = KVM_CPUID_SIGNATURE;
-if (!hyperv_enabled(cpu)) {
-memcpy(signature, KVMKVMKVM\0\0\0, 12);
-c-eax = 0;
-} else {
+if (hyperv_enabled(cpu)) {
+c = cpuid_data.entries[cpuid_i++];
+c-function = HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS;
 memcpy(signature, Microsoft Hv, 12);
 c-eax = HYPERV_CPUID_MIN;
-}
-c-ebx = signature[0];
-c-ecx = signature[1];
-c-edx = signature[2];
-
-c = cpuid_data.entries[cpuid_i++];
-c-function = KVM_CPUID_FEATURES;
-c-eax = env-features[FEAT_KVM];
+c-ebx = signature[0];
+c-ecx = signature[1];
+c-edx = signature[2];
 
-if (hyperv_enabled(cpu)) {
+c = cpuid_data.entries[cpuid_i++];
+c-function = HYPERV_CPUID_INTERFACE;
 memcpy(signature, Hv#1\0\0\0\0\0\0\0\0, 12);
 c-eax = signature[0];
+c-ebx = 0;
+c-ecx = 0;
+c-edx = 0;
 
 c = cpuid_data.entries[cpuid_i++];
 c-function = HYPERV_CPUID_VERSION;
@@ -513,15 +510,21 @@ int kvm_arch_init_vcpu(CPUState *cs)
 c-eax = 0x40;
 c-ebx = 0x40;
 
-c = cpuid_data.entries[cpuid_i++];
-c-function = KVM_CPUID_SIGNATURE_NEXT;
-memcpy(signature, KVMKVMKVM\0\0\0, 12);
-c-eax = 0;
-c-ebx = signature[0];
-c-ecx = signature[1];
-c-edx = signature[2];
+kvm_base = KVM_CPUID_SIGNATURE_NEXT;
 }
 
+memcpy(signature, KVMKVMKVM\0\0\0, 12);
+c = cpuid_data.entries[cpuid_i++];
+c-function = KVM_CPUID_SIGNATURE | kvm_base;
+c-eax = 0;
+c-ebx = signature[0];
+c-ecx = signature[1];
+c-edx = signature[2];
+
+c = cpuid_data.entries[cpuid_i++];
+c-function = KVM_CPUID_FEATURES | kvm_base;
+c-eax = env-features[FEAT_KVM];
+
 has_msr_async_pf_en = c-eax  (1  KVM_FEATURE_ASYNC_PF);
 
 has_msr_pv_eoi_en = c-eax  (1  KVM_FEATURE_PV_EOI);
-- 
1.8.3.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] kvm: make hyperv vapic assist page migratable

2014-01-24 Thread Paolo Bonzini
From: Vadim Rozenfeld vroze...@redhat.com

Signed-off-by: Vadim Rozenfeld vroze...@redhat.com
Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 target-i386/cpu.h |  1 +
 target-i386/kvm.c | 16 +---
 target-i386/machine.c | 22 ++
 3 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 3dba5ef..45bd554 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -864,6 +864,7 @@ typedef struct CPUX86State {
 uint64_t msr_gp_evtsel[MAX_GP_COUNTERS];
 uint64_t msr_hv_hypercall;
 uint64_t msr_hv_guest_os_id;
+uint64_t msr_hv_vapic;
 
 /* exception/interrupt handling */
 int error_code;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 8f2854a..ddd437f 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -73,6 +73,7 @@ static bool has_msr_bndcfgs;
 static bool has_msr_kvm_steal_time;
 static int lm_capable_kernel;
 static bool has_msr_hv_hypercall;
+static bool has_msr_hv_vapic;
 
 static bool has_msr_architectural_pmu;
 static uint32_t num_architectural_pmu_counters;
@@ -496,6 +497,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
 if (cpu-hyperv_vapic) {
 c-eax |= HV_X64_MSR_HYPERCALL_AVAILABLE;
 c-eax |= HV_X64_MSR_APIC_ACCESS_AVAILABLE;
+has_msr_hv_vapic = true;
 }
 
 c = cpuid_data.entries[cpuid_i++];
@@ -503,7 +505,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
 if (cpu-hyperv_relaxed_timing) {
 c-eax |= HV_X64_RELAXED_TIMING_RECOMMENDED;
 }
-if (cpu-hyperv_vapic) {
+if (has_msr_hv_vapic) {
 c-eax |= HV_X64_APIC_ACCESS_RECOMMENDED;
 }
 c-ebx = cpu-hyperv_spinlock_attempts;
@@ -1233,8 +1235,9 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 kvm_msr_entry_set(msrs[n++], HV_X64_MSR_HYPERCALL,
   env-msr_hv_hypercall);
 }
-if (cpu-hyperv_vapic) {
-kvm_msr_entry_set(msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, 0);
+if (has_msr_hv_vapic) {
+kvm_msr_entry_set(msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE,
+  env-msr_hv_vapic);
 }
 
 /* Note: MSR_IA32_FEATURE_CONTROL is written separately, see
@@ -1524,6 +1527,10 @@ static int kvm_get_msrs(X86CPU *cpu)
 msrs[n++].index = HV_X64_MSR_HYPERCALL;
 msrs[n++].index = HV_X64_MSR_GUEST_OS_ID;
 }
+if (has_msr_hv_vapic) {
+msrs[n++].index = HV_X64_MSR_APIC_ASSIST_PAGE;
+}
+
 msr_data.info.nmsrs = n;
 ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_MSRS, msr_data);
 if (ret  0) {
@@ -1637,6 +1644,9 @@ static int kvm_get_msrs(X86CPU *cpu)
 case HV_X64_MSR_GUEST_OS_ID:
 env-msr_hv_guest_os_id = msrs[i].data;
 break;
+case HV_X64_MSR_APIC_ASSIST_PAGE:
+env-msr_hv_vapic = msrs[i].data;
+break;
 }
 }
 
diff --git a/target-i386/machine.c b/target-i386/machine.c
index 96fd045..e72e270 100644
--- a/target-i386/machine.c
+++ b/target-i386/machine.c
@@ -574,6 +574,25 @@ static const VMStateDescription 
vmstate_msr_hypercall_hypercall = {
 }
 };
 
+static bool hyperv_vapic_enable_needed(void *opaque)
+{
+X86CPU *cpu = opaque;
+CPUX86State *env = cpu-env;
+
+return env-msr_hv_vapic != 0;
+}
+
+static const VMStateDescription vmstate_msr_hyperv_vapic = {
+.name = cpu/msr_hyperv_vapic,
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT64(env.msr_hv_vapic, X86CPU),
+VMSTATE_END_OF_LIST()
+}
+};
+
 const VMStateDescription vmstate_x86_cpu = {
 .name = cpu,
 .version_id = 12,
@@ -711,6 +730,9 @@ const VMStateDescription vmstate_x86_cpu = {
 }, {
 .vmsd = vmstate_msr_hypercall_hypercall,
 .needed = hyperv_hypercall_enable_needed,
+}, {
+.vmsd = vmstate_msr_hyperv_vapic,
+.needed = hyperv_vapic_enable_needed,
 } , {
 /* empty */
 }
-- 
1.8.3.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Victor Kamensky
On 24 January 2014 07:32, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 24/01/2014 16:23, Victor Kamensky ha scritto:

 Also if you use ints on real bus as description, you may want to clarify
 restrictions on mmio.len. Basically on 32 bit platform (i.e like V7
 ARM) one cannot have mmio.len=8, because one cannot have 64bit
 value on 32bit data bus. Without such clarification introduction of
 text like the value as it would go on the bus in real hardware is
 confusing for len=8 for emulated CPUs where real busses are
 32bit.


 This is not necessarily true.  On a 32-bit CPU you can have a 64-bit memory
 bus.  Even x86 32-bit CPUs can do 64-bit MMIO via MMX or SSE or double-word
 compare-and-swap (CMPXCHG8B).

Sure, that was part of my point :). But now text says about real
hardware buses and in any given moment emulator specify particular
type of CPU emulated, and I am quite sure that we can find one
emulated ARMV7 CPU that would just have real 32bit data bus.

Note there was no such problem and nobody cares what real
data buses width are, with definition I argued for. That I think
that was original intent of '__u8  data[8]' use - just bytes array
description of how memory at given phys_addr looks before
(mmiois_write=0) or would look after (mmio.is_write = 1)
requested memory transaction (or several of them for that
matter) are emulated by KVM_EXIT_MMIO. When if comes
to devices memory read/write it is logical view of memory of
course. Such definition works with any real data buses sizes,
starting with byte.

But, ooh, well ... nobody understands such definition .. I
stand failed to explain it clear enough.

Thanks,
Victor

 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Victor Kamensky
On 24 January 2014 07:23, Victor Kamensky victor.kamen...@linaro.org wrote:
 On 24 January 2014 05:13, Alexander Graf ag...@suse.de wrote:

 On 24.01.2014, at 14:09, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 24/01/2014 01:01, Peter Maydell ha scritto:
 
  +The 'data' member byte order is host kernel native endianness, 
  regardless of
  +the endianness of the guest, and represents the the value as it would 
  go on the
  +bus in real hardware.

Other thing with mmio.len need clarification: please specify what
mmio.len values are allowed. It seems that it was implied that len
could be only power of 2 in range between 1 and 8. I would rather see
that it is spelled out. I think code on all sides of KVM_EXIT_MMIO
already make such assumption. Now when you talk about integers
values on bus in real hardware power of 2 sizes implied Also note
for memory pieces with sizes other than 2, 4, 8 endianness is not
defined.


 Also if you use ints on real bus as description, you may want to clarify
 restrictions on mmio.len. Basically on 32 bit platform (i.e like V7
 ARM) one cannot have mmio.len=8, because one cannot have 64bit
 value on 32bit data bus. Without such clarification introduction of
 text like the value as it would go on the bus in real hardware is
 confusing for len=8 for emulated CPUs where real busses are
 32bit.

 If ldrd/strd would be emulated on ARMV7 one would need to use
 mmio twice to pass required data in either direction using len=4 ..

 Thanks,
 Victor

 The host kernel should always be able to do:
  +type val = *((type *)mmio.data).

would it be prudent to say that it is just integer type here.

Thanks,
Victor

 I think this would be better phrased as The host userspace should always,
 since this documentation is supposed to be telling userspace what the
 kernel's contract with it is, not the kernel keeping notes for itself on
 its own implementation. (It also clarifies what the intention is for the
 obscure and maybe-we'll-never-implement-this case of an LE host
 kernel using a compatibility interface to run the host userspace (QEMU)
 as a BE process which sees the same ABI a BE kernel provides,
 without actually dragging that red herring explicitly into the 
 documentation.)

 I agree, and also the first line should mention userspace.

 In PPC I think it's possible or even common to have BE host kernel and LE 
 host userspace (or perhaps vice versa is the common one).

 It was possible on 32bit, but I'm not sure anyone's actively using it :). 
 The thing that was very common (not so much anymore for enterprise distros) 
 is 32-bit user space with 64-bit kernels.


 Alex


 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH] KVM, XEN: Fix potential race in pvclock code

2014-01-24 Thread Konrad Rzeszutek Wilk
On Thu, Jan 16, 2014 at 03:13:44PM +0100, Julian Stecklina wrote:
 The paravirtualized clock used in KVM and Xen uses a version field to
 allow the guest to see when the shared data structure is inconsistent.
 The code reads the version field twice (before and after the data
 structure is copied) and checks whether they haven't
 changed and that the lowest bit is not set. As the second access is not
 synchronized, the compiler could generate code that accesses version
 more than two times and you end up with inconsistent data.

Could you paste in the code that the 'bad' compiler generates
vs the compiler that generate 'good' code please?

 
 An example using pvclock_get_time_values:
 
 host starts updating data, sets src-version to 1
 guest reads src-version (1) and stores it into dst-version.
 guest copies inconsistent data
 guest reads src-version (1) and computes xor with dst-version.
 host finishes updating data and sets src-version to 2
 guest reads src-version (2) and checks whether lower bit is not set.
 while loop exits with inconsistent data!
 
 AFAICS the compiler is allowed to optimize the given code this way.
 
 Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de
 ---
  arch/x86/kernel/pvclock.c | 10 +++---
  1 file changed, 7 insertions(+), 3 deletions(-)
 
 diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
 index 42eb330..f62b41c 100644
 --- a/arch/x86/kernel/pvclock.c
 +++ b/arch/x86/kernel/pvclock.c
 @@ -55,6 +55,8 @@ static u64 pvclock_get_nsec_offset(struct 
 pvclock_shadow_time *shadow)
  static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
   struct pvclock_vcpu_time_info *src)
  {
 + u32 nversion;
 +
   do {
   dst-version = src-version;
   rmb();  /* fetch version before data */
 @@ -64,7 +66,8 @@ static unsigned pvclock_get_time_values(struct 
 pvclock_shadow_time *dst,
   dst-tsc_shift = src-tsc_shift;
   dst-flags = src-flags;
   rmb();  /* test version after fetching data */
 - } while ((src-version  1) || (dst-version != src-version));
 + nversion = ACCESS_ONCE(src-version);
 + } while ((nversion  1) || (dst-version != nversion));
  
   return dst-version;
  }
 @@ -135,7 +138,7 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
 *wall_clock,
   struct pvclock_vcpu_time_info *vcpu_time,
   struct timespec *ts)
  {
 - u32 version;
 + u32 version, nversion;
   u64 delta;
   struct timespec now;
  
 @@ -146,7 +149,8 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
 *wall_clock,
   now.tv_sec  = wall_clock-sec;
   now.tv_nsec = wall_clock-nsec;
   rmb();  /* fetch time before checking version */
 - } while ((wall_clock-version  1) || (version != wall_clock-version));
 + nversion = ACCESS_ONCE(wall_clock-version);
 + } while ((nversion  1) || (version != nversion));
  
   delta = pvclock_clocksource_read(vcpu_time);/* time since system 
 boot */
   delta += now.tv_sec * (u64)NSEC_PER_SEC + now.tv_nsec;
 -- 
 1.8.4.2
 
 
 ___
 Xen-devel mailing list
 xen-de...@lists.xen.org
 http://lists.xen.org/xen-devel
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm virtio ethernet ring on guest side over high throughput (packet per second)

2014-01-24 Thread Alejandro Comisario
Well, its confirmed that ... because of the shape of our traffic,
constant burst of many many small packages (1.5k / 3.5k) Nagle
algorithm was in a beggining the root cause of our performance issues.

So i will have this thread as solved.
Thank you so much to everyone involved, specially people from RedHat.
Thanks a lot!


Alejandro Comisario
#melicloud CloudBuilders
Arias 3751, Piso 7 (C1430CRG)
Ciudad de Buenos Aires - Argentina
Cel: +549(11) 15-3770-1857
Tel : +54(11) 4640-8443


On Thu, Jan 23, 2014 at 4:25 PM, Alejandro Comisario
alejandro.comisa...@mercadolibre.com wrote:
 Jason, Stefan ... thank you so much.
 At a glance, disabling Nagle algorithm made the hundred thousands
 20ms delays to dissapear suddenly, tomorrow we are gonna made a
 whole day test again, and test client connectivity against NginX
 and Memcached to see if, because of the traffic we have (hundred
 thousands packages per minute) Nagle introduced this delay.

 I'll get back to you tomorrow with the tests.
 Thanks again.

 kindest regards.


 Alejandro Comisario
 #melicloud CloudBuilders
 Arias 3751, Piso 7 (C1430CRG)
 Ciudad de Buenos Aires - Argentina
 Cel: +549(11) 15-3770-1857
 Tel : +54(11) 4640-8443


 On Thu, Jan 23, 2014 at 12:14 AM, Jason Wang jasow...@redhat.com wrote:
 On 01/23/2014 05:32 AM, Alejandro Comisario wrote:
 Thank you so much Stefan for the help and cc'ing Michael  Jason.
 Like you advised yesterday on IRC, today we are making some tests with
 the application setting TCP_NODELAY in the socket options.

 So we will try that and get back to you with further information.
 In the mean time, maybe showing what options the vms are using while 
 running !

 # 
 --
 /usr/bin/kvm -S -M pc-1.0 -cpu
 core2duo,+lahf_lm,+rdtscp,+pdpe1gb,+aes,+popcnt,+x2apic,+sse4.2,+sse4.1,+dca,+xtpr,+cx16,+tm2,+est,+vmx,+ds_cpl,+pbe,+tm,+ht,+ss,+acpi,+ds
 -enable-kvm -m 32768 -smp 8,sockets=1,cores=6,threads=2 -name
 instance-0254 -uuid d25b1b20-409e-4d7f-bd92-2ef4073c7c2b
 -nodefconfig -nodefaults -chardev
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-0254.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
 -no-shutdown -kernel /var/lib/nova/instances/instance-0254/kernel
 -initrd /var/lib/nova/instances/instance-0254/ramdisk -append
 root=/dev/vda console=ttyS0 -drive
 file=/var/lib/nova/instances/instance-0254/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=writethrough
 -device 
 virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
 -netdev tap,fd=19,id=hostnet0 -device

 Better enable vhost as Stefan suggested. It may help a lot here.
 virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:27:d4:6d,bus=pci.0,addr=0x3
 -chardev 
 file,id=charserial0,path=/var/lib/nova/instances/instance-0254/console.log
 -device isa-serial,chardev=charserial0,id=serial0 -chardev
 pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1
 -usb -device usb-tablet,id=input0 -vnc 0.0.0.0:4 -k en-us -vga cirrus
 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
 # 
 --

 best regards


 Alejandro Comisario
 #melicloud CloudBuilders
 Arias 3751, Piso 7 (C1430CRG)
 Ciudad de Buenos Aires - Argentina
 Cel: +549(11) 15-3770-1857
 Tel : +54(11) 4640-8443


 On Wed, Jan 22, 2014 at 12:22 PM, Stefan Hajnoczi stefa...@gmail.com 
 wrote:
 On Tue, Jan 21, 2014 at 04:06:05PM -0200, Alejandro Comisario wrote:

 CCed Michael Tsirkin and Jason Wang who work on KVM networking.

 Hi guys, we had in the past when using physical servers, several
 throughput issues regarding the throughput of our APIS, in our case we
 measure this with packets per seconds, since we dont have that much
 bandwidth (Mb/s) since our apis respond lots of packets very small
 ones (maximum response of 3.5k and avg response of 1.5k), when we
 where using this physical servers, when we reach throughput capacity
 (due to clients tiemouts) we touched the ethernet ring configuration
 and we made the problem dissapear.

 Today with kvm and over 10k virtual instances, when we want to
 increase the throughput of KVM instances, we bumped with the fact that
 when using virtio on guests, we have a max configuration of the ring
 of 256 TX/RX, and from the host side the atached vnet has a txqueuelen
 of 500.

 What i want to know is, how can i tune the guest to support more
 packets per seccond if i know that's my bottleneck?
 I suggest investigating performance in a systematic way.  Set up a
 benchmark that saturates the network.  Post the details of the benchmark
 and the results that you are seeing.

 Then, we can discuss how to investigate the root cause of the bottleneck.

 * 

Re: [PATCH 1/5] KVM: fix coexistence of KVM and Hyper-V leaves

2014-01-24 Thread Marcelo Tosatti
On Fri, Jan 24, 2014 at 05:17:52PM +0100, Paolo Bonzini wrote:
 kvm_arch_init_vcpu's initialization of the KVM leaves at 0x4100
 is broken, because KVM_CPUID_FEATURES is left at 0x4001.  Move
 it to 0x4101 if Hyper-V is enabled.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com

arch/x86/include/asm/kvm_para.h

static inline unsigned int kvm_arch_para_features(void)
{
return cpuid_eax(KVM_CPUID_FEATURES);
}

Shouldnt it be using kvm_cpuid_base ?

 ---
  target-i386/kvm.c | 47 +--
  1 file changed, 25 insertions(+), 22 deletions(-)
 
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 0a21c30..5738911 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -455,6 +455,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
  uint32_t unused;
  struct kvm_cpuid_entry2 *c;
  uint32_t signature[3];
 +int kvm_base = KVM_CPUID_SIGNATURE;
  int r;
  
  memset(cpuid_data, 0, sizeof(cpuid_data));
 @@ -462,26 +463,22 @@ int kvm_arch_init_vcpu(CPUState *cs)
  cpuid_i = 0;
  
  /* Paravirtualization CPUIDs */
 -c = cpuid_data.entries[cpuid_i++];
 -c-function = KVM_CPUID_SIGNATURE;
 -if (!hyperv_enabled(cpu)) {
 -memcpy(signature, KVMKVMKVM\0\0\0, 12);
 -c-eax = 0;
 -} else {
 +if (hyperv_enabled(cpu)) {
 +c = cpuid_data.entries[cpuid_i++];
 +c-function = HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS;
  memcpy(signature, Microsoft Hv, 12);
  c-eax = HYPERV_CPUID_MIN;
 -}
 -c-ebx = signature[0];
 -c-ecx = signature[1];
 -c-edx = signature[2];
 -
 -c = cpuid_data.entries[cpuid_i++];
 -c-function = KVM_CPUID_FEATURES;
 -c-eax = env-features[FEAT_KVM];
 +c-ebx = signature[0];
 +c-ecx = signature[1];
 +c-edx = signature[2];
  
 -if (hyperv_enabled(cpu)) {
 +c = cpuid_data.entries[cpuid_i++];
 +c-function = HYPERV_CPUID_INTERFACE;
  memcpy(signature, Hv#1\0\0\0\0\0\0\0\0, 12);
  c-eax = signature[0];
 +c-ebx = 0;
 +c-ecx = 0;
 +c-edx = 0;
  
  c = cpuid_data.entries[cpuid_i++];
  c-function = HYPERV_CPUID_VERSION;
 @@ -513,15 +510,21 @@ int kvm_arch_init_vcpu(CPUState *cs)
  c-eax = 0x40;
  c-ebx = 0x40;
  
 -c = cpuid_data.entries[cpuid_i++];
 -c-function = KVM_CPUID_SIGNATURE_NEXT;
 -memcpy(signature, KVMKVMKVM\0\0\0, 12);
 -c-eax = 0;
 -c-ebx = signature[0];
 -c-ecx = signature[1];
 -c-edx = signature[2];
 +kvm_base = KVM_CPUID_SIGNATURE_NEXT;
  }
  
 +memcpy(signature, KVMKVMKVM\0\0\0, 12);
 +c = cpuid_data.entries[cpuid_i++];
 +c-function = KVM_CPUID_SIGNATURE | kvm_base;
 +c-eax = 0;
 +c-ebx = signature[0];
 +c-ecx = signature[1];
 +c-edx = signature[2];
 +
 +c = cpuid_data.entries[cpuid_i++];
 +c-function = KVM_CPUID_FEATURES | kvm_base;
 +c-eax = env-features[FEAT_KVM];
 +
  has_msr_async_pf_en = c-eax  (1  KVM_FEATURE_ASYNC_PF);
  
  has_msr_pv_eoi_en = c-eax  (1  KVM_FEATURE_PV_EOI);
 -- 
 1.8.3.1
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Victor Kamensky
On 24 January 2014 05:13, Alexander Graf ag...@suse.de wrote:

 On 24.01.2014, at 14:09, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 24/01/2014 01:01, Peter Maydell ha scritto:
 
  +The 'data' member byte order is host kernel native endianness, 
  regardless of
  +the endianness of the guest, and represents the the value as it would 
  go on the
  +bus in real hardware.  The host kernel should always be able to do:
  +type val = *((type *)mmio.data).
 I think this would be better phrased as The host userspace should always,
 since this documentation is supposed to be telling userspace what the
 kernel's contract with it is, not the kernel keeping notes for itself on
 its own implementation. (It also clarifies what the intention is for the
 obscure and maybe-we'll-never-implement-this case of an LE host
 kernel using a compatibility interface to run the host userspace (QEMU)
 as a BE process which sees the same ABI a BE kernel provides,
 without actually dragging that red herring explicitly into the 
 documentation.)

 I agree, and also the first line should mention userspace.

 In PPC I think it's possible or even common to have BE host kernel and LE 
 host userspace (or perhaps vice versa is the common one).

 It was possible on 32bit, but I'm not sure anyone's actively using it :). The 
 thing that was very common (not so much anymore for enterprise distros) is 
 32-bit user space with 64-bit kernels.

Paolo, Alex, good point about BE kernel / LE user-land mix!

How KVM kernel code that deals with KVM_MMIO_EXIT can find
out what is user process endianity that handles this
KVM_MMIO_EXIT? Do we have kernel function that can tell
that. Hey, what is current user-land task endianity? :).
And reverse case, how user-land code that wants to do
KVM_MMIO_EXIT can find out what is endianity of kernel?
Do we have such system call? Hey, kernel tell me your
endianity :).

Just a thought: should not we instead of trying to implicitly
setup endianity by some other side properties like emulator or
kernel endianity, let's just do it explicitly. Adding 'endianness'
field into current mmio structure is not an option, but maybe
there is outside mechanism like KVM features, special ioctl
number, through which one can explicitly either set of learn
endianity in mmio.data[] for given KVM session.

At least, if we don't want to consider mixed BE/LE kernel/user-land,
then document should clarify that BE/LE kernel/user-land mix is not
supported and assumption here is that they always coincide.

Thanks,
Victor


 Alex


 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 69361] Host call trace and guest hang after create guest.

2014-01-24 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=69361

Marcelo Tosatti mtosa...@redhat.com changed:

   What|Removed |Added

 CC||mtosa...@redhat.com

--- Comment #4 from Marcelo Tosatti mtosa...@redhat.com ---
Can you make the host kernel config file available?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] KVM: fix coexistence of KVM and Hyper-V leaves

2014-01-24 Thread Paolo Bonzini

Il 24/01/2014 20:08, Marcelo Tosatti ha scritto:

 kvm_arch_init_vcpu's initialization of the KVM leaves at 0x4100
 is broken, because KVM_CPUID_FEATURES is left at 0x4001.  Move
 it to 0x4101 if Hyper-V is enabled.

 Signed-off-by: Paolo Bonzini pbonz...@redhat.com

arch/x86/include/asm/kvm_para.h

static inline unsigned int kvm_arch_para_features(void)
{
return cpuid_eax(KVM_CPUID_FEATURES);
}

Shouldnt it be using kvm_cpuid_base ?



Yes.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] ARM: mm: Fix stage-2 device memory attributes

2014-01-24 Thread Christoffer Dall
On Sat, Jan 04, 2014 at 08:27:23AM -0800, Christoffer Dall wrote:
 The stage-2 memory attributes are distinct from the Hyp memory
 attributes and the Stage-1 memory attributes.  We were using the stage-1
 memory attributes for stage-2 mappings causing device mappings to be
 mapped as normal memory.  Add the S2 equivalent defines for memory
 attributes and fix the comments explaining the defines while at it.
 
 Add a prot_pte_s2 field to the mem_type struct and fill out the field
 for device mappings accordingly.
 
 Signed-off-by: Christoffer Dall christoffer.d...@linaro.org
 ---
 Changelog[v2]:
  - Guard the use of L_PTE_S2 defines with s2_policy to allow non-LPAE 
 compiles.
 
  arch/arm/include/asm/pgtable-3level.h | 20 +---
  arch/arm/mm/mm.h  |  1 +
  arch/arm/mm/mmu.c | 15 ++-
  3 files changed, 28 insertions(+), 8 deletions(-)
 
 diff --git a/arch/arm/include/asm/pgtable-3level.h 
 b/arch/arm/include/asm/pgtable-3level.h
 index 4f95039..d5e04d6 100644
 --- a/arch/arm/include/asm/pgtable-3level.h
 +++ b/arch/arm/include/asm/pgtable-3level.h
 @@ -120,13 +120,19 @@
  /*
   * 2nd stage PTE definitions for LPAE.
   */
 -#define L_PTE_S2_MT_UNCACHED  (_AT(pteval_t, 0x5)  2) /* MemAttr[3:0] */
 -#define L_PTE_S2_MT_WRITETHROUGH (_AT(pteval_t, 0xa)  2) /* MemAttr[3:0] */
 -#define L_PTE_S2_MT_WRITEBACK (_AT(pteval_t, 0xf)  2) /* 
 MemAttr[3:0] */
 -#define L_PTE_S2_RDONLY   (_AT(pteval_t, 1)  6)   /* HAP[1]   
 */
 -#define L_PTE_S2_RDWR (_AT(pteval_t, 3)  6)   /* HAP[2:1] 
 */
 -
 -#define L_PMD_S2_RDWR (_AT(pmdval_t, 3)  6)   /* HAP[2:1] 
 */
 +#define L_PTE_S2_MT_UNCACHED (_AT(pteval_t, 0x0)  2) /* strongly 
 ordered */
 +#define L_PTE_S2_MT_WRITETHROUGH (_AT(pteval_t, 0xa)  2) /* normal 
 inner write-through */
 +#define L_PTE_S2_MT_WRITEBACK(_AT(pteval_t, 0xf)  2) /* 
 normal inner write-back */
 +#define L_PTE_S2_MT_DEV_SHARED   (_AT(pteval_t, 0x1)  2) /* 
 device */
 +#define L_PTE_S2_MT_DEV_NONSHARED(_AT(pteval_t, 0x1)  2) /* device */
 +#define L_PTE_S2_MT_DEV_WC   (_AT(pteval_t, 0x5)  2) /* normal 
 non-cacheable */
 +#define L_PTE_S2_MT_DEV_CACHED   (_AT(pteval_t, 0xf)  2) /* 
 normal inner write-back */
 +#define L_PTE_S2_MT_MASK (_AT(pteval_t, 0xf)  2)
 +
 +#define L_PTE_S2_RDONLY  (_AT(pteval_t, 1)  6)   /* 
 HAP[1]   */
 +#define L_PTE_S2_RDWR(_AT(pteval_t, 3)  6)   /* 
 HAP[2:1] */
 +
 +#define L_PMD_S2_RDWR(_AT(pmdval_t, 3)  6)   /* 
 HAP[2:1] */
  
  /*
   * Hyp-mode PL2 PTE definitions for LPAE.
 diff --git a/arch/arm/mm/mm.h b/arch/arm/mm/mm.h
 index d5a982d..7ea641b 100644
 --- a/arch/arm/mm/mm.h
 +++ b/arch/arm/mm/mm.h
 @@ -38,6 +38,7 @@ static inline pmd_t *pmd_off_k(unsigned long virt)
  
  struct mem_type {
   pteval_t prot_pte;
 + pteval_t prot_pte_s2;
   pmdval_t prot_l1;
   pmdval_t prot_sect;
   unsigned int domain;
 diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
 index 580ef2d..44d571f 100644
 --- a/arch/arm/mm/mmu.c
 +++ b/arch/arm/mm/mmu.c
 @@ -231,36 +231,48 @@ __setup(noalign, noalign_setup);
  #endif /* ifdef CONFIG_CPU_CP15 / else */
  
  #define PROT_PTE_DEVICE  
 L_PTE_PRESENT|L_PTE_YOUNG|L_PTE_DIRTY|L_PTE_XN
 +#define PROT_PTE_S2_DEVICE   PROT_PTE_DEVICE
  #define PROT_SECT_DEVICE PMD_TYPE_SECT|PMD_SECT_AP_WRITE
  
  static struct mem_type mem_types[] = {
   [MT_DEVICE] = {   /* Strongly ordered / ARMv6 shared device */
   .prot_pte   = PROT_PTE_DEVICE | L_PTE_MT_DEV_SHARED |
 L_PTE_SHARED,
 + .prot_pte_s2= s2_policy(PROT_PTE_S2_DEVICE) |
 +   s2_policy(L_PTE_S2_MT_DEV_SHARED) |
 +   L_PTE_SHARED,
   .prot_l1= PMD_TYPE_TABLE,
   .prot_sect  = PROT_SECT_DEVICE | PMD_SECT_S,
   .domain = DOMAIN_IO,
   },
   [MT_DEVICE_NONSHARED] = { /* ARMv6 non-shared device */
   .prot_pte   = PROT_PTE_DEVICE | L_PTE_MT_DEV_NONSHARED,
 + .prot_pte_s2= s2_policy(PROT_PTE_S2_DEVICE) |
 +   s2_policy(L_PTE_S2_MT_DEV_NONSHARED),
   .prot_l1= PMD_TYPE_TABLE,
   .prot_sect  = PROT_SECT_DEVICE,
   .domain = DOMAIN_IO,
   },
   [MT_DEVICE_CACHED] = {/* ioremap_cached */
   .prot_pte   = PROT_PTE_DEVICE | L_PTE_MT_DEV_CACHED,
 + .prot_pte_s2= s2_policy(PROT_PTE_S2_DEVICE) |
 +   s2_policy(L_PTE_S2_MT_DEV_CACHED),
   .prot_l1= PMD_TYPE_TABLE,
   .prot_sect  = PROT_SECT_DEVICE | PMD_SECT_WB,
   .domain = DOMAIN_IO,
   },
   [MT_DEVICE_WC] = {  /* 

[PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Christoffer Dall
The KVM API documentation is not clear about the semantics of the data
field on the mmio struct on the kvm_run struct.

This has become problematic when supporting ARM guests on big-endian
host systems with guests of both endianness types, because it is unclear
how the data should be exported to user space.

This should not break with existing implementations as all supported
existing implementations of known user space applications (QEMU and
kvmtools for virtio) only support default endianness of the
architectures on the host side.

Cc: Marc Zyngier marc.zyng...@arm.com
Cc: Peter Maydell peter.mayd...@linaro.org
Cc: Alexander Graf ag...@suse.de
Signed-off-by: Christoffer Dall christoffer.d...@linaro.org
---
Changes [v1 - v2]:
 - s/host kernel should/host user space should/

 Documentation/virtual/kvm/api.txt | 5 +
 1 file changed, 5 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 366bf4b..6dbd68c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could not 
be satisfied
 by kvm.  The 'data' member contains the written data if 'is_write' is
 true, and should be filled by application code otherwise.
 
+The 'data' member byte order is host kernel native endianness, regardless of
+the endianness of the guest, and represents the the value as it would go on the
+bus in real hardware.  The host user space should always be able to do:
+type val = *((type *)mmio.data).
+
 NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_DCR,
   KVM_EXIT_PAPR and KVM_EXIT_EPR the corresponding
 operations are complete (and guest state is consistent) only after userspace
-- 
1.8.5.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Scott Wood
On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote:
 The KVM API documentation is not clear about the semantics of the data
 field on the mmio struct on the kvm_run struct.
 
 This has become problematic when supporting ARM guests on big-endian
 host systems with guests of both endianness types, because it is unclear
 how the data should be exported to user space.
 
 This should not break with existing implementations as all supported
 existing implementations of known user space applications (QEMU and
 kvmtools for virtio) only support default endianness of the
 architectures on the host side.
 
 Cc: Marc Zyngier marc.zyng...@arm.com
 Cc: Peter Maydell peter.mayd...@linaro.org
 Cc: Alexander Graf ag...@suse.de
 Signed-off-by: Christoffer Dall christoffer.d...@linaro.org
 ---
 Changes [v1 - v2]:
  - s/host kernel should/host user space should/
 
  Documentation/virtual/kvm/api.txt | 5 +
  1 file changed, 5 insertions(+)
 
 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 366bf4b..6dbd68c 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could 
 not be satisfied
  by kvm.  The 'data' member contains the written data if 'is_write' is
  true, and should be filled by application code otherwise.
  
 +The 'data' member byte order is host kernel native endianness, regardless of
 +the endianness of the guest, and represents the the value as it would go on 
 the
 +bus in real hardware.  The host user space should always be able to do:
 +type val = *((type *)mmio.data).

Host userspace should be able to do that with what results?  It would
only produce a directly usable value if host endianness is the same as
the emulated device's endianness.

I'm not sure that host kernel native endianness is an accurate way of
describing what currently happens.  Regardless of host or guest
endianness, the guest should be swapping the value as necessary to
ensure that the value that goes on the (real or emulated) bus is the
same.

Plus, if it were really as it would go on the bus the value wouldn't
necessarily be left justified within data[], depending on how the bus 
works.

How about a wording like this:

  The 'data' member contains, in its first 'len' bytes, the value as it
  would appear if the guest had accessed memory rather than I/O.

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Victor Kamensky
On 24 January 2014 15:51, Scott Wood scottw...@freescale.com wrote:
 On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote:
 The KVM API documentation is not clear about the semantics of the data
 field on the mmio struct on the kvm_run struct.

 This has become problematic when supporting ARM guests on big-endian
 host systems with guests of both endianness types, because it is unclear
 how the data should be exported to user space.

 This should not break with existing implementations as all supported
 existing implementations of known user space applications (QEMU and
 kvmtools for virtio) only support default endianness of the
 architectures on the host side.

 Cc: Marc Zyngier marc.zyng...@arm.com
 Cc: Peter Maydell peter.mayd...@linaro.org
 Cc: Alexander Graf ag...@suse.de
 Signed-off-by: Christoffer Dall christoffer.d...@linaro.org
 ---
 Changes [v1 - v2]:
  - s/host kernel should/host user space should/

  Documentation/virtual/kvm/api.txt | 5 +
  1 file changed, 5 insertions(+)

 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 366bf4b..6dbd68c 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could 
 not be satisfied
  by kvm.  The 'data' member contains the written data if 'is_write' is
  true, and should be filled by application code otherwise.

 +The 'data' member byte order is host kernel native endianness, regardless of
 +the endianness of the guest, and represents the the value as it would go on 
 the
 +bus in real hardware.  The host user space should always be able to do:
 +type val = *((type *)mmio.data).

 Host userspace should be able to do that with what results?  It would
 only produce a directly usable value if host endianness is the same as
 the emulated device's endianness.

 I'm not sure that host kernel native endianness is an accurate way of
 describing what currently happens.  Regardless of host or guest
 endianness, the guest should be swapping the value as necessary to
 ensure that the value that goes on the (real or emulated) bus is the
 same.

 Plus, if it were really as it would go on the bus the value wouldn't
 necessarily be left justified within data[], depending on how the bus
 works.

 How about a wording like this:

   The 'data' member contains, in its first 'len' bytes, the value as it
   would appear if the guest had accessed memory rather than I/O.

Scott, Yes! Thank you so much! Completely agree.
You managed elegantly to express in two lines,
what I failed to communicate with hundreds lines
of my emails in last few days on this and other
email thread.

- Victor

 -Scott


 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Peter Maydell
On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote:
 On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote:
 The KVM API documentation is not clear about the semantics of the data
 field on the mmio struct on the kvm_run struct.

 This has become problematic when supporting ARM guests on big-endian
 host systems with guests of both endianness types, because it is unclear
 how the data should be exported to user space.

 This should not break with existing implementations as all supported
 existing implementations of known user space applications (QEMU and
 kvmtools for virtio) only support default endianness of the
 architectures on the host side.

 Cc: Marc Zyngier marc.zyng...@arm.com
 Cc: Peter Maydell peter.mayd...@linaro.org
 Cc: Alexander Graf ag...@suse.de
 Signed-off-by: Christoffer Dall christoffer.d...@linaro.org
 ---
 Changes [v1 - v2]:
  - s/host kernel should/host user space should/

  Documentation/virtual/kvm/api.txt | 5 +
  1 file changed, 5 insertions(+)

 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 366bf4b..6dbd68c 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could 
 not be satisfied
  by kvm.  The 'data' member contains the written data if 'is_write' is
  true, and should be filled by application code otherwise.

 +The 'data' member byte order is host kernel native endianness, regardless of
 +the endianness of the guest, and represents the the value as it would go on 
 the
 +bus in real hardware.  The host user space should always be able to do:
 +type val = *((type *)mmio.data).

 Host userspace should be able to do that with what results?  It would
 only produce a directly usable value if host endianness is the same as
 the emulated device's endianness.

With the result that it gets the value the CPU has sent out on
the bus as the memory transaction. Obviously if what userspace
is emulating is a bus which has a byteswapping bridge or if it's
being helpful to device emulation by providing here's the value
even though you think you're wired up backwards then it needs
to byteswap.

 I'm not sure that host kernel native endianness is an accurate way of
 describing what currently happens.  Regardless of host or guest
 endianness, the guest should be swapping the value as necessary to
 ensure that the value that goes on the (real or emulated) bus is the
 same.

I don't know why you're bringing the guest in here. Whether
the guest chooses to byteswap or not is IMHO not relevant.
What KVM and userspace need to combine to achieve is that
whatever the guest happens to have done causes the same result
it would on the real hardware. Whether the guest sends out a
write of 0x12345678 because it wrote 0x12345678 directly or
because it started with 0x87654321 and issued a byte-reverse
instruction doesn't matter.

 Plus, if it were really as it would go on the bus the value wouldn't
 necessarily be left justified within data[], depending on how the bus
 works.

The point is that the value in data[] is not as it would go on the bus,
but the value you get out by treating it as a host-native-endianness
value of the relevant size left-justified within data[] is the value as
it would go on the bus.

 How about a wording like this:

   The 'data' member contains, in its first 'len' bytes, the value as it
   would appear if the guest had accessed memory rather than I/O.

I think this is confusing, because now userspace authors have
to figure out how to get back to value X of size Y at address Z
by interpreting this text... Can you write out the equivalent of
Christoffer's text here's how you get the memory transaction
value for what you want?

(Also, value as it would appear to who?)

I think your wording implies that the order of bytes in data[] depend
on the guest CPU usual byte order, ie the order which the CPU
does not do a byte-lane-swap for (LE for ARM, BE for PPC),
and it would mean it would come out differently from
my/Alex/Christoffer's proposal if the host kernel was the opposite
endianness from that usual order.

Finally, I think it's a bit confusing in that as if the guest had
accessed memory is assigning implicit semantics to memory
in the emulated system, when memory is actually kind of outside
KVM's purview because it's not part of the CPU.

thanks
-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Christoffer Dall
On Sat, Jan 25, 2014 at 12:24:08AM +, Peter Maydell wrote:
 On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote:
  On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote:
  The KVM API documentation is not clear about the semantics of the data
  field on the mmio struct on the kvm_run struct.
 
  This has become problematic when supporting ARM guests on big-endian
  host systems with guests of both endianness types, because it is unclear
  how the data should be exported to user space.
 
  This should not break with existing implementations as all supported
  existing implementations of known user space applications (QEMU and
  kvmtools for virtio) only support default endianness of the
  architectures on the host side.
 
  Cc: Marc Zyngier marc.zyng...@arm.com
  Cc: Peter Maydell peter.mayd...@linaro.org
  Cc: Alexander Graf ag...@suse.de
  Signed-off-by: Christoffer Dall christoffer.d...@linaro.org
  ---
  Changes [v1 - v2]:
   - s/host kernel should/host user space should/
 
   Documentation/virtual/kvm/api.txt | 5 +
   1 file changed, 5 insertions(+)
 
  diff --git a/Documentation/virtual/kvm/api.txt 
  b/Documentation/virtual/kvm/api.txt
  index 366bf4b..6dbd68c 100644
  --- a/Documentation/virtual/kvm/api.txt
  +++ b/Documentation/virtual/kvm/api.txt
  @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which 
  could not be satisfied
   by kvm.  The 'data' member contains the written data if 'is_write' is
   true, and should be filled by application code otherwise.
 
  +The 'data' member byte order is host kernel native endianness, regardless 
  of
  +the endianness of the guest, and represents the the value as it would go 
  on the
  +bus in real hardware.  The host user space should always be able to do:
  +type val = *((type *)mmio.data).
 
  Host userspace should be able to do that with what results?  It would
  only produce a directly usable value if host endianness is the same as
  the emulated device's endianness.
 
 With the result that it gets the value the CPU has sent out on
 the bus as the memory transaction. Obviously if what userspace
 is emulating is a bus which has a byteswapping bridge or if it's
 being helpful to device emulation by providing here's the value
 even though you think you're wired up backwards then it needs
 to byteswap.
 

Yes, because KVM emulates the CPU, it's interface should emulate the
external interface of the CPU, not CPU-internal bits.

The value is directly usable to emulate the bus to attached devices in
user space.

  I'm not sure that host kernel native endianness is an accurate way of
  describing what currently happens.  Regardless of host or guest
  endianness, the guest should be swapping the value as necessary to
  ensure that the value that goes on the (real or emulated) bus is the
  same.
 
 I don't know why you're bringing the guest in here. Whether
 the guest chooses to byteswap or not is IMHO not relevant.
 What KVM and userspace need to combine to achieve is that
 whatever the guest happens to have done causes the same result
 it would on the real hardware. Whether the guest sends out a
 write of 0x12345678 because it wrote 0x12345678 directly or
 because it started with 0x87654321 and issued a byte-reverse
 instruction doesn't matter.
 
  Plus, if it were really as it would go on the bus the value wouldn't
  necessarily be left justified within data[], depending on how the bus
  works.
 
 The point is that the value in data[] is not as it would go on the bus,
 but the value you get out by treating it as a host-native-endianness
 value of the relevant size left-justified within data[] is the value as
 it would go on the bus.
 
  How about a wording like this:
 
The 'data' member contains, in its first 'len' bytes, the value as it
would appear if the guest had accessed memory rather than I/O.
 
 I think this is confusing, because now userspace authors have
 to figure out how to get back to value X of size Y at address Z
 by interpreting this text... Can you write out the equivalent of
 Christoffer's text here's how you get the memory transaction
 value for what you want?
 
 (Also, value as it would appear to who?)
 
 I think your wording implies that the order of bytes in data[] depend
 on the guest CPU usual byte order, ie the order which the CPU
 does not do a byte-lane-swap for (LE for ARM, BE for PPC),
 and it would mean it would come out differently from
 my/Alex/Christoffer's proposal if the host kernel was the opposite
 endianness from that usual order.
 
 Finally, I think it's a bit confusing in that as if the guest had
 accessed memory is assigning implicit semantics to memory
 in the emulated system, when memory is actually kind of outside
 KVM's purview because it's not part of the CPU.
 

The as if the guest had accessed memory, would imply that user space
needs to look at the settings of the CPU (in the case of ARM the E-bit)
to understand which value the CPU intended for the memory 

Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Scott Wood
On Sat, 2014-01-25 at 00:24 +, Peter Maydell wrote:
 On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote:
  On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote:
  diff --git a/Documentation/virtual/kvm/api.txt 
  b/Documentation/virtual/kvm/api.txt
  index 366bf4b..6dbd68c 100644
  --- a/Documentation/virtual/kvm/api.txt
  +++ b/Documentation/virtual/kvm/api.txt
  @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which 
  could not be satisfied
   by kvm.  The 'data' member contains the written data if 'is_write' is
   true, and should be filled by application code otherwise.
 
  +The 'data' member byte order is host kernel native endianness, regardless 
  of
  +the endianness of the guest, and represents the the value as it would go 
  on the
  +bus in real hardware.  The host user space should always be able to do:
  +type val = *((type *)mmio.data).
 
  Host userspace should be able to do that with what results?  It would
  only produce a directly usable value if host endianness is the same as
  the emulated device's endianness.
 
 With the result that it gets the value the CPU has sent out on
 the bus as the memory transaction.

Doesn't that assume the host kernel endianness is the same as the bus
(or rather, that the host CPU would not swap such an access before it
hits the bus)?

If you take the same hardware and boot a little endian host kernel one
day, and a big endian host kernel the next, the bus doesn't change, and
neither should the bytewise (assuming address invariance) contents of
data[].  How data[] would look when read as a larger integer would of
course change -- but that's due to how you're reading it.

It's clear to say that a value in memory has been stored there in host
endianness when the value is as you would want to see it in a CPU
register, but it's less clear when you talk about it relative to values
on a bus.  It's harder to correlate that to something that is software
visible.

I don't think there's any actual technical difference between your
wording and mine when each wording is properly interpreted, but I
suspect my wording is less likely to be misinterpreted (I could be
wrong).

 Obviously if what userspace
 is emulating is a bus which has a byteswapping bridge or if it's
 being helpful to device emulation by providing here's the value
 even though you think you're wired up backwards then it needs
 to byteswap.

Whether the emulated bus has a byteswapping bridge doesn't sound like
something that depends on the endianness that the host CPU is currently
running in.

  How about a wording like this:
 
The 'data' member contains, in its first 'len' bytes, the value as it
would appear if the guest had accessed memory rather than I/O.
 
 I think this is confusing, because now userspace authors have
 to figure out how to get back to value X of size Y at address Z
 by interpreting this text... Can you write out the equivalent of
 Christoffer's text here's how you get the memory transaction
 value for what you want?

Userspace swaps the value if and only if userspace's endianness differs
from the endianness with which the device interprets the data
(regardless of whether said interpretation is considered natural or
swapped relative to the way the bus is documented).  It's similar to how
userspace would handle emulating DMA.

KVM swaps the value if and only if the endianness of the guest access
differs from that of the host, i.e. if it would have done swapping when
emulating an ordinary memory access.

 (Also, value as it would appear to who?)

As it would appear to anyone.  It works because data[] actually is
memory.  Any difference in how data appears based on the reader's
context would already be reflected when the reader performs the load.

 I think your wording implies that the order of bytes in data[] depend
 on the guest CPU usual byte order, ie the order which the CPU
 does not do a byte-lane-swap for (LE for ARM, BE for PPC),
 and it would mean it would come out differently from
 my/Alex/Christoffer's proposal if the host kernel was the opposite
 endianness from that usual order.

It doesn't depend on usual anything.  The only thing it implicitly
says about guest byte order is that it's KVM's job to implement any
swapping if the endianness of the guest access is different from the
endianness of the host kernel access (whether it's due to the guest's
mode, the way a page is mapped, the instruction used, etc).

 Finally, I think it's a bit confusing in that as if the guest had
 accessed memory is assigning implicit semantics to memory
 in the emulated system, when memory is actually kind of outside
 KVM's purview because it's not part of the CPU.

That's sort of the point.  It defines it in a way that is independent of
the CPU, and thus independent of what endianness the CPU operates in.

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo 

Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Scott Wood
On Fri, 2014-01-24 at 17:56 -0800, Christoffer Dall wrote:
 On Sat, Jan 25, 2014 at 12:24:08AM +, Peter Maydell wrote:
  Finally, I think it's a bit confusing in that as if the guest had
  accessed memory is assigning implicit semantics to memory
  in the emulated system, when memory is actually kind of outside
  KVM's purview because it's not part of the CPU.
  
 
 The as if the guest had accessed memory, would imply that user space
 needs to look at the settings of the CPU (in the case of ARM the E-bit)
 to understand which value the CPU intended for the memory transaction.

No, it doesn't, just as userspace doesn't need to do that to interpret
DMA descriptors.

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Alexander Graf

On 25.01.2014, at 02:58, Scott Wood scottw...@freescale.com wrote:

 On Sat, 2014-01-25 at 00:24 +, Peter Maydell wrote:
 On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote:
 On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote:
 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 366bf4b..6dbd68c 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which 
 could not be satisfied
 by kvm.  The 'data' member contains the written data if 'is_write' is
 true, and should be filled by application code otherwise.
 
 +The 'data' member byte order is host kernel native endianness, regardless 
 of
 +the endianness of the guest, and represents the the value as it would go 
 on the
 +bus in real hardware.  The host user space should always be able to do:
 +type val = *((type *)mmio.data).
 
 Host userspace should be able to do that with what results?  It would
 only produce a directly usable value if host endianness is the same as
 the emulated device's endianness.
 
 With the result that it gets the value the CPU has sent out on
 the bus as the memory transaction.
 
 Doesn't that assume the host kernel endianness is the same as the bus
 (or rather, that the host CPU would not swap such an access before it
 hits the bus)?
 
 If you take the same hardware and boot a little endian host kernel one
 day, and a big endian host kernel the next, the bus doesn't change, and
 neither should the bytewise (assuming address invariance) contents of
 data[].  How data[] would look when read as a larger integer would of
 course change -- but that's due to how you're reading it.
 
 It's clear to say that a value in memory has been stored there in host
 endianness when the value is as you would want to see it in a CPU
 register, but it's less clear when you talk about it relative to values
 on a bus.  It's harder to correlate that to something that is software
 visible.
 
 I don't think there's any actual technical difference between your
 wording and mine when each wording is properly interpreted, but I
 suspect my wording is less likely to be misinterpreted (I could be
 wrong).
 
 Obviously if what userspace
 is emulating is a bus which has a byteswapping bridge or if it's
 being helpful to device emulation by providing here's the value
 even though you think you're wired up backwards then it needs
 to byteswap.
 
 Whether the emulated bus has a byteswapping bridge doesn't sound like
 something that depends on the endianness that the host CPU is currently
 running in.
 
 How about a wording like this:
 
  The 'data' member contains, in its first 'len' bytes, the value as it
  would appear if the guest had accessed memory rather than I/O.
 
 I think this is confusing, because now userspace authors have
 to figure out how to get back to value X of size Y at address Z
 by interpreting this text... Can you write out the equivalent of
 Christoffer's text here's how you get the memory transaction
 value for what you want?
 
 Userspace swaps the value if and only if userspace's endianness differs
 from the endianness with which the device interprets the data
 (regardless of whether said interpretation is considered natural or
 swapped relative to the way the bus is documented).  It's similar to how
 userspace would handle emulating DMA.
 
 KVM swaps the value if and only if the endianness of the guest access
 differs from that of the host, i.e. if it would have done swapping when
 emulating an ordinary memory access.
 
 (Also, value as it would appear to who?)
 
 As it would appear to anyone.  It works because data[] actually is
 memory.  Any difference in how data appears based on the reader's
 context would already be reflected when the reader performs the load.
 
 I think your wording implies that the order of bytes in data[] depend
 on the guest CPU usual byte order, ie the order which the CPU
 does not do a byte-lane-swap for (LE for ARM, BE for PPC),
 and it would mean it would come out differently from
 my/Alex/Christoffer's proposal if the host kernel was the opposite
 endianness from that usual order.
 
 It doesn't depend on usual anything.  The only thing it implicitly
 says about guest byte order is that it's KVM's job to implement any
 swapping if the endianness of the guest access is different from the
 endianness of the host kernel access (whether it's due to the guest's
 mode, the way a page is mapped, the instruction used, etc).
 
 Finally, I think it's a bit confusing in that as if the guest had
 accessed memory is assigning implicit semantics to memory
 in the emulated system, when memory is actually kind of outside
 KVM's purview because it's not part of the CPU.
 
 That's sort of the point.  It defines it in a way that is independent of
 the CPU, and thus independent of what endianness the CPU operates in.

Ok, let's go through the combinations for a 

Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Alexander Graf

On 25.01.2014, at 03:04, Scott Wood scottw...@freescale.com wrote:

 On Fri, 2014-01-24 at 17:56 -0800, Christoffer Dall wrote:
 On Sat, Jan 25, 2014 at 12:24:08AM +, Peter Maydell wrote:
 Finally, I think it's a bit confusing in that as if the guest had
 accessed memory is assigning implicit semantics to memory
 in the emulated system, when memory is actually kind of outside
 KVM's purview because it's not part of the CPU.
 
 
 The as if the guest had accessed memory, would imply that user space
 needs to look at the settings of the CPU (in the case of ARM the E-bit)
 to understand which value the CPU intended for the memory transaction.
 
 No, it doesn't, just as userspace doesn't need to do that to interpret
 DMA descriptors.

DMA descriptors are handled by specific pieces of hardware that access memory. 
Guests usually push data to memory in a layout so that it's natural to the 
device they want to talk to. MMIO is more tricky.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Christoffer Dall
On Sat, Jan 25, 2014 at 03:15:35AM +0100, Alexander Graf wrote:
 
 On 25.01.2014, at 02:58, Scott Wood scottw...@freescale.com wrote:
 
  On Sat, 2014-01-25 at 00:24 +, Peter Maydell wrote:
  On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote:
  On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote:
  diff --git a/Documentation/virtual/kvm/api.txt 
  b/Documentation/virtual/kvm/api.txt
  index 366bf4b..6dbd68c 100644
  --- a/Documentation/virtual/kvm/api.txt
  +++ b/Documentation/virtual/kvm/api.txt
  @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which 
  could not be satisfied
  by kvm.  The 'data' member contains the written data if 'is_write' is
  true, and should be filled by application code otherwise.
  
  +The 'data' member byte order is host kernel native endianness, 
  regardless of
  +the endianness of the guest, and represents the the value as it would 
  go on the
  +bus in real hardware.  The host user space should always be able to do:
  +type val = *((type *)mmio.data).
  
  Host userspace should be able to do that with what results?  It would
  only produce a directly usable value if host endianness is the same as
  the emulated device's endianness.
  
  With the result that it gets the value the CPU has sent out on
  the bus as the memory transaction.
  
  Doesn't that assume the host kernel endianness is the same as the bus
  (or rather, that the host CPU would not swap such an access before it
  hits the bus)?
  
  If you take the same hardware and boot a little endian host kernel one
  day, and a big endian host kernel the next, the bus doesn't change, and
  neither should the bytewise (assuming address invariance) contents of
  data[].  How data[] would look when read as a larger integer would of
  course change -- but that's due to how you're reading it.
  
  It's clear to say that a value in memory has been stored there in host
  endianness when the value is as you would want to see it in a CPU
  register, but it's less clear when you talk about it relative to values
  on a bus.  It's harder to correlate that to something that is software
  visible.
  
  I don't think there's any actual technical difference between your
  wording and mine when each wording is properly interpreted, but I
  suspect my wording is less likely to be misinterpreted (I could be
  wrong).
  
  Obviously if what userspace
  is emulating is a bus which has a byteswapping bridge or if it's
  being helpful to device emulation by providing here's the value
  even though you think you're wired up backwards then it needs
  to byteswap.
  
  Whether the emulated bus has a byteswapping bridge doesn't sound like
  something that depends on the endianness that the host CPU is currently
  running in.
  
  How about a wording like this:
  
   The 'data' member contains, in its first 'len' bytes, the value as it
   would appear if the guest had accessed memory rather than I/O.
  
  I think this is confusing, because now userspace authors have
  to figure out how to get back to value X of size Y at address Z
  by interpreting this text... Can you write out the equivalent of
  Christoffer's text here's how you get the memory transaction
  value for what you want?
  
  Userspace swaps the value if and only if userspace's endianness differs
  from the endianness with which the device interprets the data
  (regardless of whether said interpretation is considered natural or
  swapped relative to the way the bus is documented).  It's similar to how
  userspace would handle emulating DMA.
  
  KVM swaps the value if and only if the endianness of the guest access
  differs from that of the host, i.e. if it would have done swapping when
  emulating an ordinary memory access.
  
  (Also, value as it would appear to who?)
  
  As it would appear to anyone.  It works because data[] actually is
  memory.  Any difference in how data appears based on the reader's
  context would already be reflected when the reader performs the load.
  
  I think your wording implies that the order of bytes in data[] depend
  on the guest CPU usual byte order, ie the order which the CPU
  does not do a byte-lane-swap for (LE for ARM, BE for PPC),
  and it would mean it would come out differently from
  my/Alex/Christoffer's proposal if the host kernel was the opposite
  endianness from that usual order.
  
  It doesn't depend on usual anything.  The only thing it implicitly
  says about guest byte order is that it's KVM's job to implement any
  swapping if the endianness of the guest access is different from the
  endianness of the host kernel access (whether it's due to the guest's
  mode, the way a page is mapped, the instruction used, etc).
  
  Finally, I think it's a bit confusing in that as if the guest had
  accessed memory is assigning implicit semantics to memory
  in the emulated system, when memory is actually kind of outside
  KVM's purview because it's not part of the CPU.
  
  That's sort of the 

Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Victor Kamensky
On 24 January 2014 18:15, Alexander Graf ag...@suse.de wrote:

 On 25.01.2014, at 02:58, Scott Wood scottw...@freescale.com wrote:

 On Sat, 2014-01-25 at 00:24 +, Peter Maydell wrote:
 On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote:
 On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote:
 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 366bf4b..6dbd68c 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which 
 could not be satisfied
 by kvm.  The 'data' member contains the written data if 'is_write' is
 true, and should be filled by application code otherwise.

 +The 'data' member byte order is host kernel native endianness, 
 regardless of
 +the endianness of the guest, and represents the the value as it would go 
 on the
 +bus in real hardware.  The host user space should always be able to do:
 +type val = *((type *)mmio.data).

 Host userspace should be able to do that with what results?  It would
 only produce a directly usable value if host endianness is the same as
 the emulated device's endianness.

 With the result that it gets the value the CPU has sent out on
 the bus as the memory transaction.

 Doesn't that assume the host kernel endianness is the same as the bus
 (or rather, that the host CPU would not swap such an access before it
 hits the bus)?

 If you take the same hardware and boot a little endian host kernel one
 day, and a big endian host kernel the next, the bus doesn't change, and
 neither should the bytewise (assuming address invariance) contents of
 data[].  How data[] would look when read as a larger integer would of
 course change -- but that's due to how you're reading it.

 It's clear to say that a value in memory has been stored there in host
 endianness when the value is as you would want to see it in a CPU
 register, but it's less clear when you talk about it relative to values
 on a bus.  It's harder to correlate that to something that is software
 visible.

 I don't think there's any actual technical difference between your
 wording and mine when each wording is properly interpreted, but I
 suspect my wording is less likely to be misinterpreted (I could be
 wrong).

 Obviously if what userspace
 is emulating is a bus which has a byteswapping bridge or if it's
 being helpful to device emulation by providing here's the value
 even though you think you're wired up backwards then it needs
 to byteswap.

 Whether the emulated bus has a byteswapping bridge doesn't sound like
 something that depends on the endianness that the host CPU is currently
 running in.

 How about a wording like this:

  The 'data' member contains, in its first 'len' bytes, the value as it
  would appear if the guest had accessed memory rather than I/O.

 I think this is confusing, because now userspace authors have
 to figure out how to get back to value X of size Y at address Z
 by interpreting this text... Can you write out the equivalent of
 Christoffer's text here's how you get the memory transaction
 value for what you want?

 Userspace swaps the value if and only if userspace's endianness differs
 from the endianness with which the device interprets the data
 (regardless of whether said interpretation is considered natural or
 swapped relative to the way the bus is documented).  It's similar to how
 userspace would handle emulating DMA.

 KVM swaps the value if and only if the endianness of the guest access
 differs from that of the host, i.e. if it would have done swapping when
 emulating an ordinary memory access.

 (Also, value as it would appear to who?)

 As it would appear to anyone.  It works because data[] actually is
 memory.  Any difference in how data appears based on the reader's
 context would already be reflected when the reader performs the load.

 I think your wording implies that the order of bytes in data[] depend
 on the guest CPU usual byte order, ie the order which the CPU
 does not do a byte-lane-swap for (LE for ARM, BE for PPC),
 and it would mean it would come out differently from
 my/Alex/Christoffer's proposal if the host kernel was the opposite
 endianness from that usual order.

 It doesn't depend on usual anything.  The only thing it implicitly
 says about guest byte order is that it's KVM's job to implement any
 swapping if the endianness of the guest access is different from the
 endianness of the host kernel access (whether it's due to the guest's
 mode, the way a page is mapped, the instruction used, etc).

 Finally, I think it's a bit confusing in that as if the guest had
 accessed memory is assigning implicit semantics to memory
 in the emulated system, when memory is actually kind of outside
 KVM's purview because it's not part of the CPU.

 That's sort of the point.  It defines it in a way that is independent of
 the CPU, and thus independent of what endianness the CPU operates in.

 Ok, 

Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Paolo Bonzini

Il 24/01/2014 01:01, Peter Maydell ha scritto:


 +The 'data' member byte order is host kernel native endianness, regardless of
 +the endianness of the guest, and represents the the value as it would go on 
the
 +bus in real hardware.  The host kernel should always be able to do:
 +type val = *((type *)mmio.data).

I think this would be better phrased as The host userspace should always,
since this documentation is supposed to be telling userspace what the
kernel's contract with it is, not the kernel keeping notes for itself on
its own implementation. (It also clarifies what the intention is for the
obscure and maybe-we'll-never-implement-this case of an LE host
kernel using a compatibility interface to run the host userspace (QEMU)
as a BE process which sees the same ABI a BE kernel provides,
without actually dragging that red herring explicitly into the documentation.)


I agree, and also the first line should mention userspace.

In PPC I think it's possible or even common to have BE host kernel and 
LE host userspace (or perhaps vice versa is the common one).


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Alexander Graf

On 24.01.2014, at 14:09, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 24/01/2014 01:01, Peter Maydell ha scritto:
 
  +The 'data' member byte order is host kernel native endianness, 
  regardless of
  +the endianness of the guest, and represents the the value as it would go 
  on the
  +bus in real hardware.  The host kernel should always be able to do:
  +type val = *((type *)mmio.data).
 I think this would be better phrased as The host userspace should always,
 since this documentation is supposed to be telling userspace what the
 kernel's contract with it is, not the kernel keeping notes for itself on
 its own implementation. (It also clarifies what the intention is for the
 obscure and maybe-we'll-never-implement-this case of an LE host
 kernel using a compatibility interface to run the host userspace (QEMU)
 as a BE process which sees the same ABI a BE kernel provides,
 without actually dragging that red herring explicitly into the 
 documentation.)
 
 I agree, and also the first line should mention userspace.
 
 In PPC I think it's possible or even common to have BE host kernel and LE 
 host userspace (or perhaps vice versa is the common one).

It was possible on 32bit, but I'm not sure anyone's actively using it :). The 
thing that was very common (not so much anymore for enterprise distros) is 
32-bit user space with 64-bit kernels.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Victor Kamensky
On 24 January 2014 05:13, Alexander Graf ag...@suse.de wrote:

 On 24.01.2014, at 14:09, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 24/01/2014 01:01, Peter Maydell ha scritto:
 
  +The 'data' member byte order is host kernel native endianness, 
  regardless of
  +the endianness of the guest, and represents the the value as it would 
  go on the
  +bus in real hardware.

Also if you use ints on real bus as description, you may want to clarify
restrictions on mmio.len. Basically on 32 bit platform (i.e like V7
ARM) one cannot have mmio.len=8, because one cannot have 64bit
value on 32bit data bus. Without such clarification introduction of
text like the value as it would go on the bus in real hardware is
confusing for len=8 for emulated CPUs where real busses are
32bit.

If ldrd/strd would be emulated on ARMV7 one would need to use
mmio twice to pass required data in either direction using len=4 ..

Thanks,
Victor

 The host kernel should always be able to do:
  +type val = *((type *)mmio.data).
 I think this would be better phrased as The host userspace should always,
 since this documentation is supposed to be telling userspace what the
 kernel's contract with it is, not the kernel keeping notes for itself on
 its own implementation. (It also clarifies what the intention is for the
 obscure and maybe-we'll-never-implement-this case of an LE host
 kernel using a compatibility interface to run the host userspace (QEMU)
 as a BE process which sees the same ABI a BE kernel provides,
 without actually dragging that red herring explicitly into the 
 documentation.)

 I agree, and also the first line should mention userspace.

 In PPC I think it's possible or even common to have BE host kernel and LE 
 host userspace (or perhaps vice versa is the common one).

 It was possible on 32bit, but I'm not sure anyone's actively using it :). The 
 thing that was very common (not so much anymore for enterprise distros) is 
 32-bit user space with 64-bit kernels.


 Alex


 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Paolo Bonzini

Il 24/01/2014 16:23, Victor Kamensky ha scritto:

Also if you use ints on real bus as description, you may want to clarify
restrictions on mmio.len. Basically on 32 bit platform (i.e like V7
ARM) one cannot have mmio.len=8, because one cannot have 64bit
value on 32bit data bus. Without such clarification introduction of
text like the value as it would go on the bus in real hardware is
confusing for len=8 for emulated CPUs where real busses are
32bit.


This is not necessarily true.  On a 32-bit CPU you can have a 64-bit 
memory bus.  Even x86 32-bit CPUs can do 64-bit MMIO via MMX or SSE or 
double-word compare-and-swap (CMPXCHG8B).


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Victor Kamensky
On 24 January 2014 07:32, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 24/01/2014 16:23, Victor Kamensky ha scritto:

 Also if you use ints on real bus as description, you may want to clarify
 restrictions on mmio.len. Basically on 32 bit platform (i.e like V7
 ARM) one cannot have mmio.len=8, because one cannot have 64bit
 value on 32bit data bus. Without such clarification introduction of
 text like the value as it would go on the bus in real hardware is
 confusing for len=8 for emulated CPUs where real busses are
 32bit.


 This is not necessarily true.  On a 32-bit CPU you can have a 64-bit memory
 bus.  Even x86 32-bit CPUs can do 64-bit MMIO via MMX or SSE or double-word
 compare-and-swap (CMPXCHG8B).

Sure, that was part of my point :). But now text says about real
hardware buses and in any given moment emulator specify particular
type of CPU emulated, and I am quite sure that we can find one
emulated ARMV7 CPU that would just have real 32bit data bus.

Note there was no such problem and nobody cares what real
data buses width are, with definition I argued for. That I think
that was original intent of '__u8  data[8]' use - just bytes array
description of how memory at given phys_addr looks before
(mmiois_write=0) or would look after (mmio.is_write = 1)
requested memory transaction (or several of them for that
matter) are emulated by KVM_EXIT_MMIO. When if comes
to devices memory read/write it is logical view of memory of
course. Such definition works with any real data buses sizes,
starting with byte.

But, ooh, well ... nobody understands such definition .. I
stand failed to explain it clear enough.

Thanks,
Victor

 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Victor Kamensky
On 24 January 2014 07:23, Victor Kamensky victor.kamen...@linaro.org wrote:
 On 24 January 2014 05:13, Alexander Graf ag...@suse.de wrote:

 On 24.01.2014, at 14:09, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 24/01/2014 01:01, Peter Maydell ha scritto:
 
  +The 'data' member byte order is host kernel native endianness, 
  regardless of
  +the endianness of the guest, and represents the the value as it would 
  go on the
  +bus in real hardware.

Other thing with mmio.len need clarification: please specify what
mmio.len values are allowed. It seems that it was implied that len
could be only power of 2 in range between 1 and 8. I would rather see
that it is spelled out. I think code on all sides of KVM_EXIT_MMIO
already make such assumption. Now when you talk about integers
values on bus in real hardware power of 2 sizes implied Also note
for memory pieces with sizes other than 2, 4, 8 endianness is not
defined.


 Also if you use ints on real bus as description, you may want to clarify
 restrictions on mmio.len. Basically on 32 bit platform (i.e like V7
 ARM) one cannot have mmio.len=8, because one cannot have 64bit
 value on 32bit data bus. Without such clarification introduction of
 text like the value as it would go on the bus in real hardware is
 confusing for len=8 for emulated CPUs where real busses are
 32bit.

 If ldrd/strd would be emulated on ARMV7 one would need to use
 mmio twice to pass required data in either direction using len=4 ..

 Thanks,
 Victor

 The host kernel should always be able to do:
  +type val = *((type *)mmio.data).

would it be prudent to say that it is just integer type here.

Thanks,
Victor

 I think this would be better phrased as The host userspace should always,
 since this documentation is supposed to be telling userspace what the
 kernel's contract with it is, not the kernel keeping notes for itself on
 its own implementation. (It also clarifies what the intention is for the
 obscure and maybe-we'll-never-implement-this case of an LE host
 kernel using a compatibility interface to run the host userspace (QEMU)
 as a BE process which sees the same ABI a BE kernel provides,
 without actually dragging that red herring explicitly into the 
 documentation.)

 I agree, and also the first line should mention userspace.

 In PPC I think it's possible or even common to have BE host kernel and LE 
 host userspace (or perhaps vice versa is the common one).

 It was possible on 32bit, but I'm not sure anyone's actively using it :). 
 The thing that was very common (not so much anymore for enterprise distros) 
 is 32-bit user space with 64-bit kernels.


 Alex


 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Victor Kamensky
On 24 January 2014 05:13, Alexander Graf ag...@suse.de wrote:

 On 24.01.2014, at 14:09, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 24/01/2014 01:01, Peter Maydell ha scritto:
 
  +The 'data' member byte order is host kernel native endianness, 
  regardless of
  +the endianness of the guest, and represents the the value as it would 
  go on the
  +bus in real hardware.  The host kernel should always be able to do:
  +type val = *((type *)mmio.data).
 I think this would be better phrased as The host userspace should always,
 since this documentation is supposed to be telling userspace what the
 kernel's contract with it is, not the kernel keeping notes for itself on
 its own implementation. (It also clarifies what the intention is for the
 obscure and maybe-we'll-never-implement-this case of an LE host
 kernel using a compatibility interface to run the host userspace (QEMU)
 as a BE process which sees the same ABI a BE kernel provides,
 without actually dragging that red herring explicitly into the 
 documentation.)

 I agree, and also the first line should mention userspace.

 In PPC I think it's possible or even common to have BE host kernel and LE 
 host userspace (or perhaps vice versa is the common one).

 It was possible on 32bit, but I'm not sure anyone's actively using it :). The 
 thing that was very common (not so much anymore for enterprise distros) is 
 32-bit user space with 64-bit kernels.

Paolo, Alex, good point about BE kernel / LE user-land mix!

How KVM kernel code that deals with KVM_MMIO_EXIT can find
out what is user process endianity that handles this
KVM_MMIO_EXIT? Do we have kernel function that can tell
that. Hey, what is current user-land task endianity? :).
And reverse case, how user-land code that wants to do
KVM_MMIO_EXIT can find out what is endianity of kernel?
Do we have such system call? Hey, kernel tell me your
endianity :).

Just a thought: should not we instead of trying to implicitly
setup endianity by some other side properties like emulator or
kernel endianity, let's just do it explicitly. Adding 'endianness'
field into current mmio structure is not an option, but maybe
there is outside mechanism like KVM features, special ioctl
number, through which one can explicitly either set of learn
endianity in mmio.data[] for given KVM session.

At least, if we don't want to consider mixed BE/LE kernel/user-land,
then document should clarify that BE/LE kernel/user-land mix is not
supported and assumption here is that they always coincide.

Thanks,
Victor


 Alex


 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Scott Wood
On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote:
 The KVM API documentation is not clear about the semantics of the data
 field on the mmio struct on the kvm_run struct.
 
 This has become problematic when supporting ARM guests on big-endian
 host systems with guests of both endianness types, because it is unclear
 how the data should be exported to user space.
 
 This should not break with existing implementations as all supported
 existing implementations of known user space applications (QEMU and
 kvmtools for virtio) only support default endianness of the
 architectures on the host side.
 
 Cc: Marc Zyngier marc.zyng...@arm.com
 Cc: Peter Maydell peter.mayd...@linaro.org
 Cc: Alexander Graf ag...@suse.de
 Signed-off-by: Christoffer Dall christoffer.d...@linaro.org
 ---
 Changes [v1 - v2]:
  - s/host kernel should/host user space should/
 
  Documentation/virtual/kvm/api.txt | 5 +
  1 file changed, 5 insertions(+)
 
 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 366bf4b..6dbd68c 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could 
 not be satisfied
  by kvm.  The 'data' member contains the written data if 'is_write' is
  true, and should be filled by application code otherwise.
  
 +The 'data' member byte order is host kernel native endianness, regardless of
 +the endianness of the guest, and represents the the value as it would go on 
 the
 +bus in real hardware.  The host user space should always be able to do:
 +type val = *((type *)mmio.data).

Host userspace should be able to do that with what results?  It would
only produce a directly usable value if host endianness is the same as
the emulated device's endianness.

I'm not sure that host kernel native endianness is an accurate way of
describing what currently happens.  Regardless of host or guest
endianness, the guest should be swapping the value as necessary to
ensure that the value that goes on the (real or emulated) bus is the
same.

Plus, if it were really as it would go on the bus the value wouldn't
necessarily be left justified within data[], depending on how the bus 
works.

How about a wording like this:

  The 'data' member contains, in its first 'len' bytes, the value as it
  would appear if the guest had accessed memory rather than I/O.

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Victor Kamensky
On 24 January 2014 15:51, Scott Wood scottw...@freescale.com wrote:
 On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote:
 The KVM API documentation is not clear about the semantics of the data
 field on the mmio struct on the kvm_run struct.

 This has become problematic when supporting ARM guests on big-endian
 host systems with guests of both endianness types, because it is unclear
 how the data should be exported to user space.

 This should not break with existing implementations as all supported
 existing implementations of known user space applications (QEMU and
 kvmtools for virtio) only support default endianness of the
 architectures on the host side.

 Cc: Marc Zyngier marc.zyng...@arm.com
 Cc: Peter Maydell peter.mayd...@linaro.org
 Cc: Alexander Graf ag...@suse.de
 Signed-off-by: Christoffer Dall christoffer.d...@linaro.org
 ---
 Changes [v1 - v2]:
  - s/host kernel should/host user space should/

  Documentation/virtual/kvm/api.txt | 5 +
  1 file changed, 5 insertions(+)

 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 366bf4b..6dbd68c 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could 
 not be satisfied
  by kvm.  The 'data' member contains the written data if 'is_write' is
  true, and should be filled by application code otherwise.

 +The 'data' member byte order is host kernel native endianness, regardless of
 +the endianness of the guest, and represents the the value as it would go on 
 the
 +bus in real hardware.  The host user space should always be able to do:
 +type val = *((type *)mmio.data).

 Host userspace should be able to do that with what results?  It would
 only produce a directly usable value if host endianness is the same as
 the emulated device's endianness.

 I'm not sure that host kernel native endianness is an accurate way of
 describing what currently happens.  Regardless of host or guest
 endianness, the guest should be swapping the value as necessary to
 ensure that the value that goes on the (real or emulated) bus is the
 same.

 Plus, if it were really as it would go on the bus the value wouldn't
 necessarily be left justified within data[], depending on how the bus
 works.

 How about a wording like this:

   The 'data' member contains, in its first 'len' bytes, the value as it
   would appear if the guest had accessed memory rather than I/O.

Scott, Yes! Thank you so much! Completely agree.
You managed elegantly to express in two lines,
what I failed to communicate with hundreds lines
of my emails in last few days on this and other
email thread.

- Victor

 -Scott


 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Peter Maydell
On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote:
 On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote:
 The KVM API documentation is not clear about the semantics of the data
 field on the mmio struct on the kvm_run struct.

 This has become problematic when supporting ARM guests on big-endian
 host systems with guests of both endianness types, because it is unclear
 how the data should be exported to user space.

 This should not break with existing implementations as all supported
 existing implementations of known user space applications (QEMU and
 kvmtools for virtio) only support default endianness of the
 architectures on the host side.

 Cc: Marc Zyngier marc.zyng...@arm.com
 Cc: Peter Maydell peter.mayd...@linaro.org
 Cc: Alexander Graf ag...@suse.de
 Signed-off-by: Christoffer Dall christoffer.d...@linaro.org
 ---
 Changes [v1 - v2]:
  - s/host kernel should/host user space should/

  Documentation/virtual/kvm/api.txt | 5 +
  1 file changed, 5 insertions(+)

 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 366bf4b..6dbd68c 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which could 
 not be satisfied
  by kvm.  The 'data' member contains the written data if 'is_write' is
  true, and should be filled by application code otherwise.

 +The 'data' member byte order is host kernel native endianness, regardless of
 +the endianness of the guest, and represents the the value as it would go on 
 the
 +bus in real hardware.  The host user space should always be able to do:
 +type val = *((type *)mmio.data).

 Host userspace should be able to do that with what results?  It would
 only produce a directly usable value if host endianness is the same as
 the emulated device's endianness.

With the result that it gets the value the CPU has sent out on
the bus as the memory transaction. Obviously if what userspace
is emulating is a bus which has a byteswapping bridge or if it's
being helpful to device emulation by providing here's the value
even though you think you're wired up backwards then it needs
to byteswap.

 I'm not sure that host kernel native endianness is an accurate way of
 describing what currently happens.  Regardless of host or guest
 endianness, the guest should be swapping the value as necessary to
 ensure that the value that goes on the (real or emulated) bus is the
 same.

I don't know why you're bringing the guest in here. Whether
the guest chooses to byteswap or not is IMHO not relevant.
What KVM and userspace need to combine to achieve is that
whatever the guest happens to have done causes the same result
it would on the real hardware. Whether the guest sends out a
write of 0x12345678 because it wrote 0x12345678 directly or
because it started with 0x87654321 and issued a byte-reverse
instruction doesn't matter.

 Plus, if it were really as it would go on the bus the value wouldn't
 necessarily be left justified within data[], depending on how the bus
 works.

The point is that the value in data[] is not as it would go on the bus,
but the value you get out by treating it as a host-native-endianness
value of the relevant size left-justified within data[] is the value as
it would go on the bus.

 How about a wording like this:

   The 'data' member contains, in its first 'len' bytes, the value as it
   would appear if the guest had accessed memory rather than I/O.

I think this is confusing, because now userspace authors have
to figure out how to get back to value X of size Y at address Z
by interpreting this text... Can you write out the equivalent of
Christoffer's text here's how you get the memory transaction
value for what you want?

(Also, value as it would appear to who?)

I think your wording implies that the order of bytes in data[] depend
on the guest CPU usual byte order, ie the order which the CPU
does not do a byte-lane-swap for (LE for ARM, BE for PPC),
and it would mean it would come out differently from
my/Alex/Christoffer's proposal if the host kernel was the opposite
endianness from that usual order.

Finally, I think it's a bit confusing in that as if the guest had
accessed memory is assigning implicit semantics to memory
in the emulated system, when memory is actually kind of outside
KVM's purview because it's not part of the CPU.

thanks
-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Scott Wood
On Sat, 2014-01-25 at 00:24 +, Peter Maydell wrote:
 On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote:
  On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote:
  diff --git a/Documentation/virtual/kvm/api.txt 
  b/Documentation/virtual/kvm/api.txt
  index 366bf4b..6dbd68c 100644
  --- a/Documentation/virtual/kvm/api.txt
  +++ b/Documentation/virtual/kvm/api.txt
  @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which 
  could not be satisfied
   by kvm.  The 'data' member contains the written data if 'is_write' is
   true, and should be filled by application code otherwise.
 
  +The 'data' member byte order is host kernel native endianness, regardless 
  of
  +the endianness of the guest, and represents the the value as it would go 
  on the
  +bus in real hardware.  The host user space should always be able to do:
  +type val = *((type *)mmio.data).
 
  Host userspace should be able to do that with what results?  It would
  only produce a directly usable value if host endianness is the same as
  the emulated device's endianness.
 
 With the result that it gets the value the CPU has sent out on
 the bus as the memory transaction.

Doesn't that assume the host kernel endianness is the same as the bus
(or rather, that the host CPU would not swap such an access before it
hits the bus)?

If you take the same hardware and boot a little endian host kernel one
day, and a big endian host kernel the next, the bus doesn't change, and
neither should the bytewise (assuming address invariance) contents of
data[].  How data[] would look when read as a larger integer would of
course change -- but that's due to how you're reading it.

It's clear to say that a value in memory has been stored there in host
endianness when the value is as you would want to see it in a CPU
register, but it's less clear when you talk about it relative to values
on a bus.  It's harder to correlate that to something that is software
visible.

I don't think there's any actual technical difference between your
wording and mine when each wording is properly interpreted, but I
suspect my wording is less likely to be misinterpreted (I could be
wrong).

 Obviously if what userspace
 is emulating is a bus which has a byteswapping bridge or if it's
 being helpful to device emulation by providing here's the value
 even though you think you're wired up backwards then it needs
 to byteswap.

Whether the emulated bus has a byteswapping bridge doesn't sound like
something that depends on the endianness that the host CPU is currently
running in.

  How about a wording like this:
 
The 'data' member contains, in its first 'len' bytes, the value as it
would appear if the guest had accessed memory rather than I/O.
 
 I think this is confusing, because now userspace authors have
 to figure out how to get back to value X of size Y at address Z
 by interpreting this text... Can you write out the equivalent of
 Christoffer's text here's how you get the memory transaction
 value for what you want?

Userspace swaps the value if and only if userspace's endianness differs
from the endianness with which the device interprets the data
(regardless of whether said interpretation is considered natural or
swapped relative to the way the bus is documented).  It's similar to how
userspace would handle emulating DMA.

KVM swaps the value if and only if the endianness of the guest access
differs from that of the host, i.e. if it would have done swapping when
emulating an ordinary memory access.

 (Also, value as it would appear to who?)

As it would appear to anyone.  It works because data[] actually is
memory.  Any difference in how data appears based on the reader's
context would already be reflected when the reader performs the load.

 I think your wording implies that the order of bytes in data[] depend
 on the guest CPU usual byte order, ie the order which the CPU
 does not do a byte-lane-swap for (LE for ARM, BE for PPC),
 and it would mean it would come out differently from
 my/Alex/Christoffer's proposal if the host kernel was the opposite
 endianness from that usual order.

It doesn't depend on usual anything.  The only thing it implicitly
says about guest byte order is that it's KVM's job to implement any
swapping if the endianness of the guest access is different from the
endianness of the host kernel access (whether it's due to the guest's
mode, the way a page is mapped, the instruction used, etc).

 Finally, I think it's a bit confusing in that as if the guest had
 accessed memory is assigning implicit semantics to memory
 in the emulated system, when memory is actually kind of outside
 KVM's purview because it's not part of the CPU.

That's sort of the point.  It defines it in a way that is independent of
the CPU, and thus independent of what endianness the CPU operates in.

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More 

Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Alexander Graf

On 25.01.2014, at 03:04, Scott Wood scottw...@freescale.com wrote:

 On Fri, 2014-01-24 at 17:56 -0800, Christoffer Dall wrote:
 On Sat, Jan 25, 2014 at 12:24:08AM +, Peter Maydell wrote:
 Finally, I think it's a bit confusing in that as if the guest had
 accessed memory is assigning implicit semantics to memory
 in the emulated system, when memory is actually kind of outside
 KVM's purview because it's not part of the CPU.
 
 
 The as if the guest had accessed memory, would imply that user space
 needs to look at the settings of the CPU (in the case of ARM the E-bit)
 to understand which value the CPU intended for the memory transaction.
 
 No, it doesn't, just as userspace doesn't need to do that to interpret
 DMA descriptors.

DMA descriptors are handled by specific pieces of hardware that access memory. 
Guests usually push data to memory in a layout so that it's natural to the 
device they want to talk to. MMIO is more tricky.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: Specify byte order for KVM_EXIT_MMIO

2014-01-24 Thread Christoffer Dall
On Sat, Jan 25, 2014 at 03:15:35AM +0100, Alexander Graf wrote:
 
 On 25.01.2014, at 02:58, Scott Wood scottw...@freescale.com wrote:
 
  On Sat, 2014-01-25 at 00:24 +, Peter Maydell wrote:
  On 24 January 2014 23:51, Scott Wood scottw...@freescale.com wrote:
  On Fri, 2014-01-24 at 15:39 -0800, Christoffer Dall wrote:
  diff --git a/Documentation/virtual/kvm/api.txt 
  b/Documentation/virtual/kvm/api.txt
  index 366bf4b..6dbd68c 100644
  --- a/Documentation/virtual/kvm/api.txt
  +++ b/Documentation/virtual/kvm/api.txt
  @@ -2565,6 +2565,11 @@ executed a memory-mapped I/O instruction which 
  could not be satisfied
  by kvm.  The 'data' member contains the written data if 'is_write' is
  true, and should be filled by application code otherwise.
  
  +The 'data' member byte order is host kernel native endianness, 
  regardless of
  +the endianness of the guest, and represents the the value as it would 
  go on the
  +bus in real hardware.  The host user space should always be able to do:
  +type val = *((type *)mmio.data).
  
  Host userspace should be able to do that with what results?  It would
  only produce a directly usable value if host endianness is the same as
  the emulated device's endianness.
  
  With the result that it gets the value the CPU has sent out on
  the bus as the memory transaction.
  
  Doesn't that assume the host kernel endianness is the same as the bus
  (or rather, that the host CPU would not swap such an access before it
  hits the bus)?
  
  If you take the same hardware and boot a little endian host kernel one
  day, and a big endian host kernel the next, the bus doesn't change, and
  neither should the bytewise (assuming address invariance) contents of
  data[].  How data[] would look when read as a larger integer would of
  course change -- but that's due to how you're reading it.
  
  It's clear to say that a value in memory has been stored there in host
  endianness when the value is as you would want to see it in a CPU
  register, but it's less clear when you talk about it relative to values
  on a bus.  It's harder to correlate that to something that is software
  visible.
  
  I don't think there's any actual technical difference between your
  wording and mine when each wording is properly interpreted, but I
  suspect my wording is less likely to be misinterpreted (I could be
  wrong).
  
  Obviously if what userspace
  is emulating is a bus which has a byteswapping bridge or if it's
  being helpful to device emulation by providing here's the value
  even though you think you're wired up backwards then it needs
  to byteswap.
  
  Whether the emulated bus has a byteswapping bridge doesn't sound like
  something that depends on the endianness that the host CPU is currently
  running in.
  
  How about a wording like this:
  
   The 'data' member contains, in its first 'len' bytes, the value as it
   would appear if the guest had accessed memory rather than I/O.
  
  I think this is confusing, because now userspace authors have
  to figure out how to get back to value X of size Y at address Z
  by interpreting this text... Can you write out the equivalent of
  Christoffer's text here's how you get the memory transaction
  value for what you want?
  
  Userspace swaps the value if and only if userspace's endianness differs
  from the endianness with which the device interprets the data
  (regardless of whether said interpretation is considered natural or
  swapped relative to the way the bus is documented).  It's similar to how
  userspace would handle emulating DMA.
  
  KVM swaps the value if and only if the endianness of the guest access
  differs from that of the host, i.e. if it would have done swapping when
  emulating an ordinary memory access.
  
  (Also, value as it would appear to who?)
  
  As it would appear to anyone.  It works because data[] actually is
  memory.  Any difference in how data appears based on the reader's
  context would already be reflected when the reader performs the load.
  
  I think your wording implies that the order of bytes in data[] depend
  on the guest CPU usual byte order, ie the order which the CPU
  does not do a byte-lane-swap for (LE for ARM, BE for PPC),
  and it would mean it would come out differently from
  my/Alex/Christoffer's proposal if the host kernel was the opposite
  endianness from that usual order.
  
  It doesn't depend on usual anything.  The only thing it implicitly
  says about guest byte order is that it's KVM's job to implement any
  swapping if the endianness of the guest access is different from the
  endianness of the host kernel access (whether it's due to the guest's
  mode, the way a page is mapped, the instruction used, etc).
  
  Finally, I think it's a bit confusing in that as if the guest had
  accessed memory is assigning implicit semantics to memory
  in the emulated system, when memory is actually kind of outside
  KVM's purview because it's not part of the CPU.
  
  That's sort of the