[Bug 73331] New: Nested Virtualization, L2 cannot boot up on Ivybridge and Haswell

2014-04-01 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=73331

Bug ID: 73331
   Summary: Nested Virtualization, L2 cannot boot up on Ivybridge
and Haswell
   Product: Virtualization
   Version: unspecified
Kernel Version: 3.14.0-rc3
  Hardware: x86-64
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: kvm
  Assignee: virtualization_...@kernel-bugs.osdl.org
  Reporter: robert...@intel.com
CC: jan.kis...@web.de
Regression: No

Environment:

Host OS (ia32/ia32e/IA64):ia32e
Guest OS (ia32/ia32e/IA64):ia32e
Guest OS Type (Linux/Windows):Linux
kvm.git Commit:7227fc006b0df2c0d2966a7f4859b01bdf74
qemu.git Commit: 8648fcd52a9bcc2aa415cbe87b7c636e545acb38
Host Kernel Version:3.14.0-rc3
Hardware:Ivytown_EP, Grantley-EP

Bug detailed description:
--
create L2 guest, L2 guest cannot boot up on Ivytown.


Reproduce steps:

1.create L1 guests:
qemu-system-x86_64 --enable-kvm -m 10240 -smp 8 -net
nic,macaddr=00:12:45:67:2B:1C -net tap,script=/etc/kvm/qemu-ifup
nested-kvm-rhel6u4.qcow -cpu host
2. create L2 guests:
qemu-system-x86_64 -enable-kvm -m 1024 -smp 2 -net none rhel6u4.qcow


Current result:

L2 guest cannot boot up fine

Expected result:

L2 guest boot up fine.

Basic root-causing log:
--
[root@rhel6u4 ~]# qemu-system-x86_64 -enable-kvm -m 1024 -smp 2 -net none
rhel6u4.qcow
VNC server running on `::1:5900'
KVM: entry failed, hardware error 0x0
EAX= EBX= ECX= EDX=0623
ESI= EDI= EBP= ESP=
EIP=fff0 EFL=0002 [---] CPL=3 II=0 A20=1 SMM=0 HLT=0
ES =   9300
CS =f000 000f  f300
SS =   f300
DS =   9300
FS =   9300
GS =   9300
LDT=   8200
TR =   8b00
GDT=  
IDT=  
CR0=6010 CR2= CR3= CR4=
DR0= DR1= DR2=
DR3=
DR6=0ff0 DR7=0400
EFER=
Code=ff 66 89 c8 66 5b 66 5e 66 5f 66 5d 66 c3 fb f4 fa fc 66 c3 ea 5b e0 00
f0 30 36 2f 32 33 2f 39 39 00 fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-04-01 Thread Peter Maydell
On 31 March 2014 20:57, Michael Roth mdr...@linux.vnet.ibm.com wrote:
 Agreed, though I feel a little weird about creating releases for tags that
 aren't in the official repo. Would that be acceptable from a community
 stand-point? I'm honestly not sure.

 Otherwise I think Anthony/Peter would probably still need to process a pull
 for stable-y.x branch in advance before we do the tarball/release. Would
 still help simplify things a bit though by keeping tasks compartmentalized.

 Anthony, Peter: in the past, prior to release, I just sent an email with
 a pointer to my github branch with the stable release tagged. Would a proper
 pull request (with a for-stable-x.y tag or somesuch) be preferable?

The simplest approach seems to me to be giving you commit access
to master so you can directly push release tags to it as part of
making a release.

thanks
-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/4] KVM: Add SMAP support when setting CR4

2014-04-01 Thread Paolo Bonzini

Il 01/04/2014 03:17, Wu, Feng ha scritto:

A fault is considered
  +  * as a SMAP violation if all of the following
  +  * conditions are ture:
  +  *   - X86_CR4_SMAP is set in CR4
  +  *   - An user page is accessed
  +  *   - Page fault in kernel mode
  +  *   - !(CPL3  X86_EFLAGS_AC is set)


 - if CPL  3, EFLAGS.AC is clear

Should it be if CPL =3 or EFLAGS.AC is clear ?



What I meant is if CPL  3, EFLAGS.AC must also be clear, but your 
version is fine too.  You choose!


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2] ARM VM System Specification

2014-04-01 Thread Ian Campbell
On Mon, 2014-03-31 at 10:26 -0700, Christoffer Dall wrote:
 However, that may be a pain to update

Very much, and there would be nothing worse than having multiple
divergent copies of the spec.

  so the preferred method could be
 to host it in either its current form (clear-text) or publish some
 versioned PDF somewhere and simply point to it from the compliant
 software pieces.

I think that would be best.

I'm at a loss to suggest a suitable neutral home though I'm afraid.

Ian.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2] ARM VM System Specification

2014-04-01 Thread Michael Casadevall
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1



On 04/01/2014 05:49 AM, Ian Campbell wrote:
 On Mon, 2014-03-31 at 10:26 -0700, Christoffer Dall wrote:
 However, that may be a pain to update
 
 Very much, and there would be nothing worse than having multiple 
 divergent copies of the spec.
 
 so the preferred method could be to host it in either its current
 form (clear-text) or publish some versioned PDF somewhere and
 simply point to it from the compliant software pieces.
 
 I think that would be best.
 
 I'm at a loss to suggest a suitable neutral home though I'm
 afraid.
 

I agree with formal PDF (or TeX/plain text) releases + versioning;
something I can print out and hit people with always works great for
spec compliance.

On the topic of hosting, if Linaro itself isn't neutral enough, what
about under the Linux Foundation? They already host a bunch of
standards like the FHS.
Michael
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTOo2JAAoJEGKVRgSEnX1Qu+gH/ja+aqOqc04gbpFvuarHtzeM
nLtJ1ie4JKOJTVqqQz+lgPhx+r3H9PrMFZnGvzs8H6GYSaU66i/SHKU150J1Ig5M
1w4DYqQj7DEClpoNmpPZVJJbevp70/wDH1wFSJZXJmxV9yOpMtrNZXwB60KANP2f
MJDF+iDLIkv8vRPYQJEmuV4PMcJQs88X91eEb67jChX1ntMiiDAFSM6t2cRyfVG0
la4Ujk/p/u+Kr43wbhQITOKXZpWCtI8gjp7wTuZc+EDIhmufd5uiQ2VyTZaMkPkV
xuX9871Y8VHGxCkAmhHPODbp3Wd4VUP2ewOYnTRYR2IIRpa4gNNLUa+MSdrL/sU=
=2ub7
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Public key for qemu 1.7.1

2014-04-01 Thread Alex Davis
Where is the public key for qemu-1.7.1? I can't find it on qemu.org, or 
anywhere else.

I code, therefore I am
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2] ARM VM System Specification

2014-04-01 Thread Grant Likely
Start with Linaro publishing it. It can always be moved to another
venue if someone dislikes Linaro having maintainership.

g.


On Tue, Apr 1, 2014 at 10:57 AM, Michael Casadevall
michael.casadev...@linaro.org wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1



 On 04/01/2014 05:49 AM, Ian Campbell wrote:
 On Mon, 2014-03-31 at 10:26 -0700, Christoffer Dall wrote:
 However, that may be a pain to update

 Very much, and there would be nothing worse than having multiple
 divergent copies of the spec.

 so the preferred method could be to host it in either its current
 form (clear-text) or publish some versioned PDF somewhere and
 simply point to it from the compliant software pieces.

 I think that would be best.

 I'm at a loss to suggest a suitable neutral home though I'm
 afraid.


 I agree with formal PDF (or TeX/plain text) releases + versioning;
 something I can print out and hit people with always works great for
 spec compliance.

 On the topic of hosting, if Linaro itself isn't neutral enough, what
 about under the Linux Foundation? They already host a bunch of
 standards like the FHS.
 Michael
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.14 (GNU/Linux)
 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

 iQEcBAEBAgAGBQJTOo2JAAoJEGKVRgSEnX1Qu+gH/ja+aqOqc04gbpFvuarHtzeM
 nLtJ1ie4JKOJTVqqQz+lgPhx+r3H9PrMFZnGvzs8H6GYSaU66i/SHKU150J1Ig5M
 1w4DYqQj7DEClpoNmpPZVJJbevp70/wDH1wFSJZXJmxV9yOpMtrNZXwB60KANP2f
 MJDF+iDLIkv8vRPYQJEmuV4PMcJQs88X91eEb67jChX1ntMiiDAFSM6t2cRyfVG0
 la4Ujk/p/u+Kr43wbhQITOKXZpWCtI8gjp7wTuZc+EDIhmufd5uiQ2VyTZaMkPkV
 xuX9871Y8VHGxCkAmhHPODbp3Wd4VUP2ewOYnTRYR2IIRpa4gNNLUa+MSdrL/sU=
 =2ub7
 -END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 2/4] KVM: Add SMAP support when setting CR4

2014-04-01 Thread Wu, Feng


 -Original Message-
 From: Paolo Bonzini [mailto:pbonz...@redhat.com]
 Sent: Tuesday, April 01, 2014 5:27 PM
 To: Wu, Feng; g...@redhat.com; h...@zytor.com; kvm@vger.kernel.org
 Subject: Re: [PATCH v3 2/4] KVM: Add SMAP support when setting CR4
 
 Il 01/04/2014 03:17, Wu, Feng ha scritto:
  A fault is considered
+  * as a SMAP violation if all of the 
following
+  * conditions are ture:
+  *   - X86_CR4_SMAP is set in CR4
+  *   - An user page is accessed
+  *   - Page fault in kernel mode
+  *   - !(CPL3  X86_EFLAGS_AC is set)
  
   - if CPL  3, EFLAGS.AC is clear
  Should it be if CPL =3 or EFLAGS.AC is clear ?
 
 
 What I meant is if CPL  3, EFLAGS.AC must also be clear, but your
 version is fine too.  You choose!

Thanks for your comments! :)
BTW: I tested nested today, this patch set works well for nested virtualization!
 
 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenfda for 2014-04-01

2014-04-01 Thread Juan Quintela
Juan Quintela quint...@redhat.com wrote:
 Hi

 Please, send any topic that you are interested in covering.

 Thanks, Juan.

 Call details:

 10:00 AM to 11:00 AM EDT
 Every two weeks

Time clarification.  This time was wrong, it is 1h early.

15:00 CEST
13:00 UTC
09:00 EDT

Sorry for the inconvenience (I copy  paste from and old email, and  did
it wrong.)

Later, Juan.


 If you need phone number details,  contact me privately.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Massive read only kvm guests when backing file was missing

2014-04-01 Thread Stefan Hajnoczi
On Mon, Mar 31, 2014 at 09:51:23PM -0300, Alejandro Comisario wrote:
 Again, thanks to everyone.

Did you reach a conclusion or is there still a problem that might be a
bug in KVM?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Massive read only kvm guests when backing file was missing

2014-04-01 Thread Alejandro Comisario
The conclusion is that the backing file stored on NFS that is the same
for all 950 hosts / 10500 guests was deleted and immediatelly raised a
read-only filesystem on the guest, seems that there's no way to avoid
that.

We developed a script to recover from that scenario if the same happens.
Basically doing:

* virsh stop
* qemu-ndb connect
* fsck
* qemu-ndb disconnect
* virsh start

best regards.


Alejandro Comisario
MercadoLibre Cloud Services
Arias 3751, Piso 7 (C1430CRG)
Ciudad de Buenos Aires - Argentina
Cel: +549(11) 15-3770-1857
Tel : +54(11) 4640-8443


On Tue, Apr 1, 2014 at 10:52 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Mon, Mar 31, 2014 at 09:51:23PM -0300, Alejandro Comisario wrote:
 Again, thanks to everyone.

 Did you reach a conclusion or is there still a problem that might be a
 bug in KVM?

 Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC] KVM: s390: Add S390 configuration and control kvm device

2014-04-01 Thread Christian Borntraeger
From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com

Add KVM_DEV_TYPE_S390_CONFIG kvm device that contains
configuration and control attributes of particular vm.
The device is created by KVM_CREATE_DEVICE ioctl.
The attributes may be retrieved and stored by calling
KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls.

Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
---
 Documentation/virtual/kvm/devices/s390_config.txt |  13 +++
 arch/s390/include/asm/kvm_host.h  |   6 +
 arch/s390/include/uapi/asm/kvm.h  |  11 ++
 arch/s390/kvm/Makefile|   2 +-
 arch/s390/kvm/config.c| 133 ++
 arch/s390/kvm/kvm-s390.c  |   7 ++
 arch/s390/kvm/kvm-s390.h  |   4 +
 arch/s390/kvm/priv.c  |   6 +-
 include/linux/kvm_host.h  |   1 +
 include/uapi/linux/kvm.h  |   1 +
 virt/kvm/kvm_main.c   |   3 +
 11 files changed, 185 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/s390_config.txt
 create mode 100644 arch/s390/kvm/config.c

diff --git a/Documentation/virtual/kvm/devices/s390_config.txt 
b/Documentation/virtual/kvm/devices/s390_config.txt
new file mode 100644
index 000..f9bc251
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/s390_config.txt
@@ -0,0 +1,13 @@
+S390 Config Device
+==
+
+The s390 config device is used for storage and retrieval of various kinds
+of configuration related information which should be shared between kvm
+and userspace.
+
+Groups:
+KVM_DEV_CONFIG_GROUP
+
+Attributes for KVM_DEV_CONFIG_GROUP:
+KVM_DEV_CONFIG_NAME - The name for this vm instance. A string of at
+  most 128 characters.
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 154b600..da93b6e 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -277,6 +277,11 @@ struct s390_io_adapter {
 #define MAX_S390_IO_ADAPTERS ((MAX_ISC + 1) * 8)
 #define MAX_S390_ADAPTER_MAPS 256
 
+struct kvm_s390_config {
+   struct mutex lock;
+   struct kvm_s390_attr_name name;
+};
+
 struct kvm_arch{
struct sca_block *sca;
debug_info_t *dbf;
@@ -286,6 +291,7 @@ struct kvm_arch{
int css_support;
int use_irqchip;
struct s390_io_adapter *adapters[MAX_S390_IO_ADAPTERS];
+   struct kvm_s390_config *cfg;
 };
 
 #define KVM_HVA_ERR_BAD(-1UL)
diff --git a/arch/s390/include/uapi/asm/kvm.h b/arch/s390/include/uapi/asm/kvm.h
index c003c6a..d53dd45 100644
--- a/arch/s390/include/uapi/asm/kvm.h
+++ b/arch/s390/include/uapi/asm/kvm.h
@@ -79,6 +79,17 @@ struct kvm_debug_exit_arch {
 struct kvm_guest_debug_arch {
 };
 
+
+
+/* config device specific GROUPS */
+
+#define KVM_DEV_CONFIG_GROUP   0
+#define KVM_DEV_CONFIG_NAME0
+
+struct kvm_s390_attr_name {
+   char name[128];
+};
+
 #define KVM_SYNC_PREFIX (1UL  0)
 #define KVM_SYNC_GPRS   (1UL  1)
 #define KVM_SYNC_ACRS   (1UL  2)
diff --git a/arch/s390/kvm/Makefile b/arch/s390/kvm/Makefile
index d3adb37..79f130b 100644
--- a/arch/s390/kvm/Makefile
+++ b/arch/s390/kvm/Makefile
@@ -11,5 +11,5 @@ common-objs = $(KVM)/kvm_main.o $(KVM)/eventfd.o  
$(KVM)/async_pf.o $(KVM)/irqch
 
 ccflags-y := -Ivirt/kvm -Iarch/s390/kvm
 
-kvm-objs := $(common-objs) kvm-s390.o intercept.o interrupt.o priv.o sigp.o 
diag.o
+kvm-objs := $(common-objs) kvm-s390.o intercept.o interrupt.o priv.o sigp.o 
diag.o config.o
 obj-$(CONFIG_KVM) += kvm.o
diff --git a/arch/s390/kvm/config.c b/arch/s390/kvm/config.c
new file mode 100644
index 000..f887116
--- /dev/null
+++ b/arch/s390/kvm/config.c
@@ -0,0 +1,133 @@
+/*
+ * handling CONFIG kvm device
+ *
+ * Copyright IBM Corp. 2014
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License (version 2 only)
+ * as published by the Free Software Foundation.
+ *
+ *Author(s): Ekaterina Tumanova tuman...@linux.vnet.ibm.com
+ *   Christian Borntraeger borntrae...@de.ibm.com
+ */
+#include linux/kvm.h
+#include linux/kvm_host.h
+#include asm/uaccess.h
+#include kvm-s390.h
+
+static const struct kvm_s390_config config_defaults = {
+   .name = { KVMguest },
+};
+
+/*
+ * This is called from architecture code, since we access config even if
+ * userspace does not create a config device.
+ */
+struct kvm_s390_config *kvm_s390_config_init(void)
+{
+   struct kvm_s390_config *config;
+
+   config = kmalloc(sizeof(struct kvm_s390_config), GFP_KERNEL);
+   if (config) {
+   memcpy(config, config_defaults, sizeof(*config));
+   mutex_init(config-lock);
+   }
+
+   return config;
+}
+
+void kvm_s390_config_free(struct 

[PATCH/RFC] s390x/kvm: implement and use QEMU config device for s390

2014-04-01 Thread Christian Borntraeger
From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com

The following patch adds Qemu CONFIG device,
which interacts with kvm CONFIG device by calling KVM_CREATE_DEVICE
(to create the device in kernel), KVM_GET_DEVICE_ATTR and
KVM_SET_DEVICE_ATTR (to get and set the particular attributes of
KVM CONFIG device respectively).
This patch uses the Qemu CONFIG device to copy the
qemu_name (if specified) into the kvm CONFIG device.

Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
Signed-off-by: Michael Mueller m...@linux.vnet.ibm.com
Signed-off-by: Jens Freimann jf...@linux.vnet.ibm.com
---
 default-configs/s390x-softmmu.mak |   1 +
 hw/s390x/Makefile.objs|   1 +
 hw/s390x/config.c | 113 ++
 hw/s390x/s390-virtio-ccw.c|   6 ++
 hw/s390x/s390-virtio.c|  24 
 hw/s390x/s390-virtio.h|   1 +
 include/hw/s390x/config.h |  51 +
 linux-headers/asm-s390/kvm.h  |  11 
 linux-headers/linux/kvm.h |   1 +
 trace-events  |   4 ++
 10 files changed, 213 insertions(+)
 create mode 100644 hw/s390x/config.c
 create mode 100644 include/hw/s390x/config.h

diff --git a/default-configs/s390x-softmmu.mak 
b/default-configs/s390x-softmmu.mak
index d843dc0..a07697d 100644
--- a/default-configs/s390x-softmmu.mak
+++ b/default-configs/s390x-softmmu.mak
@@ -1,3 +1,4 @@
 CONFIG_VIRTIO=y
 CONFIG_SCLPCONSOLE=y
 CONFIG_S390_FLIC=$(CONFIG_KVM)
+CONFIG_S390_CONFIG=$(CONFIG_KVM)
diff --git a/hw/s390x/Makefile.objs b/hw/s390x/Makefile.objs
index 1ba6c3a..dc03e1d 100644
--- a/hw/s390x/Makefile.objs
+++ b/hw/s390x/Makefile.objs
@@ -8,3 +8,4 @@ obj-y += ipl.o
 obj-y += css.o
 obj-y += s390-virtio-ccw.o
 obj-y += virtio-ccw.o
+obj-$(CONFIG_S390_CONFIG) += config.o
diff --git a/hw/s390x/config.c b/hw/s390x/config.c
new file mode 100644
index 000..f70e880
--- /dev/null
+++ b/hw/s390x/config.c
@@ -0,0 +1,113 @@
+/*
+ * S390 configuration and control device
+ *
+ * Copyright 2014 IBM Corp.
+ * Author(s): Ekaterina Tumanova tuman...@linux.vnet.ibm.com
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or (at
+ * your option) any later version. See the COPYING file in the top-level
+ * directory.
+ */
+
+#include sys/ioctl.h
+#include hw/s390x/config.h
+#include trace.h
+
+S390ConfigState *get_config_state(void)
+{
+return S390_CONFIG(object_resolve_path(/machine/s390-config, NULL));
+}
+
+static int config_set(S390ConfigState *config, struct kvm_device_attr *attr)
+{
+int rc;
+
+rc = ioctl(config-fd, KVM_SET_DEVICE_ATTR, attr);
+return rc  0 ? -errno : rc;
+}
+
+/*
+static int config_get(S390ConfigState *config, struct kvm_device_attr *attr)
+{
+int rc;
+
+rc = ioctl(config-fd, KVM_GET_DEVICE_ATTR, attr);
+return rc  0 ? -errno : rc;
+}
+*/
+
+static int config_set_name(S390ConfigState *config, char *name)
+{
+struct kvm_device_attr attr = {
+.group = KVM_DEV_CONFIG_GROUP,
+.attr  = KVM_DEV_CONFIG_NAME,
+.addr  = (uint64_t)name,
+};
+
+if (!name ||
+(qemu_strnlen(name, S390_MACHINE_NAME_SIZE)
+ = S390_MACHINE_NAME_SIZE)) {
+return -EINVAL;
+}
+
+return config_set(config, attr);
+}
+
+static void s390_config_realize(DeviceState *dev, Error **errp)
+{
+S390ConfigState *cfg = S390_CONFIG(dev);
+struct kvm_create_device cd = {0};
+int ret;
+
+if (!kvm_check_extension(kvm_state, KVM_CAP_DEVICE_CTRL)) {
+trace_config_no_device_api(errno);
+return;
+}
+
+cd.type = KVM_DEV_TYPE_S390_CONFIG;
+ret = kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, cd);
+if (ret  0) {
+trace_config_create_device(errno);
+return;
+}
+cfg-fd = cd.fd;
+}
+
+void s390_config_dev_init(void)
+{
+DeviceState *dev;
+
+if (kvm_enabled()) {
+dev  = qdev_create(NULL, TYPE_S390_CONFIG);
+
+object_property_add_child(qdev_get_machine(),
+  TYPE_S390_CONFIG,
+  OBJECT(dev), NULL);
+qdev_init_nofail(dev);
+}
+}
+
+static void s390_config_class_init(ObjectClass *oc, void *data)
+{
+DeviceClass *dc = DEVICE_CLASS(oc);
+S390ConfigClass *scc = S390_CONFIG_CLASS(oc);
+
+dc-realize = s390_config_realize;
+scc-set_name = config_set_name;
+}
+
+static const TypeInfo s390_config_info = {
+.name  = TYPE_S390_CONFIG,
+.parent= TYPE_SYS_BUS_DEVICE,
+.instance_size = sizeof(S390ConfigState),
+.class_init= s390_config_class_init,
+.class_size= sizeof(S390ConfigClass),
+};
+
+
+static void s390_config_register_types(void)
+{
+type_register_static(s390_config_info);
+}
+
+type_init(s390_config_register_types)
diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
index 0d4f6ae..1e4b9ad 100644
--- a/hw/s390x/s390-virtio-ccw.c

[PATCH/RFC] s390: Provide a configuration and control device

2014-04-01 Thread Christian Borntraeger
We want to configure several things in KVM that go beyond what
ENABLE_CAP (we need payload) or ONE_REG (we need it for the VM
and we need to do more complex actions) can provide. Instead of
adding several s390 specific ioctls, lets provide a configuration
and control device that encapsulates different commands into
groups of the same area (MEMORY, CPU, ..)

We also provide an initial nameless base group, with a simple first
user to set the guest name. We need that name in the kernel for
the emulation of STSI (which provides the guest name to the guest)
but we need to implement the emulation in supervisor mode, as it
also provides the underlying levels of hipervisors.

Currently we have the following GROUPS and ATTRs pending, which
configure some memory management related function or allow to set
the guest facilities, cpuids etc:

#define KVM_DEV_CONFIG_GROUP0
#define KVM_DEV_CONFIG_NAME 0

#define KVM_DEV_CONFIG_GROUP_MEM1
#define KVM_DEV_CONFIG_MEM_ENABLE_CMMA  0
#define KVM_DEV_CONFIG_MEM_CLR_CMMA 1
#define KVM_DEV_CONFIG_MEM_CLR_PAGES2

#define KVM_DEV_CONFIG_GROUP_CPU2
#define KVM_DEV_CONFIG_CPU_TYPE 0
#define KVM_DEV_CONFIG_CPU_FAC  1
#define KVM_DEV_CONFIG_CPU_FAC_MASK 2
#define KVM_DEV_CONFIG_CPU_IBC  3
#define KVM_DEV_CONFIG_CPU_IBC_RANGE4



In addition other groups like 
#define KVM_DEV_CONFIG_GROUP_CRYPTO 
are under consideration to configure crypto acceleration.

Unless there is a major concern, I will add this to the next
s390 PULL requests for KVM.

Christian

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] KVM: s390: Add S390 configuration and control kvm device

2014-04-01 Thread Alexander Graf

On 04/01/2014 04:47 PM, Christian Borntraeger wrote:

From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com

Add KVM_DEV_TYPE_S390_CONFIG kvm device that contains
configuration and control attributes of particular vm.
The device is created by KVM_CREATE_DEVICE ioctl.
The attributes may be retrieved and stored by calling
KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls.

Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com


I don't think a device is particularly the best fit. A device can 
usually be instantiated multiple times. The configuration device can 
only be created once. A device also gets created by user space which 
enables it to receive the fd to drive it. Your device has to be created 
during VM creation.


I think VM configuration is common enough to just make this a separate 
interface.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] s390: Provide a configuration and control device

2014-04-01 Thread Alexander Graf

On 04/01/2014 04:47 PM, Christian Borntraeger wrote:

We want to configure several things in KVM that go beyond what
ENABLE_CAP (we need payload) or ONE_REG (we need it for the VM
and we need to do more complex actions) can provide. Instead of
adding several s390 specific ioctls, lets provide a configuration
and control device that encapsulates different commands into
groups of the same area (MEMORY, CPU, ..)

We also provide an initial nameless base group, with a simple first
user to set the guest name. We need that name in the kernel for
the emulation of STSI (which provides the guest name to the guest)
but we need to implement the emulation in supervisor mode, as it
also provides the underlying levels of hipervisors.

Currently we have the following GROUPS and ATTRs pending, which
configure some memory management related function or allow to set
the guest facilities, cpuids etc:

#define KVM_DEV_CONFIG_GROUP0
#define KVM_DEV_CONFIG_NAME 0

#define KVM_DEV_CONFIG_GROUP_MEM1
#define KVM_DEV_CONFIG_MEM_ENABLE_CMMA  0
#define KVM_DEV_CONFIG_MEM_CLR_CMMA 1
#define KVM_DEV_CONFIG_MEM_CLR_PAGES2

#define KVM_DEV_CONFIG_GROUP_CPU2
#define KVM_DEV_CONFIG_CPU_TYPE 0
#define KVM_DEV_CONFIG_CPU_FAC  1
#define KVM_DEV_CONFIG_CPU_FAC_MASK 2
#define KVM_DEV_CONFIG_CPU_IBC  3
#define KVM_DEV_CONFIG_CPU_IBC_RANGE4


Why would CPU specific information be set in the VM?


Alex





In addition other groups like
#define KVM_DEV_CONFIG_GROUP_CRYPTO
are under consideration to configure crypto acceleration.

Unless there is a major concern, I will add this to the next
s390 PULL requests for KVM.

Christian



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] KVM: s390: Add S390 configuration and control kvm device

2014-04-01 Thread Christian Borntraeger
On 01/04/14 16:58, Alexander Graf wrote:
 On 04/01/2014 04:47 PM, Christian Borntraeger wrote:
 From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com

 Add KVM_DEV_TYPE_S390_CONFIG kvm device that contains
 configuration and control attributes of particular vm.
 The device is created by KVM_CREATE_DEVICE ioctl.
 The attributes may be retrieved and stored by calling
 KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls.

 Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
 Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
 
 I don't think a device is particularly the best fit. A device can usually be 
 instantiated multiple times. The configuration device can only be created 
 once. A device also gets created by user space which enables it to receive 
 the fd to drive it. Your device has to be created during VM creation.

I remember some discussion a year or 2 ago, and IIRC a config device
was actually your idea ;-) (The other idea that we had, was ONE_REG for the VM)

 
 I think VM configuration is common enough to just make this a separate 
 interface.

So you propose to define a new base ioctl (e.g. VM_REG) on the vm fd, instead?
Seems like an easy enough change. Would you reuse the kvm_attr structure for 
that?

 
 
 Alex
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] KVM: s390: Add S390 configuration and control kvm device

2014-04-01 Thread Alexander Graf

On 04/01/2014 05:04 PM, Christian Borntraeger wrote:

On 01/04/14 16:58, Alexander Graf wrote:

On 04/01/2014 04:47 PM, Christian Borntraeger wrote:

From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com

Add KVM_DEV_TYPE_S390_CONFIG kvm device that contains
configuration and control attributes of particular vm.
The device is created by KVM_CREATE_DEVICE ioctl.
The attributes may be retrieved and stored by calling
KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls.

Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com

I don't think a device is particularly the best fit. A device can usually be 
instantiated multiple times. The configuration device can only be created once. 
A device also gets created by user space which enables it to receive the fd to 
drive it. Your device has to be created during VM creation.

I remember some discussion a year or 2 ago, and IIRC a config device
was actually your idea ;-) (The other idea that we had, was ONE_REG for the VM)


Omg, really? :o

A device would make sense for a specific system information 
instruction trap that we handle in-kernel for whatever reason (usually 
because it's performance critical) and some mandatory say to make sure 
user space always creates it. And some checks to make sure it can't get 
created twice.


Speaking of which, why don't we just forward STSI to user space with an 
ENABLE_CAP and handle all of this there? It's not performance critical 
at all, right?



I think VM configuration is common enough to just make this a separate 
interface.

So you propose to define a new base ioctl (e.g. VM_REG) on the vm fd, instead?
Seems like an easy enough change. Would you reuse the kvm_attr structure for 
that?


Yeah, reuse whatever we can. Basically just remove the device 
boilerplate - I don't think it's impressively useful for a non-device.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] KVM: export mark_page_dirty_in_slot

2014-04-01 Thread Paolo Bonzini
Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c  | 12 +---
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ba3180a5e6d3..56289b487ddd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -582,6 +582,7 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, 
gfn_t gfn);
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
 
 void kvm_vcpu_block(struct kvm_vcpu *vcpu);
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index bf2fa29c8beb..98833e21d580 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -99,8 +99,6 @@ static void update_memslots(struct kvm_memslots *slots,
struct kvm_memory_slot *new, u64 last_generation);
 
 static void kvm_release_pfn_dirty(pfn_t pfn);
-static void mark_page_dirty_in_slot(struct kvm *kvm,
-   struct kvm_memory_slot *memslot, gfn_t gfn);
 
 bool kvm_rebooting;
 EXPORT_SYMBOL_GPL(kvm_rebooting);
@@ -1583,7 +1581,7 @@ int kvm_write_guest_cached(struct kvm *kvm, struct 
gfn_to_hva_cache *ghc,
r = __copy_to_user((void __user *)ghc-hva, data, len);
if (r)
return -EFAULT;
-   mark_page_dirty_in_slot(kvm, ghc-memslot, ghc-gpa  PAGE_SHIFT);
+   mark_page_dirty_in_slot(ghc-memslot, ghc-gpa  PAGE_SHIFT);
 
return 0;
 }
@@ -1641,9 +1639,8 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned 
long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-static void mark_page_dirty_in_slot(struct kvm *kvm,
-   struct kvm_memory_slot *memslot,
-   gfn_t gfn)
+void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot,
+ gfn_t gfn)
 {
if (memslot  memslot-dirty_bitmap) {
unsigned long rel_gfn = gfn - memslot-base_gfn;
@@ -1651,13 +1648,14 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
set_bit_le(rel_gfn, memslot-dirty_bitmap);
}
 }
+EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot);
 
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 {
struct kvm_memory_slot *memslot;
 
memslot = gfn_to_memslot(kvm, gfn);
-   mark_page_dirty_in_slot(kvm, memslot, gfn);
+   mark_page_dirty_in_slot(memslot, gfn);
 }
 EXPORT_SYMBOL_GPL(mark_page_dirty);
 
-- 
1.8.3.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] KVM: emulate: introduce memory_prepare callback to speed up memory access

2014-04-01 Thread Paolo Bonzini
Emulating a RMW instruction currently walks the page tables
twice.  To avoid this, store the virtual address in the host
and access it directly.

In fact, it turns out that the optimizations we can do
actually benefit all memory accesses.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 arch/x86/include/asm/kvm_emulate.h | 26 +++
 arch/x86/kvm/x86.c | 67 ++
 include/linux/kvm_host.h   |  5 +++
 virt/kvm/kvm_main.c|  5 ---
 4 files changed, 98 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index f7b1e45eb753..4a580be2553e 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -167,6 +167,32 @@ struct x86_emulate_ops {
const void *new,
unsigned int bytes,
struct x86_exception *fault);
+
+   /*
+* memory_prepare: Prepare userspace access fastpath.
+*   @addr: [IN ] Linear address to access.
+*   @bytes:[IN ] Number of bytes to access.
+*   @write:[IN ] True if *p_hva will be written to.
+*   @p_opaque: [OUT] Value passed back to memory_finish.
+*   @p_hva:[OUT] Host virtual address for __copy_from/to_user.
+*/
+   int (*memory_prepare)(struct x86_emulate_ctxt *ctxt,
+ unsigned long addr,
+ unsigned int bytes,
+ struct x86_exception *exception,
+ bool write,
+ void **p_opaque,
+ unsigned long *p_hva);
+
+   /*
+* memory_finish: Complete userspace access fastpath.
+*   @opaque: [OUT] Value passed back from memory_prepare.
+*   @hva:[OUT] Host virtual address computed in memory_prepare.
+*/
+   void (*memory_finish)(struct x86_emulate_ctxt *ctxt,
+ void *p_opaque,
+ unsigned long p_hva);
+
void (*invlpg)(struct x86_emulate_ctxt *ctxt, ulong addr);
 
int (*pio_in_emulated)(struct x86_emulate_ctxt *ctxt,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e70a0e3227b1..64d0f171996b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4273,6 +4273,71 @@ static const struct read_write_emulator_ops 
write_emultor = {
.write = true,
 };
 
+static int emulator_memory_prepare(struct x86_emulate_ctxt *ctxt,
+  unsigned long addr,
+  unsigned int bytes,
+  struct x86_exception *exception,
+  bool write,
+  void **p_opaque,
+  unsigned long *p_hva)
+{
+   struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+   struct kvm_memory_slot *memslot;
+   int ret;
+   gpa_t gpa;
+   gfn_t gfn;
+   unsigned long hva;
+   
+   if (unlikely(((addr + bytes - 1) ^ addr)  PAGE_MASK))
+   goto no_hva;
+
+   ret = vcpu_mmio_gva_to_gpa(vcpu, addr, gpa, exception, true);
+   if (ret != 0) {
+   if (ret  0)
+   return X86EMUL_PROPAGATE_FAULT;
+   goto no_hva;
+   }
+
+   /* A (heavily) simplified version of kvm_gfn_to_hva_cache_init.  */
+   gfn = gpa  PAGE_SHIFT;
+   memslot = gfn_to_memslot(vcpu-kvm, gfn);
+   if (!memslot)
+   goto no_hva;
+
+   if (write) {
+   if (memslot_is_readonly(memslot))
+   goto no_hva;
+
+   *p_opaque = memslot-dirty_bitmap ? memslot : NULL;
+   }
+
+   hva = __gfn_to_hva_memslot(memslot, gfn);
+   if (kvm_is_error_hva(hva))
+   goto no_hva;
+
+   *p_hva = hva + offset_in_page(gpa);
+   return X86EMUL_CONTINUE;
+
+no_hva:
+   *p_hva = KVM_HVA_ERR_BAD;
+   return X86EMUL_CONTINUE;
+}
+
+static void emulator_memory_finish(struct x86_emulate_ctxt *ctxt,
+  void *opaque,
+  unsigned long hva)
+{
+   struct kvm_memory_slot *memslot;
+   gfn_t gfn;
+
+   if (!opaque)
+   return;
+
+   memslot = opaque;
+   gfn = hva_to_gfn_memslot(hva, memslot);
+   mark_page_dirty_in_slot(memslot, gfn);
+}
+
 static int emulator_read_write_onepage(unsigned long addr, void *val,
   unsigned int bytes,
   struct x86_exception *exception,
@@ -4816,6 +4881,8 @@ static const struct x86_emulate_ops emulate_ops = {
.read_std= kvm_read_guest_virt_system,
.write_std   = kvm_write_guest_virt_system,
.fetch   = kvm_fetch_guest_virt,
+   .memory_prepare   

[PATCH 2/6] KVM: emulate: abstract handling of memory operands

2014-04-01 Thread Paolo Bonzini
Abstract the pre-execution processing and writeback of memory operands
in new functions.  We will soon do some work before execution even for
move destination, so call the function in that case too; but not for
the memory operand of lea, invlpg etc.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 arch/x86/kvm/emulate.c | 43 ---
 1 file changed, 28 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index b42184eccbcc..c7ef72c1289e 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1545,6 +1545,29 @@ exception:
return X86EMUL_PROPAGATE_FAULT;
 }
 
+static int prepare_memory_operand(struct x86_emulate_ctxt *ctxt,
+ struct operand *op)
+{
+   return segmented_read(ctxt, op-addr.mem, op-val, op-bytes);
+}
+
+static int cmpxchg_memory_operand(struct x86_emulate_ctxt *ctxt,
+ struct operand *op)
+{
+   return segmented_cmpxchg(ctxt, op-addr.mem,
+op-orig_val,
+op-val,
+op-bytes);
+}
+
+static int write_memory_operand(struct x86_emulate_ctxt *ctxt,
+   struct operand *op)
+{
+   return segmented_write(ctxt, op-addr.mem,
+  op-val,
+  op-bytes);
+}
+
 static void write_register_operand(struct operand *op)
 {
/* The 4-byte case *is* correct: in 64-bit mode we zero-extend. */
@@ -1572,16 +1595,9 @@ static int writeback(struct x86_emulate_ctxt *ctxt, 
struct operand *op)
break;
case OP_MEM:
if (ctxt-lock_prefix)
-   return segmented_cmpxchg(ctxt,
-op-addr.mem,
-op-orig_val,
-op-val,
-op-bytes);
+   return cmpxchg_memory_operand(ctxt, op);
else
-   return segmented_write(ctxt,
-  op-addr.mem,
-  op-val,
-  op-bytes);
+   return write_memory_operand(ctxt, op);
break;
case OP_MEM_STR:
return segmented_write(ctxt,
@@ -4588,16 +4604,14 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
}
 
if ((ctxt-src.type == OP_MEM)  !(ctxt-d  NoAccess)) {
-   rc = segmented_read(ctxt, ctxt-src.addr.mem,
-   ctxt-src.valptr, ctxt-src.bytes);
+   rc = prepare_memory_operand(ctxt, ctxt-src);
if (rc != X86EMUL_CONTINUE)
goto done;
ctxt-src.orig_val64 = ctxt-src.val64;
}
 
if (ctxt-src2.type == OP_MEM) {
-   rc = segmented_read(ctxt, ctxt-src2.addr.mem,
-   ctxt-src2.val, ctxt-src2.bytes);
+   rc = prepare_memory_operand(ctxt, ctxt-src2);
if (rc != X86EMUL_CONTINUE)
goto done;
}
@@ -4608,8 +4622,7 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
 
if ((ctxt-dst.type == OP_MEM)  !(ctxt-d  Mov)) {
/* optimisation - avoid slow emulated read if Mov */
-   rc = segmented_read(ctxt, ctxt-dst.addr.mem,
-  ctxt-dst.val, ctxt-dst.bytes);
+   rc = prepare_memory_operand(ctxt, ctxt-dst);
if (rc != X86EMUL_CONTINUE)
goto done;
}
-- 
1.8.3.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] KVM: emulate: activate memory access optimization

2014-04-01 Thread Paolo Bonzini
memory_prepare lets us replace segmented_read/segmented_write with direct
calls to __copy_from_user/__copy_to_user.

This saves about 70 cycles (15%) on arithmetic with a memory source
operand, and about 150 cycles (25%) on arithmetic with a memory
destination operand.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 arch/x86/include/asm/kvm_emulate.h |  2 ++
 arch/x86/kvm/emulate.c | 50 ++
 2 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 4a580be2553e..a572d4fabd4f 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -263,6 +263,8 @@ struct operand {
u64 mm_val;
void *data;
};
+   unsigned long hva;
+   void *opaque;
 };
 
 struct fetch_cache {
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index c7ef72c1289e..2c881e5cf5ad 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1546,9 +1546,37 @@ exception:
 }
 
 static int prepare_memory_operand(struct x86_emulate_ctxt *ctxt,
- struct operand *op)
+ struct operand *op,
+ bool write)
 {
-   return segmented_read(ctxt, op-addr.mem, op-val, op-bytes);
+   int rc;
+   unsigned long gva;
+   unsigned int size = op-bytes;
+
+   rc = linearize(ctxt, op-addr.mem, size, write, gva);
+   if (rc != X86EMUL_CONTINUE)
+   return rc;
+
+   rc = ctxt-ops-memory_prepare(ctxt, gva, size,
+   ctxt-exception, true,
+   op-opaque, op-hva);
+   if (rc != X86EMUL_CONTINUE)
+   return rc;
+
+   if (likely(!kvm_is_error_hva(op-hva))) {
+   rc = __copy_from_user(op-val, (void __user *)op-hva,
+ size);
+   if (!write)
+   ctxt-ops-memory_finish(ctxt, op-opaque, op-hva);
+
+   if (likely(!rc))
+   return X86EMUL_CONTINUE;
+
+   /* Should not happen.  */
+   op-hva = KVM_HVA_ERR_BAD;
+   }
+
+   return read_emulated(ctxt, gva, op-val, size);
 }
 
 static int cmpxchg_memory_operand(struct x86_emulate_ctxt *ctxt,
@@ -1563,6 +1591,17 @@ static int cmpxchg_memory_operand(struct 
x86_emulate_ctxt *ctxt,
 static int write_memory_operand(struct x86_emulate_ctxt *ctxt,
struct operand *op)
 {
+   int rc;
+
+   if (likely(!kvm_is_error_hva(op-hva))) {
+   rc = __copy_to_user((void __user *)op-hva, op-val,
+   op-bytes);
+   ctxt-ops-memory_finish(ctxt, op-opaque, op-hva);
+
+   if (likely(!rc))
+   return X86EMUL_CONTINUE;
+   }
+
return segmented_write(ctxt, op-addr.mem,
   op-val,
   op-bytes);
@@ -4604,14 +4643,14 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
}
 
if ((ctxt-src.type == OP_MEM)  !(ctxt-d  NoAccess)) {
-   rc = prepare_memory_operand(ctxt, ctxt-src);
+   rc = prepare_memory_operand(ctxt, ctxt-src, false);
if (rc != X86EMUL_CONTINUE)
goto done;
ctxt-src.orig_val64 = ctxt-src.val64;
}
 
if (ctxt-src2.type == OP_MEM) {
-   rc = prepare_memory_operand(ctxt, ctxt-src2);
+   rc = prepare_memory_operand(ctxt, ctxt-src2, false);
if (rc != X86EMUL_CONTINUE)
goto done;
}
@@ -4622,7 +4661,8 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
 
if ((ctxt-dst.type == OP_MEM)  !(ctxt-d  Mov)) {
/* optimisation - avoid slow emulated read if Mov */
-   rc = prepare_memory_operand(ctxt, ctxt-dst);
+   rc = prepare_memory_operand(ctxt, ctxt-dst,
+   !(ctxt-d  NoWrite));
if (rc != X86EMUL_CONTINUE)
goto done;
}
-- 
1.8.3.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 0/6] KVM: x86: speedups for emulator memory accesses

2014-04-01 Thread Paolo Bonzini
Another emulator speedup series, shaving up to 400 cycles (25%) off
RMW instructions.

The performance of various instructions is now relatively flat:

 jump  919  (down from 2300)
 move  1075 (down from 2700)
 arith 1081 (down from 2600)
 load  1267 (down from 2800, 1400 after previous round)
 store 1213 (down from 2900, 1300 after previous round)
 RMW   1310 (down from 3200, 1700 after previous round)

The next low-hanging fruit is fetching instructions and initializing
the context.  Similar optimizations to those done here could be made
for instruction fetch.

Paolo Bonzini (6):
  KVM: emulate: simplify writeback
  KVM: emulate: abstract handling of memory operands
  KVM: export mark_page_dirty_in_slot
  KVM: emulate: introduce memory_prepare callback to speed up memory access
  KVM: emulate: activate memory access optimization
  KVM: emulate: extend memory access optimization to stores

 arch/x86/include/asm/kvm_emulate.h |  28 ++
 arch/x86/kvm/emulate.c | 107 +++--
 arch/x86/kvm/x86.c |  67 +++
 include/linux/kvm_host.h   |   6 +++
 virt/kvm/kvm_main.c|  17 ++
 5 files changed, 185 insertions(+), 40 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] KVM: emulate: simplify writeback

2014-04-01 Thread Paolo Bonzini
The if/return checks are useless, because we return X86EMUL_CONTINUE
anyway if we do not return.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 arch/x86/kvm/emulate.c | 28 +++-
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 4a3584d419e5..b42184eccbcc 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1566,34 +1566,28 @@ static void write_register_operand(struct operand *op)
 
 static int writeback(struct x86_emulate_ctxt *ctxt, struct operand *op)
 {
-   int rc;
-
switch (op-type) {
case OP_REG:
write_register_operand(op);
break;
case OP_MEM:
if (ctxt-lock_prefix)
-   rc = segmented_cmpxchg(ctxt,
+   return segmented_cmpxchg(ctxt,
+op-addr.mem,
+op-orig_val,
+op-val,
+op-bytes);
+   else
+   return segmented_write(ctxt,
   op-addr.mem,
-  op-orig_val,
   op-val,
   op-bytes);
-   else
-   rc = segmented_write(ctxt,
-op-addr.mem,
-op-val,
-op-bytes);
-   if (rc != X86EMUL_CONTINUE)
-   return rc;
break;
case OP_MEM_STR:
-   rc = segmented_write(ctxt,
-   op-addr.mem,
-   op-data,
-   op-bytes * op-count);
-   if (rc != X86EMUL_CONTINUE)
-   return rc;
+   return segmented_write(ctxt,
+  op-addr.mem,
+  op-data,
+  op-bytes * op-count);
break;
case OP_XMM:
write_sse_reg(ctxt, op-vec_val, op-addr.xmm);
-- 
1.8.3.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] KVM: emulate: extend memory access optimization to stores

2014-04-01 Thread Paolo Bonzini
Even on a store, the optimization saves about 20 clock cycles (4-5%),
mostly because the jump in write_memory_operand becomes much more
predictable.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 arch/x86/kvm/emulate.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 2c881e5cf5ad..541296f62b40 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1547,7 +1547,7 @@ exception:
 
 static int prepare_memory_operand(struct x86_emulate_ctxt *ctxt,
  struct operand *op,
- bool write)
+ bool read, bool write)
 {
int rc;
unsigned long gva;
@@ -1563,6 +1563,10 @@ static int prepare_memory_operand(struct 
x86_emulate_ctxt *ctxt,
if (rc != X86EMUL_CONTINUE)
return rc;
 
+   /* optimisation - avoid slow emulated read if Mov */
+   if (!read)
+   return X86EMUL_CONTINUE;
+
if (likely(!kvm_is_error_hva(op-hva))) {
rc = __copy_from_user(op-val, (void __user *)op-hva,
  size);
@@ -4643,14 +4647,14 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
}
 
if ((ctxt-src.type == OP_MEM)  !(ctxt-d  NoAccess)) {
-   rc = prepare_memory_operand(ctxt, ctxt-src, false);
+   rc = prepare_memory_operand(ctxt, ctxt-src, true, false);
if (rc != X86EMUL_CONTINUE)
goto done;
ctxt-src.orig_val64 = ctxt-src.val64;
}
 
if (ctxt-src2.type == OP_MEM) {
-   rc = prepare_memory_operand(ctxt, ctxt-src2, false);
+   rc = prepare_memory_operand(ctxt, ctxt-src2, true, false);
if (rc != X86EMUL_CONTINUE)
goto done;
}
@@ -4659,9 +4663,9 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
goto special_insn;
 
 
-   if ((ctxt-dst.type == OP_MEM)  !(ctxt-d  Mov)) {
-   /* optimisation - avoid slow emulated read if Mov */
+   if (ctxt-dst.type == OP_MEM) {
rc = prepare_memory_operand(ctxt, ctxt-dst,
+   !(ctxt-d  Mov),
!(ctxt-d  NoWrite));
if (rc != X86EMUL_CONTINUE)
goto done;
-- 
1.8.3.1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call for agenda 2014-04-01

2014-04-01 Thread Juan Quintela


2014-04-01
--

- port of virtio to vmstate
  * problem are lists (virtio-net/virtio-serial)

Relaese process
- Delay release for a few days to be able to test the tarball.
- Michael Rooth do the realese
- Stefan get used as realse manager backup

- Dates agreed:

2014-04-01  Tag v2.0.0-rc1
2014-04-08  Tag v2.0.0 


Conference call
- we sent calls for agenda one day before, 2 weeks before, 1 week before
  and not a lot of calls.  What should we do to improve?
  Plan is to sent: one week before and a reminder the day before
- Qemu is better modularized nowadays, and less changes involve the whole tree
- If you have any suggestion for the conference call told me.


Stefan virtio property patches

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call for agenda 2014-04-01

2014-04-01 Thread Juan Quintela


2014-04-01
--

- port of virtio to vmstate
  * problem are lists (virtio-net/virtio-serial)

Relaese process
- Delay release for a few days to be able to test the tarball.
- Michael Rooth do the realese
- Stefan get used as realse manager backup

- Dates agreed:

2014-04-01  Tag v2.0.0-rc1
2014-04-08  Tag v2.0.0 


Conference call
- we sent calls for agenda one day before, 2 weeks before, 1 week before
  and not a lot of calls.  What should we do to improve?
  Plan is to sent: one week before and a reminder the day before
- Qemu is better modularized nowadays, and less changes involve the whole tree
- If you have any suggestion for the conference call told me.


Stefan virtio property patches

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] s390x/kvm: implement and use QEMU config device for s390

2014-04-01 Thread Paolo Bonzini

Il 01/04/2014 16:47, Christian Borntraeger ha scritto:

From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com

The following patch adds Qemu CONFIG device,
which interacts with kvm CONFIG device by calling KVM_CREATE_DEVICE
(to create the device in kernel), KVM_GET_DEVICE_ATTR and
KVM_SET_DEVICE_ATTR (to get and set the particular attributes of
KVM CONFIG device respectively).
This patch uses the Qemu CONFIG device to copy the
qemu_name (if specified) into the kvm CONFIG device.

Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
Signed-off-by: Michael Mueller m...@linux.vnet.ibm.com
Signed-off-by: Jens Freimann jf...@linux.vnet.ibm.com


On the QEMU side we'll soon have per-machine-type options in -machine. 
s390 can then use that.


Paolo


---
 default-configs/s390x-softmmu.mak |   1 +
 hw/s390x/Makefile.objs|   1 +
 hw/s390x/config.c | 113 ++
 hw/s390x/s390-virtio-ccw.c|   6 ++
 hw/s390x/s390-virtio.c|  24 
 hw/s390x/s390-virtio.h|   1 +
 include/hw/s390x/config.h |  51 +
 linux-headers/asm-s390/kvm.h  |  11 
 linux-headers/linux/kvm.h |   1 +
 trace-events  |   4 ++
 10 files changed, 213 insertions(+)
 create mode 100644 hw/s390x/config.c
 create mode 100644 include/hw/s390x/config.h

diff --git a/default-configs/s390x-softmmu.mak 
b/default-configs/s390x-softmmu.mak
index d843dc0..a07697d 100644
--- a/default-configs/s390x-softmmu.mak
+++ b/default-configs/s390x-softmmu.mak
@@ -1,3 +1,4 @@
 CONFIG_VIRTIO=y
 CONFIG_SCLPCONSOLE=y
 CONFIG_S390_FLIC=$(CONFIG_KVM)
+CONFIG_S390_CONFIG=$(CONFIG_KVM)
diff --git a/hw/s390x/Makefile.objs b/hw/s390x/Makefile.objs
index 1ba6c3a..dc03e1d 100644
--- a/hw/s390x/Makefile.objs
+++ b/hw/s390x/Makefile.objs
@@ -8,3 +8,4 @@ obj-y += ipl.o
 obj-y += css.o
 obj-y += s390-virtio-ccw.o
 obj-y += virtio-ccw.o
+obj-$(CONFIG_S390_CONFIG) += config.o
diff --git a/hw/s390x/config.c b/hw/s390x/config.c
new file mode 100644
index 000..f70e880
--- /dev/null
+++ b/hw/s390x/config.c
@@ -0,0 +1,113 @@
+/*
+ * S390 configuration and control device
+ *
+ * Copyright 2014 IBM Corp.
+ * Author(s): Ekaterina Tumanova tuman...@linux.vnet.ibm.com
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or (at
+ * your option) any later version. See the COPYING file in the top-level
+ * directory.
+ */
+
+#include sys/ioctl.h
+#include hw/s390x/config.h
+#include trace.h
+
+S390ConfigState *get_config_state(void)
+{
+return S390_CONFIG(object_resolve_path(/machine/s390-config, NULL));
+}
+
+static int config_set(S390ConfigState *config, struct kvm_device_attr *attr)
+{
+int rc;
+
+rc = ioctl(config-fd, KVM_SET_DEVICE_ATTR, attr);
+return rc  0 ? -errno : rc;
+}
+
+/*
+static int config_get(S390ConfigState *config, struct kvm_device_attr *attr)
+{
+int rc;
+
+rc = ioctl(config-fd, KVM_GET_DEVICE_ATTR, attr);
+return rc  0 ? -errno : rc;
+}
+*/
+
+static int config_set_name(S390ConfigState *config, char *name)
+{
+struct kvm_device_attr attr = {
+.group = KVM_DEV_CONFIG_GROUP,
+.attr  = KVM_DEV_CONFIG_NAME,
+.addr  = (uint64_t)name,
+};
+
+if (!name ||
+(qemu_strnlen(name, S390_MACHINE_NAME_SIZE)
+ = S390_MACHINE_NAME_SIZE)) {
+return -EINVAL;
+}
+
+return config_set(config, attr);
+}
+
+static void s390_config_realize(DeviceState *dev, Error **errp)
+{
+S390ConfigState *cfg = S390_CONFIG(dev);
+struct kvm_create_device cd = {0};
+int ret;
+
+if (!kvm_check_extension(kvm_state, KVM_CAP_DEVICE_CTRL)) {
+trace_config_no_device_api(errno);
+return;
+}
+
+cd.type = KVM_DEV_TYPE_S390_CONFIG;
+ret = kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, cd);
+if (ret  0) {
+trace_config_create_device(errno);
+return;
+}
+cfg-fd = cd.fd;
+}
+
+void s390_config_dev_init(void)
+{
+DeviceState *dev;
+
+if (kvm_enabled()) {
+dev  = qdev_create(NULL, TYPE_S390_CONFIG);
+
+object_property_add_child(qdev_get_machine(),
+  TYPE_S390_CONFIG,
+  OBJECT(dev), NULL);
+qdev_init_nofail(dev);
+}
+}
+
+static void s390_config_class_init(ObjectClass *oc, void *data)
+{
+DeviceClass *dc = DEVICE_CLASS(oc);
+S390ConfigClass *scc = S390_CONFIG_CLASS(oc);
+
+dc-realize = s390_config_realize;
+scc-set_name = config_set_name;
+}
+
+static const TypeInfo s390_config_info = {
+.name  = TYPE_S390_CONFIG,
+.parent= TYPE_SYS_BUS_DEVICE,
+.instance_size = sizeof(S390ConfigState),
+.class_init= s390_config_class_init,
+.class_size= sizeof(S390ConfigClass),
+};
+
+
+static void s390_config_register_types(void)
+{
+type_register_static(s390_config_info);
+}
+

[RFC PATCH] PCI: Introduce new device binding path using pci_dev.driver_override

2014-04-01 Thread Alex Williamson
The driver_override field allows us to specify the driver for a device
rather than relying on the driver to provide a positive match of the
device.  This shortcuts the existing process of looking up the vendor
and device ID, adding them to the driver new_id, binding the device,
then removing the ID, but it also provides a couple advantages.

First, the above process allows the driver to bind to any device
matching the new_id for the window where it's enabled.  This is often
not desired, such as the case of trying to bind a single device to a
meta driver like pci-stub or vfio-pci.  Using driver_override we can
do this deterministically using:

echo pci-stub  /sys/bus/pci/devices/:03:00.0/driver_override
echo :03:00.0  /sys/bus/pci/devices/:03:00.0/driver/unbind
echo :03:00.0  /sys/bus/pci/drivers_probe

Previously we could not invoke drivers_probe after adding a device
to new_id for a driver as we get non-deterministic behavior whether
the driver we intend or the standard driver will claim the device.
Now it becomes a deterministic process, only the driver matching
driver_override will probe the device.

To return the device to the standard driver, we simply clear the
driver_override and reprobe the device, ex:

echo  /sys/bus/pci/devices/:03:00.0/preferred_driver
echo :03:00.0  /sys/bus/pci/devices/:03:00.0/driver/unbind
echo :03:00.0  /sys/bus/pci/drivers_probe

Another advantage to this approach is that we can specify a driver
override to force a specific binding or prevent any binding.  For
instance when an IOMMU group is exposed to userspace through VFIO
we require that all devices within that group are owned by VFIO.
However, devices can be hot-added into an IOMMU group, in which case
we want to prevent the device from binding to any driver (preferred
driver = none) or perhaps have it automatically bind to vfio-pci.
With driver_override it's a simple matter for this field to be set
internally when the device is first discovered to prevent driver
matches.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

Apologies for the exceptionally long cc list, this is a follow-up to
Stuart's Subject: mechanism to allow a driver to bind to any device
thread.  This is effectively a v2 of the proof-of-concept patch I
posted in that thread.  This version changes to use a dummy id struct
to return on an override match, which removes the collateral damage
and greatly simplifies the patch.  This feels fairly well baked for
PCI and I would expect that platform drivers could do a similar
implementation.  From there perhaps we can discuss whether there's
any advantage to placing driver_override on struct device.  The logic
for incorporating it into the match still needs to happen per bus
driver, so it might only contribute to consistency of the show/store
sysfs attributes to move it up to struct device.  Please comment.
Thanks,

Alex

 drivers/pci/pci-driver.c |   25 ++---
 drivers/pci/pci-sysfs.c  |   40 
 include/linux/pci.h  |1 +
 3 files changed, 63 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 25f0bc6..f780eb8 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -216,6 +216,13 @@ const struct pci_device_id *pci_match_id(const struct 
pci_device_id *ids,
return NULL;
 }
 
+static const struct pci_device_id pci_device_id_any = {
+   .vendor = PCI_ANY_ID,
+   .device = PCI_ANY_ID,
+   .subvendor = PCI_ANY_ID,
+   .subdevice = PCI_ANY_ID,
+};
+
 /**
  * pci_match_device - Tell if a PCI device structure has a matching PCI device 
id structure
  * @drv: the PCI driver to match against
@@ -229,18 +236,30 @@ static const struct pci_device_id 
*pci_match_device(struct pci_driver *drv,
struct pci_dev *dev)
 {
struct pci_dynid *dynid;
+   const struct pci_device_id *found_id = NULL;
+
+   /* When driver_override is set, only bind to the matching driver */
+   if (dev-driver_override  strcmp(dev-driver_override, drv-name))
+   return NULL;
 
/* Look at the dynamic ids first, before the static ones */
spin_lock(drv-dynids.lock);
list_for_each_entry(dynid, drv-dynids.list, node) {
if (pci_match_one_device(dynid-id, dev)) {
-   spin_unlock(drv-dynids.lock);
-   return dynid-id;
+   found_id = dynid-id;
+   break;
}
}
spin_unlock(drv-dynids.lock);
 
-   return pci_match_id(drv-id_table, dev);
+   if (!found_id)
+   found_id = pci_match_id(drv-id_table, dev);
+
+   /* driver_override will always match, send a dummy id */
+   if (!found_id  dev-driver_override)
+   found_id = pci_device_id_any;
+
+   return found_id;
 }
 
 struct drv_dev_and_id {
diff 

Re: [RFC PATCH] PCI: Introduce new device binding path using pci_dev.driver_override

2014-04-01 Thread Greg KH
On Tue, Apr 01, 2014 at 10:28:54AM -0600, Alex Williamson wrote:
 The driver_override field allows us to specify the driver for a device
 rather than relying on the driver to provide a positive match of the
 device.  This shortcuts the existing process of looking up the vendor
 and device ID, adding them to the driver new_id, binding the device,
 then removing the ID, but it also provides a couple advantages.
 
 First, the above process allows the driver to bind to any device
 matching the new_id for the window where it's enabled.  This is often
 not desired, such as the case of trying to bind a single device to a
 meta driver like pci-stub or vfio-pci.  Using driver_override we can
 do this deterministically using:
 
 echo pci-stub  /sys/bus/pci/devices/:03:00.0/driver_override
 echo :03:00.0  /sys/bus/pci/devices/:03:00.0/driver/unbind
 echo :03:00.0  /sys/bus/pci/drivers_probe
 
 Previously we could not invoke drivers_probe after adding a device
 to new_id for a driver as we get non-deterministic behavior whether
 the driver we intend or the standard driver will claim the device.
 Now it becomes a deterministic process, only the driver matching
 driver_override will probe the device.
 
 To return the device to the standard driver, we simply clear the
 driver_override and reprobe the device, ex:
 
 echo  /sys/bus/pci/devices/:03:00.0/preferred_driver
 echo :03:00.0  /sys/bus/pci/devices/:03:00.0/driver/unbind
 echo :03:00.0  /sys/bus/pci/drivers_probe
 
 Another advantage to this approach is that we can specify a driver
 override to force a specific binding or prevent any binding.  For
 instance when an IOMMU group is exposed to userspace through VFIO
 we require that all devices within that group are owned by VFIO.
 However, devices can be hot-added into an IOMMU group, in which case
 we want to prevent the device from binding to any driver (preferred
 driver = none) or perhaps have it automatically bind to vfio-pci.
 With driver_override it's a simple matter for this field to be set
 internally when the device is first discovered to prevent driver
 matches.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---
 
 Apologies for the exceptionally long cc list, this is a follow-up to
 Stuart's Subject: mechanism to allow a driver to bind to any device
 thread.  This is effectively a v2 of the proof-of-concept patch I
 posted in that thread.  This version changes to use a dummy id struct
 to return on an override match, which removes the collateral damage
 and greatly simplifies the patch.  This feels fairly well baked for
 PCI and I would expect that platform drivers could do a similar
 implementation.  From there perhaps we can discuss whether there's
 any advantage to placing driver_override on struct device.  The logic
 for incorporating it into the match still needs to happen per bus
 driver, so it might only contribute to consistency of the show/store
 sysfs attributes to move it up to struct device.  Please comment.
 Thanks,
 
 Alex
 
  drivers/pci/pci-driver.c |   25 ++---
  drivers/pci/pci-sysfs.c  |   40 
  include/linux/pci.h  |1 +
  3 files changed, 63 insertions(+), 3 deletions(-)

No Documentation/ABI/ update to reflect the ABI you are creating?

thanks,

greg k-h
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

2014-04-01 Thread Scott Wood
On Tue, 2014-04-01 at 07:47 +0200, Alexander Graf wrote:
 
  Am 01.04.2014 um 01:03 schrieb Scott Wood scottw...@freescale.com:
  
  On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote:
  On 03/26/2014 10:17 PM, Scott Wood wrote:
  On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:
  +/*
  + * Another thread may rewrite the TLB entry in parallel, don't
  + * execute from the address if the execute permission is not set
  + */
  
  What happens when another thread rewrites the TLB entry in parallel? 
  Does tlbsx succeed? Does it fail? Do we see failure indicated somehow? 
  Are the contents of the MAS registers consistent at this point or 
  inconsistent?
  
  If another thread rewrites the TLB entry, then we use the new TLB entry,
  just as if it had raced in hardware.  This check ensures that we don't
  execute from the new TLB entry if it doesn't have execute permissions
  (just as hardware would).
  
  What happens if the new TLB entry is valid and executable, and the
  instruction pointed to is valid, but doesn't trap (and thus we don't
  have emulation for it)?
  
  There has to be a good way to detect such a race and deal with it, no?
  
  How would you detect it?  We don't get any information from the trap
  about what physical address the instruction was fetched from, or what
  the instruction was.
 
 Ah, this is a guest race where the guest modifies exactly the TLB in 
 question. I see.
 
 Why would this ever happen in practice (in a case where we're not executing 
 malicious code)?

I don't know.  It probably wouldn't.  But it seems wrong to not check
the exec bit.

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call for agenda 2014-04-01

2014-04-01 Thread Peter Maydell
On 1 April 2014 16:34, Juan Quintela quint...@redhat.com wrote:

 Relaese process
 - Delay release for a few days to be able to test the tarball.
 - Michael Rooth do the realese
 - Stefan get used as realse manager backup

A point I didn't get a clear answer to on the call:
for rc1 are we going to give Michael commit access to
master so he can do the whole update-VERSION-tag-roll-tarball
process, or do I need to do at least the VERSION-and-tag
part this time round?

thanks
-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

2014-04-01 Thread Alexander Graf

On 04/01/2014 06:58 PM, Scott Wood wrote:

On Tue, 2014-04-01 at 07:47 +0200, Alexander Graf wrote:

Am 01.04.2014 um 01:03 schrieb Scott Wood scottw...@freescale.com:


On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote:

On 03/26/2014 10:17 PM, Scott Wood wrote:

On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:
+/*
+ * Another thread may rewrite the TLB entry in parallel, don't
+ * execute from the address if the execute permission is not set
+ */

What happens when another thread rewrites the TLB entry in parallel?
Does tlbsx succeed? Does it fail? Do we see failure indicated somehow?
Are the contents of the MAS registers consistent at this point or
inconsistent?

If another thread rewrites the TLB entry, then we use the new TLB entry,
just as if it had raced in hardware.  This check ensures that we don't
execute from the new TLB entry if it doesn't have execute permissions
(just as hardware would).

What happens if the new TLB entry is valid and executable, and the
instruction pointed to is valid, but doesn't trap (and thus we don't
have emulation for it)?


There has to be a good way to detect such a race and deal with it, no?

How would you detect it?  We don't get any information from the trap
about what physical address the instruction was fetched from, or what
the instruction was.

Ah, this is a guest race where the guest modifies exactly the TLB in question. 
I see.

Why would this ever happen in practice (in a case where we're not executing 
malicious code)?

I don't know.  It probably wouldn't.  But it seems wrong to not check
the exec bit.


Right, I'm just saying that a program interrupt is not too bad of an 
answer in case the guest does something as stupid as this in kernel 
space :). It's definitely good practice to check for the exec bit.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] PCI: Introduce new device binding path using pci_dev.driver_override

2014-04-01 Thread Alex Williamson
On Tue, 2014-04-01 at 09:47 -0700, Greg KH wrote:
 On Tue, Apr 01, 2014 at 10:28:54AM -0600, Alex Williamson wrote:
  The driver_override field allows us to specify the driver for a device
  rather than relying on the driver to provide a positive match of the
  device.  This shortcuts the existing process of looking up the vendor
  and device ID, adding them to the driver new_id, binding the device,
  then removing the ID, but it also provides a couple advantages.
  
  First, the above process allows the driver to bind to any device
  matching the new_id for the window where it's enabled.  This is often
  not desired, such as the case of trying to bind a single device to a
  meta driver like pci-stub or vfio-pci.  Using driver_override we can
  do this deterministically using:
  
  echo pci-stub  /sys/bus/pci/devices/:03:00.0/driver_override
  echo :03:00.0  /sys/bus/pci/devices/:03:00.0/driver/unbind
  echo :03:00.0  /sys/bus/pci/drivers_probe
  
  Previously we could not invoke drivers_probe after adding a device
  to new_id for a driver as we get non-deterministic behavior whether
  the driver we intend or the standard driver will claim the device.
  Now it becomes a deterministic process, only the driver matching
  driver_override will probe the device.
  
  To return the device to the standard driver, we simply clear the
  driver_override and reprobe the device, ex:
  
  echo  /sys/bus/pci/devices/:03:00.0/preferred_driver
  echo :03:00.0  /sys/bus/pci/devices/:03:00.0/driver/unbind
  echo :03:00.0  /sys/bus/pci/drivers_probe
  
  Another advantage to this approach is that we can specify a driver
  override to force a specific binding or prevent any binding.  For
  instance when an IOMMU group is exposed to userspace through VFIO
  we require that all devices within that group are owned by VFIO.
  However, devices can be hot-added into an IOMMU group, in which case
  we want to prevent the device from binding to any driver (preferred
  driver = none) or perhaps have it automatically bind to vfio-pci.
  With driver_override it's a simple matter for this field to be set
  internally when the device is first discovered to prevent driver
  matches.
  
  Signed-off-by: Alex Williamson alex.william...@redhat.com
  ---
  
  Apologies for the exceptionally long cc list, this is a follow-up to
  Stuart's Subject: mechanism to allow a driver to bind to any device
  thread.  This is effectively a v2 of the proof-of-concept patch I
  posted in that thread.  This version changes to use a dummy id struct
  to return on an override match, which removes the collateral damage
  and greatly simplifies the patch.  This feels fairly well baked for
  PCI and I would expect that platform drivers could do a similar
  implementation.  From there perhaps we can discuss whether there's
  any advantage to placing driver_override on struct device.  The logic
  for incorporating it into the match still needs to happen per bus
  driver, so it might only contribute to consistency of the show/store
  sysfs attributes to move it up to struct device.  Please comment.
  Thanks,
  
  Alex
  
   drivers/pci/pci-driver.c |   25 ++---
   drivers/pci/pci-sysfs.c  |   40 
   include/linux/pci.h  |1 +
   3 files changed, 63 insertions(+), 3 deletions(-)
 
 No Documentation/ABI/ update to reflect the ABI you are creating?

Oops, thanks for the reminder.  I'd propose this:

diff --git a/Documentation/ABI/testing/sysfs-bus-pci 
b/Documentation/ABI/testing/sysfs-bus-pci
index a3c5a66..55ca6e5 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -250,3 +250,21 @@ Description:
valid.  For example, writing a 2 to this file when sriov_numvfs
is not 0 and not 2 already will return an error. Writing a 10
when the value of sriov_totalvfs is 8 will return an error.
+
+What:  /sys/bus/pci/devices/.../driver_override
+Date:  April 2014
+Contact:   Alex Williamson alex.william...@redhat.com
+Description:
+   This file allows the driver for a device to be specified
+   which will override standard static and dynamic ID matching.
+   When specified, only a driver with a name matching the value
+   written to driver_override will have an opportunity to bind
+   to the device.  The override may be cleared by writing an
+   empty string (ex. echo  driver_override), returning the
+   device to standard matching rules binding.  Writing to
+   driver_override does not automatically unbind the device from
+   its current driver or make any attempt to automatically load
+   the specified driver name.  If no driver with a matching name
+   is currently loaded in the kernel, no match will be found.
+   This also 

Re: VDSO pvclock may increase host cpu consumption, is this a problem?

2014-04-01 Thread Marcelo Tosatti
On Mon, Mar 31, 2014 at 10:33:41PM -0700, Andy Lutomirski wrote:
 On Mar 31, 2014 8:45 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
  On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote:
   On 03/29/2014 01:47 AM, Zhanghailiang wrote:
Hi,
I found when Guest is idle, VDSO pvclock may increase host consumption.
We can calcutate as follow, Correct me if I am wrong.
  (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest)
In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain in 
timekeeping.c. It consume nearly 900 cycles per call. So in 
consideration of 250 Hz, it may consume 225,000 cycles per second, even 
no VM is created.
In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. 
If the no-kvmclock-vsyscall is configured, gettimeofday consumes 370 
cycles per call. The feature decrease 150 cycles consumption per call.
When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to 
the host consumption.
Both Host and Guest is linux-3.13.6.
So, whether the host cpu consumption is a problem?
  
   Does pvclock serve any real purpose on systems with fully-functional
   TSCs?  The x86 guest implementation is awful, so it's about 2x slower
   than TSC.  It could be improved a lot, but I'm not sure I understand why
   it exists in the first place.
 
  VM migration.
 
 Why does that need percpu stuff?  Wouldn't it be sufficient to
 interrupt all CPUs (or at least all cpus running in userspace) on
 migration and update the normal timing data structures?

Are you suggesting to allow interruption of the timekeeping code 
at any time to update frequency information ?

Do you want to that as a special tsc clocksource driver ? 

 Even better: have the VM offer to invalidate the physical page
 containing the kernel's clock data on migration and interrupt one CPU.
  If another CPU races, it'll fault and wait for the guest kernel to
 update its timing.

Perhaps that is a good idea.

 Does the current kvmclock stuff track CLOCK_MONOTONIC and
 CLOCK_REALTIME separately?

No. kvmclock counting is interrupted on vm pause (the hw clock does not
count during vm pause).

  Can you explain why you consider it so bad ? How you think it could be
  improved ?
 
 The second rdtsc_barrier looks unnecessary.  Even better, if rdtscp is
 available, then rdtscp can replace rdtsc_barrier, rdtsc, and the
 getcpu call.

 It would also be nice to avoid having two sets of rescalings of the timing 
 data.

Yep, probably good improvements, patches are welcome :-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] s390x/kvm: implement and use QEMU config device for s390

2014-04-01 Thread Christian Borntraeger
On 01/04/14 17:38, Paolo Bonzini wrote:
 Il 01/04/2014 16:47, Christian Borntraeger ha scritto:
 From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com

 The following patch adds Qemu CONFIG device,
 which interacts with kvm CONFIG device by calling KVM_CREATE_DEVICE
 (to create the device in kernel), KVM_GET_DEVICE_ATTR and
 KVM_SET_DEVICE_ATTR (to get and set the particular attributes of
 KVM CONFIG device respectively).
 This patch uses the Qemu CONFIG device to copy the
 qemu_name (if specified) into the kvm CONFIG device.

 Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
 Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
 Signed-off-by: Michael Mueller m...@linux.vnet.ibm.com
 Signed-off-by: Jens Freimann jf...@linux.vnet.ibm.com
 
 On the QEMU side we'll soon have per-machine-type options in -machine. s390 
 can then use that.

Well we need some calls in the kernel at runtime (e.g. during reset), so we 
need more than
static options. (But we can use those for several things). This device was 
mostly meant as an achor for doesnt fit into something else and would require 
a new
ioctl. Lets encapsulate everything in a way that follows the everything is a 
device approach with QOM and all these things.

Is there a branch for the per-machine-type things?

Christian


 
 Paolo
 
 ---
  default-configs/s390x-softmmu.mak |   1 +
  hw/s390x/Makefile.objs|   1 +
  hw/s390x/config.c | 113 
 ++
  hw/s390x/s390-virtio-ccw.c|   6 ++
  hw/s390x/s390-virtio.c|  24 
  hw/s390x/s390-virtio.h|   1 +
  include/hw/s390x/config.h |  51 +
  linux-headers/asm-s390/kvm.h  |  11 
  linux-headers/linux/kvm.h |   1 +
  trace-events  |   4 ++
  10 files changed, 213 insertions(+)
  create mode 100644 hw/s390x/config.c
  create mode 100644 include/hw/s390x/config.h

 diff --git a/default-configs/s390x-softmmu.mak 
 b/default-configs/s390x-softmmu.mak
 index d843dc0..a07697d 100644
 --- a/default-configs/s390x-softmmu.mak
 +++ b/default-configs/s390x-softmmu.mak
 @@ -1,3 +1,4 @@
  CONFIG_VIRTIO=y
  CONFIG_SCLPCONSOLE=y
  CONFIG_S390_FLIC=$(CONFIG_KVM)
 +CONFIG_S390_CONFIG=$(CONFIG_KVM)
 diff --git a/hw/s390x/Makefile.objs b/hw/s390x/Makefile.objs
 index 1ba6c3a..dc03e1d 100644
 --- a/hw/s390x/Makefile.objs
 +++ b/hw/s390x/Makefile.objs
 @@ -8,3 +8,4 @@ obj-y += ipl.o
  obj-y += css.o
  obj-y += s390-virtio-ccw.o
  obj-y += virtio-ccw.o
 +obj-$(CONFIG_S390_CONFIG) += config.o
 diff --git a/hw/s390x/config.c b/hw/s390x/config.c
 new file mode 100644
 index 000..f70e880
 --- /dev/null
 +++ b/hw/s390x/config.c
 @@ -0,0 +1,113 @@
 +/*
 + * S390 configuration and control device
 + *
 + * Copyright 2014 IBM Corp.
 + * Author(s): Ekaterina Tumanova tuman...@linux.vnet.ibm.com
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2 or (at
 + * your option) any later version. See the COPYING file in the top-level
 + * directory.
 + */
 +
 +#include sys/ioctl.h
 +#include hw/s390x/config.h
 +#include trace.h
 +
 +S390ConfigState *get_config_state(void)
 +{
 +return S390_CONFIG(object_resolve_path(/machine/s390-config, NULL));
 +}
 +
 +static int config_set(S390ConfigState *config, struct kvm_device_attr *attr)
 +{
 +int rc;
 +
 +rc = ioctl(config-fd, KVM_SET_DEVICE_ATTR, attr);
 +return rc  0 ? -errno : rc;
 +}
 +
 +/*
 +static int config_get(S390ConfigState *config, struct kvm_device_attr *attr)
 +{
 +int rc;
 +
 +rc = ioctl(config-fd, KVM_GET_DEVICE_ATTR, attr);
 +return rc  0 ? -errno : rc;
 +}
 +*/
 +
 +static int config_set_name(S390ConfigState *config, char *name)
 +{
 +struct kvm_device_attr attr = {
 +.group = KVM_DEV_CONFIG_GROUP,
 +.attr  = KVM_DEV_CONFIG_NAME,
 +.addr  = (uint64_t)name,
 +};
 +
 +if (!name ||
 +(qemu_strnlen(name, S390_MACHINE_NAME_SIZE)
 + = S390_MACHINE_NAME_SIZE)) {
 +return -EINVAL;
 +}
 +
 +return config_set(config, attr);
 +}
 +
 +static void s390_config_realize(DeviceState *dev, Error **errp)
 +{
 +S390ConfigState *cfg = S390_CONFIG(dev);
 +struct kvm_create_device cd = {0};
 +int ret;
 +
 +if (!kvm_check_extension(kvm_state, KVM_CAP_DEVICE_CTRL)) {
 +trace_config_no_device_api(errno);
 +return;
 +}
 +
 +cd.type = KVM_DEV_TYPE_S390_CONFIG;
 +ret = kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, cd);
 +if (ret  0) {
 +trace_config_create_device(errno);
 +return;
 +}
 +cfg-fd = cd.fd;
 +}
 +
 +void s390_config_dev_init(void)
 +{
 +DeviceState *dev;
 +
 +if (kvm_enabled()) {
 +dev  = qdev_create(NULL, TYPE_S390_CONFIG);
 +
 +object_property_add_child(qdev_get_machine(),
 +  TYPE_S390_CONFIG,
 +  OBJECT(dev), NULL);
 +qdev_init_nofail(dev);
 +}
 +}

Re: VDSO pvclock may increase host cpu consumption, is this a problem?

2014-04-01 Thread Andy Lutomirski
On Tue, Apr 1, 2014 at 11:01 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Mon, Mar 31, 2014 at 10:33:41PM -0700, Andy Lutomirski wrote:
 On Mar 31, 2014 8:45 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
  On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote:
   On 03/29/2014 01:47 AM, Zhanghailiang wrote:
Hi,
I found when Guest is idle, VDSO pvclock may increase host consumption.
We can calcutate as follow, Correct me if I am wrong.
  (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest)
In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain 
in timekeeping.c. It consume nearly 900 cycles per call. So in 
consideration of 250 Hz, it may consume 225,000 cycles per second, 
even no VM is created.
In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. 
If the no-kvmclock-vsyscall is configured, gettimeofday consumes 370 
cycles per call. The feature decrease 150 cycles consumption per call.
When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to 
the host consumption.
Both Host and Guest is linux-3.13.6.
So, whether the host cpu consumption is a problem?
  
   Does pvclock serve any real purpose on systems with fully-functional
   TSCs?  The x86 guest implementation is awful, so it's about 2x slower
   than TSC.  It could be improved a lot, but I'm not sure I understand why
   it exists in the first place.
 
  VM migration.

 Why does that need percpu stuff?  Wouldn't it be sufficient to
 interrupt all CPUs (or at least all cpus running in userspace) on
 migration and update the normal timing data structures?

 Are you suggesting to allow interruption of the timekeeping code
 at any time to update frequency information ?

I'm not sure what you mean by interruption of the timekeeping code.
I'm suggesting sending an interrupt to the guest (via a virtio device,
presumably) to tell it that it has been paused and resumed.

This is probably worth getting John's input if you actually want to do
this.  I'm not about to :)

Is there any case in which the TSC is stable and the kvmclock data for
different cpus is actually different?


 Do you want to that as a special tsc clocksource driver ?

 Even better: have the VM offer to invalidate the physical page
 containing the kernel's clock data on migration and interrupt one CPU.
  If another CPU races, it'll fault and wait for the guest kernel to
 update its timing.

 Perhaps that is a good idea.

 Does the current kvmclock stuff track CLOCK_MONOTONIC and
 CLOCK_REALTIME separately?

 No. kvmclock counting is interrupted on vm pause (the hw clock does not
 count during vm pause).

Makes sense.


  Can you explain why you consider it so bad ? How you think it could be
  improved ?

 The second rdtsc_barrier looks unnecessary.  Even better, if rdtscp is
 available, then rdtscp can replace rdtsc_barrier, rdtsc, and the
 getcpu call.

 It would also be nice to avoid having two sets of rescalings of the timing 
 data.

 Yep, probably good improvements, patches are welcome :-)


I may get to it at some point.  No guarantees.  I did just rewrite all
the mapping-related code for every other x86 vdso timesource, so maybe
I should try to add this to the pile.  The fact that the data is a
variable number of pages makes it messy, though, and since I don't
understand why there's a separate structure for each CPU, I'm hesitant
to change it too much.

--Andy
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] KVM: s390: Add S390 configuration and control kvm device

2014-04-01 Thread Christian Borntraeger
On 01/04/14 17:12, Alexander Graf wrote:
 On 04/01/2014 05:04 PM, Christian Borntraeger wrote:
 On 01/04/14 16:58, Alexander Graf wrote:
 On 04/01/2014 04:47 PM, Christian Borntraeger wrote:
 From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com

 Add KVM_DEV_TYPE_S390_CONFIG kvm device that contains
 configuration and control attributes of particular vm.
 The device is created by KVM_CREATE_DEVICE ioctl.
 The attributes may be retrieved and stored by calling
 KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls.

 Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
 Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
 I don't think a device is particularly the best fit. A device can usually 
 be instantiated multiple times. The configuration device can only be 
 created once. A device also gets created by user space which enables it to 
 receive the fd to drive it. Your device has to be created during VM 
 creation.
 I remember some discussion a year or 2 ago, and IIRC a config device
 was actually your idea ;-) (The other idea that we had, was ONE_REG for the 
 VM)
 
 Omg, really? :o
 
 A device would make sense for a specific system information instruction 
 trap that we handle in-kernel for whatever reason (usually because it's 
 performance critical) and some mandatory say to make sure user space always 
 creates it. And some checks to make sure it can't get created twice.

Well the device created twice problem  is also true for all existing KVM 
devices. 
The only missing piece is the check in the config device, no?

 
 Speaking of which, why don't we just forward STSI to user space with an 
 ENABLE_CAP and handle all of this there? It's not performance critical at 
 all, right?

No, performance is not critical.
The thing is, that we definitely need the kernel to handle parts of STSI, as we 
have to provide information from the upper hipervisor (LPAR or zVM). This 
information is only available in kernel space. So in essence we could only 
forward a small subset of STSI, namely stsi3_2_2. But we still have to call 
stsi_3_2_2 in the kernel,
as 3_2_2 does contain the list of hipervisors underneath us (KVM under z/VM). 

So then only thing that we could do is to forward STSI_3_2_2 to qemu when a 
capability is set and after the kernel has filled in the upper layers. 
QEMU then has to modify the page that the kernel touched and go back. Would 
work, but needs a capability and preferably an own exit. An new ioctl or
a subcode of an ioctl (attr/group whatever) seems easier.

 
 I think VM configuration is common enough to just make this a separate 
 interface.
 So you propose to define a new base ioctl (e.g. VM_REG) on the vm fd, 
 instead?
 Seems like an easy enough change. Would you reuse the kvm_attr structure for 
 that?
 
 Yeah, reuse whatever we can. Basically just remove the device boilerplate - I 
 don't think it's impressively useful for a non-device.

See above, name is just a simple first user.  
The thing is, that we have to have the ioctl either define a proper namespace 
(unique groups attrs) or to make it s390 specific. The device approach does 
help us here. 

I personally dont mind which way to go, as long as Paolo is fine with the 
approach, and nobody complains about the functions being non-QOM.

Christian

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] s390: Provide a configuration and control device

2014-04-01 Thread Christian Borntraeger
On 01/04/14 16:59, Alexander Graf wrote:
 On 04/01/2014 04:47 PM, Christian Borntraeger wrote:
 We want to configure several things in KVM that go beyond what
 ENABLE_CAP (we need payload) or ONE_REG (we need it for the VM
 and we need to do more complex actions) can provide. Instead of
 adding several s390 specific ioctls, lets provide a configuration
 and control device that encapsulates different commands into
 groups of the same area (MEMORY, CPU, ..)

 We also provide an initial nameless base group, with a simple first
 user to set the guest name. We need that name in the kernel for
 the emulation of STSI (which provides the guest name to the guest)
 but we need to implement the emulation in supervisor mode, as it
 also provides the underlying levels of hipervisors.

 Currently we have the following GROUPS and ATTRs pending, which
 configure some memory management related function or allow to set
 the guest facilities, cpuids etc:

 #define KVM_DEV_CONFIG_GROUP0
 #define KVM_DEV_CONFIG_NAME 0

 #define KVM_DEV_CONFIG_GROUP_MEM1
 #define KVM_DEV_CONFIG_MEM_ENABLE_CMMA  0
 #define KVM_DEV_CONFIG_MEM_CLR_CMMA 1
 #define KVM_DEV_CONFIG_MEM_CLR_PAGES2

 #define KVM_DEV_CONFIG_GROUP_CPU2
 #define KVM_DEV_CONFIG_CPU_TYPE 0
 #define KVM_DEV_CONFIG_CPU_FAC  1
 #define KVM_DEV_CONFIG_CPU_FAC_MASK 2
 #define KVM_DEV_CONFIG_CPU_IBC  3
 #define KVM_DEV_CONFIG_CPU_IBC_RANGE4
 
 Why would CPU specific information be set in the VM?

Might be a misleading name here. This is about CPU id and facility list (mostly 
CPU features).
The list of facilities and the cpu id is unique on VM level. It is queried via 
a CPU instruction, though. Providing some VM-wide setting via a CPU interface 
would make the interface look like the user can have different settings 
per-CPU. Since we cant, its better to make this explicit.
 
 
 Alex
 



 In addition other groups like
 #define KVM_DEV_CONFIG_GROUP_CRYPTO
 are under consideration to configure crypto acceleration.

 Unless there is a major concern, I will add this to the next
 s390 PULL requests for KVM.

 Christian

 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] KVM: s390: Add S390 configuration and control kvm device

2014-04-01 Thread Alexander Graf

On 01.04.2014, at 21:19, Christian Borntraeger borntrae...@de.ibm.com wrote:

 On 01/04/14 17:12, Alexander Graf wrote:
 On 04/01/2014 05:04 PM, Christian Borntraeger wrote:
 On 01/04/14 16:58, Alexander Graf wrote:
 On 04/01/2014 04:47 PM, Christian Borntraeger wrote:
 From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
 
 Add KVM_DEV_TYPE_S390_CONFIG kvm device that contains
 configuration and control attributes of particular vm.
 The device is created by KVM_CREATE_DEVICE ioctl.
 The attributes may be retrieved and stored by calling
 KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls.
 
 Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
 Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
 I don't think a device is particularly the best fit. A device can usually 
 be instantiated multiple times. The configuration device can only be 
 created once. A device also gets created by user space which enables it to 
 receive the fd to drive it. Your device has to be created during VM 
 creation.
 I remember some discussion a year or 2 ago, and IIRC a config device
 was actually your idea ;-) (The other idea that we had, was ONE_REG for the 
 VM)
 
 Omg, really? :o
 
 A device would make sense for a specific system information instruction 
 trap that we handle in-kernel for whatever reason (usually because it's 
 performance critical) and some mandatory say to make sure user space always 
 creates it. And some checks to make sure it can't get created twice.
 
 Well the device created twice problem  is also true for all existing KVM 
 devices. 
 The only missing piece is the check in the config device, no?
 
 
 Speaking of which, why don't we just forward STSI to user space with an 
 ENABLE_CAP and handle all of this there? It's not performance critical at 
 all, right?
 
 No, performance is not critical.
 The thing is, that we definitely need the kernel to handle parts of STSI, as 
 we have to provide information from the upper hipervisor (LPAR or zVM). This 
 information is only available in kernel space. So in essence we could only 
 forward a small subset of STSI, namely stsi3_2_2. But we still have to call 
 stsi_3_2_2 in the kernel,
 as 3_2_2 does contain the list of hipervisors underneath us (KVM under z/VM). 
 
 So then only thing that we could do is to forward STSI_3_2_2 to qemu when a 
 capability is set and after the kernel has filled in the upper layers. 
 QEMU then has to modify the page that the kernel touched and go back. Would 
 work, but needs a capability and preferably an own exit. An new ioctl or
 a subcode of an ioctl (attr/group whatever) seems easier.

I would consider these 2 orthogonal bits of information. User space wants to 
get information about its underlying hypervisors regardless of KVM, no? So we 
should have some interface to bubble STSI information of the current system to 
user space either way.

QEMU could use that and add a few bits of its own. That way we could handle all 
of STSI in QEMU and get out of the business of defining complicated interfaces.

 
 
 I think VM configuration is common enough to just make this a separate 
 interface.
 So you propose to define a new base ioctl (e.g. VM_REG) on the vm fd, 
 instead?
 Seems like an easy enough change. Would you reuse the kvm_attr structure 
 for that?
 
 Yeah, reuse whatever we can. Basically just remove the device boilerplate - 
 I don't think it's impressively useful for a non-device.
 
 See above, name is just a simple first user.  
 The thing is, that we have to have the ioctl either define a proper namespace 
 (unique groups attrs) or to make it s390 specific. The device approach does 
 help us here. 

If you like the device approach, make sure to create it on VM creation and only 
implement a specific ioctl to fetch its fd. We don't create the configuration 
information pseudo device after VM creation - it's always there :).

 I personally dont mind which way to go, as long as Paolo is fine with the 
 approach, and nobody complains about the functions being non-QOM.

I think the most obvious and straight forward way would be to deal with all of 
STSI in user space. Make it a separate exit type similar to hypercalls and 
don't worry about QOM'ification of anything. This thing is on the same level as 
CPUID really - just VM wide :).


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] s390: Provide a configuration and control device

2014-04-01 Thread Alexander Graf

On 01.04.2014, at 21:23, Christian Borntraeger borntrae...@de.ibm.com wrote:

 On 01/04/14 16:59, Alexander Graf wrote:
 On 04/01/2014 04:47 PM, Christian Borntraeger wrote:
 We want to configure several things in KVM that go beyond what
 ENABLE_CAP (we need payload) or ONE_REG (we need it for the VM
 and we need to do more complex actions) can provide. Instead of
 adding several s390 specific ioctls, lets provide a configuration
 and control device that encapsulates different commands into
 groups of the same area (MEMORY, CPU, ..)
 
 We also provide an initial nameless base group, with a simple first
 user to set the guest name. We need that name in the kernel for
 the emulation of STSI (which provides the guest name to the guest)
 but we need to implement the emulation in supervisor mode, as it
 also provides the underlying levels of hipervisors.
 
 Currently we have the following GROUPS and ATTRs pending, which
 configure some memory management related function or allow to set
 the guest facilities, cpuids etc:
 
 #define KVM_DEV_CONFIG_GROUP0
 #define KVM_DEV_CONFIG_NAME 0
 
 #define KVM_DEV_CONFIG_GROUP_MEM1
 #define KVM_DEV_CONFIG_MEM_ENABLE_CMMA  0
 #define KVM_DEV_CONFIG_MEM_CLR_CMMA 1
 #define KVM_DEV_CONFIG_MEM_CLR_PAGES2
 
 #define KVM_DEV_CONFIG_GROUP_CPU2
 #define KVM_DEV_CONFIG_CPU_TYPE 0
 #define KVM_DEV_CONFIG_CPU_FAC  1
 #define KVM_DEV_CONFIG_CPU_FAC_MASK 2
 #define KVM_DEV_CONFIG_CPU_IBC  3
 #define KVM_DEV_CONFIG_CPU_IBC_RANGE4
 
 Why would CPU specific information be set in the VM?
 
 Might be a misleading name here. This is about CPU id and facility list 
 (mostly CPU features).
 The list of facilities and the cpu id is unique on VM level.

Is this a hardware limitation? I thought s390's virtualization instruction 
doesn't have a notion of a VM, but I could easily be wrong.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] KVM: s390: Add S390 configuration and control kvm device

2014-04-01 Thread Alexander Graf

On 01.04.2014, at 21:19, Christian Borntraeger borntrae...@de.ibm.com wrote:

 On 01/04/14 17:12, Alexander Graf wrote:
 On 04/01/2014 05:04 PM, Christian Borntraeger wrote:
 On 01/04/14 16:58, Alexander Graf wrote:
 On 04/01/2014 04:47 PM, Christian Borntraeger wrote:
 From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
 
 Add KVM_DEV_TYPE_S390_CONFIG kvm device that contains
 configuration and control attributes of particular vm.
 The device is created by KVM_CREATE_DEVICE ioctl.
 The attributes may be retrieved and stored by calling
 KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls.
 
 Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
 Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
 I don't think a device is particularly the best fit. A device can usually 
 be instantiated multiple times. The configuration device can only be 
 created once. A device also gets created by user space which enables it to 
 receive the fd to drive it. Your device has to be created during VM 
 creation.
 I remember some discussion a year or 2 ago, and IIRC a config device
 was actually your idea ;-) (The other idea that we had, was ONE_REG for the 
 VM)
 
 Omg, really? :o
 
 A device would make sense for a specific system information instruction 
 trap that we handle in-kernel for whatever reason (usually because it's 
 performance critical) and some mandatory say to make sure user space always 
 creates it. And some checks to make sure it can't get created twice.
 
 Well the device created twice problem  is also true for all existing KVM 
 devices. 
 The only missing piece is the check in the config device, no?
 
 
 Speaking of which, why don't we just forward STSI to user space with an 
 ENABLE_CAP and handle all of this there? It's not performance critical at 
 all, right?
 
 No, performance is not critical.
 The thing is, that we definitely need the kernel to handle parts of STSI, as 
 we have to provide information from the upper hipervisor (LPAR or zVM). This 
 information is only available in kernel space. So in essence we could only 
 forward a small subset of STSI, namely stsi3_2_2. But we still have to call 
 stsi_3_2_2 in the kernel,
 as 3_2_2 does contain the list of hipervisors underneath us (KVM under z/VM). 
 
 So then only thing that we could do is to forward STSI_3_2_2 to qemu when a 
 capability is set and after the kernel has filled in the upper layers. 
 QEMU then has to modify the page that the kernel touched and go back. Would 
 work, but needs a capability and preferably an own exit. An new ioctl or
 a subcode of an ioctl (attr/group whatever) seems easier.

I would consider these 2 orthogonal bits of information. User space wants to 
get information about its underlying hypervisors regardless of KVM, no? So we 
should have some interface to bubble STSI information of the current system to 
user space either way.

QEMU could use that and add a few bits of its own. That way we could handle all 
of STSI in QEMU and get out of the business of defining complicated interfaces.

 
 
 I think VM configuration is common enough to just make this a separate 
 interface.
 So you propose to define a new base ioctl (e.g. VM_REG) on the vm fd, 
 instead?
 Seems like an easy enough change. Would you reuse the kvm_attr structure 
 for that?
 
 Yeah, reuse whatever we can. Basically just remove the device boilerplate - 
 I don't think it's impressively useful for a non-device.
 
 See above, name is just a simple first user.  
 The thing is, that we have to have the ioctl either define a proper namespace 
 (unique groups attrs) or to make it s390 specific. The device approach does 
 help us here. 

If you like the device approach, make sure to create it on VM creation and only 
implement a specific ioctl to fetch its fd. We don't create the configuration 
information pseudo device after VM creation - it's always there :).

 I personally dont mind which way to go, as long as Paolo is fine with the 
 approach, and nobody complains about the functions being non-QOM.

I think the most obvious and straight forward way would be to deal with all of 
STSI in user space. Make it a separate exit type similar to hypercalls and 
don't worry about QOM'ification of anything. This thing is on the same level as 
CPUID really - just VM wide :).


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] KVM: s390: Add S390 configuration and control kvm device

2014-04-01 Thread Christian Borntraeger
On 01/04/14 21:36, Alexander Graf wrote:
[...]
 Speaking of which, why don't we just forward STSI to user space with an 
 ENABLE_CAP and handle all of this there? It's not performance critical at 
 all, right?

 No, performance is not critical.
 The thing is, that we definitely need the kernel to handle parts of STSI, as 
 we have to provide information from the upper hipervisor (LPAR or zVM). This 
 information is only available in kernel space. So in essence we could only 
 forward a small subset of STSI, namely stsi3_2_2. But we still have to call 
 stsi_3_2_2 in the kernel,
 as 3_2_2 does contain the list of hipervisors underneath us (KVM under 
 z/VM). 

 So then only thing that we could do is to forward STSI_3_2_2 to qemu when a 
 capability is set and after the kernel has filled in the upper layers. 
 QEMU then has to modify the page that the kernel touched and go back. Would 
 work, but needs a capability and preferably an own exit. An new ioctl or
 a subcode of an ioctl (attr/group whatever) seems easier.
 
 I would consider these 2 orthogonal bits of information. User space wants to 
 get information about its underlying hypervisors regardless of KVM, no? So we 
 should have some interface to bubble STSI information of the current system 
 to user space either way.

There is /proc/sysinfo and it provides most (but not all) of the STSI 
information. But we dont want to go that path, really. Providing STSI as
binary blob is also complex and error-prone.
Think about it again: Instead of having a small qemu-kernel interface for 
specific subsets of stsi, you suggest a kernel-qemu interface
for the complete set of STSI, so that qemu can then emulate it? (We talk about 
20 pages in the principles of operation). T
 
 QEMU could use that and add a few bits of its own. That way we could handle 
 all of STSI in QEMU and get out of the business of defining complicated 
 interfaces.

This also misses my main point: The name is just the first user. I want to have 
some interface that allows me to do other things:
enable CMMA, reset the CMMA memory usage state on reset, zap the page tables on 
clear reset, set the cpu facilities (some parts need to be in kernel so that we 
can block specific operations), enable s390 specific settings, whatever. 

The usual approach was create new KVM ioctls, each for one feature + 
capability. This sounded like overkill.
So please dont focus on stsi, I need some interface that I can use for other 
pending patches. And if that interface is generic enough, we might also use 
that for name (or not depending on the stsi discussion). I am not married to 
the device idea, anything that works out fine is ok with me.

 I think VM configuration is common enough to just make this a separate 
 interface.
 So you propose to define a new base ioctl (e.g. VM_REG) on the vm fd, 
 instead?
 Seems like an easy enough change. Would you reuse the kvm_attr structure 
 for that?

 Yeah, reuse whatever we can. Basically just remove the device boilerplate - 
 I don't think it's impressively useful for a non-device.

 See above, name is just a simple first user.  
 The thing is, that we have to have the ioctl either define a proper 
 namespace (unique groups attrs) or to make it s390 specific. The device 
 approach does help us here. 
 
 If you like the device approach, make sure to create it on VM creation and 
 only implement a specific ioctl to fetch its fd. We don't create the 
 configuration information pseudo device after VM creation - it's always there 
 :).
 
 I personally dont mind which way to go, as long as Paolo is fine with the 
 approach, and nobody complains about the functions being non-QOM.
 
 I think the most obvious and straight forward way would be to deal with all 
 of STSI in user space. Make it a separate exit type similar to hypercalls and 
 don't worry about QOM'ification of anything. This thing is on the same level 
 as CPUID really - just VM wide :).

For 3_2_2 this might be possible solution, but not for the other codes. But as 
I said, the name is not my main problem here.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] PCI: Introduce new device binding path using pci_dev.driver_override

2014-04-01 Thread Kim Phillips
On Tue, 01 Apr 2014 10:28:54 -0600
Alex Williamson alex.william...@redhat.com wrote:

 The driver_override field allows us to specify the driver for a device
 rather than relying on the driver to provide a positive match of the
 device.  This shortcuts the existing process of looking up the vendor
 and device ID, adding them to the driver new_id, binding the device,
 then removing the ID, but it also provides a couple advantages.
 
 First, the above process allows the driver to bind to any device
 matching the new_id for the window where it's enabled.  This is often
 not desired, such as the case of trying to bind a single device to a
 meta driver like pci-stub or vfio-pci.  Using driver_override we can
 do this deterministically using:
 
 echo pci-stub  /sys/bus/pci/devices/:03:00.0/driver_override
 echo :03:00.0  /sys/bus/pci/devices/:03:00.0/driver/unbind
 echo :03:00.0  /sys/bus/pci/drivers_probe
 
 Previously we could not invoke drivers_probe after adding a device
 to new_id for a driver as we get non-deterministic behavior whether
 the driver we intend or the standard driver will claim the device.
 Now it becomes a deterministic process, only the driver matching
 driver_override will probe the device.
 
 To return the device to the standard driver, we simply clear the
 driver_override and reprobe the device, ex:
 
 echo  /sys/bus/pci/devices/:03:00.0/preferred_driver
 echo :03:00.0  /sys/bus/pci/devices/:03:00.0/driver/unbind
 echo :03:00.0  /sys/bus/pci/drivers_probe
 
 Another advantage to this approach is that we can specify a driver
 override to force a specific binding or prevent any binding.  For
 instance when an IOMMU group is exposed to userspace through VFIO
 we require that all devices within that group are owned by VFIO.
 However, devices can be hot-added into an IOMMU group, in which case
 we want to prevent the device from binding to any driver (preferred
 driver = none) or perhaps have it automatically bind to vfio-pci.
 With driver_override it's a simple matter for this field to be set
 internally when the device is first discovered to prevent driver
 matches.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---
 
 Apologies for the exceptionally long cc list, this is a follow-up to
 Stuart's Subject: mechanism to allow a driver to bind to any device
 thread.  This is effectively a v2 of the proof-of-concept patch I
 posted in that thread.  This version changes to use a dummy id struct
 to return on an override match, which removes the collateral damage
 and greatly simplifies the patch.  This feels fairly well baked for
 PCI and I would expect that platform drivers could do a similar
 implementation.  From there perhaps we can discuss whether there's
 any advantage to placing driver_override on struct device.  The logic
 for incorporating it into the match still needs to happen per bus
 driver, so it might only contribute to consistency of the show/store
 sysfs attributes to move it up to struct device.  Please comment.

Sounds like Greg likes this approach more than {drv,dev}_sysfs_only.

The diff below is the result of duplicating and converting this patch
for platform devices, and, indeed, binding a device to the
vfio-platform driver succeeds with:

echo vfio-platform  /sys/bus/platform/devices/fff51000.ethernet/driver_override
echo fff51000.ethernet  
/sys/bus/platform/devices/fff51000.ethernet/driver/unbind
echo fff51000.ethernet  /sys/bus/platform/drivers_probe

However, it's almost pure duplication modulo the bus match code.  The
only other place I can see where to put the common bus check is
drivers/base/base.h:driver_match_device(), which I'm guessing is
off-limits?  So should we leave this as per-bus code, and somehow
refactor driver_override_{show,store}?

Kim

diff --git a/drivers/base/platform.c b/drivers/base/platform.c
index bc78848..621c5bd2 100644
--- a/drivers/base/platform.c
+++ b/drivers/base/platform.c
@@ -22,6 +22,7 @@
 #include linux/pm_runtime.h
 #include linux/idr.h
 #include linux/acpi.h
+#include linux/limits.h
 
 #include base.h
 #include power/power.h
@@ -693,8 +694,49 @@ static ssize_t modalias_show(struct device *dev, struct 
device_attribute *a,
 }
 static DEVICE_ATTR_RO(modalias);
 
+static ssize_t driver_override_store(struct device *dev,
+struct device_attribute *attr,
+const char *buf, size_t count)
+{
+   struct platform_device *pdev = to_platform_device(dev);
+   char *driver_override, *old = pdev-driver_override;
+
+   if (count  PATH_MAX)
+   return -EINVAL;
+
+   driver_override = kstrndup(buf, count, GFP_KERNEL);
+   if (!driver_override)
+   return -ENOMEM;
+
+   while (strlen(driver_override) 
+  driver_override[strlen(driver_override) - 1] == '\n')
+   driver_override[strlen(driver_override) - 1] = '\0';
+
+   if (strlen(driver_override)) {
+

Re: VDSO pvclock may increase host cpu consumption, is this a problem?

2014-04-01 Thread Marcelo Tosatti
On Tue, Apr 01, 2014 at 12:17:16PM -0700, Andy Lutomirski wrote:
 On Tue, Apr 1, 2014 at 11:01 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Mon, Mar 31, 2014 at 10:33:41PM -0700, Andy Lutomirski wrote:
  On Mar 31, 2014 8:45 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  
   On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote:
On 03/29/2014 01:47 AM, Zhanghailiang wrote:
 Hi,
 I found when Guest is idle, VDSO pvclock may increase host 
 consumption.
 We can calcutate as follow, Correct me if I am wrong.
   (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest)
 In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain 
 in timekeeping.c. It consume nearly 900 cycles per call. So in 
 consideration of 250 Hz, it may consume 225,000 cycles per second, 
 even no VM is created.
 In Guest, gettimeofday consumes 220 cycles per call with VDSO 
 pvclock. If the no-kvmclock-vsyscall is configured, gettimeofday 
 consumes 370 cycles per call. The feature decrease 150 cycles 
 consumption per call.
 When call gettimeofday 1500 times,it decrease 225,000 cycles,equal 
 to the host consumption.
 Both Host and Guest is linux-3.13.6.
 So, whether the host cpu consumption is a problem?
   
Does pvclock serve any real purpose on systems with fully-functional
TSCs?  The x86 guest implementation is awful, so it's about 2x slower
than TSC.  It could be improved a lot, but I'm not sure I understand 
why
it exists in the first place.
  
   VM migration.
 
  Why does that need percpu stuff?  Wouldn't it be sufficient to
  interrupt all CPUs (or at least all cpus running in userspace) on
  migration and update the normal timing data structures?
 
  Are you suggesting to allow interruption of the timekeeping code
  at any time to update frequency information ?
 
 I'm not sure what you mean by interruption of the timekeeping code.
 I'm suggesting sending an interrupt to the guest (via a virtio device,
 presumably) to tell it that it has been paused and resumed.

code:

1) disable interrupts
2) A = RDTSC
3) B = SCALE(A, TSC.FREQ)

If migration happens between 2 and 3, you've got an incorrect value.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VDSO pvclock may increase host cpu consumption, is this a problem?

2014-04-01 Thread Andy Lutomirski
On Tue, Apr 1, 2014 at 5:12 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Tue, Apr 01, 2014 at 12:17:16PM -0700, Andy Lutomirski wrote:
 On Tue, Apr 1, 2014 at 11:01 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Mon, Mar 31, 2014 at 10:33:41PM -0700, Andy Lutomirski wrote:
  On Mar 31, 2014 8:45 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  
   On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote:
On 03/29/2014 01:47 AM, Zhanghailiang wrote:
 Hi,
 I found when Guest is idle, VDSO pvclock may increase host 
 consumption.
 We can calcutate as follow, Correct me if I am wrong.
   (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest)
 In Host, VDSO pvclock introduce a notifier chain, 
 pvclock_gtod_chain in timekeeping.c. It consume nearly 900 cycles 
 per call. So in consideration of 250 Hz, it may consume 225,000 
 cycles per second, even no VM is created.
 In Guest, gettimeofday consumes 220 cycles per call with VDSO 
 pvclock. If the no-kvmclock-vsyscall is configured, gettimeofday 
 consumes 370 cycles per call. The feature decrease 150 cycles 
 consumption per call.
 When call gettimeofday 1500 times,it decrease 225,000 cycles,equal 
 to the host consumption.
 Both Host and Guest is linux-3.13.6.
 So, whether the host cpu consumption is a problem?
   
Does pvclock serve any real purpose on systems with fully-functional
TSCs?  The x86 guest implementation is awful, so it's about 2x slower
than TSC.  It could be improved a lot, but I'm not sure I understand 
why
it exists in the first place.
  
   VM migration.
 
  Why does that need percpu stuff?  Wouldn't it be sufficient to
  interrupt all CPUs (or at least all cpus running in userspace) on
  migration and update the normal timing data structures?
 
  Are you suggesting to allow interruption of the timekeeping code
  at any time to update frequency information ?

 I'm not sure what you mean by interruption of the timekeeping code.
 I'm suggesting sending an interrupt to the guest (via a virtio device,
 presumably) to tell it that it has been paused and resumed.

 code:

 1) disable interrupts
 2) A = RDTSC
 3) B = SCALE(A, TSC.FREQ)

 If migration happens between 2 and 3, you've got an incorrect value.


Fair enough.

I guess

1) disable interrupts
2) A = RDTSC
3) B = SCALE(A, TSC.FREQ)

is also bad if (3) blocks due to magic invalidation of the physical page.

--Andy
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] PCI: Introduce new device binding path using pci_dev.driver_override

2014-04-01 Thread Greg KH
On Tue, Apr 01, 2014 at 06:52:12PM -0500, Kim Phillips wrote:
 On Tue, 01 Apr 2014 10:28:54 -0600
 Alex Williamson alex.william...@redhat.com wrote:
 
  The driver_override field allows us to specify the driver for a device
  rather than relying on the driver to provide a positive match of the
  device.  This shortcuts the existing process of looking up the vendor
  and device ID, adding them to the driver new_id, binding the device,
  then removing the ID, but it also provides a couple advantages.
  
  First, the above process allows the driver to bind to any device
  matching the new_id for the window where it's enabled.  This is often
  not desired, such as the case of trying to bind a single device to a
  meta driver like pci-stub or vfio-pci.  Using driver_override we can
  do this deterministically using:
  
  echo pci-stub  /sys/bus/pci/devices/:03:00.0/driver_override
  echo :03:00.0  /sys/bus/pci/devices/:03:00.0/driver/unbind
  echo :03:00.0  /sys/bus/pci/drivers_probe
  
  Previously we could not invoke drivers_probe after adding a device
  to new_id for a driver as we get non-deterministic behavior whether
  the driver we intend or the standard driver will claim the device.
  Now it becomes a deterministic process, only the driver matching
  driver_override will probe the device.
  
  To return the device to the standard driver, we simply clear the
  driver_override and reprobe the device, ex:
  
  echo  /sys/bus/pci/devices/:03:00.0/preferred_driver
  echo :03:00.0  /sys/bus/pci/devices/:03:00.0/driver/unbind
  echo :03:00.0  /sys/bus/pci/drivers_probe
  
  Another advantage to this approach is that we can specify a driver
  override to force a specific binding or prevent any binding.  For
  instance when an IOMMU group is exposed to userspace through VFIO
  we require that all devices within that group are owned by VFIO.
  However, devices can be hot-added into an IOMMU group, in which case
  we want to prevent the device from binding to any driver (preferred
  driver = none) or perhaps have it automatically bind to vfio-pci.
  With driver_override it's a simple matter for this field to be set
  internally when the device is first discovered to prevent driver
  matches.
  
  Signed-off-by: Alex Williamson alex.william...@redhat.com
  ---
  
  Apologies for the exceptionally long cc list, this is a follow-up to
  Stuart's Subject: mechanism to allow a driver to bind to any device
  thread.  This is effectively a v2 of the proof-of-concept patch I
  posted in that thread.  This version changes to use a dummy id struct
  to return on an override match, which removes the collateral damage
  and greatly simplifies the patch.  This feels fairly well baked for
  PCI and I would expect that platform drivers could do a similar
  implementation.  From there perhaps we can discuss whether there's
  any advantage to placing driver_override on struct device.  The logic
  for incorporating it into the match still needs to happen per bus
  driver, so it might only contribute to consistency of the show/store
  sysfs attributes to move it up to struct device.  Please comment.
 
 Sounds like Greg likes this approach more than {drv,dev}_sysfs_only.

I have made no such judgement, I only pointed out that if you
modify/add/remove a sysfs file, it needs to have documentation for it.

 The diff below is the result of duplicating and converting this patch
 for platform devices, and, indeed, binding a device to the
 vfio-platform driver succeeds with:
 
 echo vfio-platform  
 /sys/bus/platform/devices/fff51000.ethernet/driver_override
 echo fff51000.ethernet  
 /sys/bus/platform/devices/fff51000.ethernet/driver/unbind
 echo fff51000.ethernet  /sys/bus/platform/drivers_probe
 
 However, it's almost pure duplication modulo the bus match code.  The
 only other place I can see where to put the common bus check is
 drivers/base/base.h:driver_match_device(), which I'm guessing is
 off-limits?  So should we leave this as per-bus code, and somehow
 refactor driver_override_{show,store}?

If you can provide a way for this to be done in a bus-independant way,
like we did for new_id and the like, I'd be open to reviewing it.


greg k-h
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] PCI: Introduce new device binding path using pci_dev.driver_override

2014-04-01 Thread Christoffer Dall
On Tue, Apr 01, 2014 at 11:15:40AM -0600, Alex Williamson wrote:
 On Tue, 2014-04-01 at 09:47 -0700, Greg KH wrote:
  On Tue, Apr 01, 2014 at 10:28:54AM -0600, Alex Williamson wrote:
   The driver_override field allows us to specify the driver for a device
   rather than relying on the driver to provide a positive match of the
   device.  This shortcuts the existing process of looking up the vendor
   and device ID, adding them to the driver new_id, binding the device,
   then removing the ID, but it also provides a couple advantages.
   
   First, the above process allows the driver to bind to any device
   matching the new_id for the window where it's enabled.  This is often
   not desired, such as the case of trying to bind a single device to a
   meta driver like pci-stub or vfio-pci.  Using driver_override we can
   do this deterministically using:
   
   echo pci-stub  /sys/bus/pci/devices/:03:00.0/driver_override
   echo :03:00.0  /sys/bus/pci/devices/:03:00.0/driver/unbind
   echo :03:00.0  /sys/bus/pci/drivers_probe
   
   Previously we could not invoke drivers_probe after adding a device
   to new_id for a driver as we get non-deterministic behavior whether
   the driver we intend or the standard driver will claim the device.
   Now it becomes a deterministic process, only the driver matching
   driver_override will probe the device.
   
   To return the device to the standard driver, we simply clear the
   driver_override and reprobe the device, ex:
   
   echo  /sys/bus/pci/devices/:03:00.0/preferred_driver
   echo :03:00.0  /sys/bus/pci/devices/:03:00.0/driver/unbind
   echo :03:00.0  /sys/bus/pci/drivers_probe
   
   Another advantage to this approach is that we can specify a driver
   override to force a specific binding or prevent any binding.  For
   instance when an IOMMU group is exposed to userspace through VFIO
   we require that all devices within that group are owned by VFIO.
   However, devices can be hot-added into an IOMMU group, in which case
   we want to prevent the device from binding to any driver (preferred
   driver = none) or perhaps have it automatically bind to vfio-pci.
   With driver_override it's a simple matter for this field to be set
   internally when the device is first discovered to prevent driver
   matches.
   
   Signed-off-by: Alex Williamson alex.william...@redhat.com
   ---
   
   Apologies for the exceptionally long cc list, this is a follow-up to
   Stuart's Subject: mechanism to allow a driver to bind to any device
   thread.  This is effectively a v2 of the proof-of-concept patch I
   posted in that thread.  This version changes to use a dummy id struct
   to return on an override match, which removes the collateral damage
   and greatly simplifies the patch.  This feels fairly well baked for
   PCI and I would expect that platform drivers could do a similar
   implementation.  From there perhaps we can discuss whether there's
   any advantage to placing driver_override on struct device.  The logic
   for incorporating it into the match still needs to happen per bus
   driver, so it might only contribute to consistency of the show/store
   sysfs attributes to move it up to struct device.  Please comment.
   Thanks,
   
   Alex
   
drivers/pci/pci-driver.c |   25 ++---
drivers/pci/pci-sysfs.c  |   40 
include/linux/pci.h  |1 +
3 files changed, 63 insertions(+), 3 deletions(-)
  
  No Documentation/ABI/ update to reflect the ABI you are creating?
 
 Oops, thanks for the reminder.  I'd propose this:
 
 diff --git a/Documentation/ABI/testing/sysfs-bus-pci 
 b/Documentation/ABI/testing/sysfs-bus-pci
 index a3c5a66..55ca6e5 100644
 --- a/Documentation/ABI/testing/sysfs-bus-pci
 +++ b/Documentation/ABI/testing/sysfs-bus-pci
 @@ -250,3 +250,21 @@ Description:
   valid.  For example, writing a 2 to this file when sriov_numvfs
   is not 0 and not 2 already will return an error. Writing a 10
   when the value of sriov_totalvfs is 8 will return an error.
 +
 +What:/sys/bus/pci/devices/.../driver_override
 +Date:April 2014
 +Contact: Alex Williamson alex.william...@redhat.com
 +Description:
 + This file allows the driver for a device to be specified
 + which will override standard static and dynamic ID matching.
 + When specified, only a driver with a name matching the value
 + written to driver_override will have an opportunity to bind
 + to the device.  The override may be cleared by writing an
 + empty string (ex. echo  driver_override), returning the
 + device to standard matching rules binding.  Writing to
 + driver_override does not automatically unbind the device from
 + its current driver or make any attempt to automatically load
 + the specified driver 

Re: VDSO pvclock may increase host cpu consumption, is this a problem?

2014-04-01 Thread Andy Lutomirski
On Tue, Apr 1, 2014 at 5:29 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Tue, Apr 01, 2014 at 12:17:16PM -0700, Andy Lutomirski wrote:
 On Tue, Apr 1, 2014 at 11:01 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Mon, Mar 31, 2014 at 10:33:41PM -0700, Andy Lutomirski wrote:
  On Mar 31, 2014 8:45 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
  
   On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote:
On 03/29/2014 01:47 AM, Zhanghailiang wrote:
 Hi,
 I found when Guest is idle, VDSO pvclock may increase host 
 consumption.
 We can calcutate as follow, Correct me if I am wrong.
   (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest)
 In Host, VDSO pvclock introduce a notifier chain, 
 pvclock_gtod_chain in timekeeping.c. It consume nearly 900 cycles 
 per call. So in consideration of 250 Hz, it may consume 225,000 
 cycles per second, even no VM is created.
 In Guest, gettimeofday consumes 220 cycles per call with VDSO 
 pvclock. If the no-kvmclock-vsyscall is configured, gettimeofday 
 consumes 370 cycles per call. The feature decrease 150 cycles 
 consumption per call.
 When call gettimeofday 1500 times,it decrease 225,000 cycles,equal 
 to the host consumption.
 Both Host and Guest is linux-3.13.6.
 So, whether the host cpu consumption is a problem?
   
Does pvclock serve any real purpose on systems with fully-functional
TSCs?  The x86 guest implementation is awful, so it's about 2x slower
than TSC.  It could be improved a lot, but I'm not sure I understand 
why
it exists in the first place.
  
   VM migration.
 
  Why does that need percpu stuff?  Wouldn't it be sufficient to
  interrupt all CPUs (or at least all cpus running in userspace) on
  migration and update the normal timing data structures?
 
  Are you suggesting to allow interruption of the timekeeping code
  at any time to update frequency information ?

 I'm not sure what you mean by interruption of the timekeeping code.
 I'm suggesting sending an interrupt to the guest (via a virtio device,
 presumably) to tell it that it has been paused and resumed.

 This is probably worth getting John's input if you actually want to do
 this.  I'm not about to :)

 Honestly, neither am i at the moment. But i'll think about it.

 Is there any case in which the TSC is stable and the kvmclock data for
 different cpus is actually different?

 No. However, kvmclock_data.flags field is an interface for watchdog
 unpause.

  Do you want to that as a special tsc clocksource driver ?
 
  Even better: have the VM offer to invalidate the physical page
  containing the kernel's clock data on migration and interrupt one CPU.
   If another CPU races, it'll fault and wait for the guest kernel to
  update its timing.
 
  Perhaps that is a good idea.
 
  Does the current kvmclock stuff track CLOCK_MONOTONIC and
  CLOCK_REALTIME separately?
 
  No. kvmclock counting is interrupted on vm pause (the hw clock does not
  count during vm pause).

 Makes sense.

 
   Can you explain why you consider it so bad ? How you think it could be
   improved ?
 
  The second rdtsc_barrier looks unnecessary.  Even better, if rdtscp is
  available, then rdtscp can replace rdtsc_barrier, rdtsc, and the
  getcpu call.
 
  It would also be nice to avoid having two sets of rescalings of the 
  timing data.
 
  Yep, probably good improvements, patches are welcome :-)
 

 I may get to it at some point.  No guarantees.  I did just rewrite all
 the mapping-related code for every other x86 vdso timesource, so maybe
 I should try to add this to the pile.  The fact that the data is a
 variable number of pages makes it messy, though, and since I don't
 understand why there's a separate structure for each CPU, I'm hesitant
 to change it too much.

 --Andy

 kvmclock.data? Because each VCPU can have different .flags fields for
 example.

It looks like the vdso kvmclock code only runs if
PVCLOCK_TSC_STABLE_BIT is set, which in turn is only the case if the
TSC is guaranteed to be monotonic across all CPUs.  If we can rely on
the fact that that bit will only be set if tsc_to_system_mul and
tsc_shift are the same on all CPUs and that (system_time -
(tsc_timestamp * mul)  shift) is the same on all CPUs, then there
should be no reason for the vdso to read the pvclock data for anything
but CPU 0.  That will make it a lot faster and simpler.

Can we rely on that?

I wonder what happens if the guest runs ntpd or otherwise uses
adjtimex.  Presumably it starts drifting relative to the host.

--Andy
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] drivers/vfio/pci: Fix MSIx message lost

2014-04-01 Thread Alexey Kardashevskiy
On 03/10/2014 04:46 PM, Gavin Shan wrote:
 The problem is specific to the case of BIST reset issued to IPR
 adapter on the guest side. The IPR driver calls pci_save_state(),
 issues BIST reset and then pci_restore_state(). The issued BIST
 cleans out MSIx table and pci_restore_state() doesn't restore
 the MSIx table as expected. Eventually, MSIx messages with all
 zeros are sent and causes EEH error on Power platform.
 
 The patch fixes it by writing MSIx message to MSIx table before
 reenabling the individual interrupts in the following path:
 
 qemu/hw/pci/msix.c::msix_table_mmio_write
 msix_handle_mask_update
 msix_fire_vector_notifier
 qemu/hw/misc/vfio.c::vfio_msix_vector_use
 vfio_msix_vector_do_use
 
 IOCTL Command VFIO_DEVICE_SET_IRQS to VFIO-PCI
 vfio/pci/vfio_pci.c::vfio_pci_ioctl
 vfio/pci/vfio_pci_intrs.c::vfio_pci_set_irqs_ioctl
 vfio_pci_set_msi_trigger
 vfio_msi_set_block
 vfio_msi_set_vector_signal
 
 Reported-by: Wen Xiong wenxi...@us.ibm.com
 Signed-off-by: Gavin Shan sha...@linux.vnet.ibm.com
 ---
  drivers/vfio/pci/vfio_pci_intrs.c |   11 +++
  1 file changed, 11 insertions(+)
 
 diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
 b/drivers/vfio/pci/vfio_pci_intrs.c
 index 2103576..83e0638 100644
 --- a/drivers/vfio/pci/vfio_pci_intrs.c
 +++ b/drivers/vfio/pci/vfio_pci_intrs.c
 @@ -17,6 +17,7 @@
  #include linux/interrupt.h
  #include linux/eventfd.h
  #include linux/pci.h
 +#include linux/msi.h
  #include linux/file.h
  #include linux/poll.h
  #include linux/vfio.h
 @@ -517,6 +518,7 @@ static int vfio_msi_set_vector_signal(struct 
 vfio_pci_device *vdev,
   struct pci_dev *pdev = vdev-pdev;
   int irq = msix ? vdev-msix[vector].vector : pdev-irq + vector;
   char *name = msix ? vfio-msix : vfio-msi;
 + struct msi_msg msg;
   struct eventfd_ctx *trigger;
   int ret;
  
 @@ -544,6 +546,15 @@ static int vfio_msi_set_vector_signal(struct 
 vfio_pci_device *vdev,
   return PTR_ERR(trigger);
   }
  
 + /* We possiblly lose the MSI/MSIx message in some cases, one
 +  * of which is pci_save_state(), BIST reset and pci_restore_state()
 +  * for IPR adapter. The BIST reset cleans out MSIx table and we
 +  * don't have chance to restore it. Here, we have the trick to
 +  * restore it before enabling individual interrupts. For MSI messages,
 +  * it's harmless to write them back.
 +  */
 + get_cached_msi_msg(irq, msg);
 + write_msi_msg(irq, msg);
   ret = request_irq(irq, vfio_msihandler, 0,
 vdev-ctx[vector].name, trigger);
   if (ret) {
 


This is missing to successfully compile VFIO as modules:


diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 2c10752..80ba2e9 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -283,6 +283,7 @@ void get_cached_msi_msg(unsigned int irq, struct
msi_msg *msg)

__get_cached_msi_msg(entry, msg);
 }
+EXPORT_SYMBOL_GPL(get_cached_msi_msg);

 void __write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
 {
@@ -327,6 +328,7 @@ void write_msi_msg(unsigned int irq, struct msi_msg *msg)

__write_msi_msg(entry, msg);
 }
+EXPORT_SYMBOL_GPL(write_msi_msg);

 static void free_msi_irqs(struct pci_dev *dev)
 {




-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] drivers/vfio/pci: Fix MSIx message lost

2014-04-01 Thread Alex Williamson
On Wed, 2014-04-02 at 12:16 +1100, Alexey Kardashevskiy wrote:
 On 03/10/2014 04:46 PM, Gavin Shan wrote:
  The problem is specific to the case of BIST reset issued to IPR
  adapter on the guest side. The IPR driver calls pci_save_state(),
  issues BIST reset and then pci_restore_state(). The issued BIST
  cleans out MSIx table and pci_restore_state() doesn't restore
  the MSIx table as expected. Eventually, MSIx messages with all
  zeros are sent and causes EEH error on Power platform.
  
  The patch fixes it by writing MSIx message to MSIx table before
  reenabling the individual interrupts in the following path:
  
  qemu/hw/pci/msix.c::msix_table_mmio_write
  msix_handle_mask_update
  msix_fire_vector_notifier
  qemu/hw/misc/vfio.c::vfio_msix_vector_use
  vfio_msix_vector_do_use
  
  IOCTL Command VFIO_DEVICE_SET_IRQS to VFIO-PCI
  vfio/pci/vfio_pci.c::vfio_pci_ioctl
  vfio/pci/vfio_pci_intrs.c::vfio_pci_set_irqs_ioctl
  vfio_pci_set_msi_trigger
  vfio_msi_set_block
  vfio_msi_set_vector_signal
  
  Reported-by: Wen Xiong wenxi...@us.ibm.com
  Signed-off-by: Gavin Shan sha...@linux.vnet.ibm.com
  ---
   drivers/vfio/pci/vfio_pci_intrs.c |   11 +++
   1 file changed, 11 insertions(+)
  
  diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
  b/drivers/vfio/pci/vfio_pci_intrs.c
  index 2103576..83e0638 100644
  --- a/drivers/vfio/pci/vfio_pci_intrs.c
  +++ b/drivers/vfio/pci/vfio_pci_intrs.c
  @@ -17,6 +17,7 @@
   #include linux/interrupt.h
   #include linux/eventfd.h
   #include linux/pci.h
  +#include linux/msi.h
   #include linux/file.h
   #include linux/poll.h
   #include linux/vfio.h
  @@ -517,6 +518,7 @@ static int vfio_msi_set_vector_signal(struct 
  vfio_pci_device *vdev,
  struct pci_dev *pdev = vdev-pdev;
  int irq = msix ? vdev-msix[vector].vector : pdev-irq + vector;
  char *name = msix ? vfio-msix : vfio-msi;
  +   struct msi_msg msg;
  struct eventfd_ctx *trigger;
  int ret;
   
  @@ -544,6 +546,15 @@ static int vfio_msi_set_vector_signal(struct 
  vfio_pci_device *vdev,
  return PTR_ERR(trigger);
  }
   
  +   /* We possiblly lose the MSI/MSIx message in some cases, one
  +* of which is pci_save_state(), BIST reset and pci_restore_state()
  +* for IPR adapter. The BIST reset cleans out MSIx table and we
  +* don't have chance to restore it. Here, we have the trick to
  +* restore it before enabling individual interrupts. For MSI messages,
  +* it's harmless to write them back.
  +*/
  +   get_cached_msi_msg(irq, msg);
  +   write_msi_msg(irq, msg);
  ret = request_irq(irq, vfio_msihandler, 0,
vdev-ctx[vector].name, trigger);
  if (ret) {
  
 
 
 This is missing to successfully compile VFIO as modules:

Yep, I noted this in my reply on 3/18.  Thanks,

Alex

 diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
 index 2c10752..80ba2e9 100644
 --- a/drivers/pci/msi.c
 +++ b/drivers/pci/msi.c
 @@ -283,6 +283,7 @@ void get_cached_msi_msg(unsigned int irq, struct
 msi_msg *msg)
 
 __get_cached_msi_msg(entry, msg);
  }
 +EXPORT_SYMBOL_GPL(get_cached_msi_msg);
 
  void __write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
  {
 @@ -327,6 +328,7 @@ void write_msi_msg(unsigned int irq, struct msi_msg *msg)
 
 __write_msi_msg(entry, msg);
  }
 +EXPORT_SYMBOL_GPL(write_msi_msg);
 
  static void free_msi_irqs(struct pci_dev *dev)
  {
 
 
 
 



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH]pci-assign: Fix memory out of bound when MSI-X table not fit in a single page

2014-04-01 Thread Gonglei (Arei)
  Hi,
 
  I have a problem about SR-IOV pass-through.
 
  The PF is Emulex Corporation OneConnect NIC (Lancer)(rev 10),
  and the VF pci config is as follow:
 
  LINUX:/sys/bus/pci/devices/:04:00.6 # hexdump config
  000    0010 0010 0200  0080
  010        
  020       10df e264
  030   0054     
  040   0008     
  050   6009 0008 2b41 c002  
  060 7805 018a      
  070     8411 03ff 4000 
  080 3400  9403     
  090   0010 0002 8724 1000  
  0a0 dc83 0041      
  0b0     001f 0010  
  0c0 000e       
  0d0        
 
  We can see the msix_max is 0x3ff and msix_table_entry is 0x4000 (4 pages).
 But QEMU
  only mmap MSIX_PAGE_SIZE memory for all pci devices in funciton
 assigned_dev_register_msix_mmio,
  meanwhile the set the one page memmory to zero, so the rest memory will
 be random value
  (maybe etnry.data is not 0).
 
  In function assigned_dev_update_msix_mmio maybe occur the issue of
 entry_nr  256,
  and the kmod reports the EINVAL error.
 
  My patch fix this issue which alloc memory according to the real size of pci
 device config.
 
  Any ideas? Thnaks.
 
  Signed-off-by: Gonglei arei.gong...@huawei.com
  ---
   hw/i386/kvm/pci-assign.c |   24 +++-
   1 files changed, 19 insertions(+), 5 deletions(-)
 
  diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c
  index a825871..daa191c 100644
  --- a/hw/i386/kvm/pci-assign.c
  +++ b/hw/i386/kvm/pci-assign.c
  @@ -1591,10 +1591,6 @@ static void
 assigned_dev_msix_reset(AssignedDevice *dev)
   MSIXTableEntry *entry;
   int i;
 
  -if (!dev-msix_table) {
  -return;
  -}
  -
   memset(dev-msix_table, 0, MSIX_PAGE_SIZE);
 
   for (i = 0, entry = dev-msix_table; i  dev-msix_max; i++, entry++) {
  @@ -1604,13 +1600,31 @@ static void
 assigned_dev_msix_reset(AssignedDevice *dev)
 
   static int assigned_dev_register_msix_mmio(AssignedDevice *dev)
   {
  -dev-msix_table = mmap(NULL, MSIX_PAGE_SIZE,
 PROT_READ|PROT_WRITE,
  +int nr_pages;
  +int size;
  +int entry_per_page = MSIX_PAGE_SIZE / sizeof(struct MSIXTableEntry);
  +
  +if (dev-msix_max  entry_per_page) {
  +nr_pages = dev-msix_max / entry_per_page;
  +if (dev-msix_max % entry_per_page) {
  +nr_pages += 1;
  +}
  +} else {
  +nr_pages = 1;
  +}
 
 It's usually not a good idea to special-case corner-cases like this.
 
IMHO, we should assure the memory page-aligned, so I use ROUND_UP according to 
dev-msix_max.

 
  +
  +size = MSIX_PAGE_SIZE * nr_pages;
 
 Just use ROUND_UP?
 
  +dev-msix_table = mmap(NULL, size, PROT_READ|PROT_WRITE,
  MAP_ANONYMOUS|MAP_PRIVATE, 0, 0);
 
 Need to fix unmap as well?
 
Yep, it should be unmap for new size memory in 
assigned_dev_unregister_msix_mmio. Thanks, Michael.

BTW, do you think the KVM should upsize the max support MSI-X entry to 2048.
Because the MSI-X supports a maximum table size of 2048 entries, which is 
descript in 
PCI specification 3.0 version 6.8.3.2: MSI-X Configuration.

The history patch about downsize the MSI-X entry size to 256:
http://thread.gmane.org/gmane.comp.emulators.kvm.devel/38852/focus=38849

Best regards,
-Gonglei
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

2014-04-01 Thread Scott Wood
On Tue, 2014-04-01 at 07:47 +0200, Alexander Graf wrote:
 
  Am 01.04.2014 um 01:03 schrieb Scott Wood scottw...@freescale.com:
  
  On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote:
  On 03/26/2014 10:17 PM, Scott Wood wrote:
  On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:
  +/*
  + * Another thread may rewrite the TLB entry in parallel, don't
  + * execute from the address if the execute permission is not set
  + */
  
  What happens when another thread rewrites the TLB entry in parallel? 
  Does tlbsx succeed? Does it fail? Do we see failure indicated somehow? 
  Are the contents of the MAS registers consistent at this point or 
  inconsistent?
  
  If another thread rewrites the TLB entry, then we use the new TLB entry,
  just as if it had raced in hardware.  This check ensures that we don't
  execute from the new TLB entry if it doesn't have execute permissions
  (just as hardware would).
  
  What happens if the new TLB entry is valid and executable, and the
  instruction pointed to is valid, but doesn't trap (and thus we don't
  have emulation for it)?
  
  There has to be a good way to detect such a race and deal with it, no?
  
  How would you detect it?  We don't get any information from the trap
  about what physical address the instruction was fetched from, or what
  the instruction was.
 
 Ah, this is a guest race where the guest modifies exactly the TLB in 
 question. I see.
 
 Why would this ever happen in practice (in a case where we're not executing 
 malicious code)?

I don't know.  It probably wouldn't.  But it seems wrong to not check
the exec bit.

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

2014-04-01 Thread Alexander Graf

On 04/01/2014 06:58 PM, Scott Wood wrote:

On Tue, 2014-04-01 at 07:47 +0200, Alexander Graf wrote:

Am 01.04.2014 um 01:03 schrieb Scott Wood scottw...@freescale.com:


On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote:

On 03/26/2014 10:17 PM, Scott Wood wrote:

On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:
+/*
+ * Another thread may rewrite the TLB entry in parallel, don't
+ * execute from the address if the execute permission is not set
+ */

What happens when another thread rewrites the TLB entry in parallel?
Does tlbsx succeed? Does it fail? Do we see failure indicated somehow?
Are the contents of the MAS registers consistent at this point or
inconsistent?

If another thread rewrites the TLB entry, then we use the new TLB entry,
just as if it had raced in hardware.  This check ensures that we don't
execute from the new TLB entry if it doesn't have execute permissions
(just as hardware would).

What happens if the new TLB entry is valid and executable, and the
instruction pointed to is valid, but doesn't trap (and thus we don't
have emulation for it)?


There has to be a good way to detect such a race and deal with it, no?

How would you detect it?  We don't get any information from the trap
about what physical address the instruction was fetched from, or what
the instruction was.

Ah, this is a guest race where the guest modifies exactly the TLB in question. 
I see.

Why would this ever happen in practice (in a case where we're not executing 
malicious code)?

I don't know.  It probably wouldn't.  But it seems wrong to not check
the exec bit.


Right, I'm just saying that a program interrupt is not too bad of an 
answer in case the guest does something as stupid as this in kernel 
space :). It's definitely good practice to check for the exec bit.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html