[Bug 73331] New: Nested Virtualization, L2 cannot boot up on Ivybridge and Haswell
https://bugzilla.kernel.org/show_bug.cgi?id=73331 Bug ID: 73331 Summary: Nested Virtualization, L2 cannot boot up on Ivybridge and Haswell Product: Virtualization Version: unspecified Kernel Version: 3.14.0-rc3 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: kvm Assignee: virtualization_...@kernel-bugs.osdl.org Reporter: robert...@intel.com CC: jan.kis...@web.de Regression: No Environment: Host OS (ia32/ia32e/IA64):ia32e Guest OS (ia32/ia32e/IA64):ia32e Guest OS Type (Linux/Windows):Linux kvm.git Commit:7227fc006b0df2c0d2966a7f4859b01bdf74 qemu.git Commit: 8648fcd52a9bcc2aa415cbe87b7c636e545acb38 Host Kernel Version:3.14.0-rc3 Hardware:Ivytown_EP, Grantley-EP Bug detailed description: -- create L2 guest, L2 guest cannot boot up on Ivytown. Reproduce steps: 1.create L1 guests: qemu-system-x86_64 --enable-kvm -m 10240 -smp 8 -net nic,macaddr=00:12:45:67:2B:1C -net tap,script=/etc/kvm/qemu-ifup nested-kvm-rhel6u4.qcow -cpu host 2. create L2 guests: qemu-system-x86_64 -enable-kvm -m 1024 -smp 2 -net none rhel6u4.qcow Current result: L2 guest cannot boot up fine Expected result: L2 guest boot up fine. Basic root-causing log: -- [root@rhel6u4 ~]# qemu-system-x86_64 -enable-kvm -m 1024 -smp 2 -net none rhel6u4.qcow VNC server running on `::1:5900' KVM: entry failed, hardware error 0x0 EAX= EBX= ECX= EDX=0623 ESI= EDI= EBP= ESP= EIP=fff0 EFL=0002 [---] CPL=3 II=0 A20=1 SMM=0 HLT=0 ES = 9300 CS =f000 000f f300 SS = f300 DS = 9300 FS = 9300 GS = 9300 LDT= 8200 TR = 8b00 GDT= IDT= CR0=6010 CR2= CR3= CR4= DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER= Code=ff 66 89 c8 66 5b 66 5e 66 5f 66 5d 66 c3 fb f4 fa fc 66 c3 ea 5b e0 00 f0 30 36 2f 32 33 2f 39 39 00 fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for 2014-04-01
On 31 March 2014 20:57, Michael Roth mdr...@linux.vnet.ibm.com wrote: Agreed, though I feel a little weird about creating releases for tags that aren't in the official repo. Would that be acceptable from a community stand-point? I'm honestly not sure. Otherwise I think Anthony/Peter would probably still need to process a pull for stable-y.x branch in advance before we do the tarball/release. Would still help simplify things a bit though by keeping tasks compartmentalized. Anthony, Peter: in the past, prior to release, I just sent an email with a pointer to my github branch with the stable release tagged. Would a proper pull request (with a for-stable-x.y tag or somesuch) be preferable? The simplest approach seems to me to be giving you commit access to master so you can directly push release tags to it as part of making a release. thanks -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/4] KVM: Add SMAP support when setting CR4
Il 01/04/2014 03:17, Wu, Feng ha scritto: A fault is considered + * as a SMAP violation if all of the following + * conditions are ture: + * - X86_CR4_SMAP is set in CR4 + * - An user page is accessed + * - Page fault in kernel mode + * - !(CPL3 X86_EFLAGS_AC is set) - if CPL 3, EFLAGS.AC is clear Should it be if CPL =3 or EFLAGS.AC is clear ? What I meant is if CPL 3, EFLAGS.AC must also be clear, but your version is fine too. You choose! Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2] ARM VM System Specification
On Mon, 2014-03-31 at 10:26 -0700, Christoffer Dall wrote: However, that may be a pain to update Very much, and there would be nothing worse than having multiple divergent copies of the spec. so the preferred method could be to host it in either its current form (clear-text) or publish some versioned PDF somewhere and simply point to it from the compliant software pieces. I think that would be best. I'm at a loss to suggest a suitable neutral home though I'm afraid. Ian. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2] ARM VM System Specification
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 04/01/2014 05:49 AM, Ian Campbell wrote: On Mon, 2014-03-31 at 10:26 -0700, Christoffer Dall wrote: However, that may be a pain to update Very much, and there would be nothing worse than having multiple divergent copies of the spec. so the preferred method could be to host it in either its current form (clear-text) or publish some versioned PDF somewhere and simply point to it from the compliant software pieces. I think that would be best. I'm at a loss to suggest a suitable neutral home though I'm afraid. I agree with formal PDF (or TeX/plain text) releases + versioning; something I can print out and hit people with always works great for spec compliance. On the topic of hosting, if Linaro itself isn't neutral enough, what about under the Linux Foundation? They already host a bunch of standards like the FHS. Michael -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTOo2JAAoJEGKVRgSEnX1Qu+gH/ja+aqOqc04gbpFvuarHtzeM nLtJ1ie4JKOJTVqqQz+lgPhx+r3H9PrMFZnGvzs8H6GYSaU66i/SHKU150J1Ig5M 1w4DYqQj7DEClpoNmpPZVJJbevp70/wDH1wFSJZXJmxV9yOpMtrNZXwB60KANP2f MJDF+iDLIkv8vRPYQJEmuV4PMcJQs88X91eEb67jChX1ntMiiDAFSM6t2cRyfVG0 la4Ujk/p/u+Kr43wbhQITOKXZpWCtI8gjp7wTuZc+EDIhmufd5uiQ2VyTZaMkPkV xuX9871Y8VHGxCkAmhHPODbp3Wd4VUP2ewOYnTRYR2IIRpa4gNNLUa+MSdrL/sU= =2ub7 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Public key for qemu 1.7.1
Where is the public key for qemu-1.7.1? I can't find it on qemu.org, or anywhere else. I code, therefore I am -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2] ARM VM System Specification
Start with Linaro publishing it. It can always be moved to another venue if someone dislikes Linaro having maintainership. g. On Tue, Apr 1, 2014 at 10:57 AM, Michael Casadevall michael.casadev...@linaro.org wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 04/01/2014 05:49 AM, Ian Campbell wrote: On Mon, 2014-03-31 at 10:26 -0700, Christoffer Dall wrote: However, that may be a pain to update Very much, and there would be nothing worse than having multiple divergent copies of the spec. so the preferred method could be to host it in either its current form (clear-text) or publish some versioned PDF somewhere and simply point to it from the compliant software pieces. I think that would be best. I'm at a loss to suggest a suitable neutral home though I'm afraid. I agree with formal PDF (or TeX/plain text) releases + versioning; something I can print out and hit people with always works great for spec compliance. On the topic of hosting, if Linaro itself isn't neutral enough, what about under the Linux Foundation? They already host a bunch of standards like the FHS. Michael -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTOo2JAAoJEGKVRgSEnX1Qu+gH/ja+aqOqc04gbpFvuarHtzeM nLtJ1ie4JKOJTVqqQz+lgPhx+r3H9PrMFZnGvzs8H6GYSaU66i/SHKU150J1Ig5M 1w4DYqQj7DEClpoNmpPZVJJbevp70/wDH1wFSJZXJmxV9yOpMtrNZXwB60KANP2f MJDF+iDLIkv8vRPYQJEmuV4PMcJQs88X91eEb67jChX1ntMiiDAFSM6t2cRyfVG0 la4Ujk/p/u+Kr43wbhQITOKXZpWCtI8gjp7wTuZc+EDIhmufd5uiQ2VyTZaMkPkV xuX9871Y8VHGxCkAmhHPODbp3Wd4VUP2ewOYnTRYR2IIRpa4gNNLUa+MSdrL/sU= =2ub7 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3 2/4] KVM: Add SMAP support when setting CR4
-Original Message- From: Paolo Bonzini [mailto:pbonz...@redhat.com] Sent: Tuesday, April 01, 2014 5:27 PM To: Wu, Feng; g...@redhat.com; h...@zytor.com; kvm@vger.kernel.org Subject: Re: [PATCH v3 2/4] KVM: Add SMAP support when setting CR4 Il 01/04/2014 03:17, Wu, Feng ha scritto: A fault is considered + * as a SMAP violation if all of the following + * conditions are ture: + * - X86_CR4_SMAP is set in CR4 + * - An user page is accessed + * - Page fault in kernel mode + * - !(CPL3 X86_EFLAGS_AC is set) - if CPL 3, EFLAGS.AC is clear Should it be if CPL =3 or EFLAGS.AC is clear ? What I meant is if CPL 3, EFLAGS.AC must also be clear, but your version is fine too. You choose! Thanks for your comments! :) BTW: I tested nested today, this patch set works well for nested virtualization! Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenfda for 2014-04-01
Juan Quintela quint...@redhat.com wrote: Hi Please, send any topic that you are interested in covering. Thanks, Juan. Call details: 10:00 AM to 11:00 AM EDT Every two weeks Time clarification. This time was wrong, it is 1h early. 15:00 CEST 13:00 UTC 09:00 EDT Sorry for the inconvenience (I copy paste from and old email, and did it wrong.) Later, Juan. If you need phone number details, contact me privately. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Massive read only kvm guests when backing file was missing
On Mon, Mar 31, 2014 at 09:51:23PM -0300, Alejandro Comisario wrote: Again, thanks to everyone. Did you reach a conclusion or is there still a problem that might be a bug in KVM? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Massive read only kvm guests when backing file was missing
The conclusion is that the backing file stored on NFS that is the same for all 950 hosts / 10500 guests was deleted and immediatelly raised a read-only filesystem on the guest, seems that there's no way to avoid that. We developed a script to recover from that scenario if the same happens. Basically doing: * virsh stop * qemu-ndb connect * fsck * qemu-ndb disconnect * virsh start best regards. Alejandro Comisario MercadoLibre Cloud Services Arias 3751, Piso 7 (C1430CRG) Ciudad de Buenos Aires - Argentina Cel: +549(11) 15-3770-1857 Tel : +54(11) 4640-8443 On Tue, Apr 1, 2014 at 10:52 AM, Stefan Hajnoczi stefa...@gmail.com wrote: On Mon, Mar 31, 2014 at 09:51:23PM -0300, Alejandro Comisario wrote: Again, thanks to everyone. Did you reach a conclusion or is there still a problem that might be a bug in KVM? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC] KVM: s390: Add S390 configuration and control kvm device
From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com Add KVM_DEV_TYPE_S390_CONFIG kvm device that contains configuration and control attributes of particular vm. The device is created by KVM_CREATE_DEVICE ioctl. The attributes may be retrieved and stored by calling KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls. Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com --- Documentation/virtual/kvm/devices/s390_config.txt | 13 +++ arch/s390/include/asm/kvm_host.h | 6 + arch/s390/include/uapi/asm/kvm.h | 11 ++ arch/s390/kvm/Makefile| 2 +- arch/s390/kvm/config.c| 133 ++ arch/s390/kvm/kvm-s390.c | 7 ++ arch/s390/kvm/kvm-s390.h | 4 + arch/s390/kvm/priv.c | 6 +- include/linux/kvm_host.h | 1 + include/uapi/linux/kvm.h | 1 + virt/kvm/kvm_main.c | 3 + 11 files changed, 185 insertions(+), 2 deletions(-) create mode 100644 Documentation/virtual/kvm/devices/s390_config.txt create mode 100644 arch/s390/kvm/config.c diff --git a/Documentation/virtual/kvm/devices/s390_config.txt b/Documentation/virtual/kvm/devices/s390_config.txt new file mode 100644 index 000..f9bc251 --- /dev/null +++ b/Documentation/virtual/kvm/devices/s390_config.txt @@ -0,0 +1,13 @@ +S390 Config Device +== + +The s390 config device is used for storage and retrieval of various kinds +of configuration related information which should be shared between kvm +and userspace. + +Groups: +KVM_DEV_CONFIG_GROUP + +Attributes for KVM_DEV_CONFIG_GROUP: +KVM_DEV_CONFIG_NAME - The name for this vm instance. A string of at + most 128 characters. diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h index 154b600..da93b6e 100644 --- a/arch/s390/include/asm/kvm_host.h +++ b/arch/s390/include/asm/kvm_host.h @@ -277,6 +277,11 @@ struct s390_io_adapter { #define MAX_S390_IO_ADAPTERS ((MAX_ISC + 1) * 8) #define MAX_S390_ADAPTER_MAPS 256 +struct kvm_s390_config { + struct mutex lock; + struct kvm_s390_attr_name name; +}; + struct kvm_arch{ struct sca_block *sca; debug_info_t *dbf; @@ -286,6 +291,7 @@ struct kvm_arch{ int css_support; int use_irqchip; struct s390_io_adapter *adapters[MAX_S390_IO_ADAPTERS]; + struct kvm_s390_config *cfg; }; #define KVM_HVA_ERR_BAD(-1UL) diff --git a/arch/s390/include/uapi/asm/kvm.h b/arch/s390/include/uapi/asm/kvm.h index c003c6a..d53dd45 100644 --- a/arch/s390/include/uapi/asm/kvm.h +++ b/arch/s390/include/uapi/asm/kvm.h @@ -79,6 +79,17 @@ struct kvm_debug_exit_arch { struct kvm_guest_debug_arch { }; + + +/* config device specific GROUPS */ + +#define KVM_DEV_CONFIG_GROUP 0 +#define KVM_DEV_CONFIG_NAME0 + +struct kvm_s390_attr_name { + char name[128]; +}; + #define KVM_SYNC_PREFIX (1UL 0) #define KVM_SYNC_GPRS (1UL 1) #define KVM_SYNC_ACRS (1UL 2) diff --git a/arch/s390/kvm/Makefile b/arch/s390/kvm/Makefile index d3adb37..79f130b 100644 --- a/arch/s390/kvm/Makefile +++ b/arch/s390/kvm/Makefile @@ -11,5 +11,5 @@ common-objs = $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/async_pf.o $(KVM)/irqch ccflags-y := -Ivirt/kvm -Iarch/s390/kvm -kvm-objs := $(common-objs) kvm-s390.o intercept.o interrupt.o priv.o sigp.o diag.o +kvm-objs := $(common-objs) kvm-s390.o intercept.o interrupt.o priv.o sigp.o diag.o config.o obj-$(CONFIG_KVM) += kvm.o diff --git a/arch/s390/kvm/config.c b/arch/s390/kvm/config.c new file mode 100644 index 000..f887116 --- /dev/null +++ b/arch/s390/kvm/config.c @@ -0,0 +1,133 @@ +/* + * handling CONFIG kvm device + * + * Copyright IBM Corp. 2014 + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License (version 2 only) + * as published by the Free Software Foundation. + * + *Author(s): Ekaterina Tumanova tuman...@linux.vnet.ibm.com + * Christian Borntraeger borntrae...@de.ibm.com + */ +#include linux/kvm.h +#include linux/kvm_host.h +#include asm/uaccess.h +#include kvm-s390.h + +static const struct kvm_s390_config config_defaults = { + .name = { KVMguest }, +}; + +/* + * This is called from architecture code, since we access config even if + * userspace does not create a config device. + */ +struct kvm_s390_config *kvm_s390_config_init(void) +{ + struct kvm_s390_config *config; + + config = kmalloc(sizeof(struct kvm_s390_config), GFP_KERNEL); + if (config) { + memcpy(config, config_defaults, sizeof(*config)); + mutex_init(config-lock); + } + + return config; +} + +void kvm_s390_config_free(struct
[PATCH/RFC] s390x/kvm: implement and use QEMU config device for s390
From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com The following patch adds Qemu CONFIG device, which interacts with kvm CONFIG device by calling KVM_CREATE_DEVICE (to create the device in kernel), KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR (to get and set the particular attributes of KVM CONFIG device respectively). This patch uses the Qemu CONFIG device to copy the qemu_name (if specified) into the kvm CONFIG device. Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com Signed-off-by: Michael Mueller m...@linux.vnet.ibm.com Signed-off-by: Jens Freimann jf...@linux.vnet.ibm.com --- default-configs/s390x-softmmu.mak | 1 + hw/s390x/Makefile.objs| 1 + hw/s390x/config.c | 113 ++ hw/s390x/s390-virtio-ccw.c| 6 ++ hw/s390x/s390-virtio.c| 24 hw/s390x/s390-virtio.h| 1 + include/hw/s390x/config.h | 51 + linux-headers/asm-s390/kvm.h | 11 linux-headers/linux/kvm.h | 1 + trace-events | 4 ++ 10 files changed, 213 insertions(+) create mode 100644 hw/s390x/config.c create mode 100644 include/hw/s390x/config.h diff --git a/default-configs/s390x-softmmu.mak b/default-configs/s390x-softmmu.mak index d843dc0..a07697d 100644 --- a/default-configs/s390x-softmmu.mak +++ b/default-configs/s390x-softmmu.mak @@ -1,3 +1,4 @@ CONFIG_VIRTIO=y CONFIG_SCLPCONSOLE=y CONFIG_S390_FLIC=$(CONFIG_KVM) +CONFIG_S390_CONFIG=$(CONFIG_KVM) diff --git a/hw/s390x/Makefile.objs b/hw/s390x/Makefile.objs index 1ba6c3a..dc03e1d 100644 --- a/hw/s390x/Makefile.objs +++ b/hw/s390x/Makefile.objs @@ -8,3 +8,4 @@ obj-y += ipl.o obj-y += css.o obj-y += s390-virtio-ccw.o obj-y += virtio-ccw.o +obj-$(CONFIG_S390_CONFIG) += config.o diff --git a/hw/s390x/config.c b/hw/s390x/config.c new file mode 100644 index 000..f70e880 --- /dev/null +++ b/hw/s390x/config.c @@ -0,0 +1,113 @@ +/* + * S390 configuration and control device + * + * Copyright 2014 IBM Corp. + * Author(s): Ekaterina Tumanova tuman...@linux.vnet.ibm.com + * + * This work is licensed under the terms of the GNU GPL, version 2 or (at + * your option) any later version. See the COPYING file in the top-level + * directory. + */ + +#include sys/ioctl.h +#include hw/s390x/config.h +#include trace.h + +S390ConfigState *get_config_state(void) +{ +return S390_CONFIG(object_resolve_path(/machine/s390-config, NULL)); +} + +static int config_set(S390ConfigState *config, struct kvm_device_attr *attr) +{ +int rc; + +rc = ioctl(config-fd, KVM_SET_DEVICE_ATTR, attr); +return rc 0 ? -errno : rc; +} + +/* +static int config_get(S390ConfigState *config, struct kvm_device_attr *attr) +{ +int rc; + +rc = ioctl(config-fd, KVM_GET_DEVICE_ATTR, attr); +return rc 0 ? -errno : rc; +} +*/ + +static int config_set_name(S390ConfigState *config, char *name) +{ +struct kvm_device_attr attr = { +.group = KVM_DEV_CONFIG_GROUP, +.attr = KVM_DEV_CONFIG_NAME, +.addr = (uint64_t)name, +}; + +if (!name || +(qemu_strnlen(name, S390_MACHINE_NAME_SIZE) + = S390_MACHINE_NAME_SIZE)) { +return -EINVAL; +} + +return config_set(config, attr); +} + +static void s390_config_realize(DeviceState *dev, Error **errp) +{ +S390ConfigState *cfg = S390_CONFIG(dev); +struct kvm_create_device cd = {0}; +int ret; + +if (!kvm_check_extension(kvm_state, KVM_CAP_DEVICE_CTRL)) { +trace_config_no_device_api(errno); +return; +} + +cd.type = KVM_DEV_TYPE_S390_CONFIG; +ret = kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, cd); +if (ret 0) { +trace_config_create_device(errno); +return; +} +cfg-fd = cd.fd; +} + +void s390_config_dev_init(void) +{ +DeviceState *dev; + +if (kvm_enabled()) { +dev = qdev_create(NULL, TYPE_S390_CONFIG); + +object_property_add_child(qdev_get_machine(), + TYPE_S390_CONFIG, + OBJECT(dev), NULL); +qdev_init_nofail(dev); +} +} + +static void s390_config_class_init(ObjectClass *oc, void *data) +{ +DeviceClass *dc = DEVICE_CLASS(oc); +S390ConfigClass *scc = S390_CONFIG_CLASS(oc); + +dc-realize = s390_config_realize; +scc-set_name = config_set_name; +} + +static const TypeInfo s390_config_info = { +.name = TYPE_S390_CONFIG, +.parent= TYPE_SYS_BUS_DEVICE, +.instance_size = sizeof(S390ConfigState), +.class_init= s390_config_class_init, +.class_size= sizeof(S390ConfigClass), +}; + + +static void s390_config_register_types(void) +{ +type_register_static(s390_config_info); +} + +type_init(s390_config_register_types) diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c index 0d4f6ae..1e4b9ad 100644 --- a/hw/s390x/s390-virtio-ccw.c
[PATCH/RFC] s390: Provide a configuration and control device
We want to configure several things in KVM that go beyond what ENABLE_CAP (we need payload) or ONE_REG (we need it for the VM and we need to do more complex actions) can provide. Instead of adding several s390 specific ioctls, lets provide a configuration and control device that encapsulates different commands into groups of the same area (MEMORY, CPU, ..) We also provide an initial nameless base group, with a simple first user to set the guest name. We need that name in the kernel for the emulation of STSI (which provides the guest name to the guest) but we need to implement the emulation in supervisor mode, as it also provides the underlying levels of hipervisors. Currently we have the following GROUPS and ATTRs pending, which configure some memory management related function or allow to set the guest facilities, cpuids etc: #define KVM_DEV_CONFIG_GROUP0 #define KVM_DEV_CONFIG_NAME 0 #define KVM_DEV_CONFIG_GROUP_MEM1 #define KVM_DEV_CONFIG_MEM_ENABLE_CMMA 0 #define KVM_DEV_CONFIG_MEM_CLR_CMMA 1 #define KVM_DEV_CONFIG_MEM_CLR_PAGES2 #define KVM_DEV_CONFIG_GROUP_CPU2 #define KVM_DEV_CONFIG_CPU_TYPE 0 #define KVM_DEV_CONFIG_CPU_FAC 1 #define KVM_DEV_CONFIG_CPU_FAC_MASK 2 #define KVM_DEV_CONFIG_CPU_IBC 3 #define KVM_DEV_CONFIG_CPU_IBC_RANGE4 In addition other groups like #define KVM_DEV_CONFIG_GROUP_CRYPTO are under consideration to configure crypto acceleration. Unless there is a major concern, I will add this to the next s390 PULL requests for KVM. Christian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] KVM: s390: Add S390 configuration and control kvm device
On 04/01/2014 04:47 PM, Christian Borntraeger wrote: From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com Add KVM_DEV_TYPE_S390_CONFIG kvm device that contains configuration and control attributes of particular vm. The device is created by KVM_CREATE_DEVICE ioctl. The attributes may be retrieved and stored by calling KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls. Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com I don't think a device is particularly the best fit. A device can usually be instantiated multiple times. The configuration device can only be created once. A device also gets created by user space which enables it to receive the fd to drive it. Your device has to be created during VM creation. I think VM configuration is common enough to just make this a separate interface. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] s390: Provide a configuration and control device
On 04/01/2014 04:47 PM, Christian Borntraeger wrote: We want to configure several things in KVM that go beyond what ENABLE_CAP (we need payload) or ONE_REG (we need it for the VM and we need to do more complex actions) can provide. Instead of adding several s390 specific ioctls, lets provide a configuration and control device that encapsulates different commands into groups of the same area (MEMORY, CPU, ..) We also provide an initial nameless base group, with a simple first user to set the guest name. We need that name in the kernel for the emulation of STSI (which provides the guest name to the guest) but we need to implement the emulation in supervisor mode, as it also provides the underlying levels of hipervisors. Currently we have the following GROUPS and ATTRs pending, which configure some memory management related function or allow to set the guest facilities, cpuids etc: #define KVM_DEV_CONFIG_GROUP0 #define KVM_DEV_CONFIG_NAME 0 #define KVM_DEV_CONFIG_GROUP_MEM1 #define KVM_DEV_CONFIG_MEM_ENABLE_CMMA 0 #define KVM_DEV_CONFIG_MEM_CLR_CMMA 1 #define KVM_DEV_CONFIG_MEM_CLR_PAGES2 #define KVM_DEV_CONFIG_GROUP_CPU2 #define KVM_DEV_CONFIG_CPU_TYPE 0 #define KVM_DEV_CONFIG_CPU_FAC 1 #define KVM_DEV_CONFIG_CPU_FAC_MASK 2 #define KVM_DEV_CONFIG_CPU_IBC 3 #define KVM_DEV_CONFIG_CPU_IBC_RANGE4 Why would CPU specific information be set in the VM? Alex In addition other groups like #define KVM_DEV_CONFIG_GROUP_CRYPTO are under consideration to configure crypto acceleration. Unless there is a major concern, I will add this to the next s390 PULL requests for KVM. Christian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] KVM: s390: Add S390 configuration and control kvm device
On 01/04/14 16:58, Alexander Graf wrote: On 04/01/2014 04:47 PM, Christian Borntraeger wrote: From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com Add KVM_DEV_TYPE_S390_CONFIG kvm device that contains configuration and control attributes of particular vm. The device is created by KVM_CREATE_DEVICE ioctl. The attributes may be retrieved and stored by calling KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls. Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com I don't think a device is particularly the best fit. A device can usually be instantiated multiple times. The configuration device can only be created once. A device also gets created by user space which enables it to receive the fd to drive it. Your device has to be created during VM creation. I remember some discussion a year or 2 ago, and IIRC a config device was actually your idea ;-) (The other idea that we had, was ONE_REG for the VM) I think VM configuration is common enough to just make this a separate interface. So you propose to define a new base ioctl (e.g. VM_REG) on the vm fd, instead? Seems like an easy enough change. Would you reuse the kvm_attr structure for that? Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] KVM: s390: Add S390 configuration and control kvm device
On 04/01/2014 05:04 PM, Christian Borntraeger wrote: On 01/04/14 16:58, Alexander Graf wrote: On 04/01/2014 04:47 PM, Christian Borntraeger wrote: From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com Add KVM_DEV_TYPE_S390_CONFIG kvm device that contains configuration and control attributes of particular vm. The device is created by KVM_CREATE_DEVICE ioctl. The attributes may be retrieved and stored by calling KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls. Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com I don't think a device is particularly the best fit. A device can usually be instantiated multiple times. The configuration device can only be created once. A device also gets created by user space which enables it to receive the fd to drive it. Your device has to be created during VM creation. I remember some discussion a year or 2 ago, and IIRC a config device was actually your idea ;-) (The other idea that we had, was ONE_REG for the VM) Omg, really? :o A device would make sense for a specific system information instruction trap that we handle in-kernel for whatever reason (usually because it's performance critical) and some mandatory say to make sure user space always creates it. And some checks to make sure it can't get created twice. Speaking of which, why don't we just forward STSI to user space with an ENABLE_CAP and handle all of this there? It's not performance critical at all, right? I think VM configuration is common enough to just make this a separate interface. So you propose to define a new base ioctl (e.g. VM_REG) on the vm fd, instead? Seems like an easy enough change. Would you reuse the kvm_attr structure for that? Yeah, reuse whatever we can. Basically just remove the device boilerplate - I don't think it's impressively useful for a non-device. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] KVM: export mark_page_dirty_in_slot
Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 12 +--- 2 files changed, 6 insertions(+), 7 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ba3180a5e6d3..56289b487ddd 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -582,6 +582,7 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn); int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn); unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn); void mark_page_dirty(struct kvm *kvm, gfn_t gfn); +void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn); void kvm_vcpu_block(struct kvm_vcpu *vcpu); void kvm_vcpu_kick(struct kvm_vcpu *vcpu); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index bf2fa29c8beb..98833e21d580 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -99,8 +99,6 @@ static void update_memslots(struct kvm_memslots *slots, struct kvm_memory_slot *new, u64 last_generation); static void kvm_release_pfn_dirty(pfn_t pfn); -static void mark_page_dirty_in_slot(struct kvm *kvm, - struct kvm_memory_slot *memslot, gfn_t gfn); bool kvm_rebooting; EXPORT_SYMBOL_GPL(kvm_rebooting); @@ -1583,7 +1581,7 @@ int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, r = __copy_to_user((void __user *)ghc-hva, data, len); if (r) return -EFAULT; - mark_page_dirty_in_slot(kvm, ghc-memslot, ghc-gpa PAGE_SHIFT); + mark_page_dirty_in_slot(ghc-memslot, ghc-gpa PAGE_SHIFT); return 0; } @@ -1641,9 +1639,8 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len) } EXPORT_SYMBOL_GPL(kvm_clear_guest); -static void mark_page_dirty_in_slot(struct kvm *kvm, - struct kvm_memory_slot *memslot, - gfn_t gfn) +void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, + gfn_t gfn) { if (memslot memslot-dirty_bitmap) { unsigned long rel_gfn = gfn - memslot-base_gfn; @@ -1651,13 +1648,14 @@ static void mark_page_dirty_in_slot(struct kvm *kvm, set_bit_le(rel_gfn, memslot-dirty_bitmap); } } +EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot); void mark_page_dirty(struct kvm *kvm, gfn_t gfn) { struct kvm_memory_slot *memslot; memslot = gfn_to_memslot(kvm, gfn); - mark_page_dirty_in_slot(kvm, memslot, gfn); + mark_page_dirty_in_slot(memslot, gfn); } EXPORT_SYMBOL_GPL(mark_page_dirty); -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] KVM: emulate: introduce memory_prepare callback to speed up memory access
Emulating a RMW instruction currently walks the page tables twice. To avoid this, store the virtual address in the host and access it directly. In fact, it turns out that the optimizations we can do actually benefit all memory accesses. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/include/asm/kvm_emulate.h | 26 +++ arch/x86/kvm/x86.c | 67 ++ include/linux/kvm_host.h | 5 +++ virt/kvm/kvm_main.c| 5 --- 4 files changed, 98 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index f7b1e45eb753..4a580be2553e 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -167,6 +167,32 @@ struct x86_emulate_ops { const void *new, unsigned int bytes, struct x86_exception *fault); + + /* +* memory_prepare: Prepare userspace access fastpath. +* @addr: [IN ] Linear address to access. +* @bytes:[IN ] Number of bytes to access. +* @write:[IN ] True if *p_hva will be written to. +* @p_opaque: [OUT] Value passed back to memory_finish. +* @p_hva:[OUT] Host virtual address for __copy_from/to_user. +*/ + int (*memory_prepare)(struct x86_emulate_ctxt *ctxt, + unsigned long addr, + unsigned int bytes, + struct x86_exception *exception, + bool write, + void **p_opaque, + unsigned long *p_hva); + + /* +* memory_finish: Complete userspace access fastpath. +* @opaque: [OUT] Value passed back from memory_prepare. +* @hva:[OUT] Host virtual address computed in memory_prepare. +*/ + void (*memory_finish)(struct x86_emulate_ctxt *ctxt, + void *p_opaque, + unsigned long p_hva); + void (*invlpg)(struct x86_emulate_ctxt *ctxt, ulong addr); int (*pio_in_emulated)(struct x86_emulate_ctxt *ctxt, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e70a0e3227b1..64d0f171996b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4273,6 +4273,71 @@ static const struct read_write_emulator_ops write_emultor = { .write = true, }; +static int emulator_memory_prepare(struct x86_emulate_ctxt *ctxt, + unsigned long addr, + unsigned int bytes, + struct x86_exception *exception, + bool write, + void **p_opaque, + unsigned long *p_hva) +{ + struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); + struct kvm_memory_slot *memslot; + int ret; + gpa_t gpa; + gfn_t gfn; + unsigned long hva; + + if (unlikely(((addr + bytes - 1) ^ addr) PAGE_MASK)) + goto no_hva; + + ret = vcpu_mmio_gva_to_gpa(vcpu, addr, gpa, exception, true); + if (ret != 0) { + if (ret 0) + return X86EMUL_PROPAGATE_FAULT; + goto no_hva; + } + + /* A (heavily) simplified version of kvm_gfn_to_hva_cache_init. */ + gfn = gpa PAGE_SHIFT; + memslot = gfn_to_memslot(vcpu-kvm, gfn); + if (!memslot) + goto no_hva; + + if (write) { + if (memslot_is_readonly(memslot)) + goto no_hva; + + *p_opaque = memslot-dirty_bitmap ? memslot : NULL; + } + + hva = __gfn_to_hva_memslot(memslot, gfn); + if (kvm_is_error_hva(hva)) + goto no_hva; + + *p_hva = hva + offset_in_page(gpa); + return X86EMUL_CONTINUE; + +no_hva: + *p_hva = KVM_HVA_ERR_BAD; + return X86EMUL_CONTINUE; +} + +static void emulator_memory_finish(struct x86_emulate_ctxt *ctxt, + void *opaque, + unsigned long hva) +{ + struct kvm_memory_slot *memslot; + gfn_t gfn; + + if (!opaque) + return; + + memslot = opaque; + gfn = hva_to_gfn_memslot(hva, memslot); + mark_page_dirty_in_slot(memslot, gfn); +} + static int emulator_read_write_onepage(unsigned long addr, void *val, unsigned int bytes, struct x86_exception *exception, @@ -4816,6 +4881,8 @@ static const struct x86_emulate_ops emulate_ops = { .read_std= kvm_read_guest_virt_system, .write_std = kvm_write_guest_virt_system, .fetch = kvm_fetch_guest_virt, + .memory_prepare
[PATCH 2/6] KVM: emulate: abstract handling of memory operands
Abstract the pre-execution processing and writeback of memory operands in new functions. We will soon do some work before execution even for move destination, so call the function in that case too; but not for the memory operand of lea, invlpg etc. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/kvm/emulate.c | 43 --- 1 file changed, 28 insertions(+), 15 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index b42184eccbcc..c7ef72c1289e 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1545,6 +1545,29 @@ exception: return X86EMUL_PROPAGATE_FAULT; } +static int prepare_memory_operand(struct x86_emulate_ctxt *ctxt, + struct operand *op) +{ + return segmented_read(ctxt, op-addr.mem, op-val, op-bytes); +} + +static int cmpxchg_memory_operand(struct x86_emulate_ctxt *ctxt, + struct operand *op) +{ + return segmented_cmpxchg(ctxt, op-addr.mem, +op-orig_val, +op-val, +op-bytes); +} + +static int write_memory_operand(struct x86_emulate_ctxt *ctxt, + struct operand *op) +{ + return segmented_write(ctxt, op-addr.mem, + op-val, + op-bytes); +} + static void write_register_operand(struct operand *op) { /* The 4-byte case *is* correct: in 64-bit mode we zero-extend. */ @@ -1572,16 +1595,9 @@ static int writeback(struct x86_emulate_ctxt *ctxt, struct operand *op) break; case OP_MEM: if (ctxt-lock_prefix) - return segmented_cmpxchg(ctxt, -op-addr.mem, -op-orig_val, -op-val, -op-bytes); + return cmpxchg_memory_operand(ctxt, op); else - return segmented_write(ctxt, - op-addr.mem, - op-val, - op-bytes); + return write_memory_operand(ctxt, op); break; case OP_MEM_STR: return segmented_write(ctxt, @@ -4588,16 +4604,14 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) } if ((ctxt-src.type == OP_MEM) !(ctxt-d NoAccess)) { - rc = segmented_read(ctxt, ctxt-src.addr.mem, - ctxt-src.valptr, ctxt-src.bytes); + rc = prepare_memory_operand(ctxt, ctxt-src); if (rc != X86EMUL_CONTINUE) goto done; ctxt-src.orig_val64 = ctxt-src.val64; } if (ctxt-src2.type == OP_MEM) { - rc = segmented_read(ctxt, ctxt-src2.addr.mem, - ctxt-src2.val, ctxt-src2.bytes); + rc = prepare_memory_operand(ctxt, ctxt-src2); if (rc != X86EMUL_CONTINUE) goto done; } @@ -4608,8 +4622,7 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) if ((ctxt-dst.type == OP_MEM) !(ctxt-d Mov)) { /* optimisation - avoid slow emulated read if Mov */ - rc = segmented_read(ctxt, ctxt-dst.addr.mem, - ctxt-dst.val, ctxt-dst.bytes); + rc = prepare_memory_operand(ctxt, ctxt-dst); if (rc != X86EMUL_CONTINUE) goto done; } -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] KVM: emulate: activate memory access optimization
memory_prepare lets us replace segmented_read/segmented_write with direct calls to __copy_from_user/__copy_to_user. This saves about 70 cycles (15%) on arithmetic with a memory source operand, and about 150 cycles (25%) on arithmetic with a memory destination operand. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/include/asm/kvm_emulate.h | 2 ++ arch/x86/kvm/emulate.c | 50 ++ 2 files changed, 47 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index 4a580be2553e..a572d4fabd4f 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -263,6 +263,8 @@ struct operand { u64 mm_val; void *data; }; + unsigned long hva; + void *opaque; }; struct fetch_cache { diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index c7ef72c1289e..2c881e5cf5ad 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1546,9 +1546,37 @@ exception: } static int prepare_memory_operand(struct x86_emulate_ctxt *ctxt, - struct operand *op) + struct operand *op, + bool write) { - return segmented_read(ctxt, op-addr.mem, op-val, op-bytes); + int rc; + unsigned long gva; + unsigned int size = op-bytes; + + rc = linearize(ctxt, op-addr.mem, size, write, gva); + if (rc != X86EMUL_CONTINUE) + return rc; + + rc = ctxt-ops-memory_prepare(ctxt, gva, size, + ctxt-exception, true, + op-opaque, op-hva); + if (rc != X86EMUL_CONTINUE) + return rc; + + if (likely(!kvm_is_error_hva(op-hva))) { + rc = __copy_from_user(op-val, (void __user *)op-hva, + size); + if (!write) + ctxt-ops-memory_finish(ctxt, op-opaque, op-hva); + + if (likely(!rc)) + return X86EMUL_CONTINUE; + + /* Should not happen. */ + op-hva = KVM_HVA_ERR_BAD; + } + + return read_emulated(ctxt, gva, op-val, size); } static int cmpxchg_memory_operand(struct x86_emulate_ctxt *ctxt, @@ -1563,6 +1591,17 @@ static int cmpxchg_memory_operand(struct x86_emulate_ctxt *ctxt, static int write_memory_operand(struct x86_emulate_ctxt *ctxt, struct operand *op) { + int rc; + + if (likely(!kvm_is_error_hva(op-hva))) { + rc = __copy_to_user((void __user *)op-hva, op-val, + op-bytes); + ctxt-ops-memory_finish(ctxt, op-opaque, op-hva); + + if (likely(!rc)) + return X86EMUL_CONTINUE; + } + return segmented_write(ctxt, op-addr.mem, op-val, op-bytes); @@ -4604,14 +4643,14 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) } if ((ctxt-src.type == OP_MEM) !(ctxt-d NoAccess)) { - rc = prepare_memory_operand(ctxt, ctxt-src); + rc = prepare_memory_operand(ctxt, ctxt-src, false); if (rc != X86EMUL_CONTINUE) goto done; ctxt-src.orig_val64 = ctxt-src.val64; } if (ctxt-src2.type == OP_MEM) { - rc = prepare_memory_operand(ctxt, ctxt-src2); + rc = prepare_memory_operand(ctxt, ctxt-src2, false); if (rc != X86EMUL_CONTINUE) goto done; } @@ -4622,7 +4661,8 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) if ((ctxt-dst.type == OP_MEM) !(ctxt-d Mov)) { /* optimisation - avoid slow emulated read if Mov */ - rc = prepare_memory_operand(ctxt, ctxt-dst); + rc = prepare_memory_operand(ctxt, ctxt-dst, + !(ctxt-d NoWrite)); if (rc != X86EMUL_CONTINUE) goto done; } -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/6] KVM: x86: speedups for emulator memory accesses
Another emulator speedup series, shaving up to 400 cycles (25%) off RMW instructions. The performance of various instructions is now relatively flat: jump 919 (down from 2300) move 1075 (down from 2700) arith 1081 (down from 2600) load 1267 (down from 2800, 1400 after previous round) store 1213 (down from 2900, 1300 after previous round) RMW 1310 (down from 3200, 1700 after previous round) The next low-hanging fruit is fetching instructions and initializing the context. Similar optimizations to those done here could be made for instruction fetch. Paolo Bonzini (6): KVM: emulate: simplify writeback KVM: emulate: abstract handling of memory operands KVM: export mark_page_dirty_in_slot KVM: emulate: introduce memory_prepare callback to speed up memory access KVM: emulate: activate memory access optimization KVM: emulate: extend memory access optimization to stores arch/x86/include/asm/kvm_emulate.h | 28 ++ arch/x86/kvm/emulate.c | 107 +++-- arch/x86/kvm/x86.c | 67 +++ include/linux/kvm_host.h | 6 +++ virt/kvm/kvm_main.c| 17 ++ 5 files changed, 185 insertions(+), 40 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] KVM: emulate: simplify writeback
The if/return checks are useless, because we return X86EMUL_CONTINUE anyway if we do not return. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/kvm/emulate.c | 28 +++- 1 file changed, 11 insertions(+), 17 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 4a3584d419e5..b42184eccbcc 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1566,34 +1566,28 @@ static void write_register_operand(struct operand *op) static int writeback(struct x86_emulate_ctxt *ctxt, struct operand *op) { - int rc; - switch (op-type) { case OP_REG: write_register_operand(op); break; case OP_MEM: if (ctxt-lock_prefix) - rc = segmented_cmpxchg(ctxt, + return segmented_cmpxchg(ctxt, +op-addr.mem, +op-orig_val, +op-val, +op-bytes); + else + return segmented_write(ctxt, op-addr.mem, - op-orig_val, op-val, op-bytes); - else - rc = segmented_write(ctxt, -op-addr.mem, -op-val, -op-bytes); - if (rc != X86EMUL_CONTINUE) - return rc; break; case OP_MEM_STR: - rc = segmented_write(ctxt, - op-addr.mem, - op-data, - op-bytes * op-count); - if (rc != X86EMUL_CONTINUE) - return rc; + return segmented_write(ctxt, + op-addr.mem, + op-data, + op-bytes * op-count); break; case OP_XMM: write_sse_reg(ctxt, op-vec_val, op-addr.xmm); -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] KVM: emulate: extend memory access optimization to stores
Even on a store, the optimization saves about 20 clock cycles (4-5%), mostly because the jump in write_memory_operand becomes much more predictable. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/kvm/emulate.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 2c881e5cf5ad..541296f62b40 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1547,7 +1547,7 @@ exception: static int prepare_memory_operand(struct x86_emulate_ctxt *ctxt, struct operand *op, - bool write) + bool read, bool write) { int rc; unsigned long gva; @@ -1563,6 +1563,10 @@ static int prepare_memory_operand(struct x86_emulate_ctxt *ctxt, if (rc != X86EMUL_CONTINUE) return rc; + /* optimisation - avoid slow emulated read if Mov */ + if (!read) + return X86EMUL_CONTINUE; + if (likely(!kvm_is_error_hva(op-hva))) { rc = __copy_from_user(op-val, (void __user *)op-hva, size); @@ -4643,14 +4647,14 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) } if ((ctxt-src.type == OP_MEM) !(ctxt-d NoAccess)) { - rc = prepare_memory_operand(ctxt, ctxt-src, false); + rc = prepare_memory_operand(ctxt, ctxt-src, true, false); if (rc != X86EMUL_CONTINUE) goto done; ctxt-src.orig_val64 = ctxt-src.val64; } if (ctxt-src2.type == OP_MEM) { - rc = prepare_memory_operand(ctxt, ctxt-src2, false); + rc = prepare_memory_operand(ctxt, ctxt-src2, true, false); if (rc != X86EMUL_CONTINUE) goto done; } @@ -4659,9 +4663,9 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) goto special_insn; - if ((ctxt-dst.type == OP_MEM) !(ctxt-d Mov)) { - /* optimisation - avoid slow emulated read if Mov */ + if (ctxt-dst.type == OP_MEM) { rc = prepare_memory_operand(ctxt, ctxt-dst, + !(ctxt-d Mov), !(ctxt-d NoWrite)); if (rc != X86EMUL_CONTINUE) goto done; -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call for agenda 2014-04-01
2014-04-01 -- - port of virtio to vmstate * problem are lists (virtio-net/virtio-serial) Relaese process - Delay release for a few days to be able to test the tarball. - Michael Rooth do the realese - Stefan get used as realse manager backup - Dates agreed: 2014-04-01 Tag v2.0.0-rc1 2014-04-08 Tag v2.0.0 Conference call - we sent calls for agenda one day before, 2 weeks before, 1 week before and not a lot of calls. What should we do to improve? Plan is to sent: one week before and a reminder the day before - Qemu is better modularized nowadays, and less changes involve the whole tree - If you have any suggestion for the conference call told me. Stefan virtio property patches -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call for agenda 2014-04-01
2014-04-01 -- - port of virtio to vmstate * problem are lists (virtio-net/virtio-serial) Relaese process - Delay release for a few days to be able to test the tarball. - Michael Rooth do the realese - Stefan get used as realse manager backup - Dates agreed: 2014-04-01 Tag v2.0.0-rc1 2014-04-08 Tag v2.0.0 Conference call - we sent calls for agenda one day before, 2 weeks before, 1 week before and not a lot of calls. What should we do to improve? Plan is to sent: one week before and a reminder the day before - Qemu is better modularized nowadays, and less changes involve the whole tree - If you have any suggestion for the conference call told me. Stefan virtio property patches -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] s390x/kvm: implement and use QEMU config device for s390
Il 01/04/2014 16:47, Christian Borntraeger ha scritto: From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com The following patch adds Qemu CONFIG device, which interacts with kvm CONFIG device by calling KVM_CREATE_DEVICE (to create the device in kernel), KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR (to get and set the particular attributes of KVM CONFIG device respectively). This patch uses the Qemu CONFIG device to copy the qemu_name (if specified) into the kvm CONFIG device. Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com Signed-off-by: Michael Mueller m...@linux.vnet.ibm.com Signed-off-by: Jens Freimann jf...@linux.vnet.ibm.com On the QEMU side we'll soon have per-machine-type options in -machine. s390 can then use that. Paolo --- default-configs/s390x-softmmu.mak | 1 + hw/s390x/Makefile.objs| 1 + hw/s390x/config.c | 113 ++ hw/s390x/s390-virtio-ccw.c| 6 ++ hw/s390x/s390-virtio.c| 24 hw/s390x/s390-virtio.h| 1 + include/hw/s390x/config.h | 51 + linux-headers/asm-s390/kvm.h | 11 linux-headers/linux/kvm.h | 1 + trace-events | 4 ++ 10 files changed, 213 insertions(+) create mode 100644 hw/s390x/config.c create mode 100644 include/hw/s390x/config.h diff --git a/default-configs/s390x-softmmu.mak b/default-configs/s390x-softmmu.mak index d843dc0..a07697d 100644 --- a/default-configs/s390x-softmmu.mak +++ b/default-configs/s390x-softmmu.mak @@ -1,3 +1,4 @@ CONFIG_VIRTIO=y CONFIG_SCLPCONSOLE=y CONFIG_S390_FLIC=$(CONFIG_KVM) +CONFIG_S390_CONFIG=$(CONFIG_KVM) diff --git a/hw/s390x/Makefile.objs b/hw/s390x/Makefile.objs index 1ba6c3a..dc03e1d 100644 --- a/hw/s390x/Makefile.objs +++ b/hw/s390x/Makefile.objs @@ -8,3 +8,4 @@ obj-y += ipl.o obj-y += css.o obj-y += s390-virtio-ccw.o obj-y += virtio-ccw.o +obj-$(CONFIG_S390_CONFIG) += config.o diff --git a/hw/s390x/config.c b/hw/s390x/config.c new file mode 100644 index 000..f70e880 --- /dev/null +++ b/hw/s390x/config.c @@ -0,0 +1,113 @@ +/* + * S390 configuration and control device + * + * Copyright 2014 IBM Corp. + * Author(s): Ekaterina Tumanova tuman...@linux.vnet.ibm.com + * + * This work is licensed under the terms of the GNU GPL, version 2 or (at + * your option) any later version. See the COPYING file in the top-level + * directory. + */ + +#include sys/ioctl.h +#include hw/s390x/config.h +#include trace.h + +S390ConfigState *get_config_state(void) +{ +return S390_CONFIG(object_resolve_path(/machine/s390-config, NULL)); +} + +static int config_set(S390ConfigState *config, struct kvm_device_attr *attr) +{ +int rc; + +rc = ioctl(config-fd, KVM_SET_DEVICE_ATTR, attr); +return rc 0 ? -errno : rc; +} + +/* +static int config_get(S390ConfigState *config, struct kvm_device_attr *attr) +{ +int rc; + +rc = ioctl(config-fd, KVM_GET_DEVICE_ATTR, attr); +return rc 0 ? -errno : rc; +} +*/ + +static int config_set_name(S390ConfigState *config, char *name) +{ +struct kvm_device_attr attr = { +.group = KVM_DEV_CONFIG_GROUP, +.attr = KVM_DEV_CONFIG_NAME, +.addr = (uint64_t)name, +}; + +if (!name || +(qemu_strnlen(name, S390_MACHINE_NAME_SIZE) + = S390_MACHINE_NAME_SIZE)) { +return -EINVAL; +} + +return config_set(config, attr); +} + +static void s390_config_realize(DeviceState *dev, Error **errp) +{ +S390ConfigState *cfg = S390_CONFIG(dev); +struct kvm_create_device cd = {0}; +int ret; + +if (!kvm_check_extension(kvm_state, KVM_CAP_DEVICE_CTRL)) { +trace_config_no_device_api(errno); +return; +} + +cd.type = KVM_DEV_TYPE_S390_CONFIG; +ret = kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, cd); +if (ret 0) { +trace_config_create_device(errno); +return; +} +cfg-fd = cd.fd; +} + +void s390_config_dev_init(void) +{ +DeviceState *dev; + +if (kvm_enabled()) { +dev = qdev_create(NULL, TYPE_S390_CONFIG); + +object_property_add_child(qdev_get_machine(), + TYPE_S390_CONFIG, + OBJECT(dev), NULL); +qdev_init_nofail(dev); +} +} + +static void s390_config_class_init(ObjectClass *oc, void *data) +{ +DeviceClass *dc = DEVICE_CLASS(oc); +S390ConfigClass *scc = S390_CONFIG_CLASS(oc); + +dc-realize = s390_config_realize; +scc-set_name = config_set_name; +} + +static const TypeInfo s390_config_info = { +.name = TYPE_S390_CONFIG, +.parent= TYPE_SYS_BUS_DEVICE, +.instance_size = sizeof(S390ConfigState), +.class_init= s390_config_class_init, +.class_size= sizeof(S390ConfigClass), +}; + + +static void s390_config_register_types(void) +{ +type_register_static(s390_config_info); +} +
[RFC PATCH] PCI: Introduce new device binding path using pci_dev.driver_override
The driver_override field allows us to specify the driver for a device rather than relying on the driver to provide a positive match of the device. This shortcuts the existing process of looking up the vendor and device ID, adding them to the driver new_id, binding the device, then removing the ID, but it also provides a couple advantages. First, the above process allows the driver to bind to any device matching the new_id for the window where it's enabled. This is often not desired, such as the case of trying to bind a single device to a meta driver like pci-stub or vfio-pci. Using driver_override we can do this deterministically using: echo pci-stub /sys/bus/pci/devices/:03:00.0/driver_override echo :03:00.0 /sys/bus/pci/devices/:03:00.0/driver/unbind echo :03:00.0 /sys/bus/pci/drivers_probe Previously we could not invoke drivers_probe after adding a device to new_id for a driver as we get non-deterministic behavior whether the driver we intend or the standard driver will claim the device. Now it becomes a deterministic process, only the driver matching driver_override will probe the device. To return the device to the standard driver, we simply clear the driver_override and reprobe the device, ex: echo /sys/bus/pci/devices/:03:00.0/preferred_driver echo :03:00.0 /sys/bus/pci/devices/:03:00.0/driver/unbind echo :03:00.0 /sys/bus/pci/drivers_probe Another advantage to this approach is that we can specify a driver override to force a specific binding or prevent any binding. For instance when an IOMMU group is exposed to userspace through VFIO we require that all devices within that group are owned by VFIO. However, devices can be hot-added into an IOMMU group, in which case we want to prevent the device from binding to any driver (preferred driver = none) or perhaps have it automatically bind to vfio-pci. With driver_override it's a simple matter for this field to be set internally when the device is first discovered to prevent driver matches. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Apologies for the exceptionally long cc list, this is a follow-up to Stuart's Subject: mechanism to allow a driver to bind to any device thread. This is effectively a v2 of the proof-of-concept patch I posted in that thread. This version changes to use a dummy id struct to return on an override match, which removes the collateral damage and greatly simplifies the patch. This feels fairly well baked for PCI and I would expect that platform drivers could do a similar implementation. From there perhaps we can discuss whether there's any advantage to placing driver_override on struct device. The logic for incorporating it into the match still needs to happen per bus driver, so it might only contribute to consistency of the show/store sysfs attributes to move it up to struct device. Please comment. Thanks, Alex drivers/pci/pci-driver.c | 25 ++--- drivers/pci/pci-sysfs.c | 40 include/linux/pci.h |1 + 3 files changed, 63 insertions(+), 3 deletions(-) diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 25f0bc6..f780eb8 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -216,6 +216,13 @@ const struct pci_device_id *pci_match_id(const struct pci_device_id *ids, return NULL; } +static const struct pci_device_id pci_device_id_any = { + .vendor = PCI_ANY_ID, + .device = PCI_ANY_ID, + .subvendor = PCI_ANY_ID, + .subdevice = PCI_ANY_ID, +}; + /** * pci_match_device - Tell if a PCI device structure has a matching PCI device id structure * @drv: the PCI driver to match against @@ -229,18 +236,30 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv, struct pci_dev *dev) { struct pci_dynid *dynid; + const struct pci_device_id *found_id = NULL; + + /* When driver_override is set, only bind to the matching driver */ + if (dev-driver_override strcmp(dev-driver_override, drv-name)) + return NULL; /* Look at the dynamic ids first, before the static ones */ spin_lock(drv-dynids.lock); list_for_each_entry(dynid, drv-dynids.list, node) { if (pci_match_one_device(dynid-id, dev)) { - spin_unlock(drv-dynids.lock); - return dynid-id; + found_id = dynid-id; + break; } } spin_unlock(drv-dynids.lock); - return pci_match_id(drv-id_table, dev); + if (!found_id) + found_id = pci_match_id(drv-id_table, dev); + + /* driver_override will always match, send a dummy id */ + if (!found_id dev-driver_override) + found_id = pci_device_id_any; + + return found_id; } struct drv_dev_and_id { diff
Re: [RFC PATCH] PCI: Introduce new device binding path using pci_dev.driver_override
On Tue, Apr 01, 2014 at 10:28:54AM -0600, Alex Williamson wrote: The driver_override field allows us to specify the driver for a device rather than relying on the driver to provide a positive match of the device. This shortcuts the existing process of looking up the vendor and device ID, adding them to the driver new_id, binding the device, then removing the ID, but it also provides a couple advantages. First, the above process allows the driver to bind to any device matching the new_id for the window where it's enabled. This is often not desired, such as the case of trying to bind a single device to a meta driver like pci-stub or vfio-pci. Using driver_override we can do this deterministically using: echo pci-stub /sys/bus/pci/devices/:03:00.0/driver_override echo :03:00.0 /sys/bus/pci/devices/:03:00.0/driver/unbind echo :03:00.0 /sys/bus/pci/drivers_probe Previously we could not invoke drivers_probe after adding a device to new_id for a driver as we get non-deterministic behavior whether the driver we intend or the standard driver will claim the device. Now it becomes a deterministic process, only the driver matching driver_override will probe the device. To return the device to the standard driver, we simply clear the driver_override and reprobe the device, ex: echo /sys/bus/pci/devices/:03:00.0/preferred_driver echo :03:00.0 /sys/bus/pci/devices/:03:00.0/driver/unbind echo :03:00.0 /sys/bus/pci/drivers_probe Another advantage to this approach is that we can specify a driver override to force a specific binding or prevent any binding. For instance when an IOMMU group is exposed to userspace through VFIO we require that all devices within that group are owned by VFIO. However, devices can be hot-added into an IOMMU group, in which case we want to prevent the device from binding to any driver (preferred driver = none) or perhaps have it automatically bind to vfio-pci. With driver_override it's a simple matter for this field to be set internally when the device is first discovered to prevent driver matches. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Apologies for the exceptionally long cc list, this is a follow-up to Stuart's Subject: mechanism to allow a driver to bind to any device thread. This is effectively a v2 of the proof-of-concept patch I posted in that thread. This version changes to use a dummy id struct to return on an override match, which removes the collateral damage and greatly simplifies the patch. This feels fairly well baked for PCI and I would expect that platform drivers could do a similar implementation. From there perhaps we can discuss whether there's any advantage to placing driver_override on struct device. The logic for incorporating it into the match still needs to happen per bus driver, so it might only contribute to consistency of the show/store sysfs attributes to move it up to struct device. Please comment. Thanks, Alex drivers/pci/pci-driver.c | 25 ++--- drivers/pci/pci-sysfs.c | 40 include/linux/pci.h |1 + 3 files changed, 63 insertions(+), 3 deletions(-) No Documentation/ABI/ update to reflect the ABI you are creating? thanks, greg k-h -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
On Tue, 2014-04-01 at 07:47 +0200, Alexander Graf wrote: Am 01.04.2014 um 01:03 schrieb Scott Wood scottw...@freescale.com: On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote: On 03/26/2014 10:17 PM, Scott Wood wrote: On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote: +/* + * Another thread may rewrite the TLB entry in parallel, don't + * execute from the address if the execute permission is not set + */ What happens when another thread rewrites the TLB entry in parallel? Does tlbsx succeed? Does it fail? Do we see failure indicated somehow? Are the contents of the MAS registers consistent at this point or inconsistent? If another thread rewrites the TLB entry, then we use the new TLB entry, just as if it had raced in hardware. This check ensures that we don't execute from the new TLB entry if it doesn't have execute permissions (just as hardware would). What happens if the new TLB entry is valid and executable, and the instruction pointed to is valid, but doesn't trap (and thus we don't have emulation for it)? There has to be a good way to detect such a race and deal with it, no? How would you detect it? We don't get any information from the trap about what physical address the instruction was fetched from, or what the instruction was. Ah, this is a guest race where the guest modifies exactly the TLB in question. I see. Why would this ever happen in practice (in a case where we're not executing malicious code)? I don't know. It probably wouldn't. But it seems wrong to not check the exec bit. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call for agenda 2014-04-01
On 1 April 2014 16:34, Juan Quintela quint...@redhat.com wrote: Relaese process - Delay release for a few days to be able to test the tarball. - Michael Rooth do the realese - Stefan get used as realse manager backup A point I didn't get a clear answer to on the call: for rc1 are we going to give Michael commit access to master so he can do the whole update-VERSION-tag-roll-tarball process, or do I need to do at least the VERSION-and-tag part this time round? thanks -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
On 04/01/2014 06:58 PM, Scott Wood wrote: On Tue, 2014-04-01 at 07:47 +0200, Alexander Graf wrote: Am 01.04.2014 um 01:03 schrieb Scott Wood scottw...@freescale.com: On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote: On 03/26/2014 10:17 PM, Scott Wood wrote: On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote: +/* + * Another thread may rewrite the TLB entry in parallel, don't + * execute from the address if the execute permission is not set + */ What happens when another thread rewrites the TLB entry in parallel? Does tlbsx succeed? Does it fail? Do we see failure indicated somehow? Are the contents of the MAS registers consistent at this point or inconsistent? If another thread rewrites the TLB entry, then we use the new TLB entry, just as if it had raced in hardware. This check ensures that we don't execute from the new TLB entry if it doesn't have execute permissions (just as hardware would). What happens if the new TLB entry is valid and executable, and the instruction pointed to is valid, but doesn't trap (and thus we don't have emulation for it)? There has to be a good way to detect such a race and deal with it, no? How would you detect it? We don't get any information from the trap about what physical address the instruction was fetched from, or what the instruction was. Ah, this is a guest race where the guest modifies exactly the TLB in question. I see. Why would this ever happen in practice (in a case where we're not executing malicious code)? I don't know. It probably wouldn't. But it seems wrong to not check the exec bit. Right, I'm just saying that a program interrupt is not too bad of an answer in case the guest does something as stupid as this in kernel space :). It's definitely good practice to check for the exec bit. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] PCI: Introduce new device binding path using pci_dev.driver_override
On Tue, 2014-04-01 at 09:47 -0700, Greg KH wrote: On Tue, Apr 01, 2014 at 10:28:54AM -0600, Alex Williamson wrote: The driver_override field allows us to specify the driver for a device rather than relying on the driver to provide a positive match of the device. This shortcuts the existing process of looking up the vendor and device ID, adding them to the driver new_id, binding the device, then removing the ID, but it also provides a couple advantages. First, the above process allows the driver to bind to any device matching the new_id for the window where it's enabled. This is often not desired, such as the case of trying to bind a single device to a meta driver like pci-stub or vfio-pci. Using driver_override we can do this deterministically using: echo pci-stub /sys/bus/pci/devices/:03:00.0/driver_override echo :03:00.0 /sys/bus/pci/devices/:03:00.0/driver/unbind echo :03:00.0 /sys/bus/pci/drivers_probe Previously we could not invoke drivers_probe after adding a device to new_id for a driver as we get non-deterministic behavior whether the driver we intend or the standard driver will claim the device. Now it becomes a deterministic process, only the driver matching driver_override will probe the device. To return the device to the standard driver, we simply clear the driver_override and reprobe the device, ex: echo /sys/bus/pci/devices/:03:00.0/preferred_driver echo :03:00.0 /sys/bus/pci/devices/:03:00.0/driver/unbind echo :03:00.0 /sys/bus/pci/drivers_probe Another advantage to this approach is that we can specify a driver override to force a specific binding or prevent any binding. For instance when an IOMMU group is exposed to userspace through VFIO we require that all devices within that group are owned by VFIO. However, devices can be hot-added into an IOMMU group, in which case we want to prevent the device from binding to any driver (preferred driver = none) or perhaps have it automatically bind to vfio-pci. With driver_override it's a simple matter for this field to be set internally when the device is first discovered to prevent driver matches. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Apologies for the exceptionally long cc list, this is a follow-up to Stuart's Subject: mechanism to allow a driver to bind to any device thread. This is effectively a v2 of the proof-of-concept patch I posted in that thread. This version changes to use a dummy id struct to return on an override match, which removes the collateral damage and greatly simplifies the patch. This feels fairly well baked for PCI and I would expect that platform drivers could do a similar implementation. From there perhaps we can discuss whether there's any advantage to placing driver_override on struct device. The logic for incorporating it into the match still needs to happen per bus driver, so it might only contribute to consistency of the show/store sysfs attributes to move it up to struct device. Please comment. Thanks, Alex drivers/pci/pci-driver.c | 25 ++--- drivers/pci/pci-sysfs.c | 40 include/linux/pci.h |1 + 3 files changed, 63 insertions(+), 3 deletions(-) No Documentation/ABI/ update to reflect the ABI you are creating? Oops, thanks for the reminder. I'd propose this: diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci index a3c5a66..55ca6e5 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci +++ b/Documentation/ABI/testing/sysfs-bus-pci @@ -250,3 +250,21 @@ Description: valid. For example, writing a 2 to this file when sriov_numvfs is not 0 and not 2 already will return an error. Writing a 10 when the value of sriov_totalvfs is 8 will return an error. + +What: /sys/bus/pci/devices/.../driver_override +Date: April 2014 +Contact: Alex Williamson alex.william...@redhat.com +Description: + This file allows the driver for a device to be specified + which will override standard static and dynamic ID matching. + When specified, only a driver with a name matching the value + written to driver_override will have an opportunity to bind + to the device. The override may be cleared by writing an + empty string (ex. echo driver_override), returning the + device to standard matching rules binding. Writing to + driver_override does not automatically unbind the device from + its current driver or make any attempt to automatically load + the specified driver name. If no driver with a matching name + is currently loaded in the kernel, no match will be found. + This also
Re: VDSO pvclock may increase host cpu consumption, is this a problem?
On Mon, Mar 31, 2014 at 10:33:41PM -0700, Andy Lutomirski wrote: On Mar 31, 2014 8:45 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote: On 03/29/2014 01:47 AM, Zhanghailiang wrote: Hi, I found when Guest is idle, VDSO pvclock may increase host consumption. We can calcutate as follow, Correct me if I am wrong. (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest) In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain in timekeeping.c. It consume nearly 900 cycles per call. So in consideration of 250 Hz, it may consume 225,000 cycles per second, even no VM is created. In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. If the no-kvmclock-vsyscall is configured, gettimeofday consumes 370 cycles per call. The feature decrease 150 cycles consumption per call. When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to the host consumption. Both Host and Guest is linux-3.13.6. So, whether the host cpu consumption is a problem? Does pvclock serve any real purpose on systems with fully-functional TSCs? The x86 guest implementation is awful, so it's about 2x slower than TSC. It could be improved a lot, but I'm not sure I understand why it exists in the first place. VM migration. Why does that need percpu stuff? Wouldn't it be sufficient to interrupt all CPUs (or at least all cpus running in userspace) on migration and update the normal timing data structures? Are you suggesting to allow interruption of the timekeeping code at any time to update frequency information ? Do you want to that as a special tsc clocksource driver ? Even better: have the VM offer to invalidate the physical page containing the kernel's clock data on migration and interrupt one CPU. If another CPU races, it'll fault and wait for the guest kernel to update its timing. Perhaps that is a good idea. Does the current kvmclock stuff track CLOCK_MONOTONIC and CLOCK_REALTIME separately? No. kvmclock counting is interrupted on vm pause (the hw clock does not count during vm pause). Can you explain why you consider it so bad ? How you think it could be improved ? The second rdtsc_barrier looks unnecessary. Even better, if rdtscp is available, then rdtscp can replace rdtsc_barrier, rdtsc, and the getcpu call. It would also be nice to avoid having two sets of rescalings of the timing data. Yep, probably good improvements, patches are welcome :-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] s390x/kvm: implement and use QEMU config device for s390
On 01/04/14 17:38, Paolo Bonzini wrote: Il 01/04/2014 16:47, Christian Borntraeger ha scritto: From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com The following patch adds Qemu CONFIG device, which interacts with kvm CONFIG device by calling KVM_CREATE_DEVICE (to create the device in kernel), KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR (to get and set the particular attributes of KVM CONFIG device respectively). This patch uses the Qemu CONFIG device to copy the qemu_name (if specified) into the kvm CONFIG device. Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com Signed-off-by: Michael Mueller m...@linux.vnet.ibm.com Signed-off-by: Jens Freimann jf...@linux.vnet.ibm.com On the QEMU side we'll soon have per-machine-type options in -machine. s390 can then use that. Well we need some calls in the kernel at runtime (e.g. during reset), so we need more than static options. (But we can use those for several things). This device was mostly meant as an achor for doesnt fit into something else and would require a new ioctl. Lets encapsulate everything in a way that follows the everything is a device approach with QOM and all these things. Is there a branch for the per-machine-type things? Christian Paolo --- default-configs/s390x-softmmu.mak | 1 + hw/s390x/Makefile.objs| 1 + hw/s390x/config.c | 113 ++ hw/s390x/s390-virtio-ccw.c| 6 ++ hw/s390x/s390-virtio.c| 24 hw/s390x/s390-virtio.h| 1 + include/hw/s390x/config.h | 51 + linux-headers/asm-s390/kvm.h | 11 linux-headers/linux/kvm.h | 1 + trace-events | 4 ++ 10 files changed, 213 insertions(+) create mode 100644 hw/s390x/config.c create mode 100644 include/hw/s390x/config.h diff --git a/default-configs/s390x-softmmu.mak b/default-configs/s390x-softmmu.mak index d843dc0..a07697d 100644 --- a/default-configs/s390x-softmmu.mak +++ b/default-configs/s390x-softmmu.mak @@ -1,3 +1,4 @@ CONFIG_VIRTIO=y CONFIG_SCLPCONSOLE=y CONFIG_S390_FLIC=$(CONFIG_KVM) +CONFIG_S390_CONFIG=$(CONFIG_KVM) diff --git a/hw/s390x/Makefile.objs b/hw/s390x/Makefile.objs index 1ba6c3a..dc03e1d 100644 --- a/hw/s390x/Makefile.objs +++ b/hw/s390x/Makefile.objs @@ -8,3 +8,4 @@ obj-y += ipl.o obj-y += css.o obj-y += s390-virtio-ccw.o obj-y += virtio-ccw.o +obj-$(CONFIG_S390_CONFIG) += config.o diff --git a/hw/s390x/config.c b/hw/s390x/config.c new file mode 100644 index 000..f70e880 --- /dev/null +++ b/hw/s390x/config.c @@ -0,0 +1,113 @@ +/* + * S390 configuration and control device + * + * Copyright 2014 IBM Corp. + * Author(s): Ekaterina Tumanova tuman...@linux.vnet.ibm.com + * + * This work is licensed under the terms of the GNU GPL, version 2 or (at + * your option) any later version. See the COPYING file in the top-level + * directory. + */ + +#include sys/ioctl.h +#include hw/s390x/config.h +#include trace.h + +S390ConfigState *get_config_state(void) +{ +return S390_CONFIG(object_resolve_path(/machine/s390-config, NULL)); +} + +static int config_set(S390ConfigState *config, struct kvm_device_attr *attr) +{ +int rc; + +rc = ioctl(config-fd, KVM_SET_DEVICE_ATTR, attr); +return rc 0 ? -errno : rc; +} + +/* +static int config_get(S390ConfigState *config, struct kvm_device_attr *attr) +{ +int rc; + +rc = ioctl(config-fd, KVM_GET_DEVICE_ATTR, attr); +return rc 0 ? -errno : rc; +} +*/ + +static int config_set_name(S390ConfigState *config, char *name) +{ +struct kvm_device_attr attr = { +.group = KVM_DEV_CONFIG_GROUP, +.attr = KVM_DEV_CONFIG_NAME, +.addr = (uint64_t)name, +}; + +if (!name || +(qemu_strnlen(name, S390_MACHINE_NAME_SIZE) + = S390_MACHINE_NAME_SIZE)) { +return -EINVAL; +} + +return config_set(config, attr); +} + +static void s390_config_realize(DeviceState *dev, Error **errp) +{ +S390ConfigState *cfg = S390_CONFIG(dev); +struct kvm_create_device cd = {0}; +int ret; + +if (!kvm_check_extension(kvm_state, KVM_CAP_DEVICE_CTRL)) { +trace_config_no_device_api(errno); +return; +} + +cd.type = KVM_DEV_TYPE_S390_CONFIG; +ret = kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, cd); +if (ret 0) { +trace_config_create_device(errno); +return; +} +cfg-fd = cd.fd; +} + +void s390_config_dev_init(void) +{ +DeviceState *dev; + +if (kvm_enabled()) { +dev = qdev_create(NULL, TYPE_S390_CONFIG); + +object_property_add_child(qdev_get_machine(), + TYPE_S390_CONFIG, + OBJECT(dev), NULL); +qdev_init_nofail(dev); +} +}
Re: VDSO pvclock may increase host cpu consumption, is this a problem?
On Tue, Apr 1, 2014 at 11:01 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Mon, Mar 31, 2014 at 10:33:41PM -0700, Andy Lutomirski wrote: On Mar 31, 2014 8:45 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote: On 03/29/2014 01:47 AM, Zhanghailiang wrote: Hi, I found when Guest is idle, VDSO pvclock may increase host consumption. We can calcutate as follow, Correct me if I am wrong. (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest) In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain in timekeeping.c. It consume nearly 900 cycles per call. So in consideration of 250 Hz, it may consume 225,000 cycles per second, even no VM is created. In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. If the no-kvmclock-vsyscall is configured, gettimeofday consumes 370 cycles per call. The feature decrease 150 cycles consumption per call. When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to the host consumption. Both Host and Guest is linux-3.13.6. So, whether the host cpu consumption is a problem? Does pvclock serve any real purpose on systems with fully-functional TSCs? The x86 guest implementation is awful, so it's about 2x slower than TSC. It could be improved a lot, but I'm not sure I understand why it exists in the first place. VM migration. Why does that need percpu stuff? Wouldn't it be sufficient to interrupt all CPUs (or at least all cpus running in userspace) on migration and update the normal timing data structures? Are you suggesting to allow interruption of the timekeeping code at any time to update frequency information ? I'm not sure what you mean by interruption of the timekeeping code. I'm suggesting sending an interrupt to the guest (via a virtio device, presumably) to tell it that it has been paused and resumed. This is probably worth getting John's input if you actually want to do this. I'm not about to :) Is there any case in which the TSC is stable and the kvmclock data for different cpus is actually different? Do you want to that as a special tsc clocksource driver ? Even better: have the VM offer to invalidate the physical page containing the kernel's clock data on migration and interrupt one CPU. If another CPU races, it'll fault and wait for the guest kernel to update its timing. Perhaps that is a good idea. Does the current kvmclock stuff track CLOCK_MONOTONIC and CLOCK_REALTIME separately? No. kvmclock counting is interrupted on vm pause (the hw clock does not count during vm pause). Makes sense. Can you explain why you consider it so bad ? How you think it could be improved ? The second rdtsc_barrier looks unnecessary. Even better, if rdtscp is available, then rdtscp can replace rdtsc_barrier, rdtsc, and the getcpu call. It would also be nice to avoid having two sets of rescalings of the timing data. Yep, probably good improvements, patches are welcome :-) I may get to it at some point. No guarantees. I did just rewrite all the mapping-related code for every other x86 vdso timesource, so maybe I should try to add this to the pile. The fact that the data is a variable number of pages makes it messy, though, and since I don't understand why there's a separate structure for each CPU, I'm hesitant to change it too much. --Andy -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] KVM: s390: Add S390 configuration and control kvm device
On 01/04/14 17:12, Alexander Graf wrote: On 04/01/2014 05:04 PM, Christian Borntraeger wrote: On 01/04/14 16:58, Alexander Graf wrote: On 04/01/2014 04:47 PM, Christian Borntraeger wrote: From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com Add KVM_DEV_TYPE_S390_CONFIG kvm device that contains configuration and control attributes of particular vm. The device is created by KVM_CREATE_DEVICE ioctl. The attributes may be retrieved and stored by calling KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls. Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com I don't think a device is particularly the best fit. A device can usually be instantiated multiple times. The configuration device can only be created once. A device also gets created by user space which enables it to receive the fd to drive it. Your device has to be created during VM creation. I remember some discussion a year or 2 ago, and IIRC a config device was actually your idea ;-) (The other idea that we had, was ONE_REG for the VM) Omg, really? :o A device would make sense for a specific system information instruction trap that we handle in-kernel for whatever reason (usually because it's performance critical) and some mandatory say to make sure user space always creates it. And some checks to make sure it can't get created twice. Well the device created twice problem is also true for all existing KVM devices. The only missing piece is the check in the config device, no? Speaking of which, why don't we just forward STSI to user space with an ENABLE_CAP and handle all of this there? It's not performance critical at all, right? No, performance is not critical. The thing is, that we definitely need the kernel to handle parts of STSI, as we have to provide information from the upper hipervisor (LPAR or zVM). This information is only available in kernel space. So in essence we could only forward a small subset of STSI, namely stsi3_2_2. But we still have to call stsi_3_2_2 in the kernel, as 3_2_2 does contain the list of hipervisors underneath us (KVM under z/VM). So then only thing that we could do is to forward STSI_3_2_2 to qemu when a capability is set and after the kernel has filled in the upper layers. QEMU then has to modify the page that the kernel touched and go back. Would work, but needs a capability and preferably an own exit. An new ioctl or a subcode of an ioctl (attr/group whatever) seems easier. I think VM configuration is common enough to just make this a separate interface. So you propose to define a new base ioctl (e.g. VM_REG) on the vm fd, instead? Seems like an easy enough change. Would you reuse the kvm_attr structure for that? Yeah, reuse whatever we can. Basically just remove the device boilerplate - I don't think it's impressively useful for a non-device. See above, name is just a simple first user. The thing is, that we have to have the ioctl either define a proper namespace (unique groups attrs) or to make it s390 specific. The device approach does help us here. I personally dont mind which way to go, as long as Paolo is fine with the approach, and nobody complains about the functions being non-QOM. Christian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] s390: Provide a configuration and control device
On 01/04/14 16:59, Alexander Graf wrote: On 04/01/2014 04:47 PM, Christian Borntraeger wrote: We want to configure several things in KVM that go beyond what ENABLE_CAP (we need payload) or ONE_REG (we need it for the VM and we need to do more complex actions) can provide. Instead of adding several s390 specific ioctls, lets provide a configuration and control device that encapsulates different commands into groups of the same area (MEMORY, CPU, ..) We also provide an initial nameless base group, with a simple first user to set the guest name. We need that name in the kernel for the emulation of STSI (which provides the guest name to the guest) but we need to implement the emulation in supervisor mode, as it also provides the underlying levels of hipervisors. Currently we have the following GROUPS and ATTRs pending, which configure some memory management related function or allow to set the guest facilities, cpuids etc: #define KVM_DEV_CONFIG_GROUP0 #define KVM_DEV_CONFIG_NAME 0 #define KVM_DEV_CONFIG_GROUP_MEM1 #define KVM_DEV_CONFIG_MEM_ENABLE_CMMA 0 #define KVM_DEV_CONFIG_MEM_CLR_CMMA 1 #define KVM_DEV_CONFIG_MEM_CLR_PAGES2 #define KVM_DEV_CONFIG_GROUP_CPU2 #define KVM_DEV_CONFIG_CPU_TYPE 0 #define KVM_DEV_CONFIG_CPU_FAC 1 #define KVM_DEV_CONFIG_CPU_FAC_MASK 2 #define KVM_DEV_CONFIG_CPU_IBC 3 #define KVM_DEV_CONFIG_CPU_IBC_RANGE4 Why would CPU specific information be set in the VM? Might be a misleading name here. This is about CPU id and facility list (mostly CPU features). The list of facilities and the cpu id is unique on VM level. It is queried via a CPU instruction, though. Providing some VM-wide setting via a CPU interface would make the interface look like the user can have different settings per-CPU. Since we cant, its better to make this explicit. Alex In addition other groups like #define KVM_DEV_CONFIG_GROUP_CRYPTO are under consideration to configure crypto acceleration. Unless there is a major concern, I will add this to the next s390 PULL requests for KVM. Christian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] KVM: s390: Add S390 configuration and control kvm device
On 01.04.2014, at 21:19, Christian Borntraeger borntrae...@de.ibm.com wrote: On 01/04/14 17:12, Alexander Graf wrote: On 04/01/2014 05:04 PM, Christian Borntraeger wrote: On 01/04/14 16:58, Alexander Graf wrote: On 04/01/2014 04:47 PM, Christian Borntraeger wrote: From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com Add KVM_DEV_TYPE_S390_CONFIG kvm device that contains configuration and control attributes of particular vm. The device is created by KVM_CREATE_DEVICE ioctl. The attributes may be retrieved and stored by calling KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls. Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com I don't think a device is particularly the best fit. A device can usually be instantiated multiple times. The configuration device can only be created once. A device also gets created by user space which enables it to receive the fd to drive it. Your device has to be created during VM creation. I remember some discussion a year or 2 ago, and IIRC a config device was actually your idea ;-) (The other idea that we had, was ONE_REG for the VM) Omg, really? :o A device would make sense for a specific system information instruction trap that we handle in-kernel for whatever reason (usually because it's performance critical) and some mandatory say to make sure user space always creates it. And some checks to make sure it can't get created twice. Well the device created twice problem is also true for all existing KVM devices. The only missing piece is the check in the config device, no? Speaking of which, why don't we just forward STSI to user space with an ENABLE_CAP and handle all of this there? It's not performance critical at all, right? No, performance is not critical. The thing is, that we definitely need the kernel to handle parts of STSI, as we have to provide information from the upper hipervisor (LPAR or zVM). This information is only available in kernel space. So in essence we could only forward a small subset of STSI, namely stsi3_2_2. But we still have to call stsi_3_2_2 in the kernel, as 3_2_2 does contain the list of hipervisors underneath us (KVM under z/VM). So then only thing that we could do is to forward STSI_3_2_2 to qemu when a capability is set and after the kernel has filled in the upper layers. QEMU then has to modify the page that the kernel touched and go back. Would work, but needs a capability and preferably an own exit. An new ioctl or a subcode of an ioctl (attr/group whatever) seems easier. I would consider these 2 orthogonal bits of information. User space wants to get information about its underlying hypervisors regardless of KVM, no? So we should have some interface to bubble STSI information of the current system to user space either way. QEMU could use that and add a few bits of its own. That way we could handle all of STSI in QEMU and get out of the business of defining complicated interfaces. I think VM configuration is common enough to just make this a separate interface. So you propose to define a new base ioctl (e.g. VM_REG) on the vm fd, instead? Seems like an easy enough change. Would you reuse the kvm_attr structure for that? Yeah, reuse whatever we can. Basically just remove the device boilerplate - I don't think it's impressively useful for a non-device. See above, name is just a simple first user. The thing is, that we have to have the ioctl either define a proper namespace (unique groups attrs) or to make it s390 specific. The device approach does help us here. If you like the device approach, make sure to create it on VM creation and only implement a specific ioctl to fetch its fd. We don't create the configuration information pseudo device after VM creation - it's always there :). I personally dont mind which way to go, as long as Paolo is fine with the approach, and nobody complains about the functions being non-QOM. I think the most obvious and straight forward way would be to deal with all of STSI in user space. Make it a separate exit type similar to hypercalls and don't worry about QOM'ification of anything. This thing is on the same level as CPUID really - just VM wide :). Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] s390: Provide a configuration and control device
On 01.04.2014, at 21:23, Christian Borntraeger borntrae...@de.ibm.com wrote: On 01/04/14 16:59, Alexander Graf wrote: On 04/01/2014 04:47 PM, Christian Borntraeger wrote: We want to configure several things in KVM that go beyond what ENABLE_CAP (we need payload) or ONE_REG (we need it for the VM and we need to do more complex actions) can provide. Instead of adding several s390 specific ioctls, lets provide a configuration and control device that encapsulates different commands into groups of the same area (MEMORY, CPU, ..) We also provide an initial nameless base group, with a simple first user to set the guest name. We need that name in the kernel for the emulation of STSI (which provides the guest name to the guest) but we need to implement the emulation in supervisor mode, as it also provides the underlying levels of hipervisors. Currently we have the following GROUPS and ATTRs pending, which configure some memory management related function or allow to set the guest facilities, cpuids etc: #define KVM_DEV_CONFIG_GROUP0 #define KVM_DEV_CONFIG_NAME 0 #define KVM_DEV_CONFIG_GROUP_MEM1 #define KVM_DEV_CONFIG_MEM_ENABLE_CMMA 0 #define KVM_DEV_CONFIG_MEM_CLR_CMMA 1 #define KVM_DEV_CONFIG_MEM_CLR_PAGES2 #define KVM_DEV_CONFIG_GROUP_CPU2 #define KVM_DEV_CONFIG_CPU_TYPE 0 #define KVM_DEV_CONFIG_CPU_FAC 1 #define KVM_DEV_CONFIG_CPU_FAC_MASK 2 #define KVM_DEV_CONFIG_CPU_IBC 3 #define KVM_DEV_CONFIG_CPU_IBC_RANGE4 Why would CPU specific information be set in the VM? Might be a misleading name here. This is about CPU id and facility list (mostly CPU features). The list of facilities and the cpu id is unique on VM level. Is this a hardware limitation? I thought s390's virtualization instruction doesn't have a notion of a VM, but I could easily be wrong. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] KVM: s390: Add S390 configuration and control kvm device
On 01.04.2014, at 21:19, Christian Borntraeger borntrae...@de.ibm.com wrote: On 01/04/14 17:12, Alexander Graf wrote: On 04/01/2014 05:04 PM, Christian Borntraeger wrote: On 01/04/14 16:58, Alexander Graf wrote: On 04/01/2014 04:47 PM, Christian Borntraeger wrote: From: Ekaterina Tumanova tuman...@linux.vnet.ibm.com Add KVM_DEV_TYPE_S390_CONFIG kvm device that contains configuration and control attributes of particular vm. The device is created by KVM_CREATE_DEVICE ioctl. The attributes may be retrieved and stored by calling KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls. Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com I don't think a device is particularly the best fit. A device can usually be instantiated multiple times. The configuration device can only be created once. A device also gets created by user space which enables it to receive the fd to drive it. Your device has to be created during VM creation. I remember some discussion a year or 2 ago, and IIRC a config device was actually your idea ;-) (The other idea that we had, was ONE_REG for the VM) Omg, really? :o A device would make sense for a specific system information instruction trap that we handle in-kernel for whatever reason (usually because it's performance critical) and some mandatory say to make sure user space always creates it. And some checks to make sure it can't get created twice. Well the device created twice problem is also true for all existing KVM devices. The only missing piece is the check in the config device, no? Speaking of which, why don't we just forward STSI to user space with an ENABLE_CAP and handle all of this there? It's not performance critical at all, right? No, performance is not critical. The thing is, that we definitely need the kernel to handle parts of STSI, as we have to provide information from the upper hipervisor (LPAR or zVM). This information is only available in kernel space. So in essence we could only forward a small subset of STSI, namely stsi3_2_2. But we still have to call stsi_3_2_2 in the kernel, as 3_2_2 does contain the list of hipervisors underneath us (KVM under z/VM). So then only thing that we could do is to forward STSI_3_2_2 to qemu when a capability is set and after the kernel has filled in the upper layers. QEMU then has to modify the page that the kernel touched and go back. Would work, but needs a capability and preferably an own exit. An new ioctl or a subcode of an ioctl (attr/group whatever) seems easier. I would consider these 2 orthogonal bits of information. User space wants to get information about its underlying hypervisors regardless of KVM, no? So we should have some interface to bubble STSI information of the current system to user space either way. QEMU could use that and add a few bits of its own. That way we could handle all of STSI in QEMU and get out of the business of defining complicated interfaces. I think VM configuration is common enough to just make this a separate interface. So you propose to define a new base ioctl (e.g. VM_REG) on the vm fd, instead? Seems like an easy enough change. Would you reuse the kvm_attr structure for that? Yeah, reuse whatever we can. Basically just remove the device boilerplate - I don't think it's impressively useful for a non-device. See above, name is just a simple first user. The thing is, that we have to have the ioctl either define a proper namespace (unique groups attrs) or to make it s390 specific. The device approach does help us here. If you like the device approach, make sure to create it on VM creation and only implement a specific ioctl to fetch its fd. We don't create the configuration information pseudo device after VM creation - it's always there :). I personally dont mind which way to go, as long as Paolo is fine with the approach, and nobody complains about the functions being non-QOM. I think the most obvious and straight forward way would be to deal with all of STSI in user space. Make it a separate exit type similar to hypercalls and don't worry about QOM'ification of anything. This thing is on the same level as CPUID really - just VM wide :). Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] KVM: s390: Add S390 configuration and control kvm device
On 01/04/14 21:36, Alexander Graf wrote: [...] Speaking of which, why don't we just forward STSI to user space with an ENABLE_CAP and handle all of this there? It's not performance critical at all, right? No, performance is not critical. The thing is, that we definitely need the kernel to handle parts of STSI, as we have to provide information from the upper hipervisor (LPAR or zVM). This information is only available in kernel space. So in essence we could only forward a small subset of STSI, namely stsi3_2_2. But we still have to call stsi_3_2_2 in the kernel, as 3_2_2 does contain the list of hipervisors underneath us (KVM under z/VM). So then only thing that we could do is to forward STSI_3_2_2 to qemu when a capability is set and after the kernel has filled in the upper layers. QEMU then has to modify the page that the kernel touched and go back. Would work, but needs a capability and preferably an own exit. An new ioctl or a subcode of an ioctl (attr/group whatever) seems easier. I would consider these 2 orthogonal bits of information. User space wants to get information about its underlying hypervisors regardless of KVM, no? So we should have some interface to bubble STSI information of the current system to user space either way. There is /proc/sysinfo and it provides most (but not all) of the STSI information. But we dont want to go that path, really. Providing STSI as binary blob is also complex and error-prone. Think about it again: Instead of having a small qemu-kernel interface for specific subsets of stsi, you suggest a kernel-qemu interface for the complete set of STSI, so that qemu can then emulate it? (We talk about 20 pages in the principles of operation). T QEMU could use that and add a few bits of its own. That way we could handle all of STSI in QEMU and get out of the business of defining complicated interfaces. This also misses my main point: The name is just the first user. I want to have some interface that allows me to do other things: enable CMMA, reset the CMMA memory usage state on reset, zap the page tables on clear reset, set the cpu facilities (some parts need to be in kernel so that we can block specific operations), enable s390 specific settings, whatever. The usual approach was create new KVM ioctls, each for one feature + capability. This sounded like overkill. So please dont focus on stsi, I need some interface that I can use for other pending patches. And if that interface is generic enough, we might also use that for name (or not depending on the stsi discussion). I am not married to the device idea, anything that works out fine is ok with me. I think VM configuration is common enough to just make this a separate interface. So you propose to define a new base ioctl (e.g. VM_REG) on the vm fd, instead? Seems like an easy enough change. Would you reuse the kvm_attr structure for that? Yeah, reuse whatever we can. Basically just remove the device boilerplate - I don't think it's impressively useful for a non-device. See above, name is just a simple first user. The thing is, that we have to have the ioctl either define a proper namespace (unique groups attrs) or to make it s390 specific. The device approach does help us here. If you like the device approach, make sure to create it on VM creation and only implement a specific ioctl to fetch its fd. We don't create the configuration information pseudo device after VM creation - it's always there :). I personally dont mind which way to go, as long as Paolo is fine with the approach, and nobody complains about the functions being non-QOM. I think the most obvious and straight forward way would be to deal with all of STSI in user space. Make it a separate exit type similar to hypercalls and don't worry about QOM'ification of anything. This thing is on the same level as CPUID really - just VM wide :). For 3_2_2 this might be possible solution, but not for the other codes. But as I said, the name is not my main problem here. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] PCI: Introduce new device binding path using pci_dev.driver_override
On Tue, 01 Apr 2014 10:28:54 -0600 Alex Williamson alex.william...@redhat.com wrote: The driver_override field allows us to specify the driver for a device rather than relying on the driver to provide a positive match of the device. This shortcuts the existing process of looking up the vendor and device ID, adding them to the driver new_id, binding the device, then removing the ID, but it also provides a couple advantages. First, the above process allows the driver to bind to any device matching the new_id for the window where it's enabled. This is often not desired, such as the case of trying to bind a single device to a meta driver like pci-stub or vfio-pci. Using driver_override we can do this deterministically using: echo pci-stub /sys/bus/pci/devices/:03:00.0/driver_override echo :03:00.0 /sys/bus/pci/devices/:03:00.0/driver/unbind echo :03:00.0 /sys/bus/pci/drivers_probe Previously we could not invoke drivers_probe after adding a device to new_id for a driver as we get non-deterministic behavior whether the driver we intend or the standard driver will claim the device. Now it becomes a deterministic process, only the driver matching driver_override will probe the device. To return the device to the standard driver, we simply clear the driver_override and reprobe the device, ex: echo /sys/bus/pci/devices/:03:00.0/preferred_driver echo :03:00.0 /sys/bus/pci/devices/:03:00.0/driver/unbind echo :03:00.0 /sys/bus/pci/drivers_probe Another advantage to this approach is that we can specify a driver override to force a specific binding or prevent any binding. For instance when an IOMMU group is exposed to userspace through VFIO we require that all devices within that group are owned by VFIO. However, devices can be hot-added into an IOMMU group, in which case we want to prevent the device from binding to any driver (preferred driver = none) or perhaps have it automatically bind to vfio-pci. With driver_override it's a simple matter for this field to be set internally when the device is first discovered to prevent driver matches. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Apologies for the exceptionally long cc list, this is a follow-up to Stuart's Subject: mechanism to allow a driver to bind to any device thread. This is effectively a v2 of the proof-of-concept patch I posted in that thread. This version changes to use a dummy id struct to return on an override match, which removes the collateral damage and greatly simplifies the patch. This feels fairly well baked for PCI and I would expect that platform drivers could do a similar implementation. From there perhaps we can discuss whether there's any advantage to placing driver_override on struct device. The logic for incorporating it into the match still needs to happen per bus driver, so it might only contribute to consistency of the show/store sysfs attributes to move it up to struct device. Please comment. Sounds like Greg likes this approach more than {drv,dev}_sysfs_only. The diff below is the result of duplicating and converting this patch for platform devices, and, indeed, binding a device to the vfio-platform driver succeeds with: echo vfio-platform /sys/bus/platform/devices/fff51000.ethernet/driver_override echo fff51000.ethernet /sys/bus/platform/devices/fff51000.ethernet/driver/unbind echo fff51000.ethernet /sys/bus/platform/drivers_probe However, it's almost pure duplication modulo the bus match code. The only other place I can see where to put the common bus check is drivers/base/base.h:driver_match_device(), which I'm guessing is off-limits? So should we leave this as per-bus code, and somehow refactor driver_override_{show,store}? Kim diff --git a/drivers/base/platform.c b/drivers/base/platform.c index bc78848..621c5bd2 100644 --- a/drivers/base/platform.c +++ b/drivers/base/platform.c @@ -22,6 +22,7 @@ #include linux/pm_runtime.h #include linux/idr.h #include linux/acpi.h +#include linux/limits.h #include base.h #include power/power.h @@ -693,8 +694,49 @@ static ssize_t modalias_show(struct device *dev, struct device_attribute *a, } static DEVICE_ATTR_RO(modalias); +static ssize_t driver_override_store(struct device *dev, +struct device_attribute *attr, +const char *buf, size_t count) +{ + struct platform_device *pdev = to_platform_device(dev); + char *driver_override, *old = pdev-driver_override; + + if (count PATH_MAX) + return -EINVAL; + + driver_override = kstrndup(buf, count, GFP_KERNEL); + if (!driver_override) + return -ENOMEM; + + while (strlen(driver_override) + driver_override[strlen(driver_override) - 1] == '\n') + driver_override[strlen(driver_override) - 1] = '\0'; + + if (strlen(driver_override)) { +
Re: VDSO pvclock may increase host cpu consumption, is this a problem?
On Tue, Apr 01, 2014 at 12:17:16PM -0700, Andy Lutomirski wrote: On Tue, Apr 1, 2014 at 11:01 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Mon, Mar 31, 2014 at 10:33:41PM -0700, Andy Lutomirski wrote: On Mar 31, 2014 8:45 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote: On 03/29/2014 01:47 AM, Zhanghailiang wrote: Hi, I found when Guest is idle, VDSO pvclock may increase host consumption. We can calcutate as follow, Correct me if I am wrong. (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest) In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain in timekeeping.c. It consume nearly 900 cycles per call. So in consideration of 250 Hz, it may consume 225,000 cycles per second, even no VM is created. In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. If the no-kvmclock-vsyscall is configured, gettimeofday consumes 370 cycles per call. The feature decrease 150 cycles consumption per call. When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to the host consumption. Both Host and Guest is linux-3.13.6. So, whether the host cpu consumption is a problem? Does pvclock serve any real purpose on systems with fully-functional TSCs? The x86 guest implementation is awful, so it's about 2x slower than TSC. It could be improved a lot, but I'm not sure I understand why it exists in the first place. VM migration. Why does that need percpu stuff? Wouldn't it be sufficient to interrupt all CPUs (or at least all cpus running in userspace) on migration and update the normal timing data structures? Are you suggesting to allow interruption of the timekeeping code at any time to update frequency information ? I'm not sure what you mean by interruption of the timekeeping code. I'm suggesting sending an interrupt to the guest (via a virtio device, presumably) to tell it that it has been paused and resumed. code: 1) disable interrupts 2) A = RDTSC 3) B = SCALE(A, TSC.FREQ) If migration happens between 2 and 3, you've got an incorrect value. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VDSO pvclock may increase host cpu consumption, is this a problem?
On Tue, Apr 1, 2014 at 5:12 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Tue, Apr 01, 2014 at 12:17:16PM -0700, Andy Lutomirski wrote: On Tue, Apr 1, 2014 at 11:01 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Mon, Mar 31, 2014 at 10:33:41PM -0700, Andy Lutomirski wrote: On Mar 31, 2014 8:45 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote: On 03/29/2014 01:47 AM, Zhanghailiang wrote: Hi, I found when Guest is idle, VDSO pvclock may increase host consumption. We can calcutate as follow, Correct me if I am wrong. (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest) In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain in timekeeping.c. It consume nearly 900 cycles per call. So in consideration of 250 Hz, it may consume 225,000 cycles per second, even no VM is created. In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. If the no-kvmclock-vsyscall is configured, gettimeofday consumes 370 cycles per call. The feature decrease 150 cycles consumption per call. When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to the host consumption. Both Host and Guest is linux-3.13.6. So, whether the host cpu consumption is a problem? Does pvclock serve any real purpose on systems with fully-functional TSCs? The x86 guest implementation is awful, so it's about 2x slower than TSC. It could be improved a lot, but I'm not sure I understand why it exists in the first place. VM migration. Why does that need percpu stuff? Wouldn't it be sufficient to interrupt all CPUs (or at least all cpus running in userspace) on migration and update the normal timing data structures? Are you suggesting to allow interruption of the timekeeping code at any time to update frequency information ? I'm not sure what you mean by interruption of the timekeeping code. I'm suggesting sending an interrupt to the guest (via a virtio device, presumably) to tell it that it has been paused and resumed. code: 1) disable interrupts 2) A = RDTSC 3) B = SCALE(A, TSC.FREQ) If migration happens between 2 and 3, you've got an incorrect value. Fair enough. I guess 1) disable interrupts 2) A = RDTSC 3) B = SCALE(A, TSC.FREQ) is also bad if (3) blocks due to magic invalidation of the physical page. --Andy -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] PCI: Introduce new device binding path using pci_dev.driver_override
On Tue, Apr 01, 2014 at 06:52:12PM -0500, Kim Phillips wrote: On Tue, 01 Apr 2014 10:28:54 -0600 Alex Williamson alex.william...@redhat.com wrote: The driver_override field allows us to specify the driver for a device rather than relying on the driver to provide a positive match of the device. This shortcuts the existing process of looking up the vendor and device ID, adding them to the driver new_id, binding the device, then removing the ID, but it also provides a couple advantages. First, the above process allows the driver to bind to any device matching the new_id for the window where it's enabled. This is often not desired, such as the case of trying to bind a single device to a meta driver like pci-stub or vfio-pci. Using driver_override we can do this deterministically using: echo pci-stub /sys/bus/pci/devices/:03:00.0/driver_override echo :03:00.0 /sys/bus/pci/devices/:03:00.0/driver/unbind echo :03:00.0 /sys/bus/pci/drivers_probe Previously we could not invoke drivers_probe after adding a device to new_id for a driver as we get non-deterministic behavior whether the driver we intend or the standard driver will claim the device. Now it becomes a deterministic process, only the driver matching driver_override will probe the device. To return the device to the standard driver, we simply clear the driver_override and reprobe the device, ex: echo /sys/bus/pci/devices/:03:00.0/preferred_driver echo :03:00.0 /sys/bus/pci/devices/:03:00.0/driver/unbind echo :03:00.0 /sys/bus/pci/drivers_probe Another advantage to this approach is that we can specify a driver override to force a specific binding or prevent any binding. For instance when an IOMMU group is exposed to userspace through VFIO we require that all devices within that group are owned by VFIO. However, devices can be hot-added into an IOMMU group, in which case we want to prevent the device from binding to any driver (preferred driver = none) or perhaps have it automatically bind to vfio-pci. With driver_override it's a simple matter for this field to be set internally when the device is first discovered to prevent driver matches. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Apologies for the exceptionally long cc list, this is a follow-up to Stuart's Subject: mechanism to allow a driver to bind to any device thread. This is effectively a v2 of the proof-of-concept patch I posted in that thread. This version changes to use a dummy id struct to return on an override match, which removes the collateral damage and greatly simplifies the patch. This feels fairly well baked for PCI and I would expect that platform drivers could do a similar implementation. From there perhaps we can discuss whether there's any advantage to placing driver_override on struct device. The logic for incorporating it into the match still needs to happen per bus driver, so it might only contribute to consistency of the show/store sysfs attributes to move it up to struct device. Please comment. Sounds like Greg likes this approach more than {drv,dev}_sysfs_only. I have made no such judgement, I only pointed out that if you modify/add/remove a sysfs file, it needs to have documentation for it. The diff below is the result of duplicating and converting this patch for platform devices, and, indeed, binding a device to the vfio-platform driver succeeds with: echo vfio-platform /sys/bus/platform/devices/fff51000.ethernet/driver_override echo fff51000.ethernet /sys/bus/platform/devices/fff51000.ethernet/driver/unbind echo fff51000.ethernet /sys/bus/platform/drivers_probe However, it's almost pure duplication modulo the bus match code. The only other place I can see where to put the common bus check is drivers/base/base.h:driver_match_device(), which I'm guessing is off-limits? So should we leave this as per-bus code, and somehow refactor driver_override_{show,store}? If you can provide a way for this to be done in a bus-independant way, like we did for new_id and the like, I'd be open to reviewing it. greg k-h -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] PCI: Introduce new device binding path using pci_dev.driver_override
On Tue, Apr 01, 2014 at 11:15:40AM -0600, Alex Williamson wrote: On Tue, 2014-04-01 at 09:47 -0700, Greg KH wrote: On Tue, Apr 01, 2014 at 10:28:54AM -0600, Alex Williamson wrote: The driver_override field allows us to specify the driver for a device rather than relying on the driver to provide a positive match of the device. This shortcuts the existing process of looking up the vendor and device ID, adding them to the driver new_id, binding the device, then removing the ID, but it also provides a couple advantages. First, the above process allows the driver to bind to any device matching the new_id for the window where it's enabled. This is often not desired, such as the case of trying to bind a single device to a meta driver like pci-stub or vfio-pci. Using driver_override we can do this deterministically using: echo pci-stub /sys/bus/pci/devices/:03:00.0/driver_override echo :03:00.0 /sys/bus/pci/devices/:03:00.0/driver/unbind echo :03:00.0 /sys/bus/pci/drivers_probe Previously we could not invoke drivers_probe after adding a device to new_id for a driver as we get non-deterministic behavior whether the driver we intend or the standard driver will claim the device. Now it becomes a deterministic process, only the driver matching driver_override will probe the device. To return the device to the standard driver, we simply clear the driver_override and reprobe the device, ex: echo /sys/bus/pci/devices/:03:00.0/preferred_driver echo :03:00.0 /sys/bus/pci/devices/:03:00.0/driver/unbind echo :03:00.0 /sys/bus/pci/drivers_probe Another advantage to this approach is that we can specify a driver override to force a specific binding or prevent any binding. For instance when an IOMMU group is exposed to userspace through VFIO we require that all devices within that group are owned by VFIO. However, devices can be hot-added into an IOMMU group, in which case we want to prevent the device from binding to any driver (preferred driver = none) or perhaps have it automatically bind to vfio-pci. With driver_override it's a simple matter for this field to be set internally when the device is first discovered to prevent driver matches. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Apologies for the exceptionally long cc list, this is a follow-up to Stuart's Subject: mechanism to allow a driver to bind to any device thread. This is effectively a v2 of the proof-of-concept patch I posted in that thread. This version changes to use a dummy id struct to return on an override match, which removes the collateral damage and greatly simplifies the patch. This feels fairly well baked for PCI and I would expect that platform drivers could do a similar implementation. From there perhaps we can discuss whether there's any advantage to placing driver_override on struct device. The logic for incorporating it into the match still needs to happen per bus driver, so it might only contribute to consistency of the show/store sysfs attributes to move it up to struct device. Please comment. Thanks, Alex drivers/pci/pci-driver.c | 25 ++--- drivers/pci/pci-sysfs.c | 40 include/linux/pci.h |1 + 3 files changed, 63 insertions(+), 3 deletions(-) No Documentation/ABI/ update to reflect the ABI you are creating? Oops, thanks for the reminder. I'd propose this: diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci index a3c5a66..55ca6e5 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci +++ b/Documentation/ABI/testing/sysfs-bus-pci @@ -250,3 +250,21 @@ Description: valid. For example, writing a 2 to this file when sriov_numvfs is not 0 and not 2 already will return an error. Writing a 10 when the value of sriov_totalvfs is 8 will return an error. + +What:/sys/bus/pci/devices/.../driver_override +Date:April 2014 +Contact: Alex Williamson alex.william...@redhat.com +Description: + This file allows the driver for a device to be specified + which will override standard static and dynamic ID matching. + When specified, only a driver with a name matching the value + written to driver_override will have an opportunity to bind + to the device. The override may be cleared by writing an + empty string (ex. echo driver_override), returning the + device to standard matching rules binding. Writing to + driver_override does not automatically unbind the device from + its current driver or make any attempt to automatically load + the specified driver
Re: VDSO pvclock may increase host cpu consumption, is this a problem?
On Tue, Apr 1, 2014 at 5:29 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Tue, Apr 01, 2014 at 12:17:16PM -0700, Andy Lutomirski wrote: On Tue, Apr 1, 2014 at 11:01 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Mon, Mar 31, 2014 at 10:33:41PM -0700, Andy Lutomirski wrote: On Mar 31, 2014 8:45 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote: On 03/29/2014 01:47 AM, Zhanghailiang wrote: Hi, I found when Guest is idle, VDSO pvclock may increase host consumption. We can calcutate as follow, Correct me if I am wrong. (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest) In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain in timekeeping.c. It consume nearly 900 cycles per call. So in consideration of 250 Hz, it may consume 225,000 cycles per second, even no VM is created. In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. If the no-kvmclock-vsyscall is configured, gettimeofday consumes 370 cycles per call. The feature decrease 150 cycles consumption per call. When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to the host consumption. Both Host and Guest is linux-3.13.6. So, whether the host cpu consumption is a problem? Does pvclock serve any real purpose on systems with fully-functional TSCs? The x86 guest implementation is awful, so it's about 2x slower than TSC. It could be improved a lot, but I'm not sure I understand why it exists in the first place. VM migration. Why does that need percpu stuff? Wouldn't it be sufficient to interrupt all CPUs (or at least all cpus running in userspace) on migration and update the normal timing data structures? Are you suggesting to allow interruption of the timekeeping code at any time to update frequency information ? I'm not sure what you mean by interruption of the timekeeping code. I'm suggesting sending an interrupt to the guest (via a virtio device, presumably) to tell it that it has been paused and resumed. This is probably worth getting John's input if you actually want to do this. I'm not about to :) Honestly, neither am i at the moment. But i'll think about it. Is there any case in which the TSC is stable and the kvmclock data for different cpus is actually different? No. However, kvmclock_data.flags field is an interface for watchdog unpause. Do you want to that as a special tsc clocksource driver ? Even better: have the VM offer to invalidate the physical page containing the kernel's clock data on migration and interrupt one CPU. If another CPU races, it'll fault and wait for the guest kernel to update its timing. Perhaps that is a good idea. Does the current kvmclock stuff track CLOCK_MONOTONIC and CLOCK_REALTIME separately? No. kvmclock counting is interrupted on vm pause (the hw clock does not count during vm pause). Makes sense. Can you explain why you consider it so bad ? How you think it could be improved ? The second rdtsc_barrier looks unnecessary. Even better, if rdtscp is available, then rdtscp can replace rdtsc_barrier, rdtsc, and the getcpu call. It would also be nice to avoid having two sets of rescalings of the timing data. Yep, probably good improvements, patches are welcome :-) I may get to it at some point. No guarantees. I did just rewrite all the mapping-related code for every other x86 vdso timesource, so maybe I should try to add this to the pile. The fact that the data is a variable number of pages makes it messy, though, and since I don't understand why there's a separate structure for each CPU, I'm hesitant to change it too much. --Andy kvmclock.data? Because each VCPU can have different .flags fields for example. It looks like the vdso kvmclock code only runs if PVCLOCK_TSC_STABLE_BIT is set, which in turn is only the case if the TSC is guaranteed to be monotonic across all CPUs. If we can rely on the fact that that bit will only be set if tsc_to_system_mul and tsc_shift are the same on all CPUs and that (system_time - (tsc_timestamp * mul) shift) is the same on all CPUs, then there should be no reason for the vdso to read the pvclock data for anything but CPU 0. That will make it a lot faster and simpler. Can we rely on that? I wonder what happens if the guest runs ntpd or otherwise uses adjtimex. Presumably it starts drifting relative to the host. --Andy -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] drivers/vfio/pci: Fix MSIx message lost
On 03/10/2014 04:46 PM, Gavin Shan wrote: The problem is specific to the case of BIST reset issued to IPR adapter on the guest side. The IPR driver calls pci_save_state(), issues BIST reset and then pci_restore_state(). The issued BIST cleans out MSIx table and pci_restore_state() doesn't restore the MSIx table as expected. Eventually, MSIx messages with all zeros are sent and causes EEH error on Power platform. The patch fixes it by writing MSIx message to MSIx table before reenabling the individual interrupts in the following path: qemu/hw/pci/msix.c::msix_table_mmio_write msix_handle_mask_update msix_fire_vector_notifier qemu/hw/misc/vfio.c::vfio_msix_vector_use vfio_msix_vector_do_use IOCTL Command VFIO_DEVICE_SET_IRQS to VFIO-PCI vfio/pci/vfio_pci.c::vfio_pci_ioctl vfio/pci/vfio_pci_intrs.c::vfio_pci_set_irqs_ioctl vfio_pci_set_msi_trigger vfio_msi_set_block vfio_msi_set_vector_signal Reported-by: Wen Xiong wenxi...@us.ibm.com Signed-off-by: Gavin Shan sha...@linux.vnet.ibm.com --- drivers/vfio/pci/vfio_pci_intrs.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index 2103576..83e0638 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -17,6 +17,7 @@ #include linux/interrupt.h #include linux/eventfd.h #include linux/pci.h +#include linux/msi.h #include linux/file.h #include linux/poll.h #include linux/vfio.h @@ -517,6 +518,7 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev, struct pci_dev *pdev = vdev-pdev; int irq = msix ? vdev-msix[vector].vector : pdev-irq + vector; char *name = msix ? vfio-msix : vfio-msi; + struct msi_msg msg; struct eventfd_ctx *trigger; int ret; @@ -544,6 +546,15 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev, return PTR_ERR(trigger); } + /* We possiblly lose the MSI/MSIx message in some cases, one + * of which is pci_save_state(), BIST reset and pci_restore_state() + * for IPR adapter. The BIST reset cleans out MSIx table and we + * don't have chance to restore it. Here, we have the trick to + * restore it before enabling individual interrupts. For MSI messages, + * it's harmless to write them back. + */ + get_cached_msi_msg(irq, msg); + write_msi_msg(irq, msg); ret = request_irq(irq, vfio_msihandler, 0, vdev-ctx[vector].name, trigger); if (ret) { This is missing to successfully compile VFIO as modules: diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 2c10752..80ba2e9 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -283,6 +283,7 @@ void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg) __get_cached_msi_msg(entry, msg); } +EXPORT_SYMBOL_GPL(get_cached_msi_msg); void __write_msi_msg(struct msi_desc *entry, struct msi_msg *msg) { @@ -327,6 +328,7 @@ void write_msi_msg(unsigned int irq, struct msi_msg *msg) __write_msi_msg(entry, msg); } +EXPORT_SYMBOL_GPL(write_msi_msg); static void free_msi_irqs(struct pci_dev *dev) { -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] drivers/vfio/pci: Fix MSIx message lost
On Wed, 2014-04-02 at 12:16 +1100, Alexey Kardashevskiy wrote: On 03/10/2014 04:46 PM, Gavin Shan wrote: The problem is specific to the case of BIST reset issued to IPR adapter on the guest side. The IPR driver calls pci_save_state(), issues BIST reset and then pci_restore_state(). The issued BIST cleans out MSIx table and pci_restore_state() doesn't restore the MSIx table as expected. Eventually, MSIx messages with all zeros are sent and causes EEH error on Power platform. The patch fixes it by writing MSIx message to MSIx table before reenabling the individual interrupts in the following path: qemu/hw/pci/msix.c::msix_table_mmio_write msix_handle_mask_update msix_fire_vector_notifier qemu/hw/misc/vfio.c::vfio_msix_vector_use vfio_msix_vector_do_use IOCTL Command VFIO_DEVICE_SET_IRQS to VFIO-PCI vfio/pci/vfio_pci.c::vfio_pci_ioctl vfio/pci/vfio_pci_intrs.c::vfio_pci_set_irqs_ioctl vfio_pci_set_msi_trigger vfio_msi_set_block vfio_msi_set_vector_signal Reported-by: Wen Xiong wenxi...@us.ibm.com Signed-off-by: Gavin Shan sha...@linux.vnet.ibm.com --- drivers/vfio/pci/vfio_pci_intrs.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index 2103576..83e0638 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -17,6 +17,7 @@ #include linux/interrupt.h #include linux/eventfd.h #include linux/pci.h +#include linux/msi.h #include linux/file.h #include linux/poll.h #include linux/vfio.h @@ -517,6 +518,7 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev, struct pci_dev *pdev = vdev-pdev; int irq = msix ? vdev-msix[vector].vector : pdev-irq + vector; char *name = msix ? vfio-msix : vfio-msi; + struct msi_msg msg; struct eventfd_ctx *trigger; int ret; @@ -544,6 +546,15 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev, return PTR_ERR(trigger); } + /* We possiblly lose the MSI/MSIx message in some cases, one +* of which is pci_save_state(), BIST reset and pci_restore_state() +* for IPR adapter. The BIST reset cleans out MSIx table and we +* don't have chance to restore it. Here, we have the trick to +* restore it before enabling individual interrupts. For MSI messages, +* it's harmless to write them back. +*/ + get_cached_msi_msg(irq, msg); + write_msi_msg(irq, msg); ret = request_irq(irq, vfio_msihandler, 0, vdev-ctx[vector].name, trigger); if (ret) { This is missing to successfully compile VFIO as modules: Yep, I noted this in my reply on 3/18. Thanks, Alex diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 2c10752..80ba2e9 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -283,6 +283,7 @@ void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg) __get_cached_msi_msg(entry, msg); } +EXPORT_SYMBOL_GPL(get_cached_msi_msg); void __write_msi_msg(struct msi_desc *entry, struct msi_msg *msg) { @@ -327,6 +328,7 @@ void write_msi_msg(unsigned int irq, struct msi_msg *msg) __write_msi_msg(entry, msg); } +EXPORT_SYMBOL_GPL(write_msi_msg); static void free_msi_irqs(struct pci_dev *dev) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH]pci-assign: Fix memory out of bound when MSI-X table not fit in a single page
Hi, I have a problem about SR-IOV pass-through. The PF is Emulex Corporation OneConnect NIC (Lancer)(rev 10), and the VF pci config is as follow: LINUX:/sys/bus/pci/devices/:04:00.6 # hexdump config 000 0010 0010 0200 0080 010 020 10df e264 030 0054 040 0008 050 6009 0008 2b41 c002 060 7805 018a 070 8411 03ff 4000 080 3400 9403 090 0010 0002 8724 1000 0a0 dc83 0041 0b0 001f 0010 0c0 000e 0d0 We can see the msix_max is 0x3ff and msix_table_entry is 0x4000 (4 pages). But QEMU only mmap MSIX_PAGE_SIZE memory for all pci devices in funciton assigned_dev_register_msix_mmio, meanwhile the set the one page memmory to zero, so the rest memory will be random value (maybe etnry.data is not 0). In function assigned_dev_update_msix_mmio maybe occur the issue of entry_nr 256, and the kmod reports the EINVAL error. My patch fix this issue which alloc memory according to the real size of pci device config. Any ideas? Thnaks. Signed-off-by: Gonglei arei.gong...@huawei.com --- hw/i386/kvm/pci-assign.c | 24 +++- 1 files changed, 19 insertions(+), 5 deletions(-) diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c index a825871..daa191c 100644 --- a/hw/i386/kvm/pci-assign.c +++ b/hw/i386/kvm/pci-assign.c @@ -1591,10 +1591,6 @@ static void assigned_dev_msix_reset(AssignedDevice *dev) MSIXTableEntry *entry; int i; -if (!dev-msix_table) { -return; -} - memset(dev-msix_table, 0, MSIX_PAGE_SIZE); for (i = 0, entry = dev-msix_table; i dev-msix_max; i++, entry++) { @@ -1604,13 +1600,31 @@ static void assigned_dev_msix_reset(AssignedDevice *dev) static int assigned_dev_register_msix_mmio(AssignedDevice *dev) { -dev-msix_table = mmap(NULL, MSIX_PAGE_SIZE, PROT_READ|PROT_WRITE, +int nr_pages; +int size; +int entry_per_page = MSIX_PAGE_SIZE / sizeof(struct MSIXTableEntry); + +if (dev-msix_max entry_per_page) { +nr_pages = dev-msix_max / entry_per_page; +if (dev-msix_max % entry_per_page) { +nr_pages += 1; +} +} else { +nr_pages = 1; +} It's usually not a good idea to special-case corner-cases like this. IMHO, we should assure the memory page-aligned, so I use ROUND_UP according to dev-msix_max. + +size = MSIX_PAGE_SIZE * nr_pages; Just use ROUND_UP? +dev-msix_table = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, 0, 0); Need to fix unmap as well? Yep, it should be unmap for new size memory in assigned_dev_unregister_msix_mmio. Thanks, Michael. BTW, do you think the KVM should upsize the max support MSI-X entry to 2048. Because the MSI-X supports a maximum table size of 2048 entries, which is descript in PCI specification 3.0 version 6.8.3.2: MSI-X Configuration. The history patch about downsize the MSI-X entry size to 256: http://thread.gmane.org/gmane.comp.emulators.kvm.devel/38852/focus=38849 Best regards, -Gonglei -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
On Tue, 2014-04-01 at 07:47 +0200, Alexander Graf wrote: Am 01.04.2014 um 01:03 schrieb Scott Wood scottw...@freescale.com: On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote: On 03/26/2014 10:17 PM, Scott Wood wrote: On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote: +/* + * Another thread may rewrite the TLB entry in parallel, don't + * execute from the address if the execute permission is not set + */ What happens when another thread rewrites the TLB entry in parallel? Does tlbsx succeed? Does it fail? Do we see failure indicated somehow? Are the contents of the MAS registers consistent at this point or inconsistent? If another thread rewrites the TLB entry, then we use the new TLB entry, just as if it had raced in hardware. This check ensures that we don't execute from the new TLB entry if it doesn't have execute permissions (just as hardware would). What happens if the new TLB entry is valid and executable, and the instruction pointed to is valid, but doesn't trap (and thus we don't have emulation for it)? There has to be a good way to detect such a race and deal with it, no? How would you detect it? We don't get any information from the trap about what physical address the instruction was fetched from, or what the instruction was. Ah, this is a guest race where the guest modifies exactly the TLB in question. I see. Why would this ever happen in practice (in a case where we're not executing malicious code)? I don't know. It probably wouldn't. But it seems wrong to not check the exec bit. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
On 04/01/2014 06:58 PM, Scott Wood wrote: On Tue, 2014-04-01 at 07:47 +0200, Alexander Graf wrote: Am 01.04.2014 um 01:03 schrieb Scott Wood scottw...@freescale.com: On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote: On 03/26/2014 10:17 PM, Scott Wood wrote: On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote: +/* + * Another thread may rewrite the TLB entry in parallel, don't + * execute from the address if the execute permission is not set + */ What happens when another thread rewrites the TLB entry in parallel? Does tlbsx succeed? Does it fail? Do we see failure indicated somehow? Are the contents of the MAS registers consistent at this point or inconsistent? If another thread rewrites the TLB entry, then we use the new TLB entry, just as if it had raced in hardware. This check ensures that we don't execute from the new TLB entry if it doesn't have execute permissions (just as hardware would). What happens if the new TLB entry is valid and executable, and the instruction pointed to is valid, but doesn't trap (and thus we don't have emulation for it)? There has to be a good way to detect such a race and deal with it, no? How would you detect it? We don't get any information from the trap about what physical address the instruction was fetched from, or what the instruction was. Ah, this is a guest race where the guest modifies exactly the TLB in question. I see. Why would this ever happen in practice (in a case where we're not executing malicious code)? I don't know. It probably wouldn't. But it seems wrong to not check the exec bit. Right, I'm just saying that a program interrupt is not too bad of an answer in case the guest does something as stupid as this in kernel space :). It's definitely good practice to check for the exec bit. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html