Re: [PATCH 0/4] Alter steal-time reporting in the guest
On Wed, Mar 06, 2013 at 10:29:12AM -0600, Michael Wolf wrote: On Wed, 2013-03-06 at 12:13 +0400, Glauber Costa wrote: On 03/06/2013 05:41 AM, Marcelo Tosatti wrote: On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote: Sorry for the delay in the response. I did not see the email right away. On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote: On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote: 2013/2/5 Michael Wolf m...@linux.vnet.ibm.com: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. Sorry, I'm no expert in this area. But I don't really understand what is confusing for the end user here. I suppose that what is wanted is to subtract stolen time due to 'known reasons' from steal time reporting. 'Known reasons' being, for example, hard caps. So a vcpu executing instructions with no halt, but limited to 80% of available bandwidth, would not have 20% of stolen time reported. Yes exactly and the end user many times did not set up the guest and is not aware of the capping. The end user is only aware of the performance level that they were told they would get with the guest. But yes, a description of the scenario that is being dealt with, with details, is important. I will add more detail to the description next time I submit the patches. How about something like,In a cloud environment the user of a kvm guest is not aware of the underlying hardware or how many other guests are running on it. The end user is only aware of a level of performance that they should see. or does that just muddy the picture more?? So the feature aims for is to report stolen time relative to hard capping. That is: stolen time should be counted as time stolen from the guest _beyond_ hard capping. Yes? Probably don't need to report new data to the guest for that. If we take into account that 1 second always have one second, I believe that you can just subtract the consigned time from the steal time the host passes to the guest. During each second, the numbers won't sum up to 100. The delta to 100 is the consigned time, if anyone cares. Adopting this would simplify this a lot. All you need to do, really, is to get your calculation right from the bandwidth given by the cpu controller. Subtract it in the host, and voila. I looked at doing that once but was told that I was changing the interface in an unacceptable way, because now I was not reporting all of the elapsed time. I agree it would make things simpler. Pointer to that claim, please? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Alter steal-time reporting in the guest
On Wed, Mar 06, 2013 at 10:27:13AM -0600, Michael Wolf wrote: On Tue, 2013-03-05 at 22:41 -0300, Marcelo Tosatti wrote: On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote: Sorry for the delay in the response. I did not see the email right away. On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote: On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote: 2013/2/5 Michael Wolf m...@linux.vnet.ibm.com: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. Sorry, I'm no expert in this area. But I don't really understand what is confusing for the end user here. I suppose that what is wanted is to subtract stolen time due to 'known reasons' from steal time reporting. 'Known reasons' being, for example, hard caps. So a vcpu executing instructions with no halt, but limited to 80% of available bandwidth, would not have 20% of stolen time reported. Yes exactly and the end user many times did not set up the guest and is not aware of the capping. The end user is only aware of the performance level that they were told they would get with the guest. But yes, a description of the scenario that is being dealt with, with details, is important. I will add more detail to the description next time I submit the patches. How about something like,In a cloud environment the user of a kvm guest is not aware of the underlying hardware or how many other guests are running on it. The end user is only aware of a level of performance that they should see. or does that just muddy the picture more?? So the feature aims for is to report stolen time relative to hard capping. That is: stolen time should be counted as time stolen from the guest _beyond_ hard capping. Yes? Yes, that is the goal. Probably don't need to report new data to the guest for that. Not sure I understand what you are saying here. Do you mean that I don't need to report the expected steal from the guest? If I don't do that then I'm not reporting all of the time and changing /proc/stat in a bigger way than adding another catagory. Also I thought I would need to provide the consigned time and the steal time for debugging purposes. Maybe I'm missing your point. OK so the usefulness of steal time comes from the ability to measure CPU cycles that the guest is being deprived of, relative to some unit (implicitly the CPU frequency presented to the VM). That way, it becomes easier to properly allocate resources. From top man page: st : time stolen from this vm by the hypervisor Not only its a problem for the lender, it is also confusing for the user (who has to subtract from the reported value himself), the hardcapping from reported steal time. The problem with the algorithm in the patchset is the following (practical example): - Hard capping set to 80% of available CPU. - vcpu does not exceed its threshold, say workload with 40% CPU utilization. - Under this scenario it is possible for vcpu to be deprived of cycles (because out of the 40% that workload uses, only 30% of actual CPU time are being provided). - The algorithm in this patchset will not report any stolen time because it assumes 20% of stolen time reported via 'run_delay' is fixed at all times (which is false), therefore any valid stolen time below 20% will not be reported. Makes sense? Not sure what the concrete way to report stolen time relative to hard capping is (probably easier inside the scheduler, where run_delay is calculated). Reporting the hard capping to the guest is a good idea (which saves the user from having to measure it themselves), but better done separately via new field. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: x86: Convert INIT and SIPI signals into synchronously handled requests
On Mon, Mar 04, 2013 at 10:41:43PM +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com A VCPU sending INIT or SIPI to some other VCPU races for setting the remote VCPU's mp_state. When we were unlucky, KVM_MP_STATE_INIT_RECEIVED was overwritten by kvm_emulate_halt and, thus, got lost. Fix this by raising requests on the sender side that will then be handled synchronously over the target VCPU context. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Why is kvm_emulate_halt being executed from KVM_MP_STATE_INIT_RECEIVED/KVM_MP_STATE_SIPI_RECEIVED again? Why is it not true that the only valid transition from KVM_MP_STATE_HALTED is from KVM_MP_STATE_RUNNABLE? It would be good for KVM_MP_STATE_HALTED to indicate guest executed HLT instruction (which is impossible without INIT/SIPI being received). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 0/6] kvm: Make ioeventfd usable on s390.
On Thu, Feb 28, 2013 at 12:33:15PM +0100, Cornelia Huck wrote: v5 of the ioeventfd patch set, this time with a proper return code from __diag_virtio_hypercall(), otherwise unchanged. v4 - v5: - Proper return code in __diag_virtio_hypercall() v3 - v4: - Pass cookies in virtio-ccw notify hypercall - Coding style v2 - v3: - Added a patch exporting the virtio-ccw api and use it for the diagnose implementation. - Better naming: We're dealing with virtio-ccw notifications only. v1 - v2: - Move irqfd initialization from a module init function to kvm_init, eliminating the need for a second module for kvm/s390. - Use kvm_io_device for s390 css devices. Cornelia Huck (5): KVM: s390: Export virtio-ccw api. KVM: Initialize irqfd from kvm_init(). KVM: Introduce KVM_VIRTIO_CCW_NOTIFY_BUS. KVM: ioeventfd for virtio-ccw devices. KVM: s390: Wire up ioeventfd. Michael S. Tsirkin (1): virtio_ccw: pass a cookie value to kvm hypercall Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: x86: Convert INIT and SIPI signals into synchronously handled requests
On Tue, Mar 05, 2013 at 08:16:41PM -0300, Marcelo Tosatti wrote: On Mon, Mar 04, 2013 at 10:41:43PM +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com A VCPU sending INIT or SIPI to some other VCPU races for setting the remote VCPU's mp_state. When we were unlucky, KVM_MP_STATE_INIT_RECEIVED was overwritten by kvm_emulate_halt and, thus, got lost. Fix this by raising requests on the sender side that will then be handled synchronously over the target VCPU context. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Why is kvm_emulate_halt being executed from KVM_MP_STATE_INIT_RECEIVED/KVM_MP_STATE_SIPI_RECEIVED again? Why is it not true that the only valid transition from s/from/to/ KVM_MP_STATE_HALTED is from KVM_MP_STATE_RUNNABLE? It would be good for KVM_MP_STATE_HALTED to indicate guest executed HLT instruction (which is impossible without INIT/SIPI being received). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: x86: Convert INIT and SIPI signals into synchronously handled requests
On Tue, Mar 05, 2013 at 08:16:41PM -0300, Marcelo Tosatti wrote: On Mon, Mar 04, 2013 at 10:41:43PM +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com A VCPU sending INIT or SIPI to some other VCPU races for setting the remote VCPU's mp_state. When we were unlucky, KVM_MP_STATE_INIT_RECEIVED was overwritten by kvm_emulate_halt and, thus, got lost. Fix this by raising requests on the sender side that will then be handled synchronously over the target VCPU context. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Why is kvm_emulate_halt being executed from KVM_MP_STATE_INIT_RECEIVED/KVM_MP_STATE_SIPI_RECEIVED again? Why is it not true that the only valid transition from KVM_MP_STATE_HALTED is from KVM_MP_STATE_RUNNABLE? See Paolo's table, it is. So why fix a race which should not be happening in the first place. It would be good for KVM_MP_STATE_HALTED to indicate guest executed HLT instruction (which is impossible without INIT/SIPI being received). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: nVMX: Reset RFLAGS on VM-exit
On Mon, Mar 04, 2013 at 11:00:38AM +0200, Gleb Natapov wrote: On Sun, Mar 03, 2013 at 08:47:11PM +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com Ouch, how could this work so well that far? We need to clear RFLAGS to the reset value as specified by the SDM. Particularly, IF must be off after VM-exit! Signed-off-by: Jan Kiszka jan.kis...@siemens.com Reviewed-by: Gleb Natapov g...@redhat.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] s3: wake up on RTC event
On Tue, Mar 05, 2013 at 10:16:18AM +0100, Paolo Bonzini wrote: The S3 test requires user interaction to terminate. Set up the RTC alarm instead, so that the test will end after one second. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- x86/s3.c | 36 1 file changed, 36 insertions(+) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: nVMX: Fix content of MSR_IA32_VMX_ENTRY/EXIT_CTLS
On Mon, Mar 04, 2013 at 05:15:48PM +0100, Jan Kiszka wrote: Properly set those bits to 1 that the spec demands in case bit 55 of VMX_BASIC is 0 - like in our case. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Changes in v2: - use symbolic constants arch/x86/include/asm/vmx.h |4 arch/x86/kvm/vmx.c | 13 ++--- 2 files changed, 14 insertions(+), 3 deletions(-) Please regenerate against queue branch. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.
On Tue, Mar 05, 2013 at 02:17:57PM -0600, Michael Wolf wrote: Sorry for the delay in the response. I did not see your question. On Mon, 2013-02-18 at 20:57 -0300, Marcelo Tosatti wrote: On Tue, Feb 05, 2013 at 03:49:41PM -0600, Michael Wolf wrote: Add a helper routine to scheduler/core.c to allow the kvm module to retrieve the cpu hardlimit settings. The values will be used to set up a timer that is used to separate the consigned from the steal time. 1) Can you please describe, in english, the mechanics of subtracting cpu hardlimit values from steal time reported via run_delay supposed to work? The period and the quota used to separate the consigned time (expected steal) from the steal time are taken from the cfs bandwidth control settings. Any other steal time accruing during that period will show as the traditional steal time. There is no expected steal time over a fixed period of real time. There is expected steal time in the sense that the administrator of the system sets up guests on the host so that there will be cpu overcommitment. I refer to + /* split the delta into steal and consigned */ + if (vcpu-arch.current_consigned vcpu-arch.consigned_quota) { + vcpu-arch.current_consigned += delta; + if (vcpu-arch.current_consigned vcpu-arch.consigned_quota) { + steal_delta = vcpu-arch.current_consigned + - vcpu-arch.consigned_quota; + consigned_delta = delta - steal_delta; + } else { You can't expect there to be any amount of stolen time over a fixed period of time. The end user who is using the guest does not know this, they only know they have been guaranteed a certain level of performance. So if steal time shows up the end user typically thinks they are not getting their guaranteed performance. So this patchset is meant to allow top to show 100% utilization and ONLY show steal time if it is over the level of steal time that the host administrator setup. So take a simple example of a host with 1 cpu and two guest on it. If each guest is fully utilized a user will see 50% utilization and 50% steal in either of the guests. In this case the amount of steal time that the host administrator would expect to see is 50%. As long as the steal in the guest does not exceed 50% the guest is running as expected. If for some reason the steal increases to 60%, now something is wrong and the steal time needs to be reported and the end user will make inquiries? This is the purpose of stolen time: to report the amount of time guest vcpu was runnable, but not running (IOW: starved). 2) From the description of patch 1: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This is outdated, right? Because overcommitted environment is exactly what steal time should report. I hope I'm not missing your point here. But again this comes down to the point of view. The end user is guaranteed a capability/level of performance that may not be a whole cpu. So only show steal time if the amount of steal time exceeds what the host admin expected when the guest was set up. The real values must be reported. If the host system becomes suddenly loaded beyond what the host can provide to the guest, should the system report an incorrect value, to avoid users from complaining? Sounds incorrect. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Alter steal-time reporting in the guest
On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote: Sorry for the delay in the response. I did not see the email right away. On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote: On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote: 2013/2/5 Michael Wolf m...@linux.vnet.ibm.com: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. Sorry, I'm no expert in this area. But I don't really understand what is confusing for the end user here. I suppose that what is wanted is to subtract stolen time due to 'known reasons' from steal time reporting. 'Known reasons' being, for example, hard caps. So a vcpu executing instructions with no halt, but limited to 80% of available bandwidth, would not have 20% of stolen time reported. Yes exactly and the end user many times did not set up the guest and is not aware of the capping. The end user is only aware of the performance level that they were told they would get with the guest. But yes, a description of the scenario that is being dealt with, with details, is important. I will add more detail to the description next time I submit the patches. How about something like,In a cloud environment the user of a kvm guest is not aware of the underlying hardware or how many other guests are running on it. The end user is only aware of a level of performance that they should see. or does that just muddy the picture more?? So the feature aims for is to report stolen time relative to hard capping. That is: stolen time should be counted as time stolen from the guest _beyond_ hard capping. Yes? Probably don't need to report new data to the guest for that. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio PCI on KVM without IO BARs
On Thu, Feb 28, 2013 at 05:24:33PM +0200, Michael S. Tsirkin wrote: OK we talked about this a while ago, here's a summary and some proposals: At the moment, virtio PCI uses IO BARs for all accesses. The reason for IO use is the cost of different VM exit types of transactions and their emulation on KVM on x86 (it would be trivial to use memory BARs on non x86 platforms if they don't have PIO). Example benchmark (cycles per transaction): (io access) outw 1737 (memory access) movw 4341 for comparison: (hypercall access): vmcall 1566 (pv memory access) movw_fast 1817 (*explanation what this is below) This creates a problem if we want to make virtio devices proper PCI express devices with native hotplug support. This is because each hotpluggable PCI express device always has a PCI express port (port per device), where each port is represented by a PCI to PCI bridge. In turn, a PCI to PCI bridge claims a 4Kbyte aligned range of IO addresses. This means that we can have at most 15 such devices, this is a nasty limitation. Another problem with PIO is support for physical virtio devices, and nested virt: KVM currently programs all PIO accesses to cause vm exit, so using this device in a VM will be slow. So we really want to stop using IO BARs completely if at all possible, but looking at the table above, switching to memory BAR and movw for notifications will not work well. Possible solutions: 1. hypercall instead of PIO basically add a hypercall that gets an MMIO address/data and does an MMIO write for us. We'll want some capability in the device to let guest know this is what it should do. Pros: even faster than PIO Cons: this won't help nested or assigned devices (won't hurt them either as it will be conditional on the capability above). Cons: need host kernel support, which then has to be maintained forever, even if intel speeds up MMIO exits. 2. pv memory access There are two reasons that memory access is slower: - one is that it's handled as an EPT misconfiguration error so handled by cpu slow path - one is that we need to decode the x86 instruction in software, to calculate address/data for the access. We could agree that guests would use a specific instruction for virtio accesses, and fast-path it specifically. This is the pv memory access option above. Pros: helps assigned devices and nested virt Pros: easy to drop if hardware support is there Cons: a bit slower than IO Cons: need host kernel support 3. hypervisor assigned IO address qemu can reserve IO addresses and assign to virtio devices. 2 bytes per device (for notification and ISR access) will be enough. So we can reserve 4K and this gets us 2000 devices. From KVM perspective, nothing changes. We'll want some capability in the device to let guest know this is what it should do, and pass the io address. One way to reserve the addresses is by using the bridge. Pros: no need for host kernel support Pros: regular PIO so fast Cons: does not help assigned devices, breaks nested virt Simply counting pros/cons, option 3 seems best. It's also the easiest to implement. Agree. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 6/6] KVM: s390: Wire up ioeventfd.
On Thu, Feb 28, 2013 at 12:33:21PM +0100, Cornelia Huck wrote: Enable ioeventfd support on s390 and hook up diagnose 500 virtio-ccw notifications. Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com --- arch/s390/kvm/Kconfig| 1 + arch/s390/kvm/Makefile | 2 +- arch/s390/kvm/diag.c | 26 ++ arch/s390/kvm/kvm-s390.c | 1 + 4 files changed, 29 insertions(+), 1 deletion(-) diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig index b58dd86..3c43e30 100644 --- a/arch/s390/kvm/Kconfig +++ b/arch/s390/kvm/Kconfig @@ -22,6 +22,7 @@ config KVM select PREEMPT_NOTIFIERS select ANON_INODES select HAVE_KVM_CPU_RELAX_INTERCEPT + select HAVE_KVM_EVENTFD ---help--- Support hosting paravirtualized guest machines using the SIE virtualization capability on the mainframe. This should work diff --git a/arch/s390/kvm/Makefile b/arch/s390/kvm/Makefile index 3975722..8fe9d65 100644 --- a/arch/s390/kvm/Makefile +++ b/arch/s390/kvm/Makefile @@ -6,7 +6,7 @@ # it under the terms of the GNU General Public License (version 2 only) # as published by the Free Software Foundation. -common-objs = $(addprefix ../../../virt/kvm/, kvm_main.o) +common-objs = $(addprefix ../../../virt/kvm/, kvm_main.o eventfd.o) ccflags-y := -Ivirt/kvm -Iarch/s390/kvm diff --git a/arch/s390/kvm/diag.c b/arch/s390/kvm/diag.c index a390687..1c01a99 100644 --- a/arch/s390/kvm/diag.c +++ b/arch/s390/kvm/diag.c @@ -13,6 +13,7 @@ #include linux/kvm.h #include linux/kvm_host.h +#include asm/virtio-ccw.h #include kvm-s390.h #include trace.h #include trace-s390.h @@ -104,6 +105,29 @@ static int __diag_ipl_functions(struct kvm_vcpu *vcpu) return -EREMOTE; } +static int __diag_virtio_hypercall(struct kvm_vcpu *vcpu) +{ + int ret, idx; + + /* No virtio-ccw notification? Get out quickly. */ + if (!vcpu-kvm-arch.css_support || + (vcpu-run-s.regs.gprs[1] != KVM_S390_VIRTIO_CCW_NOTIFY)) + return -EOPNOTSUPP; + + idx = srcu_read_lock(vcpu-kvm-srcu); + /* + * The layout is as follows: + * - gpr 2 contains the subchannel id (passed as addr) + * - gpr 3 contains the virtqueue index (passed as datamatch) + */ + ret = kvm_io_bus_write(vcpu-kvm, KVM_VIRTIO_CCW_NOTIFY_BUS, + vcpu-run-s.regs.gprs[2], + 8, vcpu-run-s.regs.gprs[3]); + srcu_read_unlock(vcpu-kvm-srcu, idx); + /* kvm_io_bus_write returns -EOPNOTSUPP if it found no match. */ + return ret 0 ? ret : 0; +} + What about the cookie? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 rebased 2/8] start vm after resetting it
On Wed, Feb 20, 2013 at 04:13:49PM +0800, Hu Tao wrote: On Thu, Feb 07, 2013 at 11:50:28PM -0200, Marcelo Tosatti wrote: On Wed, Jan 23, 2013 at 03:19:23PM +0800, Hu Tao wrote: From: Wen Congyang we...@cn.fujitsu.com The guest should run after resetting it, but it does not run if its old state is RUN_STATE_INTERNAL_ERROR or RUN_STATE_PAUSED. We don't set runstate to RUN_STATE_PAUSED when resetting the guest, so the runstate will be changed from RUN_STATE_INTERNAL_ERROR or RUN_STATE_PAUSED to RUN_STATE_RUNNING(not RUN_STATE_PAUSED). It appears the last hunk will automatically reset state from RUN_STATE_INTERNAL_ERROR to RUN_STATE_RUNNING ? Yes. I suppose the transition table allows, from RUN_STATE_INTERNAL_ERROR: monitor system_reset monitor cont To resume the machine? True. I think the purpose of this patch is to always reset and _run_ the guest by `system_reset', avoiding an additional `cont' following `system_reset'. Unclear why its essential to the feature being proposed? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 rebased] kvm: notify host when the guest is panicked
On Thu, Feb 28, 2013 at 04:54:25PM +0800, Hu Tao wrote: diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 06fdbd9..c15ef33 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -96,5 +96,7 @@ struct kvm_vcpu_pv_apf_data { #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK #define KVM_PV_EOI_DISABLED 0x0 +#define KVM_PV_EVENT_PORT(0x505UL) + No need for the ioport to be hard coded. What are the options to communicate an address to the guest? An MSR, via ACPI? I'm not quite understanding here. By 'address', you mean an ioport? how to communicate an address? (I have little knowledge about ACPI) Yes, the ioport. The address of the ioport should not be fixed (for example future emulated board could use that fixed ioport address, 0x505UL). One option is to pass the address via an MSR. Yes, that is probably the best option because there is no dependency on ACPI. pv-event is a bad name for an interface which is specific to notify panic events. Please use pv-panic everywhere. panic event is one of the events supported. Can we keep the name? Call the initialization code from kvm_guest_init, only one function is necessary. At the point of kvm_guest_init, rqeust_region (called by kvm_pv_event_init) will block, so the guest kernel won't up. Why does it block? #define PANIC_TIMER_STEP 100 #define PANIC_BLINK_SPD 18 @@ -132,6 +133,9 @@ void panic(const char *fmt, ...) if (!panic_blink) panic_blink = no_blink; + if (kvm_pv_event_enabled()) + panic_timeout = 0; + What is the rationale behind this? This is a hack to disable reset_on_panic if user enables pv-event. Condition it to kvm_pv_event_enabled() directly? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM guest OS reports zero CPU frequency via /proc/cpuinfo
https://lists.gnu.org/archive/html/qemu-devel/2012-10/msg02269.html On Wed, Feb 20, 2013 at 01:25:04AM +0400, Dmitry Krivenok wrote: Hello, I encountered an odd issue with running VMWare .vmdk image inside KVM. I run KVM as follows qemu-kvm -hda ./guest.vmdk -m 4096 and guest OS (SLES11) boots fine, but reports zero CPU freq via /proc/cpuinfo (cpu MHz): $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 2 model name : QEMU Virtual CPU version 0.15.1 stepping: 3 cpu MHz : 0.000 cache size : 4096 KB fpu : yes fpu_exception : yes cpuid level : 4 wp : yes flags : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx lm up rep_good nopl pni cx16 popcnt lahf_lm bogomips: 2088.96 clflush size: 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: $ I tried to convert .vmdk to qcow2 image and also tried different –cpu=MODEL options, but I always get zero frequency in this guest. Another guest (Ubuntu in native qcow2 image) works fine and reports correct cpu frequency, so I only see this problem with this particular .vmdk image. I’m running QEMU emulator version 0.15.1 (kvm-0.15.1-0.17.3), Copyright (c) 2003-2008 Fabrice Bellard on SLES11-SP2 host (3.0 kernel + patches). This problem is critical for me because by guest runs a binary which relies on non-zero CPU freq in /proc/cpuinfo. Any ideas? Thanks! -- Sincerely yours, Dmitry V. Krivenok e-mail: krivenok.dmi...@gmail.com skype: krivenok_dmitry jabber: krivenok_dmi...@jabber.ru icq: 242-526-443 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bisected][-next-20130204+] [x86/kvm] udevd:[97]: segfault at ffffffffff5fd020 ip 00007fff069e277f sp 00007fff068c9ef8 error d
On Sun, Feb 24, 2013 at 04:23:44PM -0500, Peter Hurley wrote: On Tue, 2013-02-19 at 10:26 +0200, Gleb Natapov wrote: On Mon, Feb 18, 2013 at 08:12:21PM -0500, Peter Hurley wrote: On Mon, 2013-02-18 at 19:59 -0300, Marcelo Tosatti wrote: On Wed, Feb 13, 2013 at 06:57:09AM -0500, Peter Hurley wrote: On Wed, 2013-02-13 at 12:51 +0200, Gleb Natapov wrote: On Tue, Feb 12, 2013 at 04:39:03PM -0800, H. Peter Anvin wrote: On 02/12/2013 04:26 PM, Peter Hurley wrote: With -next-20130204+ in ubuntu 12.10 VM (so the 80x25 VGA device/console): [0.666410] udevd[97]: starting version 175 [0.674043] udevd[97]: udevd:[97]: segfault at ff5fd020 ip 7fff069e277f sp 7fff068c9ef8 error d and boots to an initramfs prompt. git bisect (log attached) blames: commit 7b5c4a65cc27f017c170b025f8d6d75dabb11c6f Merge: 3596f5b 949db15 Author: H. Peter Anvin h...@linux.intel.com Date: Fri Jan 25 16:31:21 2013 -0800 Merge tag 'v3.8-rc5' into x86/mm The __pa() fixup series that follows touches KVM code that is not present in the existing branch based on v3.7-rc5, so merge in the current upstream from Linus. Signed-off-by: H. Peter Anvin h...@linux.intel.com This only happens with the VGA device/console but that is the default configuration for Ubuntu/KVM because it blacklists pretty much every fb driver. I am guessing this is another bad use of __pa()... need to look into that. Can't find this commit on kvm.git or linux-2.6.git. Is it reproducible there? This is in the linux-next repo (any git tag after 'next-20130204' will reproduce this). It's a pretty large merge commit. This doesn't happen on 3.8-rc7. I'll try to repro this on kvm.git sometime this week. Otherwise, we can wait for it to show up in 3.9. Can you also drop 5dfd486c4750c9278c63fa96e6e85bdd2fb58e9d from linux-next and reproduce? Ok, found and fixed. This will need to go to stable for 3.8 as well. ACK, please send an email with the patch to kvm@vger.kernel.org. Regards, Peter Hurley --- % --- From: Peter Hurley pe...@hurleysoftware.com Date: Sun, 24 Feb 2013 10:55:09 -0500 Subject: [PATCH] x86/kvm: Fix pvclock vsyscall fixmap The physical memory fixmapped for the pvclock clock_gettime vsyscall was allocated, and thus is not a kernel symbol. __pa() is the proper method to use in this case. Fixes the crash below when booting a next-20130204+ smp guest on a 3.8-rc5+ KVM host. [0.666410] udevd[97]: starting version 175 [0.674043] udevd[97]: udevd:[97]: segfault at ff5fd020 ip 7fff069e277f sp 7fff068c9ef8 error d Signed-off-by: Peter Hurley pe...@hurleysoftware.com --- arch/x86/kernel/pvclock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 85c3959..2cb9470 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -185,7 +185,7 @@ int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i, for (idx = 0; idx = (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) { __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx, - __pa_symbol(i) + (idx*PAGE_SIZE), + __pa(i) + (idx*PAGE_SIZE), PAGE_KERNEL_VVAR); } -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] KVM: Clean up and optimize set_memory_region() - part2
On Wed, Feb 27, 2013 at 07:41:03PM +0900, Takuya Yoshikawa wrote: Note: this is based on the latest master branch. I'm sending this before 3.9-rc1 is released since this can cause extra conflicts unless we make this available to all arch before we start adding new stuff. Please review: every change is trivial and should not change anything. Takuya Yoshikawa (6): KVM: set_memory_region: Drop user_alloc from prepare/commit_memory_region() KVM: set_memory_region: Drop user_alloc from set_memory_region() KVM: set_memory_region: Make kvm_mr_change available to arch code KVM: set_memory_region: Refactor prepare_memory_region() KVM: set_memory_region: Refactor commit_memory_region() KVM: ARM: Remove kvm_arch_set_memory_region() arch/arm/kvm/arm.c | 15 ++-- arch/ia64/kvm/kvm-ia64.c | 25 ++--- arch/powerpc/include/asm/kvm_ppc.h |2 +- arch/powerpc/kvm/book3s_hv.c |4 +- arch/powerpc/kvm/book3s_pr.c |2 +- arch/powerpc/kvm/booke.c |2 +- arch/powerpc/kvm/powerpc.c | 13 +-- arch/s390/kvm/kvm-s390.c | 10 ++-- arch/x86/kvm/vmx.c |6 ++-- arch/x86/kvm/x86.c | 25 + include/linux/kvm_host.h | 35 +- virt/kvm/kvm_main.c| 40 --- 12 files changed, 67 insertions(+), 112 deletions(-) -- 1.7.4.1 Reviewed-by: Marcelo Tosatti mtosa...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
On Mon, Feb 25, 2013 at 11:13:25AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-25: On Mon, Feb 25, 2013 at 11:04:25AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-25: On Mon, Feb 25, 2013 at 08:42:52AM +, Zhang, Yang Z wrote: Avi Kivity wrote on 2013-02-25: I didn't really follow, but is the root cause the need to keep track of interrupt coalescing? If so we can recommend that users use KVM_IRQ_LINE when coalescing is unneeded, and move interrupt injection with irq coalescing support to vcpu context. So we can hide the capability KVM_CAP_IRQ_INJECT_STATUS when posted interrupt is enabled to force users doesn't to use KVM_IRQ_LINE_STATUS. Does this acceptable? The only case in KVM that need to know the interrupt injection status is vlapic timer. But since vlapic timer and vcpu are always in same pcpu, so there is no problem. Not really. The primary user of this interface is RTC interrupt re-injection for Windows guests. So without KVM_IRQ_LINE_STATUS capability, RTC cannot work well? Windows guests may experience timedrift under CPU overcommit scenario. Ok, I see. Seems we are stuck. :( Do you have any suggestion to solve or workaround current problem? Depend on knowledge about atomicity (item 5 IIRC) of the sequence in the manual. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
On Mon, Feb 25, 2013 at 06:55:58AM +, Zhang, Yang Z wrote: p1) cpu0 cpu1vcpu0 test_and_set_bit(PIR-A) set KVM_REQ_EVENT process REQ_EVENT PIR-A-IRR vcpu-mode=IN_GUEST if (vcpu0-guest_mode) if (!t_a_s_bit(PIR notif)) send IPI linux_pir_handler t_a_s_b(PIR-B)=1 no PIR IPI sent p2) No, this exists with your implementation not mine. As I said, in this patch, it will make request after vcpu ==guest mode, then send ipi. Even the ipi is lost, but the request still will be tracked. Because we have the follow check: vcpu-mode = GUEST_MODE (ipi may arrived here and lost) local irq disable check request (this will ensure the pir modification will be picked up by vcpu before vmentry) Please read the sequence above again, between p1) and p2). Yes, if the IPI is lost, the request _for the CPU which sent PIR IPI_ is guaranteed to be processed, but not the request for another cpu (cpu1). If no PIR IPI sent in cpu1, the delivered is false, and it will roll back to old logic: __apic_accept_irq(): if (!delivered) { kvm_make_request(KVM_REQ_EVENT, vcpu); kvm_vcpu_kick(vcpu); } This can solve the problem you mentioned above. Right? Should not be doing this kick in the first place. One result of it is that you'll always take a VM-exit if a second injection happens while a VCPU has not handled the first one. What is the drawback of the suggestion to unconditionally clear PIR notification bit on VM-entry? We can avoid it, but lets get it correct first (BTW can stick a comment saying FIXME: optimize) on that PIR clearing. cpu0 check pir(pass) check irr(pass) injected = !test_and_set_bit(pir) versus cpu1 xchg(pir) No information can be lost because all accesses to shared data are atomic. Sorry, I cannot get your point. Why 'xchg(pir)' can solve the problem I mentioned above? Can you elaborate it? The spinlock I used is trying to protect the whole procedure of sync pir to irr, not just access pir. You're right, its the same problem as with the hardware update. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
On Mon, Feb 25, 2013 at 03:34:19PM +0200, Gleb Natapov wrote: On Mon, Feb 25, 2013 at 11:13:25AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-25: On Mon, Feb 25, 2013 at 11:04:25AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-25: On Mon, Feb 25, 2013 at 08:42:52AM +, Zhang, Yang Z wrote: Avi Kivity wrote on 2013-02-25: I didn't really follow, but is the root cause the need to keep track of interrupt coalescing? If so we can recommend that users use KVM_IRQ_LINE when coalescing is unneeded, and move interrupt injection with irq coalescing support to vcpu context. So we can hide the capability KVM_CAP_IRQ_INJECT_STATUS when posted interrupt is enabled to force users doesn't to use KVM_IRQ_LINE_STATUS. Does this acceptable? The only case in KVM that need to know the interrupt injection status is vlapic timer. But since vlapic timer and vcpu are always in same pcpu, so there is no problem. Not really. The primary user of this interface is RTC interrupt re-injection for Windows guests. So without KVM_IRQ_LINE_STATUS capability, RTC cannot work well? Windows guests may experience timedrift under CPU overcommit scenario. Ok, I see. Seems we are stuck. :( Do you have any suggestion to solve or workaround current problem? I see a couple of possible solutions: 1. Do what Avi said. Make KVM_IRQ_LINE_STATUS be synchronous. Cons: current QEMU uses KVM_IRQ_LINE_STATUS always and it means that it will be slow on newer kernels Can add a capability to QEMU and enable APICv selectively only in newer QEMU, which can issue KVM_IRQ_LINE_STATUS on target vcpu only when necessary (and KVM_IRQ_LINE otherwise). Even a lock serializing injection is not safe because ON bit is cleared before XCHG(PIR, 0). Must do something heavier (such as running on target vcpu context). 2. Make KVM_IRQ_LINE_STATUS report coalescing only when vcpu is not running during injection. This assumes that if vcpu is running and does not process interrupt it is guest fault and the same can happen on real HW too. Coalescing when vcpu is not running though is the result of CPU overcommit and should be reported. Cons interface definition is kind of murky. 3. Do not report KVM_IRQ_LINE_STATUS capability and move RTC to use EOI notifiers for interrupt reinjection. This requires us to add interface for reporting EOI to userspace. This is not in the scope of this patchset. Cons: need to introduce new interface (and the one that will not work on AMD BTW) Breaks older userspace? Other ideas? Can HW write a 'finished' bit after 6 in the reserved area? Suppose its not a KVM-specific problem? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
On Mon, Feb 25, 2013 at 11:00:21AM -0300, Marcelo Tosatti wrote: I see a couple of possible solutions: 1. Do what Avi said. Make KVM_IRQ_LINE_STATUS be synchronous. Cons: current QEMU uses KVM_IRQ_LINE_STATUS always and it means that it will be slow on newer kernels Can add a capability to QEMU and enable APICv selectively only in newer QEMU, which can issue KVM_IRQ_LINE_STATUS on target vcpu only when necessary (and KVM_IRQ_LINE otherwise). Bad idea. What happens with mixed scenarios. Even a lock serializing injection is not safe because ON bit is cleared before XCHG(PIR, 0). Must do something heavier (such as running on target vcpu context). Note always running on target vcpu is likely to be slower than no-APICv. So need to do something heavier on the kernel under serialization, if firmware cannot be changed (injection from simultaneous CPUs should be rare so if data to serialize __accept_apic_irq is cache-line aligned it should reduce performance impact). 2. Make KVM_IRQ_LINE_STATUS report coalescing only when vcpu is not running during injection. This assumes that if vcpu is running and does not process interrupt it is guest fault and the same can happen on real HW too. Coalescing when vcpu is not running though is the result of CPU overcommit and should be reported. Cons interface definition is kind of murky. 3. Do not report KVM_IRQ_LINE_STATUS capability and move RTC to use EOI notifiers for interrupt reinjection. This requires us to add interface for reporting EOI to userspace. This is not in the scope of this patchset. Cons: need to introduce new interface (and the one that will not work on AMD BTW) Is there no int-ack notification at RTC HW level? Breaks older userspace? Other ideas? Can HW write a 'finished' bit after 6 in the reserved area? Suppose its not a KVM-specific problem? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
On Mon, Feb 25, 2013 at 07:40:07PM +0200, Gleb Natapov wrote: On Mon, Feb 25, 2013 at 11:17:39AM -0300, Marcelo Tosatti wrote: On Mon, Feb 25, 2013 at 11:00:21AM -0300, Marcelo Tosatti wrote: I see a couple of possible solutions: 1. Do what Avi said. Make KVM_IRQ_LINE_STATUS be synchronous. Cons: current QEMU uses KVM_IRQ_LINE_STATUS always and it means that it will be slow on newer kernels Can add a capability to QEMU and enable APICv selectively only in newer QEMU, which can issue KVM_IRQ_LINE_STATUS on target vcpu only when necessary (and KVM_IRQ_LINE otherwise). Bad idea. What happens with mixed scenarios. Exactly. Even a lock serializing injection is not safe because ON bit is cleared before XCHG(PIR, 0). Must do something heavier (such as running on target vcpu context). Note always running on target vcpu is likely to be slower than no-APICv. Yes, but we will do it only for RTC interrupt. Still performance hit should be very visible when RTC is in 1000Hz mode. So older qemu-kvm on APICv enabled hardware becomes slow? If that is acceptable, fine. So need to do something heavier on the kernel under serialization, if firmware cannot be changed (injection from simultaneous CPUs should be rare so if data to serialize __accept_apic_irq is cache-line aligned it should reduce performance impact). 2. Make KVM_IRQ_LINE_STATUS report coalescing only when vcpu is not running during injection. This assumes that if vcpu is running and does not process interrupt it is guest fault and the same can happen on real HW too. Coalescing when vcpu is not running though is the result of CPU overcommit and should be reported. Cons interface definition is kind of murky. 3. Do not report KVM_IRQ_LINE_STATUS capability and move RTC to use EOI notifiers for interrupt reinjection. This requires us to add interface for reporting EOI to userspace. This is not in the scope of this patchset. Cons: need to introduce new interface (and the one that will not work on AMD BTW) Is there no int-ack notification at RTC HW level? There is, but Windows calls it twice for each injected interrupt. I tried to use it in the past to detect interrupt coalescing, but this double ack prevented me from doing so. May be revisit this approach with willingness to be more hacky about it. Also it is possible to disable RTC ack for HyperV guests. We do not do it and if we will use the ack we will not be able to. Is it OK to kill the ability to have coalesced information? Breaks older userspace? Older userspace will have timedrif with Windows, yes. Note that we some version of Windows it has it now too. Other ideas? Can HW write a 'finished' bit after 6 in the reserved area? Suppose its not a KVM-specific problem? I still think adding locking just to obtain correct injection status is bad trade-off even if HW would allow us to get away with it. Perhaps with notification at end of copy it could be lockless. With the current scheme, it is not possible to get it right because the notification bit disappears temporarily from sight. And it is not possible to distinguish between 'interrupt injected' and 'interrupt merged', as discussed here. So would have to serialize completly along the lines of: Lock and only inject if can identify that interrupt handler is not running. If its possible to drop KVM_IRQ_LINE_STATUS, then demand APICv HW to use recent software, fine (did not grasp Avi's second idea). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86: kvmclock: Do not setup kvmclock vsyscall in the absence of that clock
On Sat, Feb 23, 2013 at 05:05:29PM +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com This fixes boot lockups with no-kvmclock, when the host is not exposing this particular feature (QEMU: -cpu ...,-kvmclock) or when the kvmclock initialization failed for whatever reason. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Should go to 3.8 as well, I presume. arch/x86/kernel/kvmclock.c |9 - 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index 5bedbdd..b730efa 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -160,8 +160,12 @@ int kvm_register_clock(char *txt) { int cpu = smp_processor_id(); int low, high, ret; - struct pvclock_vcpu_time_info *src = hv_clock[cpu].pvti; + struct pvclock_vcpu_time_info *src; + + if (!hv_clock) + return 0; + src = hv_clock[cpu].pvti; low = (int)__pa(src) | 1; high = ((u64)__pa(src) 32); ret = native_write_msr_safe(msr_kvm_system_time, low, high); @@ -276,6 +280,9 @@ int __init kvm_setup_vsyscall_timeinfo(void) struct pvclock_vcpu_time_info *vcpu_time; unsigned int size; + if (!hv_clock) + return 0; + size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS); preempt_disable(); -- 1.7.3.4 Reviewed-by: Marcelo Tosatti mtosa...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
On Sun, Feb 24, 2013 at 01:17:59PM +, Zhang, Yang Z wrote: Marcelo Tosatti wrote on 2013-02-24: On Sat, Feb 23, 2013 at 03:16:11PM +, Zhang, Yang Z wrote: Marcelo Tosatti wrote on 2013-02-23: On Sat, Feb 23, 2013 at 02:05:28PM +, Zhang, Yang Z wrote: Marcelo Tosatti wrote on 2013-02-23: On Fri, Feb 22, 2013 at 09:42:32PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com Posted Interrupt allows APIC interrupts to inject into guest directly without any vmexit. - When delivering a interrupt to guest, if target vcpu is running, update Posted-interrupt requests bitmap and send a notification event to the vcpu. Then the vcpu will handle this interrupt automatically, without any software involvemnt. - If target vcpu is not running or there already a notification event pending in the vcpu, do nothing. The interrupt will be handled by next vm entry Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index e4595f1..64616cc 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -228,6 +228,26 @@ void smp_x86_platform_ipi(struct pt_regs *regs) set_irq_regs(old_regs); } +#ifdef CONFIG_HAVE_KVM +/* + * Handler for POSTED_INTERRUPT_VECTOR. + */ +void smp_kvm_posted_intr_ipi(struct pt_regs *regs) +{ + struct pt_regs *old_regs = set_irq_regs(regs); + + ack_APIC_irq(); + + irq_enter(); + + exit_idle(); + + irq_exit(); + + set_irq_regs(old_regs); +} +#endif + Add per-cpu counters, similarly to other handlers in the same file. Sure. @@ -379,6 +398,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic *apic) if (!apic-irr_pending) return -1; + kvm_x86_ops-sync_pir_to_irr(apic-vcpu); result = apic_search_irr(apic);ASSERT(result == -1 || result = 16); kvm_x86_ops-sync_pir_to_irr() should be called from vcpu_enter_guest, before inject_pending_event. if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) { - inject_pending_event ... } Some other places will search irr to do something, like kvm_cpu_has_interrupt(). So we must to sync pir to irr before touch irr, not just before enter guest. I see. The only call site that needs IRR/PIR information with posted interrupt enabled is kvm_arch_vcpu_runnable, correct? Yes. If so, can we contain reading PIR to only that location? Yes, we can do it. Actually, why I do pir-irr here is to avoid the wrong usage in future of check pending interrupt just by call kvm_vcpu_has_interrupt(). Yes, i see. Also, there may need an extra callback to check pir. So you think your suggestions is better? Because it respects standard request processing mechanism, yes. Sure. Will change it in next patch. And have PIR-IRR sync at KVM_REQ_EVENT processing only (to respect the standard way of event processing and also reduce reading the PIR). Since we will check ON bit before each reading PIR, the cost can be ignored. Note reading ON bit is not OK, because if injection path did not see vcpu-mode == IN_GUEST_MODE, ON bit will not be set. In my patch, It will set ON bit before check vcpu-mode. So it's ok. Actually, I think the ON bit must be set unconditionally after change PIR regardless of vcpu-mode. So it is possible that a PIR IPI is delivered and handled before guest entry. By clearing PIR notification bit after local_irq_disable, but before the last check for vcpu-requests, we guarantee that a PIR IPI is never lost. Makes sense? (please check the logic, it might be flawed). I am not sure whether I understand you. But I don't think the IPI will lost in current patch: if (!pi_test_and_set_on(vmx-pi_desc) (vcpu-mode == IN_GUEST_MODE)) { kvm_make_request(KVM_REQ_EVENT, vcpu); apic-send_IPI_mask(get_cpu_mask(vcpu-cpu), POSTED_INTR_VECTOR); *delivered = true; } vcpu entry has: vcpu-mode = GUEST_MODE local irq disable check request We will send the IPI after making request, if IPI is received after set guest_mode before disable irq, then it still will be handled by the following check request. Please correct me if I am wrong. p1) cpu0cpu1vcpu0 test_and_set_bit(PIR-A) set KVM_REQ_EVENT process REQ_EVENT PIR-A-IRR vcpu-mode=IN_GUEST if (vcpu0-guest_mode) if (!t_a_s_bit(PIR notif)) send IPI
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
On Sun, Feb 24, 2013 at 01:55:07PM +, Zhang, Yang Z wrote: contexts, but only two use locks. See following logic, I think the problem you mentioned will not happened with current logic. get lock if test_pir (this will ensure there is no in flight IPI for same interrupt. Since we are taking the lock now, no IPI will be sent before we release the lock and no pir-irr is performed by hardware for same interrupt.) return coalesced if test_irr return coalesced set_pir injected=true if t_a_s(on) in guest mode send ipi unlock I'd rather think about proper way to do lockless injection before committing anything. The case where we care about correct injection status is rare, but we always care about scalability and since we violate the spec by reading vapic page while vcpu is in non-root operation anyway the result may be incorrect with or without the lock. Our use can was not in HW designers mind when they designed this thing obviously :( Zhang, can you comment on whether reading vapic page with CPU in VMX-non root accessing it is safe? See 24.11.4 In addition to data in the VMCS region itself, VMX non-root operation can be controlled by data structures that are referenced by pointers in a VMCS (for example, the I/O bitmaps). While the pointers to these data structures are parts of the VMCS, the data structures themselves are not. They are not accessible using VMREAD and VMWRITE but by ordinary memory writes. Software should ensure that each such data structure is modified only when no logical processor with a current VMCS that references it is in VMX non-root operation. Doing otherwise may lead to unpredictable behavior. This means the initial design of KVM is wrong. It should not to modify vIRR directly. The good thing is that reading is ok. OK. Gleb, yes, a comment mentioning the race (instead of the spinlock) and explanation why its believed to be benign (given how the injection return value is interpreted) could also work. Its ugly, though... murphy is around. The race above is not benign. It will report interrupt as coalesced while in reality it is injected. This may cause to many interrupt to be injected. If this happens rare enough ntp may be able to fix time drift resulted from this. Please check the above logic. Does it have same problem? If yes, please point out. 1) set_pir must be test_and_set_bit() (so the lock at vcpu entry path can be dropped). 2) if t_a_s(on) in guest mode send ipi should be inverted: if guest mode t_a_s(on) send ipi So you avoid setting ON bit if guest is outside of guest mode. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
On Sun, Feb 24, 2013 at 04:19:17PM +0200, Gleb Natapov wrote: On Sun, Feb 24, 2013 at 01:55:07PM +, Zhang, Yang Z wrote: I do not think it fixes it. There is no guaranty that IPI will be processed by remote cpu while sending cpu is still in locked section, so the same race may happen regardless. As you say above there are 3 contexts, but only two use locks. See following logic, I think the problem you mentioned will not happened with current logic. get lock if test_pir (this will ensure there is no in flight IPI for same interrupt. Since we are taking the lock now, no IPI will be sent before we release the lock and no pir-irr is performed by hardware for same interrupt.) Does the PIR IPI interrupt returns synchronously _after_ PIR-IRR transfer is made? Don't think so. PIR IPI interrupt returns after remote CPU has acked interrupt receival, not after remote CPU has acked _and_ performed PIR-IRR transfer. Right? (yes, right, see step 4. below). Should try to make it easier to drop the lock later, so depend on it as little as possible (also please document what it protects in detail). Also note: 3. The processor clears the outstanding-notification bit in the posted-interrupt descriptor. This is done atomically so as to leave the remainder of the descriptor unmodified (e.g., with a locked AND operation). 4. The processor writes zero to the EOI register in the local APIC; this dismisses the interrupt with the postedinterrupt notification vector from the local APIC. 5. The logical processor performs a logical-OR of PIR into VIRR and clears PIR. No other agent can read or write a PIR bit (or group of bits) between the time it is read (to determine what to OR into VIRR) and when it is cleared. So checking the ON bit does not mean the HW has finished performing PIR-VIRR transfer (which means ON bit can only be used as an indication of whether to send interrupt, not whether PIR-VIRR transfer is finished). So its fine to do - atomic set pir - if (atomic_test_and_set(PIR ON bit)) - send IPI But HW can transfer two distinct bits, set by different serialized instances of kvm_set_irq (protected with a lock), in a single PIR-IRR pass. I do not see where those assumptions are coming from. Testing pir does not guaranty that the IPI is not processed by VCPU right now. iothread: vcpu: send_irq() lock(pir) check pir and irr set pir send IPI (*) unlock(pir) send_irq() lock(pir) receive IPI (*) atomic { pir_tmp = pir pir = 0 } check pir and irr irr = pir_tmp set pir send IPI unlock(pir) At this point both pir and irr are set and interrupt may be coalesced, but it is reported as delivered. s/set pir/injected = !t_a_s(pir)/ So what prevents the scenario above from happening? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
On Fri, Feb 22, 2013 at 09:42:32PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com Posted Interrupt allows APIC interrupts to inject into guest directly without any vmexit. - When delivering a interrupt to guest, if target vcpu is running, update Posted-interrupt requests bitmap and send a notification event to the vcpu. Then the vcpu will handle this interrupt automatically, without any software involvemnt. - If target vcpu is not running or there already a notification event pending in the vcpu, do nothing. The interrupt will be handled by next vm entry Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index e4595f1..64616cc 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -228,6 +228,26 @@ void smp_x86_platform_ipi(struct pt_regs *regs) set_irq_regs(old_regs); } +#ifdef CONFIG_HAVE_KVM +/* + * Handler for POSTED_INTERRUPT_VECTOR. + */ +void smp_kvm_posted_intr_ipi(struct pt_regs *regs) +{ + struct pt_regs *old_regs = set_irq_regs(regs); + + ack_APIC_irq(); + + irq_enter(); + + exit_idle(); + + irq_exit(); + + set_irq_regs(old_regs); +} +#endif + Add per-cpu counters, similarly to other handlers in the same file. @@ -379,6 +398,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic *apic) if (!apic-irr_pending) return -1; + kvm_x86_ops-sync_pir_to_irr(apic-vcpu); result = apic_search_irr(apic); ASSERT(result == -1 || result = 16); kvm_x86_ops-sync_pir_to_irr() should be called from vcpu_enter_guest, before inject_pending_event. if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) { - inject_pending_event ... } @@ -685,6 +705,7 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, { int result = 0; struct kvm_vcpu *vcpu = apic-vcpu; + bool delivered = false; switch (delivery_mode) { case APIC_DM_LOWEST: @@ -700,7 +721,10 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, } else apic_clear_vector(vector, apic-regs + APIC_TMR); - result = !apic_test_and_set_irr(vector, apic); + if (!kvm_x86_ops-deliver_posted_interrupt(vcpu, vector, + result, delivered)) + result = !apic_test_and_set_irr(vector, apic); + trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode, trig_mode, vector, !result); if (!result) { @@ -710,8 +734,10 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, break; } - kvm_make_request(KVM_REQ_EVENT, vcpu); - kvm_vcpu_kick(vcpu); + if (!delivered) { + kvm_make_request(KVM_REQ_EVENT, vcpu); + kvm_vcpu_kick(vcpu); + } This set bit / kick pair should be for non-APICv only (please check for it directly). + return test_bit(vector, (unsigned long *)pi_desc-pir); +} + +static void pi_set_pir(int vector, struct pi_desc *pi_desc) +{ + __set_bit(vector, (unsigned long *)pi_desc-pir); +} This must be locked atomic operation (set_bit). + struct vcpu_vmx { struct kvm_vcpu vcpu; unsigned long host_rsp; @@ -429,6 +465,10 @@ struct vcpu_vmx { bool rdtscp_enabled; + /* Posted interrupt descriptor */ + struct pi_desc pi_desc; + spinlock_t pi_lock; Don't see why the lock is necessary. Please use the following pseudocode for vmx_deliver_posted_interrupt: vmx_deliver_posted_interrupt(), returns 'bool injected'. 1. orig_irr = read irr from vapic page 2. if (orig_irr != 0) 3. return false; 4. kvm_make_request(KVM_REQ_EVENT) 5. bool injected = !test_and_set_bit(PIR) 6. if (vcpu-guest_mode injected) 7. if (test_and_set_bit(PIR notification bit)) 8. send PIR IPI 9. return injected On vcpu_enter_guest: if (kvm_check_request(KVM_REQ_EVENT)) { pir-virr sync (*) ... } ... vcpu-mode = IN_GUEST_MODE; local_irq_disable clear pir notification bit unconditionally (*) Right after local_irq_disable. +static bool vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, + int vector, int *result, bool *delivered) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + unsigned long flags; + + if (!vmx_vm_has_apicv(vcpu-kvm)) + return false; + + spin_lock_irqsave(vmx-pi_lock, flags); + if (pi_test_pir(vector, vmx-pi_desc) || + kvm_apic_test_irr(vector, vcpu-arch.apic)) { + spin_unlock_irqrestore(vmx-pi_lock, flags); + return true; + } + +
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
On Sat, Feb 23, 2013 at 02:05:28PM +, Zhang, Yang Z wrote: Marcelo Tosatti wrote on 2013-02-23: On Fri, Feb 22, 2013 at 09:42:32PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com Posted Interrupt allows APIC interrupts to inject into guest directly without any vmexit. - When delivering a interrupt to guest, if target vcpu is running, update Posted-interrupt requests bitmap and send a notification event to the vcpu. Then the vcpu will handle this interrupt automatically, without any software involvemnt. - If target vcpu is not running or there already a notification event pending in the vcpu, do nothing. The interrupt will be handled by next vm entry Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index e4595f1..64616cc 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -228,6 +228,26 @@ void smp_x86_platform_ipi(struct pt_regs *regs) set_irq_regs(old_regs); } +#ifdef CONFIG_HAVE_KVM +/* + * Handler for POSTED_INTERRUPT_VECTOR. + */ +void smp_kvm_posted_intr_ipi(struct pt_regs *regs) +{ + struct pt_regs *old_regs = set_irq_regs(regs); + + ack_APIC_irq(); + + irq_enter(); + + exit_idle(); + + irq_exit(); + + set_irq_regs(old_regs); +} +#endif + Add per-cpu counters, similarly to other handlers in the same file. Sure. @@ -379,6 +398,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic *apic) if (!apic-irr_pending) return -1; + kvm_x86_ops-sync_pir_to_irr(apic-vcpu); result = apic_search_irr(apic); ASSERT(result == -1 || result = 16); kvm_x86_ops-sync_pir_to_irr() should be called from vcpu_enter_guest, before inject_pending_event. if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) { - inject_pending_event ... } Some other places will search irr to do something, like kvm_cpu_has_interrupt(). So we must to sync pir to irr before touch irr, not just before enter guest. I see. The only call site that needs IRR/PIR information with posted interrupt enabled is kvm_arch_vcpu_runnable, correct? If so, can we contain reading PIR to only that location? And have PIR-IRR sync at KVM_REQ_EVENT processing only (to respect the standard way of event processing and also reduce reading the PIR). +{ + __set_bit(vector, (unsigned long *)pi_desc-pir); +} This must be locked atomic operation (set_bit). If using spinlock, pi_set_pir is not called concurrently. It safe to use _set_bit. The manual demands atomic locked accesses (remember this a remote access). See the posted interrupt page. Don't see why the lock is necessary. Please use the following pseudocode for vmx_deliver_posted_interrupt: vmx_deliver_posted_interrupt(), returns 'bool injected'. 1. orig_irr = read irr from vapic page 2. if (orig_irr != 0) 3. return false; 4. kvm_make_request(KVM_REQ_EVENT) 5. bool injected = !test_and_set_bit(PIR) 6. if (vcpu-guest_mode injected) 7. if (test_and_set_bit(PIR notification bit)) 8. send PIR IPI 9. return injected Consider follow case: vcpu 0 |vcpu1 send intr to vcpu1 check irr receive a posted intr pir-irr(pir is cleared, irr is set) injected=test_and_set_bit=true pir=set Then both irr and pir have the interrupt pending, they may merge to one later, but injected reported as true. Wrong. True. Have a lock around it and at step 1 also verify if PIR is set. That would do it, yes? On vcpu_enter_guest: if (kvm_check_request(KVM_REQ_EVENT)) { pir-virr sync (*) ... } ... vcpu-mode = IN_GUEST_MODE; local_irq_disable clear pir notification bit unconditionally (*) Why clear ON bit here? If there is no pir-irr operation in this vmentry, clear on here is redundant. A PIR IPI must not be lost. Because if its lost, then interrupt injection can be delayed while it must be performed immediately. vcpu entry path has: 1. vcpu-mode = GUEST_MODE 2. local_irq_disable injection path has: 1. if (vcpu-guest_mode test_and_set_bit(PIR notif bit)) send IPI So it is possible that a PIR IPI is delivered and handled before guest entry. By clearing PIR notification bit after local_irq_disable, but before the last check for vcpu-requests, we guarantee that a PIR IPI is never lost. Makes sense? (please check the logic, it might be flawed). Please confirm whether a spinlock is necessary with the pseudocode above. In current implementation, spinlock is necessary to avoid race condition between vcpus when delivery posted interrupt and sync pir-irr. Sync PIR-IRR uses xchg, which is locked and atomic, so can't see the need for the spinlock
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
On Sat, Feb 23, 2013 at 04:35:30PM +0200, Gleb Natapov wrote: On Sat, Feb 23, 2013 at 02:05:28PM +, Zhang, Yang Z wrote: +return test_bit(vector, (unsigned long *)pi_desc-pir); +} + +static void pi_set_pir(int vector, struct pi_desc *pi_desc) +{ +__set_bit(vector, (unsigned long *)pi_desc-pir); +} This must be locked atomic operation (set_bit). If using spinlock, pi_set_pir is not called concurrently. It safe to use _set_bit. HW still access it concurrently without bothering taking your lock. + struct vcpu_vmx { struct kvm_vcpu vcpu; unsigned long host_rsp; @@ -429,6 +465,10 @@ struct vcpu_vmx { bool rdtscp_enabled; +/* Posted interrupt descriptor */ +struct pi_desc pi_desc; +spinlock_t pi_lock; Don't see why the lock is necessary. Please use the following pseudocode for vmx_deliver_posted_interrupt: vmx_deliver_posted_interrupt(), returns 'bool injected'. 1. orig_irr = read irr from vapic page 2. if (orig_irr != 0) 3.return false; 4. kvm_make_request(KVM_REQ_EVENT) 5. bool injected = !test_and_set_bit(PIR) 6. if (vcpu-guest_mode injected) 7.if (test_and_set_bit(PIR notification bit)) 8.send PIR IPI 9. return injected Consider follow case: vcpu 0 |vcpu1 send intr to vcpu1 check irr receive a posted intr pir-irr(pir is cleared, irr is set) injected=test_and_set_bit=true pir=set Then both irr and pir have the interrupt pending, they may merge to one later, but injected reported as true. Wrong. I and Marcelo discussed the lockless logic that should be used here on the previous patch submission. All is left for you is to implement it. We worked hard to make irq injection path lockless, we will not going to add one now. He is right, the scheme is still flawed (because of concurrent injectors along with HW in VMX non-root). I'd said lets add a spinlock think about lockless scheme in the meantime. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On Thu, Feb 21, 2013 at 08:00:25PM -0600, Scott Wood wrote: On 02/21/2013 05:03:32 PM, Marcelo Tosatti wrote: On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote: On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote: On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote: It is then not necessary to set device attributes on a live guest and deal with the complications associated with that. Which complications? -Scott Semantics of individual attribute writes, for one. When the attribute is a device register, the hardware documentation takes care of that. You are not writing to the registers from the CPU point of view. That's exactly how KVM_DEV_MPIC_GRP_REGISTER is defined and implemented on MPIC (with the exception of registers whose behavior changes based on which specific vcpu you use to access them). If/when we have a need to set/get state in a different manner, that's a separate attribute group. Can you describe usage of this register again? Otherwise, the semantics are documented in the device-specific documentation -- which can include restricting when the access is allowed. Same as with any other interface documentation. Again, you are talking about the semantics of device access from the software operating on the machine view. We are discussing hypervisor userspace - hypervisor kernel interface. And I was talking about the userspace-to-hypervisor kernel interface documentation. It just happens that one specific MPIC device group (when the attribute is a device register) is defined with respect to what guest software would see if it did a similar access. In general you never have to set attributes on a device after it has been initialized, because there is state associated with that device that requires proper handling (example: if you modify a timer counter register of a timer device, any software timers used to emulate the timer counter must be cancelled). Yes, it requires proper handling and the MMIO code does that. If and when we add raw state accessors, it's totally reasonable for there to be command/attribute-specific documented restrictions on when the access can be done. Also, it is necessary to provide proper locking of device attribute write versus vcpu device access. So far we have been focusing on having a lockless vcpu path. How is device access related to vcpus? Existing irqchip code is not lockless. VCPUS access in-kernel devices. Yes, it is lockless (see RCU usage in virt/kvm/). So when device attributes can be modified has implications beyond what may seem visible at first. Are this reasonable arguments? Basically abstract 'device attributes' are too abstract. It's up to the device-specific documentation to make them not abstract (I admit there are a few details missing in mpic.txt that I've pointed out in this thread -- it is RFC v1 after all). This wouldn't be any different if we used separate ioctls for everything. It's like saying abstract 'ioctl' is too abstract. Perhaps a better way to put it is that its too permissive. However, your proposed interface deals with sucky capability, versioning and namespace conflicts we have now. Note these items can probably be improved separately. Any particular proposals? Namespace conflicts: Reserve ranges for each arch. The other two items, haven't though. I am not the one bothered :-) (yes, they suck). I suppose mpic.txt could use an additional statement that KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed by the guest. Locking versus currently executing VCPUs, for another (see how KVM_SET_IRQ's RCU usage, for instance, that is something should be shared). If you mean kvm_set_irq() in irq_comm.c, that's only relevant when you have a GSI routing table, which this patchset doesn't. Assuming we end up having a routing table to support irqfd, we still can't share the code as is, since it's APIC-specific. Suppose it is worthwhile to attempt to share code as much as possible. Sure... my point is it isn't a case of the common code is right over there, why aren't you using it? I'll try to share what I reasonably can, subject to my limited knowledge of how the APIC stuff works. The irqfd code is substantial enough that refactoring for sharing should be worthwhile. I'm not so sure about irq_comm.c. -scott Note just pointing out drawbacks of device attributes (if something of that sort is integrated, x86 should also use it). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
On Sat, Feb 23, 2013 at 05:31:44PM +0200, Gleb Natapov wrote: On Sat, Feb 23, 2013 at 11:48:54AM -0300, Marcelo Tosatti wrote: 1. orig_irr = read irr from vapic page 2. if (orig_irr != 0) 3.return false; 4. kvm_make_request(KVM_REQ_EVENT) 5. bool injected = !test_and_set_bit(PIR) 6. if (vcpu-guest_mode injected) 7.if (test_and_set_bit(PIR notification bit)) 8.send PIR IPI 9. return injected Consider follow case: vcpu 0 |vcpu1 send intr to vcpu1 check irr receive a posted intr pir-irr(pir is cleared, irr is set) injected=test_and_set_bit=true pir=set Then both irr and pir have the interrupt pending, they may merge to one later, but injected reported as true. Wrong. I and Marcelo discussed the lockless logic that should be used here on the previous patch submission. All is left for you is to implement it. We worked hard to make irq injection path lockless, we will not going to add one now. He is right, the scheme is still flawed (because of concurrent injectors along with HW in VMX non-root). I'd said lets add a spinlock think about lockless scheme in the meantime. The logic proposed was (from that thread): apic_accept_interrupt() { if (PIR || IRR) return coalesced; else set PIR; } Which should map to something like: if (!test_and_set_bit(PIR)) return coalesced; HW transfers PIR to IRR, here. Say due to PIR IPI sent due to setting of a different vector. if (irr on vapic page is set) return coalesced; I do not see how the race above can happen this way. Other can though if vcpu is outside a guest. My be we should deliver interrupt differently depending on whether vcpu is in guest or not. Problem is with 3 contexes: two injectors and one vcpu in guest mode. Earlier on that thread you mentioned The point is that we need to check PIR and IRR atomically and this is impossible. That would be one way to fix it. I'd rather think about proper way to do lockless injection before committing anything. The case where we care about correct injection status is rare, but we always care about scalability and since we violate the spec by reading vapic page while vcpu is in non-root operation anyway the result may be incorrect with or without the lock. Our use can was not in HW designers mind when they designed this thing obviously :( Zhang, can you comment on whether reading vapic page with CPU in VMX-non root accessing it is safe? Gleb, yes, a comment mentioning the race (instead of the spinlock) and explanation why its believed to be benign (given how the injection return value is interpreted) could also work. Its ugly, though... murphy is around. OTOH spinlock is not the end of the world, can figure out something later (we've tried without success so far). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
On Sat, Feb 23, 2013 at 03:16:11PM +, Zhang, Yang Z wrote: Marcelo Tosatti wrote on 2013-02-23: On Sat, Feb 23, 2013 at 02:05:28PM +, Zhang, Yang Z wrote: Marcelo Tosatti wrote on 2013-02-23: On Fri, Feb 22, 2013 at 09:42:32PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com Posted Interrupt allows APIC interrupts to inject into guest directly without any vmexit. - When delivering a interrupt to guest, if target vcpu is running, update Posted-interrupt requests bitmap and send a notification event to the vcpu. Then the vcpu will handle this interrupt automatically, without any software involvemnt. - If target vcpu is not running or there already a notification event pending in the vcpu, do nothing. The interrupt will be handled by next vm entry Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index e4595f1..64616cc 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -228,6 +228,26 @@ void smp_x86_platform_ipi(struct pt_regs *regs) set_irq_regs(old_regs); } +#ifdef CONFIG_HAVE_KVM +/* + * Handler for POSTED_INTERRUPT_VECTOR. + */ +void smp_kvm_posted_intr_ipi(struct pt_regs *regs) +{ +struct pt_regs *old_regs = set_irq_regs(regs); + +ack_APIC_irq(); + +irq_enter(); + +exit_idle(); + +irq_exit(); + +set_irq_regs(old_regs); +} +#endif + Add per-cpu counters, similarly to other handlers in the same file. Sure. @@ -379,6 +398,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic *apic) if (!apic-irr_pending) return -1; + kvm_x86_ops-sync_pir_to_irr(apic-vcpu); result = apic_search_irr(apic); ASSERT(result == -1 || result = 16); kvm_x86_ops-sync_pir_to_irr() should be called from vcpu_enter_guest, before inject_pending_event. if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) { - inject_pending_event ... } Some other places will search irr to do something, like kvm_cpu_has_interrupt(). So we must to sync pir to irr before touch irr, not just before enter guest. I see. The only call site that needs IRR/PIR information with posted interrupt enabled is kvm_arch_vcpu_runnable, correct? Yes. If so, can we contain reading PIR to only that location? Yes, we can do it. Actually, why I do pir-irr here is to avoid the wrong usage in future of check pending interrupt just by call kvm_vcpu_has_interrupt(). Yes, i see. Also, there may need an extra callback to check pir. So you think your suggestions is better? Because it respects standard request processing mechanism, yes. And have PIR-IRR sync at KVM_REQ_EVENT processing only (to respect the standard way of event processing and also reduce reading the PIR). Since we will check ON bit before each reading PIR, the cost can be ignored. Note reading ON bit is not OK, because if injection path did not see vcpu-mode == IN_GUEST_MODE, ON bit will not be set. So it is possible that a PIR IPI is delivered and handled before guest entry. By clearing PIR notification bit after local_irq_disable, but before the last check for vcpu-requests, we guarantee that a PIR IPI is never lost. Makes sense? (please check the logic, it might be flawed). I am not sure whether I understand you. But I don't think the IPI will lost in current patch: if (!pi_test_and_set_on(vmx-pi_desc) (vcpu-mode == IN_GUEST_MODE)) { kvm_make_request(KVM_REQ_EVENT, vcpu); apic-send_IPI_mask(get_cpu_mask(vcpu-cpu), POSTED_INTR_VECTOR); *delivered = true; } vcpu entry has: vcpu-mode = GUEST_MODE local irq disable check request We will send the IPI after making request, if IPI is received after set guest_mode before disable irq, then it still will be handled by the following check request. Please correct me if I am wrong. cpu0cpu1vcpu0 test_and_set_bit(PIR-A) set KVM_REQ_EVENT process REQ_EVENT PIR-A-IRR vcpu-mode=IN_GUEST if (vcpu0-guest_mode) if (!t_a_s_bit(PIR notif)) send IPI linux_pir_handler t_a_s_b(PIR-B)=1 no PIR IPI sent Sync PIR-IRR uses xchg, which is locked and atomic, so can't see the need
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
On Sat, Feb 23, 2013 at 09:42:14PM +0200, Gleb Natapov wrote: explanation why its believed to be benign (given how the injection return value is interpreted) could also work. Its ugly, though... murphy is around. The race above is not benign. It will report interrupt as coalesced while in reality it is injected. This may cause to many interrupt to be injected. If this happens rare enough ntp may be able to fix time drift resulted from this. OK. OTOH spinlock is not the end of the world, can figure out something later (we've tried without success so far). It serializes all injections into vcpu. I do not believe now that even with lock we are safe for the reason I mention above. We can use pir-on bit as a lock, but that only emphasise how ridiculous serialization of injections becomes. Please review the 2nd iteration of pseudocode in patchset v4 thread, it should be good. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On Thu, Feb 21, 2013 at 08:00:25PM -0600, Scott Wood wrote: On 02/21/2013 05:03:32 PM, Marcelo Tosatti wrote: On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote: On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote: On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote: It is then not necessary to set device attributes on a live guest and deal with the complications associated with that. Which complications? -Scott Semantics of individual attribute writes, for one. When the attribute is a device register, the hardware documentation takes care of that. You are not writing to the registers from the CPU point of view. That's exactly how KVM_DEV_MPIC_GRP_REGISTER is defined and implemented on MPIC (with the exception of registers whose behavior changes based on which specific vcpu you use to access them). If/when we have a need to set/get state in a different manner, that's a separate attribute group. Can you describe usage of this register again? Otherwise, the semantics are documented in the device-specific documentation -- which can include restricting when the access is allowed. Same as with any other interface documentation. Again, you are talking about the semantics of device access from the software operating on the machine view. We are discussing hypervisor userspace - hypervisor kernel interface. And I was talking about the userspace-to-hypervisor kernel interface documentation. It just happens that one specific MPIC device group (when the attribute is a device register) is defined with respect to what guest software would see if it did a similar access. In general you never have to set attributes on a device after it has been initialized, because there is state associated with that device that requires proper handling (example: if you modify a timer counter register of a timer device, any software timers used to emulate the timer counter must be cancelled). Yes, it requires proper handling and the MMIO code does that. If and when we add raw state accessors, it's totally reasonable for there to be command/attribute-specific documented restrictions on when the access can be done. Also, it is necessary to provide proper locking of device attribute write versus vcpu device access. So far we have been focusing on having a lockless vcpu path. How is device access related to vcpus? Existing irqchip code is not lockless. VCPUS access in-kernel devices. Yes, it is lockless (see RCU usage in virt/kvm/). So when device attributes can be modified has implications beyond what may seem visible at first. Are this reasonable arguments? Basically abstract 'device attributes' are too abstract. It's up to the device-specific documentation to make them not abstract (I admit there are a few details missing in mpic.txt that I've pointed out in this thread -- it is RFC v1 after all). This wouldn't be any different if we used separate ioctls for everything. It's like saying abstract 'ioctl' is too abstract. Perhaps a better way to put it is that its too permissive. However, your proposed interface deals with sucky capability, versioning and namespace conflicts we have now. Note these items can probably be improved separately. Any particular proposals? Namespace conflicts: Reserve ranges for each arch. The other two items, haven't though. I am not the one bothered :-) (yes, they suck). I suppose mpic.txt could use an additional statement that KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed by the guest. Locking versus currently executing VCPUs, for another (see how KVM_SET_IRQ's RCU usage, for instance, that is something should be shared). If you mean kvm_set_irq() in irq_comm.c, that's only relevant when you have a GSI routing table, which this patchset doesn't. Assuming we end up having a routing table to support irqfd, we still can't share the code as is, since it's APIC-specific. Suppose it is worthwhile to attempt to share code as much as possible. Sure... my point is it isn't a case of the common code is right over there, why aren't you using it? I'll try to share what I reasonably can, subject to my limited knowledge of how the APIC stuff works. The irqfd code is substantial enough that refactoring for sharing should be worthwhile. I'm not so sure about irq_comm.c. -scott Note just pointing out drawbacks of device attributes (if something of that sort is integrated, x86 should also use it). -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote: On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote: On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote: It is then not necessary to set device attributes on a live guest and deal with the complications associated with that. Which complications? -Scott Semantics of individual attribute writes, for one. When the attribute is a device register, the hardware documentation takes care of that. You are not writing to the registers from the CPU point of view. Otherwise, the semantics are documented in the device-specific documentation -- which can include restricting when the access is allowed. Same as with any other interface documentation. Again, you are talking about the semantics of device access from the software operating on the machine view. We are discussing hypervisor userspace - hypervisor kernel interface. In general you never have to set attributes on a device after it has been initialized, because there is state associated with that device that requires proper handling (example: if you modify a timer counter register of a timer device, any software timers used to emulate the timer counter must be cancelled). Also, it is necessary to provide proper locking of device attribute write versus vcpu device access. So far we have been focusing on having a lockless vcpu path. So when device attributes can be modified has implications beyond what may seem visible at first. Are this reasonable arguments? Basically abstract 'device attributes' are too abstract. However, your proposed interface deals with sucky capability, versioning and namespace conflicts we have now. Note these items can probably be improved separately. I suppose mpic.txt could use an additional statement that KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed by the guest. Locking versus currently executing VCPUs, for another (see how KVM_SET_IRQ's RCU usage, for instance, that is something should be shared). If you mean kvm_set_irq() in irq_comm.c, that's only relevant when you have a GSI routing table, which this patchset doesn't. Assuming we end up having a routing table to support irqfd, we still can't share the code as is, since it's APIC-specific. Suppose it is worthwhile to attempt to share code as much as possible. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote: On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote: On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote: It is then not necessary to set device attributes on a live guest and deal with the complications associated with that. Which complications? -Scott Semantics of individual attribute writes, for one. When the attribute is a device register, the hardware documentation takes care of that. You are not writing to the registers from the CPU point of view. Otherwise, the semantics are documented in the device-specific documentation -- which can include restricting when the access is allowed. Same as with any other interface documentation. Again, you are talking about the semantics of device access from the software operating on the machine view. We are discussing hypervisor userspace - hypervisor kernel interface. In general you never have to set attributes on a device after it has been initialized, because there is state associated with that device that requires proper handling (example: if you modify a timer counter register of a timer device, any software timers used to emulate the timer counter must be cancelled). Also, it is necessary to provide proper locking of device attribute write versus vcpu device access. So far we have been focusing on having a lockless vcpu path. So when device attributes can be modified has implications beyond what may seem visible at first. Are this reasonable arguments? Basically abstract 'device attributes' are too abstract. However, your proposed interface deals with sucky capability, versioning and namespace conflicts we have now. Note these items can probably be improved separately. I suppose mpic.txt could use an additional statement that KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed by the guest. Locking versus currently executing VCPUs, for another (see how KVM_SET_IRQ's RCU usage, for instance, that is something should be shared). If you mean kvm_set_irq() in irq_comm.c, that's only relevant when you have a GSI routing table, which this patchset doesn't. Assuming we end up having a routing table to support irqfd, we still can't share the code as is, since it's APIC-specific. Suppose it is worthwhile to attempt to share code as much as possible. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Sat, Feb 16, 2013 at 01:56:14PM +1100, Paul Mackerras wrote: On Fri, Feb 15, 2013 at 05:59:11PM -0600, Scott Wood wrote: On 02/15/2013 05:18:31 PM, Paul Mackerras wrote: On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote: On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? I suppose it could be made to work. It doesn't feel like a natural fit though, because your API seems to assume (AFAICT) that a device is manipulated via registers at specific physical addresses, so I would have to invent an artificial set of registers with addresses and bit layouts, that aren't otherwise required. The XICS is operated from the guest side via hcalls, not via emulated MMIO. I don't think it makes such an assumption. The MPIC device has physical registers, so it exposes them, but it also exposes things that are not physical registers (e.g. the per-IRQ input state). The generic device control layer leaves interpretation of attributes up to the device. I think it would be easier to fit XICS into the device control api model than to fit MPIC into this model, not to mention what would happen if we later want to emulate some other type of device -- x86 already has at least one non-irqchip emulated device (i8254). I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. This is probably a stupid question, but why the KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for your purposes? x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING. Further, the magic means that you can only have one instance of the device, whereas you might want to model the interrupt controller architecture as several devices. You could do that using several device types, but then the interconnections between them would also be magic. Secondly, it means that we are completely abandoning any attempt to define an abstract or generic interface to in-kernel interrupt controller emulations. Each device will have its own unique set of attribute groups and its own unique userspace code to drive it, with no commonality between them. snip -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2][QEMU] vmxcap: Report APIC register emulation and RDTSCP control
On Mon, Feb 18, 2013 at 07:56:54AM +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- This time I've checked twice that I'm no longer missing a field. Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote: On 02/18/2013 06:21:59 AM, Gleb Natapov wrote: Copying Christoffer since ARM has in kernel irq chip too. On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote: Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. Creation of ioctl has advantages. It makes explicit what the data contains and how it should be interpreted. This means that for example, policy control can be performed at ioctl level (can group, by ioctl number, which ioctl calls are allowed after VM creation, for example). It also makes it explicit that its a userspace interface which applications depend. With a single 'set device attribute' ioctl its more difficult to do that. Abstracting the details also makes it cumbersome to read strace output :-) This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. You are not only provide different way to create in kernel irq chip you also use an alternate way to trigger interrupt lines. Before going into interface specifics lets think about whether it is really worth it? Which it do you mean here? The ability to set/get attributes is needed. Sorry, but get or set one blob of data, up to 512 bytes, for the entire irqchip is just not good enough -- assuming you don't want us to start sticking pointers and commands in *that* data. :-) Why not? Is it necessary to constantly read/write attributes? If you mean the way to inject interrupts, it's simpler this way. Are you injecting interrupts via this new SET_DEVICE_ATTRIBUTE ioctl? Why go out of our way to inject common glue code into a communication path between hw/kvm/mpic.c in QEMU and arch/powerpc/kvm/mpic.c in KVM? Or rather, why make that common glue be specific to this one function when we could reuse the same communication glue used for other things, such as device state? And that's just for regular interrupts. MSIs are vastly simpler on MPIC than what x86 does. x86 obviously support old way and will have to for some, very long, time. Sure. ARM vGIC code, that is ready to go upstream, uses old way too. So it will be 2 archs against one. I wasn't aware that that's how it worked. :-P I was trying to be considerate by not making the entire thing gratuitously PPC or MPIC specific, as some others seem inclined to do (e.g. see irqfd and APIC). We already had a discussion on ARM's set address ioctl and rather than extend *that* interface, they preferred to just stick something ARM-specific in ASAP with the understanding that it would be replaced (or more accurately, kept around as a thin wrapper around the new stuff) later. Christoffer do you think the proposed way it better for your needs. Are you willing to make vGIC use it? Scott, what other devices are you planning to support with this interface? At the moment I do not have plans for other devices, though what does it hurt for the capability to be there? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On Wed, Feb 20, 2013 at 03:09:49PM +0200, Gleb Natapov wrote: On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote: On 02/19/2013 06:24:18 AM, Gleb Natapov wrote: On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote: The ability to set/get attributes is needed. Sorry, but get or set one blob of data, up to 512 bytes, for the entire irqchip is just not good enough -- assuming you don't want us to start sticking pointers and commands in *that* data. :-) Proposed interface sticks pointers into ioctl data, so why doing the same for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile. There's a difference between putting a pointer in an ioctl control structure that is specifically documented as being that way (as in ONE_REG), versus taking an ioctl that claims to be setting/getting a blob of state and embedding pointers in it. It would be like sticking a pointer in the attribute payload of this API, which I think is something to be discouraged. If documentation is what differentiate for you between silly and smart then write documentation instead of new interfaces. KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data on x86, nothing prevent you from adding MPIC specifics to the interface, Add mpic state into kvm_irqchip structure and if 512 bytes is not enough for you to transfer the state put pointers there and _document_ them. But with 512 bytes you can transfer properties inline, so you probably do not need pointer there anyway. I see you have three properties 2 of them 32bit and one 64bit. It'd also be using KVM_SET_IRQCHIP to read data, which is the sort of thing you object to later on regarding KVM_IRQ_LINE_STATUS. Do not see why. Then there's the silliness of transporting 512 bytes just to read a descriptor for transporting something else. Yes, agree. But is this enough of a reason to introduce entirely new interface? Is it on performance critical path? Doubt it, unless you abuse the interface to send interrupts, but then isn't it silty to do copy_from_user() twice to inject an interrupt like proposed interface does? For signaling irqs (I think this is what you mean by commands) we have KVM_IRQ_LINE. It's one type of command. Another is setting the address. Another is writing to registers that have side effects (this is how MSI injection is done on MPIC, just as in real hardware). Setting the address is setting an attribute. Sending MSI is a command. Things you set/get during init/migration are attributes. Things you do to cause side-effects are commands. Yes, it would be good to restrict what can be done via the interface (to avoid abstracting away problems). At a first glance, i would say it should allow for initial configuration of device state, such as registers etc. Why exactly you need to set attributes Scott? What is the benefit of KVM_IRQ_LINE over what MPIC does? What real (non-glue/wrapper) code can become common? No new ioctl with exactly same result (well actually even faster since less copying is done). You need to show us the benefits of the new interface vs existing one, not vice versa. Also related to this question is the point of avoiding the implementation of devices to be spread across userspace and the kernel (this is one point Avi brought up often). If the device emulation is implemented entirely in the kernel, all is necessary are initial values of device registers (and retrieval of those values later for savevm/migration). It is then not necessary to set device attributes on a live guest and deal with the complications associated with that. snip -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On Wed, Feb 20, 2013 at 06:28:24PM -0300, Marcelo Tosatti wrote: On Wed, Feb 20, 2013 at 03:09:49PM +0200, Gleb Natapov wrote: On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote: On 02/19/2013 06:24:18 AM, Gleb Natapov wrote: On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote: The ability to set/get attributes is needed. Sorry, but get or set one blob of data, up to 512 bytes, for the entire irqchip is just not good enough -- assuming you don't want us to start sticking pointers and commands in *that* data. :-) Proposed interface sticks pointers into ioctl data, so why doing the same for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile. There's a difference between putting a pointer in an ioctl control structure that is specifically documented as being that way (as in ONE_REG), versus taking an ioctl that claims to be setting/getting a blob of state and embedding pointers in it. It would be like sticking a pointer in the attribute payload of this API, which I think is something to be discouraged. If documentation is what differentiate for you between silly and smart then write documentation instead of new interfaces. KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data on x86, nothing prevent you from adding MPIC specifics to the interface, Add mpic state into kvm_irqchip structure and if 512 bytes is not enough for you to transfer the state put pointers there and _document_ them. But with 512 bytes you can transfer properties inline, so you probably do not need pointer there anyway. I see you have three properties 2 of them 32bit and one 64bit. It'd also be using KVM_SET_IRQCHIP to read data, which is the sort of thing you object to later on regarding KVM_IRQ_LINE_STATUS. Do not see why. Then there's the silliness of transporting 512 bytes just to read a descriptor for transporting something else. Yes, agree. But is this enough of a reason to introduce entirely new interface? Is it on performance critical path? Doubt it, unless you abuse the interface to send interrupts, but then isn't it silty to do copy_from_user() twice to inject an interrupt like proposed interface does? For signaling irqs (I think this is what you mean by commands) we have KVM_IRQ_LINE. It's one type of command. Another is setting the address. Another is writing to registers that have side effects (this is how MSI injection is done on MPIC, just as in real hardware). Setting the address is setting an attribute. Sending MSI is a command. Things you set/get during init/migration are attributes. Things you do to cause side-effects are commands. Yes, it would be good to restrict what can be done via the interface (to avoid abstracting away problems). At a first glance, i would say it should allow for initial configuration of device state, such as registers etc. Why exactly you need to set attributes Scott? What is the benefit of KVM_IRQ_LINE over what MPIC does? What real (non-glue/wrapper) code can become common? No new ioctl with exactly same result (well actually even faster since less copying is done). You need to show us the benefits of the new interface vs existing one, not vice versa. Also related to this question is the point of avoiding the implementation of devices to be spread across userspace and the kernel (this is one point Avi brought up often). If the device emulation is implemented entirely in the kernel, all is necessary are initial values of device registers (and retrieval of those values later for savevm/migration). It is then not necessary to set device attributes on a live guest and deal with the complications associated with that. snip I suppose the atomic-test-and-set type operations introduced to ONE_REG ioctls are one example of such complications. Avi, can you remind us what are the drawbacks of having separate userspace/kernel device emulation? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On Wed, Feb 20, 2013 at 05:20:50PM -0600, Scott Wood wrote: On 02/20/2013 03:17:27 PM, Marcelo Tosatti wrote: On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote: On 02/18/2013 06:21:59 AM, Gleb Natapov wrote: Copying Christoffer since ARM has in kernel irq chip too. On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote: Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. Creation of ioctl has advantages. It makes explicit what the data contains and how it should be interpreted. This means that for example, policy control can be performed at ioctl level (can group, by ioctl number, which ioctl calls are allowed after VM creation, for example). It also makes it explicit that its a userspace interface which applications depend. With a single 'set device attribute' ioctl its more difficult to do that. I don't see why creating ioctls rather than device attributes (or whatever you want to call them) differs this way. Device attributes are inherently userspace interfaces, just as ioctls are. What the data contains and how it is interpreted are documented, albeit in a more lightweight format than KVM usually uses for ioctls. An ioctl is no guarantee of good documentation. KVM is far better than most of the kernel in that regard, but even there some ioctls are missing (e.g. KVM_IRQ_LINE_STATUS as mentioned elsewhere in this thread), and some others are inadequately explained (e.g. IRQ routing). By policy control, do you just mean that some ioctls act on a /dev/kvm fd, others on a vm fd, and others on a vcpu fd? This is pretty similar to having a device fd, except for the mechanism used. No, i mean policy control acting on ioctl numbers. Its essentially the same issue as with 'strace' (in that the its necessary to parse the ioctl data further). The main things that I dislike about adding new things at the ioctl level are: - limited numberspace, and more prone to merge conflicts than a device-specific numberspace Can reserve per-architecture ranges to deal with that. - having to add a new pathway to get into the driver for each ioctl, rather than having all operations on a particular device go to the right place, and the device implementation interprets further (this assumes that we're talking about vm ioctls, and not returning a new fd for the device) Agree with that point. - awkward way of negotiating capabilities with userspace (have to declare the capability id separately, and add code somewhere outside the device implementation to respond to capability requests) Agree with that point. - api.txt formatting that imposes yet another document-internal numberspace to conflict on :-) The main problem, to me, is that the interface allows flexibility but the details of using such flexibility are not worked out. That is, lack of definitions such as when setting attributes is allowed. With a big blog, that information is implicit: a SET operation is a full device reset. With individual attributes: - Which attributes can be set individually? - Is there an order which must be respected? - Locking. - End result: more logic/code necessary. Abstracting the details also makes it cumbersome to read strace output :-) You'd have to update strace one way or another if you want pretty-printed output. Having a more restricted API than arbitrary ioctls could actually improve the situation -- this could be a good reason for including the attribute length as an explicit parameter rather than being implicit in the attribute id. Then you'd just teach strace about the device control API once, and not for each new attribute or device that gets added. This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. You are not only provide different way to create in kernel irq chip you also use an alternate way to trigger interrupt lines. Before going into interface specifics lets think about whether it is really worth it? Which it do you mean here? The ability to set/get attributes is needed. Sorry, but get or set one blob of data, up to 512 bytes, for the entire irqchip is just not good enough -- assuming you don't want us to start sticking pointers and commands in *that* data. :-) Why not? Is it necessary to constantly read/write attributes? It's
[GIT PULL] KVM updates for the 3.9 merge window
: move the code that installs new slots array to a separate function. KVM: emulator: drop RPL check from linearize() function KVM: emulator: implement fninit, fnstsw, fnstcw KVM: VMX: make rmode_segment_valid() more strict. KVM: VMX: fix emulation of invalid guest state. KVM: VMX: Do not fix segment register during vcpu initialization. KVM: VMX: handle IO when emulation is due to #GP in real mode. KVM: mmu: remove unused trace event KVM: MMU: simplify folding of dirty bit into accessed_dirty KVM: x86: remove unused variable from walk_addr_generic() Merge branch 'kvm-ppc-next' of https://github.com/agraf/linux-2.6 into queue KVM: VMX: remove special CPL cache access during transition to real mode. KVM: VMX: reset CPL only on CS register write. KVM: VMX: if unrestricted guest is enabled vcpu state is always valid. KVM: VMX: remove hack that disables emulation on vcpu reset/init KVM: VMX: skip vmx-rmode.vm86_active check on cr0 write if unrestricted guest is enabled KVM: VMX: don't clobber segment AR of unusable segments. KVM: VMX: rename fix_pmode_dataseg to fix_pmode_seg. KVM: x86: fix use of uninitialized memory as segment descriptor in emulator. KVM: VMX: set vmx-emulation_required only when needed. KVM: MMU: make spte_is_locklessly_modifiable() more clear KVM: MMU: drop unneeded checks. KVM: MMU: set base_role.nxe during mmu initialization. KVM: MMU: drop superfluous min() call. KVM: MMU: drop superfluous is_present_gpte() check. Revert KVM: MMU: split kvm_mmu_free_page KVM: VMX: add missing exit names to VMX_EXIT_REASONS array KVM: VMX: cleanup vmx_set_cr0(). x86 emulator: fix parity calculation for AAD instruction Jan Kiszka (1): KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr Jesse Larrew (1): x86: kvm_para: fix typo in hypercall comments Marcelo Tosatti (3): KVM: VMX: fix incorrect cached cpl value with real/v8086 modes x86: pvclock kvm: align allocation size to page size Revert KVM: MMU: lazily drop large spte Mihai Caraman (2): KVM: PPC: Fix SREGS documentation reference KVM: PPC: Fix mfspr/mtspr MMUCFG emulation Nadav Amit (1): KVM: x86: fix mov immediate emulation for 64-bit operands Nickolai Zeldovich (1): kvm: fix i8254 counter 0 wraparound Peter Zijlstra (1): sched: Bail out of yield_to when source and target runqueue has one task Raghavendra K T (1): kvm: Handle yield_to failure return code for potential undercommit case Takuya Yoshikawa (13): KVM: Write protect the updated slot only when dirty logging is enabled KVM: MMU: Remove unused parameter level from __rmap_write_protect() KVM: MMU: Make kvm_mmu_slot_remove_write_access() rmap based KVM: Remove unused slot_bitmap from kvm_mmu_page KVM: Make kvm_mmu_change_mmu_pages() take mmu_lock by itself KVM: Make kvm_mmu_slot_remove_write_access() take mmu_lock by itself KVM: MMU: Conditionally reschedule when kvm_mmu_slot_remove_write_access() takes a long time KVM: set_memory_region: Don't jump to out_free unnecessarily KVM: set_memory_region: Don't check for overlaps unless we create or move a slot KVM: set_memory_region: Remove unnecessary variable memslot KVM: set_memory_region: Identify the requested change explicitly KVM: set_memory_region: Disallow changing read-only attribute later KVM: Remove user_alloc from struct kvm_memory_slot Xiao Guangrong (9): KVM: MMU: fix Dirty bit missed if CR0.WP = 0 KVM: MMU: fix infinite fault access retry KVM: x86: clean up reexecute_instruction KVM: x86: let reexecute_instruction work for tdp KVM: x86: improve reexecute_instruction KVM: MMU: lazily drop large spte KVM: MMU: cleanup mapping-level KVM: MMU: remove pt_access in mmu_set_spte KVM: MMU: cleanup __direct_map Yang Zhang (5): KVM: remove a wrong hack of delivery PIT intr to vcpu0 x86, apicv: add APICv register virtualization support x86, apicv: add virtual x2apic support x86, apicv: add virtual interrupt delivery support KVM: VMX: disable apicv by default Documentation/virtual/kvm/api.txt| 108 +- Documentation/virtual/kvm/mmu.txt|7 arch/ia64/include/asm/kvm_host.h |4 arch/ia64/kvm/kvm-ia64.c |8 arch/ia64/kvm/lapic.h|6 arch/powerpc/include/asm/kvm_host.h | 10 arch/powerpc/include/asm/kvm_ppc.h | 12 arch/powerpc/include/asm/reg.h |2 arch/powerpc/include/asm/reg_booke.h |1 arch/powerpc/include/uapi/asm/kvm.h |6 arch/powerpc/kernel/asm-offsets.c|2 arch/powerpc/kvm/Makefile|9 arch/powerpc/kvm/book3s_emulate.c| 30 arch/powerpc/kvm/book3s_hv.c |2 arch/powerpc/kvm/book3s_pr.c
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote: On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote: This is probably a stupid question, but why the KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for your purposes? x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING. To start, the whole IRQ routing stuff is poorly documented. Am I supposed to make up GSI numbers and use the routing thing to associate them with real interrupts? I have no idea. Is mapping from one integer linear space (GSIs) to real interrupts suitable for you? Are there constraints on what sort of GSI numbers I can choose (I now see in the code that KVM_MAX_IRQ_ROUTES is returned from the capability check, but where is that documented? Don't think it is. It looks like the APIC implementation has default routes, where is that documented?)? In the code. Where does the code live to manage this table, and how APICy is it (looks like the answer is irq_comm.c, and very)? Thinking faster than typing? Not sure what you mean. I suppose I could write another implementation of the table management code for MPIC, though the placement of irqchip inside the route entry, rather than as an argument to KVM_IRQ_LINE, suggests the table is supposed to be global, not in the individual interrupt controller. Yes the table is global. It maps GSI (Global System Interrupt IIRC) (integer) to (irqchip,pin) pair. It looks like I'm going to have to do this anyway for irqfd, though that doesn't make the other uses of the device control api go away. Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading the status for debugging (reading device registers doesn't quite do it, since the active bit won't show up if the interrupt is masked). At that point, is it more offensive to make it read-only even though it would be trivial to make it read/write (which would allow users who don't need it to bypass the routing API), or to make it read/write and live with there being more than one way to do something? Can't follow this sentence. KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes of state, and because it doesn't allow debugging access to device registers (e.g. inspecting from the QEMU command line), and because it's hard to add new pieces of state if we realize we left something out. It reminds be of GET/SET_SREGS. With that, I did what you seem to want here, which is to adapt the existing interfaces, using feature flags to control optional state. It ended up being a mess, and ONE_REG was introduced as a replacement. The device control API is the equivalent of ONE_REG for things other than vcpus. -Scott - ACK on 512 bytes not sufficient. Add another ioctl, SET_IRQCHIP2? - Agree on poor extensibility of interface. Adding a reserved amount of space as padding and versioning such as has been done so far is not acceptable? - Debugging: why is reading entire register state not acceptable? Yes, its slow. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote: Why exactly you need to set attributes Scott? That's best answered by looking at patch 6/6 and discussing the actual attributes that are defined so far. The need to set the base address of the registers is straightforward. When ARM added their special ioctl for this, we discussed it being eventually replaced with a more general API (in fact, that was the reason they put ARM in the name). Access to device registers was originally intended for debugging, and eventually migration. It turned out to also be very useful for injecting MSIs -- nothing special needed to be done. It Just Works(tm) the same way it does in hardware, with an MMIO write from a PCI device to a specific MPIC register. Again, irqfd may complicate this, but there's no need for QEMU-generated MSIs to have to take a circuitous path. Finally, there's the interrupt source attribute. Even if we get rid of that, we'll need MPIC-specific documentation on how to flatten the IRQ numberspace, and if we ever have a cascaded PIC situation things could get ugly since there's no way to identify a specific IRQ controller in KVM_IRQ_LINE. Also related to this question is the point of avoiding the implementation of devices to be spread across userspace and the kernel (this is one point Avi brought up often). If the device emulation is implemented entirely in the kernel, all is necessary are initial values of device registers (and retrieval of those values later for savevm/migration). MPIC emulation is entirely in the kernel with this patchset -- though the register that lets you reset cores will likely need to be kicked back to QEMU. It is then not necessary to set device attributes on a live guest and deal with the complications associated with that. Which complications? -Scott Semantics of individual attribute writes, for one. Locking versus currently executing VCPUs, for another (see how KVM_SET_IRQ's RCU usage, for instance, that is something should be shared). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote: On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote: This is probably a stupid question, but why the KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for your purposes? x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING. To start, the whole IRQ routing stuff is poorly documented. Am I supposed to make up GSI numbers and use the routing thing to associate them with real interrupts? I have no idea. Is mapping from one integer linear space (GSIs) to real interrupts suitable for you? Are there constraints on what sort of GSI numbers I can choose (I now see in the code that KVM_MAX_IRQ_ROUTES is returned from the capability check, but where is that documented? Don't think it is. It looks like the APIC implementation has default routes, where is that documented?)? In the code. Where does the code live to manage this table, and how APICy is it (looks like the answer is irq_comm.c, and very)? Thinking faster than typing? Not sure what you mean. I suppose I could write another implementation of the table management code for MPIC, though the placement of irqchip inside the route entry, rather than as an argument to KVM_IRQ_LINE, suggests the table is supposed to be global, not in the individual interrupt controller. Yes the table is global. It maps GSI (Global System Interrupt IIRC) (integer) to (irqchip,pin) pair. It looks like I'm going to have to do this anyway for irqfd, though that doesn't make the other uses of the device control api go away. Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading the status for debugging (reading device registers doesn't quite do it, since the active bit won't show up if the interrupt is masked). At that point, is it more offensive to make it read-only even though it would be trivial to make it read/write (which would allow users who don't need it to bypass the routing API), or to make it read/write and live with there being more than one way to do something? Can't follow this sentence. KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes of state, and because it doesn't allow debugging access to device registers (e.g. inspecting from the QEMU command line), and because it's hard to add new pieces of state if we realize we left something out. It reminds be of GET/SET_SREGS. With that, I did what you seem to want here, which is to adapt the existing interfaces, using feature flags to control optional state. It ended up being a mess, and ONE_REG was introduced as a replacement. The device control API is the equivalent of ONE_REG for things other than vcpus. -Scott - ACK on 512 bytes not sufficient. Add another ioctl, SET_IRQCHIP2? - Agree on poor extensibility of interface. Adding a reserved amount of space as padding and versioning such as has been done so far is not acceptable? - Debugging: why is reading entire register state not acceptable? Yes, its slow. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote: Why exactly you need to set attributes Scott? That's best answered by looking at patch 6/6 and discussing the actual attributes that are defined so far. The need to set the base address of the registers is straightforward. When ARM added their special ioctl for this, we discussed it being eventually replaced with a more general API (in fact, that was the reason they put ARM in the name). Access to device registers was originally intended for debugging, and eventually migration. It turned out to also be very useful for injecting MSIs -- nothing special needed to be done. It Just Works(tm) the same way it does in hardware, with an MMIO write from a PCI device to a specific MPIC register. Again, irqfd may complicate this, but there's no need for QEMU-generated MSIs to have to take a circuitous path. Finally, there's the interrupt source attribute. Even if we get rid of that, we'll need MPIC-specific documentation on how to flatten the IRQ numberspace, and if we ever have a cascaded PIC situation things could get ugly since there's no way to identify a specific IRQ controller in KVM_IRQ_LINE. Also related to this question is the point of avoiding the implementation of devices to be spread across userspace and the kernel (this is one point Avi brought up often). If the device emulation is implemented entirely in the kernel, all is necessary are initial values of device registers (and retrieval of those values later for savevm/migration). MPIC emulation is entirely in the kernel with this patchset -- though the register that lets you reset cores will likely need to be kicked back to QEMU. It is then not necessary to set device attributes on a live guest and deal with the complications associated with that. Which complications? -Scott Semantics of individual attribute writes, for one. Locking versus currently executing VCPUs, for another (see how KVM_SET_IRQ's RCU usage, for instance, that is something should be shared). -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[uq/master PATCH] target-i386: kvm: save/restore steal time MSR
Read and write steal time MSR, so that reporting is functional across migration. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 7577e4f..17c7293 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -792,6 +792,7 @@ typedef struct CPUX86State { #endif uint64_t system_time_msr; uint64_t wall_clock_msr; +uint64_t steal_time_msr; uint64_t async_pf_en_msr; uint64_t pv_eoi_en_msr; diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 0cf413d..9ae9d74 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -68,6 +68,7 @@ static bool has_msr_tsc_deadline; static bool has_msr_async_pf_en; static bool has_msr_pv_eoi_en; static bool has_msr_misc_enable; +static bool has_msr_kvm_steal_time; static int lm_capable_kernel; bool kvm_allows_irq0_override(void) @@ -507,6 +508,8 @@ int kvm_arch_init_vcpu(CPUState *cs) has_msr_pv_eoi_en = c-eax (1 KVM_FEATURE_PV_EOI); +has_msr_kvm_steal_time = c-eax (1 KVM_FEATURE_STEAL_TIME); + cpu_x86_cpuid(env, 0, 0, limit, unused, unused, unused); for (i = 0; i = limit; i++) { @@ -1107,6 +1110,10 @@ static int kvm_put_msrs(X86CPU *cpu, int level) kvm_msr_entry_set(msrs[n++], MSR_KVM_PV_EOI_EN, env-pv_eoi_en_msr); } +if (has_msr_kvm_steal_time) { +kvm_msr_entry_set(msrs[n++], MSR_KVM_STEAL_TIME, + env-steal_time_msr); +} if (hyperv_hypercall_available()) { kvm_msr_entry_set(msrs[n++], HV_X64_MSR_GUEST_OS_ID, 0); kvm_msr_entry_set(msrs[n++], HV_X64_MSR_HYPERCALL, 0); @@ -1360,6 +1367,9 @@ static int kvm_get_msrs(X86CPU *cpu) if (has_msr_pv_eoi_en) { msrs[n++].index = MSR_KVM_PV_EOI_EN; } +if (has_msr_kvm_steal_time) { +msrs[n++].index = MSR_KVM_STEAL_TIME; +} if (env-mcg_cap) { msrs[n++].index = MSR_MCG_STATUS; @@ -1445,6 +1455,9 @@ static int kvm_get_msrs(X86CPU *cpu) case MSR_KVM_PV_EOI_EN: env-pv_eoi_en_msr = msrs[i].data; break; +case MSR_KVM_STEAL_TIME: +env-steal_time_msr = msrs[i].data; +break; } } diff --git a/target-i386/machine.c b/target-i386/machine.c index 8df6a6b..1feb9ca 100644 --- a/target-i386/machine.c +++ b/target-i386/machine.c @@ -287,6 +287,24 @@ static bool pv_eoi_msr_needed(void *opaque) return cpu-pv_eoi_en_msr != 0; } +static bool steal_time_msr_needed(void *opaque) +{ +CPUX86State *cpu = opaque; + +return cpu-steal_time_msr != 0; +} + +static const VMStateDescription vmstate_steal_time_msr = { +.name = cpu/steal_time_msr, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT64(steal_time_msr, CPUX86State), +VMSTATE_END_OF_LIST() +} +}; + static const VMStateDescription vmstate_async_pf_msr = { .name = cpu/async_pf_msr, .version_id = 1, @@ -494,6 +512,9 @@ static const VMStateDescription vmstate_cpu = { .vmsd = vmstate_pv_eoi_msr, .needed = pv_eoi_msr_needed, } , { +.vmsd = vmstate_steal_time_msr, +.needed = steal_time_msr_needed, +} , { .vmsd = vmstate_fpop_ip_dp, .needed = fpop_ip_dp_needed, }, { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.
On Tue, Feb 05, 2013 at 03:49:41PM -0600, Michael Wolf wrote: Add a helper routine to scheduler/core.c to allow the kvm module to retrieve the cpu hardlimit settings. The values will be used to set up a timer that is used to separate the consigned from the steal time. 1) Can you please describe, in english, the mechanics of subtracting cpu hardlimit values from steal time reported via run_delay supposed to work? The period and the quota used to separate the consigned time (expected steal) from the steal time are taken from the cfs bandwidth control settings. Any other steal time accruing during that period will show as the traditional steal time. There is no expected steal time over a fixed period of real time. 2) From the description of patch 1: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This is outdated, right? Because overcommitted environment is exactly what steal time should report. Thanks -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bisected][-next-20130204+] [x86/kvm] udevd:[97]: segfault at ffffffffff5fd020 ip 00007fff069e277f sp 00007fff068c9ef8 error d
On Wed, Feb 13, 2013 at 06:57:09AM -0500, Peter Hurley wrote: On Wed, 2013-02-13 at 12:51 +0200, Gleb Natapov wrote: On Tue, Feb 12, 2013 at 04:39:03PM -0800, H. Peter Anvin wrote: On 02/12/2013 04:26 PM, Peter Hurley wrote: With -next-20130204+ in ubuntu 12.10 VM (so the 80x25 VGA device/console): [0.666410] udevd[97]: starting version 175 [0.674043] udevd[97]: udevd:[97]: segfault at ff5fd020 ip 7fff069e277f sp 7fff068c9ef8 error d and boots to an initramfs prompt. git bisect (log attached) blames: commit 7b5c4a65cc27f017c170b025f8d6d75dabb11c6f Merge: 3596f5b 949db15 Author: H. Peter Anvin h...@linux.intel.com Date: Fri Jan 25 16:31:21 2013 -0800 Merge tag 'v3.8-rc5' into x86/mm The __pa() fixup series that follows touches KVM code that is not present in the existing branch based on v3.7-rc5, so merge in the current upstream from Linus. Signed-off-by: H. Peter Anvin h...@linux.intel.com This only happens with the VGA device/console but that is the default configuration for Ubuntu/KVM because it blacklists pretty much every fb driver. I am guessing this is another bad use of __pa()... need to look into that. Can't find this commit on kvm.git or linux-2.6.git. Is it reproducible there? He is using 64bit guest and on those __pa() happens to be working. Is it possible that slow_virt_to_phys() does not work as expected? Peter (the bug reporter :)) can you run your guest kernel with loglevel=7 and attach send me console output? Attached. BTW, this message happens on 'good' boots too: [0.00] [ cut here ] [0.00] WARNING: at /home/peter/src/kernels/next/arch/x86/kernel/pvclock.c:182 pvclock_init_vsyscall+0x22/0x60() [0.00] Hardware name: Bochs [0.00] Modules linked in: [0.00] Pid: 0, comm: swapper Not tainted 3.8.0-next-20130204-xeon #20130204 [0.00] Call Trace: [0.00] [8105812f] warn_slowpath_common+0x7f/0xc0 [0.00] [8105818a] warn_slowpath_null+0x1a/0x20 [0.00] [81d20521] pvclock_init_vsyscall+0x22/0x60 [0.00] [81d20480] kvm_setup_vsyscall_timeinfo+0x74/0xd8 [0.00] [81d201d1] kvm_guest_init+0xd0/0xe9 [0.00] [81d13f7c] setup_arch+0xbee/0xcaf [0.00] [816cbceb] ? printk+0x61/0x63 [0.00] [81d0cbc3] start_kernel+0xd3/0x3f0 [0.00] [81d0c5e4] x86_64_start_reservations+0x2a/0x2c [0.00] [81d0c6d7] x86_64_start_kernel+0xf1/0x100 [0.00] ---[ end trace b47bb564b2d6ec76 ]--- Regards, Peter Hurley Sending a patch for this, thanks for the report. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL 00/14] ppc patch queue 2013-02-15
Pulled, thanks. On Fri, Feb 15, 2013 at 01:16:16AM +0100, Alexander Graf wrote: Hi Marcelo / Gleb, This is my current patch queue for ppc. Please pull. Highlights of this queue drop are: - BookE: Fast mapping support for 4k backed memory - BookE: Handle alignment interrupts Alex The following changes since commit cbd29cb6e38af6119df2cdac0c58acf0e85c177e: Jan Kiszka (1): KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr are available in the git repository at: git://github.com/agraf/linux-2.6.git kvm-ppc-next Alexander Graf (11): KVM: PPC: E500: Move write_stlbe higher KVM: PPC: E500: Explicitly mark shadow maps invalid KVM: PPC: E500: Propagate errors when shadow mapping KVM: PPC: e500: Call kvmppc_mmu_map for initial mapping KVM: PPC: E500: Split host and guest MMU parts KVM: PPC: e500: Implement TLB1-in-TLB0 mapping KVM: PPC: E500: Make clear_tlb_refs and clear_tlb1_bitmap static KVM: PPC: E500: Remove kvmppc_e500_tlbil_all usage from guest TLB code Merge commit 'origin/next' into kvm-ppc-next KVM: PPC: BookE: Handle alignment interrupts Merge commit 'origin/next' into kvm-ppc-next Bharat Bhushan (3): KVM: PPC: booke: use vcpu reference from thread_struct KVM: PPC: booke: Allow multiple exception types booke: Added DBCR4 SPR number arch/powerpc/include/asm/kvm_ppc.h |2 - arch/powerpc/include/asm/reg.h |2 - arch/powerpc/include/asm/reg_booke.h |1 + arch/powerpc/kernel/asm-offsets.c|2 +- arch/powerpc/kvm/Makefile|9 +- arch/powerpc/kvm/booke.c | 30 +- arch/powerpc/kvm/booke.h |1 + arch/powerpc/kvm/booke_interrupts.S | 49 +- arch/powerpc/kvm/e500.c | 16 +- arch/powerpc/kvm/e500.h |1 + arch/powerpc/kvm/e500_mmu.c | 809 +++ arch/powerpc/kvm/e500_mmu_host.c | 699 + arch/powerpc/kvm/e500_mmu_host.h | 18 + arch/powerpc/kvm/e500_tlb.c | 1430 -- 14 files changed, 1610 insertions(+), 1459 deletions(-) create mode 100644 arch/powerpc/kvm/e500_mmu.c create mode 100644 arch/powerpc/kvm/e500_mmu_host.c create mode 100644 arch/powerpc/kvm/e500_mmu_host.h delete mode 100644 arch/powerpc/kvm/e500_tlb.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Alter steal-time reporting in the guest
On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote: 2013/2/5 Michael Wolf m...@linux.vnet.ibm.com: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. Sorry, I'm no expert in this area. But I don't really understand what is confusing for the end user here. I suppose that what is wanted is to subtract stolen time due to 'known reasons' from steal time reporting. 'Known reasons' being, for example, hard caps. So a vcpu executing instructions with no halt, but limited to 80% of available bandwidth, would not have 20% of stolen time reported. But yes, a description of the scenario that is being dealt with, with details, is important. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. Tthe steal time will only be altered if hard limits (cfs bandwidth control) is used. The period and the quota used to separate the consigned time (expected steal) from the steal time are taken from the cfs bandwidth control settings. Any other steal time accruing during that period will show as the traditional steal time. I'm also a bit confused here. steal time will then only account the cpu time lost due to quotas from cfs bandwidth control? Also what do you exactly mean by expected steal time? Is it steal time due to overcommitting minus scheduler quotas? Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
x86: pvclock kvm: align allocation size to page size
To match whats mapped via vsyscalls to userspace. Reported-by: Peter Hurley pe...@hurleysoftware.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index 220a360..5bedbdd 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -218,6 +218,9 @@ static void kvm_shutdown(void) void __init kvmclock_init(void) { unsigned long mem; + int size; + + size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS); if (!kvm_para_available()) return; @@ -231,16 +234,14 @@ void __init kvmclock_init(void) printk(KERN_INFO kvm-clock: Using msrs %x and %x, msr_kvm_system_time, msr_kvm_wall_clock); - mem = memblock_alloc(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS, -PAGE_SIZE); + mem = memblock_alloc(size, PAGE_SIZE); if (!mem) return; hv_clock = __va(mem); if (kvm_register_clock(boot clock)) { hv_clock = NULL; - memblock_free(mem, - sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS); + memblock_free(mem, size); return; } pv_time_ops.sched_clock = kvm_clock_read; @@ -275,7 +276,7 @@ int __init kvm_setup_vsyscall_timeinfo(void) struct pvclock_vcpu_time_info *vcpu_time; unsigned int size; - size = sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS; + size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS); preempt_disable(); cpu = smp_processor_id(); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6] KVM: nVMX: Improve I/O exit handling
On Mon, Feb 18, 2013 at 12:32:37PM +0200, Gleb Natapov wrote: On Mon, Feb 18, 2013 at 11:21:16AM +0100, Jan Kiszka wrote: This prevents trapping L2 I/O exits if L1 has neither unconditional nor bitmap-based exiting enabled. Furthermore, it implements I/O bitmap handling. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Reviewed-by: Gleb Natapov g...@redhat.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Trap unconditionally if msr bitmap access fails
On Sun, Feb 17, 2013 at 10:56:36AM +0200, Gleb Natapov wrote: On Thu, Feb 14, 2013 at 07:46:27PM +0100, Jan Kiszka wrote: This avoids basing decisions on uninitialized variables, potentially leaking kernel data to the L1 guest. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Reviewed-by: Gleb Natapov g...@redhat.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL 00/14] ppc patch queue 2013-02-15
Pulled, thanks. On Fri, Feb 15, 2013 at 01:16:16AM +0100, Alexander Graf wrote: Hi Marcelo / Gleb, This is my current patch queue for ppc. Please pull. Highlights of this queue drop are: - BookE: Fast mapping support for 4k backed memory - BookE: Handle alignment interrupts Alex The following changes since commit cbd29cb6e38af6119df2cdac0c58acf0e85c177e: Jan Kiszka (1): KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr are available in the git repository at: git://github.com/agraf/linux-2.6.git kvm-ppc-next Alexander Graf (11): KVM: PPC: E500: Move write_stlbe higher KVM: PPC: E500: Explicitly mark shadow maps invalid KVM: PPC: E500: Propagate errors when shadow mapping KVM: PPC: e500: Call kvmppc_mmu_map for initial mapping KVM: PPC: E500: Split host and guest MMU parts KVM: PPC: e500: Implement TLB1-in-TLB0 mapping KVM: PPC: E500: Make clear_tlb_refs and clear_tlb1_bitmap static KVM: PPC: E500: Remove kvmppc_e500_tlbil_all usage from guest TLB code Merge commit 'origin/next' into kvm-ppc-next KVM: PPC: BookE: Handle alignment interrupts Merge commit 'origin/next' into kvm-ppc-next Bharat Bhushan (3): KVM: PPC: booke: use vcpu reference from thread_struct KVM: PPC: booke: Allow multiple exception types booke: Added DBCR4 SPR number arch/powerpc/include/asm/kvm_ppc.h |2 - arch/powerpc/include/asm/reg.h |2 - arch/powerpc/include/asm/reg_booke.h |1 + arch/powerpc/kernel/asm-offsets.c|2 +- arch/powerpc/kvm/Makefile|9 +- arch/powerpc/kvm/booke.c | 30 +- arch/powerpc/kvm/booke.h |1 + arch/powerpc/kvm/booke_interrupts.S | 49 +- arch/powerpc/kvm/e500.c | 16 +- arch/powerpc/kvm/e500.h |1 + arch/powerpc/kvm/e500_mmu.c | 809 +++ arch/powerpc/kvm/e500_mmu_host.c | 699 + arch/powerpc/kvm/e500_mmu_host.h | 18 + arch/powerpc/kvm/e500_tlb.c | 1430 -- 14 files changed, 1610 insertions(+), 1459 deletions(-) create mode 100644 arch/powerpc/kvm/e500_mmu.c create mode 100644 arch/powerpc/kvm/e500_mmu_host.c create mode 100644 arch/powerpc/kvm/e500_mmu_host.h delete mode 100644 arch/powerpc/kvm/e500_tlb.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: switch unhandled MSR printk's to tracepoints
On Tue, Feb 12, 2013 at 02:33:17PM +0200, Gleb Natapov wrote: On Mon, Feb 11, 2013 at 09:50:09PM -0200, Marcelo Tosatti wrote: Such messages in the system log very often cause confusion because users understand the kvm: unhandled MSR messages as errors. They are errors. Linux compiled with PV support ignores them, but other guest OSes or Linux without PV will crash. Some indication that guest crash due to use of unsupported MSR is a helpful hint where to look. OK, closing https://bugzilla.redhat.com/show_bug.cgi?id=883216 as wontfix because printk_ratelimit renders the messages unreliable. Switch to trace points, making the information available on demand. What new information those trace points provide that we cannot get today from trace_kvm_msr()? None. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM: x86: switch unhandled MSR printk's to tracepoints
Such messages in the system log very often cause confusion because users understand the kvm: unhandled MSR messages as errors. Switch to trace points, making the information available on demand. diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index fe5e00e..c461368 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -318,6 +318,40 @@ TRACE_EVENT(kvm_msr, #define trace_kvm_msr_write_ex(ecx, data) trace_kvm_msr(1, ecx, data, true) /* + * Tracepoint for unhandled guest MSR access. + */ +TRACE_EVENT(kvm_msr_unhandled, + TP_PROTO(bool write, u32 ecx, u64 data, bool unhandled), + TP_ARGS(write, ecx, data, unhandled), + + TP_STRUCT__entry( + __field(u32,ecx ) + __field(u64,data) + __field(bool, write ) + __field(bool, unhandled ) + ), + + TP_fast_assign( + __entry-ecx= ecx; + __entry-data = data; + __entry-write = write; + __entry-unhandled = unhandled; + ), + + TP_printk(unhandled MSR %s %x = 0x%llx (%s), + __entry-write ? write : read, + __entry-ecx, __entry-data, + __entry-unhandled ? injected #GP : ignored) +); + +#define trace_kvm_msr_unhandled_read(ecx) trace_kvm_msr(0, ecx, 0, false) +#define trace_kvm_msr_unhandled_read_ex(ecx) trace_kvm_msr(0, ecx, 0, true) +#define trace_kvm_msr_unhandled_write(ecx, data) \ + trace_kvm_msr(1, ecx, data, false) +#define trace_kvm_msr_unhandled_write_ex(ecx, data)\ + trace_kvm_msr(1, ecx, data, true) + +/* * Tracepoint for guest CR access. */ TRACE_EVENT(kvm_cr, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c243b81..f5abcce 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1772,8 +1772,7 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data) break; } default: - vcpu_unimpl(vcpu, HYPER-V unimplemented wrmsr: 0x%x - data 0x%llx\n, msr, data); + trace_kvm_msr_unhandled_write_ex(msr, data); return 1; } return 0; @@ -1805,8 +1804,7 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 data) case HV_X64_MSR_TPR: return kvm_hv_vapic_msr_write(vcpu, APIC_TASKPRI, data); default: - vcpu_unimpl(vcpu, HYPER-V unimplemented wrmsr: 0x%x - data 0x%llx\n, msr, data); + trace_kvm_msr_unhandled_write_ex(msr, data); return 1; } @@ -1888,15 +1886,13 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) data = ~(u64)0x100;/* ignore ignne emulation enable */ data = ~(u64)0x8; /* ignore TLB cache disable */ if (data != 0) { - vcpu_unimpl(vcpu, unimplemented HWCR wrmsr: 0x%llx\n, - data); + trace_kvm_msr_unhandled_write_ex(msr, data); return 1; } break; case MSR_FAM10H_MMIO_CONF_BASE: if (data != 0) { - vcpu_unimpl(vcpu, unimplemented MMIO_CONF_BASE wrmsr: - 0x%llx\n, data); + trace_kvm_msr_unhandled_write_ex(msr, data); return 1; } break; @@ -1911,8 +1907,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) thus reserved and should throw a #GP */ return 1; } - vcpu_unimpl(vcpu, %s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n, - __func__, data); + + trace_kvm_msr_unhandled_write(msr, data); break; case MSR_IA32_UCODE_REV: case MSR_IA32_UCODE_WRITE: @@ -2020,8 +2016,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_K7_EVNTSEL2: case MSR_K7_EVNTSEL3: if (data != 0) - vcpu_unimpl(vcpu, unimplemented perfctr wrmsr: - 0x%x data 0x%llx\n, msr, data); + trace_kvm_msr_unhandled_write(msr, data); break; /* at least RHEL 4 unconditionally writes to the perfctr registers, * so we ignore writes to make it happy. @@ -2030,8 +2025,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_K7_PERFCTR1: case MSR_K7_PERFCTR2: case MSR_K7_PERFCTR3: - vcpu_unimpl(vcpu, unimplemented
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Fri, Feb 08, 2013 at 02:28:44PM +0200, Gleb Natapov wrote: On Thu, Feb 07, 2013 at 07:49:47PM -0200, Marcelo Tosatti wrote: On Thu, Feb 07, 2013 at 04:01:11PM +0200, Gleb Natapov wrote: On Wed, Feb 06, 2013 at 08:49:23PM -0200, Marcelo Tosatti wrote: Second is that interrupt may be reported as delivered, but it will be coalesced (possible only with the self IPI with the same vector): Starting condition: PIR=0, IRR=0 vcpu is in a guest mode io thread | vcpu accept_apic_interrupt() | PIR and IRR is zero | set PIR | return delivered | | self IPI | set IRR | merge PIR to IRR (*) At (*) interrupt that was reported as delivered is coalesced. Only vcpu itself should send self-IPI, so its fine. It is fine only because this is not happening in practice (I hope) for single interrupt we care about. Otherwise this is serious problem. Coalesced information is only interesting for non IPI cases, that is, device emulation (at the moment, at least). And incorrect result will be returned for an interrupt injected by an emulated device in the scenario above. The above cause can happen when loading APIC registers, but delivered is not interesting in that case. Good to document, however. Or: apic_accept_interrupt() { 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR. Never set IRR when HWAPIC enabled, even if outside of guest mode. 2. Set PIR and let HW or SW VM-entry transfer it to IRR. 3. set_irq return value: (ORIG_PIR or ORIG_IRR set). } This can report interrupt as coalesced, but it will be eventually delivered as separate interrupt: Starting condition: PIR=0, IRR=1 vcpu is in a guest mode io thread | vcpu | accept_apic_interrupt() | ORIG_PIR=0, ORIG_IRR=1 | |EOI |clear IRR, set ISR set PIR | return coalesced| |clear PIR, set IRR |EOI |clear IRR, set ISR (*) At (*) interrupt that was reported as coalesced is delivered. So still no perfect solution. But first one has much less serious problems for our practical needs. Two or more concurrent set_irq can race with each other, though. Can either document the race or add a lock. -- Gleb. Ok, then: accept_apic_irq: 1. coalesced = test_and_set_bit(PIR) 2. set KVM_REQ_EVENT bit(*) 3. if (vcpu-in_guest_mode) 4. if (test_and_set_bit(pir notification bit)) 5. send PIR IPI 6. return coalesced Do not see how it will help. Starting condition: PIR=0, IRR=1 vcpu is in a guest mode io thread | vcpu accept_apic_interrupt()| coalesced = 0, PIR=1 | vcpu in a guest mode: | send PIR IPI | | receive PIR IPI | merge PIR to IRR (*) return not coalesced | At (*) interrupt that was reported as delivered is coalesced. Of course! The point is that we need to check PIR and IRR atomically and this is impossible. Ok, next try: 1. orig_irr = read irr from vapic page 2. if (orig_irr == 0) 3. return test_and_test_bit(pir) 4. return 0 I think this is exactly same solution we are discussing above: apic_accept_interrupt() { if (PIR || IRR) return coalesced; else set PIR; } Its not exactly the same. Without test_and_set_bit(PIR) two vcpus can race. with the same self-IPI problem. IMO this is the best we can do and will work correctly for RTC interrupt re-injection case. Yes. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 5/5] KVM: MMU: fast drop all spte on the pte_list
On Tue, Feb 05, 2013 at 04:55:37PM +0800, Xiao Guangrong wrote: If a shadow page is being zapped or a host page is going to be freed, kvm will drop all the reverse-mappings on the shadow page or the gfn. Currently, it drops the reverse-mapping one by one - it deletes the first reverse mapping, then moves other reverse-mapping between the description-table. When the last description-table become empty, it will be freed. It works well if we only have a few reverse-mappings, but some pte_lists are very long, during my tracking, i saw some gfns have more than 200 sptes listed on its pte-list (1G memory in guest on softmmu). Optimization for dropping such long pte-list is worthwhile, at lease it is good for deletion memslots and ksm/thp merge pages. This patch introduce a better way to optimize for this case, it walks all the reverse-mappings and clear them, then free all description-tables together. Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 36 +++- 1 files changed, 27 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 58f813a..aa7a887 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -945,6 +945,25 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list) } } +static void pte_list_destroy(unsigned long *pte_list) +{ + struct pte_list_desc *desc; + unsigned long list_value = *pte_list; + + *pte_list = 0; + + if (!(list_value 1)) + return; + + desc = (struct pte_list_desc *)(list_value ~1ul); + while (desc) { + struct pte_list_desc *next_desc = desc-more; + + mmu_free_pte_list_desc(desc); + desc = next_desc; + } +} + /* * Used by the following functions to iterate through the sptes linked by a * pte_list. All fields are private and not assumed to be used outside. @@ -1183,17 +1202,17 @@ static bool rmap_write_protect(struct kvm *kvm, u64 gfn) static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp, struct kvm_memory_slot *slot, unsigned long data) { - u64 *sptep; struct pte_list_iterator iter; + u64 *sptep; int need_tlb_flush = 0; -restart: for_each_spte_in_rmap(*rmapp, iter, sptep) { - drop_spte(kvm, sptep); + mmu_spte_clear_track_bits(sptep); need_tlb_flush = 1; - goto restart; } + pte_list_destroy(rmapp); + return need_tlb_flush; } @@ -2016,11 +2035,10 @@ static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp) u64 *sptep; struct pte_list_iterator iter; -restart: - for_each_spte_in_pte_list(sp-parent_ptes, iter, sptep) { - drop_parent_pte(sp, sptep); - goto restart; - } + for_each_spte_in_pte_list(sp-parent_ptes, iter, sptep) + mmu_spte_clear_no_track(sptep); + + pte_list_destroy(sp-parent_ptes); } Do we lose the crash information of pte_list_remove? It has been shown to be useful in several cases. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/3] KVM: MMU: simple cleanups
On Tue, Feb 05, 2013 at 03:26:21PM +0800, Xiao Guangrong wrote: There are the simple cleanups for MMU, no function / logic changed. Marcelo, Gleb, please apply them after applying [PATCH v3] KVM: MMU: lazily drop large spte Changelog: no change, just split them from the previous patchset for good review. Thanks! Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/5] KVM: MMU: introduce mmu_spte_establish
On Tue, Feb 05, 2013 at 04:53:19PM +0800, Xiao Guangrong wrote: There is little different between walking parent pte and walking ramp: all spte in rmap must be present but this is not true on parent pte list, in kvm_mmu_alloc_page, we always link the parent list before set the spte to present This patch introduce mmu_spte_establish which set the spte before linking it to parent list to eliminates the different then it is possible to unify the code of walking pte list Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 81 ++- arch/x86/kvm/paging_tmpl.h | 16 - 2 files changed, 48 insertions(+), 49 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 8041454..68d4d5f 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1482,9 +1482,6 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn) static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, u64 *parent_pte) { - if (!parent_pte) - return; - pte_list_add(vcpu, parent_pte, sp-parent_ptes); } @@ -1502,7 +1499,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp, } static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, -u64 *parent_pte, int direct) +int direct) { struct kvm_mmu_page *sp; sp = mmu_memory_cache_alloc(vcpu-arch.mmu_page_header_cache); @@ -1512,7 +1509,6 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, set_page_private(virt_to_page(sp-spt), (unsigned long)sp); list_add(sp-link, vcpu-kvm-arch.active_mmu_pages); sp-parent_ptes = 0; - mmu_page_add_parent_pte(vcpu, sp, parent_pte); kvm_mod_used_mmu_pages(vcpu-kvm, +1); return sp; } @@ -1845,8 +1841,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gva_t gaddr, unsigned level, int direct, - unsigned access, - u64 *parent_pte) + unsigned access) { union kvm_mmu_page_role role; unsigned quadrant; @@ -1876,19 +1871,15 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, if (sp-unsync kvm_sync_page_transient(vcpu, sp)) break; - mmu_page_add_parent_pte(vcpu, sp, parent_pte); - if (sp-unsync_children) { + if (sp-unsync_children) kvm_make_request(KVM_REQ_MMU_SYNC, vcpu); - kvm_mmu_mark_parents_unsync(sp); - } else if (sp-unsync) - kvm_mmu_mark_parents_unsync(sp); __clear_sp_write_flooding_count(sp); trace_kvm_mmu_get_page(sp, false); return sp; } ++vcpu-kvm-stat.mmu_cache_miss; - sp = kvm_mmu_alloc_page(vcpu, parent_pte, direct); + sp = kvm_mmu_alloc_page(vcpu, direct); if (!sp) return sp; sp-gfn = gfn; @@ -1908,6 +1899,35 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, return sp; } +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp) +{ + u64 spte; + + spte = __pa(sp-spt) | PT_PRESENT_MASK | PT_WRITABLE_MASK | +shadow_user_mask | shadow_x_mask | shadow_accessed_mask; + + mmu_spte_set(sptep, spte); +} + +static struct kvm_mmu_page * +mmu_spte_establish(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn, gva_t gaddr, +unsigned level, int direct, unsigned access) +{ + struct kvm_mmu_page *sp; + + WARN_ON(is_shadow_present_pte(*spte)); + + sp = kvm_mmu_get_page(vcpu, gfn, gaddr, level, direct, access); + + link_shadow_page(spte, sp); + mmu_page_add_parent_pte(vcpu, sp, spte); + + if (sp-unsync_children || sp-unsync) + kvm_mmu_mark_parents_unsync(sp); + + return sp; +} + static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator, struct kvm_vcpu *vcpu, u64 addr) { @@ -1957,16 +1977,6 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator) return __shadow_walk_next(iterator, *iterator-sptep); } -static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp) -{ - u64 spte; - - spte = __pa(sp-spt) | PT_PRESENT_MASK | PT_WRITABLE_MASK | -shadow_user_mask | shadow_x_mask | shadow_accessed_mask; - - mmu_spte_set(sptep, spte); -} - static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned
Re: Q: Why not use struct mm_struct to manage guest physical addresses in new port?
On Tue, Feb 05, 2013 at 11:02:32AM -0800, David Daney wrote: Hi, I am starting to working on a port of KVM to an architecture that has a dual TLB. The Guest Virtual Addresses (GVA) are translated to Guest Physical Addresses (GPA) by the first TLB, then a second TLB translates the GPA to a Root Physical Address (RPA). For the sake of this question, we will ignore the GVA-GPA TLB and consider only the GPA-RPA TLB. I seems that most existing ports have a bunch of custom code that manages the GPA-RPA TLB and page tables. Here is what I would like to try: Create a mm for the GPA-RPA mappings each vma would have a fault handler that calls gfn_to_pfn() to look up the proper page. In kvm_arch_vcpu_ioctl_run() we would call switch_mm() to this new 'gva_mm'. gfn_to_pfn uses the address space of the controlling process. GPA-RPA translation does: 1) Find 'memory slot' (indexed by gfn) 2) From 'memory slot', find virtual address (relative to controlling process). 3) Walk pagetable of controlling process and page retrieve physical address. Upon exiting guest mode we would switch back to the original mm of the controlling process. For me the benefit of this approach is that all the code that manages the TLB is already implemented and works well for struct mm_struct. The only thing I need to do is write a vma fault handler. That is a lot easier and less error prone than maintaining a parallel TLB management framework and making sure it interacts properly with the existing TLB code for 'normal' processes. Q1: Am I crazy for wanting to try this? You need the mm_struct of the controlling to be active, when doing GPA-RPA translations. Q2: Have others tried this and rejected it? What were the reasons? I think you'll have to switch_mm back to the controlling process mm on every page fault (and then back to gva_mm). Thanks in advance, David Daney Cavium, Inc. 'vma' `is a process -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Remove user_alloc from struct kvm_memory_slot
On Thu, Feb 07, 2013 at 06:55:57PM +0900, Takuya Yoshikawa wrote: This field was needed to differentiate memory slots created by the new API, KVM_SET_USER_MEMORY_REGION, from those by the old equivalent, KVM_SET_MEMORY_REGION, whose support was dropped long before: commit b74a07beed0e64bfba413dcb70dd6749c57f43dc KVM: Remove kernel-allocated memory regions Although we also have private memory slots to which KVM allocates memory with vm_mmap(), !user_alloc slots in other words, the slot id should be enough for differentiating them. Note: corresponding function parameters will be removed later. Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp Reviewed-by: Marcelo Tosatti mtosa...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Thu, Feb 07, 2013 at 04:01:11PM +0200, Gleb Natapov wrote: On Wed, Feb 06, 2013 at 08:49:23PM -0200, Marcelo Tosatti wrote: Second is that interrupt may be reported as delivered, but it will be coalesced (possible only with the self IPI with the same vector): Starting condition: PIR=0, IRR=0 vcpu is in a guest mode io thread | vcpu accept_apic_interrupt() | PIR and IRR is zero | set PIR | return delivered | | self IPI | set IRR | merge PIR to IRR (*) At (*) interrupt that was reported as delivered is coalesced. Only vcpu itself should send self-IPI, so its fine. It is fine only because this is not happening in practice (I hope) for single interrupt we care about. Otherwise this is serious problem. Coalesced information is only interesting for non IPI cases, that is, device emulation (at the moment, at least). The above cause can happen when loading APIC registers, but delivered is not interesting in that case. Good to document, however. Or: apic_accept_interrupt() { 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR. Never set IRR when HWAPIC enabled, even if outside of guest mode. 2. Set PIR and let HW or SW VM-entry transfer it to IRR. 3. set_irq return value: (ORIG_PIR or ORIG_IRR set). } This can report interrupt as coalesced, but it will be eventually delivered as separate interrupt: Starting condition: PIR=0, IRR=1 vcpu is in a guest mode io thread | vcpu | accept_apic_interrupt() | ORIG_PIR=0, ORIG_IRR=1 | |EOI |clear IRR, set ISR set PIR | return coalesced| |clear PIR, set IRR |EOI |clear IRR, set ISR (*) At (*) interrupt that was reported as coalesced is delivered. So still no perfect solution. But first one has much less serious problems for our practical needs. Two or more concurrent set_irq can race with each other, though. Can either document the race or add a lock. -- Gleb. Ok, then: accept_apic_irq: 1. coalesced = test_and_set_bit(PIR) 2. set KVM_REQ_EVENT bit(*) 3. if (vcpu-in_guest_mode) 4. if (test_and_set_bit(pir notification bit)) 5. send PIR IPI 6. return coalesced Do not see how it will help. Starting condition: PIR=0, IRR=1 vcpu is in a guest mode io thread | vcpu accept_apic_interrupt()| coalesced = 0, PIR=1 | vcpu in a guest mode: | send PIR IPI | | receive PIR IPI | merge PIR to IRR (*) return not coalesced | At (*) interrupt that was reported as delivered is coalesced. Of course! The point is that we need to check PIR and IRR atomically and this is impossible. Ok, next try: 1. orig_irr = read irr from vapic page 2. if (orig_irr == 0) 3. return test_and_test_bit(pir) 4. return 0 To summarize, given irr and pir (the bit for the vector in question) possibilities: irr=0, pir=0, inject, return coalesced=0. irr=0, pir=1, not injected, return coalesced=1. irr=1, pir=0, not injected, return coalesced=1. irr=1, pir=1, not injected, return coalesced=1. Note the behaviour regarding whether to inject or not is the same as today: it depends on IRR being set. If PIR=1 and IRR=0, IRR will be eventually set. Other sites: A: On VM-entry, after disabling interrupts, but before the last check for -requests, clear pir notification bit (unconditionally). (*) This is _necessary_ also because during VM-exit a PIR IPI interrupt can be missed, so the KVM_REQ_EVENT indicates that SW is responsible for PIR-IRR transfer. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 rebased] kvm: notify host when the guest is panicked
Hi, On Wed, Jan 23, 2013 at 03:19:21PM +0800, Hu Tao wrote: We can know the guest is panicked when the guest runs on xen. But we do not have such feature on kvm. Another purpose of this feature is: management app(for example: libvirt) can do auto dump when the guest is panicked. If management app does not do auto dump, the guest's user can do dump by hand if he sees the guest is panicked. We have three solutions to implement this feature: 1. use vmcall 2. use I/O port 3. use virtio-serial. We have decided to avoid touching hypervisor. The reason why I choose choose the I/O port is: 1. it is easier to implememt 2. it does not depend any virtual device 3. it can work when starting the kernel Signed-off-by: Wen Congyang we...@cn.fujitsu.com Signed-off-by: Hu Tao hu...@cn.fujitsu.com --- arch/ia64/kvm/irq.h | 19 + arch/powerpc/include/asm/kvm_para.h | 18 arch/s390/include/asm/kvm_para.h | 19 + arch/x86/include/asm/kvm_para.h | 20 ++ arch/x86/include/uapi/asm/kvm_para.h | 2 ++ arch/x86/kernel/kvm.c| 53 include/linux/kvm_para.h | 18 include/uapi/linux/kvm_para.h| 6 kernel/panic.c | 4 +++ 9 files changed, 159 insertions(+) diff --git a/arch/ia64/kvm/irq.h b/arch/ia64/kvm/irq.h index c0785a7..b3870f8 100644 --- a/arch/ia64/kvm/irq.h +++ b/arch/ia64/kvm/irq.h @@ -30,4 +30,23 @@ static inline int irqchip_in_kernel(struct kvm *kvm) return 1; } +static inline int kvm_arch_pv_event_init(void) +{ + return 0; +} + +static inline unsigned int kvm_arch_pv_features(void) +{ + return 0; +} + +static inline void kvm_arch_pv_eject_event(unsigned int event) +{ +} + +static inline bool kvm_arch_pv_event_enabled(void) +{ + return false; +} + The interface is x86 only, no need to touch other architectures. #endif diff --git a/arch/powerpc/include/asm/kvm_para.h b/arch/powerpc/include/asm/kvm_para.h index 2b11965..17dd013 100644 --- a/arch/powerpc/include/asm/kvm_para.h +++ b/arch/powerpc/include/asm/kvm_para.h @@ -144,4 +144,22 @@ static inline bool kvm_check_and_clear_guest_paused(void) return false; } +static inline int kvm_arch_pv_event_init(void) +{ + return 0; +} + +static inline unsigned int kvm_arch_pv_features(void) +{ + return 0; +} + +static inline void kvm_arch_pv_eject_event(unsigned int event) +{ +} + +static inline bool kvm_arch_pv_event_enabled(void) +{ + return false; +} #endif /* __POWERPC_KVM_PARA_H__ */ diff --git a/arch/s390/include/asm/kvm_para.h b/arch/s390/include/asm/kvm_para.h index e0f8423..81d87ec 100644 --- a/arch/s390/include/asm/kvm_para.h +++ b/arch/s390/include/asm/kvm_para.h @@ -154,4 +154,23 @@ static inline bool kvm_check_and_clear_guest_paused(void) return false; } +static inline int kvm_arch_pv_event_init(void) +{ + return 0; +} + +static inline unsigned int kvm_arch_pv_features(void) +{ + return 0; +} + +static inline void kvm_arch_pv_eject_event(unsigned int event) +{ +} + +static inline bool kvm_arch_pv_event_enabled(void) +{ + return false; +} + #endif /* __S390_KVM_PARA_H */ --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -133,4 +133,24 @@ static inline void kvm_disable_steal_time(void) } #endif +static inline int kvm_arch_pv_event_init(void) +{ + if (!request_region(KVM_PV_EVENT_PORT, 4, KVM_PV_EVENT)) + return -1; + + return 0; +} This should be in a driver in arch/x86/kernel/kvm-panic.c, or so. + +static inline unsigned int kvm_arch_pv_features(void) +{ + return inl(KVM_PV_EVENT_PORT); +} + +static inline void kvm_arch_pv_eject_event(unsigned int event) +{ + outl(event, KVM_PV_EVENT_PORT); +} + +bool kvm_arch_pv_event_enabled(void); + #endif /* _ASM_X86_KVM_PARA_H */ diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 06fdbd9..c15ef33 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -96,5 +96,7 @@ struct kvm_vcpu_pv_apf_data { #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK #define KVM_PV_EOI_DISABLED 0x0 +#define KVM_PV_EVENT_PORT(0x505UL) + No need for the ioport to be hard coded. What are the options to communicate an address to the guest? An MSR, via ACPI? #endif /* _UAPI_ASM_X86_KVM_PARA_H */ diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 9c2bd8b..0aa7b3e 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -73,6 +73,20 @@ static int parse_no_kvmclock_vsyscall(char *arg) early_param(no-kvmclock-vsyscall, parse_no_kvmclock_vsyscall); +static int pv_event = 1; +static int parse_no_pv_event(char *arg) +{ + pv_event = 0; + return 0; +}
Re: [PATCH v12 rebased 2/8] start vm after resetting it
On Wed, Jan 23, 2013 at 03:19:23PM +0800, Hu Tao wrote: From: Wen Congyang we...@cn.fujitsu.com The guest should run after resetting it, but it does not run if its old state is RUN_STATE_INTERNAL_ERROR or RUN_STATE_PAUSED. We don't set runstate to RUN_STATE_PAUSED when resetting the guest, so the runstate will be changed from RUN_STATE_INTERNAL_ERROR or RUN_STATE_PAUSED to RUN_STATE_RUNNING(not RUN_STATE_PAUSED). It appears the last hunk will automatically reset state from RUN_STATE_INTERNAL_ERROR to RUN_STATE_RUNNING ? I suppose the transition table allows, from RUN_STATE_INTERNAL_ERROR: monitor system_reset monitor cont To resume the machine? Signed-off-by: Wen Congyang we...@cn.fujitsu.com --- include/block/block.h | 2 ++ qmp.c | 2 +- vl.c | 7 --- 3 files changed, 7 insertions(+), 4 deletions(-) diff --git a/include/block/block.h b/include/block/block.h index ffd1936..5e82ccb 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -366,6 +366,8 @@ void bdrv_disable_copy_on_read(BlockDriverState *bs); void bdrv_set_in_use(BlockDriverState *bs, int in_use); int bdrv_in_use(BlockDriverState *bs); +void iostatus_bdrv_it(void *opaque, BlockDriverState *bs); + #ifdef CONFIG_LINUX_AIO int raw_get_aio_fd(BlockDriverState *bs); #else diff --git a/qmp.c b/qmp.c index 55b056b..5f1bed1 100644 --- a/qmp.c +++ b/qmp.c @@ -130,7 +130,7 @@ SpiceInfo *qmp_query_spice(Error **errp) }; #endif -static void iostatus_bdrv_it(void *opaque, BlockDriverState *bs) +void iostatus_bdrv_it(void *opaque, BlockDriverState *bs) { bdrv_iostatus_reset(bs); } diff --git a/vl.c b/vl.c index b0bcf1e..1d2edaa 100644 --- a/vl.c +++ b/vl.c @@ -534,7 +534,7 @@ static const RunStateTransition runstate_transitions_def[] = { { RUN_STATE_INMIGRATE, RUN_STATE_RUNNING }, { RUN_STATE_INMIGRATE, RUN_STATE_PAUSED }, -{ RUN_STATE_INTERNAL_ERROR, RUN_STATE_PAUSED }, +{ RUN_STATE_INTERNAL_ERROR, RUN_STATE_RUNNING }, { RUN_STATE_INTERNAL_ERROR, RUN_STATE_FINISH_MIGRATE }, { RUN_STATE_IO_ERROR, RUN_STATE_RUNNING }, @@ -569,7 +569,7 @@ static const RunStateTransition runstate_transitions_def[] = { { RUN_STATE_SAVE_VM, RUN_STATE_RUNNING }, -{ RUN_STATE_SHUTDOWN, RUN_STATE_PAUSED }, +{ RUN_STATE_SHUTDOWN, RUN_STATE_RUNNING }, { RUN_STATE_SHUTDOWN, RUN_STATE_FINISH_MIGRATE }, { RUN_STATE_DEBUG, RUN_STATE_SUSPENDED }, @@ -1951,7 +1951,8 @@ static bool main_loop_should_exit(void) resume_all_vcpus(); if (runstate_check(RUN_STATE_INTERNAL_ERROR) || runstate_check(RUN_STATE_SHUTDOWN)) { -runstate_set(RUN_STATE_PAUSED); +bdrv_iterate(iostatus_bdrv_it, NULL); +vm_start(); } } if (qemu_wakeup_requested()) { -- 1.8.0.1.240.ge8a1f5a -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 rebased 1/8] preserve cpu runstate
On Wed, Jan 23, 2013 at 03:19:22PM +0800, Hu Tao wrote: This patch enables preservation of cpu runstate during save/load vm. So when a vm is restored from snapshot, the cpu runstate is restored, too. See following example: # save two vms: one is running, the other is paused (qemu) info status VM status: running (qemu) savevm running (qemu) stop (qemu) info status VM status: paused (qemu) savevm paused # restore the one running (qemu) info status VM status: paused (qemu) loadvm running (qemu) info status VM status: running # restore the one paused (qemu) loadvm paused (qemu) info status VM status: paused (qemu) cont (qemu)info status VM status: running Signed-off-by: Hu Tao hu...@cn.fujitsu.com Lack of pause state on guest images is annoying. Fail to see why the panic feature depends on preservation of cpu runstate. include/sysemu/sysemu.h | 2 ++ migration.c | 6 +- monitor.c | 5 ++--- savevm.c| 1 + vl.c| 34 ++ 5 files changed, 40 insertions(+), 8 deletions(-) diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h index 337ce7d..7a69fde 100644 --- a/include/sysemu/sysemu.h +++ b/include/sysemu/sysemu.h @@ -19,6 +19,8 @@ extern uint8_t qemu_uuid[]; int qemu_uuid_parse(const char *str, uint8_t *uuid); #define UUID_FMT %02hhx%02hhx%02hhx%02hhx-%02hhx%02hhx-%02hhx%02hhx-%02hhx%02hhx-%02hhx%02hhx%02hhx%02hhx%02hhx%02hhx +void save_run_state(void); +void load_run_state(void); bool runstate_check(RunState state); void runstate_set(RunState new_state); int runstate_is_running(void); diff --git a/migration.c b/migration.c index 77c1971..f96cfd6 100644 --- a/migration.c +++ b/migration.c @@ -108,11 +108,7 @@ static void process_incoming_migration_co(void *opaque) /* Make sure all file formats flush their mutable metadata */ bdrv_invalidate_cache_all(); -if (autostart) { -vm_start(); -} else { -runstate_set(RUN_STATE_PAUSED); -} +load_run_state(); } static void enter_migration_coroutine(void *opaque) diff --git a/monitor.c b/monitor.c index 20bd19b..9381ed0 100644 --- a/monitor.c +++ b/monitor.c @@ -2059,13 +2059,12 @@ void qmp_closefd(const char *fdname, Error **errp) static void do_loadvm(Monitor *mon, const QDict *qdict) { -int saved_vm_running = runstate_is_running(); const char *name = qdict_get_str(qdict, name); vm_stop(RUN_STATE_RESTORE_VM); -if (load_vmstate(name) == 0 saved_vm_running) { -vm_start(); +if (load_vmstate(name) == 0) { +load_run_state(); } } diff --git a/savevm.c b/savevm.c index 304d1ef..10f1d56 100644 --- a/savevm.c +++ b/savevm.c @@ -2112,6 +2112,7 @@ void do_savevm(Monitor *mon, const QDict *qdict) } saved_vm_running = runstate_is_running(); +save_run_state(); vm_stop(RUN_STATE_SAVE_VM); memset(sn, 0, sizeof(*sn)); diff --git a/vl.c b/vl.c index 4ee1302..b0bcf1e 100644 --- a/vl.c +++ b/vl.c @@ -520,6 +520,7 @@ static int default_driver_check(QemuOpts *opts, void *opaque) /* QEMU state */ static RunState current_run_state = RUN_STATE_PRELAUNCH; +static RunState saved_run_state = RUN_STATE_PRELAUNCH; typedef struct { RunState from; @@ -543,6 +544,7 @@ static const RunStateTransition runstate_transitions_def[] = { { RUN_STATE_PAUSED, RUN_STATE_FINISH_MIGRATE }, { RUN_STATE_POSTMIGRATE, RUN_STATE_RUNNING }, +{ RUN_STATE_POSTMIGRATE, RUN_STATE_PAUSED }, { RUN_STATE_POSTMIGRATE, RUN_STATE_FINISH_MIGRATE }, { RUN_STATE_PRELAUNCH, RUN_STATE_RUNNING }, @@ -553,6 +555,7 @@ static const RunStateTransition runstate_transitions_def[] = { { RUN_STATE_FINISH_MIGRATE, RUN_STATE_POSTMIGRATE }, { RUN_STATE_RESTORE_VM, RUN_STATE_RUNNING }, +{ RUN_STATE_RESTORE_VM, RUN_STATE_PAUSED }, { RUN_STATE_RUNNING, RUN_STATE_DEBUG }, { RUN_STATE_RUNNING, RUN_STATE_INTERNAL_ERROR }, @@ -582,11 +585,39 @@ static const RunStateTransition runstate_transitions_def[] = { static bool runstate_valid_transitions[RUN_STATE_MAX][RUN_STATE_MAX]; +void save_run_state(void) +{ +saved_run_state = current_run_state; +} + +void load_run_state(void) +{ +if (saved_run_state == RUN_STATE_RUNNING) { +vm_start(); +} else if (!runstate_check(saved_run_state)) { +runstate_set(saved_run_state); +} else { +; /* leave unchanged */ +} +} + bool runstate_check(RunState state) { return current_run_state == state; } +static void runstate_save(QEMUFile *f, void *opaque) +{ +qemu_put_byte(f, saved_run_state); +} + +static int runstate_load(QEMUFile *f, void *opaque, int version_id) +{ +saved_run_state = qemu_get_byte(f); + +return 0; +} This breaks loading images without support for
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Thu, Feb 07, 2013 at 03:52:24PM +0200, Gleb Natapov wrote: Its not a bad idea to have a new KVM_REQ_ bit for PIR processing (just as the current patches do). Without the numbers I do not see why. KVM_REQ_EVENT already means... counting... many things. Its a well defined request, to sync PIR-VIRR, don't see your point about performance. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Tue, Feb 05, 2013 at 09:32:50AM +0200, Gleb Natapov wrote: On Mon, Feb 04, 2013 at 06:47:30PM -0200, Marcelo Tosatti wrote: On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote: On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote: On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote: Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour? Don't know about guests, but KVM relies on it to detect interrupt coalescing. So if interrupt is set in IRR but not in PIR interrupt will not be reported as coalesced, but it will be coalesced during PIR-IRR merge. Yes, so: 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no. 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer. 3. vcpu outside of guest mode. 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no. 5. vcpu enters guest mode. 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no. 7. HW transfers PIR into IRR. set_irq return value at 7 is incorrect, interrupt event was _not_ queued. Not sure I understand the flow of events in your description correctly. As I understand it at 4 set_irq() will return incorrect result. Basically when PIR is set to 1 while IRR has 1 for the vector the value of set_irq() will be incorrect. At 4 it has not been coalesced: it has been queued to IRR. At 6 it has been coalesced: PIR bit merged into IRR bit. Yes, that's the case. Frankly I do not see how it can be fixed without any race with present HW PIR design. At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR already set, don't set PIR. Need to check both IRR and PIR. Something like that: apic_accept_interrupt() { if (PIR || IRR) return coalesced; else set PIR; } This has two problems. Firs is that interrupt that can be delivered will be not (IRR is cleared just after it was tested), but it will be reported as coalesced, so this is benign race. Yes, and the same condition exists today with IRR, its fine. Second is that interrupt may be reported as delivered, but it will be coalesced (possible only with the self IPI with the same vector): Starting condition: PIR=0, IRR=0 vcpu is in a guest mode io thread | vcpu accept_apic_interrupt() | PIR and IRR is zero | set PIR | return delivered | | self IPI | set IRR | merge PIR to IRR (*) At (*) interrupt that was reported as delivered is coalesced. Only vcpu itself should send self-IPI, so its fine. Or: apic_accept_interrupt() { 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR. Never set IRR when HWAPIC enabled, even if outside of guest mode. 2. Set PIR and let HW or SW VM-entry transfer it to IRR. 3. set_irq return value: (ORIG_PIR or ORIG_IRR set). } This can report interrupt as coalesced, but it will be eventually delivered as separate interrupt: Starting condition: PIR=0, IRR=1 vcpu is in a guest mode io thread | vcpu | accept_apic_interrupt() | ORIG_PIR=0, ORIG_IRR=1 | |EOI |clear IRR, set ISR set PIR | return coalesced| |clear PIR, set IRR |EOI |clear IRR, set ISR (*) At (*) interrupt that was reported as coalesced is delivered. So still no perfect solution. But first one has much less serious problems for our practical needs. Two or more concurrent set_irq can race with each other, though. Can either document the race or add a lock. -- Gleb. Ok, then: accept_apic_irq: 1. coalesced = test_and_set_bit(PIR) 2. set KVM_REQ_EVENT bit(*) 3. if (vcpu-in_guest_mode) 4. if (test_and_set_bit(pir notification bit)) 5. send PIR IPI 6. return coalesced Other sites: A: On VM-entry, after disabling interrupts, but before the last check for -requests, clear pir notification bit (unconditionally). (*) This is _necessary_ also because during VM-exit a PIR IPI interrupt can be missed, so the KVM_REQ_EVENT indicates that SW is responsible for PIR-IRR transfer. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5] target-i386: Enabling IA32_TSC_ADJUST for QEMU KVM guest VMs
On Wed, Feb 06, 2013 at 10:22:32PM +, Auld, Will wrote: Marcelo, Hi, I have been watching for this patch in the upstream but have not seen it yet. What version of QEMU should it be in? Thanks, Will Will, its in the GIT tree: https://github.com/qemu/qemu/commit/f28558d3d37ad3bc4e35e8ac93f7bf81a0d5622c As for the next release: http://www.mail-archive.com/qemu-devel@nongnu.org/msg153579.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Wed, Feb 06, 2013 at 08:49:23PM -0200, Marcelo Tosatti wrote: On Tue, Feb 05, 2013 at 09:32:50AM +0200, Gleb Natapov wrote: On Mon, Feb 04, 2013 at 06:47:30PM -0200, Marcelo Tosatti wrote: On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote: On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote: On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote: Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour? Don't know about guests, but KVM relies on it to detect interrupt coalescing. So if interrupt is set in IRR but not in PIR interrupt will not be reported as coalesced, but it will be coalesced during PIR-IRR merge. Yes, so: 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no. 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer. 3. vcpu outside of guest mode. 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no. 5. vcpu enters guest mode. 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no. 7. HW transfers PIR into IRR. set_irq return value at 7 is incorrect, interrupt event was _not_ queued. Not sure I understand the flow of events in your description correctly. As I understand it at 4 set_irq() will return incorrect result. Basically when PIR is set to 1 while IRR has 1 for the vector the value of set_irq() will be incorrect. At 4 it has not been coalesced: it has been queued to IRR. At 6 it has been coalesced: PIR bit merged into IRR bit. Yes, that's the case. Frankly I do not see how it can be fixed without any race with present HW PIR design. At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR already set, don't set PIR. Need to check both IRR and PIR. Something like that: apic_accept_interrupt() { if (PIR || IRR) return coalesced; else set PIR; } This has two problems. Firs is that interrupt that can be delivered will be not (IRR is cleared just after it was tested), but it will be reported as coalesced, so this is benign race. Yes, and the same condition exists today with IRR, its fine. Second is that interrupt may be reported as delivered, but it will be coalesced (possible only with the self IPI with the same vector): Starting condition: PIR=0, IRR=0 vcpu is in a guest mode io thread | vcpu accept_apic_interrupt() | PIR and IRR is zero | set PIR | return delivered | | self IPI | set IRR | merge PIR to IRR (*) At (*) interrupt that was reported as delivered is coalesced. Only vcpu itself should send self-IPI, so its fine. Or: apic_accept_interrupt() { 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR. Never set IRR when HWAPIC enabled, even if outside of guest mode. 2. Set PIR and let HW or SW VM-entry transfer it to IRR. 3. set_irq return value: (ORIG_PIR or ORIG_IRR set). } This can report interrupt as coalesced, but it will be eventually delivered as separate interrupt: Starting condition: PIR=0, IRR=1 vcpu is in a guest mode io thread | vcpu | accept_apic_interrupt() | ORIG_PIR=0, ORIG_IRR=1 | |EOI |clear IRR, set ISR set PIR | return coalesced| |clear PIR, set IRR |EOI |clear IRR, set ISR (*) At (*) interrupt that was reported as coalesced is delivered. So still no perfect solution. But first one has much less serious problems for our practical needs. Two or more concurrent set_irq can race with each other, though. Can either document the race or add a lock. -- Gleb. Ok, then: accept_apic_irq: 1. coalesced = test_and_set_bit(PIR) 2. set KVM_REQ_EVENT bit (*) 3. if (vcpu-in_guest_mode) 4.if (test_and_set_bit(pir notification bit)) 5.send PIR IPI 6. return coalesced Other sites: A: On VM-entry, after disabling interrupts, but before the last check for -requests, clear pir notification bit (unconditionally). (*) This is _necessary_ also because during VM-exit a PIR IPI interrupt can be missed, so the KVM_REQ_EVENT indicates that SW is responsible for PIR-IRR transfer. Its not a bad idea to have a new KVM_REQ_ bit for PIR processing (just as the current patches do). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] KVM: MMU: lazily drop large spte
On Tue, Feb 05, 2013 at 03:11:09PM +0800, Xiao Guangrong wrote: Currently, kvm zaps the large spte if write-protected is needed, the later read can fault on that spte. Actually, we can make the large spte readonly instead of making them un-present, the page fault caused by read access can be avoid The idea is from Avi: | As I mentioned before, write-protecting a large spte is a good idea, | since it moves some work from protect-time to fault-time, so it reduces | jitter. This removes the need for the return value. Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
According the SDM, software should not touch the IRR when target vcpu is running. Instead, use locked way to access PIR. So your solution may wrong. Then your apicv patches are broken, because they do exactly that. Which code is broken? The one that updates IRR directly on the apic page. No, all the updates are ensuring the target vcpu is not running. So it's safe to touch IRR. Not at all. Read the code. Sorry. I still cannot figure out which code is wrong. All the places call sync_pir_to_irr() are on target vcpu. Can you point out the code? Thanks. I am taking about vapic patches which are already in, not pir patches. Yes, but the issue will be fixed with pir patches. With posted interrupt, it will touch PIR instead IRR and access PIR is allowed by HW. Best regards, Yang From http://www.mail-archive.com/kvm@vger.kernel.org/msg82824.html: 2. Section 29.6 mentions that Use of the posted-interrupt descriptor differs from that of other data structures that are referenced by pointers in a VMCS. There is a general requirement that software ensure that each such data structure is modified only when no logical processor with a current VMCS that references it is in VMX non-root operation. That requirement does not apply to the posted-interrupt descriptor. There is a requirement, however, that such modifications be done using locked read-modify-write instructions. The APIC virtual page is being modified by a CPU while a logical processor with current VMCS that references it is in VMX non-root operation, in fact even modifying the APIC virtual page with EOI virtualizaton, virtual interrupt delivery, etc. What are the requirements in this case? It should be same with posted interrupt. Software must ensure to use atomic access to virtual apic page. Can this point be clarified? Software can or cannot access virtual APIC page while VMCS that references it is in VMX non-root operation? Because if it cannot, then it means the current code is broken and VID usage without PIR should not be allowed. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: VMX: disable SMEP feature when guest is in non-paging mode
On Mon, Feb 04, 2013 at 11:50:43AM +0800, Dongxiao Xu wrote: Changes from v1 to v2: - Modify commit message and comments according to Gleb's suggestions. SMEP is disabled if CPU is in non-paging mode in hardware. However KVM always uses paging mode to emulate guest non-paging mode with TDP. To emulate this behavior, SMEP needs to be manually disabled when guest switches to non-paging mode. We met an issue that, SMP Linux guest with recent kernel (enable SMEP support, for example, 3.5.3) would crash with triple fault if setting unrestricted_guest=0. This is because KVM uses an identity mapping page table to emulate the non-paging mode, where the page table is set with USER flag. If SMEP is still enabled in this case, guest will meet unhandlable page fault and then crash. Signed-off-by: Dongxiao Xu dongxiao...@intel.com Signed-off-by: Xiantao Zhang xiantao.zh...@intel.com --- arch/x86/kvm/vmx.c |8 1 files changed, 8 insertions(+), 0 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: VMX: add missing exit names to VMX_EXIT_REASONS array
Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Remove duplicate text in api.txt
On Thu, Jan 31, 2013 at 12:06:08PM -0800, Geoff Levand wrote: Signed-off-by: Geoff Levand ge...@infradead.org --- Saw this in v3.8-rc5, please apply. Documentation/virtual/kvm/api.txt | 13 - 1 file changed, 13 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] KVM: Reduce mmu_lock hold time when zapping mmu pages
On Wed, Jan 23, 2013 at 07:12:31PM +0900, Takuya Yoshikawa wrote: This patch set mitigates another mmu_lock hold time issue. Although this is not enough and I'm thinking of additional work already, this alone can reduce the lock hold time to some extent. Takuya Yoshikawa (8): KVM: MMU: Fix and clean up for_each_gfn_* macros KVM: MMU: Use list_for_each_entry_safe in kvm_mmu_commit_zap_page() KVM: MMU: Add a parameter to kvm_mmu_prepare_zap_page() to update the next position KVM: MMU: Introduce for_each_gfn_indirect_valid_sp_safe macro KVM: MMU: Delete hash_link node in kvm_mmu_prepare_zap_page() KVM: MMU: Introduce free_zapped_mmu_pages() for freeing mmu pages in a list KVM: MMU: Split out free_zapped_mmu_pages() from kvm_mmu_commit_zap_page() KVM: MMU: Move free_zapped_mmu_pages() out of the protection of mmu_lock arch/x86/kvm/mmu.c | 149 +++- 1 files changed, 101 insertions(+), 48 deletions(-) Need a limit on the number of pages whose freeing is delayed. See that n_used_mmu_pages is used by both SLAB freeing (to know how much pressure to apply) and allocators (to decide when to allocate more). You allow n_used_mmu_pages to be inaccurate, which is fine as long as the error is limited. Perhaps have a max of 64 pages at invalid_pages per round and if exceeded release memory inside mmu_lock (one-by-one) ? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] KVM: Reduce mmu_lock hold time when zapping mmu pages
On Wed, Jan 23, 2013 at 06:44:52PM +0800, Xiao Guangrong wrote: On 01/23/2013 06:12 PM, Takuya Yoshikawa wrote: This patch set mitigates another mmu_lock hold time issue. Although this is not enough and I'm thinking of additional work already, this alone can reduce the lock hold time to some extent. It is not worth doing this kind of complex thing, usually, only a few pages on the invalid list. I think its a good idea - memory freeing can be done outside mmu_lock protection (as long as its bounded). It reduces mmu_lock contention overall. The *really* heavily case is kvm_mmu_zap_all() which can be speeded up by using generation number, this is a todo work in kvm wiki: http://www.linux-kvm.org/page/TODO: O(1) mmu invalidation using a generation number I am doing this work for some weeks and will post the patch out during these days. Can you describe the generation number scheme in more detail, please? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] KVM: MMU: Move free_zapped_mmu_pages() out of the protection of mmu_lock
On Wed, Jan 23, 2013 at 07:18:11PM +0900, Takuya Yoshikawa wrote: We noticed that kvm_mmu_zap_all() could take hundreds of milliseconds for zapping mmu pages with mmu_lock held. Although we need to do conditional rescheduling for completely fixing this issue, we can reduce the hold time to some extent by moving free_zapped_mmu_pages() out of the protection. Since invalid_list can be very long, the effect is not negligible. Note: this patch does not treat non-trivial cases. Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp Can you describe the case thats biting? Is it /* * If memory slot is created, or moved, we need to clear all * mmio sptes. */ if (npages old.base_gfn != mem-guest_phys_addr PAGE_SHIFT) { kvm_mmu_zap_all(kvm); kvm_reload_remote_mmus(kvm); } Because conditional rescheduling for kvm_mmu_zap_all() might not be desirable: KVM_SET_USER_MEMORY has low latency requirements. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Mon, Feb 04, 2013 at 11:55:53AM +0200, Gleb Natapov wrote: On Sun, Feb 03, 2013 at 10:57:00PM -0200, Marcelo Tosatti wrote: On Thu, Jan 31, 2013 at 03:55:56PM +0200, Gleb Natapov wrote: On Thu, Jan 31, 2013 at 11:44:43AM -0200, Marcelo Tosatti wrote: On Thu, Jan 31, 2013 at 03:38:37PM +0200, Gleb Natapov wrote: On Thu, Jan 31, 2013 at 11:32:45AM -0200, Marcelo Tosatti wrote: On Thu, Jan 31, 2013 at 11:43:48AM +0200, Gleb Natapov wrote: On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote: Posted interrupt patch: 2) Must move IN_GUEST_MODE assignment after local_irq_disable, in vcpu_enter_guest function. Otherwise: cpu0vcpu1-cpu1 vcpu-mode = IN_GUEST_MODE if IN_GUEST_MODE == true send IPI local_irq_disable PIR not transferred to VIRR, misses interrupt. cpu0 will set KVM_REQ_EVENT, so vmentry will be aborted after local_irq_disable() by -requests check. Yes, but you don't want KVM_REQ_EVENT+kick. It defeats the purpose of posted interrupts. You want if vcpu in guest mode send posted interrupt IPI else KVM_REQ_EVENT+kick I am thinking: set KVM_REQ_EVENT if pi is enabled send posted interrupt IPI else kick KVM_REQ_EVENT must be after sending posted interrupt IPI. Otherwise on the vcpu entry side test_and_clear(KVM_REQ_EVENT) { No bits set in PIR } It should be after updating PIR, but before sending posted interrupt IPI. Otherwise: cpu0 cpu1/vcpu KVM_REQ_EVENT is not set set pir send IPI irq_disable() -request is empty. set KVM_REQ_EVENT That's the same sequence as with IRR update, KVM_REQ_EVENT and kick today. Can only send IPI if vcpu-mode == IN_GUEST_MODE, which must be set after interrupt flag is cleared as noted. Also KVM_REQ_EVENT is processed outside of interrupt disabled region today. But it is checked in interrupt disabled section and vcpu entry is aborted if event is pending. Or maybe i don't get what you say... write a complete description? I am saying that we do not need to move vcpu-mode = IN_GUEST_MODE to irq_disable() section to make things work. Just do: set bit in pir set KVM_REQ_EVENT if in guest mode do IPI I see. Yeah, probably. What about item 4 below? That's for Yang to answer :) If more than one interrupt is generated with the same vector number, the local APIC can set the bit for the vector both in the IRR and ISR. This means that for the Pentium 4 and Intel Xeon processors, the IRR and ISR can queue two interrupts for each interrupt vector: one in the IRR and one in the ISR. Any additional interrupts issued for the same interrupt vector are collapsed into the single bit in IRR Which would mean KVM emulation differs... Eventually 3 interrupts can be queued: one in IRR, one in ISR, one in PIR. I do not think this is the case though. PIR will be folded into IRR either during a guest entry or by vcpu itself on receiving of notification vector IPI. Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour? Don't know about guests, but KVM relies on it to detect interrupt coalescing. So if interrupt is set in IRR but not in PIR interrupt will not be reported as coalesced, but it will be coalesced during PIR-IRR merge. Yes, so: 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no. 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer. 3. vcpu outside of guest mode. 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no. 5. vcpu enters guest mode. 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no. 7. HW transfers PIR into IRR. set_irq return value at 7 is incorrect, interrupt event was _not_ queued. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote: On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote: Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour? Don't know about guests, but KVM relies on it to detect interrupt coalescing. So if interrupt is set in IRR but not in PIR interrupt will not be reported as coalesced, but it will be coalesced during PIR-IRR merge. Yes, so: 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no. 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer. 3. vcpu outside of guest mode. 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no. 5. vcpu enters guest mode. 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no. 7. HW transfers PIR into IRR. set_irq return value at 7 is incorrect, interrupt event was _not_ queued. Not sure I understand the flow of events in your description correctly. As I understand it at 4 set_irq() will return incorrect result. Basically when PIR is set to 1 while IRR has 1 for the vector the value of set_irq() will be incorrect. At 4 it has not been coalesced: it has been queued to IRR. At 6 it has been coalesced: PIR bit merged into IRR bit. Frankly I do not see how it can be fixed without any race with present HW PIR design. At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR already set, don't set PIR. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote: On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote: On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote: Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour? Don't know about guests, but KVM relies on it to detect interrupt coalescing. So if interrupt is set in IRR but not in PIR interrupt will not be reported as coalesced, but it will be coalesced during PIR-IRR merge. Yes, so: 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no. 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer. 3. vcpu outside of guest mode. 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no. 5. vcpu enters guest mode. 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no. 7. HW transfers PIR into IRR. set_irq return value at 7 is incorrect, interrupt event was _not_ queued. Not sure I understand the flow of events in your description correctly. As I understand it at 4 set_irq() will return incorrect result. Basically when PIR is set to 1 while IRR has 1 for the vector the value of set_irq() will be incorrect. At 4 it has not been coalesced: it has been queued to IRR. At 6 it has been coalesced: PIR bit merged into IRR bit. Frankly I do not see how it can be fixed without any race with present HW PIR design. At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR already set, don't set PIR. Or: apic_accept_interrupt() { 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR. Never set IRR when HWAPIC enabled, even if outside of guest mode. 2. Set PIR and let HW or SW VM-entry transfer it to IRR. 3. set_irq return value: (ORIG_PIR or ORIG_IRR set). } Two or more concurrent set_irq can race with each other, though. Can either document the race or add a lock. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] KVM: set_memory_region: Identify the requested change explicitly
On Tue, Jan 29, 2013 at 11:00:07AM +0900, Takuya Yoshikawa wrote: KVM_SET_USER_MEMORY_REGION forces __kvm_set_memory_region() to identify what kind of change is being requested by checking the arguments. The current code does this checking at various points in code and each condition being used there is not easy to understand at first glance. This patch consolidates these checks and introduces an enum to name the possible changes to clean up the code. Although this does not introduce any functional changes, there is one change which optimizes the code a bit: if we have nothing to change, the new code returns 0 immediately. Note that the return value for this case cannot be changed since QEMU relies on it: we noticed this when we changed it to -EINVAL and got a section mismatch error at the final stage of live migration. Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp --- v2: updated iommu related parts v3: converted !(A == B) to A != B Note: based on this patch, disallowing RO change will become trivial: check such changes when we identify KVM_MR_FLAGS_ONLY. virt/kvm/kvm_main.c | 64 +++ 1 files changed, 44 insertions(+), 20 deletions(-) Applied, thanks. It would be good to propagate the enum into kvm_arch_prepare/commit memory region. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Investigating abnormal stealtimes
Nice! - 'Steal time' is the amount of time taken while vcpu is able to run but not runnable. Maybe 'vmexit latency' is a better name. - Perhaps it would be good to subtract the time the thread was involuntarily scheduled out due 'timeslice' expiration. Otherwise, running a CPU intensive task returns false positives (that is, long delays to due to reschedule due to 'timeslice' exhausted by guest CPU activity, not due to KVM or QEMU issues such as voluntarily schedule in pthread_mutex_lock). Alternatively you can raise the priority of the vcpu threads (to get rid of the false positives). - Idea: Would be handy to extract trace events in the offending 'latency above threshold' vmexit/vmentry region. Say that you enable other trace events (unrelated to kvm) which can help identify the culprit. Instead of scanning the file manually searching for 100466.1062486786 save one vmexit/vmentry cycle, along with other trace events in that period, in a separate file. On Tue, Jan 29, 2013 at 01:27:12PM +0100, Stefan Hajnoczi wrote: Khoa and I have been discussing a workload that triggers softlockups and hung task warnings inside the guest. These warnings can pop up due to bugs in the guest Linux kernel but they can also be triggered by the hypervisor if vcpus are not being scheduled at reasonable times. I've wanted a tool that reports high stealtimes and includes the last vmexit reason. This allows us to figure out if specific I/O device emulation is taking too long or if other factors like host memory pressure are degrading guest performance. Here is a first sketch of such a tool. It's a perf-script(1) Python script which can be used to analyze perf.data files recorded with kvm:kvm_entry and kvm:kvm_exit events. Stealtimes exceeding a threshold will be flagged up: $ perf script -s /absolute/path/to/stealtime.py 100466.1062486786 9690: steal time 0.029318914 secs, exit_reason IO_INSTRUCTION, guest_rip 0x81278f02, isa 1, info1 0xcf80003, info2 0x0 The example above shows an I/O access to 0xcf8 (PCI Configuration Space Address port) that took about 28 milliseconds. The host pid was 9690; this can be used to investigate the QEMU vcpu thread. The guest rip can be used to investigate guest code that triggered this vmexit. Given this information, it becomes possible to debug QEMU to figure out why vmexit handling is taking too long. It might be due to global mutex contention if another thread holds the global mutex while blocking. This sort of investigation needs to be done manually today but it might be possible to add perf event handlers to watch for global mutex contention inside QEMU and automatically identify the culprit. Stalls inside the kvm kernel module can also be investigated since kvm:kvm_exit events are triggered when they happen too. I wanted to share in case it is useful for others. Suggestions for better approaches welcome! Signed-off-by: Stefan Hajnoczi stefa...@redhat.com --- #!/usr/bin/env python # perf script event handlers, generated by perf script -g python # Licensed under the terms of the GNU GPL License version 2 # Script to print steal times longer than a given threshold # # To collect trace data: # $ perf record -a -e kvm:kvm_entry -e kvm:kvm_exit # # To print results from trace data: # # $ perf script -s /absolute/path/to/stealtime.py # 100466.1062486786 9690: steal time 0.029318914 secs, # exit_reason IO_INSTRUCTION, # guest_rip 0x81278f02, # isa 1, info1 0xcf80003, info2 0x0 # # The example above shows an I/O access to 0xcf8 (PCI Configuration Space # Address port) that took about 28 milliseconds. The host pid was 9690; this # can be used to investigate the QEMU vcpu thread. The guest rip can be used # to investigate guest code that triggered this vmexit. # Print steal times longer than this threshold in milliseconds THRESHOLD_MS = 100 import os import sys sys.path.append(os.environ['PERF_EXEC_PATH'] + \ '/scripts/python/Perf-Trace-Util/lib/Perf/Trace') from perf_trace_context import * from Core import * vcpu_threads = {} def trace_begin(): print 'argv:', str(sys.argv) def trace_end(): pass def kvm__kvm_exit(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, exit_reason, guest_rip, isa, info1, info2): if common_pid in vcpu_threads: last = vcpu_threads[common_pid] assert last[0] == 'kvm__kvm_entry' while last[2] common_nsecs: common_secs -= 1 common_nsecs += 1 * 1000 * 1000 * 1000 delta_secs = common_secs - last[1] delta_nsecs = common_nsecs - last[2] vcpu_threads[common_pid] = (event_name, common_secs, common_nsecs,
Re: [PATCH 6/6] Revert KVM: MMU: split kvm_mmu_free_page
On Wed, Jan 30, 2013 at 04:45:05PM +0200, Gleb Natapov wrote: This reverts commit bd4c86eaa6ff10abc4e00d0f45d2a28b10b09df4. There is not user for kvm_mmu_isolate_page() any more. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/mmu.c | 21 +++-- Applied all except this - Takuya's using it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] KVM: set_memory_region: Cleanup and new restriction
On Thu, Jan 31, 2013 at 12:15:01PM +0200, Gleb Natapov wrote: On Wed, Jan 30, 2013 at 07:38:37PM +0900, Takuya Yoshikawa wrote: Patch 1: just rebased for this series. Patch 2: an API change, so please let me know if you notice any problems. Takuya Yoshikawa (2): KVM: set_memory_region: Identify the requested change explicitly KVM: set_memory_region: Disallow changing read-only attribute later Documentation/virtual/kvm/api.txt | 12 ++-- virt/kvm/kvm_main.c | 95 + 2 files changed, 60 insertions(+), 47 deletions(-) Reviewed-by: Gleb Natapov g...@redhat.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Thu, Jan 31, 2013 at 03:55:56PM +0200, Gleb Natapov wrote: On Thu, Jan 31, 2013 at 11:44:43AM -0200, Marcelo Tosatti wrote: On Thu, Jan 31, 2013 at 03:38:37PM +0200, Gleb Natapov wrote: On Thu, Jan 31, 2013 at 11:32:45AM -0200, Marcelo Tosatti wrote: On Thu, Jan 31, 2013 at 11:43:48AM +0200, Gleb Natapov wrote: On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote: Posted interrupt patch: 2) Must move IN_GUEST_MODE assignment after local_irq_disable, in vcpu_enter_guest function. Otherwise: cpu0vcpu1-cpu1 vcpu-mode = IN_GUEST_MODE if IN_GUEST_MODE == true send IPI local_irq_disable PIR not transferred to VIRR, misses interrupt. cpu0 will set KVM_REQ_EVENT, so vmentry will be aborted after local_irq_disable() by -requests check. Yes, but you don't want KVM_REQ_EVENT+kick. It defeats the purpose of posted interrupts. You want if vcpu in guest mode send posted interrupt IPI else KVM_REQ_EVENT+kick I am thinking: set KVM_REQ_EVENT if pi is enabled send posted interrupt IPI else kick KVM_REQ_EVENT must be after sending posted interrupt IPI. Otherwise on the vcpu entry side test_and_clear(KVM_REQ_EVENT) { No bits set in PIR } It should be after updating PIR, but before sending posted interrupt IPI. Otherwise: cpu0 cpu1/vcpu KVM_REQ_EVENT is not set set pir send IPI irq_disable() -request is empty. set KVM_REQ_EVENT That's the same sequence as with IRR update, KVM_REQ_EVENT and kick today. Can only send IPI if vcpu-mode == IN_GUEST_MODE, which must be set after interrupt flag is cleared as noted. Also KVM_REQ_EVENT is processed outside of interrupt disabled region today. Or maybe i don't get what you say... write a complete description? What about item 4 below? That's for Yang to answer :) If more than one interrupt is generated with the same vector number, the local APIC can set the bit for the vector both in the IRR and ISR. This means that for the Pentium 4 and Intel Xeon processors, the IRR and ISR can queue two interrupts for each interrupt vector: one in the IRR and one in the ISR. Any additional interrupts issued for the same interrupt vector are collapsed into the single bit in IRR Which would mean KVM emulation differs... Eventually 3 interrupts can be queued: one in IRR, one in ISR, one in PIR. Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Thu, Jan 31, 2013 at 11:43:48AM +0200, Gleb Natapov wrote: On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote: Posted interrupt patch: 2) Must move IN_GUEST_MODE assignment after local_irq_disable, in vcpu_enter_guest function. Otherwise: cpu0vcpu1-cpu1 vcpu-mode = IN_GUEST_MODE if IN_GUEST_MODE == true send IPI local_irq_disable PIR not transferred to VIRR, misses interrupt. cpu0 will set KVM_REQ_EVENT, so vmentry will be aborted after local_irq_disable() by -requests check. Yes, but you don't want KVM_REQ_EVENT+kick. It defeats the purpose of posted interrupts. You want if vcpu in guest mode send posted interrupt IPI else KVM_REQ_EVENT+kick 3) Must check outstanding PIR notification bit unconditionally on every VM-entry, because: 1. local_irq_disable 2. vcpu-mode = IN_GUEST_MODE 3. vmenter 4. vmexit 5. vcpu-mode = OUTSIDE_GUEST_MODE If PIR-IPI-interrupt is sent between an event which triggers VM-exit (for example, an external interrupt due to a device), and step 5 (assignment of vcpu-mode), the PIR-VIRR transfer before vmentry must be made. Not sure I understand, but I think KVM_REQ_EVENT will cover that too. See above. 4) Today, an interrupt notification is cached on IRR until its delivered - further interrupt injection is not generating further interrupt notification bits. With PIR, behaviour changes: Its possible to have one bit in PIR and another on IRR APIC page (if timing is right). Is this harmless? Why? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Thu, Jan 31, 2013 at 03:38:37PM +0200, Gleb Natapov wrote: On Thu, Jan 31, 2013 at 11:32:45AM -0200, Marcelo Tosatti wrote: On Thu, Jan 31, 2013 at 11:43:48AM +0200, Gleb Natapov wrote: On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote: Posted interrupt patch: 2) Must move IN_GUEST_MODE assignment after local_irq_disable, in vcpu_enter_guest function. Otherwise: cpu0vcpu1-cpu1 vcpu-mode = IN_GUEST_MODE if IN_GUEST_MODE == true send IPI local_irq_disable PIR not transferred to VIRR, misses interrupt. cpu0 will set KVM_REQ_EVENT, so vmentry will be aborted after local_irq_disable() by -requests check. Yes, but you don't want KVM_REQ_EVENT+kick. It defeats the purpose of posted interrupts. You want if vcpu in guest mode send posted interrupt IPI else KVM_REQ_EVENT+kick I am thinking: set KVM_REQ_EVENT if pi is enabled send posted interrupt IPI else kick KVM_REQ_EVENT must be after sending posted interrupt IPI. Otherwise on the vcpu entry side test_and_clear(KVM_REQ_EVENT) { No bits set in PIR } What about item 4 below? 3) Must check outstanding PIR notification bit unconditionally on every VM-entry, because: 1. local_irq_disable 2. vcpu-mode = IN_GUEST_MODE 3. vmenter 4. vmexit 5. vcpu-mode = OUTSIDE_GUEST_MODE If PIR-IPI-interrupt is sent between an event which triggers VM-exit (for example, an external interrupt due to a device), and step 5 (assignment of vcpu-mode), the PIR-VIRR transfer before vmentry must be made. Not sure I understand, but I think KVM_REQ_EVENT will cover that too. See above. 4) Today, an interrupt notification is cached on IRR until its delivered - further interrupt injection is not generating further interrupt notification bits. With PIR, behaviour changes: Its possible to have one bit in PIR and another on IRR APIC page (if timing is right). Is this harmless? Why? -- Gleb. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: windows 2008 guest causing rcu_shed to emit NMI
On Wed, Jan 30, 2013 at 11:21:08AM +0300, Andrey Korolyov wrote: On Wed, Jan 30, 2013 at 3:15 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Tue, Jan 29, 2013 at 02:35:02AM +0300, Andrey Korolyov wrote: On Mon, Jan 28, 2013 at 5:56 PM, Andrey Korolyov and...@xdel.ru wrote: On Mon, Jan 28, 2013 at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Mon, Jan 28, 2013 at 12:04:50AM +0300, Andrey Korolyov wrote: On Sat, Jan 26, 2013 at 12:49 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Fri, Jan 25, 2013 at 10:45:02AM +0300, Andrey Korolyov wrote: On Thu, Jan 24, 2013 at 4:20 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Thu, Jan 24, 2013 at 01:54:03PM +0300, Andrey Korolyov wrote: Thank you Marcelo, Host node locking up sometimes later than yesterday, bur problem still here, please see attached dmesg. Stuck process looks like root 19251 0.0 0.0 228476 12488 ?D14:42 0:00 /usr/bin/kvm -no-user-config -device ? -device pci-assign,? -device virtio-blk-pci,? -device on fourth vm by count. Should I try upstream kernel instead of applying patch to the latest 3.4 or it is useless? If you can upgrade to an upstream kernel, please do that. With vanilla 3.7.4 there is almost no changes, and NMI started firing again. External symptoms looks like following: starting from some count, may be third or sixth vm, qemu-kvm process allocating its memory very slowly and by jumps, 20M-200M-700M-1.6G in minutes. Patch helps, of course - on both patched 3.4 and vanilla 3.7 I`m able to kill stuck kvm processes and node returned back to the normal, when on 3.2 sending SIGKILL to the process causing zombies and hanged ``ps'' output (problem and workaround when no scheduler involved described here http://www.spinics.net/lists/kvm/msg84799.html). Try disabling pause loop exiting with ple_gap=0 kvm-intel.ko module parameter. Hi Marcelo, thanks, this parameter helped to increase number of working VMs in a half of order of magnitude, from 3-4 to 10-15. Very high SY load, 10 to 15 percents, persists on such numbers for a long time, where linux guests in same configuration do not jump over one percent even under stress bench. After I disabled HT, crash happens only in long runs and now it is kernel panic :) Stair-like memory allocation behaviour disappeared, but other symptom leading to the crash which I have not counted previously, persists: if VM count is ``enough'' for crash, some qemu processes starting to eat one core, and they`ll panic system after run in tens of minutes in such state or if I try to attach debugger to one of them. If needed, I can log entire crash output via netconsole, now I have some tail, almost the same every time: http://xdel.ru/downloads/btwin.png Yes, please log entire crash output, thanks. Here please, 3.7.4-vanilla, 16 vms, ple_gap=0: http://xdel.ru/downloads/oops-default-kvmintel.txt Just an update: I was able to reproduce that on pure linux VMs using qemu-1.3.0 and ``stress'' benchmark running on them - panic occurs at start of vm(with count ten working machines at the moment). Qemu-1.1.2 generally is not able to reproduce that, but host node with older version crashing on less amount of Windows VMs(three to six instead ten to fifteen) than with 1.3, please see trace below: http://xdel.ru/downloads/oops-old-qemu.txt Single bit memory error, apparently. Try: 1. memtest86. 2. Boot with slub_debug=ZFPU kernel parameter. 3. Reproduce on different machine Hi Marcelo, I always follow the rule - if some weird bug exists, check it on ECC-enabled machine and check IPMI logs too before start complaining :) I have finally managed to ``fix'' the problem, but my solution seems a bit strange: - I have noticed that if virtual machines started without any cgroup setting they will not cause this bug under any conditions, - I have thought, very wrong in my mind, that the CONFIG_SCHED_AUTOGROUP should regroup the tasks without any cgroup and should not touch tasks already inside any existing cpu cgroup. First sight on the 200-line patch shows that the autogrouping always applies to all tasks, so I tried to disable it, - wild magic appears - VMs didn`t crashed host any more, even in count 30+ they work fine. I still don`t know what exactly triggered that and will I face it again under different conditions, so my solution more likely to be a patch of mud in wall of the dam, instead of proper fixing. There seems to be two possible origins of such error - a very very hideous race condition involving cgroups and processes like qemu-kvm causing frequent context switches and simple incompatibility between NUMA, logic of CONFIG_SCHED_AUTOGROUP and qemu VMs already doing
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Fri, Jan 25, 2013 at 12:40:21AM +, Zhang, Yang Z wrote: Marcelo Tosatti wrote on 2013-01-25: On Thu, Dec 13, 2012 at 03:29:40PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com Posted Interrupt allows APIC interrupts to inject into guest directly without any vmexit. - When delivering a interrupt to guest, if target vcpu is running, update Posted-interrupt requests bitmap and send a notification event to the vcpu. Then the vcpu will handle this interrupt automatically, without any software involvemnt. - If target vcpu is not running or there already a notification event pending in the vcpu, do nothing. The interrupt will be handled by next vm entry. Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/include/asm/entry_arch.h |1 + arch/x86/include/asm/hw_irq.h |1 + arch/x86/include/asm/irq.h |1 + arch/x86/include/asm/irq_vectors.h |4 + arch/x86/include/asm/kvm_host.h|3 + arch/x86/include/asm/vmx.h |4 + arch/x86/kernel/entry_64.S |2 + arch/x86/kernel/irq.c | 25 +++ arch/x86/kernel/irqinit.c |2 + arch/x86/kvm/lapic.c | 16 +++- arch/x86/kvm/lapic.h |1 + arch/x86/kvm/vmx.c | 133 +--- 12 files changed, 180 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h index 40afa00..7b0a29e 100644 --- a/arch/x86/include/asm/entry_arch.h +++ b/arch/x86/include/asm/entry_arch.h @@ -18,6 +18,7 @@ BUILD_INTERRUPT(reboot_interrupt,REBOOT_VECTOR) #endif BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR) +BUILD_INTERRUPT(posted_intr_ipi, POSTED_INTR_VECTOR) /* * every pentium local APIC has two 'local interrupts', with a diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h index eb92a6e..ee61af3 100644 --- a/arch/x86/include/asm/hw_irq.h +++ b/arch/x86/include/asm/hw_irq.h @@ -28,6 +28,7 @@ /* Interrupt handlers registered during init_IRQ */ extern void apic_timer_interrupt(void); extern void x86_platform_ipi(void); +extern void posted_intr_ipi(void); extern void error_interrupt(void); extern void irq_work_interrupt(void); diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h index ba870bb..cff9933 100644 --- a/arch/x86/include/asm/irq.h +++ b/arch/x86/include/asm/irq.h @@ -30,6 +30,7 @@ extern void irq_force_complete_move(int); #endif extern void (*x86_platform_ipi_callback)(void); +extern void (*posted_intr_callback)(void); extern void native_init_IRQ(void); extern bool handle_irq(unsigned irq, struct pt_regs *regs); diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h index 1508e51..8f2e383 100644 --- a/arch/x86/include/asm/irq_vectors.h +++ b/arch/x86/include/asm/irq_vectors.h @@ -102,6 +102,10 @@ */ #define X86_PLATFORM_IPI_VECTOR 0xf7 +#ifdef CONFIG_HAVE_KVM +#define POSTED_INTR_VECTOR0xf2 +#endif + /* * IRQ work vector: */ diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 7e26d1a..82423a8 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -700,6 +700,9 @@ struct kvm_x86_ops { int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu); void (*update_irq)(struct kvm_vcpu *vcpu); void (*update_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector, bool set); + int (*has_posted_interrupt)(struct kvm_vcpu *vcpu); + int (*send_nv)(struct kvm_vcpu *vcpu, int vector); + void (*update_irr)(struct kvm_vcpu *vcpu); int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); int (*get_tdp_level)(void); u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio); diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 1003341..7b9e1d0 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -152,6 +152,7 @@ #define PIN_BASED_EXT_INTR_MASK 0x0001 #define PIN_BASED_NMI_EXITING 0x0008 #define PIN_BASED_VIRTUAL_NMIS 0x0020 +#define PIN_BASED_POSTED_INTR 0x0080 #define VM_EXIT_SAVE_DEBUG_CONTROLS 0x0002 #define VM_EXIT_HOST_ADDR_SPACE_SIZE0x0200 @@ -174,6 +175,7 @@ /* VMCS Encodings */ enum vmcs_field {VIRTUAL_PROCESSOR_ID = 0x, + POSTED_INTR_NV = 0x0002, GUEST_ES_SELECTOR = 0x0800, GUEST_CS_SELECTOR = 0x0802,GUEST_SS_SELECTOR = 0x0804, @@ -208,6 +210,8 @@ enum vmcs_field { VIRTUAL_APIC_PAGE_ADDR_HIGH = 0x2013, APIC_ACCESS_ADDR
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote: On Fri, Jan 25, 2013 at 12:40:21AM +, Zhang, Yang Z wrote: Marcelo Tosatti wrote on 2013-01-25: On Thu, Dec 13, 2012 at 03:29:40PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com Posted Interrupt allows APIC interrupts to inject into guest directly without any vmexit. - When delivering a interrupt to guest, if target vcpu is running, update Posted-interrupt requests bitmap and send a notification event to the vcpu. Then the vcpu will handle this interrupt automatically, without any software involvemnt. - If target vcpu is not running or there already a notification event pending in the vcpu, do nothing. The interrupt will be handled by next vm entry. Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/include/asm/entry_arch.h |1 + arch/x86/include/asm/hw_irq.h |1 + arch/x86/include/asm/irq.h |1 + arch/x86/include/asm/irq_vectors.h |4 + arch/x86/include/asm/kvm_host.h|3 + arch/x86/include/asm/vmx.h |4 + arch/x86/kernel/entry_64.S |2 + arch/x86/kernel/irq.c | 25 +++ arch/x86/kernel/irqinit.c |2 + arch/x86/kvm/lapic.c | 16 +++- arch/x86/kvm/lapic.h |1 + arch/x86/kvm/vmx.c | 133 +--- 12 files changed, 180 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h index 40afa00..7b0a29e 100644 --- a/arch/x86/include/asm/entry_arch.h +++ b/arch/x86/include/asm/entry_arch.h @@ -18,6 +18,7 @@ BUILD_INTERRUPT(reboot_interrupt,REBOOT_VECTOR) #endif BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR) +BUILD_INTERRUPT(posted_intr_ipi, POSTED_INTR_VECTOR) /* * every pentium local APIC has two 'local interrupts', with a diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h index eb92a6e..ee61af3 100644 --- a/arch/x86/include/asm/hw_irq.h +++ b/arch/x86/include/asm/hw_irq.h @@ -28,6 +28,7 @@ /* Interrupt handlers registered during init_IRQ */ extern void apic_timer_interrupt(void); extern void x86_platform_ipi(void); +extern void posted_intr_ipi(void); extern void error_interrupt(void); extern void irq_work_interrupt(void); diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h index ba870bb..cff9933 100644 --- a/arch/x86/include/asm/irq.h +++ b/arch/x86/include/asm/irq.h @@ -30,6 +30,7 @@ extern void irq_force_complete_move(int); #endif extern void (*x86_platform_ipi_callback)(void); +extern void (*posted_intr_callback)(void); extern void native_init_IRQ(void); extern bool handle_irq(unsigned irq, struct pt_regs *regs); diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h index 1508e51..8f2e383 100644 --- a/arch/x86/include/asm/irq_vectors.h +++ b/arch/x86/include/asm/irq_vectors.h @@ -102,6 +102,10 @@ */ #define X86_PLATFORM_IPI_VECTOR 0xf7 +#ifdef CONFIG_HAVE_KVM +#define POSTED_INTR_VECTOR 0xf2 +#endif + /* * IRQ work vector: */ diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 7e26d1a..82423a8 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -700,6 +700,9 @@ struct kvm_x86_ops { int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu); void (*update_irq)(struct kvm_vcpu *vcpu); void (*update_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector, bool set); +int (*has_posted_interrupt)(struct kvm_vcpu *vcpu); +int (*send_nv)(struct kvm_vcpu *vcpu, int vector); +void (*update_irr)(struct kvm_vcpu *vcpu); int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); int (*get_tdp_level)(void); u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio); diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 1003341..7b9e1d0 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -152,6 +152,7 @@ #define PIN_BASED_EXT_INTR_MASK 0x0001 #define PIN_BASED_NMI_EXITING 0x0008 #define PIN_BASED_VIRTUAL_NMIS 0x0020 +#define PIN_BASED_POSTED_INTR 0x0080 #define VM_EXIT_SAVE_DEBUG_CONTROLS 0x0002 #define VM_EXIT_HOST_ADDR_SPACE_SIZE0x0200 @@ -174,6 +175,7 @@ /* VMCS Encodings */ enum vmcs_field { VIRTUAL_PROCESSOR_ID = 0x, +POSTED_INTR_NV = 0x0002, GUEST_ES_SELECTOR = 0x0800
Re: [PATCH v2 06/12] KVM: MMU: introduce a static table to map guest access to spte access
On Mon, Jan 28, 2013 at 10:07:15PM -0200, Marcelo Tosatti wrote: On Fri, Jan 25, 2013 at 10:46:31AM +0800, Xiao Guangrong wrote: On 01/25/2013 08:15 AM, Marcelo Tosatti wrote: On Wed, Jan 23, 2013 at 06:07:20PM +0800, Xiao Guangrong wrote: It makes set_spte more clean and reduces branch prediction Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 37 ++--- 1 files changed, 26 insertions(+), 11 deletions(-) Don't see set_spte as being a performance problem? IMO the current code is quite simple. Yes, this is not a performance problem. I just dislike this many continuous if-s in the function: if (xxx) xxx if (xxx) xxx Totally, it has 7 if-s before this patch. Okay, if you think this is unnecessary, i will drop this patch. :) Yes, please (unless you can show set_spte is a performance problem). Same thing for spte fast drop: is it a performance problem? Please try to group changes into smaller, less controversial sets with a clear goal: - Debated performance improvement. - Cleanups (eg mmu_set_spte argument removal). - Bug fixes. - Performance improvements. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 03/12] KVM: MMU: simplify mmu_set_spte
On Tue, Jan 29, 2013 at 10:55:24AM +0800, Xiao Guangrong wrote: On 01/29/2013 08:21 AM, Marcelo Tosatti wrote: On Wed, Jan 23, 2013 at 06:05:29PM +0800, Xiao Guangrong wrote: In order to detecting spte remapping, we can simply check whether the spte has already been pointing to the pfn even if the spte is not the last spte, for middle spte is pointing to the kernel pfn which can not be mapped to userspace This check is detecting spte overwrite, when a large spte is replaced by pointer to spte table. Can't see why check for different pfn is safe: only 'int level' can differ, and pfn be equivalent, for example. The 'u64 *sptep' must on the int level we want to set, that means: page_header(__pa(sptep)).role.level == int level. Right, then stick a comment there noting which cases that condition handles. Keep the current comment and add more. We discussed this before :), the discussion can be found at: http://marc.info/?l=kvmm=135345057329427w=2. Note http://marc.info/?l=kvmm=135345059929436w=2, please take into account in the future. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: Fix irqfd resampler list walk
On Mon, Jan 28, 2013 at 08:25:00AM -0700, Alex Williamson wrote: On Mon, 2012-12-10 at 18:16 -0200, Marcelo Tosatti wrote: On Thu, Dec 06, 2012 at 02:44:59PM -0700, Alex Williamson wrote: Typo for the next pointer means we're walking random data here. Signed-off-by: Alex Williamson alex.william...@redhat.com Cc: sta...@vger.kernel.org [3.7] --- Not sure if this will make 3.7, so preemptively adding the stable flag virt/kvm/eventfd.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Applied, thanks. Hi Marcelo, This didn't seem to make it into 3.7.0 or any stable 3.7. Can we promote it for stable? Thanks, Alex Yes, please send the patch with the backport you're interested in being included to sta...@kernel.org. http://www.kernel.org/doc/Documentation/stable_kernel_rules.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v13 0/3] x86, apicv: Add APIC virtualization support
On Fri, Jan 25, 2013 at 10:18:48AM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com APIC virtualization is a new feature which can eliminate most of VM exit when vcpu handle a interrupt: APIC register virtualization: APIC read access doesn't cause APIC-access VM exits. APIC write becomes trap-like. Virtual interrupt delivery: Virtual interrupt delivery avoids KVM to inject vAPIC interrupts manually, which is fully taken care of by the hardware. The first patch adds APIC register virtualization supporting. The second patch adds virtual x2apic mode supporting since it is required by APICv when guest uses msr based way to access APIC. The third patch adds virtual interrupt delivery supporting. Please refer to Intel SDM volume 3, chapter 29 for more details. Changes v12 to v13: * Remove unnecessary check when set virtualized apic access * Use vm_need_tpr_shadow() instead read vmcs to check tpr. * Check irqchip_in_kernel when set msr bitmap. * Correct comment format * Remove unnecessary callback when set eoi exit bitmap. * Disable vid when irqchip is in userspace. * Rename vmx_vcpu_has_apicv to vmx_vm_has_apicv. * Rebased on top of KVM upstream Changes v11 to v12: * Check irqchip in kernel when enabling apicv, if using userspace irq chip, apicv cannot be used and must be disabled. * Rename some fucntion to more descriptive name. * Move ioapic entry pase logic to lapic.c * Rebased on top of KVM upstream Changes v10 to v11: * Use two new msr bitmaps for guest that enabling x2apic mode: Since msr bitmap is shared by all guests, it will break guest that not using x2apic when updating the global msr bitmap. To solve this, we use two new msr bitmap for guest which using x2apic. Changes v9 to v10: * Enable virtualize x2apic mode when guest is using x2apic and apicv: There is no point to enable x2apic mode when apicv is disabled. * Grep ioapic_lock when traversing ioapic entry to set eoi exit bitmap * Rebased on top of KVM upstream Changes v8 to v9: * Update eoi exit bitmap by vcpu itself. * Enable virtualize x2apic mode when guest is using x2apic. * Rebase on top of KVM upstream Changes v7 to v8: * According Marcelo's suggestion, add comments for irr_pending and isr_count, since the two valiables have different meaning when using apicv. * Set highest bit in vISR to SVI after migation. * Use spinlock to access eoi exit bitmap synchronously. * Enable virtualize x2apic mode when guest is using x2apic * Rebased on top of KVM upstream. Yang Zhang (3): x86, apicv: add APICv register virtualization support x86, apicv: add virtual x2apic support x86, apicv: add virtual interrupt delivery support arch/ia64/kvm/lapic.h |6 + arch/x86/include/asm/kvm_host.h |6 + arch/x86/include/asm/vmx.h | 14 ++ arch/x86/kvm/irq.c | 56 ++- arch/x86/kvm/lapic.c| 140 + arch/x86/kvm/lapic.h| 34 arch/x86/kvm/svm.c | 24 +++ arch/x86/kvm/vmx.c | 327 --- arch/x86/kvm/x86.c | 23 +++- include/linux/kvm_host.h|3 + virt/kvm/ioapic.c | 39 + virt/kvm/ioapic.h |4 + virt/kvm/irq_comm.c | 25 +++ virt/kvm/kvm_main.c |5 + 14 files changed, 647 insertions(+), 59 deletions(-) Reviewed-by: Marcelo Tosatti mtosa...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 03/12] KVM: MMU: simplify mmu_set_spte
On Wed, Jan 23, 2013 at 06:05:29PM +0800, Xiao Guangrong wrote: In order to detecting spte remapping, we can simply check whether the spte has already been pointing to the pfn even if the spte is not the last spte, for middle spte is pointing to the kernel pfn which can not be mapped to userspace This check is detecting spte overwrite, when a large spte is replaced by pointer to spte table. Can't see why check for different pfn is safe: only 'int level' can differ, and pfn be equivalent, for example. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 06/12] KVM: MMU: introduce a static table to map guest access to spte access
On Fri, Jan 25, 2013 at 10:46:31AM +0800, Xiao Guangrong wrote: On 01/25/2013 08:15 AM, Marcelo Tosatti wrote: On Wed, Jan 23, 2013 at 06:07:20PM +0800, Xiao Guangrong wrote: It makes set_spte more clean and reduces branch prediction Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 37 ++--- 1 files changed, 26 insertions(+), 11 deletions(-) Don't see set_spte as being a performance problem? IMO the current code is quite simple. Yes, this is not a performance problem. I just dislike this many continuous if-s in the function: if (xxx) xxx if (xxx) xxx Totally, it has 7 if-s before this patch. Okay, if you think this is unnecessary, i will drop this patch. :) Yes, please (unless you can show set_spte is a performance problem). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html