Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-03-06 Thread Marcelo Tosatti
On Wed, Mar 06, 2013 at 10:29:12AM -0600, Michael Wolf wrote:
 On Wed, 2013-03-06 at 12:13 +0400, Glauber Costa wrote:
  On 03/06/2013 05:41 AM, Marcelo Tosatti wrote:
   On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote:
   Sorry for the delay in the response.  I did not see the email
   right away.
  
   On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
   On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
   2013/2/5 Michael Wolf m...@linux.vnet.ibm.com:
   In the case of where you have a system that is running in a
   capped or overcommitted environment the user may see steal time
   being reported in accounting tools such as top or vmstat.  This can
   cause confusion for the end user.
  
   Sorry, I'm no expert in this area. But I don't really understand what
   is confusing for the end user here.
  
   I suppose that what is wanted is to subtract stolen time due to 'known
   reasons' from steal time reporting. 'Known reasons' being, for example,
   hard caps. So a vcpu executing instructions with no halt, but limited to
   80% of available bandwidth, would not have 20% of stolen time reported.
  
   Yes exactly and the end user many times did not set up the guest and is
   not aware of the capping.  The end user is only aware of the performance
   level that they were told they would get with the guest.  
   But yes, a description of the scenario that is being dealt with, with
   details, is important.
  
   I will add more detail to the description next time I submit the
   patches.  How about something like,In a cloud environment the user of a
   kvm guest is not aware of the underlying hardware or how many other
   guests are running on it.  The end user is only aware of a level of
   performance that they should see.   or does that just muddy the picture
   more??
   
   So the feature aims for is to report stolen time relative to hard
   capping. That is: stolen time should be counted as time stolen from
   the guest _beyond_ hard capping. Yes?
   
   Probably don't need to report new data to the guest for that.
   
  If we take into account that 1 second always have one second, I believe
  that you can just subtract the consigned time from the steal time the
  host passes to the guest.
  
  During each second, the numbers won't sum up to 100. The delta to 100 is
  the consigned time, if anyone cares.
  
  Adopting this would simplify this a lot. All you need to do, really, is
  to get your calculation right from the bandwidth given by the cpu
  controller. Subtract it in the host, and voila.
 
 I looked at doing that once but was told that I was changing the
 interface in an unacceptable way, because now I was not reporting all of
 the elapsed time.  I agree it would make things simpler.

Pointer to that claim, please?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-03-06 Thread Marcelo Tosatti
On Wed, Mar 06, 2013 at 10:27:13AM -0600, Michael Wolf wrote:
 On Tue, 2013-03-05 at 22:41 -0300, Marcelo Tosatti wrote:
  On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote:
   Sorry for the delay in the response.  I did not see the email
   right away.
   
   On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
 2013/2/5 Michael Wolf m...@linux.vnet.ibm.com:
  In the case of where you have a system that is running in a
  capped or overcommitted environment the user may see steal time
  being reported in accounting tools such as top or vmstat.  This can
  cause confusion for the end user.
 
 Sorry, I'm no expert in this area. But I don't really understand what
 is confusing for the end user here.

I suppose that what is wanted is to subtract stolen time due to 'known
reasons' from steal time reporting. 'Known reasons' being, for example,
hard caps. So a vcpu executing instructions with no halt, but limited to
80% of available bandwidth, would not have 20% of stolen time reported.
   
   Yes exactly and the end user many times did not set up the guest and is
   not aware of the capping.  The end user is only aware of the performance
   level that they were told they would get with the guest.  
But yes, a description of the scenario that is being dealt with, with
details, is important.
   
   I will add more detail to the description next time I submit the
   patches.  How about something like,In a cloud environment the user of a
   kvm guest is not aware of the underlying hardware or how many other
   guests are running on it.  The end user is only aware of a level of
   performance that they should see.   or does that just muddy the picture
   more??
  
  So the feature aims for is to report stolen time relative to hard
  capping. That is: stolen time should be counted as time stolen from
  the guest _beyond_ hard capping. Yes?
 Yes, that is the goal.
  
  Probably don't need to report new data to the guest for that.
 Not sure I understand what you are saying here. Do you mean that I don't
 need to report the expected steal from the guest?  If I don't do that
 then I'm not reporting all of the time and changing /proc/stat in a
 bigger way than adding another catagory.  Also I thought I would need to
 provide the consigned time and the steal time for debugging purposes.
 Maybe I'm missing your point.

OK so the usefulness of steal time comes from the ability to measure 
CPU cycles that the guest is being deprived of, relative to some unit
(implicitly the CPU frequency presented to the VM). That way, it becomes
easier to properly allocate resources.

From top man page:
st : time stolen from this vm by the hypervisor

Not only its a problem for the lender, it is also confusing for the user
(who has to subtract from the reported value himself), the hardcapping 
from reported steal time.


The problem with the algorithm in the patchset is the following
(practical example):

- Hard capping set to 80% of available CPU.
- vcpu does not exceed its threshold, say workload with 40%
CPU utilization.
- Under this scenario it is possible for vcpu to be deprived
of cycles (because out of the 40% that workload uses, only 30% of
actual CPU time are being provided).
- The algorithm in this patchset will not report any stolen time
because it assumes 20% of stolen time reported via 'run_delay'
is fixed at all times (which is false), therefore any valid 
stolen time below 20% will not be reported.

Makes sense?

Not sure what the concrete way to report stolen time relative to hard
capping is (probably easier inside the scheduler, where run_delay is
calculated).

Reporting the hard capping to the guest is a good idea (which saves the
user from having to measure it themselves), but better done separately
via new field.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: x86: Convert INIT and SIPI signals into synchronously handled requests

2013-03-05 Thread Marcelo Tosatti
On Mon, Mar 04, 2013 at 10:41:43PM +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 A VCPU sending INIT or SIPI to some other VCPU races for setting the
 remote VCPU's mp_state. When we were unlucky, KVM_MP_STATE_INIT_RECEIVED
 was overwritten by kvm_emulate_halt and, thus, got lost.
 
 Fix this by raising requests on the sender side that will then be
 handled synchronously over the target VCPU context.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com

Why is kvm_emulate_halt being executed from
KVM_MP_STATE_INIT_RECEIVED/KVM_MP_STATE_SIPI_RECEIVED again?

Why is it not true that the only valid transition from
KVM_MP_STATE_HALTED is from KVM_MP_STATE_RUNNABLE?

It would be good for KVM_MP_STATE_HALTED to indicate 
guest executed HLT instruction (which is impossible without INIT/SIPI
being received).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/6] kvm: Make ioeventfd usable on s390.

2013-03-05 Thread Marcelo Tosatti
On Thu, Feb 28, 2013 at 12:33:15PM +0100, Cornelia Huck wrote:
 v5 of the ioeventfd patch set, this time with a proper return code
 from __diag_virtio_hypercall(), otherwise unchanged.
 
 v4 - v5:
 - Proper return code in __diag_virtio_hypercall()
 v3 - v4:
 - Pass cookies in virtio-ccw notify hypercall
 - Coding style
 v2 - v3:
 - Added a patch exporting the virtio-ccw api and use it for the
   diagnose implementation.
 - Better naming: We're dealing with virtio-ccw notifications only.
 v1 - v2:
 - Move irqfd initialization from a module init function to kvm_init,
   eliminating the need for a second module for kvm/s390.
 - Use kvm_io_device for s390 css devices.
 
 Cornelia Huck (5):
   KVM: s390: Export virtio-ccw api.
   KVM: Initialize irqfd from kvm_init().
   KVM: Introduce KVM_VIRTIO_CCW_NOTIFY_BUS.
   KVM: ioeventfd for virtio-ccw devices.
   KVM: s390: Wire up ioeventfd.
 
 Michael S. Tsirkin (1):
   virtio_ccw: pass a cookie value to kvm hypercall

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: x86: Convert INIT and SIPI signals into synchronously handled requests

2013-03-05 Thread Marcelo Tosatti
On Tue, Mar 05, 2013 at 08:16:41PM -0300, Marcelo Tosatti wrote:
 On Mon, Mar 04, 2013 at 10:41:43PM +0100, Jan Kiszka wrote:
  From: Jan Kiszka jan.kis...@siemens.com
  
  A VCPU sending INIT or SIPI to some other VCPU races for setting the
  remote VCPU's mp_state. When we were unlucky, KVM_MP_STATE_INIT_RECEIVED
  was overwritten by kvm_emulate_halt and, thus, got lost.
  
  Fix this by raising requests on the sender side that will then be
  handled synchronously over the target VCPU context.
  
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 
 Why is kvm_emulate_halt being executed from
 KVM_MP_STATE_INIT_RECEIVED/KVM_MP_STATE_SIPI_RECEIVED again?
 
 Why is it not true that the only valid transition from
s/from/to/
 KVM_MP_STATE_HALTED is from KVM_MP_STATE_RUNNABLE?
 
 It would be good for KVM_MP_STATE_HALTED to indicate 
 guest executed HLT instruction (which is impossible without INIT/SIPI
 being received).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: x86: Convert INIT and SIPI signals into synchronously handled requests

2013-03-05 Thread Marcelo Tosatti
On Tue, Mar 05, 2013 at 08:16:41PM -0300, Marcelo Tosatti wrote:
 On Mon, Mar 04, 2013 at 10:41:43PM +0100, Jan Kiszka wrote:
  From: Jan Kiszka jan.kis...@siemens.com
  
  A VCPU sending INIT or SIPI to some other VCPU races for setting the
  remote VCPU's mp_state. When we were unlucky, KVM_MP_STATE_INIT_RECEIVED
  was overwritten by kvm_emulate_halt and, thus, got lost.
  
  Fix this by raising requests on the sender side that will then be
  handled synchronously over the target VCPU context.
  
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 
 Why is kvm_emulate_halt being executed from
 KVM_MP_STATE_INIT_RECEIVED/KVM_MP_STATE_SIPI_RECEIVED again?
 
 Why is it not true that the only valid transition from
 KVM_MP_STATE_HALTED is from KVM_MP_STATE_RUNNABLE?

See Paolo's table, it is. So why fix a race which should not be
happening in the first place.

 It would be good for KVM_MP_STATE_HALTED to indicate 
 guest executed HLT instruction (which is impossible without INIT/SIPI
 being received).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: nVMX: Reset RFLAGS on VM-exit

2013-03-05 Thread Marcelo Tosatti
On Mon, Mar 04, 2013 at 11:00:38AM +0200, Gleb Natapov wrote:
 On Sun, Mar 03, 2013 at 08:47:11PM +0100, Jan Kiszka wrote:
  From: Jan Kiszka jan.kis...@siemens.com
  
  Ouch, how could this work so well that far? We need to clear RFLAGS to
  the reset value as specified by the SDM. Particularly, IF must be off
  after VM-exit!
  
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 Reviewed-by: Gleb Natapov g...@redhat.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] s3: wake up on RTC event

2013-03-05 Thread Marcelo Tosatti
On Tue, Mar 05, 2013 at 10:16:18AM +0100, Paolo Bonzini wrote:
 The S3 test requires user interaction to terminate.  Set up the
 RTC alarm instead, so that the test will end after one second.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  x86/s3.c | 36 
  1 file changed, 36 insertions(+)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: nVMX: Fix content of MSR_IA32_VMX_ENTRY/EXIT_CTLS

2013-03-05 Thread Marcelo Tosatti
On Mon, Mar 04, 2013 at 05:15:48PM +0100, Jan Kiszka wrote:
 Properly set those bits to 1 that the spec demands in case bit 55 of
 VMX_BASIC is 0 - like in our case.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
 
 Changes in v2:
  - use symbolic constants
 
  arch/x86/include/asm/vmx.h |4 
  arch/x86/kvm/vmx.c |   13 ++---
  2 files changed, 14 insertions(+), 3 deletions(-)

Please regenerate against queue branch.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.

2013-03-05 Thread Marcelo Tosatti
On Tue, Mar 05, 2013 at 02:17:57PM -0600, Michael Wolf wrote:
 Sorry for the delay in the response.  I did not see your question.
 
 On Mon, 2013-02-18 at 20:57 -0300, Marcelo Tosatti wrote:
  On Tue, Feb 05, 2013 at 03:49:41PM -0600, Michael Wolf wrote:
   Add a helper routine to scheduler/core.c to allow the kvm module
   to retrieve the cpu hardlimit settings.  The values will be used
   to set up a timer that is used to separate the consigned from the
   steal time.
  
  1) Can you please describe, in english, the mechanics of subtracting cpu
  hardlimit values from steal time reported via run_delay supposed to
  work?
  
  The period and the quota used to separate the consigned time 
  (expected steal) from the steal time are taken
  from the cfs bandwidth control settings. Any other steal time
  accruing during that period will show as the traditional steal time.
  
  There is no expected steal time over a fixed period of real time.
 There is expected steal time in the sense that the administrator of the
 system sets up guests on the host so that there will be cpu
 overcommitment. 

I refer to 

+   /* split the delta into steal and consigned */
+   if (vcpu-arch.current_consigned  vcpu-arch.consigned_quota) {
+   vcpu-arch.current_consigned += delta;
+   if (vcpu-arch.current_consigned  vcpu-arch.consigned_quota) {
+   steal_delta = vcpu-arch.current_consigned
+   - vcpu-arch.consigned_quota;
+   consigned_delta = delta - steal_delta;
+   } else {

You can't expect there to be any amount of stolen time over a fixed
period of time.

  The end user who is using the guest does not know this,
 they only know they have been guaranteed a certain level of performance.
 So if steal time shows up the end user typically thinks they are not
 getting their guaranteed performance. So this patchset is meant to allow
 top to show 100% utilization and ONLY show steal time if it is over the
 level of steal time that the host administrator setup.  So take a simple
 example of a host with 1 cpu and two guest on it.  If each guest is
 fully utilized a user will see 50% utilization and 50% steal in either
 of the guests.  In this case the amount of steal time that the host 
 administrator would expect to see is 50%.  As long as the steal in the
 guest does not exceed 50% the guest is running as expected.  If for some
 reason the steal increases to 60%, now something is wrong and the steal
 time needs to be reported and the end user will make inquiries?

This is the purpose of stolen time: to report the amount of time guest 
vcpu was runnable, but not running (IOW: starved).

  2) From the description of patch 1: In the case of where you have
  a system that is running in a capped or overcommitted environment 
  the user may see steal time being reported in accounting tools 
  such as top or vmstat. 
  
  This is outdated, right? Because overcommitted environment is exactly
  what steal time should report.
 
 I hope I'm not missing your point here.  But again this comes down to
 the point of view.  The end user is guaranteed a capability/level of
 performance that may not be a whole cpu.  So only show steal time if the
 amount of steal time exceeds what the host admin expected when the guest
 was set up.

The real values must be reported. If the host system becomes suddenly
loaded beyond what the host can provide to the guest, should the system
report an incorrect value, to avoid users from complaining? Sounds
incorrect.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-03-05 Thread Marcelo Tosatti
On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote:
 Sorry for the delay in the response.  I did not see the email
 right away.
 
 On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
  On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
   2013/2/5 Michael Wolf m...@linux.vnet.ibm.com:
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.
   
   Sorry, I'm no expert in this area. But I don't really understand what
   is confusing for the end user here.
  
  I suppose that what is wanted is to subtract stolen time due to 'known
  reasons' from steal time reporting. 'Known reasons' being, for example,
  hard caps. So a vcpu executing instructions with no halt, but limited to
  80% of available bandwidth, would not have 20% of stolen time reported.
 
 Yes exactly and the end user many times did not set up the guest and is
 not aware of the capping.  The end user is only aware of the performance
 level that they were told they would get with the guest.  
  But yes, a description of the scenario that is being dealt with, with
  details, is important.
 
 I will add more detail to the description next time I submit the
 patches.  How about something like,In a cloud environment the user of a
 kvm guest is not aware of the underlying hardware or how many other
 guests are running on it.  The end user is only aware of a level of
 performance that they should see.   or does that just muddy the picture
 more??

So the feature aims for is to report stolen time relative to hard
capping. That is: stolen time should be counted as time stolen from
the guest _beyond_ hard capping. Yes?

Probably don't need to report new data to the guest for that.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio PCI on KVM without IO BARs

2013-03-04 Thread Marcelo Tosatti
On Thu, Feb 28, 2013 at 05:24:33PM +0200, Michael S. Tsirkin wrote:
 OK we talked about this a while ago, here's
 a summary and some proposals:
 At the moment, virtio PCI uses IO BARs for all accesses.
 
 The reason for IO use is the cost of different VM exit types
 of transactions and their emulation on KVM on x86
 (it would be trivial to use memory BARs on non x86 platforms
  if they don't have PIO).
 Example benchmark (cycles per transaction):
   (io access) outw 1737
   (memory access) movw 4341
 for comparison:
   (hypercall access): vmcall 1566
   (pv memory access) movw_fast 1817 (*explanation what this is below)
 
 This creates a problem if we want to make virtio devices
 proper PCI express devices with native hotplug support.
 This is because each hotpluggable PCI express device always has
 a PCI express port (port per device),
 where each port is represented by a PCI to PCI bridge.
 In turn, a PCI to PCI bridge claims a 4Kbyte aligned
 range of IO addresses. This means that we can have at
 most 15 such devices, this is a nasty limitation.
 
 Another problem with PIO is support for physical virtio devices,
 and nested virt: KVM currently programs all PIO accesses
 to cause vm exit, so using this device in a VM will be slow.
 
 So we really want to stop using IO BARs completely if at all possible,
 but looking at the table above, switching to memory BAR and movw for
 notifications will not work well.
 
 Possible solutions:
 1. hypercall instead of PIO
   basically add a hypercall that gets an MMIO address/data
   and does an MMIO write for us.
   We'll want some capability in the device to let guest know
   this is what it should do.
   Pros: even faster than PIO
   Cons: this won't help nested or assigned devices (won't hurt
 them either as it will be conditional on the capability above).
   Cons: need host kernel support, which then has to be maintained
 forever, even if intel speeds up MMIO exits.
 
 2. pv memory access
   There are two reasons that memory access is slower:
   - one is that it's handled as an EPT misconfiguration error
   so handled by cpu slow path
   - one is that we need to decode the x86 instruction in
   software, to calculate address/data for the access.
 
   We could agree that guests would use a specific instruction
   for virtio accesses, and fast-path it specifically.
   This is the pv memory access option above.
   Pros: helps assigned devices and nested virt
   Pros: easy to drop if hardware support is there
   Cons: a bit slower than IO
   Cons: need host kernel support
 
 3. hypervisor assigned IO address
   qemu can reserve IO addresses and assign to virtio devices.
   2 bytes per device (for notification and ISR access) will be
   enough. So we can reserve 4K and this gets us 2000 devices.
 From KVM perspective, nothing changes.
   We'll want some capability in the device to let guest know
   this is what it should do, and pass the io address.
   One way to reserve the addresses is by using the bridge.
   Pros: no need for host kernel support
   Pros: regular PIO so fast
   Cons: does not help assigned devices, breaks nested virt
 
 Simply counting pros/cons, option 3 seems best. It's also the
 easiest to implement.

Agree.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 6/6] KVM: s390: Wire up ioeventfd.

2013-03-04 Thread Marcelo Tosatti
On Thu, Feb 28, 2013 at 12:33:21PM +0100, Cornelia Huck wrote:
 Enable ioeventfd support on s390 and hook up diagnose 500 virtio-ccw
 notifications.
 
 Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
 ---
  arch/s390/kvm/Kconfig|  1 +
  arch/s390/kvm/Makefile   |  2 +-
  arch/s390/kvm/diag.c | 26 ++
  arch/s390/kvm/kvm-s390.c |  1 +
  4 files changed, 29 insertions(+), 1 deletion(-)
 
 diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
 index b58dd86..3c43e30 100644
 --- a/arch/s390/kvm/Kconfig
 +++ b/arch/s390/kvm/Kconfig
 @@ -22,6 +22,7 @@ config KVM
   select PREEMPT_NOTIFIERS
   select ANON_INODES
   select HAVE_KVM_CPU_RELAX_INTERCEPT
 + select HAVE_KVM_EVENTFD
   ---help---
 Support hosting paravirtualized guest machines using the SIE
 virtualization capability on the mainframe. This should work
 diff --git a/arch/s390/kvm/Makefile b/arch/s390/kvm/Makefile
 index 3975722..8fe9d65 100644
 --- a/arch/s390/kvm/Makefile
 +++ b/arch/s390/kvm/Makefile
 @@ -6,7 +6,7 @@
  # it under the terms of the GNU General Public License (version 2 only)
  # as published by the Free Software Foundation.
  
 -common-objs = $(addprefix ../../../virt/kvm/, kvm_main.o)
 +common-objs = $(addprefix ../../../virt/kvm/, kvm_main.o eventfd.o)
  
  ccflags-y := -Ivirt/kvm -Iarch/s390/kvm
  
 diff --git a/arch/s390/kvm/diag.c b/arch/s390/kvm/diag.c
 index a390687..1c01a99 100644
 --- a/arch/s390/kvm/diag.c
 +++ b/arch/s390/kvm/diag.c
 @@ -13,6 +13,7 @@
  
  #include linux/kvm.h
  #include linux/kvm_host.h
 +#include asm/virtio-ccw.h
  #include kvm-s390.h
  #include trace.h
  #include trace-s390.h
 @@ -104,6 +105,29 @@ static int __diag_ipl_functions(struct kvm_vcpu *vcpu)
   return -EREMOTE;
  }
  
 +static int __diag_virtio_hypercall(struct kvm_vcpu *vcpu)
 +{
 + int ret, idx;
 +
 + /* No virtio-ccw notification? Get out quickly. */
 + if (!vcpu-kvm-arch.css_support ||
 + (vcpu-run-s.regs.gprs[1] != KVM_S390_VIRTIO_CCW_NOTIFY))
 + return -EOPNOTSUPP;
 +
 + idx = srcu_read_lock(vcpu-kvm-srcu);
 + /*
 +  * The layout is as follows:
 +  * - gpr 2 contains the subchannel id (passed as addr)
 +  * - gpr 3 contains the virtqueue index (passed as datamatch)
 +  */
 + ret = kvm_io_bus_write(vcpu-kvm, KVM_VIRTIO_CCW_NOTIFY_BUS,
 + vcpu-run-s.regs.gprs[2],
 + 8, vcpu-run-s.regs.gprs[3]);
 + srcu_read_unlock(vcpu-kvm-srcu, idx);
 + /* kvm_io_bus_write returns -EOPNOTSUPP if it found no match. */
 + return ret  0 ? ret : 0;
 +}
 +

What about the cookie?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 rebased 2/8] start vm after resetting it

2013-03-01 Thread Marcelo Tosatti
On Wed, Feb 20, 2013 at 04:13:49PM +0800, Hu Tao wrote:
 On Thu, Feb 07, 2013 at 11:50:28PM -0200, Marcelo Tosatti wrote:
  On Wed, Jan 23, 2013 at 03:19:23PM +0800, Hu Tao wrote:
   From: Wen Congyang we...@cn.fujitsu.com
   
   The guest should run after resetting it, but it does not run if its
   old state is RUN_STATE_INTERNAL_ERROR or RUN_STATE_PAUSED.
   
   We don't set runstate to RUN_STATE_PAUSED when resetting the guest,
   so the runstate will be changed from RUN_STATE_INTERNAL_ERROR or
   RUN_STATE_PAUSED to RUN_STATE_RUNNING(not RUN_STATE_PAUSED).
  
  It appears the last hunk will automatically reset state from 
  RUN_STATE_INTERNAL_ERROR to RUN_STATE_RUNNING ?
 
 Yes.
 
  
  I suppose the transition table allows, from RUN_STATE_INTERNAL_ERROR:
  
  monitor system_reset
  monitor cont
  
  To resume the machine?
 
 True.
 
 I think the purpose of this patch is to always reset and _run_ the guest
 by `system_reset', avoiding an additional `cont' following `system_reset'.

Unclear why its essential to the feature being proposed?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 rebased] kvm: notify host when the guest is panicked

2013-03-01 Thread Marcelo Tosatti
On Thu, Feb 28, 2013 at 04:54:25PM +0800, Hu Tao wrote:
   diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
   b/arch/x86/include/uapi/asm/kvm_para.h
   index 06fdbd9..c15ef33 100644
   --- a/arch/x86/include/uapi/asm/kvm_para.h
   +++ b/arch/x86/include/uapi/asm/kvm_para.h
   @@ -96,5 +96,7 @@ struct kvm_vcpu_pv_apf_data {
#define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK
#define KVM_PV_EOI_DISABLED 0x0

   +#define KVM_PV_EVENT_PORT(0x505UL)
   +
  
  No need for the ioport to be hard coded. What are the options to
  communicate an address to the guest? An MSR, via ACPI?
 
 I'm not quite understanding here. By 'address', you mean an ioport?
 how to communicate an address? (I have little knowledge about ACPI)

Yes, the ioport. The address of the ioport should not be fixed (for
example future emulated board could use that fixed ioport address,
0x505UL).

One option is to pass the address via an MSR. Yes, that is probably the
best option because there is no dependency on ACPI.

  pv-event is a bad name for an interface which is specific to notify
  panic events. Please use pv-panic everywhere.
 
 panic event is one of the events supported. Can we keep the name?

  Call the initialization code from kvm_guest_init, only one function is
  necessary.
 
 At the point of kvm_guest_init, rqeust_region (called by
 kvm_pv_event_init) will block, so the guest kernel won't up.

Why does it block?

#define PANIC_TIMER_STEP 100
#define PANIC_BLINK_SPD 18
   @@ -132,6 +133,9 @@ void panic(const char *fmt, ...)
 if (!panic_blink)
 panic_blink = no_blink;

   + if (kvm_pv_event_enabled())
   + panic_timeout = 0;
   +
  
  What is the rationale behind this?
 
 This is a hack to disable reset_on_panic if user enables
 pv-event.

Condition it to kvm_pv_event_enabled() directly?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM guest OS reports zero CPU frequency via /proc/cpuinfo

2013-03-01 Thread Marcelo Tosatti

https://lists.gnu.org/archive/html/qemu-devel/2012-10/msg02269.html

On Wed, Feb 20, 2013 at 01:25:04AM +0400, Dmitry Krivenok wrote:
 Hello,
 
 I encountered an odd issue with running VMWare .vmdk image inside KVM.
 
 I run KVM as follows
 qemu-kvm -hda ./guest.vmdk -m 4096
 
 and guest OS (SLES11) boots fine, but reports zero CPU freq via
 /proc/cpuinfo (cpu MHz):
 
 
 
 $ cat /proc/cpuinfo
 
 processor   : 0
 
 vendor_id   : GenuineIntel
 
 cpu family  : 6
 
 model   : 2
 
 model name  : QEMU Virtual CPU version 0.15.1
 
 stepping: 3
 
 cpu MHz : 0.000
 
 cache size  : 4096 KB
 
 fpu : yes
 
 fpu_exception   : yes
 
 cpuid level : 4
 
 wp  : yes
 
 flags   : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca
 cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx lm up rep_good
 nopl pni cx16 popcnt lahf_lm
 
 bogomips: 2088.96
 
 clflush size: 64
 
 cache_alignment : 64
 
 address sizes   : 40 bits physical, 48 bits virtual
 
 power management:
 
 $
 
 
 
 I tried to convert .vmdk to qcow2 image and also tried different
 –cpu=MODEL options, but I always get zero frequency in this guest.
 
 Another guest (Ubuntu in native qcow2 image) works fine and reports
 correct cpu frequency, so I only see this problem with this particular
 .vmdk image.
 
 
 
 I’m running
 
 QEMU emulator version 0.15.1 (kvm-0.15.1-0.17.3), Copyright (c)
 2003-2008 Fabrice Bellard
 
 on SLES11-SP2 host (3.0 kernel + patches).
 
 This problem is critical for me because by guest runs a binary which
 relies on non-zero CPU freq in /proc/cpuinfo.
 
 
 Any ideas?
 
 Thanks!
 
 --
 Sincerely yours, Dmitry V. Krivenok
 e-mail: krivenok.dmi...@gmail.com
 skype: krivenok_dmitry
 jabber: krivenok_dmi...@jabber.ru
 icq: 242-526-443
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bisected][-next-20130204+] [x86/kvm] udevd:[97]: segfault at ffffffffff5fd020 ip 00007fff069e277f sp 00007fff068c9ef8 error d

2013-02-27 Thread Marcelo Tosatti
On Sun, Feb 24, 2013 at 04:23:44PM -0500, Peter Hurley wrote:
 On Tue, 2013-02-19 at 10:26 +0200, Gleb Natapov wrote:
  On Mon, Feb 18, 2013 at 08:12:21PM -0500, Peter Hurley wrote:
   On Mon, 2013-02-18 at 19:59 -0300, Marcelo Tosatti wrote:
On Wed, Feb 13, 2013 at 06:57:09AM -0500, Peter Hurley wrote:
 On Wed, 2013-02-13 at 12:51 +0200, Gleb Natapov wrote:
  On Tue, Feb 12, 2013 at 04:39:03PM -0800, H. Peter Anvin wrote:
   On 02/12/2013 04:26 PM, Peter Hurley wrote:
With -next-20130204+ in ubuntu 12.10 VM (so the 80x25 VGA
device/console):

[0.666410] udevd[97]: starting version 175
[0.674043] udevd[97]: udevd:[97]: segfault at 
ff5fd020 ip 7fff069e277f sp 7fff068c9ef8 error d

and boots to an initramfs prompt.

git bisect (log attached) blames:

commit 7b5c4a65cc27f017c170b025f8d6d75dabb11c6f
Merge: 3596f5b 949db15
Author: H. Peter Anvin h...@linux.intel.com
Date:   Fri Jan 25 16:31:21 2013 -0800

Merge tag 'v3.8-rc5' into x86/mm

The __pa() fixup series that follows touches KVM code that 
is not
present in the existing branch based on v3.7-rc5, so merge 
in the
current upstream from Linus.

Signed-off-by: H. Peter Anvin h...@linux.intel.com


This only happens with the VGA device/console but that is the 
default
configuration for Ubuntu/KVM because it blacklists pretty much 
every fb
driver.

   
   I am guessing this is another bad use of __pa()... need to look 
   into that.

Can't find this commit on kvm.git or linux-2.6.git. Is it reproducible
there?
   
   This is in the linux-next repo (any git tag after 'next-20130204' will
   reproduce this). It's a pretty large merge commit.
   
   This doesn't happen on 3.8-rc7.
   
   I'll try to repro this on kvm.git sometime this week. Otherwise, we can
   wait for it to show up in 3.9.
   
  Can you also drop 5dfd486c4750c9278c63fa96e6e85bdd2fb58e9d from linux-next 
  and reproduce?
 
 Ok, found and fixed. This will need to go to stable for 3.8 as well.

ACK, please send an email with the patch to kvm@vger.kernel.org.

 Regards,
 Peter Hurley
 
 --- % ---
 From: Peter Hurley pe...@hurleysoftware.com
 Date: Sun, 24 Feb 2013 10:55:09 -0500
 Subject: [PATCH] x86/kvm: Fix pvclock vsyscall fixmap
 
 The physical memory fixmapped for the pvclock clock_gettime vsyscall
 was allocated, and thus is not a kernel symbol. __pa() is the proper
 method to use in this case.
 
 Fixes the crash below when booting a next-20130204+ smp guest on a
 3.8-rc5+ KVM host.
 
 [0.666410] udevd[97]: starting version 175
 [0.674043] udevd[97]: udevd:[97]: segfault at ff5fd020
  ip 7fff069e277f sp 7fff068c9ef8 error d
 
 Signed-off-by: Peter Hurley pe...@hurleysoftware.com
 ---
  arch/x86/kernel/pvclock.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
 index 85c3959..2cb9470 100644
 --- a/arch/x86/kernel/pvclock.c
 +++ b/arch/x86/kernel/pvclock.c
 @@ -185,7 +185,7 @@ int __init pvclock_init_vsyscall(struct 
 pvclock_vsyscall_time_info *i,
  
   for (idx = 0; idx = (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
   __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
 -  __pa_symbol(i) + (idx*PAGE_SIZE),
 +  __pa(i) + (idx*PAGE_SIZE),
PAGE_KERNEL_VVAR);
   }
  
 -- 
 1.8.1.2
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] KVM: Clean up and optimize set_memory_region() - part2

2013-02-27 Thread Marcelo Tosatti
On Wed, Feb 27, 2013 at 07:41:03PM +0900, Takuya Yoshikawa wrote:
 Note: this is based on the latest master branch.
 
 I'm sending this before 3.9-rc1 is released since this can cause extra
 conflicts unless we make this available to all arch before we start
 adding new stuff.
 
 Please review: every change is trivial and should not change anything.
 
 Takuya Yoshikawa (6):
   KVM: set_memory_region: Drop user_alloc from prepare/commit_memory_region()
   KVM: set_memory_region: Drop user_alloc from set_memory_region()
   KVM: set_memory_region: Make kvm_mr_change available to arch code
   KVM: set_memory_region: Refactor prepare_memory_region()
   KVM: set_memory_region: Refactor commit_memory_region()
   KVM: ARM: Remove kvm_arch_set_memory_region()
 
  arch/arm/kvm/arm.c |   15 ++--
  arch/ia64/kvm/kvm-ia64.c   |   25 ++---
  arch/powerpc/include/asm/kvm_ppc.h |2 +-
  arch/powerpc/kvm/book3s_hv.c   |4 +-
  arch/powerpc/kvm/book3s_pr.c   |2 +-
  arch/powerpc/kvm/booke.c   |2 +-
  arch/powerpc/kvm/powerpc.c |   13 +--
  arch/s390/kvm/kvm-s390.c   |   10 ++--
  arch/x86/kvm/vmx.c |6 ++--
  arch/x86/kvm/x86.c |   25 +
  include/linux/kvm_host.h   |   35 +-
  virt/kvm/kvm_main.c|   40 ---
  12 files changed, 67 insertions(+), 112 deletions(-)
 
 -- 
 1.7.4.1

Reviewed-by: Marcelo Tosatti mtosa...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-25 Thread Marcelo Tosatti
On Mon, Feb 25, 2013 at 11:13:25AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-02-25:
  On Mon, Feb 25, 2013 at 11:04:25AM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2013-02-25:
  On Mon, Feb 25, 2013 at 08:42:52AM +, Zhang, Yang Z wrote:
  Avi Kivity wrote on 2013-02-25:
  I didn't really follow, but is the root cause the need to keep track
  of interrupt coalescing?  If so we can recommend that users use
  KVM_IRQ_LINE when coalescing is unneeded, and move interrupt
  injection with irq coalescing support to vcpu context.
  So we can hide the capability KVM_CAP_IRQ_INJECT_STATUS when posted
  interrupt is enabled to force users doesn't to use
  KVM_IRQ_LINE_STATUS. Does this acceptable?
  
  The only case in KVM that need to know the interrupt injection status is
  vlapic
  timer. But since vlapic timer and vcpu are always in same pcpu, so there 
  is no
  problem.
  
  Not really. The primary user of this interface is RTC interrupt
  re-injection for Windows guests.
  So without KVM_IRQ_LINE_STATUS capability, RTC cannot work well?
  
  Windows guests may experience timedrift under CPU overcommit scenario.
 Ok, I see. Seems we are stuck. :(
 Do you have any suggestion to solve or workaround current problem?

Depend on knowledge about atomicity (item 5 IIRC) of the sequence 
in the manual.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-25 Thread Marcelo Tosatti
On Mon, Feb 25, 2013 at 06:55:58AM +, Zhang, Yang Z wrote:
  
  p1)
  
  cpu0  cpu1vcpu0
  test_and_set_bit(PIR-A)
  set KVM_REQ_EVENT
process REQ_EVENT
PIR-A-IRR
  
vcpu-mode=IN_GUEST
  
  if (vcpu0-guest_mode)
if (!t_a_s_bit(PIR notif))
send IPI
linux_pir_handler
  
t_a_s_b(PIR-B)=1
no PIR IPI sent
  
  p2)
  
  No, this exists with your implementation not mine.
  As I said, in this patch, it will make request after vcpu ==guest mode, 
  then send
  ipi. Even the ipi is lost, but the request still will be tracked. Because 
  we have the
  follow check:
  vcpu-mode = GUEST_MODE
  (ipi may arrived here and lost)
  local irq disable
  check request (this will ensure the pir modification will be picked up by 
  vcpu
  before vmentry)
  
  Please read the sequence above again, between p1) and p2). Yes, if the
  IPI is lost, the request _for the CPU which sent PIR IPI_ is guaranteed
  to be processed, but not the request for another cpu (cpu1).
 If no PIR IPI sent in cpu1, the delivered is false, and it will roll back to 
 old logic:
 __apic_accept_irq():
 if (!delivered) {
  kvm_make_request(KVM_REQ_EVENT, vcpu);
  kvm_vcpu_kick(vcpu);
 }
 This can solve the problem you mentioned above. Right?

Should not be doing this kick in the first place. One result of it is
that you'll always take a VM-exit if a second injection happens while a VCPU
has not handled the first one.

What is the drawback of the suggestion to unconditionally clear PIR
notification bit on VM-entry?

We can avoid it, but lets get it correct first  (BTW can stick a comment
saying FIXME: optimize) on that PIR clearing.

  cpu0
  
  check pir(pass)
  check irr(pass)
  injected = !test_and_set_bit(pir)
  
  versus
  
  cpu1
 xchg(pir)

  
  No information can be lost because all accesses to shared data are
  atomic.
 Sorry, I cannot get your point. Why 'xchg(pir)' can solve the problem I 
 mentioned above? Can you elaborate it?
 The spinlock I used is trying to protect the whole procedure of sync pir to 
 irr, not just access pir.

You're right, its the same problem as with the hardware update.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-25 Thread Marcelo Tosatti
On Mon, Feb 25, 2013 at 03:34:19PM +0200, Gleb Natapov wrote:
 On Mon, Feb 25, 2013 at 11:13:25AM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2013-02-25:
   On Mon, Feb 25, 2013 at 11:04:25AM +, Zhang, Yang Z wrote:
   Gleb Natapov wrote on 2013-02-25:
   On Mon, Feb 25, 2013 at 08:42:52AM +, Zhang, Yang Z wrote:
   Avi Kivity wrote on 2013-02-25:
   I didn't really follow, but is the root cause the need to keep track
   of interrupt coalescing?  If so we can recommend that users use
   KVM_IRQ_LINE when coalescing is unneeded, and move interrupt
   injection with irq coalescing support to vcpu context.
   So we can hide the capability KVM_CAP_IRQ_INJECT_STATUS when posted
   interrupt is enabled to force users doesn't to use
   KVM_IRQ_LINE_STATUS. Does this acceptable?
   
   The only case in KVM that need to know the interrupt injection status 
   is
   vlapic
   timer. But since vlapic timer and vcpu are always in same pcpu, so 
   there is no
   problem.
   
   Not really. The primary user of this interface is RTC interrupt
   re-injection for Windows guests.
   So without KVM_IRQ_LINE_STATUS capability, RTC cannot work well?
   
   Windows guests may experience timedrift under CPU overcommit scenario.
  Ok, I see. Seems we are stuck. :(
  Do you have any suggestion to solve or workaround current problem?
  
 I see a couple of possible solutions:
 1. Do what Avi said. Make KVM_IRQ_LINE_STATUS be synchronous. Cons:
 current QEMU uses KVM_IRQ_LINE_STATUS always and it means that it
 will be slow on newer kernels

Can add a capability to QEMU and enable APICv selectively only 
in newer QEMU, which can issue KVM_IRQ_LINE_STATUS on target vcpu
only when necessary (and KVM_IRQ_LINE otherwise).

Even a lock serializing injection is not safe because ON bit is cleared
before XCHG(PIR, 0). Must do something heavier (such as running on
target vcpu context).

 2. Make KVM_IRQ_LINE_STATUS report coalescing only when vcpu is not
 running during injection. This assumes that if vcpu is running and does
 not process interrupt it is guest fault and the same can happen on real
 HW too. Coalescing when vcpu is not running though is the result of CPU
 overcommit and should be reported. Cons interface definition is kind of
 murky.
 3. Do not report KVM_IRQ_LINE_STATUS capability and move RTC to use EOI
 notifiers for interrupt reinjection. This requires us to add interface
 for reporting EOI to userspace. This is not in the scope of this
 patchset. Cons: need to introduce new interface (and the one that will
 not work on AMD BTW)

Breaks older userspace?
 
 Other ideas?

Can HW write a 'finished' bit after 6 in the reserved area? Suppose its
not a KVM-specific problem?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-25 Thread Marcelo Tosatti
On Mon, Feb 25, 2013 at 11:00:21AM -0300, Marcelo Tosatti wrote:
  I see a couple of possible solutions:
  1. Do what Avi said. Make KVM_IRQ_LINE_STATUS be synchronous. Cons:
  current QEMU uses KVM_IRQ_LINE_STATUS always and it means that it
  will be slow on newer kernels
 
 Can add a capability to QEMU and enable APICv selectively only 
 in newer QEMU, which can issue KVM_IRQ_LINE_STATUS on target vcpu
 only when necessary (and KVM_IRQ_LINE otherwise).

Bad idea. What happens with mixed scenarios.

 Even a lock serializing injection is not safe because ON bit is cleared
 before XCHG(PIR, 0). Must do something heavier (such as running on
 target vcpu context).

Note always running on target vcpu is likely to be slower than no-APICv.

So need to do something heavier on the kernel under serialization, if 
firmware cannot be changed (injection from simultaneous CPUs should be
rare so if data to serialize __accept_apic_irq is cache-line aligned 
it should reduce performance impact).

  2. Make KVM_IRQ_LINE_STATUS report coalescing only when vcpu is not
  running during injection. This assumes that if vcpu is running and does
  not process interrupt it is guest fault and the same can happen on real
  HW too. Coalescing when vcpu is not running though is the result of CPU
  overcommit and should be reported. Cons interface definition is kind of
  murky.
  3. Do not report KVM_IRQ_LINE_STATUS capability and move RTC to use EOI
  notifiers for interrupt reinjection. This requires us to add interface
  for reporting EOI to userspace. This is not in the scope of this
  patchset. Cons: need to introduce new interface (and the one that will
  not work on AMD BTW)

Is there no int-ack notification at RTC HW level?

 Breaks older userspace?
  
  Other ideas?
 
 Can HW write a 'finished' bit after 6 in the reserved area? Suppose its
 not a KVM-specific problem?
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-25 Thread Marcelo Tosatti
On Mon, Feb 25, 2013 at 07:40:07PM +0200, Gleb Natapov wrote:
 On Mon, Feb 25, 2013 at 11:17:39AM -0300, Marcelo Tosatti wrote:
  On Mon, Feb 25, 2013 at 11:00:21AM -0300, Marcelo Tosatti wrote:
I see a couple of possible solutions:
1. Do what Avi said. Make KVM_IRQ_LINE_STATUS be synchronous. Cons:
current QEMU uses KVM_IRQ_LINE_STATUS always and it means that it
will be slow on newer kernels
   
   Can add a capability to QEMU and enable APICv selectively only 
   in newer QEMU, which can issue KVM_IRQ_LINE_STATUS on target vcpu
   only when necessary (and KVM_IRQ_LINE otherwise).
  
  Bad idea. What happens with mixed scenarios.
  
 Exactly.
 
   Even a lock serializing injection is not safe because ON bit is cleared
   before XCHG(PIR, 0). Must do something heavier (such as running on
   target vcpu context).
  
  Note always running on target vcpu is likely to be slower than no-APICv.
  
 Yes, but we will do it only for RTC interrupt. Still performance hit
 should be very visible when RTC is in 1000Hz mode.

So older qemu-kvm on APICv enabled hardware becomes slow? If that is
acceptable, fine.

  So need to do something heavier on the kernel under serialization, if 
  firmware cannot be changed (injection from simultaneous CPUs should be
  rare so if data to serialize __accept_apic_irq is cache-line aligned 
  it should reduce performance impact).
  
2. Make KVM_IRQ_LINE_STATUS report coalescing only when vcpu is not
running during injection. This assumes that if vcpu is running and does
not process interrupt it is guest fault and the same can happen on real
HW too. Coalescing when vcpu is not running though is the result of CPU
overcommit and should be reported. Cons interface definition is kind of
murky.
3. Do not report KVM_IRQ_LINE_STATUS capability and move RTC to use EOI
notifiers for interrupt reinjection. This requires us to add interface
for reporting EOI to userspace. This is not in the scope of this
patchset. Cons: need to introduce new interface (and the one that will
not work on AMD BTW)
  
  Is there no int-ack notification at RTC HW level?
 There is, but Windows calls it twice for each injected interrupt. I
 tried to use it in the past to detect interrupt coalescing, but this
 double ack prevented me from doing so. May be revisit this approach with
 willingness to be more hacky about it. Also it is possible to disable
 RTC ack for HyperV guests. We do not do it and if we will use the ack we
 will not be able to.

Is it OK to kill the ability to have coalesced information?

   Breaks older userspace?
 Older userspace will have timedrif with Windows, yes. Note that we some
 version of Windows it has it now too.

Other ideas?
   
   Can HW write a 'finished' bit after 6 in the reserved area? Suppose its
   not a KVM-specific problem?
   
 I still think adding locking just to obtain correct injection status is bad
 trade-off even if HW would allow us to get away with it.

Perhaps with notification at end of copy it could be lockless.

With the current scheme, it is not possible to get it right because
the notification bit disappears temporarily from sight. And it is not
possible to distinguish between 'interrupt injected' and 'interrupt
merged', as discussed here. So would have to serialize completly along
the lines of:

Lock and only inject if can identify that interrupt handler is not
running.

If its possible to drop KVM_IRQ_LINE_STATUS, then demand APICv HW to use
recent software, fine (did not grasp Avi's second idea).


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: kvmclock: Do not setup kvmclock vsyscall in the absence of that clock

2013-02-25 Thread Marcelo Tosatti
On Sat, Feb 23, 2013 at 05:05:29PM +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 This fixes boot lockups with no-kvmclock, when the host is not
 exposing this particular feature (QEMU: -cpu ...,-kvmclock) or when
 the kvmclock initialization failed for whatever reason.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
 
 Should go to 3.8 as well, I presume.
 
  arch/x86/kernel/kvmclock.c |9 -
  1 files changed, 8 insertions(+), 1 deletions(-)
 
 diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
 index 5bedbdd..b730efa 100644
 --- a/arch/x86/kernel/kvmclock.c
 +++ b/arch/x86/kernel/kvmclock.c
 @@ -160,8 +160,12 @@ int kvm_register_clock(char *txt)
  {
   int cpu = smp_processor_id();
   int low, high, ret;
 - struct pvclock_vcpu_time_info *src = hv_clock[cpu].pvti;
 + struct pvclock_vcpu_time_info *src;
 +
 + if (!hv_clock)
 + return 0;
  
 + src = hv_clock[cpu].pvti;
   low = (int)__pa(src) | 1;
   high = ((u64)__pa(src)  32);
   ret = native_write_msr_safe(msr_kvm_system_time, low, high);
 @@ -276,6 +280,9 @@ int __init kvm_setup_vsyscall_timeinfo(void)
   struct pvclock_vcpu_time_info *vcpu_time;
   unsigned int size;
  
 + if (!hv_clock)
 + return 0;
 +
   size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
  
   preempt_disable();
 -- 
 1.7.3.4
 

Reviewed-by: Marcelo Tosatti mtosa...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-24 Thread Marcelo Tosatti
On Sun, Feb 24, 2013 at 01:17:59PM +, Zhang, Yang Z wrote:
 Marcelo Tosatti wrote on 2013-02-24:
  On Sat, Feb 23, 2013 at 03:16:11PM +, Zhang, Yang Z wrote:
  Marcelo Tosatti wrote on 2013-02-23:
  On Sat, Feb 23, 2013 at 02:05:28PM +, Zhang, Yang Z wrote:
  Marcelo Tosatti wrote on 2013-02-23:
  
  On Fri, Feb 22, 2013 at 09:42:32PM +0800, Yang Zhang wrote:
  From: Yang Zhang yang.z.zh...@intel.com
  
  Posted Interrupt allows APIC interrupts to inject into guest directly
  without any vmexit.
  
  - When delivering a interrupt to guest, if target vcpu is running,
update Posted-interrupt requests bitmap and send a notification
event to the vcpu. Then the vcpu will handle this interrupt
automatically, without any software involvemnt. - If target vcpu is
not running or there already a notification event pending in the
vcpu, do nothing. The interrupt will be handled by next vm entry
  Signed-off-by: Yang Zhang yang.z.zh...@intel.com
  ---
  
  diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
  index e4595f1..64616cc 100644
  --- a/arch/x86/kernel/irq.c
  +++ b/arch/x86/kernel/irq.c
  @@ -228,6 +228,26 @@ void smp_x86_platform_ipi(struct pt_regs *regs)
 set_irq_regs(old_regs);
   }
  +#ifdef CONFIG_HAVE_KVM
  +/*
  + * Handler for POSTED_INTERRUPT_VECTOR.
  + */
  +void smp_kvm_posted_intr_ipi(struct pt_regs *regs)
  +{
  +  struct pt_regs *old_regs = set_irq_regs(regs);
  +
  +  ack_APIC_irq();
  +
  +  irq_enter();
  +
  +  exit_idle();
  +
  +  irq_exit();
  +
  +  set_irq_regs(old_regs);
  +}
  +#endif
  +
  
  Add per-cpu counters, similarly to other handlers in the same file.
  Sure.
  
  @@ -379,6 +398,7 @@ static inline int apic_find_highest_irr(struct
  kvm_lapic
  *apic)
 if (!apic-irr_pending) return -1;
   + kvm_x86_ops-sync_pir_to_irr(apic-vcpu);   result =
   apic_search_irr(apic);ASSERT(result == -1 || result = 16);
  
  kvm_x86_ops-sync_pir_to_irr() should be called from vcpu_enter_guest,
  before inject_pending_event.
  
  if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win)
  {
  -
  inject_pending_event
  ...
  }
  Some other places will search irr to do something, like
  kvm_cpu_has_interrupt(). So we must to sync pir to irr before touch
  irr, not just before enter guest.
  
  I see. The only call site that needs IRR/PIR information with posted
  interrupt enabled is kvm_arch_vcpu_runnable, correct?
  Yes.
  
  If so, can we contain reading PIR to only that location?
  Yes, we can do it. Actually, why I do pir-irr here is to avoid the
  wrong usage in future of check pending interrupt just by call
  kvm_vcpu_has_interrupt().
  
  Yes, i see.
  
  Also, there may need an extra callback to check pir.
  So you think your suggestions is better?
  
  Because it respects standard request processing mechanism, yes.
 Sure. Will change it in next patch.
  
  And have PIR-IRR sync at KVM_REQ_EVENT processing only (to respect the
  standard way of event processing and also reduce reading the PIR).
  Since we will check ON bit before each reading PIR, the cost can be
  ignored.
  
  Note reading ON bit is not OK, because if injection path did not see
  vcpu-mode == IN_GUEST_MODE, ON bit will not be set.
 In my patch, It will set ON bit before check vcpu-mode. So it's ok.
 Actually, I think the ON bit must be set unconditionally after change PIR 
 regardless of vcpu-mode.
  
  
  So it is possible that a PIR IPI is delivered and handled before guest
  entry.
  
  By clearing PIR notification bit after local_irq_disable, but before the
  last check for vcpu-requests, we guarantee that a PIR IPI is never lost.
  
  Makes sense? (please check the logic, it might be flawed).
  I am not sure whether I understand you. But I don't think the IPI will
  lost in current patch:
  
  if (!pi_test_and_set_on(vmx-pi_desc)  (vcpu-mode ==
  IN_GUEST_MODE)) {
  kvm_make_request(KVM_REQ_EVENT, vcpu);
  apic-send_IPI_mask(get_cpu_mask(vcpu-cpu),
  POSTED_INTR_VECTOR);
  *delivered = true;
  }
  
  vcpu entry has:
  vcpu-mode = GUEST_MODE
  local irq disable
  check request
  
  We will send the IPI after making request, if IPI is received after set
  guest_mode before disable irq, then it still will be handled by the 
  following check
  request.
  Please correct me if I am wrong.

p1)

  cpu0cpu1vcpu0
  test_and_set_bit(PIR-A)
  set KVM_REQ_EVENT
  process REQ_EVENT
  PIR-A-IRR
  
  vcpu-mode=IN_GUEST
  
  if (vcpu0-guest_mode)
  if (!t_a_s_bit(PIR notif))
  send IPI

Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-24 Thread Marcelo Tosatti
On Sun, Feb 24, 2013 at 01:55:07PM +, Zhang, Yang Z wrote:
  contexts, but only two use locks.
 See following logic, I think the problem you mentioned will not happened with 
 current logic.
 
 get lock
 if test_pir (this will ensure there is no in flight IPI for same interrupt. 
 Since we are taking the lock now, no IPI will be sent before we release the 
 lock and no pir-irr is performed by hardware for same interrupt.)
return coalesced
 if test_irr
return coalesced
 set_pir
 injected=true
 if t_a_s(on)  in guest mode
send ipi
 unlock
 
  
  I'd rather think about proper way to do lockless injection before
  committing anything. The case where we care about correct injection
  status is rare, but we always care about scalability and since we
  violate the spec by reading vapic page while vcpu is in non-root
  operation anyway the result may be incorrect with or without the lock.
  Our use can was not in HW designers mind when they designed this thing
  obviously :(
  
  Zhang, can you comment on whether reading vapic page with CPU in
  VMX-non root accessing it is safe?
 See 24.11.4
 In addition to data in the VMCS region itself, VMX non-root operation can be 
 controlled by data structures that are
 referenced by pointers in a VMCS (for example, the I/O bitmaps). While the 
 pointers to these data structures are
 parts of the VMCS, the data structures themselves are not. They are not 
 accessible using VMREAD and VMWRITE
 but by ordinary memory writes.
 Software should ensure that each such data structure is modified only when no 
 logical processor with a current
 VMCS that references it is in VMX non-root operation. Doing otherwise may 
 lead to unpredictable behavior.
 
 This means the initial design of KVM is wrong. It should not to modify vIRR 
 directly.
 
 The good thing is that reading is ok.

OK.

  Gleb, yes, a comment mentioning the race (instead of the spinlock) and
  explanation why its believed to be benign (given how the injection
  return value is interpreted) could also work. Its ugly, though... murphy
  is around.
  The race above is not benign. It will report interrupt as coalesced
  while in reality it is injected. This may cause to many interrupt to be
  injected. If this happens rare enough ntp may be able to fix time drift
  resulted from this.
 Please check the above logic. Does it have same problem? If yes, please point 
 out.

1) set_pir must be test_and_set_bit() (so the lock at vcpu entry path can
be dropped).

2)  if t_a_s(on)  in guest mode
  send ipi

should be inverted:

if guest mode  t_a_s(on)
  send ipi

So you avoid setting ON bit if guest is outside of guest mode.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-24 Thread Marcelo Tosatti
On Sun, Feb 24, 2013 at 04:19:17PM +0200, Gleb Natapov wrote:
 On Sun, Feb 24, 2013 at 01:55:07PM +, Zhang, Yang Z wrote:
   I do not think it fixes it. There is no guaranty that IPI will be
   processed by remote cpu while sending cpu is still in locked section, so
   the same race may happen regardless. As you say above there are 3
   contexts, but only two use locks.
  See following logic, I think the problem you mentioned will not happened 
  with current logic.
  
  get lock
  if test_pir (this will ensure there is no in flight IPI for same interrupt. 
  Since we are taking the lock now, no IPI will be sent before we release the 
  lock and no pir-irr is performed by hardware for same interrupt.)

Does the PIR IPI interrupt returns synchronously _after_ PIR-IRR transfer
is made? Don't think so.

PIR IPI interrupt returns after remote CPU has acked interrupt receival,
not after remote CPU has acked _and_ performed PIR-IRR transfer.

Right? (yes, right, see step 4. below).

Should try to make it easier to drop the lock later, so depend on it as
little as possible (also please document what it protects in detail).

Also note:


3. The processor clears the outstanding-notification bit in the
posted-interrupt descriptor. This is done atomically
so as to leave the remainder of the descriptor unmodified (e.g., with a
locked AND operation).
4. The processor writes zero to the EOI register in the local APIC; this
dismisses the interrupt with the postedinterrupt
notification vector from the local APIC.
5. The logical processor performs a logical-OR of PIR into VIRR and
clears PIR. No other agent can read or write a
PIR bit (or group of bits) between the time it is read (to determine
what to OR into VIRR) and when it is cleared.


So checking the ON bit does not mean the HW has finished performing
PIR-VIRR transfer (which means ON bit can only be used as an indication 
of whether to send interrupt, not whether PIR-VIRR transfer is
finished).

So its fine to do

- atomic set pir
- if (atomic_test_and_set(PIR ON bit))
-  send IPI

But HW can transfer two distinct bits, set by different serialized instances
of kvm_set_irq (protected with a lock), in a single PIR-IRR pass.

 I do not see where those assumptions are coming from. Testing pir does
 not guaranty that the IPI is not processed by VCPU right now.
 
 iothread:   vcpu:
 send_irq()
 lock(pir)
 check pir and irr
 set pir
 send IPI (*)
 unlock(pir)
 
 send_irq()
 lock(pir)
  receive IPI (*)
  atomic {
pir_tmp = pir
pir = 0
  }
 check pir and irr
  irr = pir_tmp
 set pir
 send IPI
 unlock(pir)
 
 At this point both pir and irr are set and interrupt may be coalesced,
 but it is reported as delivered.

s/set pir/injected = !t_a_s(pir)/

 So what prevents the scenario above from happening?
 
 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-23 Thread Marcelo Tosatti

On Fri, Feb 22, 2013 at 09:42:32PM +0800, Yang Zhang wrote:
 From: Yang Zhang yang.z.zh...@intel.com
 
 Posted Interrupt allows APIC interrupts to inject into guest directly
 without any vmexit.
 
 - When delivering a interrupt to guest, if target vcpu is running,
   update Posted-interrupt requests bitmap and send a notification event
   to the vcpu. Then the vcpu will handle this interrupt automatically,
   without any software involvemnt.
 
 - If target vcpu is not running or there already a notification event
   pending in the vcpu, do nothing. The interrupt will be handled by
   next vm entry
 
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 ---

 diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
 index e4595f1..64616cc 100644
 --- a/arch/x86/kernel/irq.c
 +++ b/arch/x86/kernel/irq.c
 @@ -228,6 +228,26 @@ void smp_x86_platform_ipi(struct pt_regs *regs)
   set_irq_regs(old_regs);
  }
  
 +#ifdef CONFIG_HAVE_KVM
 +/*
 + * Handler for POSTED_INTERRUPT_VECTOR.
 + */
 +void smp_kvm_posted_intr_ipi(struct pt_regs *regs)
 +{
 + struct pt_regs *old_regs = set_irq_regs(regs);
 +
 + ack_APIC_irq();
 +
 + irq_enter();
 +
 + exit_idle();
 +
 + irq_exit();
 +
 + set_irq_regs(old_regs);
 +}
 +#endif
 +

Add per-cpu counters, similarly to other handlers in the same file.

 @@ -379,6 +398,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic 
 *apic)
   if (!apic-irr_pending)
   return -1;
  
 + kvm_x86_ops-sync_pir_to_irr(apic-vcpu);
   result = apic_search_irr(apic);
   ASSERT(result == -1 || result = 16);

kvm_x86_ops-sync_pir_to_irr() should be called from vcpu_enter_guest,
before inject_pending_event.

if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
-
inject_pending_event
...
}


 @@ -685,6 +705,7 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int 
 delivery_mode,
  {
   int result = 0;
   struct kvm_vcpu *vcpu = apic-vcpu;
 + bool delivered = false;
  
   switch (delivery_mode) {
   case APIC_DM_LOWEST:
 @@ -700,7 +721,10 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int 
 delivery_mode,
   } else
   apic_clear_vector(vector, apic-regs + APIC_TMR);
  
 - result = !apic_test_and_set_irr(vector, apic);
 + if (!kvm_x86_ops-deliver_posted_interrupt(vcpu, vector,
 + result, delivered))
 + result = !apic_test_and_set_irr(vector, apic);
 +
   trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode,
 trig_mode, vector, !result);
   if (!result) {
 @@ -710,8 +734,10 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int 
 delivery_mode,
   break;
   }
  
 - kvm_make_request(KVM_REQ_EVENT, vcpu);
 - kvm_vcpu_kick(vcpu);
 + if (!delivered) {
 + kvm_make_request(KVM_REQ_EVENT, vcpu);
 + kvm_vcpu_kick(vcpu);
 + }

This set bit / kick pair should be for non-APICv only (please
check for it directly).

 + return test_bit(vector, (unsigned long *)pi_desc-pir);
 +}
 +
 +static void pi_set_pir(int vector, struct pi_desc *pi_desc)
 +{
 + __set_bit(vector, (unsigned long *)pi_desc-pir);
 +}

This must be locked atomic operation (set_bit).

 +
  struct vcpu_vmx {
   struct kvm_vcpu   vcpu;
   unsigned long host_rsp;
 @@ -429,6 +465,10 @@ struct vcpu_vmx {
  
   bool rdtscp_enabled;
  
 + /* Posted interrupt descriptor */
 + struct pi_desc pi_desc;
 + spinlock_t pi_lock;

Don't see why the lock is necessary. Please use the following
pseudocode for vmx_deliver_posted_interrupt:

vmx_deliver_posted_interrupt(), returns 'bool injected'.

1. orig_irr = read irr from vapic page
2. if (orig_irr != 0)
3.  return false;
4. kvm_make_request(KVM_REQ_EVENT)
5. bool injected = !test_and_set_bit(PIR)
6. if (vcpu-guest_mode  injected)
7.  if (test_and_set_bit(PIR notification bit))
8.  send PIR IPI
9. return injected

On vcpu_enter_guest:

if (kvm_check_request(KVM_REQ_EVENT))  {
pir-virr sync  (*)
...
}
...
vcpu-mode = IN_GUEST_MODE;
local_irq_disable
clear pir notification bit unconditionally (*)

Right after local_irq_disable.

 +static bool vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu,
 + int vector, int *result, bool *delivered)
 +{
 + struct vcpu_vmx *vmx = to_vmx(vcpu);
 + unsigned long flags;
 +
 + if (!vmx_vm_has_apicv(vcpu-kvm))
 + return false;
 +
 + spin_lock_irqsave(vmx-pi_lock, flags);
 + if (pi_test_pir(vector, vmx-pi_desc) ||
 + kvm_apic_test_irr(vector, vcpu-arch.apic)) {
 + spin_unlock_irqrestore(vmx-pi_lock, flags);
 + return true;
 + }
 +
 + 

Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-23 Thread Marcelo Tosatti
On Sat, Feb 23, 2013 at 02:05:28PM +, Zhang, Yang Z wrote:
 Marcelo Tosatti wrote on 2013-02-23:
  
  On Fri, Feb 22, 2013 at 09:42:32PM +0800, Yang Zhang wrote:
  From: Yang Zhang yang.z.zh...@intel.com
  
  Posted Interrupt allows APIC interrupts to inject into guest directly
  without any vmexit.
  
  - When delivering a interrupt to guest, if target vcpu is running,
update Posted-interrupt requests bitmap and send a notification event
to the vcpu. Then the vcpu will handle this interrupt automatically,
without any software involvemnt.
  - If target vcpu is not running or there already a notification event
pending in the vcpu, do nothing. The interrupt will be handled by
next vm entry
  Signed-off-by: Yang Zhang yang.z.zh...@intel.com
  ---
  
  diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
  index e4595f1..64616cc 100644
  --- a/arch/x86/kernel/irq.c
  +++ b/arch/x86/kernel/irq.c
  @@ -228,6 +228,26 @@ void smp_x86_platform_ipi(struct pt_regs *regs)
 set_irq_regs(old_regs);
   }
  +#ifdef CONFIG_HAVE_KVM
  +/*
  + * Handler for POSTED_INTERRUPT_VECTOR.
  + */
  +void smp_kvm_posted_intr_ipi(struct pt_regs *regs)
  +{
  +  struct pt_regs *old_regs = set_irq_regs(regs);
  +
  +  ack_APIC_irq();
  +
  +  irq_enter();
  +
  +  exit_idle();
  +
  +  irq_exit();
  +
  +  set_irq_regs(old_regs);
  +}
  +#endif
  +
  
  Add per-cpu counters, similarly to other handlers in the same file.
 Sure.
 
  @@ -379,6 +398,7 @@ static inline int apic_find_highest_irr(struct 
  kvm_lapic
  *apic)
 if (!apic-irr_pending)
 return -1;
  +  kvm_x86_ops-sync_pir_to_irr(apic-vcpu);
 result = apic_search_irr(apic);
 ASSERT(result == -1 || result = 16);
  
  kvm_x86_ops-sync_pir_to_irr() should be called from vcpu_enter_guest,
  before inject_pending_event.
  
  if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
  -
  inject_pending_event
  ...
  }
 Some other places will search irr to do something, like 
 kvm_cpu_has_interrupt(). So we must to sync pir to irr before touch irr, not 
 just before enter guest.

I see. The only call site that needs IRR/PIR information with posted
interrupt enabled is kvm_arch_vcpu_runnable, correct?

If so, can we contain reading PIR to only that location?

And have PIR-IRR sync at KVM_REQ_EVENT processing only (to respect the
standard way of event processing and also reduce reading the PIR).

  +{
  +  __set_bit(vector, (unsigned long *)pi_desc-pir);
  +}
  
  This must be locked atomic operation (set_bit).
 If using spinlock, pi_set_pir is not called concurrently. It safe to use 
 _set_bit.

The manual demands atomic locked accesses (remember this a remote
access). See the posted interrupt page.

  
  Don't see why the lock is necessary. Please use the following
  pseudocode for vmx_deliver_posted_interrupt:
  
  vmx_deliver_posted_interrupt(), returns 'bool injected'.
  
  1. orig_irr = read irr from vapic page
  2. if (orig_irr != 0)
  3.  return false;
  4. kvm_make_request(KVM_REQ_EVENT)
  5. bool injected = !test_and_set_bit(PIR)
  6. if (vcpu-guest_mode  injected)
  7.  if (test_and_set_bit(PIR notification bit))
  8.  send PIR IPI
  9. return injected
 
 Consider follow case:
 vcpu 0  |vcpu1
 send intr to vcpu1
 check irr
 receive a posted intr
 pir-irr(pir is cleared, irr is 
 set)
 injected=test_and_set_bit=true
 pir=set
 
 Then both irr and pir have the interrupt pending, they may merge to one 
 later, but injected reported as true. Wrong.

True. Have a lock around it and at step 1 also verify if PIR is set. That
would do it, yes?

  On vcpu_enter_guest:
  
  if (kvm_check_request(KVM_REQ_EVENT))  {
  pir-virr sync  (*)
  ...
  }
  ...
  vcpu-mode = IN_GUEST_MODE;
  local_irq_disable
  clear pir notification bit unconditionally (*)
 Why clear ON bit here? If there is no pir-irr operation in this vmentry, 
 clear on here is redundant. 

A PIR IPI must not be lost. Because if its lost, then interrupt
injection can be delayed while it must be performed immediately.

vcpu entry path has:
1. vcpu-mode = GUEST_MODE
2. local_irq_disable

injection path has:
1. if (vcpu-guest_mode  test_and_set_bit(PIR notif bit))
send IPI

So it is possible that a PIR IPI is delivered and handled before guest
entry.

By clearing PIR notification bit after local_irq_disable, but before
the last check for vcpu-requests, we guarantee that a PIR IPI is never
lost.

Makes sense? (please check the logic, it might be flawed).

  
  Please confirm whether a spinlock is necessary with the pseudocode above.
 In current implementation, spinlock is necessary to avoid race condition 
 between vcpus when delivery posted interrupt and sync pir-irr.

Sync PIR-IRR uses xchg, which is locked and atomic, so can't see the
need for the spinlock

Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-23 Thread Marcelo Tosatti
On Sat, Feb 23, 2013 at 04:35:30PM +0200, Gleb Natapov wrote:
 On Sat, Feb 23, 2013 at 02:05:28PM +, Zhang, Yang Z wrote:
   +return test_bit(vector, (unsigned long *)pi_desc-pir);
   +}
   +
   +static void pi_set_pir(int vector, struct pi_desc *pi_desc)
   +{
   +__set_bit(vector, (unsigned long *)pi_desc-pir);
   +}
   
   This must be locked atomic operation (set_bit).
  If using spinlock, pi_set_pir is not called concurrently. It safe to use 
  _set_bit.
  
 HW still access it concurrently without bothering taking your lock.
 
   
   +
struct vcpu_vmx {
struct kvm_vcpu   vcpu;
unsigned long host_rsp;
   @@ -429,6 +465,10 @@ struct vcpu_vmx {
   
bool rdtscp_enabled;
   +/* Posted interrupt descriptor */
   +struct pi_desc pi_desc;
   +spinlock_t pi_lock;
   
   Don't see why the lock is necessary. Please use the following
   pseudocode for vmx_deliver_posted_interrupt:
   
   vmx_deliver_posted_interrupt(), returns 'bool injected'.
   
   1. orig_irr = read irr from vapic page
   2. if (orig_irr != 0)
   3.return false;
   4. kvm_make_request(KVM_REQ_EVENT)
   5. bool injected = !test_and_set_bit(PIR)
   6. if (vcpu-guest_mode  injected)
   7.if (test_and_set_bit(PIR notification bit))
   8.send PIR IPI
   9. return injected
  
  Consider follow case:
  vcpu 0  |vcpu1
  send intr to vcpu1
  check irr
  receive a posted intr
  
  pir-irr(pir is cleared, irr is set)
  injected=test_and_set_bit=true
  pir=set
  
  Then both irr and pir have the interrupt pending, they may merge to one 
  later, but injected reported as true. Wrong.
  
 I and Marcelo discussed the lockless logic that should be used here on
 the previous patch submission. All is left for you is to implement it.
 We worked hard to make irq injection path lockless, we will not going to
 add one now.

He is right, the scheme is still flawed (because of concurrent injectors
along with HW in VMX non-root). I'd said lets add a spinlock think about
lockless scheme in the meantime.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-23 Thread Marcelo Tosatti
On Thu, Feb 21, 2013 at 08:00:25PM -0600, Scott Wood wrote:
 On 02/21/2013 05:03:32 PM, Marcelo Tosatti wrote:
 On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote:
  On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
  On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
   It is then not necessary to set device attributes on a live
  guest and
   deal with the complications associated with that.
  
   Which complications?
  
   -Scott
  
  Semantics of individual attribute writes, for one.
 
  When the attribute is a device register, the hardware documentation
  takes care of that.
 
 You are not writing to the registers from the CPU point of view.
 
 That's exactly how KVM_DEV_MPIC_GRP_REGISTER is defined and
 implemented on MPIC (with the exception of registers whose behavior
 changes based on which specific vcpu you use to access them).
 If/when we have a need to set/get state in a different manner,
 that's a separate attribute group.

Can you describe usage of this register again?

  Otherwise, the semantics are documented in the
  device-specific documentation -- which can include restricting when
  the access is allowed.  Same as with any other interface
  documentation.
 
 Again, you are talking about the semantics of device access from the
 software operating on the machine view. We are discussing hypervisor
 userspace - hypervisor kernel interface.
 
 And I was talking about the userspace-to-hypervisor kernel interface
 documentation.  It just happens that one specific MPIC device group
 (when the attribute is a device register) is defined with respect
 to what guest software would see if it did a similar access.
 
 In general you never have to set attributes on a device after it has
 been initialized, because there is state associated with that device
 that requires proper handling (example: if you modify a timer counter
 register of a timer device, any software timers used to emulate the
 timer counter must be cancelled).
 
 Yes, it requires proper handling and the MMIO code does that.

 If and when we add raw state accessors, it's totally reasonable for
 there to be command/attribute-specific documented restrictions on
 when the access can be done.

 Also, it is necessary to provide proper locking of device attribute
 write versus vcpu device access. So far we have been focusing on
 having
 a lockless vcpu path.
 
 How is device access related to vcpus?  Existing irqchip code is not
 lockless.

VCPUS access in-kernel devices. Yes, it is lockless (see RCU usage in
virt/kvm/).

 So when device attributes can be modified has implications beyond what
 may seem visible at first.
 
 Are this reasonable arguments?
 
 Basically abstract 'device attributes' are too abstract.
 
 It's up to the device-specific documentation to make them not
 abstract (I admit there are a few details missing in mpic.txt that
 I've pointed out in this thread -- it is RFC v1 after all).  This
 wouldn't be any different if we used separate ioctls for everything.
 It's like saying abstract 'ioctl' is too abstract.

Perhaps a better way to put it is that its too permissive.

 However, your proposed interface deals with sucky capability,
 versioning
 and namespace conflicts we have now. Note these items can probably be
 improved separately.
 
 Any particular proposals?

Namespace conflicts: Reserve ranges for each arch. 

The other two items, haven't though. I am not the one bothered :-) (yes, they
suck).

  I suppose mpic.txt could use an additional statement that
  KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed
  by the guest.
 
  Locking versus currently executing VCPUs, for another (see how
  KVM_SET_IRQ's RCU usage, for instance, that is something should be
  shared).
 
  If you mean kvm_set_irq() in irq_comm.c, that's only relevant when
  you have a GSI routing table, which this patchset doesn't.
 
  Assuming we end up having a routing table to support irqfd, we still
  can't share the code as is, since it's APIC-specific.
 
 Suppose it is worthwhile to attempt to share code as much as possible.
 
 Sure... my point is it isn't a case of the common code is right
 over there, why aren't you using it?  I'll try to share what I
 reasonably can, subject to my limited knowledge of how the APIC
 stuff works.  The irqfd code is substantial enough that refactoring
 for sharing should be worthwhile.  I'm not so sure about irq_comm.c.
 
 -scott

Note just pointing out drawbacks of device attributes (if something of
that sort is integrated, x86 should also use it).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-23 Thread Marcelo Tosatti
On Sat, Feb 23, 2013 at 05:31:44PM +0200, Gleb Natapov wrote:
 On Sat, Feb 23, 2013 at 11:48:54AM -0300, Marcelo Tosatti wrote:
 1. orig_irr = read irr from vapic page
 2. if (orig_irr != 0)
 3.return false;
 4. kvm_make_request(KVM_REQ_EVENT)
 5. bool injected = !test_and_set_bit(PIR)
 6. if (vcpu-guest_mode  injected)
 7.if (test_and_set_bit(PIR notification bit))
 8.send PIR IPI
 9. return injected

Consider follow case:
vcpu 0  |vcpu1
send intr to vcpu1
check irr
receive a posted intr

pir-irr(pir is cleared, irr is set)
injected=test_and_set_bit=true
pir=set

Then both irr and pir have the interrupt pending, they may merge to one 
later, but injected reported as true. Wrong.

   I and Marcelo discussed the lockless logic that should be used here on
   the previous patch submission. All is left for you is to implement it.
   We worked hard to make irq injection path lockless, we will not going to
   add one now.
  
  He is right, the scheme is still flawed (because of concurrent injectors
  along with HW in VMX non-root). I'd said lets add a spinlock think about
  lockless scheme in the meantime.
 The logic proposed was (from that thread):
  apic_accept_interrupt() {
   if (PIR || IRR)
 return coalesced;
   else
 set PIR;
  }
 
 Which should map to something like:
 if (!test_and_set_bit(PIR))
   return coalesced;

HW transfers PIR to IRR, here. Say due to PIR IPI sent 
due to setting of a different vector.

 if (irr on vapic page is set)
 return coalesced;

 
 I do not see how the race above can happen this way. Other can though if
 vcpu is outside a guest. My be we should deliver interrupt differently
 depending on whether vcpu is in guest or not.

Problem is with 3 contexes: two injectors and one vcpu in guest
mode. Earlier on that thread you mentioned

The point is that we need to check PIR and IRR atomically and this is
impossible.

That would be one way to fix it.

 I'd rather think about proper way to do lockless injection before
 committing anything. The case where we care about correct injection
 status is rare, but we always care about scalability and since we
 violate the spec by reading vapic page while vcpu is in non-root
 operation anyway the result may be incorrect with or without the lock.
 Our use can was not in HW designers mind when they designed this thing
 obviously :(

Zhang, can you comment on whether reading vapic page with CPU in
VMX-non root accessing it is safe?

Gleb, yes, a comment mentioning the race (instead of the spinlock) and
explanation why its believed to be benign (given how the injection
return value is interpreted) could also work. Its ugly, though... murphy
is around.

OTOH spinlock is not the end of the world, can figure out something later 
(we've tried without success so far).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-23 Thread Marcelo Tosatti
On Sat, Feb 23, 2013 at 03:16:11PM +, Zhang, Yang Z wrote:
 Marcelo Tosatti wrote on 2013-02-23:
  On Sat, Feb 23, 2013 at 02:05:28PM +, Zhang, Yang Z wrote:
  Marcelo Tosatti wrote on 2013-02-23:
  
  On Fri, Feb 22, 2013 at 09:42:32PM +0800, Yang Zhang wrote:
  From: Yang Zhang yang.z.zh...@intel.com
  
  Posted Interrupt allows APIC interrupts to inject into guest directly
  without any vmexit.
  
  - When delivering a interrupt to guest, if target vcpu is running,
update Posted-interrupt requests bitmap and send a notification
event to the vcpu. Then the vcpu will handle this interrupt
automatically, without any software involvemnt. - If target vcpu is
not running or there already a notification event pending in the
vcpu, do nothing. The interrupt will be handled by next vm entry
  Signed-off-by: Yang Zhang yang.z.zh...@intel.com
  ---
  
  diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
  index e4595f1..64616cc 100644
  --- a/arch/x86/kernel/irq.c
  +++ b/arch/x86/kernel/irq.c
  @@ -228,6 +228,26 @@ void smp_x86_platform_ipi(struct pt_regs *regs)
   set_irq_regs(old_regs);
   }
  +#ifdef CONFIG_HAVE_KVM
  +/*
  + * Handler for POSTED_INTERRUPT_VECTOR.
  + */
  +void smp_kvm_posted_intr_ipi(struct pt_regs *regs)
  +{
  +struct pt_regs *old_regs = set_irq_regs(regs);
  +
  +ack_APIC_irq();
  +
  +irq_enter();
  +
  +exit_idle();
  +
  +irq_exit();
  +
  +set_irq_regs(old_regs);
  +}
  +#endif
  +
  
  Add per-cpu counters, similarly to other handlers in the same file.
  Sure.
  
  @@ -379,6 +398,7 @@ static inline int apic_find_highest_irr(struct 
  kvm_lapic
  *apic)
   if (!apic-irr_pending) return -1;
   +   kvm_x86_ops-sync_pir_to_irr(apic-vcpu);   result =
   apic_search_irr(apic);  ASSERT(result == -1 || result = 16);
  
  kvm_x86_ops-sync_pir_to_irr() should be called from vcpu_enter_guest,
  before inject_pending_event.
  
  if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
-
inject_pending_event
...
}
  Some other places will search irr to do something, like
  kvm_cpu_has_interrupt(). So we must to sync pir to irr before touch irr, 
  not just
  before enter guest.
  
  I see. The only call site that needs IRR/PIR information with posted
  interrupt enabled is kvm_arch_vcpu_runnable, correct?
 Yes.
 
  If so, can we contain reading PIR to only that location?
 Yes, we can do it. 
 Actually, why I do pir-irr here is to avoid the wrong usage in future of 
 check pending interrupt just by
 call kvm_vcpu_has_interrupt().

Yes, i see.

 Also, there may need an extra callback to check pir.
 So you think your suggestions is better?

Because it respects standard request processing mechanism, yes.

  And have PIR-IRR sync at KVM_REQ_EVENT processing only (to respect the
  standard way of event processing and also reduce reading the PIR).
 Since we will check ON bit before each reading PIR, the cost can be
 ignored.

Note reading ON bit is not OK, because if injection path did not see
vcpu-mode == IN_GUEST_MODE, ON bit will not be set.

  
  So it is possible that a PIR IPI is delivered and handled before guest
  entry.
  
  By clearing PIR notification bit after local_irq_disable, but before the
  last check for vcpu-requests, we guarantee that a PIR IPI is never lost.
  
  Makes sense? (please check the logic, it might be flawed).
 I am not sure whether I understand you. But I don't think the IPI will lost 
 in current patch:
 
 if (!pi_test_and_set_on(vmx-pi_desc)  (vcpu-mode == IN_GUEST_MODE)) {
 kvm_make_request(KVM_REQ_EVENT, vcpu);
 apic-send_IPI_mask(get_cpu_mask(vcpu-cpu),
 POSTED_INTR_VECTOR);
 *delivered = true;
 }
 
 vcpu entry has: 
 vcpu-mode = GUEST_MODE
 local irq disable
 check request
 
 We will send the IPI after making request, if IPI is received after set 
 guest_mode before disable irq, then it still will be handled by the following 
 check request.
 Please correct me if I am wrong.

cpu0cpu1vcpu0
test_and_set_bit(PIR-A) 
set KVM_REQ_EVENT   
process REQ_EVENT
PIR-A-IRR

vcpu-mode=IN_GUEST

if (vcpu0-guest_mode)
if (!t_a_s_bit(PIR notif))  
send IPI
linux_pir_handler

t_a_s_b(PIR-B)=1
no PIR IPI sent

  
  Sync PIR-IRR uses xchg, which is locked and atomic, so can't see the
  need

Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-23 Thread Marcelo Tosatti
On Sat, Feb 23, 2013 at 09:42:14PM +0200, Gleb Natapov wrote:
  explanation why its believed to be benign (given how the injection
  return value is interpreted) could also work. Its ugly, though... murphy
  is around.
 The race above is not benign. It will report interrupt as coalesced
 while in reality it is injected. This may cause to many interrupt to be
 injected. If this happens rare enough ntp may be able to fix time drift
 resulted from this.

OK.

  OTOH spinlock is not the end of the world, can figure out something later 
  (we've tried without success so far).
 It serializes all injections into vcpu. I do not believe now that even
 with lock we are safe for the reason I mention above. We can use pir-on
 bit as a lock, but that only emphasise how ridiculous serialization of
 injections becomes.

Please review the 2nd iteration of pseudocode in patchset v4 thread, it
should be good.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-23 Thread Marcelo Tosatti
On Thu, Feb 21, 2013 at 08:00:25PM -0600, Scott Wood wrote:
 On 02/21/2013 05:03:32 PM, Marcelo Tosatti wrote:
 On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote:
  On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
  On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
   It is then not necessary to set device attributes on a live
  guest and
   deal with the complications associated with that.
  
   Which complications?
  
   -Scott
  
  Semantics of individual attribute writes, for one.
 
  When the attribute is a device register, the hardware documentation
  takes care of that.
 
 You are not writing to the registers from the CPU point of view.
 
 That's exactly how KVM_DEV_MPIC_GRP_REGISTER is defined and
 implemented on MPIC (with the exception of registers whose behavior
 changes based on which specific vcpu you use to access them).
 If/when we have a need to set/get state in a different manner,
 that's a separate attribute group.

Can you describe usage of this register again?

  Otherwise, the semantics are documented in the
  device-specific documentation -- which can include restricting when
  the access is allowed.  Same as with any other interface
  documentation.
 
 Again, you are talking about the semantics of device access from the
 software operating on the machine view. We are discussing hypervisor
 userspace - hypervisor kernel interface.
 
 And I was talking about the userspace-to-hypervisor kernel interface
 documentation.  It just happens that one specific MPIC device group
 (when the attribute is a device register) is defined with respect
 to what guest software would see if it did a similar access.
 
 In general you never have to set attributes on a device after it has
 been initialized, because there is state associated with that device
 that requires proper handling (example: if you modify a timer counter
 register of a timer device, any software timers used to emulate the
 timer counter must be cancelled).
 
 Yes, it requires proper handling and the MMIO code does that.

 If and when we add raw state accessors, it's totally reasonable for
 there to be command/attribute-specific documented restrictions on
 when the access can be done.

 Also, it is necessary to provide proper locking of device attribute
 write versus vcpu device access. So far we have been focusing on
 having
 a lockless vcpu path.
 
 How is device access related to vcpus?  Existing irqchip code is not
 lockless.

VCPUS access in-kernel devices. Yes, it is lockless (see RCU usage in
virt/kvm/).

 So when device attributes can be modified has implications beyond what
 may seem visible at first.
 
 Are this reasonable arguments?
 
 Basically abstract 'device attributes' are too abstract.
 
 It's up to the device-specific documentation to make them not
 abstract (I admit there are a few details missing in mpic.txt that
 I've pointed out in this thread -- it is RFC v1 after all).  This
 wouldn't be any different if we used separate ioctls for everything.
 It's like saying abstract 'ioctl' is too abstract.

Perhaps a better way to put it is that its too permissive.

 However, your proposed interface deals with sucky capability,
 versioning
 and namespace conflicts we have now. Note these items can probably be
 improved separately.
 
 Any particular proposals?

Namespace conflicts: Reserve ranges for each arch. 

The other two items, haven't though. I am not the one bothered :-) (yes, they
suck).

  I suppose mpic.txt could use an additional statement that
  KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed
  by the guest.
 
  Locking versus currently executing VCPUs, for another (see how
  KVM_SET_IRQ's RCU usage, for instance, that is something should be
  shared).
 
  If you mean kvm_set_irq() in irq_comm.c, that's only relevant when
  you have a GSI routing table, which this patchset doesn't.
 
  Assuming we end up having a routing table to support irqfd, we still
  can't share the code as is, since it's APIC-specific.
 
 Suppose it is worthwhile to attempt to share code as much as possible.
 
 Sure... my point is it isn't a case of the common code is right
 over there, why aren't you using it?  I'll try to share what I
 reasonably can, subject to my limited knowledge of how the APIC
 stuff works.  The irqfd code is substantial enough that refactoring
 for sharing should be worthwhile.  I'm not so sure about irq_comm.c.
 
 -scott

Note just pointing out drawbacks of device attributes (if something of
that sort is integrated, x86 should also use it).

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-21 Thread Marcelo Tosatti
On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote:
 On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
 On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
  It is then not necessary to set device attributes on a live
 guest and
  deal with the complications associated with that.
 
  Which complications?
 
  -Scott
 
 Semantics of individual attribute writes, for one.
 
 When the attribute is a device register, the hardware documentation
 takes care of that.  

You are not writing to the registers from the CPU point of view.

 Otherwise, the semantics are documented in the
 device-specific documentation -- which can include restricting when
 the access is allowed.  Same as with any other interface
 documentation.

Again, you are talking about the semantics of device access from the
software operating on the machine view. We are discussing hypervisor
userspace - hypervisor kernel interface.

In general you never have to set attributes on a device after it has
been initialized, because there is state associated with that device
that requires proper handling (example: if you modify a timer counter
register of a timer device, any software timers used to emulate the
timer counter must be cancelled).

Also, it is necessary to provide proper locking of device attribute
write versus vcpu device access. So far we have been focusing on having 
a lockless vcpu path.

So when device attributes can be modified has implications beyond what
may seem visible at first.

Are this reasonable arguments?

Basically abstract 'device attributes' are too abstract.

However, your proposed interface deals with sucky capability, versioning
and namespace conflicts we have now. Note these items can probably be
improved separately.

 I suppose mpic.txt could use an additional statement that
 KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed
 by the guest.
 
 Locking versus currently executing VCPUs, for another (see how
 KVM_SET_IRQ's RCU usage, for instance, that is something should be
 shared).
 
 If you mean kvm_set_irq() in irq_comm.c, that's only relevant when
 you have a GSI routing table, which this patchset doesn't.
 
 Assuming we end up having a routing table to support irqfd, we still
 can't share the code as is, since it's APIC-specific.

Suppose it is worthwhile to attempt to share code as much as possible.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-21 Thread Marcelo Tosatti
On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote:
 On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
 On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
  It is then not necessary to set device attributes on a live
 guest and
  deal with the complications associated with that.
 
  Which complications?
 
  -Scott
 
 Semantics of individual attribute writes, for one.
 
 When the attribute is a device register, the hardware documentation
 takes care of that.  

You are not writing to the registers from the CPU point of view.

 Otherwise, the semantics are documented in the
 device-specific documentation -- which can include restricting when
 the access is allowed.  Same as with any other interface
 documentation.

Again, you are talking about the semantics of device access from the
software operating on the machine view. We are discussing hypervisor
userspace - hypervisor kernel interface.

In general you never have to set attributes on a device after it has
been initialized, because there is state associated with that device
that requires proper handling (example: if you modify a timer counter
register of a timer device, any software timers used to emulate the
timer counter must be cancelled).

Also, it is necessary to provide proper locking of device attribute
write versus vcpu device access. So far we have been focusing on having 
a lockless vcpu path.

So when device attributes can be modified has implications beyond what
may seem visible at first.

Are this reasonable arguments?

Basically abstract 'device attributes' are too abstract.

However, your proposed interface deals with sucky capability, versioning
and namespace conflicts we have now. Note these items can probably be
improved separately.

 I suppose mpic.txt could use an additional statement that
 KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed
 by the guest.
 
 Locking versus currently executing VCPUs, for another (see how
 KVM_SET_IRQ's RCU usage, for instance, that is something should be
 shared).
 
 If you mean kvm_set_irq() in irq_comm.c, that's only relevant when
 you have a GSI routing table, which this patchset doesn't.
 
 Assuming we end up having a routing table to support irqfd, we still
 can't share the code as is, since it's APIC-specific.

Suppose it is worthwhile to attempt to share code as much as possible.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-20 Thread Marcelo Tosatti
On Sat, Feb 16, 2013 at 01:56:14PM +1100, Paul Mackerras wrote:
 On Fri, Feb 15, 2013 at 05:59:11PM -0600, Scott Wood wrote:
  On 02/15/2013 05:18:31 PM, Paul Mackerras wrote:
  On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
   On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
   From: Benjamin Herrenschmidt b...@kernel.crashing.org
   
   This adds in-kernel emulation of the XICS (eXternal Interrupt
   Controller Specification) interrupt controller specified by
  PAPR, for
   both HV and PR KVM guests.
   
   This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
   KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
   should use in-kernel interrupt controller emulation, but also
  takes an
   argument struct that contains the type of interrupt controller
   architecture and an optional parameter.  Currently only one
  type value
   is defined, that which indicates the XICS architecture.
  
   Would the device config API I posted a couple days ago work for you?
  
  I suppose it could be made to work.  It doesn't feel like a natural
  fit though, because your API seems to assume (AFAICT) that a device is
  manipulated via registers at specific physical addresses, so I would
  have to invent an artificial set of registers with addresses and bit
  layouts, that aren't otherwise required.  The XICS is operated from
  the guest side via hcalls, not via emulated MMIO.
  
  I don't think it makes such an assumption.  The MPIC device has
  physical registers, so it exposes them, but it also exposes things
  that are not physical registers (e.g. the per-IRQ input state).  The
  generic device control layer leaves interpretation of attributes up
  to the device.
  
  I think it would be easier to fit XICS into the device control api
  model than to fit MPIC into this model, not to mention what would
  happen if we later want to emulate some other type of device -- x86
  already has at least one non-irqchip emulated device (i8254).
 
 I have no particular objection to the device control API per se, but
 I have two objections to using it as the primary interface to the XICS
 emulation.
 
 First, I dislike the magical side-effect where creating a device of a
 particular type (e.g. MPIC or XICS) automatically attaches it to the
 interrupt lines of the vcpus.  I prefer an explicit request to do
 in-kernel interrupt control. 

This is probably a stupid question, but why the
KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
your purposes?

x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during
KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.

  Further, the magic means that you can
 only have one instance of the device, whereas you might want to model
 the interrupt controller architecture as several devices.  You could
 do that using several device types, but then the interconnections
 between them would also be magic.
 
 Secondly, it means that we are completely abandoning any attempt to
 define an abstract or generic interface to in-kernel interrupt
 controller emulations.  Each device will have its own unique set of
 attribute groups and its own unique userspace code to drive it, with
 no commonality between them.

snip

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2][QEMU] vmxcap: Report APIC register emulation and RDTSCP control

2013-02-20 Thread Marcelo Tosatti
On Mon, Feb 18, 2013 at 07:56:54AM +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
 
 This time I've checked twice that I'm no longer missing a field.

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Marcelo Tosatti
On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
 On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
 Copying Christoffer since ARM has in kernel irq chip too.
 
 On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
  Currently, devices that are emulated inside KVM are configured in a
  hardcoded manner based on an assumption that any given architecture
  only has one way to do it.  If there's any need to access device
 state,
  it is done through inflexible one-purpose-only IOCTLs (e.g.
  KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
  cumbersome and depletes a limited numberspace.

Creation of ioctl has advantages. It makes explicit what the 
data contains and how it should be interpreted.
This means that for example, policy control can be performed at ioctl
level (can group, by ioctl number, which ioctl calls are allowed after
VM creation, for example).

It also makes it explicit that its a userspace interface which
applications depend.

With a single 'set device attribute' ioctl its more difficult
to do that.

Abstracting the details also makes it cumbersome to read 
strace output :-)

  This API provides a mechanism to instantiate a device of a certain
  type, returning an ID that can be used to set/get attributes of the
  device.  Attributes may include configuration parameters (e.g.
  register base address), device state, operational commands, etc.  It
  is similar to the ONE_REG API, except that it acts on devices rather
  than vcpus.
 You are not only provide different way to create in kernel irq
 chip you
 also use an alternate way to trigger interrupt lines. Before going
 into
 interface specifics lets think about whether it is really worth it?
 
 Which it do you mean here?
 
 The ability to set/get attributes is needed.  Sorry, but get or set
 one blob of data, up to 512 bytes, for the entire irqchip is just
 not good enough -- assuming you don't want us to start sticking
 pointers and commands in *that* data. :-)

Why not? Is it necessary to constantly read/write attributes?

 If you mean the way to inject interrupts, it's simpler this way.

Are you injecting interrupts via this new SET_DEVICE_ATTRIBUTE ioctl?

 Why go out of our way to inject common glue code into a
 communication path between hw/kvm/mpic.c in QEMU and
 arch/powerpc/kvm/mpic.c in KVM?  Or rather, why make that common
 glue be specific to this one function when we could reuse the same
 communication glue used for other things, such as device state?
 
 And that's just for regular interrupts.  MSIs are vastly simpler on
 MPIC than what x86 does.
 
 x86 obviously support old way and will have to for some, very
 long, time.
 
 Sure.
 
 ARM vGIC code, that is ready to go upstream, uses old way too. So
 it will
 be 2 archs against one.
 
 I wasn't aware that that's how it worked. :-P
 
 I was trying to be considerate by not making the entire thing
 gratuitously PPC or MPIC specific, as some others seem inclined to
 do (e.g. see irqfd and APIC).  We already had a discussion on ARM's
 set address ioctl and rather than extend *that* interface, they
 preferred to just stick something ARM-specific in ASAP with the
 understanding that it would be replaced (or more accurately, kept
 around as a thin wrapper around the new stuff) later.
 
 Christoffer do you think the proposed way it
 better for your needs. Are you willing to make vGIC use it?
 
 Scott, what other devices are you planning to support with this
 interface?
 
 At the moment I do not have plans for other devices, though what
 does it hurt for the capability to be there?
 
 -Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Marcelo Tosatti
On Wed, Feb 20, 2013 at 03:09:49PM +0200, Gleb Natapov wrote:
 On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
  On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
  On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
   The ability to set/get attributes is needed.  Sorry, but get or set
   one blob of data, up to 512 bytes, for the entire irqchip is just
   not good enough -- assuming you don't want us to start sticking
   pointers and commands in *that* data. :-)
  
  Proposed interface sticks pointers into ioctl data, so why doing
  the same
  for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
  
  There's a difference between putting a pointer in an ioctl control
  structure that is specifically documented as being that way (as in
  ONE_REG), versus taking an ioctl that claims to be setting/getting a
  blob of state and embedding pointers in it.  It would be like
  sticking a pointer in the attribute payload of this API, which I
  think is something to be discouraged.
 If documentation is what differentiate for you between silly and smart
 then write documentation instead of new interfaces.
 
 KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data on
 x86, nothing prevent you from adding MPIC specifics to the interface,
 Add mpic state into kvm_irqchip structure and if 512 bytes is not enough
 for you to transfer the state put pointers there and _document_ them.
 But with 512 bytes you can transfer properties inline, so you probably
 do not need pointer there anyway. I see you have three properties 2 of
 them 32bit and one 64bit.
 
 It'd also be using
  KVM_SET_IRQCHIP to read data, which is the sort of thing you object
  to later on regarding KVM_IRQ_LINE_STATUS.
  
 Do not see why.
 
  Then there's the silliness of transporting 512 bytes just to read a
  descriptor for transporting something else.
  
 Yes, agree. But is this enough of a reason to introduce entirely new
 interface? Is it on performance critical path? Doubt it, unless you
 abuse the interface to send interrupts, but then isn't it silty to
 do copy_from_user() twice to inject an interrupt like proposed interface
 does?
 
  For signaling irqs (I think this is what you mean by commands)
  we have KVM_IRQ_LINE.
  
  It's one type of command.  Another is setting the address.  Another
  is writing to registers that have side effects (this is how MSI
  injection is done on MPIC, just as in real hardware).
  
 Setting the address is setting an attribute. Sending MSI is a command.
 Things you set/get during init/migration are attributes. Things you do
 to cause side-effects are commands.

Yes, it would be good to restrict what can be done via the interface
(to avoid abstracting away problems). At a first glance, i would say
it should allow for initial configuration of device state, such as
registers etc.

Why exactly you need to set attributes Scott?

  What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
  (non-glue/wrapper) code can become common?
  
 No new ioctl with exactly same result (well actually even faster since
 less copying is done). You need to show us the benefits of the new interface
 vs existing one, not vice versa.

Also related to this question is the point of avoiding the
implementation of devices to be spread across userspace and the kernel
(this is one point Avi brought up often). If the device emulation
is implemented entirely in the kernel, all is necessary are initial
values of device registers (and retrieval of those values later for
savevm/migration).

It is then not necessary to set device attributes on a live guest and
deal with the complications associated with that.

snip

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Marcelo Tosatti
On Wed, Feb 20, 2013 at 06:28:24PM -0300, Marcelo Tosatti wrote:
 On Wed, Feb 20, 2013 at 03:09:49PM +0200, Gleb Natapov wrote:
  On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
   On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
   On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
The ability to set/get attributes is needed.  Sorry, but get or set
one blob of data, up to 512 bytes, for the entire irqchip is just
not good enough -- assuming you don't want us to start sticking
pointers and commands in *that* data. :-)
   
   Proposed interface sticks pointers into ioctl data, so why doing
   the same
   for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
   
   There's a difference between putting a pointer in an ioctl control
   structure that is specifically documented as being that way (as in
   ONE_REG), versus taking an ioctl that claims to be setting/getting a
   blob of state and embedding pointers in it.  It would be like
   sticking a pointer in the attribute payload of this API, which I
   think is something to be discouraged.
  If documentation is what differentiate for you between silly and smart
  then write documentation instead of new interfaces.
  
  KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data on
  x86, nothing prevent you from adding MPIC specifics to the interface,
  Add mpic state into kvm_irqchip structure and if 512 bytes is not enough
  for you to transfer the state put pointers there and _document_ them.
  But with 512 bytes you can transfer properties inline, so you probably
  do not need pointer there anyway. I see you have three properties 2 of
  them 32bit and one 64bit.
  
  It'd also be using
   KVM_SET_IRQCHIP to read data, which is the sort of thing you object
   to later on regarding KVM_IRQ_LINE_STATUS.
   
  Do not see why.
  
   Then there's the silliness of transporting 512 bytes just to read a
   descriptor for transporting something else.
   
  Yes, agree. But is this enough of a reason to introduce entirely new
  interface? Is it on performance critical path? Doubt it, unless you
  abuse the interface to send interrupts, but then isn't it silty to
  do copy_from_user() twice to inject an interrupt like proposed interface
  does?
  
   For signaling irqs (I think this is what you mean by commands)
   we have KVM_IRQ_LINE.
   
   It's one type of command.  Another is setting the address.  Another
   is writing to registers that have side effects (this is how MSI
   injection is done on MPIC, just as in real hardware).
   
  Setting the address is setting an attribute. Sending MSI is a command.
  Things you set/get during init/migration are attributes. Things you do
  to cause side-effects are commands.
 
 Yes, it would be good to restrict what can be done via the interface
 (to avoid abstracting away problems). At a first glance, i would say
 it should allow for initial configuration of device state, such as
 registers etc.
 
 Why exactly you need to set attributes Scott?
 
   What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
   (non-glue/wrapper) code can become common?
   
  No new ioctl with exactly same result (well actually even faster since
  less copying is done). You need to show us the benefits of the new interface
  vs existing one, not vice versa.
 
 Also related to this question is the point of avoiding the
 implementation of devices to be spread across userspace and the kernel
 (this is one point Avi brought up often). If the device emulation
 is implemented entirely in the kernel, all is necessary are initial
 values of device registers (and retrieval of those values later for
 savevm/migration).
 
 It is then not necessary to set device attributes on a live guest and
 deal with the complications associated with that.
 
 snip

I suppose the atomic-test-and-set type operations introduced to ONE_REG ioctls
are one example of such complications.

Avi, can you remind us what are the drawbacks of having separate
userspace/kernel device emulation?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Marcelo Tosatti
On Wed, Feb 20, 2013 at 05:20:50PM -0600, Scott Wood wrote:
  On 02/20/2013 03:17:27 PM, Marcelo Tosatti wrote:
 On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
  On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
  Copying Christoffer since ARM has in kernel irq chip too.
  
  On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
   Currently, devices that are emulated inside KVM are
 configured in a
   hardcoded manner based on an assumption that any given
 architecture
   only has one way to do it.  If there's any need to access device
  state,
   it is done through inflexible one-purpose-only IOCTLs (e.g.
   KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little
 thing is
   cumbersome and depletes a limited numberspace.
 
 Creation of ioctl has advantages. It makes explicit what the
 data contains and how it should be interpreted.
 This means that for example, policy control can be performed at ioctl
 level (can group, by ioctl number, which ioctl calls are allowed after
 VM creation, for example).
 
 It also makes it explicit that its a userspace interface which
 applications depend.
 
 With a single 'set device attribute' ioctl its more difficult
 to do that.
 
 I don't see why creating ioctls rather than device attributes (or
 whatever you want to call them) differs this way.  Device attributes
 are inherently userspace interfaces, just as ioctls are.  What the
 data contains and how it is interpreted are documented, albeit in a
 more lightweight format than KVM usually uses for ioctls.
 
 An ioctl is no guarantee of good documentation.  KVM is far better
 than most of the kernel in that regard, but even there some ioctls
 are missing (e.g. KVM_IRQ_LINE_STATUS as mentioned elsewhere in this
 thread), and some others are inadequately explained (e.g. IRQ
 routing).
 
 By policy control, do you just mean that some ioctls act on a
 /dev/kvm fd, others on a vm fd, and others on a vcpu fd?  This is
 pretty similar to having a device fd, except for the mechanism used.

No, i mean policy control acting on ioctl numbers. Its essentially the
same issue as with 'strace' (in that the its necessary to parse the
ioctl data further).

 The main things that I dislike about adding new things at the ioctl
 level are:
 - limited numberspace, and more prone to merge conflicts than a
 device-specific numberspace

Can reserve per-architecture ranges to deal with that.

 - having to add a new pathway to get into the driver for each ioctl,
 rather than having all operations on a particular device go to the
   right place, and the device implementation interprets further
 (this assumes that we're talking about vm ioctls, and not returning
 a
   new fd for the device)

Agree with that point.

 - awkward way of negotiating capabilities with userspace (have to
 declare the capability id separately, and add code somewhere outside
 the
   device implementation to respond to capability requests)

Agree with that point.

 - api.txt formatting that imposes yet another document-internal
 numberspace to conflict on :-)

The main problem, to me, is that the interface allows flexibility but
the details of using such flexibility are not worked out.

That is, lack of definitions such as when setting attributes is allowed.
With a big blog, that information is implicit: a SET operation is a full
device reset.

With individual attributes:
- Which attributes can be set individually?
- Is there an order which must be respected?
- Locking.
- End result: more logic/code necessary.

 Abstracting the details also makes it cumbersome to read
 strace output :-)
 
 You'd have to update strace one way or another if you want
 pretty-printed output.  Having a more restricted API than arbitrary
 ioctls could actually improve the situation -- this could be a good
 reason for including the attribute length as an explicit parameter
 rather than being implicit in the attribute id.  Then you'd just
 teach strace about the device control API once, and not for each new
 attribute or device that gets added.
 
   This API provides a mechanism to instantiate a device of a
 certain
   type, returning an ID that can be used to set/get attributes
 of the
   device.  Attributes may include configuration parameters (e.g.
   register base address), device state, operational commands,
 etc.  It
   is similar to the ONE_REG API, except that it acts on devices
 rather
   than vcpus.
  You are not only provide different way to create in kernel irq
  chip you
  also use an alternate way to trigger interrupt lines. Before going
  into
  interface specifics lets think about whether it is really worth it?
 
  Which it do you mean here?
 
  The ability to set/get attributes is needed.  Sorry, but get or set
  one blob of data, up to 512 bytes, for the entire irqchip is just
  not good enough -- assuming you don't want us to start sticking
  pointers and commands in *that* data. :-)
 
 Why not? Is it necessary to constantly read/write attributes?
 
 It's

[GIT PULL] KVM updates for the 3.9 merge window

2013-02-20 Thread Marcelo Tosatti
: move the code that installs new slots array to a separate function.
  KVM: emulator: drop RPL check from linearize() function
  KVM: emulator: implement fninit, fnstsw, fnstcw
  KVM: VMX: make rmode_segment_valid() more strict.
  KVM: VMX: fix emulation of invalid guest state.
  KVM: VMX: Do not fix segment register during vcpu initialization.
  KVM: VMX: handle IO when emulation is due to #GP in real mode.
  KVM: mmu: remove unused trace event
  KVM: MMU: simplify folding of dirty bit into accessed_dirty
  KVM: x86: remove unused variable from walk_addr_generic()
  Merge branch 'kvm-ppc-next' of https://github.com/agraf/linux-2.6 into 
queue
  KVM: VMX: remove special CPL cache access during transition to real mode.
  KVM: VMX: reset CPL only on CS register write.
  KVM: VMX: if unrestricted guest is enabled vcpu state is always valid.
  KVM: VMX: remove hack that disables emulation on vcpu reset/init
  KVM: VMX: skip vmx-rmode.vm86_active check on cr0 write if unrestricted 
guest is enabled
  KVM: VMX: don't clobber segment AR of unusable segments.
  KVM: VMX: rename fix_pmode_dataseg to fix_pmode_seg.
  KVM: x86: fix use of uninitialized memory as segment descriptor in 
emulator.
  KVM: VMX: set vmx-emulation_required only when needed.
  KVM: MMU: make spte_is_locklessly_modifiable() more clear
  KVM: MMU: drop unneeded checks.
  KVM: MMU: set base_role.nxe during mmu initialization.
  KVM: MMU: drop superfluous min() call.
  KVM: MMU: drop superfluous is_present_gpte() check.
  Revert KVM: MMU: split kvm_mmu_free_page
  KVM: VMX: add missing exit names to VMX_EXIT_REASONS array
  KVM: VMX: cleanup vmx_set_cr0().
  x86 emulator: fix parity calculation for AAD instruction

Jan Kiszka (1):
  KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr

Jesse Larrew (1):
  x86: kvm_para: fix typo in hypercall comments

Marcelo Tosatti (3):
  KVM: VMX: fix incorrect cached cpl value with real/v8086 modes
  x86: pvclock kvm: align allocation size to page size
  Revert KVM: MMU: lazily drop large spte

Mihai Caraman (2):
  KVM: PPC: Fix SREGS documentation reference
  KVM: PPC: Fix mfspr/mtspr MMUCFG emulation

Nadav Amit (1):
  KVM: x86: fix mov immediate emulation for 64-bit operands

Nickolai Zeldovich (1):
  kvm: fix i8254 counter 0 wraparound

Peter Zijlstra (1):
  sched: Bail out of yield_to when source and target runqueue has one task

Raghavendra K T (1):
  kvm: Handle yield_to failure return code for potential undercommit case

Takuya Yoshikawa (13):
  KVM: Write protect the updated slot only when dirty logging is enabled
  KVM: MMU: Remove unused parameter level from __rmap_write_protect()
  KVM: MMU: Make kvm_mmu_slot_remove_write_access() rmap based
  KVM: Remove unused slot_bitmap from kvm_mmu_page
  KVM: Make kvm_mmu_change_mmu_pages() take mmu_lock by itself
  KVM: Make kvm_mmu_slot_remove_write_access() take mmu_lock by itself
  KVM: MMU: Conditionally reschedule when 
kvm_mmu_slot_remove_write_access() takes a long time
  KVM: set_memory_region: Don't jump to out_free unnecessarily
  KVM: set_memory_region: Don't check for overlaps unless we create or move 
a slot
  KVM: set_memory_region: Remove unnecessary variable memslot
  KVM: set_memory_region: Identify the requested change explicitly
  KVM: set_memory_region: Disallow changing read-only attribute later
  KVM: Remove user_alloc from struct kvm_memory_slot

Xiao Guangrong (9):
  KVM: MMU: fix Dirty bit missed if CR0.WP = 0
  KVM: MMU: fix infinite fault access retry
  KVM: x86: clean up reexecute_instruction
  KVM: x86: let reexecute_instruction work for tdp
  KVM: x86: improve reexecute_instruction
  KVM: MMU: lazily drop large spte
  KVM: MMU: cleanup mapping-level
  KVM: MMU: remove pt_access in mmu_set_spte
  KVM: MMU: cleanup __direct_map

Yang Zhang (5):
  KVM: remove a wrong hack of delivery PIT intr to vcpu0
  x86, apicv: add APICv register virtualization support
  x86, apicv: add virtual x2apic support
  x86, apicv: add virtual interrupt delivery support
  KVM: VMX: disable apicv by default

 Documentation/virtual/kvm/api.txt|  108 +-
 Documentation/virtual/kvm/mmu.txt|7 
 arch/ia64/include/asm/kvm_host.h |4 
 arch/ia64/kvm/kvm-ia64.c |8 
 arch/ia64/kvm/lapic.h|6 
 arch/powerpc/include/asm/kvm_host.h  |   10 
 arch/powerpc/include/asm/kvm_ppc.h   |   12 
 arch/powerpc/include/asm/reg.h   |2 
 arch/powerpc/include/asm/reg_booke.h |1 
 arch/powerpc/include/uapi/asm/kvm.h  |6 
 arch/powerpc/kernel/asm-offsets.c|2 
 arch/powerpc/kvm/Makefile|9 
 arch/powerpc/kvm/book3s_emulate.c|   30 
 arch/powerpc/kvm/book3s_hv.c |2 
 arch/powerpc/kvm/book3s_pr.c

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-20 Thread Marcelo Tosatti
On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote:
 On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote:
 This is probably a stupid question, but why the
 KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
 your purposes?
 
 x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during
 KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.
 
 To start, the whole IRQ routing stuff is poorly documented.
 
 Am I supposed to make up GSI numbers and use the routing thing to
 associate them with real interrupts? 

I have no idea. Is mapping from one integer linear space (GSIs) 
to real interrupts suitable for you? 

 Are there constraints on what
 sort of GSI numbers I can choose (I now see in the code that
 KVM_MAX_IRQ_ROUTES is returned from the capability check, but where
 is that documented?  

Don't think it is.

 It looks like the APIC implementation has
 default routes, where is that documented?)? 

In the code.

 Where does the code live to manage this table, and how APICy is it (looks 
 like the
 answer is irq_comm.c, and very)? 

Thinking faster than typing? Not sure what you mean.

 I suppose I could write another
 implementation of the table management code for MPIC, though the
 placement of irqchip inside the route entry, rather than as an
 argument to KVM_IRQ_LINE, suggests the table is supposed to be
 global, not in the individual interrupt controller.

Yes the table is global. It maps GSI (Global System Interrupt IIRC)
(integer) to (irqchip,pin) pair.

 It looks like I'm going to have to do this anyway for irqfd, though
 that doesn't make the other uses of the device control api go away.
 Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading
 the status for debugging (reading device registers doesn't quite do
 it, since the active bit won't show up if the interrupt is
 masked).  

 At that point, is it more offensive to make it read-only
 even though it would be trivial to make it read/write (which would
 allow users who don't need it to bypass the routing API), or to make
 it read/write and live with there being more than one way to do
 something?

Can't follow this sentence.

 KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes
 of state, and because it doesn't allow debugging access to device
 registers (e.g. inspecting from the QEMU command line), and because
 it's hard to add new pieces of state if we realize we left something
 out.  It reminds be of GET/SET_SREGS.  With that, I did what you
 seem to want here, which is to adapt the existing interfaces, using
 feature flags to control optional state.  It ended up being a mess,
 and ONE_REG was introduced as a replacement.  The device control API
 is the equivalent of ONE_REG for things other than vcpus.
 
 -Scott

- ACK on 512 bytes not sufficient. Add another ioctl, SET_IRQCHIP2?
- Agree on poor extensibility of interface. Adding a reserved amount
of space as padding and versioning such as has been done so far 
is not acceptable? 
- Debugging: why is reading entire register state not acceptable? Yes,
  its slow.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Marcelo Tosatti
On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
 Why exactly you need to set attributes Scott?
 
 That's best answered by looking at patch 6/6 and discussing the
 actual attributes that are defined so far.
 
 The need to set the base address of the registers is
 straightforward.  When ARM added their special ioctl for this, we
 discussed it being eventually replaced with a more general API (in
 fact, that was the reason they put ARM in the name).
 
 Access to device registers was originally intended for debugging,
 and eventually migration.  It turned out to also be very useful for
 injecting MSIs -- nothing special needed to be done.  It Just
 Works(tm) the same way it does in hardware, with an MMIO write from
 a PCI device to a specific MPIC register.  Again, irqfd may
 complicate this, but there's no need for QEMU-generated MSIs to have
 to take a circuitous path.
 
 Finally, there's the interrupt source attribute.  Even if we get rid
 of that, we'll need MPIC-specific documentation on how to flatten
 the IRQ numberspace, and if we ever have a cascaded PIC situation
 things could get ugly since there's no way to identify a specific
 IRQ controller in KVM_IRQ_LINE.
 
 Also related to this question is the point of avoiding the
 implementation of devices to be spread across userspace and the kernel
 (this is one point Avi brought up often). If the device emulation
 is implemented entirely in the kernel, all is necessary are initial
 values of device registers (and retrieval of those values later for
 savevm/migration).
 
 MPIC emulation is entirely in the kernel with this patchset --
 though the register that lets you reset cores will likely need to be
 kicked back to QEMU.
 
 It is then not necessary to set device attributes on a live guest and
 deal with the complications associated with that.
 
 Which complications?
 
 -Scott

Semantics of individual attribute writes, for one.

Locking versus currently executing VCPUs, for another (see how
KVM_SET_IRQ's RCU usage, for instance, that is something should be
shared).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-20 Thread Marcelo Tosatti
On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote:
 On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote:
 This is probably a stupid question, but why the
 KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
 your purposes?
 
 x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during
 KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.
 
 To start, the whole IRQ routing stuff is poorly documented.
 
 Am I supposed to make up GSI numbers and use the routing thing to
 associate them with real interrupts? 

I have no idea. Is mapping from one integer linear space (GSIs) 
to real interrupts suitable for you? 

 Are there constraints on what
 sort of GSI numbers I can choose (I now see in the code that
 KVM_MAX_IRQ_ROUTES is returned from the capability check, but where
 is that documented?  

Don't think it is.

 It looks like the APIC implementation has
 default routes, where is that documented?)? 

In the code.

 Where does the code live to manage this table, and how APICy is it (looks 
 like the
 answer is irq_comm.c, and very)? 

Thinking faster than typing? Not sure what you mean.

 I suppose I could write another
 implementation of the table management code for MPIC, though the
 placement of irqchip inside the route entry, rather than as an
 argument to KVM_IRQ_LINE, suggests the table is supposed to be
 global, not in the individual interrupt controller.

Yes the table is global. It maps GSI (Global System Interrupt IIRC)
(integer) to (irqchip,pin) pair.

 It looks like I'm going to have to do this anyway for irqfd, though
 that doesn't make the other uses of the device control api go away.
 Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading
 the status for debugging (reading device registers doesn't quite do
 it, since the active bit won't show up if the interrupt is
 masked).  

 At that point, is it more offensive to make it read-only
 even though it would be trivial to make it read/write (which would
 allow users who don't need it to bypass the routing API), or to make
 it read/write and live with there being more than one way to do
 something?

Can't follow this sentence.

 KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes
 of state, and because it doesn't allow debugging access to device
 registers (e.g. inspecting from the QEMU command line), and because
 it's hard to add new pieces of state if we realize we left something
 out.  It reminds be of GET/SET_SREGS.  With that, I did what you
 seem to want here, which is to adapt the existing interfaces, using
 feature flags to control optional state.  It ended up being a mess,
 and ONE_REG was introduced as a replacement.  The device control API
 is the equivalent of ONE_REG for things other than vcpus.
 
 -Scott

- ACK on 512 bytes not sufficient. Add another ioctl, SET_IRQCHIP2?
- Agree on poor extensibility of interface. Adding a reserved amount
of space as padding and versioning such as has been done so far 
is not acceptable? 
- Debugging: why is reading entire register state not acceptable? Yes,
  its slow.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Marcelo Tosatti
On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
 Why exactly you need to set attributes Scott?
 
 That's best answered by looking at patch 6/6 and discussing the
 actual attributes that are defined so far.
 
 The need to set the base address of the registers is
 straightforward.  When ARM added their special ioctl for this, we
 discussed it being eventually replaced with a more general API (in
 fact, that was the reason they put ARM in the name).
 
 Access to device registers was originally intended for debugging,
 and eventually migration.  It turned out to also be very useful for
 injecting MSIs -- nothing special needed to be done.  It Just
 Works(tm) the same way it does in hardware, with an MMIO write from
 a PCI device to a specific MPIC register.  Again, irqfd may
 complicate this, but there's no need for QEMU-generated MSIs to have
 to take a circuitous path.
 
 Finally, there's the interrupt source attribute.  Even if we get rid
 of that, we'll need MPIC-specific documentation on how to flatten
 the IRQ numberspace, and if we ever have a cascaded PIC situation
 things could get ugly since there's no way to identify a specific
 IRQ controller in KVM_IRQ_LINE.
 
 Also related to this question is the point of avoiding the
 implementation of devices to be spread across userspace and the kernel
 (this is one point Avi brought up often). If the device emulation
 is implemented entirely in the kernel, all is necessary are initial
 values of device registers (and retrieval of those values later for
 savevm/migration).
 
 MPIC emulation is entirely in the kernel with this patchset --
 though the register that lets you reset cores will likely need to be
 kicked back to QEMU.
 
 It is then not necessary to set device attributes on a live guest and
 deal with the complications associated with that.
 
 Which complications?
 
 -Scott

Semantics of individual attribute writes, for one.

Locking versus currently executing VCPUs, for another (see how
KVM_SET_IRQ's RCU usage, for instance, that is something should be
shared).

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[uq/master PATCH] target-i386: kvm: save/restore steal time MSR

2013-02-19 Thread Marcelo Tosatti

Read and write steal time MSR, so that reporting is functional across
migration.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 7577e4f..17c7293 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -792,6 +792,7 @@ typedef struct CPUX86State {
 #endif
 uint64_t system_time_msr;
 uint64_t wall_clock_msr;
+uint64_t steal_time_msr;
 uint64_t async_pf_en_msr;
 uint64_t pv_eoi_en_msr;
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 0cf413d..9ae9d74 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -68,6 +68,7 @@ static bool has_msr_tsc_deadline;
 static bool has_msr_async_pf_en;
 static bool has_msr_pv_eoi_en;
 static bool has_msr_misc_enable;
+static bool has_msr_kvm_steal_time;
 static int lm_capable_kernel;
 
 bool kvm_allows_irq0_override(void)
@@ -507,6 +508,8 @@ int kvm_arch_init_vcpu(CPUState *cs)
 
 has_msr_pv_eoi_en = c-eax  (1  KVM_FEATURE_PV_EOI);
 
+has_msr_kvm_steal_time = c-eax  (1  KVM_FEATURE_STEAL_TIME);
+
 cpu_x86_cpuid(env, 0, 0, limit, unused, unused, unused);
 
 for (i = 0; i = limit; i++) {
@@ -1107,6 +1110,10 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 kvm_msr_entry_set(msrs[n++], MSR_KVM_PV_EOI_EN,
   env-pv_eoi_en_msr);
 }
+if (has_msr_kvm_steal_time) {
+kvm_msr_entry_set(msrs[n++], MSR_KVM_STEAL_TIME,
+  env-steal_time_msr);
+}
 if (hyperv_hypercall_available()) {
 kvm_msr_entry_set(msrs[n++], HV_X64_MSR_GUEST_OS_ID, 0);
 kvm_msr_entry_set(msrs[n++], HV_X64_MSR_HYPERCALL, 0);
@@ -1360,6 +1367,9 @@ static int kvm_get_msrs(X86CPU *cpu)
 if (has_msr_pv_eoi_en) {
 msrs[n++].index = MSR_KVM_PV_EOI_EN;
 }
+if (has_msr_kvm_steal_time) {
+msrs[n++].index = MSR_KVM_STEAL_TIME;
+}
 
 if (env-mcg_cap) {
 msrs[n++].index = MSR_MCG_STATUS;
@@ -1445,6 +1455,9 @@ static int kvm_get_msrs(X86CPU *cpu)
 case MSR_KVM_PV_EOI_EN:
 env-pv_eoi_en_msr = msrs[i].data;
 break;
+case MSR_KVM_STEAL_TIME:
+env-steal_time_msr = msrs[i].data;
+break;
 }
 }
 
diff --git a/target-i386/machine.c b/target-i386/machine.c
index 8df6a6b..1feb9ca 100644
--- a/target-i386/machine.c
+++ b/target-i386/machine.c
@@ -287,6 +287,24 @@ static bool pv_eoi_msr_needed(void *opaque)
 return cpu-pv_eoi_en_msr != 0;
 }
 
+static bool steal_time_msr_needed(void *opaque)
+{
+CPUX86State *cpu = opaque;
+
+return cpu-steal_time_msr != 0;
+}
+
+static const VMStateDescription vmstate_steal_time_msr = {
+.name = cpu/steal_time_msr,
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT64(steal_time_msr, CPUX86State),
+VMSTATE_END_OF_LIST()
+}
+};
+
 static const VMStateDescription vmstate_async_pf_msr = {
 .name = cpu/async_pf_msr,
 .version_id = 1,
@@ -494,6 +512,9 @@ static const VMStateDescription vmstate_cpu = {
 .vmsd = vmstate_pv_eoi_msr,
 .needed = pv_eoi_msr_needed,
 } , {
+.vmsd = vmstate_steal_time_msr,
+.needed = steal_time_msr_needed,
+} , {
 .vmsd = vmstate_fpop_ip_dp,
 .needed = fpop_ip_dp_needed,
 }, {
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.

2013-02-18 Thread Marcelo Tosatti
On Tue, Feb 05, 2013 at 03:49:41PM -0600, Michael Wolf wrote:
 Add a helper routine to scheduler/core.c to allow the kvm module
 to retrieve the cpu hardlimit settings.  The values will be used
 to set up a timer that is used to separate the consigned from the
 steal time.

1) Can you please describe, in english, the mechanics of subtracting cpu
hardlimit values from steal time reported via run_delay supposed to
work?

The period and the quota used to separate the consigned time 
(expected steal) from the steal time are taken
from the cfs bandwidth control settings. Any other steal time
accruing during that period will show as the traditional steal time.

There is no expected steal time over a fixed period of real time.

2) From the description of patch 1: In the case of where you have
a system that is running in a capped or overcommitted environment 
the user may see steal time being reported in accounting tools 
such as top or vmstat. 

This is outdated, right? Because overcommitted environment is exactly
what steal time should report.


Thanks

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bisected][-next-20130204+] [x86/kvm] udevd:[97]: segfault at ffffffffff5fd020 ip 00007fff069e277f sp 00007fff068c9ef8 error d

2013-02-18 Thread Marcelo Tosatti
On Wed, Feb 13, 2013 at 06:57:09AM -0500, Peter Hurley wrote:
 On Wed, 2013-02-13 at 12:51 +0200, Gleb Natapov wrote:
  On Tue, Feb 12, 2013 at 04:39:03PM -0800, H. Peter Anvin wrote:
   On 02/12/2013 04:26 PM, Peter Hurley wrote:
With -next-20130204+ in ubuntu 12.10 VM (so the 80x25 VGA
device/console):

[0.666410] udevd[97]: starting version 175
[0.674043] udevd[97]: udevd:[97]: segfault at ff5fd020 ip 
7fff069e277f sp 7fff068c9ef8 error d

and boots to an initramfs prompt.

git bisect (log attached) blames:

commit 7b5c4a65cc27f017c170b025f8d6d75dabb11c6f
Merge: 3596f5b 949db15
Author: H. Peter Anvin h...@linux.intel.com
Date:   Fri Jan 25 16:31:21 2013 -0800

Merge tag 'v3.8-rc5' into x86/mm

The __pa() fixup series that follows touches KVM code that is not
present in the existing branch based on v3.7-rc5, so merge in the
current upstream from Linus.

Signed-off-by: H. Peter Anvin h...@linux.intel.com


This only happens with the VGA device/console but that is the default
configuration for Ubuntu/KVM because it blacklists pretty much every fb
driver.

   
   I am guessing this is another bad use of __pa()... need to look into that.

Can't find this commit on kvm.git or linux-2.6.git. Is it reproducible
there?

  He is using 64bit guest and on those __pa() happens to be working. Is it
  possible that slow_virt_to_phys() does not work as expected? Peter (the
  bug reporter :)) can you run your guest kernel with loglevel=7 and
  attach send me console output?
 
 Attached.
 
 BTW, this message happens on 'good' boots too:
 
 [0.00] [ cut here ]
 [0.00] WARNING: at 
 /home/peter/src/kernels/next/arch/x86/kernel/pvclock.c:182 
 pvclock_init_vsyscall+0x22/0x60()
 [0.00] Hardware name: Bochs
 [0.00] Modules linked in:
 [0.00] Pid: 0, comm: swapper Not tainted 3.8.0-next-20130204-xeon 
 #20130204
 [0.00] Call Trace:
 [0.00]  [8105812f] warn_slowpath_common+0x7f/0xc0
 [0.00]  [8105818a] warn_slowpath_null+0x1a/0x20
 [0.00]  [81d20521] pvclock_init_vsyscall+0x22/0x60
 [0.00]  [81d20480] kvm_setup_vsyscall_timeinfo+0x74/0xd8
 [0.00]  [81d201d1] kvm_guest_init+0xd0/0xe9
 [0.00]  [81d13f7c] setup_arch+0xbee/0xcaf
 [0.00]  [816cbceb] ? printk+0x61/0x63
 [0.00]  [81d0cbc3] start_kernel+0xd3/0x3f0
 [0.00]  [81d0c5e4] x86_64_start_reservations+0x2a/0x2c
 [0.00]  [81d0c6d7] x86_64_start_kernel+0xf1/0x100
 [0.00] ---[ end trace b47bb564b2d6ec76 ]---
 
 Regards,
 Peter Hurley

Sending a patch for this, thanks for the report.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 00/14] ppc patch queue 2013-02-15

2013-02-18 Thread Marcelo Tosatti
Pulled, thanks.

On Fri, Feb 15, 2013 at 01:16:16AM +0100, Alexander Graf wrote:
 Hi Marcelo / Gleb,
 
 This is my current patch queue for ppc.  Please pull.
 
 Highlights of this queue drop are:
 
   - BookE: Fast mapping support for 4k backed memory
   - BookE: Handle alignment interrupts
 
 Alex
 
 
 The following changes since commit cbd29cb6e38af6119df2cdac0c58acf0e85c177e:
   Jan Kiszka (1):
 KVM: nVMX: Remove redundant get_vmcs12 from 
 nested_vmx_exit_handled_msr
 
 are available in the git repository at:
 
   git://github.com/agraf/linux-2.6.git kvm-ppc-next
 
 Alexander Graf (11):
   KVM: PPC: E500: Move write_stlbe higher
   KVM: PPC: E500: Explicitly mark shadow maps invalid
   KVM: PPC: E500: Propagate errors when shadow mapping
   KVM: PPC: e500: Call kvmppc_mmu_map for initial mapping
   KVM: PPC: E500: Split host and guest MMU parts
   KVM: PPC: e500: Implement TLB1-in-TLB0 mapping
   KVM: PPC: E500: Make clear_tlb_refs and clear_tlb1_bitmap static
   KVM: PPC: E500: Remove kvmppc_e500_tlbil_all usage from guest TLB code
   Merge commit 'origin/next' into kvm-ppc-next
   KVM: PPC: BookE: Handle alignment interrupts
   Merge commit 'origin/next' into kvm-ppc-next
 
 Bharat Bhushan (3):
   KVM: PPC: booke: use vcpu reference from thread_struct
   KVM: PPC: booke: Allow multiple exception types
   booke: Added DBCR4 SPR number
 
  arch/powerpc/include/asm/kvm_ppc.h   |2 -
  arch/powerpc/include/asm/reg.h   |2 -
  arch/powerpc/include/asm/reg_booke.h |1 +
  arch/powerpc/kernel/asm-offsets.c|2 +-
  arch/powerpc/kvm/Makefile|9 +-
  arch/powerpc/kvm/booke.c |   30 +-
  arch/powerpc/kvm/booke.h |1 +
  arch/powerpc/kvm/booke_interrupts.S  |   49 +-
  arch/powerpc/kvm/e500.c  |   16 +-
  arch/powerpc/kvm/e500.h  |1 +
  arch/powerpc/kvm/e500_mmu.c  |  809 +++
  arch/powerpc/kvm/e500_mmu_host.c |  699 +
  arch/powerpc/kvm/e500_mmu_host.h |   18 +
  arch/powerpc/kvm/e500_tlb.c  | 1430 
 --
  14 files changed, 1610 insertions(+), 1459 deletions(-)
  create mode 100644 arch/powerpc/kvm/e500_mmu.c
  create mode 100644 arch/powerpc/kvm/e500_mmu_host.c
  create mode 100644 arch/powerpc/kvm/e500_mmu_host.h
  delete mode 100644 arch/powerpc/kvm/e500_tlb.c
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-02-18 Thread Marcelo Tosatti
On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
 2013/2/5 Michael Wolf m...@linux.vnet.ibm.com:
  In the case of where you have a system that is running in a
  capped or overcommitted environment the user may see steal time
  being reported in accounting tools such as top or vmstat.  This can
  cause confusion for the end user.
 
 Sorry, I'm no expert in this area. But I don't really understand what
 is confusing for the end user here.

I suppose that what is wanted is to subtract stolen time due to 'known
reasons' from steal time reporting. 'Known reasons' being, for example,
hard caps. So a vcpu executing instructions with no halt, but limited to
80% of available bandwidth, would not have 20% of stolen time reported.

But yes, a description of the scenario that is being dealt with, with
details, is important.

  To ease the confusion this patch set
  adds the idea of consigned (expected steal) time.  The host will separate
  the consigned time from the steal time.  Tthe steal time will only be 
  altered
  if hard limits (cfs bandwidth control) is used.  The period and the quota 
  used
  to separate the consigned time (expected steal) from the steal time are 
  taken
  from the cfs bandwidth control settings. Any other steal time accruing 
  during
  that period will show as the traditional steal time.
 
 I'm also a bit confused here. steal time will then only account the
 cpu time lost due to quotas from cfs bandwidth control? Also what do
 you exactly mean by expected steal time? Is it steal time due to
 overcommitting minus scheduler quotas?
 
 Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


x86: pvclock kvm: align allocation size to page size

2013-02-18 Thread Marcelo Tosatti

To match whats mapped via vsyscalls to userspace.

Reported-by: Peter Hurley pe...@hurleysoftware.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 220a360..5bedbdd 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -218,6 +218,9 @@ static void kvm_shutdown(void)
 void __init kvmclock_init(void)
 {
unsigned long mem;
+   int size;
+
+   size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
 
if (!kvm_para_available())
return;
@@ -231,16 +234,14 @@ void __init kvmclock_init(void)
printk(KERN_INFO kvm-clock: Using msrs %x and %x,
msr_kvm_system_time, msr_kvm_wall_clock);
 
-   mem = memblock_alloc(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS,
-PAGE_SIZE);
+   mem = memblock_alloc(size, PAGE_SIZE);
if (!mem)
return;
hv_clock = __va(mem);
 
if (kvm_register_clock(boot clock)) {
hv_clock = NULL;
-   memblock_free(mem,
-   sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
+   memblock_free(mem, size);
return;
}
pv_time_ops.sched_clock = kvm_clock_read;
@@ -275,7 +276,7 @@ int __init kvm_setup_vsyscall_timeinfo(void)
struct pvclock_vcpu_time_info *vcpu_time;
unsigned int size;
 
-   size = sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS;
+   size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
 
preempt_disable();
cpu = smp_processor_id();


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6] KVM: nVMX: Improve I/O exit handling

2013-02-18 Thread Marcelo Tosatti
On Mon, Feb 18, 2013 at 12:32:37PM +0200, Gleb Natapov wrote:
 On Mon, Feb 18, 2013 at 11:21:16AM +0100, Jan Kiszka wrote:
  This prevents trapping L2 I/O exits if L1 has neither unconditional nor
  bitmap-based exiting enabled. Furthermore, it implements I/O bitmap
  handling.
  
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 Reviewed-by: Gleb Natapov g...@redhat.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Trap unconditionally if msr bitmap access fails

2013-02-18 Thread Marcelo Tosatti
On Sun, Feb 17, 2013 at 10:56:36AM +0200, Gleb Natapov wrote:
 On Thu, Feb 14, 2013 at 07:46:27PM +0100, Jan Kiszka wrote:
  This avoids basing decisions on uninitialized variables, potentially
  leaking kernel data to the L1 guest.
  
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 Reviewed-by: Gleb Natapov g...@redhat.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 00/14] ppc patch queue 2013-02-15

2013-02-18 Thread Marcelo Tosatti
Pulled, thanks.

On Fri, Feb 15, 2013 at 01:16:16AM +0100, Alexander Graf wrote:
 Hi Marcelo / Gleb,
 
 This is my current patch queue for ppc.  Please pull.
 
 Highlights of this queue drop are:
 
   - BookE: Fast mapping support for 4k backed memory
   - BookE: Handle alignment interrupts
 
 Alex
 
 
 The following changes since commit cbd29cb6e38af6119df2cdac0c58acf0e85c177e:
   Jan Kiszka (1):
 KVM: nVMX: Remove redundant get_vmcs12 from 
 nested_vmx_exit_handled_msr
 
 are available in the git repository at:
 
   git://github.com/agraf/linux-2.6.git kvm-ppc-next
 
 Alexander Graf (11):
   KVM: PPC: E500: Move write_stlbe higher
   KVM: PPC: E500: Explicitly mark shadow maps invalid
   KVM: PPC: E500: Propagate errors when shadow mapping
   KVM: PPC: e500: Call kvmppc_mmu_map for initial mapping
   KVM: PPC: E500: Split host and guest MMU parts
   KVM: PPC: e500: Implement TLB1-in-TLB0 mapping
   KVM: PPC: E500: Make clear_tlb_refs and clear_tlb1_bitmap static
   KVM: PPC: E500: Remove kvmppc_e500_tlbil_all usage from guest TLB code
   Merge commit 'origin/next' into kvm-ppc-next
   KVM: PPC: BookE: Handle alignment interrupts
   Merge commit 'origin/next' into kvm-ppc-next
 
 Bharat Bhushan (3):
   KVM: PPC: booke: use vcpu reference from thread_struct
   KVM: PPC: booke: Allow multiple exception types
   booke: Added DBCR4 SPR number
 
  arch/powerpc/include/asm/kvm_ppc.h   |2 -
  arch/powerpc/include/asm/reg.h   |2 -
  arch/powerpc/include/asm/reg_booke.h |1 +
  arch/powerpc/kernel/asm-offsets.c|2 +-
  arch/powerpc/kvm/Makefile|9 +-
  arch/powerpc/kvm/booke.c |   30 +-
  arch/powerpc/kvm/booke.h |1 +
  arch/powerpc/kvm/booke_interrupts.S  |   49 +-
  arch/powerpc/kvm/e500.c  |   16 +-
  arch/powerpc/kvm/e500.h  |1 +
  arch/powerpc/kvm/e500_mmu.c  |  809 +++
  arch/powerpc/kvm/e500_mmu_host.c |  699 +
  arch/powerpc/kvm/e500_mmu_host.h |   18 +
  arch/powerpc/kvm/e500_tlb.c  | 1430 
 --
  14 files changed, 1610 insertions(+), 1459 deletions(-)
  create mode 100644 arch/powerpc/kvm/e500_mmu.c
  create mode 100644 arch/powerpc/kvm/e500_mmu_host.c
  create mode 100644 arch/powerpc/kvm/e500_mmu_host.h
  delete mode 100644 arch/powerpc/kvm/e500_tlb.c
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: switch unhandled MSR printk's to tracepoints

2013-02-12 Thread Marcelo Tosatti
On Tue, Feb 12, 2013 at 02:33:17PM +0200, Gleb Natapov wrote:
 On Mon, Feb 11, 2013 at 09:50:09PM -0200, Marcelo Tosatti wrote:
  
  Such messages in the system log very often cause confusion because users
  understand the kvm: unhandled MSR messages as errors.
  
 They are errors. Linux compiled with PV support ignores them, but other
 guest OSes or Linux without PV will crash. Some indication that guest
 crash due to use of unsupported MSR is a helpful hint where to look.

OK, closing 
https://bugzilla.redhat.com/show_bug.cgi?id=883216
as wontfix because printk_ratelimit renders the messages unreliable.

  Switch to trace points, making the information available on demand.
 What new information those trace points provide that we cannot get today
 from trace_kvm_msr()?

None. 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM: x86: switch unhandled MSR printk's to tracepoints

2013-02-11 Thread Marcelo Tosatti

Such messages in the system log very often cause confusion because users
understand the kvm: unhandled MSR messages as errors.

Switch to trace points, making the information available on demand.

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index fe5e00e..c461368 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -318,6 +318,40 @@ TRACE_EVENT(kvm_msr,
 #define trace_kvm_msr_write_ex(ecx, data)  trace_kvm_msr(1, ecx, data, true)
 
 /*
+ * Tracepoint for unhandled guest MSR access.
+ */
+TRACE_EVENT(kvm_msr_unhandled,
+   TP_PROTO(bool write, u32 ecx, u64 data, bool unhandled),
+   TP_ARGS(write, ecx, data, unhandled),
+
+   TP_STRUCT__entry(
+   __field(u32,ecx )
+   __field(u64,data)
+   __field(bool,   write   )
+   __field(bool,   unhandled   )
+   ),
+
+   TP_fast_assign(
+   __entry-ecx= ecx;
+   __entry-data   = data;
+   __entry-write  = write;
+   __entry-unhandled  = unhandled;
+   ),
+
+   TP_printk(unhandled MSR %s %x = 0x%llx (%s),
+ __entry-write ? write : read,
+ __entry-ecx, __entry-data,
+ __entry-unhandled ? injected #GP : ignored)
+);
+
+#define trace_kvm_msr_unhandled_read(ecx) trace_kvm_msr(0, ecx, 0, false)
+#define trace_kvm_msr_unhandled_read_ex(ecx) trace_kvm_msr(0, ecx, 0, true)
+#define trace_kvm_msr_unhandled_write(ecx, data)   \
+ trace_kvm_msr(1, ecx, data, false)
+#define trace_kvm_msr_unhandled_write_ex(ecx, data)\
+ trace_kvm_msr(1, ecx, data, true)
+
+/*
  * Tracepoint for guest CR access.
  */
 TRACE_EVENT(kvm_cr,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c243b81..f5abcce 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1772,8 +1772,7 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 
msr, u64 data)
break;
}
default:
-   vcpu_unimpl(vcpu, HYPER-V unimplemented wrmsr: 0x%x 
-   data 0x%llx\n, msr, data);
+   trace_kvm_msr_unhandled_write_ex(msr, data);
return 1;
}
return 0;
@@ -1805,8 +1804,7 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
case HV_X64_MSR_TPR:
return kvm_hv_vapic_msr_write(vcpu, APIC_TASKPRI, data);
default:
-   vcpu_unimpl(vcpu, HYPER-V unimplemented wrmsr: 0x%x 
-   data 0x%llx\n, msr, data);
+   trace_kvm_msr_unhandled_write_ex(msr, data);
return 1;
}
 
@@ -1888,15 +1886,13 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
data = ~(u64)0x100;/* ignore ignne emulation enable */
data = ~(u64)0x8;  /* ignore TLB cache disable */
if (data != 0) {
-   vcpu_unimpl(vcpu, unimplemented HWCR wrmsr: 0x%llx\n,
-   data);
+   trace_kvm_msr_unhandled_write_ex(msr, data);
return 1;
}
break;
case MSR_FAM10H_MMIO_CONF_BASE:
if (data != 0) {
-   vcpu_unimpl(vcpu, unimplemented MMIO_CONF_BASE wrmsr: 
-   0x%llx\n, data);
+   trace_kvm_msr_unhandled_write_ex(msr, data);
return 1;
}
break;
@@ -1911,8 +1907,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
   thus reserved and should throw a #GP */
return 1;
}
-   vcpu_unimpl(vcpu, %s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n,
-   __func__, data);
+
+   trace_kvm_msr_unhandled_write(msr, data);
break;
case MSR_IA32_UCODE_REV:
case MSR_IA32_UCODE_WRITE:
@@ -2020,8 +2016,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
case MSR_K7_EVNTSEL2:
case MSR_K7_EVNTSEL3:
if (data != 0)
-   vcpu_unimpl(vcpu, unimplemented perfctr wrmsr: 
-   0x%x data 0x%llx\n, msr, data);
+   trace_kvm_msr_unhandled_write(msr, data);
break;
/* at least RHEL 4 unconditionally writes to the perfctr registers,
 * so we ignore writes to make it happy.
@@ -2030,8 +2025,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
case MSR_K7_PERFCTR1:
case MSR_K7_PERFCTR2:
case MSR_K7_PERFCTR3:
-   vcpu_unimpl(vcpu, unimplemented 

Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-08 Thread Marcelo Tosatti
On Fri, Feb 08, 2013 at 02:28:44PM +0200, Gleb Natapov wrote:
 On Thu, Feb 07, 2013 at 07:49:47PM -0200, Marcelo Tosatti wrote:
  On Thu, Feb 07, 2013 at 04:01:11PM +0200, Gleb Natapov wrote:
   On Wed, Feb 06, 2013 at 08:49:23PM -0200, Marcelo Tosatti wrote:
 Second is that interrupt may be
 reported as delivered, but it will be coalesced (possible only with 
 the self
 IPI with the same vector):
 
 Starting condition: PIR=0, IRR=0 vcpu is in a guest mode
 
 io thread |   vcpu
 accept_apic_interrupt()   |
  PIR and IRR is zero  |
  set PIR  |
  return delivered |
   |  self IPI
   |  set IRR
   | merge PIR to IRR (*)
 
 At (*) interrupt that was reported as delivered is coalesced.

Only vcpu itself should send self-IPI, so its fine.

   It is fine only because this is not happening in practice (I hope) for 
   single interrupt
   we care about. Otherwise this is serious problem.
  
  Coalesced information is only interesting for non IPI cases, that
  is, device emulation (at the moment, at least).
  
 And incorrect result will be returned for an interrupt injected by an 
 emulated device
 in the scenario above.

  The above cause can happen when loading APIC registers, but delivered
  is not interesting in that case. Good to document, however.
  
  Or:
  
  apic_accept_interrupt() {
  
  1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
  Never set IRR when HWAPIC enabled, even if outside of guest mode.
  2. Set PIR and let HW or SW VM-entry transfer it to IRR.
  3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
  }
  
 This can report interrupt as coalesced, but it will be eventually 
 delivered
 as separate interrupt:
 
 Starting condition: PIR=0, IRR=1 vcpu is in a guest mode
 
  io thread  | vcpu
 | 
 accept_apic_interrupt() |
 ORIG_PIR=0, ORIG_IRR=1  |
 |EOI
 |clear IRR, set ISR
 set PIR |
 return coalesced|
 |clear PIR, set IRR
 |EOI
 |clear IRR, set ISR (*)
 
 At (*) interrupt that was reported as coalesced is delivered.
 
 
 So still no perfect solution. But first one has much less serious
 problems for our practical needs.
 
  Two or more concurrent set_irq can race with each other, though. Can
  either document the race or add a lock.
  
 
 --
   Gleb.

Ok, then:

accept_apic_irq:
1. coalesced = test_and_set_bit(PIR)
2. set KVM_REQ_EVENT bit(*)
3. if (vcpu-in_guest_mode)
4.  if (test_and_set_bit(pir notification bit))
5.  send PIR IPI
6. return coalesced
   Do not see how it will help.
   
   Starting condition: PIR=0, IRR=1 vcpu is in a guest mode
   
   io thread  |  vcpu
   accept_apic_interrupt()|
   coalesced = 0, PIR=1   |
   vcpu in a guest mode:  |
   send PIR IPI   |
  |  receive PIR IPI
  |  merge PIR to IRR (*)
   return not coalesced   |
   
   At (*) interrupt that was reported as delivered is coalesced.
  
  Of course!
  
   The point is that we need to check PIR and IRR atomically and this is
   impossible.
  
  Ok, next try:
  
  1. orig_irr = read irr from vapic page
  2. if (orig_irr == 0)
  3.  return test_and_test_bit(pir)
  4. return 0
  
 I think this is exactly same solution we are discussing above:
 
 apic_accept_interrupt() {
  if (PIR || IRR)
return coalesced;
  else
set PIR;
 }

Its not exactly the same. Without test_and_set_bit(PIR) two vcpus can race.

 with the same self-IPI problem. IMO this is the best we can do and will
 work correctly for RTC interrupt re-injection case.

Yes.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 5/5] KVM: MMU: fast drop all spte on the pte_list

2013-02-08 Thread Marcelo Tosatti
On Tue, Feb 05, 2013 at 04:55:37PM +0800, Xiao Guangrong wrote:
 If a shadow page is being zapped or a host page is going to be freed, kvm
 will drop all the reverse-mappings on the shadow page or the gfn. Currently,
 it drops the reverse-mapping one by one - it deletes the first reverse 
 mapping,
 then moves other reverse-mapping between the description-table. When the
 last description-table become empty, it will be freed.
 
 It works well if we only have a few reverse-mappings, but some pte_lists are
 very long, during my tracking, i saw some gfns have more than 200 sptes listed
 on its pte-list (1G memory in guest on softmmu). Optimization for dropping 
 such
 long pte-list is worthwhile, at lease it is good for deletion memslots and
 ksm/thp merge pages.
 
 This patch introduce a better way to optimize for this case, it walks all the
 reverse-mappings and clear them, then free all description-tables together.
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/mmu.c |   36 +++-
  1 files changed, 27 insertions(+), 9 deletions(-)
 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 58f813a..aa7a887 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -945,6 +945,25 @@ static void pte_list_remove(u64 *spte, unsigned long 
 *pte_list)
   }
  }
 
 +static void pte_list_destroy(unsigned long *pte_list)
 +{
 + struct pte_list_desc *desc;
 + unsigned long list_value = *pte_list;
 +
 + *pte_list = 0;
 +
 + if (!(list_value  1))
 + return;
 +
 + desc = (struct pte_list_desc *)(list_value  ~1ul);
 + while (desc) {
 + struct pte_list_desc *next_desc = desc-more;
 +
 + mmu_free_pte_list_desc(desc);
 + desc = next_desc;
 + }
 +}
 +
  /*
   * Used by the following functions to iterate through the sptes linked by a
   * pte_list.  All fields are private and not assumed to be used outside.
 @@ -1183,17 +1202,17 @@ static bool rmap_write_protect(struct kvm *kvm, u64 
 gfn)
  static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
  struct kvm_memory_slot *slot, unsigned long data)
  {
 - u64 *sptep;
   struct pte_list_iterator iter;
 + u64 *sptep;
   int need_tlb_flush = 0;
 
 -restart:
   for_each_spte_in_rmap(*rmapp, iter, sptep) {
 - drop_spte(kvm, sptep);
 + mmu_spte_clear_track_bits(sptep);
   need_tlb_flush = 1;
 - goto restart;
   }
 
 + pte_list_destroy(rmapp);
 +
   return need_tlb_flush;
  }
 
 @@ -2016,11 +2035,10 @@ static void kvm_mmu_unlink_parents(struct kvm *kvm, 
 struct kvm_mmu_page *sp)
   u64 *sptep;
   struct pte_list_iterator iter;
 
 -restart:
 - for_each_spte_in_pte_list(sp-parent_ptes, iter, sptep) {
 - drop_parent_pte(sp, sptep);
 - goto restart;
 - }
 + for_each_spte_in_pte_list(sp-parent_ptes, iter, sptep)
 + mmu_spte_clear_no_track(sptep);
 +
 + pte_list_destroy(sp-parent_ptes);
  }

Do we lose the crash information of pte_list_remove? 
It has been shown to be useful in several cases.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] KVM: MMU: simple cleanups

2013-02-08 Thread Marcelo Tosatti
On Tue, Feb 05, 2013 at 03:26:21PM +0800, Xiao Guangrong wrote:
 There are the simple cleanups for MMU, no function / logic changed.
 
 Marcelo, Gleb, please apply them after applying
 [PATCH v3] KVM: MMU: lazily drop large spte
 
 Changelog:
 no change, just split them from the previous patchset for good review.
 
 Thanks!

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/5] KVM: MMU: introduce mmu_spte_establish

2013-02-08 Thread Marcelo Tosatti
On Tue, Feb 05, 2013 at 04:53:19PM +0800, Xiao Guangrong wrote:
 There is little different between walking parent pte and walking ramp:
 all spte in rmap must be present but this is not true on parent pte list,
 in kvm_mmu_alloc_page, we always link the parent list before set the spte
 to present
 
 This patch introduce mmu_spte_establish which set the spte before linking
 it to parent list to eliminates the different then it is possible to unify
 the code of walking pte list
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/mmu.c |   81 ++-
  arch/x86/kvm/paging_tmpl.h |   16 -
  2 files changed, 48 insertions(+), 49 deletions(-)
 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 8041454..68d4d5f 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -1482,9 +1482,6 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
  static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu,
   struct kvm_mmu_page *sp, u64 *parent_pte)
  {
 - if (!parent_pte)
 - return;
 -
   pte_list_add(vcpu, parent_pte, sp-parent_ptes);
  }
 
 @@ -1502,7 +1499,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
  }
 
  static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
 -u64 *parent_pte, int direct)
 +int direct)
  {
   struct kvm_mmu_page *sp;
   sp = mmu_memory_cache_alloc(vcpu-arch.mmu_page_header_cache);
 @@ -1512,7 +1509,6 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct 
 kvm_vcpu *vcpu,
   set_page_private(virt_to_page(sp-spt), (unsigned long)sp);
   list_add(sp-link, vcpu-kvm-arch.active_mmu_pages);
   sp-parent_ptes = 0;
 - mmu_page_add_parent_pte(vcpu, sp, parent_pte);
   kvm_mod_used_mmu_pages(vcpu-kvm, +1);
   return sp;
  }
 @@ -1845,8 +1841,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
 kvm_vcpu *vcpu,
gva_t gaddr,
unsigned level,
int direct,
 -  unsigned access,
 -  u64 *parent_pte)
 +  unsigned access)
  {
   union kvm_mmu_page_role role;
   unsigned quadrant;
 @@ -1876,19 +1871,15 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
 kvm_vcpu *vcpu,
   if (sp-unsync  kvm_sync_page_transient(vcpu, sp))
   break;
 
 - mmu_page_add_parent_pte(vcpu, sp, parent_pte);
 - if (sp-unsync_children) {
 + if (sp-unsync_children)
   kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
 - kvm_mmu_mark_parents_unsync(sp);
 - } else if (sp-unsync)
 - kvm_mmu_mark_parents_unsync(sp);
 
   __clear_sp_write_flooding_count(sp);
   trace_kvm_mmu_get_page(sp, false);
   return sp;
   }
   ++vcpu-kvm-stat.mmu_cache_miss;
 - sp = kvm_mmu_alloc_page(vcpu, parent_pte, direct);
 + sp = kvm_mmu_alloc_page(vcpu, direct);
   if (!sp)
   return sp;
   sp-gfn = gfn;
 @@ -1908,6 +1899,35 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
 kvm_vcpu *vcpu,
   return sp;
  }
 
 +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
 +{
 + u64 spte;
 +
 + spte = __pa(sp-spt) | PT_PRESENT_MASK | PT_WRITABLE_MASK |
 +shadow_user_mask | shadow_x_mask | shadow_accessed_mask;
 +
 + mmu_spte_set(sptep, spte);
 +}
 +
 +static struct kvm_mmu_page *
 +mmu_spte_establish(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn, gva_t gaddr,
 +unsigned level, int direct, unsigned access)
 +{
 + struct kvm_mmu_page *sp;
 +
 + WARN_ON(is_shadow_present_pte(*spte));
 +
 + sp = kvm_mmu_get_page(vcpu, gfn, gaddr, level, direct, access);
 +
 + link_shadow_page(spte, sp);
 + mmu_page_add_parent_pte(vcpu, sp, spte);
 +
 + if (sp-unsync_children || sp-unsync)
 + kvm_mmu_mark_parents_unsync(sp);
 +
 + return sp;
 +}
 +
  static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
struct kvm_vcpu *vcpu, u64 addr)
  {
 @@ -1957,16 +1977,6 @@ static void shadow_walk_next(struct 
 kvm_shadow_walk_iterator *iterator)
   return __shadow_walk_next(iterator, *iterator-sptep);
  }
 
 -static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
 -{
 - u64 spte;
 -
 - spte = __pa(sp-spt) | PT_PRESENT_MASK | PT_WRITABLE_MASK |
 -shadow_user_mask | shadow_x_mask | shadow_accessed_mask;
 -
 - mmu_spte_set(sptep, spte);
 -}
 -
  static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
  unsigned 

Re: Q: Why not use struct mm_struct to manage guest physical addresses in new port?

2013-02-08 Thread Marcelo Tosatti
On Tue, Feb 05, 2013 at 11:02:32AM -0800, David Daney wrote:
 Hi,
 
 I am starting to working on a port of KVM to an architecture that
 has a dual TLB.  The Guest Virtual Addresses (GVA) are translated to
 Guest Physical Addresses (GPA) by the first TLB, then a second TLB
 translates the GPA to a Root Physical Address (RPA).  For the sake
 of this question, we will ignore the GVA-GPA TLB and consider only
 the GPA-RPA TLB.
 
 I seems that most existing ports have a bunch of custom code that
 manages the GPA-RPA TLB and page tables.
 
 Here is what I would like to try:  Create a mm for the GPA-RPA
 mappings each vma would have a fault handler that calls gfn_to_pfn()
 to look up the proper page.  In kvm_arch_vcpu_ioctl_run() we would
 call switch_mm() to this new 'gva_mm'.

gfn_to_pfn uses the address space of the controlling process. GPA-RPA
translation does:

1) Find 'memory slot' (indexed by gfn)
2) From 'memory slot', find virtual address (relative to controlling
process).
3) Walk pagetable of controlling process and page retrieve physical address.

  Upon exiting guest mode we
 would switch back to the original mm of the controlling process.
 For me the benefit of this approach is that all the code that
 manages the TLB is already implemented and works well for struct
 mm_struct.  The only thing I need to do is write a vma fault
 handler.  That is a lot easier and less error prone than maintaining
 a parallel TLB management framework and making sure it interacts
 properly with the existing TLB code for 'normal' processes.
 
 
 Q1: Am I crazy for wanting to try this?

You need the mm_struct of the controlling to be active, when doing
GPA-RPA translations.

 Q2: Have others tried this and rejected it?  What were the reasons?

I think you'll have to switch_mm back to the controlling process mm on
every page fault (and then back to gva_mm).

 
 Thanks in advance,
 David Daney
 Cavium, Inc.

'vma' `is a process 

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Remove user_alloc from struct kvm_memory_slot

2013-02-08 Thread Marcelo Tosatti
On Thu, Feb 07, 2013 at 06:55:57PM +0900, Takuya Yoshikawa wrote:
 This field was needed to differentiate memory slots created by the new
 API, KVM_SET_USER_MEMORY_REGION, from those by the old equivalent,
 KVM_SET_MEMORY_REGION, whose support was dropped long before:
 
   commit b74a07beed0e64bfba413dcb70dd6749c57f43dc
   KVM: Remove kernel-allocated memory regions
 
 Although we also have private memory slots to which KVM allocates
 memory with vm_mmap(), !user_alloc slots in other words, the slot id
 should be enough for differentiating them.
 
 Note: corresponding function parameters will be removed later.
 
 Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp

Reviewed-by: Marcelo Tosatti mtosa...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-07 Thread Marcelo Tosatti
On Thu, Feb 07, 2013 at 04:01:11PM +0200, Gleb Natapov wrote:
 On Wed, Feb 06, 2013 at 08:49:23PM -0200, Marcelo Tosatti wrote:
   Second is that interrupt may be
   reported as delivered, but it will be coalesced (possible only with the 
   self
   IPI with the same vector):
   
   Starting condition: PIR=0, IRR=0 vcpu is in a guest mode
   
   io thread |   vcpu
   accept_apic_interrupt()   |
PIR and IRR is zero  |
set PIR  |
return delivered |
 |  self IPI
 |  set IRR
 | merge PIR to IRR (*)
   
   At (*) interrupt that was reported as delivered is coalesced.
  
  Only vcpu itself should send self-IPI, so its fine.
  
 It is fine only because this is not happening in practice (I hope) for single 
 interrupt
 we care about. Otherwise this is serious problem.

Coalesced information is only interesting for non IPI cases, that
is, device emulation (at the moment, at least).

The above cause can happen when loading APIC registers, but delivered
is not interesting in that case. Good to document, however.

Or:

apic_accept_interrupt() {

1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
Never set IRR when HWAPIC enabled, even if outside of guest mode.
2. Set PIR and let HW or SW VM-entry transfer it to IRR.
3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
}

   This can report interrupt as coalesced, but it will be eventually 
   delivered
   as separate interrupt:
   
   Starting condition: PIR=0, IRR=1 vcpu is in a guest mode
   
io thread  | vcpu
   | 
   accept_apic_interrupt() |
   ORIG_PIR=0, ORIG_IRR=1  |
   |EOI
   |clear IRR, set ISR
   set PIR |
   return coalesced|
   |clear PIR, set IRR
   |EOI
   |clear IRR, set ISR (*)
   
   At (*) interrupt that was reported as coalesced is delivered.
   
   
   So still no perfect solution. But first one has much less serious
   problems for our practical needs.
   
Two or more concurrent set_irq can race with each other, though. Can
either document the race or add a lock.

   
   --
 Gleb.
  
  Ok, then:
  
  accept_apic_irq:
  1. coalesced = test_and_set_bit(PIR)
  2. set KVM_REQ_EVENT bit(*)
  3. if (vcpu-in_guest_mode)
  4.  if (test_and_set_bit(pir notification bit))
  5.  send PIR IPI
  6. return coalesced
 Do not see how it will help.
 
 Starting condition: PIR=0, IRR=1 vcpu is in a guest mode
 
 io thread  |  vcpu
 accept_apic_interrupt()|
 coalesced = 0, PIR=1   |
 vcpu in a guest mode:  |
 send PIR IPI   |
|  receive PIR IPI
|  merge PIR to IRR (*)
 return not coalesced   |
 
 At (*) interrupt that was reported as delivered is coalesced.

Of course!

 The point is that we need to check PIR and IRR atomically and this is
 impossible.

Ok, next try:

1. orig_irr = read irr from vapic page
2. if (orig_irr == 0)
3.  return test_and_test_bit(pir)
4. return 0

To summarize, given irr and pir (the bit for the vector in question) 
possibilities:

irr=0, pir=0, inject, return coalesced=0.
irr=0, pir=1, not injected, return coalesced=1.
irr=1, pir=0, not injected, return coalesced=1.
irr=1, pir=1, not injected, return coalesced=1.

Note the behaviour regarding whether to inject or not is the same as
today: it depends on IRR being set. If PIR=1 and IRR=0, IRR will be
eventually set.

  Other sites:
  A: On VM-entry, after disabling interrupts, but before
  the last check for -requests, clear pir notification bit 
  (unconditionally).
  
  (*) This is _necessary_ also because during VM-exit a PIR IPI interrupt can 
  be missed, so the KVM_REQ_EVENT indicates that SW is responsible for
  PIR-IRR transfer.
  
 
 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 rebased] kvm: notify host when the guest is panicked

2013-02-07 Thread Marcelo Tosatti
Hi,

On Wed, Jan 23, 2013 at 03:19:21PM +0800, Hu Tao wrote:
 We can know the guest is panicked when the guest runs on xen.
 But we do not have such feature on kvm.
 
 Another purpose of this feature is: management app(for example:
 libvirt) can do auto dump when the guest is panicked. If management
 app does not do auto dump, the guest's user can do dump by hand if
 he sees the guest is panicked.
 
 We have three solutions to implement this feature:
 1. use vmcall
 2. use I/O port
 3. use virtio-serial.
 
 We have decided to avoid touching hypervisor. The reason why I choose
 choose the I/O port is:
 1. it is easier to implememt
 2. it does not depend any virtual device
 3. it can work when starting the kernel
 
 Signed-off-by: Wen Congyang we...@cn.fujitsu.com
 Signed-off-by: Hu Tao hu...@cn.fujitsu.com
 ---
  arch/ia64/kvm/irq.h  | 19 +
  arch/powerpc/include/asm/kvm_para.h  | 18 
  arch/s390/include/asm/kvm_para.h | 19 +
  arch/x86/include/asm/kvm_para.h  | 20 ++
  arch/x86/include/uapi/asm/kvm_para.h |  2 ++
  arch/x86/kernel/kvm.c| 53 
 
  include/linux/kvm_para.h | 18 
  include/uapi/linux/kvm_para.h|  6 
  kernel/panic.c   |  4 +++
  9 files changed, 159 insertions(+)
 
 diff --git a/arch/ia64/kvm/irq.h b/arch/ia64/kvm/irq.h
 index c0785a7..b3870f8 100644
 --- a/arch/ia64/kvm/irq.h
 +++ b/arch/ia64/kvm/irq.h
 @@ -30,4 +30,23 @@ static inline int irqchip_in_kernel(struct kvm *kvm)
   return 1;
  }
  
 +static inline int kvm_arch_pv_event_init(void)
 +{
 + return 0;
 +}
 +
 +static inline unsigned int kvm_arch_pv_features(void)
 +{
 + return 0;
 +}
 +
 +static inline void kvm_arch_pv_eject_event(unsigned int event)
 +{
 +}
 +
 +static inline bool kvm_arch_pv_event_enabled(void)
 +{
 + return false;
 +}
 +

The interface is x86 only, no need to touch other architectures.

  #endif
 diff --git a/arch/powerpc/include/asm/kvm_para.h 
 b/arch/powerpc/include/asm/kvm_para.h
 index 2b11965..17dd013 100644
 --- a/arch/powerpc/include/asm/kvm_para.h
 +++ b/arch/powerpc/include/asm/kvm_para.h
 @@ -144,4 +144,22 @@ static inline bool kvm_check_and_clear_guest_paused(void)
   return false;
  }
  
 +static inline int kvm_arch_pv_event_init(void)
 +{
 + return 0;
 +}
 +
 +static inline unsigned int kvm_arch_pv_features(void)
 +{
 + return 0;
 +}
 +
 +static inline void kvm_arch_pv_eject_event(unsigned int event)
 +{
 +}
 +
 +static inline bool kvm_arch_pv_event_enabled(void)
 +{
 + return false;
 +}
  #endif /* __POWERPC_KVM_PARA_H__ */
 diff --git a/arch/s390/include/asm/kvm_para.h 
 b/arch/s390/include/asm/kvm_para.h
 index e0f8423..81d87ec 100644
 --- a/arch/s390/include/asm/kvm_para.h
 +++ b/arch/s390/include/asm/kvm_para.h
 @@ -154,4 +154,23 @@ static inline bool kvm_check_and_clear_guest_paused(void)
   return false;
  }
  
 +static inline int kvm_arch_pv_event_init(void)
 +{
 + return 0;
 +}
 +
 +static inline unsigned int kvm_arch_pv_features(void)
 +{
 + return 0;
 +}
 +
 +static inline void kvm_arch_pv_eject_event(unsigned int event)
 +{
 +}
 +
 +static inline bool kvm_arch_pv_event_enabled(void)
 +{
 + return false;
 +}
 +
  #endif /* __S390_KVM_PARA_H */
 --- a/arch/x86/include/asm/kvm_para.h
 +++ b/arch/x86/include/asm/kvm_para.h
 @@ -133,4 +133,24 @@ static inline void kvm_disable_steal_time(void)
  }
  #endif
  
 +static inline int kvm_arch_pv_event_init(void)
 +{
 + if (!request_region(KVM_PV_EVENT_PORT, 4, KVM_PV_EVENT))
 + return -1;
 +
 + return 0;
 +}

This should be in a driver in arch/x86/kernel/kvm-panic.c, or so.

 +
 +static inline unsigned int kvm_arch_pv_features(void)
 +{
 + return inl(KVM_PV_EVENT_PORT);
 +}
 +
 +static inline void kvm_arch_pv_eject_event(unsigned int event)
 +{
 + outl(event, KVM_PV_EVENT_PORT);
 +}

 +
 +bool kvm_arch_pv_event_enabled(void);
 +
  #endif /* _ASM_X86_KVM_PARA_H */
 diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
 b/arch/x86/include/uapi/asm/kvm_para.h
 index 06fdbd9..c15ef33 100644
 --- a/arch/x86/include/uapi/asm/kvm_para.h
 +++ b/arch/x86/include/uapi/asm/kvm_para.h
 @@ -96,5 +96,7 @@ struct kvm_vcpu_pv_apf_data {
  #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK
  #define KVM_PV_EOI_DISABLED 0x0
  
 +#define KVM_PV_EVENT_PORT(0x505UL)
 +

No need for the ioport to be hard coded. What are the options to
communicate an address to the guest? An MSR, via ACPI?

  
  #endif /* _UAPI_ASM_X86_KVM_PARA_H */
 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
 index 9c2bd8b..0aa7b3e 100644
 --- a/arch/x86/kernel/kvm.c
 +++ b/arch/x86/kernel/kvm.c
 @@ -73,6 +73,20 @@ static int parse_no_kvmclock_vsyscall(char *arg)
  
  early_param(no-kvmclock-vsyscall, parse_no_kvmclock_vsyscall);
  
 +static int pv_event = 1;
 +static int parse_no_pv_event(char *arg)
 +{
 + pv_event = 0;
 + return 0;
 +}

Re: [PATCH v12 rebased 2/8] start vm after resetting it

2013-02-07 Thread Marcelo Tosatti
On Wed, Jan 23, 2013 at 03:19:23PM +0800, Hu Tao wrote:
 From: Wen Congyang we...@cn.fujitsu.com
 
 The guest should run after resetting it, but it does not run if its
 old state is RUN_STATE_INTERNAL_ERROR or RUN_STATE_PAUSED.
 
 We don't set runstate to RUN_STATE_PAUSED when resetting the guest,
 so the runstate will be changed from RUN_STATE_INTERNAL_ERROR or
 RUN_STATE_PAUSED to RUN_STATE_RUNNING(not RUN_STATE_PAUSED).

It appears the last hunk will automatically reset state from 
RUN_STATE_INTERNAL_ERROR to RUN_STATE_RUNNING ?

I suppose the transition table allows, from RUN_STATE_INTERNAL_ERROR:

monitor system_reset
monitor cont

To resume the machine?

 
 Signed-off-by: Wen Congyang we...@cn.fujitsu.com
 ---
  include/block/block.h | 2 ++
  qmp.c | 2 +-
  vl.c  | 7 ---
  3 files changed, 7 insertions(+), 4 deletions(-)
 
 diff --git a/include/block/block.h b/include/block/block.h
 index ffd1936..5e82ccb 100644
 --- a/include/block/block.h
 +++ b/include/block/block.h
 @@ -366,6 +366,8 @@ void bdrv_disable_copy_on_read(BlockDriverState *bs);
  void bdrv_set_in_use(BlockDriverState *bs, int in_use);
  int bdrv_in_use(BlockDriverState *bs);
  
 +void iostatus_bdrv_it(void *opaque, BlockDriverState *bs);
 +
  #ifdef CONFIG_LINUX_AIO
  int raw_get_aio_fd(BlockDriverState *bs);
  #else
 diff --git a/qmp.c b/qmp.c
 index 55b056b..5f1bed1 100644
 --- a/qmp.c
 +++ b/qmp.c
 @@ -130,7 +130,7 @@ SpiceInfo *qmp_query_spice(Error **errp)
  };
  #endif
  
 -static void iostatus_bdrv_it(void *opaque, BlockDriverState *bs)
 +void iostatus_bdrv_it(void *opaque, BlockDriverState *bs)
  {
  bdrv_iostatus_reset(bs);
  }
 diff --git a/vl.c b/vl.c
 index b0bcf1e..1d2edaa 100644
 --- a/vl.c
 +++ b/vl.c
 @@ -534,7 +534,7 @@ static const RunStateTransition 
 runstate_transitions_def[] = {
  { RUN_STATE_INMIGRATE, RUN_STATE_RUNNING },
  { RUN_STATE_INMIGRATE, RUN_STATE_PAUSED },
  
 -{ RUN_STATE_INTERNAL_ERROR, RUN_STATE_PAUSED },
 +{ RUN_STATE_INTERNAL_ERROR, RUN_STATE_RUNNING },
  { RUN_STATE_INTERNAL_ERROR, RUN_STATE_FINISH_MIGRATE },
  
  { RUN_STATE_IO_ERROR, RUN_STATE_RUNNING },
 @@ -569,7 +569,7 @@ static const RunStateTransition 
 runstate_transitions_def[] = {
  
  { RUN_STATE_SAVE_VM, RUN_STATE_RUNNING },
  
 -{ RUN_STATE_SHUTDOWN, RUN_STATE_PAUSED },
 +{ RUN_STATE_SHUTDOWN, RUN_STATE_RUNNING },
  { RUN_STATE_SHUTDOWN, RUN_STATE_FINISH_MIGRATE },
  
  { RUN_STATE_DEBUG, RUN_STATE_SUSPENDED },
 @@ -1951,7 +1951,8 @@ static bool main_loop_should_exit(void)
  resume_all_vcpus();
  if (runstate_check(RUN_STATE_INTERNAL_ERROR) ||
  runstate_check(RUN_STATE_SHUTDOWN)) {
 -runstate_set(RUN_STATE_PAUSED);
 +bdrv_iterate(iostatus_bdrv_it, NULL);
 +vm_start();
  }
  }
  if (qemu_wakeup_requested()) {
 -- 
 1.8.0.1.240.ge8a1f5a
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 rebased 1/8] preserve cpu runstate

2013-02-07 Thread Marcelo Tosatti
On Wed, Jan 23, 2013 at 03:19:22PM +0800, Hu Tao wrote:
 This patch enables preservation of cpu runstate during save/load vm.
 So when a vm is restored from snapshot, the cpu runstate is restored,
 too.
 
 See following example:
 
 # save two vms: one is running, the other is paused
 (qemu) info status
 VM status: running
 (qemu) savevm running
 (qemu) stop
 (qemu) info status
 VM status: paused
 (qemu) savevm paused
 
 # restore the one running
 (qemu) info status
 VM status: paused
 (qemu) loadvm running
 (qemu) info status
 VM status: running
 
 # restore the one paused
 (qemu) loadvm paused
 (qemu) info status
 VM status: paused
 (qemu) cont
 (qemu)info status
 VM status: running
 
 
 Signed-off-by: Hu Tao hu...@cn.fujitsu.com

Lack of pause state on guest images is annoying. 

Fail to see why the panic feature depends on preservation of cpu
runstate.

  include/sysemu/sysemu.h |  2 ++
  migration.c |  6 +-
  monitor.c   |  5 ++---
  savevm.c|  1 +
  vl.c| 34 ++
  5 files changed, 40 insertions(+), 8 deletions(-)
 
 diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
 index 337ce7d..7a69fde 100644
 --- a/include/sysemu/sysemu.h
 +++ b/include/sysemu/sysemu.h
 @@ -19,6 +19,8 @@ extern uint8_t qemu_uuid[];
  int qemu_uuid_parse(const char *str, uint8_t *uuid);
  #define UUID_FMT 
 %02hhx%02hhx%02hhx%02hhx-%02hhx%02hhx-%02hhx%02hhx-%02hhx%02hhx-%02hhx%02hhx%02hhx%02hhx%02hhx%02hhx
  
 +void save_run_state(void);
 +void load_run_state(void);
  bool runstate_check(RunState state);
  void runstate_set(RunState new_state);
  int runstate_is_running(void);
 diff --git a/migration.c b/migration.c
 index 77c1971..f96cfd6 100644
 --- a/migration.c
 +++ b/migration.c
 @@ -108,11 +108,7 @@ static void process_incoming_migration_co(void *opaque)
  /* Make sure all file formats flush their mutable metadata */
  bdrv_invalidate_cache_all();
  
 -if (autostart) {
 -vm_start();
 -} else {
 -runstate_set(RUN_STATE_PAUSED);
 -}
 +load_run_state();
  }
  
  static void enter_migration_coroutine(void *opaque)
 diff --git a/monitor.c b/monitor.c
 index 20bd19b..9381ed0 100644
 --- a/monitor.c
 +++ b/monitor.c
 @@ -2059,13 +2059,12 @@ void qmp_closefd(const char *fdname, Error **errp)
  
  static void do_loadvm(Monitor *mon, const QDict *qdict)
  {
 -int saved_vm_running  = runstate_is_running();
  const char *name = qdict_get_str(qdict, name);
  
  vm_stop(RUN_STATE_RESTORE_VM);
  
 -if (load_vmstate(name) == 0  saved_vm_running) {
 -vm_start();
 +if (load_vmstate(name) == 0) {
 +load_run_state();
  }
  }
  
 diff --git a/savevm.c b/savevm.c
 index 304d1ef..10f1d56 100644
 --- a/savevm.c
 +++ b/savevm.c
 @@ -2112,6 +2112,7 @@ void do_savevm(Monitor *mon, const QDict *qdict)
  }
  
  saved_vm_running = runstate_is_running();
 +save_run_state();
  vm_stop(RUN_STATE_SAVE_VM);
  
  memset(sn, 0, sizeof(*sn));
 diff --git a/vl.c b/vl.c
 index 4ee1302..b0bcf1e 100644
 --- a/vl.c
 +++ b/vl.c
 @@ -520,6 +520,7 @@ static int default_driver_check(QemuOpts *opts, void 
 *opaque)
  /* QEMU state */
  
  static RunState current_run_state = RUN_STATE_PRELAUNCH;
 +static RunState saved_run_state = RUN_STATE_PRELAUNCH;
  
  typedef struct {
  RunState from;
 @@ -543,6 +544,7 @@ static const RunStateTransition 
 runstate_transitions_def[] = {
  { RUN_STATE_PAUSED, RUN_STATE_FINISH_MIGRATE },
  
  { RUN_STATE_POSTMIGRATE, RUN_STATE_RUNNING },
 +{ RUN_STATE_POSTMIGRATE, RUN_STATE_PAUSED },
  { RUN_STATE_POSTMIGRATE, RUN_STATE_FINISH_MIGRATE },
  
  { RUN_STATE_PRELAUNCH, RUN_STATE_RUNNING },
 @@ -553,6 +555,7 @@ static const RunStateTransition 
 runstate_transitions_def[] = {
  { RUN_STATE_FINISH_MIGRATE, RUN_STATE_POSTMIGRATE },
  
  { RUN_STATE_RESTORE_VM, RUN_STATE_RUNNING },
 +{ RUN_STATE_RESTORE_VM, RUN_STATE_PAUSED },
  
  { RUN_STATE_RUNNING, RUN_STATE_DEBUG },
  { RUN_STATE_RUNNING, RUN_STATE_INTERNAL_ERROR },
 @@ -582,11 +585,39 @@ static const RunStateTransition 
 runstate_transitions_def[] = {
  
  static bool runstate_valid_transitions[RUN_STATE_MAX][RUN_STATE_MAX];
  
 +void save_run_state(void)
 +{
 +saved_run_state = current_run_state;
 +}
 +
 +void load_run_state(void)
 +{
 +if (saved_run_state == RUN_STATE_RUNNING) {
 +vm_start();
 +} else if (!runstate_check(saved_run_state)) {
 +runstate_set(saved_run_state);
 +} else {
 +; /* leave unchanged */
 +}
 +}
 +
  bool runstate_check(RunState state)
  {
  return current_run_state == state;
  }
  
 +static void runstate_save(QEMUFile *f, void *opaque)
 +{
 +qemu_put_byte(f, saved_run_state);
 +}
 +
 +static int runstate_load(QEMUFile *f, void *opaque, int version_id)
 +{
 +saved_run_state = qemu_get_byte(f);
 +
 +return 0;
 +}

This breaks loading images without support for 

Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-07 Thread Marcelo Tosatti
On Thu, Feb 07, 2013 at 03:52:24PM +0200, Gleb Natapov wrote:
  Its not a bad idea to have a new KVM_REQ_ bit for PIR processing (just
  as the current patches do).
 Without the numbers I do not see why.

KVM_REQ_EVENT already means... counting... many things. Its a well
defined request, to sync PIR-VIRR, don't see your point about
performance.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-06 Thread Marcelo Tosatti
On Tue, Feb 05, 2013 at 09:32:50AM +0200, Gleb Natapov wrote:
 On Mon, Feb 04, 2013 at 06:47:30PM -0200, Marcelo Tosatti wrote:
  On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
   On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
   Any example how software relies on such 
   two-interrupts-queued-in-IRR/ISR behaviour?
  Don't know about guests, but KVM relies on it to detect interrupt
  coalescing. So if interrupt is set in IRR but not in PIR interrupt 
  will
  not be reported as coalesced, but it will be coalesced during 
  PIR-IRR
  merge.
 
 Yes, so:
 
 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer.
 3. vcpu outside of guest mode.
 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
 5. vcpu enters guest mode.
 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
 7. HW transfers PIR into IRR.
 
 set_irq return value at 7 is incorrect, interrupt event was _not_
 queued.
Not sure I understand the flow of events in your description correctly. 
As I
understand it at 4 set_irq() will return incorrect result. Basically
when PIR is set to 1 while IRR has 1 for the vector the value of
set_irq() will be incorrect.
   
   At 4 it has not been coalesced: it has been queued to IRR.
   At 6 it has been coalesced: PIR bit merged into IRR bit.
   
 Yes, that's the case.
 
Frankly I do not see how it can be fixed
without any race with present HW PIR design.
   
   At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
   already set, don't set PIR.
 Need to check both IRR and PIR. Something like that:
 
 apic_accept_interrupt() {
  if (PIR || IRR)
return coalesced;
  else
set PIR;
 }
 
 This has two problems. Firs is that interrupt that can be delivered will
 be not (IRR is cleared just after it was tested), but it will be reported
 as coalesced, so this is benign race. 

Yes, and the same condition exists today with IRR, its fine.

 Second is that interrupt may be
 reported as delivered, but it will be coalesced (possible only with the self
 IPI with the same vector):
 
 Starting condition: PIR=0, IRR=0 vcpu is in a guest mode
 
 io thread |   vcpu
 accept_apic_interrupt()   |
  PIR and IRR is zero  |
  set PIR  |
  return delivered |
   |  self IPI
   |  set IRR
   | merge PIR to IRR (*)
 
 At (*) interrupt that was reported as delivered is coalesced.

Only vcpu itself should send self-IPI, so its fine.

  Or:
  
  apic_accept_interrupt() {
  
  1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
  Never set IRR when HWAPIC enabled, even if outside of guest mode.
  2. Set PIR and let HW or SW VM-entry transfer it to IRR.
  3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
  }
  
 This can report interrupt as coalesced, but it will be eventually delivered
 as separate interrupt:
 
 Starting condition: PIR=0, IRR=1 vcpu is in a guest mode
 
  io thread  | vcpu
 | 
 accept_apic_interrupt() |
 ORIG_PIR=0, ORIG_IRR=1  |
 |EOI
 |clear IRR, set ISR
 set PIR |
 return coalesced|
 |clear PIR, set IRR
 |EOI
 |clear IRR, set ISR (*)
 
 At (*) interrupt that was reported as coalesced is delivered.
 
 
 So still no perfect solution. But first one has much less serious
 problems for our practical needs.
 
  Two or more concurrent set_irq can race with each other, though. Can
  either document the race or add a lock.
  
 
 --
   Gleb.

Ok, then:

accept_apic_irq:
1. coalesced = test_and_set_bit(PIR)
2. set KVM_REQ_EVENT bit(*)
3. if (vcpu-in_guest_mode)
4.  if (test_and_set_bit(pir notification bit))
5.  send PIR IPI
6. return coalesced

Other sites:
A: On VM-entry, after disabling interrupts, but before
the last check for -requests, clear pir notification bit 
(unconditionally).

(*) This is _necessary_ also because during VM-exit a PIR IPI interrupt can 
be missed, so the KVM_REQ_EVENT indicates that SW is responsible for
PIR-IRR transfer.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5] target-i386: Enabling IA32_TSC_ADJUST for QEMU KVM guest VMs

2013-02-06 Thread Marcelo Tosatti
On Wed, Feb 06, 2013 at 10:22:32PM +, Auld, Will wrote:
 Marcelo, Hi,
 
 I have been watching for this patch in the upstream but have not seen it yet. 
 What version of QEMU should it be in?
 
 Thanks,
 
 Will

Will, its in the GIT tree:
https://github.com/qemu/qemu/commit/f28558d3d37ad3bc4e35e8ac93f7bf81a0d5622c

As for the next release:
http://www.mail-archive.com/qemu-devel@nongnu.org/msg153579.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-06 Thread Marcelo Tosatti
On Wed, Feb 06, 2013 at 08:49:23PM -0200, Marcelo Tosatti wrote:
 On Tue, Feb 05, 2013 at 09:32:50AM +0200, Gleb Natapov wrote:
  On Mon, Feb 04, 2013 at 06:47:30PM -0200, Marcelo Tosatti wrote:
   On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
 On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
Any example how software relies on such 
two-interrupts-queued-in-IRR/ISR behaviour?
   Don't know about guests, but KVM relies on it to detect interrupt
   coalescing. So if interrupt is set in IRR but not in PIR 
   interrupt will
   not be reported as coalesced, but it will be coalesced during 
   PIR-IRR
   merge.
  
  Yes, so:
  
  1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
  2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer.
  3. vcpu outside of guest mode.
  4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
  5. vcpu enters guest mode.
  6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
  7. HW transfers PIR into IRR.
  
  set_irq return value at 7 is incorrect, interrupt event was _not_
  queued.
 Not sure I understand the flow of events in your description 
 correctly. As I
 understand it at 4 set_irq() will return incorrect result. Basically
 when PIR is set to 1 while IRR has 1 for the vector the value of
 set_irq() will be incorrect.

At 4 it has not been coalesced: it has been queued to IRR.
At 6 it has been coalesced: PIR bit merged into IRR bit.

  Yes, that's the case.
  
 Frankly I do not see how it can be fixed
 without any race with present HW PIR design.

At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
already set, don't set PIR.
  Need to check both IRR and PIR. Something like that:
  
  apic_accept_interrupt() {
   if (PIR || IRR)
 return coalesced;
   else
 set PIR;
  }
  
  This has two problems. Firs is that interrupt that can be delivered will
  be not (IRR is cleared just after it was tested), but it will be reported
  as coalesced, so this is benign race. 
 
 Yes, and the same condition exists today with IRR, its fine.
 
  Second is that interrupt may be
  reported as delivered, but it will be coalesced (possible only with the self
  IPI with the same vector):
  
  Starting condition: PIR=0, IRR=0 vcpu is in a guest mode
  
  io thread |   vcpu
  accept_apic_interrupt()   |
   PIR and IRR is zero  |
   set PIR  |
   return delivered |
|  self IPI
|  set IRR
| merge PIR to IRR (*)
  
  At (*) interrupt that was reported as delivered is coalesced.
 
 Only vcpu itself should send self-IPI, so its fine.
 
   Or:
   
   apic_accept_interrupt() {
   
   1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
   Never set IRR when HWAPIC enabled, even if outside of guest mode.
   2. Set PIR and let HW or SW VM-entry transfer it to IRR.
   3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
   }
   
  This can report interrupt as coalesced, but it will be eventually delivered
  as separate interrupt:
  
  Starting condition: PIR=0, IRR=1 vcpu is in a guest mode
  
   io thread  | vcpu
  | 
  accept_apic_interrupt() |
  ORIG_PIR=0, ORIG_IRR=1  |
  |EOI
  |clear IRR, set ISR
  set PIR |
  return coalesced|
  |clear PIR, set IRR
  |EOI
  |clear IRR, set ISR (*)
  
  At (*) interrupt that was reported as coalesced is delivered.
  
  
  So still no perfect solution. But first one has much less serious
  problems for our practical needs.
  
   Two or more concurrent set_irq can race with each other, though. Can
   either document the race or add a lock.
   
  
  --
  Gleb.
 
 Ok, then:
 
 accept_apic_irq:
 1. coalesced = test_and_set_bit(PIR)
 2. set KVM_REQ_EVENT bit  (*)
 3. if (vcpu-in_guest_mode)
 4.if (test_and_set_bit(pir notification bit))
 5.send PIR IPI
 6. return coalesced
 
 Other sites:
 A: On VM-entry, after disabling interrupts, but before
 the last check for -requests, clear pir notification bit 
 (unconditionally).
 
 (*) This is _necessary_ also because during VM-exit a PIR IPI interrupt can 
 be missed, so the KVM_REQ_EVENT indicates that SW is responsible for
 PIR-IRR transfer.

Its not a bad idea to have a new KVM_REQ_ bit for PIR processing (just
as the current patches do).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: MMU: lazily drop large spte

2013-02-06 Thread Marcelo Tosatti
On Tue, Feb 05, 2013 at 03:11:09PM +0800, Xiao Guangrong wrote:
 Currently, kvm zaps the large spte if write-protected is needed, the later
 read can fault on that spte. Actually, we can make the large spte readonly
 instead of making them un-present, the page fault caused by read access can
 be avoid
 
 The idea is from Avi:
 | As I mentioned before, write-protecting a large spte is a good idea,
 | since it moves some work from protect-time to fault-time, so it reduces
 | jitter.  This removes the need for the return value.
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-06 Thread Marcelo Tosatti
  According the SDM, software should not touch the IRR when target
  vcpu
  is
  running. Instead, use locked way to access PIR. So your solution may
  wrong. Then your apicv patches are broken, because they do exactly
  that.
  Which code is broken?
  
  The one that updates IRR directly on the apic page.
  No, all the updates are ensuring the target vcpu is not running. So
  it's safe to touch IRR.
  
  Not at all. Read the code.
  Sorry. I still cannot figure out which code is wrong. All the places
  call sync_pir_to_irr() are on target vcpu. Can you point out the code?
  Thanks.
  
  I am taking about vapic patches which are already in, not pir patches.
 Yes, but the issue will be fixed with pir patches. With posted interrupt, it 
 will touch PIR instead IRR and access PIR is allowed by HW.
 
 Best regards,
 Yang
 

From http://www.mail-archive.com/kvm@vger.kernel.org/msg82824.html:


 2. Section 29.6 mentions that Use of the posted-interrupt descriptor
 differs from that of other data structures that are referenced by
 pointers in a VMCS. There is a general requirement that software
 ensure
 that each such data structure is modified only when no logical
 processor
 with a current VMCS that references it is in VMX non-root operation.
 That requirement does not apply to the posted-interrupt descriptor.
 There is a requirement, however, that such modifications be done using
 locked read-modify-write instructions.

 The APIC virtual page is being modified by a CPU while a logical
 processor with current VMCS that references it is in VMX non-root
 operation, in fact even modifying the APIC virtual page with EOI
 virtualizaton, virtual interrupt delivery, etc. What are the
 requirements in this case?
It should be same with posted interrupt. Software must ensure to use
atomic access to virtual apic page.


Can this point be clarified? Software can or cannot access virtual APIC
page while VMCS that references it is in VMX non-root operation?

Because if it cannot, then it means the current code is broken and
VID usage without PIR should not be allowed.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: VMX: disable SMEP feature when guest is in non-paging mode

2013-02-05 Thread Marcelo Tosatti
On Mon, Feb 04, 2013 at 11:50:43AM +0800, Dongxiao Xu wrote:
 Changes from v1 to v2:
  - Modify commit message and comments according to Gleb's suggestions.
 
 SMEP is disabled if CPU is in non-paging mode in hardware.
 However KVM always uses paging mode to emulate guest non-paging
 mode with TDP. To emulate this behavior, SMEP needs to be manually
 disabled when guest switches to non-paging mode.
 
 We met an issue that, SMP Linux guest with recent kernel (enable
 SMEP support, for example, 3.5.3) would crash with triple fault if
 setting unrestricted_guest=0. This is because KVM uses an identity
 mapping page table to emulate the non-paging mode, where the page
 table is set with USER flag. If SMEP is still enabled in this case,
 guest will meet unhandlable page fault and then crash.
 
 Signed-off-by: Dongxiao Xu dongxiao...@intel.com
 Signed-off-by: Xiantao Zhang xiantao.zh...@intel.com
 ---
  arch/x86/kvm/vmx.c |8 
  1 files changed, 8 insertions(+), 0 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: VMX: add missing exit names to VMX_EXIT_REASONS array

2013-02-05 Thread Marcelo Tosatti

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Remove duplicate text in api.txt

2013-02-05 Thread Marcelo Tosatti
On Thu, Jan 31, 2013 at 12:06:08PM -0800, Geoff Levand wrote:
 Signed-off-by: Geoff Levand ge...@infradead.org
 ---
 
 Saw this in v3.8-rc5, please apply.
 
  Documentation/virtual/kvm/api.txt |   13 -
  1 file changed, 13 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] KVM: Reduce mmu_lock hold time when zapping mmu pages

2013-02-04 Thread Marcelo Tosatti
On Wed, Jan 23, 2013 at 07:12:31PM +0900, Takuya Yoshikawa wrote:
 This patch set mitigates another mmu_lock hold time issue.  Although
 this is not enough and I'm thinking of additional work already, this
 alone can reduce the lock hold time to some extent.
 
 Takuya Yoshikawa (8):
   KVM: MMU: Fix and clean up for_each_gfn_* macros
   KVM: MMU: Use list_for_each_entry_safe in kvm_mmu_commit_zap_page()
   KVM: MMU: Add a parameter to kvm_mmu_prepare_zap_page() to update the next 
 position
   KVM: MMU: Introduce for_each_gfn_indirect_valid_sp_safe macro
   KVM: MMU: Delete hash_link node in kvm_mmu_prepare_zap_page()
   KVM: MMU: Introduce free_zapped_mmu_pages() for freeing mmu pages in a list
   KVM: MMU: Split out free_zapped_mmu_pages() from kvm_mmu_commit_zap_page()
   KVM: MMU: Move free_zapped_mmu_pages() out of the protection of mmu_lock
 
  arch/x86/kvm/mmu.c |  149 
 +++-
  1 files changed, 101 insertions(+), 48 deletions(-)

Need a limit on the number of pages whose freeing is delayed. See that
n_used_mmu_pages is used by both SLAB freeing (to know how much pressure
to apply) and allocators (to decide when to allocate more).

You allow n_used_mmu_pages to be inaccurate, which is fine as long as
the error is limited.

Perhaps have a max of 64 pages at invalid_pages per round and if exceeded
release memory inside mmu_lock (one-by-one) ?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] KVM: Reduce mmu_lock hold time when zapping mmu pages

2013-02-04 Thread Marcelo Tosatti
On Wed, Jan 23, 2013 at 06:44:52PM +0800, Xiao Guangrong wrote:
 On 01/23/2013 06:12 PM, Takuya Yoshikawa wrote:
  This patch set mitigates another mmu_lock hold time issue.  Although
  this is not enough and I'm thinking of additional work already, this
  alone can reduce the lock hold time to some extent.
  
 
 It is not worth doing this kind of complex thing, usually, only a few pages on
 the invalid list.

I think its a good idea - memory freeing can be done outside mmu_lock
protection (as long as its bounded). It reduces mmu_lock contention
overall.

 The *really* heavily case is kvm_mmu_zap_all() which can be speeded
 up by using generation number, this is a todo work in kvm wiki:
 
 http://www.linux-kvm.org/page/TODO: O(1) mmu invalidation using a generation 
 number
 
 I am doing this work for some weeks and will post the patch out during these 
 days.

Can you describe the generation number scheme in more detail, please?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] KVM: MMU: Move free_zapped_mmu_pages() out of the protection of mmu_lock

2013-02-04 Thread Marcelo Tosatti
On Wed, Jan 23, 2013 at 07:18:11PM +0900, Takuya Yoshikawa wrote:
 We noticed that kvm_mmu_zap_all() could take hundreds of milliseconds
 for zapping mmu pages with mmu_lock held.
 
 Although we need to do conditional rescheduling for completely
 fixing this issue, we can reduce the hold time to some extent by moving
 free_zapped_mmu_pages() out of the protection.  Since invalid_list can
 be very long, the effect is not negligible.
 
 Note: this patch does not treat non-trivial cases.
 
 Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp

Can you describe the case thats biting? Is it

/*
 * If memory slot is created, or moved, we need to clear all
 * mmio sptes.
 */
if (npages  old.base_gfn != mem-guest_phys_addr  PAGE_SHIFT) {
kvm_mmu_zap_all(kvm);
kvm_reload_remote_mmus(kvm);
}

Because conditional rescheduling for kvm_mmu_zap_all() might not be
desirable: KVM_SET_USER_MEMORY has low latency requirements.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-04 Thread Marcelo Tosatti
On Mon, Feb 04, 2013 at 11:55:53AM +0200, Gleb Natapov wrote:
 On Sun, Feb 03, 2013 at 10:57:00PM -0200, Marcelo Tosatti wrote:
  On Thu, Jan 31, 2013 at 03:55:56PM +0200, Gleb Natapov wrote:
   On Thu, Jan 31, 2013 at 11:44:43AM -0200, Marcelo Tosatti wrote:
On Thu, Jan 31, 2013 at 03:38:37PM +0200, Gleb Natapov wrote:
 On Thu, Jan 31, 2013 at 11:32:45AM -0200, Marcelo Tosatti wrote:
  On Thu, Jan 31, 2013 at 11:43:48AM +0200, Gleb Natapov wrote:
   On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote:
Posted interrupt patch:
2) Must move IN_GUEST_MODE assignment after local_irq_disable, 
in
vcpu_enter_guest function. Otherwise:

cpu0vcpu1-cpu1

vcpu-mode = IN_GUEST_MODE

if IN_GUEST_MODE == true
send IPI
local_irq_disable

PIR not transferred to VIRR, misses interrupt.

   cpu0 will set KVM_REQ_EVENT, so vmentry will be aborted after
   local_irq_disable() by -requests check.
  
  Yes, but you don't want KVM_REQ_EVENT+kick. It defeats the purpose 
  of posted interrupts. You want
  
  if vcpu in guest mode
  send posted interrupt IPI
  else
  KVM_REQ_EVENT+kick
  
 I am thinking:
 
  set KVM_REQ_EVENT
  if pi is enabled
  send posted interrupt IPI
  else
  kick

KVM_REQ_EVENT must be after sending posted interrupt IPI. Otherwise on
the vcpu entry side

test_and_clear(KVM_REQ_EVENT) {
No bits set in PIR
   }

   It should be after updating PIR, but before sending posted interrupt
   IPI. Otherwise:
   
cpu0 cpu1/vcpu
   
 KVM_REQ_EVENT is not set
   set pir 
   send IPI
 irq_disable()
 -request is empty.
   set KVM_REQ_EVENT
   
   That's the same sequence as with IRR update, KVM_REQ_EVENT and kick
   today.
  
  Can only send IPI if vcpu-mode == IN_GUEST_MODE, which must be set
  after interrupt flag is cleared as noted.
  
  Also KVM_REQ_EVENT is processed outside of interrupt disabled region today.
 But it is checked in interrupt disabled section and vcpu entry is
 aborted if event is pending.
 
  
  Or maybe i don't get what you say... write a complete
  description?
  
 I am saying that we do not need to move  vcpu-mode = IN_GUEST_MODE to
 irq_disable() section to make things work. Just do:
 
 set bit in pir
 set KVM_REQ_EVENT
 if in guest mode do IPI

I see. Yeah, probably.

What about item 4 below?

   That's for Yang to answer :)
  
  If more than one interrupt is generated with the same vector number,
  the local APIC can set the bit for the vector both in the IRR and ISR.
  This means that for the Pentium 4 and Intel Xeon processors, the IRR
  and ISR can queue two interrupts for each interrupt vector: one in the
  IRR and one in the ISR. Any additional interrupts issued for the same
  interrupt vector are collapsed into the single bit in IRR
  
  Which would mean KVM emulation differs... Eventually 3 interrupts can
  be queued: one in IRR, one in ISR, one in PIR.
 I do not think this is the case though. PIR will be folded into IRR
 either during a guest entry or by vcpu itself on receiving of
 notification vector IPI.
 
  
  Any example how software relies on such two-interrupts-queued-in-IRR/ISR 
  behaviour?
 Don't know about guests, but KVM relies on it to detect interrupt
 coalescing. So if interrupt is set in IRR but not in PIR interrupt will
 not be reported as coalesced, but it will be coalesced during PIR-IRR
 merge.

Yes, so:

1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer.
3. vcpu outside of guest mode.
4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
5. vcpu enters guest mode.
6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
7. HW transfers PIR into IRR.

set_irq return value at 7 is incorrect, interrupt event was _not_
queued.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-04 Thread Marcelo Tosatti
On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
 On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
Any example how software relies on such 
two-interrupts-queued-in-IRR/ISR behaviour?
   Don't know about guests, but KVM relies on it to detect interrupt
   coalescing. So if interrupt is set in IRR but not in PIR interrupt will
   not be reported as coalesced, but it will be coalesced during PIR-IRR
   merge.
  
  Yes, so:
  
  1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
  2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer.
  3. vcpu outside of guest mode.
  4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
  5. vcpu enters guest mode.
  6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
  7. HW transfers PIR into IRR.
  
  set_irq return value at 7 is incorrect, interrupt event was _not_
  queued.
 Not sure I understand the flow of events in your description correctly. As I
 understand it at 4 set_irq() will return incorrect result. Basically
 when PIR is set to 1 while IRR has 1 for the vector the value of
 set_irq() will be incorrect.

At 4 it has not been coalesced: it has been queued to IRR.
At 6 it has been coalesced: PIR bit merged into IRR bit.

 Frankly I do not see how it can be fixed
 without any race with present HW PIR design.

At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
already set, don't set PIR.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-04 Thread Marcelo Tosatti
On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
 On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
  On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
 Any example how software relies on such 
 two-interrupts-queued-in-IRR/ISR behaviour?
Don't know about guests, but KVM relies on it to detect interrupt
coalescing. So if interrupt is set in IRR but not in PIR interrupt will
not be reported as coalesced, but it will be coalesced during PIR-IRR
merge.
   
   Yes, so:
   
   1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
   2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer.
   3. vcpu outside of guest mode.
   4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
   5. vcpu enters guest mode.
   6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
   7. HW transfers PIR into IRR.
   
   set_irq return value at 7 is incorrect, interrupt event was _not_
   queued.
  Not sure I understand the flow of events in your description correctly. As I
  understand it at 4 set_irq() will return incorrect result. Basically
  when PIR is set to 1 while IRR has 1 for the vector the value of
  set_irq() will be incorrect.
 
 At 4 it has not been coalesced: it has been queued to IRR.
 At 6 it has been coalesced: PIR bit merged into IRR bit.
 
  Frankly I do not see how it can be fixed
  without any race with present HW PIR design.
 
 At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
 already set, don't set PIR.

Or:

apic_accept_interrupt() {

1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
Never set IRR when HWAPIC enabled, even if outside of guest mode.
2. Set PIR and let HW or SW VM-entry transfer it to IRR.
3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
}

Two or more concurrent set_irq can race with each other, though. Can
either document the race or add a lock.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: set_memory_region: Identify the requested change explicitly

2013-02-04 Thread Marcelo Tosatti
On Tue, Jan 29, 2013 at 11:00:07AM +0900, Takuya Yoshikawa wrote:
 KVM_SET_USER_MEMORY_REGION forces __kvm_set_memory_region() to identify
 what kind of change is being requested by checking the arguments.  The
 current code does this checking at various points in code and each
 condition being used there is not easy to understand at first glance.
 
 This patch consolidates these checks and introduces an enum to name the
 possible changes to clean up the code.
 
 Although this does not introduce any functional changes, there is one
 change which optimizes the code a bit: if we have nothing to change, the
 new code returns 0 immediately.
 
 Note that the return value for this case cannot be changed since QEMU
 relies on it: we noticed this when we changed it to -EINVAL and got a
 section mismatch error at the final stage of live migration.
 
 Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp
 ---
 v2: updated iommu related parts
 v3: converted !(A == B) to A != B
 Note: based on this patch, disallowing RO change will become trivial:
 check such changes when we identify KVM_MR_FLAGS_ONLY.
 
  virt/kvm/kvm_main.c |   64 
 +++
  1 files changed, 44 insertions(+), 20 deletions(-)

Applied, thanks. It would be good to propagate the enum into
kvm_arch_prepare/commit memory region.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Investigating abnormal stealtimes

2013-02-04 Thread Marcelo Tosatti

Nice!

- 'Steal time' is the amount of time taken while vcpu is able to run
but not runnable. Maybe 'vmexit latency' is a better name.  

- Perhaps it would be good to subtract the time the thread was
involuntarily scheduled out due 'timeslice' expiration. Otherwise,
running a CPU intensive task returns false positives (that is, long
delays to due to reschedule due to 'timeslice' exhausted by guest CPU
activity, not due to KVM or QEMU issues such as voluntarily schedule in
pthread_mutex_lock).

Alternatively you can raise the priority of the vcpu threads (to get rid
of the false positives).

- Idea: Would be handy to extract trace events in the offending
'latency above threshold' vmexit/vmentry region.
Say that you enable other trace events (unrelated to kvm) which can
help identify the culprit. Instead of scanning the file manually 
searching for 100466.1062486786 save one vmexit/vmentry cycle,
along with other trace events in that period, in a separate file.

On Tue, Jan 29, 2013 at 01:27:12PM +0100, Stefan Hajnoczi wrote:
 Khoa and I have been discussing a workload that triggers softlockups and
 hung task warnings inside the guest.  These warnings can pop up due to
 bugs in the guest Linux kernel but they can also be triggered by the
 hypervisor if vcpus are not being scheduled at reasonable times.
 
 I've wanted a tool that reports high stealtimes and includes the last
 vmexit reason.  This allows us to figure out if specific I/O device
 emulation is taking too long or if other factors like host memory
 pressure are degrading guest performance.
 
 Here is a first sketch of such a tool.  It's a perf-script(1) Python
 script which can be used to analyze perf.data files recorded with
 kvm:kvm_entry and kvm:kvm_exit events.
 
 Stealtimes exceeding a threshold will be flagged up:
 
   $ perf script -s /absolute/path/to/stealtime.py
   100466.1062486786 9690: steal time 0.029318914 secs,
   exit_reason IO_INSTRUCTION,
   guest_rip 0x81278f02,
   isa 1, info1 0xcf80003, info2 0x0
 
 The example above shows an I/O access to 0xcf8 (PCI Configuration Space
 Address port) that took about 28 milliseconds.  The host pid was 9690; this
 can be used to investigate the QEMU vcpu thread.  The guest rip can be used
 to investigate guest code that triggered this vmexit.
 
 Given this information, it becomes possible to debug QEMU to figure out
 why vmexit handling is taking too long.  It might be due to global mutex
 contention if another thread holds the global mutex while blocking.
 This sort of investigation needs to be done manually today but it might
 be possible to add perf event handlers to watch for global mutex
 contention inside QEMU and automatically identify the culprit.
 
 Stalls inside the kvm kernel module can also be investigated since
 kvm:kvm_exit events are triggered when they happen too.
 
 I wanted to share in case it is useful for others.  Suggestions for
 better approaches welcome!
 
 Signed-off-by: Stefan Hajnoczi stefa...@redhat.com
 
 ---
 
 #!/usr/bin/env python
 # perf script event handlers, generated by perf script -g python
 # Licensed under the terms of the GNU GPL License version 2
 
 # Script to print steal times longer than a given threshold
 #
 # To collect trace data:
 # $ perf record -a -e kvm:kvm_entry -e kvm:kvm_exit
 #
 # To print results from trace data:
 #
 # $ perf script -s /absolute/path/to/stealtime.py
 # 100466.1062486786 9690: steal time 0.029318914 secs,
 # exit_reason IO_INSTRUCTION,
 # guest_rip 0x81278f02,
 # isa 1, info1 0xcf80003, info2 0x0
 #
 # The example above shows an I/O access to 0xcf8 (PCI Configuration Space
 # Address port) that took about 28 milliseconds.  The host pid was 9690; this
 # can be used to investigate the QEMU vcpu thread.  The guest rip can be used
 # to investigate guest code that triggered this vmexit.
 
 # Print steal times longer than this threshold in milliseconds
 THRESHOLD_MS = 100
 
 import os
 import sys
 
 sys.path.append(os.environ['PERF_EXEC_PATH'] + \
   '/scripts/python/Perf-Trace-Util/lib/Perf/Trace')
 
 from perf_trace_context import *
 from Core import *
 
 vcpu_threads = {}
 
 def trace_begin():
 print 'argv:', str(sys.argv)
 
 def trace_end():
   pass
 
 def kvm__kvm_exit(event_name, context, common_cpu,
   common_secs, common_nsecs, common_pid, common_comm,
   exit_reason, guest_rip, isa, info1, 
   info2):
 
 if common_pid in vcpu_threads:
 last = vcpu_threads[common_pid]
 assert last[0] == 'kvm__kvm_entry'
 while last[2]  common_nsecs:
 common_secs -= 1
 common_nsecs += 1 * 1000 * 1000 * 1000
 delta_secs = common_secs - last[1]
 delta_nsecs = common_nsecs - last[2]
 
 vcpu_threads[common_pid] = (event_name, common_secs, common_nsecs, 
 

Re: [PATCH 6/6] Revert KVM: MMU: split kvm_mmu_free_page

2013-02-04 Thread Marcelo Tosatti
On Wed, Jan 30, 2013 at 04:45:05PM +0200, Gleb Natapov wrote:
 This reverts commit bd4c86eaa6ff10abc4e00d0f45d2a28b10b09df4.
 
 There is not user for kvm_mmu_isolate_page() any more.
 
 Signed-off-by: Gleb Natapov g...@redhat.com
 ---
  arch/x86/kvm/mmu.c |   21 +++--

Applied all except this - Takuya's using it.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] KVM: set_memory_region: Cleanup and new restriction

2013-02-04 Thread Marcelo Tosatti
On Thu, Jan 31, 2013 at 12:15:01PM +0200, Gleb Natapov wrote:
 On Wed, Jan 30, 2013 at 07:38:37PM +0900, Takuya Yoshikawa wrote:
  Patch 1: just rebased for this series.
  Patch 2: an API change, so please let me know if you notice any problems.
  
  Takuya Yoshikawa (2):
KVM: set_memory_region: Identify the requested change explicitly
KVM: set_memory_region: Disallow changing read-only attribute later
  
   Documentation/virtual/kvm/api.txt |   12 ++--
   virt/kvm/kvm_main.c   |   95 
  +
   2 files changed, 60 insertions(+), 47 deletions(-)
  
 Reviewed-by: Gleb Natapov g...@redhat.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-03 Thread Marcelo Tosatti
On Thu, Jan 31, 2013 at 03:55:56PM +0200, Gleb Natapov wrote:
 On Thu, Jan 31, 2013 at 11:44:43AM -0200, Marcelo Tosatti wrote:
  On Thu, Jan 31, 2013 at 03:38:37PM +0200, Gleb Natapov wrote:
   On Thu, Jan 31, 2013 at 11:32:45AM -0200, Marcelo Tosatti wrote:
On Thu, Jan 31, 2013 at 11:43:48AM +0200, Gleb Natapov wrote:
 On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote:
  Posted interrupt patch:
  2) Must move IN_GUEST_MODE assignment after local_irq_disable, in
  vcpu_enter_guest function. Otherwise:
  
  cpu0vcpu1-cpu1
  
  vcpu-mode = IN_GUEST_MODE
  
  if IN_GUEST_MODE == true
  send IPI
  local_irq_disable
  
  PIR not transferred to VIRR, misses interrupt.
  
 cpu0 will set KVM_REQ_EVENT, so vmentry will be aborted after
 local_irq_disable() by -requests check.

Yes, but you don't want KVM_REQ_EVENT+kick. It defeats the purpose 
of posted interrupts. You want

if vcpu in guest mode
send posted interrupt IPI
else
KVM_REQ_EVENT+kick

   I am thinking:
   
set KVM_REQ_EVENT
if pi is enabled
send posted interrupt IPI
else
kick
  
  KVM_REQ_EVENT must be after sending posted interrupt IPI. Otherwise on
  the vcpu entry side
  
  test_and_clear(KVM_REQ_EVENT) {
  No bits set in PIR
 }
  
 It should be after updating PIR, but before sending posted interrupt
 IPI. Otherwise:
 
  cpu0 cpu1/vcpu
 
   KVM_REQ_EVENT is not set
 set pir 
 send IPI
   irq_disable()
   -request is empty.
 set KVM_REQ_EVENT
 
 That's the same sequence as with IRR update, KVM_REQ_EVENT and kick
 today.

Can only send IPI if vcpu-mode == IN_GUEST_MODE, which must be set
after interrupt flag is cleared as noted.

Also KVM_REQ_EVENT is processed outside of interrupt disabled region today.

Or maybe i don't get what you say... write a complete
description?

  What about item 4 below?
  
 That's for Yang to answer :)

If more than one interrupt is generated with the same vector number,
the local APIC can set the bit for the vector both in the IRR and ISR.
This means that for the Pentium 4 and Intel Xeon processors, the IRR
and ISR can queue two interrupts for each interrupt vector: one in the
IRR and one in the ISR. Any additional interrupts issued for the same
interrupt vector are collapsed into the single bit in IRR

Which would mean KVM emulation differs... Eventually 3 interrupts can
be queued: one in IRR, one in ISR, one in PIR.

Any example how software relies on such two-interrupts-queued-in-IRR/ISR 
behaviour?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-01-31 Thread Marcelo Tosatti
On Thu, Jan 31, 2013 at 11:43:48AM +0200, Gleb Natapov wrote:
 On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote:
  Posted interrupt patch:
  2) Must move IN_GUEST_MODE assignment after local_irq_disable, in
  vcpu_enter_guest function. Otherwise:
  
  cpu0vcpu1-cpu1
  
  vcpu-mode = IN_GUEST_MODE
  
  if IN_GUEST_MODE == true
  send IPI
  local_irq_disable
  
  PIR not transferred to VIRR, misses interrupt.
  
 cpu0 will set KVM_REQ_EVENT, so vmentry will be aborted after
 local_irq_disable() by -requests check.

Yes, but you don't want KVM_REQ_EVENT+kick. It defeats the purpose 
of posted interrupts. You want

if vcpu in guest mode
send posted interrupt IPI
else
KVM_REQ_EVENT+kick

  3) Must check outstanding PIR notification bit unconditionally on
  every VM-entry, because:
  
  1. local_irq_disable
  2. vcpu-mode = IN_GUEST_MODE
  3. vmenter
  4. vmexit
  5. vcpu-mode = OUTSIDE_GUEST_MODE
  
  If PIR-IPI-interrupt is sent between an event which triggers VM-exit
  (for example, an external interrupt due to a device), and step 5
  (assignment of vcpu-mode), the PIR-VIRR transfer before vmentry must
  be made.
 Not sure I understand, but I think KVM_REQ_EVENT will cover that too.

See above.

  
  4) Today, an interrupt notification is cached on IRR until its delivered - 
  further
  interrupt injection is not generating further interrupt notification
  bits. With PIR, behaviour changes: Its possible to have one bit in PIR and 
  another 
  on IRR APIC page (if timing is right). Is this harmless? Why?
  
  
 
 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-01-31 Thread Marcelo Tosatti
On Thu, Jan 31, 2013 at 03:38:37PM +0200, Gleb Natapov wrote:
 On Thu, Jan 31, 2013 at 11:32:45AM -0200, Marcelo Tosatti wrote:
  On Thu, Jan 31, 2013 at 11:43:48AM +0200, Gleb Natapov wrote:
   On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote:
Posted interrupt patch:
2) Must move IN_GUEST_MODE assignment after local_irq_disable, in
vcpu_enter_guest function. Otherwise:

cpu0vcpu1-cpu1

vcpu-mode = IN_GUEST_MODE

if IN_GUEST_MODE == true
send IPI
local_irq_disable

PIR not transferred to VIRR, misses interrupt.

   cpu0 will set KVM_REQ_EVENT, so vmentry will be aborted after
   local_irq_disable() by -requests check.
  
  Yes, but you don't want KVM_REQ_EVENT+kick. It defeats the purpose 
  of posted interrupts. You want
  
  if vcpu in guest mode
  send posted interrupt IPI
  else
  KVM_REQ_EVENT+kick
  
 I am thinking:
 
  set KVM_REQ_EVENT
  if pi is enabled
  send posted interrupt IPI
  else
  kick

KVM_REQ_EVENT must be after sending posted interrupt IPI. Otherwise on
the vcpu entry side

test_and_clear(KVM_REQ_EVENT) {
No bits set in PIR
   }

What about item 4 below?

3) Must check outstanding PIR notification bit unconditionally on
every VM-entry, because:

1. local_irq_disable
2. vcpu-mode = IN_GUEST_MODE
3. vmenter
4. vmexit
5. vcpu-mode = OUTSIDE_GUEST_MODE

If PIR-IPI-interrupt is sent between an event which triggers VM-exit
(for example, an external interrupt due to a device), and step 5
(assignment of vcpu-mode), the PIR-VIRR transfer before vmentry must
be made.
   Not sure I understand, but I think KVM_REQ_EVENT will cover that too.
  
  See above.
  

4) Today, an interrupt notification is cached on IRR until its 
delivered - further
interrupt injection is not generating further interrupt notification
bits. With PIR, behaviour changes: Its possible to have one bit in PIR 
and another 
on IRR APIC page (if timing is right). Is this harmless? Why?


   
   --
 Gleb.
 
 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: windows 2008 guest causing rcu_shed to emit NMI

2013-01-30 Thread Marcelo Tosatti
On Wed, Jan 30, 2013 at 11:21:08AM +0300, Andrey Korolyov wrote:
 On Wed, Jan 30, 2013 at 3:15 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Tue, Jan 29, 2013 at 02:35:02AM +0300, Andrey Korolyov wrote:
  On Mon, Jan 28, 2013 at 5:56 PM, Andrey Korolyov and...@xdel.ru wrote:
   On Mon, Jan 28, 2013 at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com 
   wrote:
   On Mon, Jan 28, 2013 at 12:04:50AM +0300, Andrey Korolyov wrote:
   On Sat, Jan 26, 2013 at 12:49 AM, Marcelo Tosatti 
   mtosa...@redhat.com wrote:
On Fri, Jan 25, 2013 at 10:45:02AM +0300, Andrey Korolyov wrote:
On Thu, Jan 24, 2013 at 4:20 PM, Marcelo Tosatti 
mtosa...@redhat.com wrote:
 On Thu, Jan 24, 2013 at 01:54:03PM +0300, Andrey Korolyov wrote:
 Thank you Marcelo,

 Host node locking up sometimes later than yesterday, bur problem 
 still
 here, please see attached dmesg. Stuck process looks like
 root 19251  0.0  0.0 228476 12488 ?D14:42   0:00
 /usr/bin/kvm -no-user-config -device ? -device pci-assign,? 
 -device
 virtio-blk-pci,? -device

 on fourth vm by count.

 Should I try upstream kernel instead of applying patch to the 
 latest
 3.4 or it is useless?

 If you can upgrade to an upstream kernel, please do that.

   
With vanilla 3.7.4 there is almost no changes, and NMI started 
firing
again. External symptoms looks like following: starting from some
count, may be third or sixth vm, qemu-kvm process allocating its
memory very slowly and by jumps, 20M-200M-700M-1.6G in minutes. 
Patch
helps, of course - on both patched 3.4 and vanilla 3.7 I`m able to
kill stuck kvm processes and node returned back to the normal, when 
on
3.2 sending SIGKILL to the process causing zombies and hanged ``ps''
output (problem and workaround when no scheduler involved described
here http://www.spinics.net/lists/kvm/msg84799.html).
   
Try disabling pause loop exiting with ple_gap=0 kvm-intel.ko module 
parameter.
   
  
   Hi Marcelo,
  
   thanks, this parameter helped to increase number of working VMs in a
   half of order of magnitude, from 3-4 to 10-15. Very high SY load, 10
   to 15 percents, persists on such numbers for a long time, where linux
   guests in same configuration do not jump over one percent even under
   stress bench. After I disabled HT, crash happens only in long runs and
   now it is kernel panic :)
   Stair-like memory allocation behaviour disappeared, but other symptom
   leading to the crash which I have not counted previously, persists: if
   VM count is ``enough'' for crash, some qemu processes starting to eat
   one core, and they`ll panic system after run in tens of minutes in
   such state or if I try to attach debugger to one of them. If needed, I
   can log entire crash output via netconsole, now I have some tail,
   almost the same every time:
   http://xdel.ru/downloads/btwin.png
  
   Yes, please log entire crash output, thanks.
  
  
   Here please, 3.7.4-vanilla, 16 vms, ple_gap=0:
  
   http://xdel.ru/downloads/oops-default-kvmintel.txt
 
  Just an update: I was able to reproduce that on pure linux VMs using
  qemu-1.3.0 and ``stress'' benchmark running on them - panic occurs at
  start of vm(with count ten working machines at the moment). Qemu-1.1.2
  generally is not able to reproduce that, but host node with older
  version crashing on less amount of Windows VMs(three to six instead
  ten to fifteen) than with 1.3, please see trace below:
 
  http://xdel.ru/downloads/oops-old-qemu.txt
 
  Single bit memory error, apparently. Try:
 
  1. memtest86.
  2. Boot with slub_debug=ZFPU kernel parameter.
  3. Reproduce on different machine
 
 
 
 Hi Marcelo,
 
 I always follow the rule - if some weird bug exists, check it on
 ECC-enabled machine and check IPMI logs too before start complaining
 :) I have finally managed to ``fix'' the problem, but my solution
 seems a bit strange:
 - I have noticed that if virtual machines started without any cgroup
 setting they will not cause this bug under any conditions,
 - I have thought, very wrong in my mind, that the
 CONFIG_SCHED_AUTOGROUP should regroup the tasks without any cgroup and
 should not touch tasks already inside any existing cpu cgroup. First
 sight on the 200-line patch shows that the autogrouping always applies
 to all tasks, so I tried to disable it,
 - wild magic appears - VMs didn`t crashed host any more, even in count
 30+ they work fine.
 I still don`t know what exactly triggered that and will I face it
 again under different conditions, so my solution more likely to be a
 patch of mud in wall of the dam, instead of proper fixing.
 
 There seems to be two possible origins of such error - a very very
 hideous race condition involving cgroups and processes like qemu-kvm
 causing frequent context switches and simple incompatibility between
 NUMA, logic of CONFIG_SCHED_AUTOGROUP and qemu VMs already doing

Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-01-30 Thread Marcelo Tosatti
On Fri, Jan 25, 2013 at 12:40:21AM +, Zhang, Yang Z wrote:
 Marcelo Tosatti wrote on 2013-01-25:
  On Thu, Dec 13, 2012 at 03:29:40PM +0800, Yang Zhang wrote:
  From: Yang Zhang yang.z.zh...@intel.com
  
  Posted Interrupt allows APIC interrupts to inject into guest directly
  without any vmexit.
  
  - When delivering a interrupt to guest, if target vcpu is running,
update Posted-interrupt requests bitmap and send a notification event
to the vcpu. Then the vcpu will handle this interrupt automatically,
without any software involvemnt.
  - If target vcpu is not running or there already a notification event
pending in the vcpu, do nothing. The interrupt will be handled by
next vm entry.
  Signed-off-by: Yang Zhang yang.z.zh...@intel.com
  ---
   arch/x86/include/asm/entry_arch.h  |1 +
   arch/x86/include/asm/hw_irq.h  |1 + arch/x86/include/asm/irq.h
   |1 + arch/x86/include/asm/irq_vectors.h |4 +
   arch/x86/include/asm/kvm_host.h|3 + arch/x86/include/asm/vmx.h
   |4 + arch/x86/kernel/entry_64.S |2 +
   arch/x86/kernel/irq.c  |   25 +++
   arch/x86/kernel/irqinit.c  |2 + arch/x86/kvm/lapic.c  
   |   16 +++- arch/x86/kvm/lapic.h   |1 +
   arch/x86/kvm/vmx.c |  133
   +--- 12 files changed, 180
   insertions(+), 13 deletions(-)
  diff --git a/arch/x86/include/asm/entry_arch.h
  b/arch/x86/include/asm/entry_arch.h index 40afa00..7b0a29e 100644 ---
  a/arch/x86/include/asm/entry_arch.h +++
  b/arch/x86/include/asm/entry_arch.h @@ -18,6 +18,7 @@
  BUILD_INTERRUPT(reboot_interrupt,REBOOT_VECTOR)
   #endif
   
   BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)
  +BUILD_INTERRUPT(posted_intr_ipi, POSTED_INTR_VECTOR)
  
   /*
* every pentium local APIC has two 'local interrupts', with a
  diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
  index eb92a6e..ee61af3 100644
  --- a/arch/x86/include/asm/hw_irq.h
  +++ b/arch/x86/include/asm/hw_irq.h
  @@ -28,6 +28,7 @@
   /* Interrupt handlers registered during init_IRQ */ extern void
   apic_timer_interrupt(void); extern void x86_platform_ipi(void);
   +extern void posted_intr_ipi(void); extern void error_interrupt(void);
   extern void irq_work_interrupt(void);
  diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
  index ba870bb..cff9933 100644
  --- a/arch/x86/include/asm/irq.h
  +++ b/arch/x86/include/asm/irq.h
  @@ -30,6 +30,7 @@ extern void irq_force_complete_move(int);
   #endif
   
   extern void (*x86_platform_ipi_callback)(void); +extern void
   (*posted_intr_callback)(void); extern void native_init_IRQ(void);
   extern bool handle_irq(unsigned irq, struct pt_regs *regs);
  diff --git a/arch/x86/include/asm/irq_vectors.h
  b/arch/x86/include/asm/irq_vectors.h index 1508e51..8f2e383 100644 ---
  a/arch/x86/include/asm/irq_vectors.h +++
  b/arch/x86/include/asm/irq_vectors.h @@ -102,6 +102,10 @@
*/
   #define X86_PLATFORM_IPI_VECTOR   0xf7
  +#ifdef CONFIG_HAVE_KVM
  +#define POSTED_INTR_VECTOR0xf2
  +#endif
  +
   /*
* IRQ work vector:
*/
  diff --git a/arch/x86/include/asm/kvm_host.h
  b/arch/x86/include/asm/kvm_host.h index 7e26d1a..82423a8 100644 ---
  a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
  @@ -700,6 +700,9 @@ struct kvm_x86_ops {
 int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
 void (*update_irq)(struct kvm_vcpu *vcpu);
 void (*update_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector, bool set);
  +  int (*has_posted_interrupt)(struct kvm_vcpu *vcpu);
  +  int (*send_nv)(struct kvm_vcpu *vcpu, int vector);
  +  void (*update_irr)(struct kvm_vcpu *vcpu);
 int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
 int (*get_tdp_level)(void);
 u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
  diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
  index 1003341..7b9e1d0 100644
  --- a/arch/x86/include/asm/vmx.h
  +++ b/arch/x86/include/asm/vmx.h
  @@ -152,6 +152,7 @@
   #define PIN_BASED_EXT_INTR_MASK 0x0001
   #define PIN_BASED_NMI_EXITING   0x0008
   #define PIN_BASED_VIRTUAL_NMIS  0x0020
  +#define PIN_BASED_POSTED_INTR   0x0080
  
   #define VM_EXIT_SAVE_DEBUG_CONTROLS 0x0002 #define
   VM_EXIT_HOST_ADDR_SPACE_SIZE0x0200 @@ -174,6 +175,7 @@
   /* VMCS Encodings */ enum vmcs_field {VIRTUAL_PROCESSOR_ID  
= 0x, +  POSTED_INTR_NV  = 0x0002,
 GUEST_ES_SELECTOR   = 0x0800,   GUEST_CS_SELECTOR 
= 0x0802,GUEST_SS_SELECTOR   = 0x0804,
   @@ -208,6 +210,8 @@ enum vmcs_field { VIRTUAL_APIC_PAGE_ADDR_HIGH
   = 0x2013, APIC_ACCESS_ADDR

Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-01-30 Thread Marcelo Tosatti
On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote:
 On Fri, Jan 25, 2013 at 12:40:21AM +, Zhang, Yang Z wrote:
  Marcelo Tosatti wrote on 2013-01-25:
   On Thu, Dec 13, 2012 at 03:29:40PM +0800, Yang Zhang wrote:
   From: Yang Zhang yang.z.zh...@intel.com
   
   Posted Interrupt allows APIC interrupts to inject into guest directly
   without any vmexit.
   
   - When delivering a interrupt to guest, if target vcpu is running,
 update Posted-interrupt requests bitmap and send a notification event
 to the vcpu. Then the vcpu will handle this interrupt automatically,
 without any software involvemnt.
   - If target vcpu is not running or there already a notification event
 pending in the vcpu, do nothing. The interrupt will be handled by
 next vm entry.
   Signed-off-by: Yang Zhang yang.z.zh...@intel.com
   ---
arch/x86/include/asm/entry_arch.h  |1 +
arch/x86/include/asm/hw_irq.h  |1 + arch/x86/include/asm/irq.h
|1 + arch/x86/include/asm/irq_vectors.h |4 +
arch/x86/include/asm/kvm_host.h|3 + arch/x86/include/asm/vmx.h
|4 + arch/x86/kernel/entry_64.S |2 +
arch/x86/kernel/irq.c  |   25 +++
arch/x86/kernel/irqinit.c  |2 + arch/x86/kvm/lapic.c  
|   16 +++- arch/x86/kvm/lapic.h   |1 +
arch/x86/kvm/vmx.c |  133
+--- 12 files changed, 180
insertions(+), 13 deletions(-)
   diff --git a/arch/x86/include/asm/entry_arch.h
   b/arch/x86/include/asm/entry_arch.h index 40afa00..7b0a29e 100644 ---
   a/arch/x86/include/asm/entry_arch.h +++
   b/arch/x86/include/asm/entry_arch.h @@ -18,6 +18,7 @@
   BUILD_INTERRUPT(reboot_interrupt,REBOOT_VECTOR)
#endif

BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)
   +BUILD_INTERRUPT(posted_intr_ipi, POSTED_INTR_VECTOR)
   
/*
 * every pentium local APIC has two 'local interrupts', with a
   diff --git a/arch/x86/include/asm/hw_irq.h 
   b/arch/x86/include/asm/hw_irq.h
   index eb92a6e..ee61af3 100644
   --- a/arch/x86/include/asm/hw_irq.h
   +++ b/arch/x86/include/asm/hw_irq.h
   @@ -28,6 +28,7 @@
/* Interrupt handlers registered during init_IRQ */ extern void
apic_timer_interrupt(void); extern void x86_platform_ipi(void);
+extern void posted_intr_ipi(void); extern void error_interrupt(void);
extern void irq_work_interrupt(void);
   diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
   index ba870bb..cff9933 100644
   --- a/arch/x86/include/asm/irq.h
   +++ b/arch/x86/include/asm/irq.h
   @@ -30,6 +30,7 @@ extern void irq_force_complete_move(int);
#endif

extern void (*x86_platform_ipi_callback)(void); +extern void
(*posted_intr_callback)(void); extern void native_init_IRQ(void);
extern bool handle_irq(unsigned irq, struct pt_regs *regs);
   diff --git a/arch/x86/include/asm/irq_vectors.h
   b/arch/x86/include/asm/irq_vectors.h index 1508e51..8f2e383 100644 ---
   a/arch/x86/include/asm/irq_vectors.h +++
   b/arch/x86/include/asm/irq_vectors.h @@ -102,6 +102,10 @@
 */
#define X86_PLATFORM_IPI_VECTOR 0xf7
   +#ifdef CONFIG_HAVE_KVM
   +#define POSTED_INTR_VECTOR  0xf2
   +#endif
   +
/*
 * IRQ work vector:
 */
   diff --git a/arch/x86/include/asm/kvm_host.h
   b/arch/x86/include/asm/kvm_host.h index 7e26d1a..82423a8 100644 ---
   a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
   @@ -700,6 +700,9 @@ struct kvm_x86_ops {
int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
void (*update_irq)(struct kvm_vcpu *vcpu);
void (*update_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector, 
   bool set);
   +int (*has_posted_interrupt)(struct kvm_vcpu *vcpu);
   +int (*send_nv)(struct kvm_vcpu *vcpu, int vector);
   +void (*update_irr)(struct kvm_vcpu *vcpu);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*get_tdp_level)(void);
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool 
   is_mmio);
   diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
   index 1003341..7b9e1d0 100644
   --- a/arch/x86/include/asm/vmx.h
   +++ b/arch/x86/include/asm/vmx.h
   @@ -152,6 +152,7 @@
#define PIN_BASED_EXT_INTR_MASK 0x0001
#define PIN_BASED_NMI_EXITING   0x0008
#define PIN_BASED_VIRTUAL_NMIS  0x0020
   +#define PIN_BASED_POSTED_INTR   0x0080
   
#define VM_EXIT_SAVE_DEBUG_CONTROLS 0x0002 #define
VM_EXIT_HOST_ADDR_SPACE_SIZE0x0200 @@ -174,6 +175,7 @@
/* VMCS Encodings */ enum vmcs_field {  VIRTUAL_PROCESSOR_ID  
 = 0x, +POSTED_INTR_NV  = 0x0002,
GUEST_ES_SELECTOR   = 0x0800

Re: [PATCH v2 06/12] KVM: MMU: introduce a static table to map guest access to spte access

2013-01-29 Thread Marcelo Tosatti
On Mon, Jan 28, 2013 at 10:07:15PM -0200, Marcelo Tosatti wrote:
 On Fri, Jan 25, 2013 at 10:46:31AM +0800, Xiao Guangrong wrote:
  On 01/25/2013 08:15 AM, Marcelo Tosatti wrote:
   On Wed, Jan 23, 2013 at 06:07:20PM +0800, Xiao Guangrong wrote:
   It makes set_spte more clean and reduces branch prediction
  
   Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
   ---
arch/x86/kvm/mmu.c |   37 ++---
1 files changed, 26 insertions(+), 11 deletions(-)
   
   Don't see set_spte as being a performance problem?
   IMO the current code is quite simple.
  
  Yes, this is not a performance problem.
  
  I just dislike this many continuous if-s in the function:
  
  if (xxx)
  xxx
  if (xxx)
  xxx
  
  
  Totally, it has 7 if-s before this patch.
  
  Okay, if you think this is unnecessary, i will drop this patch. :)
 
 Yes, please (unless you can show set_spte is a performance problem).

Same thing for spte fast drop: is it a performance problem? 

Please try to group changes into smaller, less controversial sets with 
a clear goal:

- Debated performance improvement.
- Cleanups (eg mmu_set_spte argument removal).
- Bug fixes.
- Performance improvements.

Thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 03/12] KVM: MMU: simplify mmu_set_spte

2013-01-29 Thread Marcelo Tosatti
On Tue, Jan 29, 2013 at 10:55:24AM +0800, Xiao Guangrong wrote:
 On 01/29/2013 08:21 AM, Marcelo Tosatti wrote:
  On Wed, Jan 23, 2013 at 06:05:29PM +0800, Xiao Guangrong wrote:
  In order to detecting spte remapping, we can simply check whether the
  spte has already been pointing to the pfn even if the spte is not the
  last spte, for middle spte is pointing to the kernel pfn which can not
  be mapped to userspace
  
  This check is detecting spte overwrite, when a large spte is replaced by 
  pointer to spte table.
  
  Can't see why check for different pfn is safe: only 'int level' can
  differ, and pfn be equivalent, for example.
 
 The 'u64 *sptep' must on the int level we want to set, that means:
  page_header(__pa(sptep)).role.level == int level.

Right, then stick a comment there noting which cases that condition handles.
Keep the current comment and add more.

 We discussed this before :), the discussion can be found at:
 http://marc.info/?l=kvmm=135345057329427w=2.

Note http://marc.info/?l=kvmm=135345059929436w=2, please take into
account in the future.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: Fix irqfd resampler list walk

2013-01-28 Thread Marcelo Tosatti
On Mon, Jan 28, 2013 at 08:25:00AM -0700, Alex Williamson wrote:
 On Mon, 2012-12-10 at 18:16 -0200, Marcelo Tosatti wrote:
  On Thu, Dec 06, 2012 at 02:44:59PM -0700, Alex Williamson wrote:
   Typo for the next pointer means we're walking random data here.
   
   Signed-off-by: Alex Williamson alex.william...@redhat.com
   Cc: sta...@vger.kernel.org [3.7]
   ---
   
   Not sure if this will make 3.7, so preemptively adding the stable flag
   
virt/kvm/eventfd.c |2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
  
  Applied, thanks.
  
 
 Hi Marcelo,
 
 This didn't seem to make it into 3.7.0 or any stable 3.7.  Can we
 promote it for stable?  Thanks,
 
 Alex

Yes, please send the patch with the backport you're interested in being
included to sta...@kernel.org.

http://www.kernel.org/doc/Documentation/stable_kernel_rules.txt


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 0/3] x86, apicv: Add APIC virtualization support

2013-01-28 Thread Marcelo Tosatti
On Fri, Jan 25, 2013 at 10:18:48AM +0800, Yang Zhang wrote:
 From: Yang Zhang yang.z.zh...@intel.com
 
 APIC virtualization is a new feature which can eliminate most of VM exit
 when vcpu handle a interrupt:
 
 APIC register virtualization:
 APIC read access doesn't cause APIC-access VM exits.
 APIC write becomes trap-like.
 
 Virtual interrupt delivery:
 Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
 manually, which is fully taken care of by the hardware.
 
 The first patch adds APIC register virtualization supporting.
 The second patch adds virtual x2apic mode supporting since it is required by 
 APICv when guest uses msr based way to access APIC.
 The third patch adds virtual interrupt delivery supporting.
 
 Please refer to Intel SDM volume 3, chapter 29 for more details.
 Changes v12 to v13:
  * Remove unnecessary check when set virtualized apic access
  * Use vm_need_tpr_shadow() instead read vmcs to check tpr.
  * Check irqchip_in_kernel when set msr bitmap.
  * Correct comment format
  * Remove unnecessary callback when set eoi exit bitmap.
  * Disable vid when irqchip is in userspace.
  * Rename vmx_vcpu_has_apicv to vmx_vm_has_apicv.
  * Rebased on top of KVM upstream
 
 Changes v11 to v12:
  * Check irqchip in kernel when enabling apicv, if using userspace irq chip,
apicv cannot be used and must be disabled.
  * Rename some fucntion to more descriptive name.
  * Move ioapic entry pase logic to lapic.c
  * Rebased on top of KVM upstream
 
 Changes v10 to v11:
  * Use two new msr bitmaps for guest that enabling x2apic mode:
Since msr bitmap is shared by all guests, it will break guest that
not using x2apic when updating the global msr bitmap. To solve this,
we use two new msr bitmap for guest which using x2apic.
 
 Changes v9 to v10:
  * Enable virtualize x2apic mode when guest is using x2apic and apicv:
There is no point to enable x2apic mode when apicv is disabled.
  * Grep ioapic_lock when traversing ioapic entry to set eoi exit bitmap
  * Rebased on top of KVM upstream
 
 Changes v8 to v9:
  * Update eoi exit bitmap by vcpu itself.
  * Enable virtualize x2apic mode when guest is using x2apic.
  * Rebase on top of KVM upstream
 
 Changes v7 to v8:
  * According Marcelo's suggestion, add comments for irr_pending and isr_count,
since the two valiables have different meaning when using apicv.
  * Set highest bit in vISR to SVI after migation.
  * Use spinlock to access eoi exit bitmap synchronously.
  * Enable virtualize x2apic mode when guest is using x2apic
  * Rebased on top of KVM upstream.
  
 
 Yang Zhang (3):
   x86, apicv: add APICv register virtualization support
   x86, apicv: add virtual x2apic support
   x86, apicv: add virtual interrupt delivery support
 
  arch/ia64/kvm/lapic.h   |6 +
  arch/x86/include/asm/kvm_host.h |6 +
  arch/x86/include/asm/vmx.h  |   14 ++
  arch/x86/kvm/irq.c  |   56 ++-
  arch/x86/kvm/lapic.c|  140 +
  arch/x86/kvm/lapic.h|   34 
  arch/x86/kvm/svm.c  |   24 +++
  arch/x86/kvm/vmx.c  |  327 
 ---
  arch/x86/kvm/x86.c  |   23 +++-
  include/linux/kvm_host.h|3 +
  virt/kvm/ioapic.c   |   39 +
  virt/kvm/ioapic.h   |4 +
  virt/kvm/irq_comm.c |   25 +++
  virt/kvm/kvm_main.c |5 +
  14 files changed, 647 insertions(+), 59 deletions(-)

Reviewed-by: Marcelo Tosatti mtosa...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 03/12] KVM: MMU: simplify mmu_set_spte

2013-01-28 Thread Marcelo Tosatti
On Wed, Jan 23, 2013 at 06:05:29PM +0800, Xiao Guangrong wrote:
 In order to detecting spte remapping, we can simply check whether the
 spte has already been pointing to the pfn even if the spte is not the
 last spte, for middle spte is pointing to the kernel pfn which can not
 be mapped to userspace

This check is detecting spte overwrite, when a large spte is replaced by 
pointer to spte table.

Can't see why check for different pfn is safe: only 'int level' can
differ, and pfn be equivalent, for example.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 06/12] KVM: MMU: introduce a static table to map guest access to spte access

2013-01-28 Thread Marcelo Tosatti
On Fri, Jan 25, 2013 at 10:46:31AM +0800, Xiao Guangrong wrote:
 On 01/25/2013 08:15 AM, Marcelo Tosatti wrote:
  On Wed, Jan 23, 2013 at 06:07:20PM +0800, Xiao Guangrong wrote:
  It makes set_spte more clean and reduces branch prediction
 
  Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
  ---
   arch/x86/kvm/mmu.c |   37 ++---
   1 files changed, 26 insertions(+), 11 deletions(-)
  
  Don't see set_spte as being a performance problem?
  IMO the current code is quite simple.
 
 Yes, this is not a performance problem.
 
 I just dislike this many continuous if-s in the function:
 
 if (xxx)
   xxx
 if (xxx)
   xxx
 
 
 Totally, it has 7 if-s before this patch.
 
 Okay, if you think this is unnecessary, i will drop this patch. :)

Yes, please (unless you can show set_spte is a performance problem).


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


<    3   4   5   6   7   8   9   10   11   12   >