date:20121127

[Bug 50921] kvm hangs booting Windows 2000

2012-11-27 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=50921





--- Comment #17 from Gleb g...@redhat.com  2012-11-27 08:22:15 ---
(In reply to comment #16)
 @xiaoguangrong: YOU ARE THE MAN! 'emulate_invalid_guest_state = 0' did the
 trick, now I have win2000 running in a 3.6.7 kvm guest! Thanks.
 
 Still guessing why it works with plain kvm-intel.ko in Debian kernels is out 
 of
 my reach, but I can safely shove that mystery in the when-I'll-have-time-stuff
 drawer and live happy with this solution meanwhile.

Because emulate_invalid_guest_state = 0 was the default before 3.6.0. This is
why I wanted you to try 3.5.0.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-27 Thread Glauber Costa

Hi,

On 11/27/2012 12:36 AM, Michael Wolf wrote:
 In the case of where you have a system that is running in a
 capped or overcommitted environment the user may see steal time
 being reported in accounting tools such as top or vmstat.  This can
 cause confusion for the end user.  To ease the confusion this patch set
 adds the idea of consigned (expected steal) time.  The host will separate
 the consigned time from the steal time.  The consignment limit passed to the
 host will be the amount of steal time expected within a fixed period of
 time.  Any other steal time accruing during that period will show as the
 traditional steal time.

If you submit this again, please include a version number in your series.

It would also be helpful to include a small changelog about what changed
between last version and this version, so we could focus on that.

As for the rest, I answered your previous two submissions saying I don't
agree with the concept. If you hadn't changed anything, resending it
won't change my mind.

I could of course, be mistaken or misguided. But I had also not seen any
wave of support in favor of this previously, so basically I have no new
data to make me believe I should see it any differently.

Let's try this again:

* Rik asked you in your last submission how does ppc handle this. You
said, and I quote: In the case of lpar on POWER systems they simply
report steal time and do not alter it in any way.
They do however report how much processor is assigned to the partition
and that information is in /proc/ppc64/lparcfg.

Now, that is a *way* more sensible thing to do. Much more. Confusing
users is something extremely subjective. This is specially true about
concepts that are know for quite some time, like steal time. If you out
of a sudden change the meaning of this, it is sure to confuse a lot more
users than it would clarify.





 
 ---
 
 Michael Wolf (5):
   Alter the amount of steal time reported by the guest.
   Expand the steal time msr to also contain the consigned time.
   Add the code to send the consigned time from the host to the guest
   Add a timer to allow the separation of consigned from steal time.
   Add an ioctl to communicate the consign limit to the host.
 
 
  arch/x86/include/asm/kvm_host.h   |   11 +++
  arch/x86/include/asm/kvm_para.h   |3 +-
  arch/x86/include/asm/paravirt.h   |4 +--
  arch/x86/include/asm/paravirt_types.h |2 +
  arch/x86/kernel/kvm.c |8 ++---
  arch/x86/kernel/paravirt.c|4 +--
  arch/x86/kvm/x86.c|   50 
 -
  fs/proc/stat.c|9 +-
  include/linux/kernel_stat.h   |2 +
  include/linux/kvm_host.h  |2 +
  include/uapi/linux/kvm.h  |2 +
  kernel/sched/core.c   |   10 ++-
  kernel/sched/cputime.c|   21 +-
  kernel/sched/sched.h  |2 +
  virt/kvm/kvm_main.c   |7 +
  15 files changed, 120 insertions(+), 17 deletions(-)
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting

2012-11-27 Thread Gleb Natapov

On Tue, Nov 27, 2012 at 03:38:05AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2012-11-26:
  On Mon, Nov 26, 2012 at 12:29:54PM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2012-11-26:
  On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2012-11-25:
  On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
  Posted Interrupt allows vAPICV interrupts to inject into guest directly
  without any vmexit.
  
  - When delivering a interrupt to guest, if target vcpu is running,
update Posted-interrupt requests bitmap and send a notification
event to the vcpu. Then the vcpu will handle this interrupt
automatically, without any software involvemnt.
  Looks like you allocating one irq vector per vcpu per pcpu and then
  migrate it or reallocate when vcpu move from one pcpu to another.
  This is not scalable and migrating irq migration slows things down.
  What's wrong with allocating one global vector for posted interrupt
  during vmx initialization and use it for all vcpus?
  
  Consider the following situation:
  If vcpu A is running when notification event which belong to vcpu B is
  arrived,
  since the vector match the vcpu A's notification vector, then this event
  will be consumed by vcpu A(even it do nothing) and the interrupt cannot
  be handled in time. The exact same situation is possible with your code.
  vcpu B can be migrated from pcpu and vcpu A will take its place and will
  be assigned the same vector as vcpu B. But I fail to see why is this a
  No, the on bit will be set to prevent notification event when vcpu B start
  migration. And it only free the vector before it going to run in another 
  pcpu.
  There is a race. Sender check on bit, vcpu B migrate to another pcpu and
  starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
  A gets it.
 Yes, it do exist. But I think it should be ok even this happens.
 
Then it is OK to use global PI vector. The race should be dealt with
anyway.

  
  problem. vcpu A will ignore PI since pir will be empty and vcpu B should
  detect new event during next vmentry.
  Yes, but the next vmentry may happen long time later and interrupt cannot 
  be
  serviced until next vmentry. In current way, it will cause vmexit and 
  re-schedule
  the vcpu B.
  Vmentry will happen when scheduler will decide that vcpu can run. There
 I don't know how scheduler can know the vcpu can run in this case, can you 
 elaborate it? 
 I thought it may have problems with global vector in some cases(maybe I am 
 wrong, since I am not familiar with KVM scheduler):
 If target VCPU is in idle, and this notification event is consumed by other 
 VCPU, then how can scheduler know the vcpu is ready to run? Even there is a 
 way for scheduler to know, then when? Isn't it too late?
 If notification event arrived in hypervisor, then how the handler know which 
 VCPU the notification event belong to?
When vcpu is idle its thread sleeps inside host kernel (see
virt/kvm/kvm_main.c:kvm_vcpu_block()). To get it out of sleep
you should call kvm_vcpu_kick(), but only after changing vcpu
state to make it runnable. arch/x86/kvm/x86.c:kvm_arch_vcpu_runnable()
checks if vcpu is runnable. Notice that we call kvm_cpu_has_interrupt()
there which checks apic IRR, but not PIR, so it is not enough to set
bit in PIR and call kvm_vcpu_kick() to wake up vcpu.

 
  is no problem here. What you probably want to say is that vcpu may not be
  aware of interrupt happening since it was migrated to different vcpu
  just after PI IPI was sent and thus missed it. But than PIR interrupts
  should be processed during vmentry on another pcpu:
  
  Sender: Guest:
  
  set pir
  set on
  if (vcpu in guest mode on pcpu1)
 vmexit on pcpu1
 vmentry on pcpu2
 process pir, deliver interrupt
  send PI IPI to pcpu1
 
  
  
  
  +  if (!cfg) {
  +  free_irq_at(irq, NULL);
  +  return 0;
  +  }
  +
  +  raw_spin_lock_irqsave(vector_lock, flags);
  +  if (!__assign_irq_vector(irq, cfg, mask))
  +  ret = irq;
  +  raw_spin_unlock_irqrestore(vector_lock, flags);
  +
  +  if (ret) {
  +  irq_set_chip_data(irq, cfg);
  +  irq_clear_status_flags(irq, IRQ_NOREQUEST);
  +  } else {
  +  free_irq_at(irq, cfg);
  +  }
  +  return ret;
  +}
  
  This function is mostly cutpaste of create_irq_nr().
  
  Yes, this function allow to allocate vector from specified cpu.
  
  Does not justify code duplication.
  ok. will change it in next version.
  
  Please use single global vector in the next version.
  
  
 if (kvm_x86_ops-has_virtual_interrupt_delivery(vcpu))
 apic-vid_enabled = true;
  +
  +  if (kvm_x86_ops-has_posted_interrupt(vcpu))
  +  apic-pi_enabled = true;
  +
  This is global

[net-next rfc v7 0/3] Multiqueue virtio-net

2012-11-27 Thread Jason Wang

Hi all:

This series is an update version of multiqueue virtio-net driver based on
Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the
packets reception and transmission. Please review and comments.

A protype implementation of qemu-kvm support could by found in
git://github.com/jasowang/qemu-kvm-mq.git. To start a guest with two queues, you
could specify the queues parameters to both tap and virtio-net like:

./qemu-kvm -netdev tap,queues=2,... -device virtio-net-pci,queues=2,...

then enable the multiqueue through ethtool by:

ethtool -L eth0 combined 2

Changes from V6:
- Align the implementation with the RFC spec update v5
- Addressing Rusty's comments:
  * split the patches
  * rename to max_queue_pairs and curr_queue_pairs
  * remove the useless status
  * fix the hibernation bug
- Addressing Ben's comments:
  * check other parameters in ethtool_set_queues

Changes from v5:
- Align the implementation with the RFC spec update v4
- Switch the mode between single mode and multiqueue mode without reset
- Remove the 256 limitation of queues
- Use helpers to do the mapping between virtqueues and tx/rx queues
- Use commbined channels instead of separated rx/tx queus when do the queue
number configuartion
- Other coding style comments from Michael

Changes from V4:
- Add ability to negotiate the number of queues through control virtqueue
- Ethtool -{L|l} support and default the tx/rx queue number to 1
- Expose the API to set irq affinity instead of irq itself

Changes from V3:
- Rebase to the net-next
- Let queue 2 to be the control virtqueue to obey the spec
- Prodives irq affinity
- Choose txq based on processor id

Reference:
- Virtio spec RFC: http://patchwork.ozlabs.org/patch/201303/
- V6: https://lkml.org/lkml/2012/10/30/127
- V5: http://lwn.net/Articles/505388/
- V4: https://lkml.org/lkml/2012/6/25/120
- V2: http://lwn.net/Articles/467283/


Perf Numbers:

Will do some basic test and post as a reply to this mail.

Jason Wang (3):
  virtio-net: separate fields of sending/receiving queue from
virtnet_info
  virtio_net: multiqueue support
  virtio-net: change the number of queues through ethtool

 drivers/net/virtio_net.c|  716 ---
 include/uapi/linux/virtio_net.h |   16 +
 2 files changed, 536 insertions(+), 196 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next rfc v7 1/3] virtio-net: separate fields of sending/receiving queue from virtnet_info

2012-11-27 Thread Jason Wang

To support multiqueue transmitq/receiveq, the first step is to separate queue
related structure from virtnet_info. This patch introduce send_queue and
receive_queue structure and use the pointer to them as the parameter in
functions handling sending/receiving.

Signed-off-by: Krishna Kumar krkum...@in.ibm.com
Signed-off-by: Jason Wang jasow...@redhat.com
---
 drivers/net/virtio_net.c |  289 +-
 1 files changed, 158 insertions(+), 131 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 26c502e..7975133 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -51,16 +51,44 @@ struct virtnet_stats {
u64 rx_packets;
 };
 
+/* Internal representation of a send virtqueue */
+struct send_queue {
+   /* Virtqueue associated with this send _queue */
+   struct virtqueue *vq;
+
+   /* TX: fragments + linear part + virtio header */
+   struct scatterlist sg[MAX_SKB_FRAGS + 2];
+};
+
+/* Internal representation of a receive virtqueue */
+struct receive_queue {
+   /* Virtqueue associated with this receive_queue */
+   struct virtqueue *vq;
+
+struct napi_struct napi;
+
+/* Number of input buffers, and max we've ever had. */
+unsigned int num, max;
+
+   /* Work struct for refilling if we run low on memory. */
+   struct delayed_work refill;
+
+   /* Chain pages by the private ptr. */
+   struct page *pages;
+
+   /* RX: fragments + linear part + virtio header */
+   struct scatterlist sg[MAX_SKB_FRAGS + 2];
+};
+
 struct virtnet_info {
struct virtio_device *vdev;
-   struct virtqueue *rvq, *svq, *cvq;
+   struct virtqueue *cvq;
struct net_device *dev;
struct napi_struct napi;
+   struct send_queue sq;
+   struct receive_queue rq;
unsigned int status;
 
-   /* Number of input buffers, and max we've ever had. */
-   unsigned int num, max;
-
/* I like... big packets and I cannot lie! */
bool big_packets;
 
@@ -73,21 +101,11 @@ struct virtnet_info {
/* Active statistics */
struct virtnet_stats __percpu *stats;
 
-   /* Work struct for refilling if we run low on memory. */
-   struct delayed_work refill;
-
/* Work struct for config space updates */
struct work_struct config_work;
 
/* Lock for config space updates */
struct mutex config_lock;
-
-   /* Chain pages by the private ptr. */
-   struct page *pages;
-
-   /* fragments + linear part + virtio header */
-   struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
-   struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
 };
 
 struct skb_vnet_hdr {
@@ -117,22 +135,22 @@ static inline struct skb_vnet_hdr *skb_vnet_hdr(struct 
sk_buff *skb)
  * private is used to chain pages for big packets, put the whole
  * most recent used list in the beginning for reuse
  */
-static void give_pages(struct virtnet_info *vi, struct page *page)
+static void give_pages(struct receive_queue *rq, struct page *page)
 {
struct page *end;
 
-   /* Find end of list, sew whole thing into vi-pages. */
+   /* Find end of list, sew whole thing into vi-rq.pages. */
for (end = page; end-private; end = (struct page *)end-private);
-   end-private = (unsigned long)vi-pages;
-   vi-pages = page;
+   end-private = (unsigned long)rq-pages;
+   rq-pages = page;
 }
 
-static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
+static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
 {
-   struct page *p = vi-pages;
+   struct page *p = rq-pages;
 
if (p) {
-   vi-pages = (struct page *)p-private;
+   rq-pages = (struct page *)p-private;
/* clear private here, it is used to chain pages */
p-private = 0;
} else
@@ -140,12 +158,12 @@ static struct page *get_a_page(struct virtnet_info *vi, 
gfp_t gfp_mask)
return p;
 }
 
-static void skb_xmit_done(struct virtqueue *svq)
+static void skb_xmit_done(struct virtqueue *vq)
 {
-   struct virtnet_info *vi = svq-vdev-priv;
+   struct virtnet_info *vi = vq-vdev-priv;
 
/* Suppress further interrupts. */
-   virtqueue_disable_cb(svq);
+   virtqueue_disable_cb(vq);
 
/* We were probably waiting for more output buffers. */
netif_wake_queue(vi-dev);
@@ -167,9 +185,10 @@ static void set_skb_frag(struct sk_buff *skb, struct page 
*page,
 }
 
 /* Called from bottom half context */
-static struct sk_buff *page_to_skb(struct virtnet_info *vi,
+static struct sk_buff *page_to_skb(struct receive_queue *rq,
   struct page *page, unsigned int len)
 {
+   struct virtnet_info *vi = rq-vq-vdev-priv;
struct sk_buff *skb;
struct skb_vnet_hdr *hdr;
unsigned int copy, hdr_len, offset;
@@ -224,12 +243,12 @@ static struct sk_buff

[net-next rfc v7 2/3] virtio_net: multiqueue support

2012-11-27 Thread Jason Wang

This addes multiqueue support to virtio_net driver. In multiple queue modes, the
driver expects the number of queue paris is equal to the number of vcpus. To
eliminate the contention bettwen vcpus and virtqueues, per-cpu virtqueue pairs
were implemented through:

- select the txq based on the smp processor id.
- smp affinity hint were set to the vcpu that owns the queue pairs.

Signed-off-by: Krishna Kumar krkum...@in.ibm.com
Signed-off-by: Jason Wang jasow...@redhat.com
---
 drivers/net/virtio_net.c|  454 ++-
 include/uapi/linux/virtio_net.h |   16 ++
 2 files changed, 371 insertions(+), 99 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7975133..bcaa6e5 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -84,17 +84,25 @@ struct virtnet_info {
struct virtio_device *vdev;
struct virtqueue *cvq;
struct net_device *dev;
-   struct napi_struct napi;
-   struct send_queue sq;
-   struct receive_queue rq;
+   struct send_queue *sq;
+   struct receive_queue *rq;
unsigned int status;
 
+   /* Max # of queue pairs supported by the device */
+   u16 max_queue_pairs;
+
+   /* # of queue pairs currently used by the driver */
+   u16 curr_queue_pairs;
+
/* I like... big packets and I cannot lie! */
bool big_packets;
 
/* Host will merge rx buffers for big packets (shake it! shake it!) */
bool mergeable_rx_bufs;
 
+   /* Has control virtqueue */
+   bool has_cvq;
+
/* enable config space updates */
bool config_enable;
 
@@ -126,6 +134,34 @@ struct padded_vnet_hdr {
char padding[6];
 };
 
+static const struct ethtool_ops virtnet_ethtool_ops;
+
+/*
+ * Converting between virtqueue no. and kernel tx/rx queue no.
+ * 0:rx0 1:tx0 2:cvq 3:rx1 4:tx1 ... 2N+1:rxN 2N+2:txN
+ */
+static int vq2txq(struct virtqueue *vq)
+{
+   int index = virtqueue_get_queue_index(vq);
+   return index == 1 ? 0 : (index - 2) / 2;
+}
+
+static int txq2vq(int txq)
+{
+   return txq ? 2 * txq + 2 : 1;
+}
+
+static int vq2rxq(struct virtqueue *vq)
+{
+   int index = virtqueue_get_queue_index(vq);
+   return index ? (index - 1) / 2 : 0;
+}
+
+static int rxq2vq(int rxq)
+{
+   return rxq ? 2 * rxq + 1 : 0;
+}
+
 static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
 {
return (struct skb_vnet_hdr *)skb-cb;
@@ -166,7 +202,7 @@ static void skb_xmit_done(struct virtqueue *vq)
virtqueue_disable_cb(vq);
 
/* We were probably waiting for more output buffers. */
-   netif_wake_queue(vi-dev);
+   netif_wake_subqueue(vi-dev, vq2txq(vq));
 }
 
 static void set_skb_frag(struct sk_buff *skb, struct page *page,
@@ -503,7 +539,7 @@ static bool try_fill_recv(struct receive_queue *rq, gfp_t 
gfp)
 static void skb_recv_done(struct virtqueue *rvq)
 {
struct virtnet_info *vi = rvq-vdev-priv;
-   struct receive_queue *rq = vi-rq;
+   struct receive_queue *rq = vi-rq[vq2rxq(rvq)];
 
/* Schedule NAPI, Suppress further interrupts if successful. */
if (napi_schedule_prep(rq-napi)) {
@@ -650,7 +686,8 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff 
*skb)
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct virtnet_info *vi = netdev_priv(dev);
-   struct send_queue *sq = vi-sq;
+   int qnum = skb_get_queue_mapping(skb);
+   struct send_queue *sq = vi-sq[qnum];
int capacity;
 
/* Free up any pending old buffers before queueing new ones. */
@@ -664,13 +701,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct 
net_device *dev)
if (likely(capacity == -ENOMEM)) {
if (net_ratelimit())
dev_warn(dev-dev,
-TX queue failure: out of memory\n);
+TXQ (%d) failure: out of memory\n,
+qnum);
} else {
dev-stats.tx_fifo_errors++;
if (net_ratelimit())
dev_warn(dev-dev,
-Unexpected TX queue failure: %d\n,
-capacity);
+Unexpected TXQ (%d) failure: %d\n,
+qnum, capacity);
}
dev-stats.tx_dropped++;
kfree_skb(skb);
@@ -685,12 +723,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct 
net_device *dev)
/* Apparently nice girls don't return TX_BUSY; stop the queue
 * before it gets out of hand.  Naturally, this wastes entries. */
if (capacity  2+MAX_SKB_FRAGS) {
-   netif_stop_queue(dev);
+   netif_stop_subqueue(dev, qnum);

[net-next rfc v7 3/3] virtio-net: change the number of queues through ethtool

2012-11-27 Thread Jason Wang

This patch implement the {set|get}_channels method of ethool to allow user to
change the number of queues dymaically when the device is running. This would
let the user to configure it on demand.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 drivers/net/virtio_net.c |   41 +
 1 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bcaa6e5..f08ec2a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1578,10 +1578,51 @@ static struct virtio_driver virtio_net_driver = {
 #endif
 };
 
+/* TODO: Eliminate OOO packets during switching */
+static int virtnet_set_channels(struct net_device *dev,
+   struct ethtool_channels *channels)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+   u16 queue_pairs = channels-combined_count;
+
+   /* We don't support separate rx/tx channels.
+* We don't allow setting 'other' channels.
+*/
+   if (channels-rx_count || channels-tx_count || channels-other_count)
+   return -EINVAL;
+
+   /* Only two modes were support currently */
+   if (queue_pairs != vi-max_queue_pairs  queue_pairs != 1)
+   return -EINVAL;
+
+   vi-curr_queue_pairs = queue_pairs;
+   BUG_ON(virtnet_set_queues(vi));
+
+   netif_set_real_num_tx_queues(dev, vi-curr_queue_pairs);
+   netif_set_real_num_rx_queues(dev, vi-curr_queue_pairs);
+
+   return 0;
+}
+
+static void virtnet_get_channels(struct net_device *dev,
+struct ethtool_channels *channels)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+
+   channels-combined_count = vi-curr_queue_pairs;
+   channels-max_combined = vi-max_queue_pairs;
+   channels-max_other = 0;
+   channels-rx_count = 0;
+   channels-tx_count = 0;
+   channels-other_count = 0;
+}
+
 static const struct ethtool_ops virtnet_ethtool_ops = {
.get_drvinfo = virtnet_get_drvinfo,
.get_link = ethtool_op_get_link,
.get_ringparam = virtnet_get_ringparam,
+   .set_channels = virtnet_set_channels,
+   .get_channels = virtnet_get_channels,
 };
 
 static int __init init(void)
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case

2012-11-27 Thread Raghavendra K T


On 11/26/2012 07:13 PM, Andrew Jones wrote:

On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote:

From: Raghavendra K T raghavendra...@linux.vnet.ibm.com

yield_to returns -ESRCH, When source and target of yield_to
run queue length is one. When we see three successive failures of
yield_to we assume we are in potential undercommit case and abort
from PLE handler.
The assumption is backed by low probability of wrong decision
for even worst case scenarios such as average runqueue length
between 1 and 2.

note that we do not update last boosted vcpu in failure cases.
Thank Avi for raising question on aborting after first fail from

yield_to.


Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
  virt/kvm/kvm_main.c |   26 --
  1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be70035..053f494 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
  {
struct pid *pid;
struct task_struct *task = NULL;
+   bool ret = false;

rcu_read_lock();
pid = rcu_dereference(target-pid);
@@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
task = get_pid_task(target-pid, PIDTYPE_PID);
rcu_read_unlock();
if (!task)
-   return false;
+   return ret;
if (task-flags  PF_VCPU) {
put_task_struct(task);
-   return false;
-   }
-   if (yield_to(task, 1)) {
-   put_task_struct(task);
-   return true;
+   return ret;
}
+   ret = yield_to(task, 1);
put_task_struct(task);
-   return false;
+
+   return ret;
  }
  EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);

@@ -1697,12 +1696,14 @@ bool

kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)

return eligible;
  }
  #endif
+
  void kvm_vcpu_on_spin(struct kvm_vcpu *me)
  {
struct kvm *kvm = me-kvm;
struct kvm_vcpu *vcpu;
int last_boosted_vcpu = me-kvm-last_boosted_vcpu;
int yielded = 0;
+   int try = 3;
int pass;
int i;

@@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 * VCPU is holding the lock that we need and will release it.
 * We approximate round-robin by starting at the last boosted VCPU.
 */
-   for (pass = 0; pass  2  !yielded; pass++) {
+   for (pass = 0; pass  2  !yielded  try; pass++) {
kvm_for_each_vcpu(i, vcpu, kvm) {
if (!pass  i = last_boosted_vcpu) {
i = last_boosted_vcpu;
@@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
continue;
if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
continue;
-   if (kvm_vcpu_yield_to(vcpu)) {
+
+   yielded = kvm_vcpu_yield_to(vcpu);
+   if (yielded  0) {
kvm-last_boosted_vcpu = i;
-   yielded = 1;
break;
+   } else if (yielded  0) {
+   try--;
+   if (!try)
+   break;
}
}
}



Drew, Thanks for reviewing this.


The check done in patch 1/2 is done before the double_rq_lock, so it's
cheap. Now, this patch is to avoid doing too many get_pid_task calls.I
wonder if it would make more sense to change the vcpu state from tracking
the pid to tracking the task. If that was done, then I don't believe this
patch is necessary.


We would need a logic not to break upon first failure of yield_to. 
(which happens otherwise with patch1 alone). Breaking upon first

failure out of ple handler resulted in degradation in moderate
overcommits due to false exits even when we have more than one task in
other cpu run queues.

But your suggestion triggered an idea to me, what would be the cost of
iterating over all vcpus despite of yield_to failure?

(Where we breakout of PLE handler only if we have successful yield i.e 
yielded  0) with something like this:


-   for (pass = 0; pass  2  !yielded; pass++) {
+   for (pass = 0; pass  2  yielded =0 ; pass++) {
kvm_for_each_vcpu(i, vcpu, kvm) {
if (!pass  i = last_boosted_vcpu) {
i = last_boosted_vcpu;
@@ -1727,11 +1727,12 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
continue;
if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
continue;
-   if (kvm_vcpu_yield_to(vcpu)) {
+
+

Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task

2012-11-27 Thread Raghavendra K T


On 11/26/2012 07:05 PM, Andrew Jones wrote:

On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:

From: Peter Zijlstra pet...@infradead.org

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple handler upon
  seeing source runqueue length = 1, but it had to export rq length).
  Peter came up with the elegant idea of return -ESRCH in scheduler core.

Signed-off-by: Peter Zijlstra pet...@infradead.org
Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---

  kernel/sched/core.c |   25 +++--
  1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..fc219a5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
   * It's the caller's job to ensure that the target task struct
   * can't go away on us before we can do any checks.
   *
- * Returns true if we indeed boosted the target task.
+ * Returns:
+ * true (0) if we indeed boosted the target task.
+ * false (0) if we failed to boost the target.
+ * -ESRCH if there's no task to yield to.
   */
  bool __sched yield_to(struct task_struct *p, bool preempt)
  {
@@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)

  again:
p_rq = task_rq(p);
+   /*
+* If we're the only runnable task on the rq and target rq also
+* has only one task, there's absolutely no point in yielding.
+*/
+   if (rq-nr_running == 1  p_rq-nr_running == 1) {
+   yielded = -ESRCH;
+   goto out_irq;
+   }
+
double_rq_lock(rq, p_rq);
while (task_rq(p) != p_rq) {
double_rq_unlock(rq, p_rq);
@@ -4310,13 +4322,13 @@ again:
}

if (!curr-sched_class-yield_to_task)
-   goto out;
+   goto out_unlock;

if (curr-sched_class != p-sched_class)
-   goto out;
+   goto out_unlock;

if (task_running(p_rq, p) || p-state)
-   goto out;
+   goto out_unlock;

yielded = curr-sched_class-yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4329,11 +4341,12 @@ again:
resched_task(p_rq-curr);
}

-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);

-   if (yielded)
+   if (yielded  0)
schedule();

return yielded;



Acked-by: Andrew Jones drjo...@redhat.com



Thank you Drew.

Marcelo Gleb.. Please let me know if you have comments / concerns on the 
patches..


Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios
especially for large guests where we do have overhead of vcpu iteration
of ple handler..

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: Re: Re: Re: Re: [RFC PATCH 0/2] kvm/vmx: Output TSC offset

2012-11-27 Thread Yoshihiro YUNOMAE


Hi Marcelo,

(2012/11/27 8:16), Marcelo Tosatti wrote:

On Mon, Nov 26, 2012 at 08:05:10PM +0900, Yoshihiro YUNOMAE wrote:

500h. event tsc_write tsc_offset=-3000

Then a guest trace containing events with a TSC timestamp.
Which tsc_offset to use?

(that is the problem, which unless i am mistaken can only be solved
easily if the guest can convert RDTSC - TSC of host).


There are three following cases of changing TSC offset:
  1. Reset TSC at guest boot time
  2. Adjust TSC offset due to some host's problems
  3. Write TSC on guests
The scenario which you mentioned is case 3, so we'll discuss this case.
Here, we assume that a guest is allocated single CPU for the sake of
ease.

If a guest executes write_tsc, TSC values jumps to forward or backward.
For the forward case, trace data are as follows:

host guest   
cyclesevents   cycles   events
  3000   tsc_offset=-2950
  3001   kvm_enter
  53 eventX
  
 100 (write_tsc=+900)
  3060   kvm_exit
  3075   tsc_offset=-2050
  3080   kvm_enter
1050 event1
1055 event2
  ...


This case is simple. The guest TSC of the first kvm_enter is calculated
as follows:

   (host TSC of kvm_enter) + (current tsc_offset) = 3001 - 2950 = 51

Similarly, the guest TSC of the second kvm_enter is 130. So, the guest
events between 51 and 130, that is, 53 eventX is inserted between the
first pair of kvm_enter and kvm_exit. To insert events of the guests
between 51 and 130, we convert the guest TSC to the host TSC using TSC
offset 2950.

For the backward case, trace data are as follows:

host guest   
cyclesevents   cycles   events
  3000   tsc_offset=-2950
  3001   kvm_enter
  53 eventX
  
 100 (write_tsc=-50)
  3060   kvm_exit
  3075   tsc_offset=-2050
  3080   kvm_enter
  90 event1
  95 event2
  ...


3400   100(write_tsc=-50)

90event3
95event4


As you say, in this case, the previous method is invalid. When we
calculate the guest TSC value for the tsc_offset=-3000 event, the value
is 75 on the guest. This seems like prior event of write_tsc=-50 event.
So, we need to consider more.

In this case, it is important that we can understand where the guest
executes write_tsc or the host rewrites the TSC offset. write_tsc on
the guest equals wrmsr 0x0010, so this instruction induces vm_exit.
This implies that the guest does not operate when the host changes TSC
offset on the cpu. In other words, the guest cannot use new TSC before
the host rewrites the new TSC offset. So, if timestamp on the guest is
not monotonically increased, we can understand the guest executes
write_tsc. Moreover, in the region where timestamp is decreasing, we
can understand when the host rewrote the TSC offset in the guest trace
data. Therefore, we can sort trace data in chronological order.


This requires an entire trace of events. That is, to be able
to reconstruct timeline you require the entire trace from the moment
guest starts. So that you can correlate wrmsr-to-tsc on the guest with
vmexit-due-to-tsc-write on the host.

Which means that running out of space for trace buffer equals losing
ability to order events.

Is that desirable? It seems cumbersome to me.


As you say, tracing events can overwrite important events like
kvm_exit/entry or write_tsc_offset. So, Steven's multiple buffer is
needed by this feature. Normal events which often hit record the buffer
A, and important events which rarely hit record the buffer B. In our
case, the important event is write_tsc_offset.

Also the need to correlate each write_tsc event in the guest trace
with a corresponding tsc_offset write in the host trace means that it
is _necessary_ for the guest and host to enable tracing simultaneously.
Correct?

Also, there are WRMSR executions in the guest for which there is
no event in the trace buffer. From SeaBIOS, during boot.
In that case, there is no explicit event in the guest trace which you
can correlate with tsc_offset changes in the host side.


I understand that you want to say, but we don't correlate between
write_tsc event and write_tsc_offset event directly. This is because
the write_tsc tracepoint (also WRMSR instruction) is not prepared in
the current kernel. So, in the previous mail
(https://lkml.org/lkml/2012/11/22/53), I suggested the method which we
don't need to prepare the write_tsc tracepoint.

In the method, we enable ftrace before the guest boots, and we need to
keep all write_tsc_offset events in the buffer. If we forgot

RE: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting

2012-11-27 Thread Zhang, Yang Z

Gleb Natapov wrote on 2012-11-27:
 On Tue, Nov 27, 2012 at 03:38:05AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2012-11-26:
 On Mon, Nov 26, 2012 at 12:29:54PM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2012-11-26:
 On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2012-11-25:
 On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
 Posted Interrupt allows vAPICV interrupts to inject into guest directly
 without any vmexit.
 
 - When delivering a interrupt to guest, if target vcpu is running,
   update Posted-interrupt requests bitmap and send a notification
   event to the vcpu. Then the vcpu will handle this interrupt
   automatically, without any software involvemnt.
 Looks like you allocating one irq vector per vcpu per pcpu and then
 migrate it or reallocate when vcpu move from one pcpu to another.
 This is not scalable and migrating irq migration slows things down.
 What's wrong with allocating one global vector for posted interrupt
 during vmx initialization and use it for all vcpus?
 
 Consider the following situation:
 If vcpu A is running when notification event which belong to vcpu B is
 arrived,
 since the vector match the vcpu A's notification vector, then this event
 will be consumed by vcpu A(even it do nothing) and the interrupt cannot
 be handled in time. The exact same situation is possible with your code.
 vcpu B can be migrated from pcpu and vcpu A will take its place and will
 be assigned the same vector as vcpu B. But I fail to see why is this a
 No, the on bit will be set to prevent notification event when vcpu B start
 migration. And it only free the vector before it going to run in another 
 pcpu.
 There is a race. Sender check on bit, vcpu B migrate to another pcpu and
 starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
 A gets it.
 Yes, it do exist. But I think it should be ok even this happens.
 
 Then it is OK to use global PI vector. The race should be dealt with
 anyway.
Or using lock can deal with it too.
 
 
 problem. vcpu A will ignore PI since pir will be empty and vcpu B should
 detect new event during next vmentry.
 Yes, but the next vmentry may happen long time later and interrupt cannot
 be
 serviced until next vmentry. In current way, it will cause vmexit and
 re-schedule the vcpu B. Vmentry will happen when scheduler will decide
 that vcpu can run. There
 I don't know how scheduler can know the vcpu can run in this case, can
 you elaborate it? I thought it may have problems with global vector in
 some cases(maybe I am wrong, since I am not familiar with KVM
 scheduler): If target VCPU is in idle, and this notification event is
 consumed by other VCPU,
 then how can scheduler know the vcpu is ready to run? Even there is a way for
 scheduler to know, then when? Isn't it too late?
 If notification event arrived in hypervisor, then how the handler know which
 VCPU the notification event belong to?
 When vcpu is idle its thread sleeps inside host kernel (see
 virt/kvm/kvm_main.c:kvm_vcpu_block()). To get it out of sleep
 you should call kvm_vcpu_kick(), but only after changing vcpu
 state to make it runnable. arch/x86/kvm/x86.c:kvm_arch_vcpu_runnable()
 checks if vcpu is runnable. Notice that we call kvm_cpu_has_interrupt()
 there which checks apic IRR, but not PIR, so it is not enough to set
 bit in PIR and call kvm_vcpu_kick() to wake up vcpu.
Sorry, I cannot understand it. As you said, we need to call kvm_vcpu_kick when 
the waiting event happened to wake up the blocked vcpu, but this event is 
consumed by other vcpu without any chance for us to kick it. Then how it will 
move out from blocked list to run queue.
BTW, what I am talking is for the interrupt from VT-d case. For virtual 
interrupt, I think global vector is ok.
Also, the second problem is also about the VT-d case.
When cpu is running in VM root mode, and then an notification event arrives, 
since all VCPU use the same notification vector, we cannot distinguish which 
VCPU the notification event want to deliver to. And we cannot put the right 
VCPU to run queue.
 
 
 is no problem here. What you probably want to say is that vcpu may not be
 aware of interrupt happening since it was migrated to different vcpu
 just after PI IPI was sent and thus missed it. But than PIR interrupts
 should be processed during vmentry on another pcpu:
 
 Sender: Guest:
 
 set pir
 set on
 if (vcpu in guest mode on pcpu1)
vmexit on pcpu1
vmentry on pcpu2
process pir, deliver interrupt
 send PI IPI to pcpu1
 
 
 
 
 +  if (!cfg) {
 +  free_irq_at(irq, NULL);
 +  return 0;
 +  }
 +
 +  raw_spin_lock_irqsave(vector_lock, flags);
 +  if (!__assign_irq_vector(irq, cfg, mask))
 +  ret = irq;
 +  raw_spin_unlock_irqrestore(vector_lock, flags);
 +
 +  if (ret) {

RE: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting

2012-11-27 Thread Veruca Salt

 From: yang.z.zh...@intel.com
 To: g...@redhat.com
 CC: kvm@vger.kernel.org; mtosa...@redhat.com; haitao.s...@intel.com; 
 xiantao.zh...@intel.com
 Subject: RE: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting
 Date: Tue, 27 Nov 2012 11:10:20 +

 Gleb Natapov wrote on 2012-11-27:
  On Tue, Nov 27, 2012 at 03:38:05AM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2012-11-26:
  On Mon, Nov 26, 2012 at 12:29:54PM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2012-11-26:
  On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2012-11-25:
  On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
  Posted Interrupt allows vAPICV interrupts to inject into guest 
  directly
  without any vmexit.

  - When delivering a interrupt to guest, if target vcpu is running,
  update Posted-interrupt requests bitmap and send a notification
  event to the vcpu. Then the vcpu will handle this interrupt
  automatically, without any software involvemnt.
  Looks like you allocating one irq vector per vcpu per pcpu and then
  migrate it or reallocate when vcpu move from one pcpu to another.
  This is not scalable and migrating irq migration slows things down.
  What's wrong with allocating one global vector for posted interrupt
  during vmx initialization and use it for all vcpus?

  Consider the following situation:
  If vcpu A is running when notification event which belong to vcpu B is
  arrived,
  since the vector match the vcpu A's notification vector, then this event
  will be consumed by vcpu A(even it do nothing) and the interrupt cannot
  be handled in time. The exact same situation is possible with your code.
  vcpu B can be migrated from pcpu and vcpu A will take its place and will
  be assigned the same vector as vcpu B. But I fail to see why is this a
  No, the on bit will be set to prevent notification event when vcpu B 
  start
  migration. And it only free the vector before it going to run in another 
  pcpu.
  There is a race. Sender check on bit, vcpu B migrate to another pcpu and
  starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
  A gets it.
  Yes, it do exist. But I think it should be ok even this happens.

  Then it is OK to use global PI vector. The race should be dealt with
  anyway.
 Or using lock can deal with it too.

  problem. vcpu A will ignore PI since pir will be empty and vcpu B should
  detect new event during next vmentry.
  Yes, but the next vmentry may happen long time later and interrupt cannot
  be
  serviced until next vmentry. In current way, it will cause vmexit and
  re-schedule the vcpu B. Vmentry will happen when scheduler will decide
  that vcpu can run. There
  I don't know how scheduler can know the vcpu can run in this case, can
  you elaborate it? I thought it may have problems with global vector in
  some cases(maybe I am wrong, since I am not familiar with KVM
  scheduler): If target VCPU is in idle, and this notification event is
  consumed by other VCPU,
  then how can scheduler know the vcpu is ready to run? Even there is a way 
  for
  scheduler to know, then when? Isn't it too late?
  If notification event arrived in hypervisor, then how the handler know 
  which
  VCPU the notification event belong to?
  When vcpu is idle its thread sleeps inside host kernel (see
  virt/kvm/kvm_main.c:kvm_vcpu_block()). To get it out of sleep
  you should call kvm_vcpu_kick(), but only after changing vcpu
  state to make it runnable. arch/x86/kvm/x86.c:kvm_arch_vcpu_runnable()
  checks if vcpu is runnable. Notice that we call kvm_cpu_has_interrupt()
  there which checks apic IRR, but not PIR, so it is not enough to set
  bit in PIR and call kvm_vcpu_kick() to wake up vcpu.
 Sorry, I cannot understand it. As you said, we need to call kvm_vcpu_kick 
 when the waiting event happened to wake up the blocked vcpu, but this event 
 is consumed by other vcpu without any chance for us to kick it. Then how it 
 will move out from blocked list to run queue.
 BTW, what I am talking is for the interrupt from VT-d case. For virtual 
 interrupt, I think global vector is ok.
 Also, the second problem is also about the VT-d case.
 When cpu is running in VM root mode, and then an notification event arrives, 
 since all VCPU use the same notification vector, we cannot distinguish which 
 VCPU the notification event want to deliver to. And we cannot put the right 
 VCPU to run queue.

If 'kick is non-destructive of VCPU state, why not cycle to 'kick all the 
VCPU's so that you are guaranteed to get the one you want?
(Tell me I'm so, so wrong. :))

  is no problem here. What you probably want to say is that vcpu may not be
  aware of interrupt happening since it was migrated to different vcpu
  just after PI IPI was sent and thus missed it. But than PIR interrupts
  should be processed during vmentry on another pcpu:

  Sender: Guest:

  set pir
  set on
  if (vcpu in guest

Re: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting

2012-11-27 Thread Gleb Natapov

On Tue, Nov 27, 2012 at 11:10:20AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2012-11-27:
  On Tue, Nov 27, 2012 at 03:38:05AM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2012-11-26:
  On Mon, Nov 26, 2012 at 12:29:54PM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2012-11-26:
  On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2012-11-25:
  On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
  Posted Interrupt allows vAPICV interrupts to inject into guest 
  directly
  without any vmexit.
  
  - When delivering a interrupt to guest, if target vcpu is running,
update Posted-interrupt requests bitmap and send a notification
event to the vcpu. Then the vcpu will handle this interrupt
automatically, without any software involvemnt.
  Looks like you allocating one irq vector per vcpu per pcpu and then
  migrate it or reallocate when vcpu move from one pcpu to another.
  This is not scalable and migrating irq migration slows things down.
  What's wrong with allocating one global vector for posted interrupt
  during vmx initialization and use it for all vcpus?
  
  Consider the following situation:
  If vcpu A is running when notification event which belong to vcpu B is
  arrived,
  since the vector match the vcpu A's notification vector, then this event
  will be consumed by vcpu A(even it do nothing) and the interrupt cannot
  be handled in time. The exact same situation is possible with your code.
  vcpu B can be migrated from pcpu and vcpu A will take its place and will
  be assigned the same vector as vcpu B. But I fail to see why is this a
  No, the on bit will be set to prevent notification event when vcpu B 
  start
  migration. And it only free the vector before it going to run in another 
  pcpu.
  There is a race. Sender check on bit, vcpu B migrate to another pcpu and
  starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
  A gets it.
  Yes, it do exist. But I think it should be ok even this happens.
  
  Then it is OK to use global PI vector. The race should be dealt with
  anyway.
 Or using lock can deal with it too.
  
Last thing we want is to hold lock while injecting interrupt. Also Vt-d
HW will not be able to use the lock.

  
  problem. vcpu A will ignore PI since pir will be empty and vcpu B should
  detect new event during next vmentry.
  Yes, but the next vmentry may happen long time later and interrupt cannot
  be
  serviced until next vmentry. In current way, it will cause vmexit and
  re-schedule the vcpu B. Vmentry will happen when scheduler will decide
  that vcpu can run. There
  I don't know how scheduler can know the vcpu can run in this case, can
  you elaborate it? I thought it may have problems with global vector in
  some cases(maybe I am wrong, since I am not familiar with KVM
  scheduler): If target VCPU is in idle, and this notification event is
  consumed by other VCPU,
  then how can scheduler know the vcpu is ready to run? Even there is a way 
  for
  scheduler to know, then when? Isn't it too late?
  If notification event arrived in hypervisor, then how the handler know 
  which
  VCPU the notification event belong to?
  When vcpu is idle its thread sleeps inside host kernel (see
  virt/kvm/kvm_main.c:kvm_vcpu_block()). To get it out of sleep
  you should call kvm_vcpu_kick(), but only after changing vcpu
  state to make it runnable. arch/x86/kvm/x86.c:kvm_arch_vcpu_runnable()
  checks if vcpu is runnable. Notice that we call kvm_cpu_has_interrupt()
  there which checks apic IRR, but not PIR, so it is not enough to set
  bit in PIR and call kvm_vcpu_kick() to wake up vcpu.
 Sorry, I cannot understand it. As you said, we need to call kvm_vcpu_kick 
 when the waiting event happened to wake up the blocked vcpu, but this event 
 is consumed by other vcpu without any chance for us to kick it. Then how it 
 will move out from blocked list to run queue.
 BTW, what I am talking is for the interrupt from VT-d case. For virtual 
 interrupt, I think global vector is ok.
 Also, the second problem is also about the VT-d case.
 When cpu is running in VM root mode, and then an notification event arrives, 
 since all VCPU use the same notification vector, we cannot distinguish which 
 VCPU the notification event want to deliver to. And we cannot put the right 
 VCPU to run queue.
  
There is not VT-d code in proposed patches (is there spec available about
how VT-d integrates with PI?), so discussion is purely theoretical. VT-d
device will have to be reprogrammed to generate regular interrupt when
vcpu thread goes to sleep. This interrupt will be injected by VFIO via
irqfd just like assigned devices do now. When vcpu becomes runnable VT-d
device is reprogrammed back to generate PI.

  
  is no problem here. What you probably want to say is that vcpu may not be
  aware of interrupt happening since it was migrated to different vcpu
  just after PI IPI was sent and thus missed it. But than PIR

Re: [PATCH v9 1/2] x86/kexec: VMCLEAR VMCSs loaded on all cpus if necessary

2012-11-27 Thread Gleb Natapov

Eric, can you ACK it?

On Tue, Nov 27, 2012 at 11:26:02AM +0800, Zhang Yanfei wrote:
 This patch provides a way to VMCLEAR VMCSs related to guests
 on all cpus before executing the VMXOFF when doing kdump. This
 is used to ensure the VMCSs in the vmcore updated and
 non-corrupted.
 
 Signed-off-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
 ---
  arch/x86/include/asm/kexec.h |2 ++
  arch/x86/kernel/crash.c  |   25 +
  2 files changed, 27 insertions(+), 0 deletions(-)
 
 diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
 index 317ff17..28feeba 100644
 --- a/arch/x86/include/asm/kexec.h
 +++ b/arch/x86/include/asm/kexec.h
 @@ -163,6 +163,8 @@ struct kimage_arch {
  };
  #endif
  
 +extern void (*crash_vmclear_loaded_vmcss)(void);
 +
  #endif /* __ASSEMBLY__ */
  
  #endif /* _ASM_X86_KEXEC_H */
 diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
 index 13ad899..4a2a12f 100644
 --- a/arch/x86/kernel/crash.c
 +++ b/arch/x86/kernel/crash.c
 @@ -16,6 +16,7 @@
  #include linux/delay.h
  #include linux/elf.h
  #include linux/elfcore.h
 +#include linux/module.h
  
  #include asm/processor.h
  #include asm/hardirq.h
 @@ -29,6 +30,20 @@
  #include asm/virtext.h
  
  int in_crash_kexec;
 + 
 +/*
 + * This is used to VMCLEAR all VMCSs loaded on the
 + * processor. And when loading kvm_intel module, the
 + * callback function pointer will be assigned.
 + */
 +void (*crash_vmclear_loaded_vmcss)(void) = NULL;
 +EXPORT_SYMBOL_GPL(crash_vmclear_loaded_vmcss);
 +
 +static inline void cpu_emergency_vmclear_loaded_vmcss(void)
 +{
 + if (crash_vmclear_loaded_vmcss)
 + crash_vmclear_loaded_vmcss();
 +}
  
  #if defined(CONFIG_SMP)  defined(CONFIG_X86_LOCAL_APIC)
  
 @@ -46,6 +61,11 @@ static void kdump_nmi_callback(int cpu, struct pt_regs 
 *regs)
  #endif
   crash_save_cpu(regs, cpu);
  
 + /*
 +  * VMCLEAR VMCSs loaded on all cpus if needed.
 +  */
 + cpu_emergency_vmclear_loaded_vmcss();
 +
   /* Disable VMX or SVM if needed.
*
* We need to disable virtualization on all CPUs.
 @@ -88,6 +108,11 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
  
   kdump_nmi_shootdown_cpus();
  
 + /*
 +  * VMCLEAR VMCSs loaded on this cpu if needed.
 +  */
 + cpu_emergency_vmclear_loaded_vmcss();
 +
   /* Booting kdump kernel with VMX or SVM enabled won't work,
* because (among other limitations) we can't disable paging
* with the virt flags.
 -- 
 1.7.1

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance issue

2012-11-27 Thread Gleb Natapov

On Mon, Nov 26, 2012 at 09:31:19PM +0200, George-Cristian Bîrzan wrote:
 On Sun, Nov 25, 2012 at 6:17 PM, George-Cristian Bîrzan g...@birzan.org 
 wrote:
  On Sun, Nov 25, 2012 at 5:19 PM, Gleb Natapov g...@redhat.com wrote:
  What Windows is this? Can you try changing -cpu host to -cpu
  host,+hv_relaxed?
 
  This is on Windows Server 2008 R2 (sorry, forgot to mention that I
  guess), and I can try it tomorrow (US time), as getting a stream my
  way depends on complicated stuff. I will though, and let you know how
  it goes.
 
 I changed that, no difference.
 
 
Heh, I forgot that the part that should make difference is not yet
upstream :(

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance issue

2012-11-27 Thread George-Cristian Bîrzan

On Tue, Nov 27, 2012 at 2:20 PM, Gleb Natapov g...@redhat.com wrote:
 On Mon, Nov 26, 2012 at 09:31:19PM +0200, George-Cristian Bîrzan wrote:
 On Sun, Nov 25, 2012 at 6:17 PM, George-Cristian Bîrzan g...@birzan.org 
 wrote:
  On Sun, Nov 25, 2012 at 5:19 PM, Gleb Natapov g...@redhat.com wrote:
  What Windows is this? Can you try changing -cpu host to -cpu
  host,+hv_relaxed?
 
  This is on Windows Server 2008 R2 (sorry, forgot to mention that I
  guess), and I can try it tomorrow (US time), as getting a stream my
  way depends on complicated stuff. I will though, and let you know how
  it goes.

 I changed that, no difference.


 Heh, I forgot that the part that should make difference is not yet
 upstream :(

We can try recompiling kvm/qemu with some patches, if that'd help. At
this point, anything is on the table except changing Windows and the
hardware :-)

Also, it might be that the software doing the actual work is not well
written, but even so...

--
George-Cristian Bîrzan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] kvm tools: add support for ARMv7 processors

2012-11-27 Thread Will Deacon

On Sat, Nov 24, 2012 at 09:12:59AM +, Pekka Enberg wrote:
 On Fri, Nov 23, 2012 at 4:50 PM, Marc Zyngier marc.zyng...@arm.com wrote:
  On 23/11/12 14:40, Will Deacon wrote:
  This patch adds initial support for ARMv7 processors (more specifically,
  Cortex-A15) to kvmtool.
 
  Everything is driven by FDT, including dynamic generation of virtio nodes
  for MMIO devices (PCI is not used due to lack of a suitable host-bridge).
 
  The virtual timers and virtual interrupt controller (VGIC) are provided
  by the kernel and require very little in terms of userspace code.
 
  Signed-off-by: Will Deacon will.dea...@arm.com
 
  FWIW:
 
  Tested-by: Marc Zyngier marc.zyng...@arm.com
 
 Applied, thanks guys!

Cheers Pekka! I got my ppc64 machine up and running at the weekend and it
looks like I broke the build with my device registration stuff, so I'll post
a fix in a second...

Will
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] kvm tools: powerpc: convert spapr pci to new device iterators

2012-11-27 Thread Will Deacon

Commit 8d35d32d0148 (kvm tools: add generic device registration
mechanism) introduced a tree-based device lookup-by-bus mechanism as
well as iterators to enumerate the devices on a particular bus.

Whilst both x86 and ppc were converted by the original patch, the spapr
pci changes were incomplete, so include the required changes here.

Compile-tested only on ppc64 970mp. Note that I had to hack the Makefile
in order to build guest_init.o with a toolchain defaulting to ppc:

 $(GUEST_INIT): guest/init.c
$(E)   LINK $@
-   $(Q) $(CC) -static guest/init.c -o $@
-   $(Q) $(LD) -r -b binary -o guest/guest_init.o $(GUEST_INIT)
+   $(Q) $(CC) -m64 -static guest/init.c -o $@
+   $(Q) $(LD) -m elf64ppc -r -b binary -o guest/guest_init.o $(GUEST_INIT)

 $(DEPS):

Cc: Matt Evans m...@ozlabs.org
Signed-off-by: Will Deacon will.dea...@arm.com
---
 tools/kvm/powerpc/spapr_pci.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/tools/kvm/powerpc/spapr_pci.c b/tools/kvm/powerpc/spapr_pci.c
index 5f4016c..ed4b9ab 100644
--- a/tools/kvm/powerpc/spapr_pci.c
+++ b/tools/kvm/powerpc/spapr_pci.c
@@ -15,6 +15,7 @@
 
 #include spapr.h
 #include spapr_pci.h
+#include kvm/devices.h
 #include kvm/fdt.h
 #include kvm/util.h
 #include kvm/pci.h
@@ -248,6 +249,7 @@ int spapr_populate_pci_devices(struct kvm *kvm,
   void *fdt)
 {
int bus_off, node_off = 0, devid, fn, i, n, devices;
+   struct device_header *dev_hdr;
char nodename[256];
struct {
uint32_t hi;
@@ -301,14 +303,15 @@ int spapr_populate_pci_devices(struct kvm *kvm,
 
/* Populate PCI devices and allocate IRQs */
devices = 0;
-
-   for (devid = 0; devid  KVM_MAX_DEVICES; devid++) {
+   dev_hdr = device__first_dev(DEVICE_BUS_PCI);
+   while (dev_hdr) {
uint32_t *irqmap = interrupt_map[devices];
-   struct pci_device_header *hdr = pci__find_dev(devid);
+   struct pci_device_header *hdr = dev_hdr-data;
 
if (!hdr)
continue;
 
+   devid = dev_hdr-dev_num;
fn = 0; /* kvmtool doesn't yet do multifunction devices */
 
sprintf(nodename, pci@%u,%u, devid, fn);
@@ -413,6 +416,7 @@ int spapr_populate_pci_devices(struct kvm *kvm,
/* We don't set ibm,dma-window property as we don't have an 
IOMMU. */
 
++devices;
+   dev_hdr = device__next_dev(dev_hdr);
}
 
/* Write interrupt map */
-- 
1.8.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case

2012-11-27 Thread Andrew Jones

On Tue, Nov 27, 2012 at 03:57:25PM +0530, Raghavendra K T wrote:
 On 11/26/2012 07:13 PM, Andrew Jones wrote:
 On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote:
 From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 
 yield_to returns -ESRCH, When source and target of yield_to
 run queue length is one. When we see three successive failures of
 yield_to we assume we are in potential undercommit case and abort
 from PLE handler.
 The assumption is backed by low probability of wrong decision
 for even worst case scenarios such as average runqueue length
 between 1 and 2.
 
 note that we do not update last boosted vcpu in failure cases.
 Thank Avi for raising question on aborting after first fail from
 yield_to.
 
 Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com
 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 ---
   virt/kvm/kvm_main.c |   26 --
   1 file changed, 16 insertions(+), 10 deletions(-)
 
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index be70035..053f494 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
   {
 struct pid *pid;
 struct task_struct *task = NULL;
 +   bool ret = false;
 
 rcu_read_lock();
 pid = rcu_dereference(target-pid);
 @@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
 task = get_pid_task(target-pid, PIDTYPE_PID);
 rcu_read_unlock();
 if (!task)
 -   return false;
 +   return ret;
 if (task-flags  PF_VCPU) {
 put_task_struct(task);
 -   return false;
 -   }
 -   if (yield_to(task, 1)) {
 -   put_task_struct(task);
 -   return true;
 +   return ret;
 }
 +   ret = yield_to(task, 1);
 put_task_struct(task);
 -   return false;
 +
 +   return ret;
   }
   EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
 
 @@ -1697,12 +1696,14 @@ bool
 kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
 return eligible;
   }
   #endif
 +
   void kvm_vcpu_on_spin(struct kvm_vcpu *me)
   {
 struct kvm *kvm = me-kvm;
 struct kvm_vcpu *vcpu;
 int last_boosted_vcpu = me-kvm-last_boosted_vcpu;
 int yielded = 0;
 +   int try = 3;
 int pass;
 int i;
 
 @@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
  * VCPU is holding the lock that we need and will release it.
  * We approximate round-robin by starting at the last boosted VCPU.
  */
 -   for (pass = 0; pass  2  !yielded; pass++) {
 +   for (pass = 0; pass  2  !yielded  try; pass++) {
 kvm_for_each_vcpu(i, vcpu, kvm) {
 if (!pass  i = last_boosted_vcpu) {
 i = last_boosted_vcpu;
 @@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 continue;
 if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
 continue;
 -   if (kvm_vcpu_yield_to(vcpu)) {
 +
 +   yielded = kvm_vcpu_yield_to(vcpu);
 +   if (yielded  0) {
 kvm-last_boosted_vcpu = i;
 -   yielded = 1;
 break;
 +   } else if (yielded  0) {
 +   try--;
 +   if (!try)
 +   break;
 }
 }
 }
 
 
 Drew, Thanks for reviewing this.
 
 The check done in patch 1/2 is done before the double_rq_lock, so it's
 cheap. Now, this patch is to avoid doing too many get_pid_task calls.I
 wonder if it would make more sense to change the vcpu state from tracking
 the pid to tracking the task. If that was done, then I don't believe this
 patch is necessary.
 
 We would need a logic not to break upon first failure of yield_to.
 (which happens otherwise with patch1 alone). Breaking upon first
 failure out of ple handler resulted in degradation in moderate
 overcommits due to false exits even when we have more than one task in
 other cpu run queues.
 
 But your suggestion triggered an idea to me, what would be the cost of
 iterating over all vcpus despite of yield_to failure?
 
 (Where we breakout of PLE handler only if we have successful yield
 i.e yielded  0) with something like this:
 
 -   for (pass = 0; pass  2  !yielded; pass++) {
 +   for (pass = 0; pass  2  yielded =0 ; pass++) {
 kvm_for_each_vcpu(i, vcpu, kvm) {
 if (!pass  i = last_boosted_vcpu) {
 i = last_boosted_vcpu;
 @@ -1727,11 +1727,12 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 continue;
 if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
 continue;
 -   if (kvm_vcpu_yield_to(vcpu)) {
 +
 +   yielded =

Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task

2012-11-27 Thread Andrew Theurer

On Tue, 2012-11-27 at 16:00 +0530, Raghavendra K T wrote:
 On 11/26/2012 07:05 PM, Andrew Jones wrote:
  On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:
  From: Peter Zijlstra pet...@infradead.org
 
  In case of undercomitted scenarios, especially in large guests
  yield_to overhead is significantly high. when run queue length of
  source and target is one, take an opportunity to bail out and return
  -ESRCH. This return condition can be further exploited to quickly come
  out of PLE handler.
 
  (History: Raghavendra initially worked on break out of kvm ple handler upon
seeing source runqueue length = 1, but it had to export rq length).
Peter came up with the elegant idea of return -ESRCH in scheduler core.
 
  Signed-off-by: Peter Zijlstra pet...@infradead.org
  Raghavendra, Checking the rq length of target vcpu condition added.(thanks 
  Avi)
  Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com
  Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
  ---
 
kernel/sched/core.c |   25 +++--
1 file changed, 19 insertions(+), 6 deletions(-)
 
  diff --git a/kernel/sched/core.c b/kernel/sched/core.c
  index 2d8927f..fc219a5 100644
  --- a/kernel/sched/core.c
  +++ b/kernel/sched/core.c
  @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
 * It's the caller's job to ensure that the target task struct
 * can't go away on us before we can do any checks.
 *
  - * Returns true if we indeed boosted the target task.
  + * Returns:
  + *true (0) if we indeed boosted the target task.
  + *false (0) if we failed to boost the target.
  + *-ESRCH if there's no task to yield to.
 */
bool __sched yield_to(struct task_struct *p, bool preempt)
{
  @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool 
  preempt)
 
again:
 p_rq = task_rq(p);
  +  /*
  +   * If we're the only runnable task on the rq and target rq also
  +   * has only one task, there's absolutely no point in yielding.
  +   */
  +  if (rq-nr_running == 1  p_rq-nr_running == 1) {
  +  yielded = -ESRCH;
  +  goto out_irq;
  +  }
  +
 double_rq_lock(rq, p_rq);
 while (task_rq(p) != p_rq) {
 double_rq_unlock(rq, p_rq);
  @@ -4310,13 +4322,13 @@ again:
 }
 
 if (!curr-sched_class-yield_to_task)
  -  goto out;
  +  goto out_unlock;
 
 if (curr-sched_class != p-sched_class)
  -  goto out;
  +  goto out_unlock;
 
 if (task_running(p_rq, p) || p-state)
  -  goto out;
  +  goto out_unlock;
 
 yielded = curr-sched_class-yield_to_task(rq, p, preempt);
 if (yielded) {
  @@ -4329,11 +4341,12 @@ again:
 resched_task(p_rq-curr);
 }
 
  -out:
  +out_unlock:
 double_rq_unlock(rq, p_rq);
  +out_irq:
 local_irq_restore(flags);
 
  -  if (yielded)
  +  if (yielded  0)
 schedule();
 
 return yielded;
 
 
  Acked-by: Andrew Jones drjo...@redhat.com
 
 
 Thank you Drew.
 
 Marcelo Gleb.. Please let me know if you have comments / concerns on the 
 patches..
 
 Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios
 especially for large guests where we do have overhead of vcpu iteration
 of ple handler..

I agree, looks fine for undercommit scenarios.  I do wonder what happens
with 1.5x overcommit, where we might see 1/2 the host cpus with runqueue
of 2 and 1/2 of the host cpus with a runqueue of 1.  Even with this
change that scenario still might be fine, but it would be nice to see a
comparison.

-Andrew


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call agenda for 2012-11-27

2012-11-27 Thread Juan Quintela

Juan Quintela quint...@redhat.com wrote:
 Hi

 Please send in any agenda topics you are interested in.

As there are no topic's call is cancelled.

Have a nice day, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task

2012-11-27 Thread Chegu Vinod


On 11/27/2012 2:30 AM, Raghavendra K T wrote:

On 11/26/2012 07:05 PM, Andrew Jones wrote:

On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:

From: Peter Zijlstra pet...@infradead.org

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple 
handler upon

  seeing source runqueue length = 1, but it had to export rq length).
  Peter came up with the elegant idea of return -ESRCH in scheduler 
core.


Signed-off-by: Peter Zijlstra pet...@infradead.org
Raghavendra, Checking the rq length of target vcpu condition 
added.(thanks Avi)

Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---

  kernel/sched/core.c |   25 +++--
  1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..fc219a5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
   * It's the caller's job to ensure that the target task struct
   * can't go away on us before we can do any checks.
   *
- * Returns true if we indeed boosted the target task.
+ * Returns:
+ *true (0) if we indeed boosted the target task.
+ *false (0) if we failed to boost the target.
+ *-ESRCH if there's no task to yield to.
   */
  bool __sched yield_to(struct task_struct *p, bool preempt)
  {
@@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, 
bool preempt)


  again:
  p_rq = task_rq(p);
+/*
+ * If we're the only runnable task on the rq and target rq also
+ * has only one task, there's absolutely no point in yielding.
+ */
+if (rq-nr_running == 1  p_rq-nr_running == 1) {
+yielded = -ESRCH;
+goto out_irq;
+}
+
  double_rq_lock(rq, p_rq);
  while (task_rq(p) != p_rq) {
  double_rq_unlock(rq, p_rq);
@@ -4310,13 +4322,13 @@ again:
  }

  if (!curr-sched_class-yield_to_task)
-goto out;
+goto out_unlock;

  if (curr-sched_class != p-sched_class)
-goto out;
+goto out_unlock;

  if (task_running(p_rq, p) || p-state)
-goto out;
+goto out_unlock;

  yielded = curr-sched_class-yield_to_task(rq, p, preempt);
  if (yielded) {
@@ -4329,11 +4341,12 @@ again:
  resched_task(p_rq-curr);
  }

-out:
+out_unlock:
  double_rq_unlock(rq, p_rq);
+out_irq:
  local_irq_restore(flags);

-if (yielded)
+if (yielded  0)
  schedule();

  return yielded;



Acked-by: Andrew Jones drjo...@redhat.com



Thank you Drew.

Marcelo Gleb.. Please let me know if you have comments / concerns on 
the patches..


Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios
especially for large guests where we do have overhead of vcpu iteration
of ple handler..

.

Thanks Raghu. Will try to get this latest patch set evaluated and get 
back to you.


Vinod

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance issue

2012-11-27 Thread Gleb Natapov

On Tue, Nov 27, 2012 at 02:29:20PM +0200, George-Cristian Bîrzan wrote:
 On Tue, Nov 27, 2012 at 2:20 PM, Gleb Natapov g...@redhat.com wrote:
  On Mon, Nov 26, 2012 at 09:31:19PM +0200, George-Cristian Bîrzan wrote:
  On Sun, Nov 25, 2012 at 6:17 PM, George-Cristian Bîrzan g...@birzan.org 
  wrote:
   On Sun, Nov 25, 2012 at 5:19 PM, Gleb Natapov g...@redhat.com wrote:
   What Windows is this? Can you try changing -cpu host to -cpu
   host,+hv_relaxed?
  
   This is on Windows Server 2008 R2 (sorry, forgot to mention that I
   guess), and I can try it tomorrow (US time), as getting a stream my
   way depends on complicated stuff. I will though, and let you know how
   it goes.
 
  I changed that, no difference.
 
 
  Heh, I forgot that the part that should make difference is not yet
  upstream :(
 
 We can try recompiling kvm/qemu with some patches, if that'd help. At
 this point, anything is on the table except changing Windows and the
 hardware :-)

Vadim do you have Hyper-v reference timer patches for KVM to try?

 
 Also, it might be that the software doing the actual work is not well
 written, but even so...
 
 --
 George-Cristian Bîrzan

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-27 Thread Michael Wolf


On 11/27/2012 02:48 AM, Glauber Costa wrote:

Hi,

On 11/27/2012 12:36 AM, Michael Wolf wrote:

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will separate
the consigned time from the steal time.  The consignment limit passed to the
host will be the amount of steal time expected within a fixed period of
time.  Any other steal time accruing during that period will show as the
traditional steal time.

If you submit this again, please include a version number in your series.
Will do.  The patchset was sent twice yesterday by mistake.  Got an 
error the first time and didn't

think the patches went out.  This has been corrected.


It would also be helpful to include a small changelog about what changed
between last version and this version, so we could focus on that.
yes, will do that.  When I took the RFC off the patches I was looking at 
it as a new patchset which was

a mistake.  I will make sure to add a changelog when I submit again.


As for the rest, I answered your previous two submissions saying I don't
agree with the concept. If you hadn't changed anything, resending it
won't change my mind.

I could of course, be mistaken or misguided. But I had also not seen any
wave of support in favor of this previously, so basically I have no new
data to make me believe I should see it any differently.

Let's try this again:

* Rik asked you in your last submission how does ppc handle this. You
said, and I quote: In the case of lpar on POWER systems they simply
report steal time and do not alter it in any way.
They do however report how much processor is assigned to the partition
and that information is in /proc/ppc64/lparcfg.
Yes, but we still get questions from users asking what is steal time? 
why am I seeing this?


Now, that is a *way* more sensible thing to do. Much more. Confusing
users is something extremely subjective. This is specially true about
concepts that are know for quite some time, like steal time. If you out
of a sudden change the meaning of this, it is sure to confuse a lot more
users than it would clarify.
Something like this could certainly be done.  But when I was submitting 
the patch set as
an RFC then qemu was passing a cpu percentage that would be used by the 
guest kernel
to adjust the steal time. This percentage was being stored on the guest 
as a sysctl value.

Avi stated he didn't like that kind of coupling, and that the value
could get out of sync.  Anthony stated The guest shouldn't need to know 
it's entitlement. Or at least, it's up to a management tool to report 
that in a way that's meaningful for the guest.


So perhaps I misunderstood what they were suggesting, but I took it to 
mean that they did not
want the guest to know what the entitlement was.  That the host should 
take care of it and just
report the already adjusted data to the guest.  So in this version of 
the code the host would use a set
period for a timer and be passed essentially a number of ticks of 
expected steal time.  The host
would then use the timer to break out the steal time into consigned and 
steal buckets which would be

reported to the guest.

Both the consigned and the steal would be reported via /proc/stat. So 
anyone needing to see total
time away could add the two fields together.  The user, however, when 
using tools like top or vmstat

would see the usage based on what the guest is entitled to.

Do you have suggestions for how I can build consensus around one of the 
two approaches?









---

Michael Wolf (5):
   Alter the amount of steal time reported by the guest.
   Expand the steal time msr to also contain the consigned time.
   Add the code to send the consigned time from the host to the guest
   Add a timer to allow the separation of consigned from steal time.
   Add an ioctl to communicate the consign limit to the host.


  arch/x86/include/asm/kvm_host.h   |   11 +++
  arch/x86/include/asm/kvm_para.h   |3 +-
  arch/x86/include/asm/paravirt.h   |4 +--
  arch/x86/include/asm/paravirt_types.h |2 +
  arch/x86/kernel/kvm.c |8 ++---
  arch/x86/kernel/paravirt.c|4 +--
  arch/x86/kvm/x86.c|   50 -
  fs/proc/stat.c|9 +-
  include/linux/kernel_stat.h   |2 +
  include/linux/kvm_host.h  |2 +
  include/uapi/linux/kvm.h  |2 +
  kernel/sched/core.c   |   10 ++-
  kernel/sched/cputime.c|   21 +-
  kernel/sched/sched.h  |2 +
  virt/kvm/kvm_main.c   |7 +
  15 files changed, 120 insertions(+), 17 deletions(-)


--
To

[Bug 50891] The smp_affinity cannot work correctly on guest os when PCI passthrough device using msi/msi-x with KVM

2012-11-27 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=50891





--- Comment #3 from Alex Williamson alex.william...@redhat.com  2012-11-27 
18:13:07 ---
I tested a BCM5716 on 3.7.0-rc7 with both qemu-kvm-1.2.0 and current qemu.git
using pci-assign.  MSI-X pinning works exactly as expected.  Note that Linux
MSI affinity is setup lazily on the next interrupt for a vector, so it's normal
that after setting the affinity for a vector that you might see a single
interrupt on another CPU before the interrupt is moved.  Also note that setting
the affinity in the guest only changes the affinity of the virtual interrupt to
the guest, the physical interrupt affinity must be separately configured on the
host.  Perhaps the steps your missing in comment 0 above is to disable
irqbalance in the host and set the irq affinity of the kvm interrupts in the
host.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH V4 1/2] Add code to track call origin for msr assignment.

2012-11-27 Thread Auld, Will

Gleb,

This last change to emulator_set_msr() was wrong as you point out. I will 
change it back to what it was in V3 with the exception of fixing the bool that 
Marcelo pointed out. 

However, the change of:

struct kvm_x86_ops {...
int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
...
};

To match emulator_set_msr() prototype seems wrong because emulator_set_msr() is 
used with the following structure: 

struct x86_emulate_ops {...
int (*set_msr)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 data);
...
};

Rather than with struct kvm_x86_ops and I don't see that these two need to 
agree. In fact, they don't agree in the first parameter which proceeds my 
mucking.

If this is correct I will not make any changes to struct kvm_x86_ops 
set_msr() prototype (remain as is in V4).

Does this make since? Do you agree?

Will

 -Original Message-
 From: Gleb Natapov [mailto:g...@redhat.com]
 Sent: Monday, November 26, 2012 10:59 PM
 To: Auld, Will
 Cc: qemu-devel; mtosa...@redhat.com; kvm@vger.kernel.org; Dugger,
 Donald D; Liu, Jinsong; Zhang, Xiantao; a...@redhat.com
 Subject: Re: [PATCH V4 1/2] Add code to track call origin for msr
 assignment.
 
 On Mon, Nov 26, 2012 at 05:35:07PM -0800, Will Auld wrote:
  In order to track who initiated the call (host or guest) to modify an
  msr value I have changed function call parameters along the call
 path.
  The specific change is to add a struct pointer parameter that points
  to (index, data, caller) information rather than having this
  information passed as individual parameters.
 
  The initial use for this capability is for updating the
  IA32_TSC_ADJUST msr while setting the tsc value. It is anticipated
  that this capability is useful other tasks.
 
  Signed-off-by: Will Auld will.a...@intel.com
  ---
   arch/x86/include/asm/kvm_host.h | 12 +---
   arch/x86/kvm/svm.c  | 21 +++--
   arch/x86/kvm/vmx.c  | 24 +---
   arch/x86/kvm/x86.c  | 23 +++
   arch/x86/kvm/x86.h  |  2 +-
   5 files changed, 57 insertions(+), 25 deletions(-)
 
  diff --git a/arch/x86/include/asm/kvm_host.h
  b/arch/x86/include/asm/kvm_host.h index 09155d6..da34027 100644
  --- a/arch/x86/include/asm/kvm_host.h
  +++ b/arch/x86/include/asm/kvm_host.h
  @@ -598,6 +598,12 @@ struct kvm_vcpu_stat {
 
   struct x86_instruction_info;
 
  +struct msr_data {
  +bool host_initiated;
  +u32 index;
  +u64 data;
  +};
  +
   struct kvm_x86_ops {
  int (*cpu_has_kvm_support)(void);  /* __init */
  int (*disabled_by_bios)(void); /* __init */
  @@ -621,7 +627,7 @@ struct kvm_x86_ops {
  void (*set_guest_debug)(struct kvm_vcpu *vcpu,
  struct kvm_guest_debug *dbg);
  int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata);
  -   int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
  +   int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
 Emulator still calls the function pointer with msr_index  data. It
 would be better to leave set_msr pointer as is and construct msr_data
 emulator_set_msr() like in V3. Just drop this set_msr signature change.
 
  u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
  void (*get_segment)(struct kvm_vcpu *vcpu,
  struct kvm_segment *var, int seg); @@ -772,7
 +778,7 @@ static
  inline int emulate_instruction(struct kvm_vcpu *vcpu,
 
   void kvm_enable_efer_bits(u64);
   int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
  -int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
  +int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
   struct x86_emulate_ctxt;
 
  @@ -799,7 +805,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu,
  int *db, int *l);  int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index,
  u64 xcr);
 
   int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
  -int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data);
  +int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
   unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu);  void
  kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags); diff
  --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index
 baead95..5ac11f0
  100644
  --- a/arch/x86/kvm/svm.c
  +++ b/arch/x86/kvm/svm.c
  @@ -1211,6 +1211,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct
 kvm *kvm, unsigned int id)
  struct page *msrpm_pages;
  struct page *hsave_page;
  struct page *nested_msrpm_pages;
  +   struct msr_data msr;
  int err;
 
  svm = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL); @@ -1255,7
  +1256,10 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm,
 unsigned int id)
  svm-vmcb_pa = page_to_pfn(page)  PAGE_SHIFT;
  svm-asid_generation = 0;
  init_vmcb(svm);
  -   kvm_write_tsc(svm-vcpu, 0);
  +   msr.data = 0x0;
  +

Re: [PATCH V4 1/2] Add code to track call origin for msr assignment.

2012-11-27 Thread Gleb Natapov

On Tue, Nov 27, 2012 at 06:19:21PM +, Auld, Will wrote:
 Gleb,
 
 This last change to emulator_set_msr() was wrong as you point out. I will 
 change it back to what it was in V3 with the exception of fixing the bool 
 that Marcelo pointed out. 
 
 However, the change of:
 
 struct kvm_x86_ops {...
   int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
 ...
 };
 
 To match emulator_set_msr() prototype seems wrong because emulator_set_msr() 
 is used with the following structure: 
 
 struct x86_emulate_ops {...
   int (*set_msr)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 data);
 ...
 };
 
 Rather than with struct kvm_x86_ops and I don't see that these two need to 
 agree. In fact, they don't agree in the first parameter which proceeds my 
 mucking.
 
 If this is correct I will not make any changes to struct kvm_x86_ops 
 set_msr() prototype (remain as is in V4).
 
 Does this make since? Do you agree?
 
Ugh, yeah. I looked at the wrong structure. Disregard this. Change true
to false in V3. Sorry about the confusion.

 Will
 
  -Original Message-
  From: Gleb Natapov [mailto:g...@redhat.com]
  Sent: Monday, November 26, 2012 10:59 PM
  To: Auld, Will
  Cc: qemu-devel; mtosa...@redhat.com; kvm@vger.kernel.org; Dugger,
  Donald D; Liu, Jinsong; Zhang, Xiantao; a...@redhat.com
  Subject: Re: [PATCH V4 1/2] Add code to track call origin for msr
  assignment.
  
  On Mon, Nov 26, 2012 at 05:35:07PM -0800, Will Auld wrote:
   In order to track who initiated the call (host or guest) to modify an
   msr value I have changed function call parameters along the call
  path.
   The specific change is to add a struct pointer parameter that points
   to (index, data, caller) information rather than having this
   information passed as individual parameters.
  
   The initial use for this capability is for updating the
   IA32_TSC_ADJUST msr while setting the tsc value. It is anticipated
   that this capability is useful other tasks.
  
   Signed-off-by: Will Auld will.a...@intel.com
   ---
arch/x86/include/asm/kvm_host.h | 12 +---
arch/x86/kvm/svm.c  | 21 +++--
arch/x86/kvm/vmx.c  | 24 +---
arch/x86/kvm/x86.c  | 23 +++
arch/x86/kvm/x86.h  |  2 +-
5 files changed, 57 insertions(+), 25 deletions(-)
  
   diff --git a/arch/x86/include/asm/kvm_host.h
   b/arch/x86/include/asm/kvm_host.h index 09155d6..da34027 100644
   --- a/arch/x86/include/asm/kvm_host.h
   +++ b/arch/x86/include/asm/kvm_host.h
   @@ -598,6 +598,12 @@ struct kvm_vcpu_stat {
  
struct x86_instruction_info;
  
   +struct msr_data {
   +bool host_initiated;
   +u32 index;
   +u64 data;
   +};
   +
struct kvm_x86_ops {
 int (*cpu_has_kvm_support)(void);  /* __init */
 int (*disabled_by_bios)(void); /* __init */
   @@ -621,7 +627,7 @@ struct kvm_x86_ops {
 void (*set_guest_debug)(struct kvm_vcpu *vcpu,
 struct kvm_guest_debug *dbg);
 int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata);
   - int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
   + int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
  Emulator still calls the function pointer with msr_index  data. It
  would be better to leave set_msr pointer as is and construct msr_data
  emulator_set_msr() like in V3. Just drop this set_msr signature change.
  
 u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
 void (*get_segment)(struct kvm_vcpu *vcpu,
 struct kvm_segment *var, int seg); @@ -772,7
  +778,7 @@ static
   inline int emulate_instruction(struct kvm_vcpu *vcpu,
  
void kvm_enable_efer_bits(u64);
int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
   -int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
   +int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
  
struct x86_emulate_ctxt;
  
   @@ -799,7 +805,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu,
   int *db, int *l);  int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index,
   u64 xcr);
  
int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
   -int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data);
   +int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr);
  
unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu);  void
   kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags); diff
   --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index
  baead95..5ac11f0
   100644
   --- a/arch/x86/kvm/svm.c
   +++ b/arch/x86/kvm/svm.c
   @@ -1211,6 +1211,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct
  kvm *kvm, unsigned int id)
 struct page *msrpm_pages;
 struct page *hsave_page;
 struct page *nested_msrpm_pages;
   + struct msr_data msr;
 int err;
  
 svm = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL); @@

[PATCH V5 0/2] Enable guest use of TSC_ADJUST functionality

2012-11-27 Thread Will Auld

With this version (V5) I have gone back the the V3 implementation of 
emulator_set_msr() but changing the 
bool to false. 

Will Auld (2):
  Add code to track call origin for msr assignment.
  Enabling IA32_TSC_ADJUST for KVM guest VM support

 arch/x86/include/asm/cpufeature.h |  1 +
 arch/x86/include/asm/kvm_host.h   | 15 +---
 arch/x86/include/asm/msr-index.h  |  1 +
 arch/x86/kvm/cpuid.c  |  2 ++
 arch/x86/kvm/cpuid.h  |  8 +++
 arch/x86/kvm/svm.c| 28 +-
 arch/x86/kvm/vmx.c| 33 --
 arch/x86/kvm/x86.c| 49 ---
 arch/x86/kvm/x86.h|  2 +-
 9 files changed, 114 insertions(+), 25 deletions(-)

-- 
1.8.0.rc0



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V5 1/2] Add code to track call origin for msr assignment.

2012-11-27 Thread Will Auld

In order to track who initiated the call (host or guest) to modify an msr
value I have changed function call parameters along the call path. The
specific change is to add a struct pointer parameter that points to (index,
data, caller) information rather than having this information passed as
individual parameters.

The initial use for this capability is for updating the IA32_TSC_ADJUST
msr while setting the tsc value. It is anticipated that this capability
is useful other tasks.

Signed-off-by: Will Auld will.a...@intel.com
---
 arch/x86/include/asm/kvm_host.h | 12 +---
 arch/x86/kvm/svm.c  | 21 +++--
 arch/x86/kvm/vmx.c  | 24 +---
 arch/x86/kvm/x86.c  | 27 +++
 arch/x86/kvm/x86.h  |  2 +-
 5 files changed, 61 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09155d6..da34027 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -598,6 +598,12 @@ struct kvm_vcpu_stat {
 
 struct x86_instruction_info;
 
+struct msr_data {
+bool host_initiated;
+u32 index;
+u64 data;
+};
+
 struct kvm_x86_ops {
int (*cpu_has_kvm_support)(void);  /* __init */
int (*disabled_by_bios)(void); /* __init */
@@ -621,7 +627,7 @@ struct kvm_x86_ops {
void (*set_guest_debug)(struct kvm_vcpu *vcpu,
struct kvm_guest_debug *dbg);
int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata);
-   int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
+   int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
void (*get_segment)(struct kvm_vcpu *vcpu,
struct kvm_segment *var, int seg);
@@ -772,7 +778,7 @@ static inline int emulate_instruction(struct kvm_vcpu *vcpu,
 
 void kvm_enable_efer_bits(u64);
 int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
-int kvm_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
+int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
 struct x86_emulate_ctxt;
 
@@ -799,7 +805,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, 
int *l);
 int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr);
 
 int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
-int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data);
+int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
 unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu);
 void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index baead95..5ac11f0 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1211,6 +1211,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, 
unsigned int id)
struct page *msrpm_pages;
struct page *hsave_page;
struct page *nested_msrpm_pages;
+   struct msr_data msr;
int err;
 
svm = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL);
@@ -1255,7 +1256,10 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, 
unsigned int id)
svm-vmcb_pa = page_to_pfn(page)  PAGE_SHIFT;
svm-asid_generation = 0;
init_vmcb(svm);
-   kvm_write_tsc(svm-vcpu, 0);
+   msr.data = 0x0;
+   msr.index = MSR_IA32_TSC;
+   msr.host_initiated = true;
+   kvm_write_tsc(svm-vcpu, msr);
 
err = fx_init(svm-vcpu);
if (err)
@@ -3147,13 +3151,15 @@ static int svm_set_vm_cr(struct kvm_vcpu *vcpu, u64 
data)
return 0;
 }
 
-static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data)
+static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
+   u32 ecx = msr-index;
+   u64 data = msr-data;
switch (ecx) {
case MSR_IA32_TSC:
-   kvm_write_tsc(vcpu, data);
+   kvm_write_tsc(vcpu, msr);
break;
case MSR_STAR:
svm-vmcb-save.star = data;
@@ -3208,20 +3214,23 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned 
ecx, u64 data)
vcpu_unimpl(vcpu, unimplemented wrmsr: 0x%x data 0x%llx\n, 
ecx, data);
break;
default:
-   return kvm_set_msr_common(vcpu, ecx, data);
+   return kvm_set_msr_common(vcpu, msr);
}
return 0;
 }
 
 static int wrmsr_interception(struct vcpu_svm *svm)
 {
+   struct msr_data msr;
u32 ecx = svm-vcpu.arch.regs[VCPU_REGS_RCX];
u64 data = (svm-vcpu.arch.regs[VCPU_REGS_RAX]  -1u)
| ((u64)(svm-vcpu.arch.regs[VCPU_REGS_RDX]  -1u)  32);
 
-
+   msr.data = data;
+   msr.index = ecx;
+   msr.host_initiated = false;
svm-next_rip = kvm_rip_read(svm-vcpu) + 2;
-   if

[PATCH V5 2/2] Enabling IA32_TSC_ADJUST for KVM guest VM support

2012-11-27 Thread Will Auld

CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported

Basic design is to emulate the MSR by allowing reads and writes to a guest
vcpu specific location to store the value of the emulated MSR while adding
the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST value will
be included in all reads to the TSC MSR whether through rdmsr or rdtsc. This
is of course as long as the use TSC counter offsetting VM-execution
control is enabled as well as the IA32_TSC_ADJUST control.

However, because hardware will only return the TSC + IA32_TSC_ADJUST + vmsc
tsc_offset for a guest process when it does and rdtsc (with the correct
settings) the value of our virtualized IA32_TSC_ADJUST must be stored in
one of these three locations. The argument against storing it in the actual
MSR is performance. This is likely to be seldom used while the save/restore
is required on every transition. IA32_TSC_ADJUST was created as a way to
solve some issues with writing TSC itself so that is not an option either.
The remaining option, defined above as our solution has the problem of
returning incorrect vmcs tsc_offset values (unless we intercept and fix, not
done here) as mentioned above. However, more problematic is that storing the
data in vmcs tsc_offset will have a different semantic effect on the system
than does using the actual MSR. This is illustrated in the following example:
The hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a guest
process performs a rdtsc. In this case the guest process will get TSC +
IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including IA32_TSC_ADJUST_guest.
While the total system semantics changed the semantics as seen by the guest
do not and hence this will not cause a problem.

Signed-off-by: Will Auld will.a...@intel.com
---
 arch/x86/include/asm/cpufeature.h |  1 +
 arch/x86/include/asm/kvm_host.h   |  3 +++
 arch/x86/include/asm/msr-index.h  |  1 +
 arch/x86/kvm/cpuid.c  |  2 ++
 arch/x86/kvm/cpuid.h  |  8 
 arch/x86/kvm/svm.c|  7 +++
 arch/x86/kvm/vmx.c|  9 +
 arch/x86/kvm/x86.c| 22 ++
 8 files changed, 53 insertions(+)

diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index 6b7ee5f..e574d81 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -199,6 +199,7 @@
 
 /* Intel-defined CPU features, CPUID level 0x0007:0 (ebx), word 9 */
 #define X86_FEATURE_FSGSBASE   (9*32+ 0) /* {RD/WR}{FS/GS}BASE instructions*/
+#define X86_FEATURE_TSC_ADJUST  (9*32+ 1) /* TSC adjustment MSR 0x3b */
 #define X86_FEATURE_BMI1   (9*32+ 3) /* 1st group bit manipulation 
extensions */
 #define X86_FEATURE_HLE(9*32+ 4) /* Hardware Lock Elision */
 #define X86_FEATURE_AVX2   (9*32+ 5) /* AVX2 instructions */
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index da34027..cf8c7e0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -442,6 +442,8 @@ struct kvm_vcpu_arch {
u32 virtual_tsc_mult;
u32 virtual_tsc_khz;
 
+   s64 ia32_tsc_adjust_msr;
+
atomic_t nmi_queued;  /* unprocessed asynchronous NMIs */
unsigned nmi_pending; /* NMI queued after currently running handler */
bool nmi_injected;/* Trying to inject an NMI this entry */
@@ -690,6 +692,7 @@ struct kvm_x86_ops {
bool (*has_wbinvd_exit)(void);
 
void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool 
scale);
+   u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu);
void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
 
u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 957ec87..6486569 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -231,6 +231,7 @@
 #define MSR_IA32_EBL_CR_POWERON0x002a
 #define MSR_EBC_FREQUENCY_ID   0x002c
 #define MSR_IA32_FEATURE_CONTROL0x003a
+#define MSR_IA32_TSC_ADJUST 0x003b
 
 #define FEATURE_CONTROL_LOCKED (10)
 #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX   (11)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0595f13..e817bac 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -320,6 +320,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 
function,
if (index == 0) {
entry-ebx = kvm_supported_word9_x86_features;
cpuid_mask(entry-ebx, 9);
+   // TSC_ADJUST is emulated 
+   entry-ebx |= F(TSC_ADJUST);
} else
entry-ebx = 0;
entry-eax = 0;
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index

Re: [PATCH v2 00/18] KVM for MIPS32 Processors

2012-11-27 Thread Sanjay Lal


On Nov 26, 2012, at 1:53 PM, David Daney wrote:

 
 I have several general questions about this patch...
 
 On 11/21/2012 06:33 PM, Sanjay Lal wrote:
 The following patchset implements KVM support for MIPS32R2 processors,
 using Trap  Emulate, with basic runtime binary translation to improve
 performance.  The goal has been to keep the Guest kernel changes to a
 minimum.
 
 What is the point of minimizing guest kernel changes?
 
 Because you are using an invented memory map, instead of the architecturally 
 defined map, there is no hope of running a single kernel image both natively 
 and as a guest.  So why do you care about how many changes there are.

It makes porting the code easier.  Since we need a special guest kernel, 
keeping the changes to minimum helps when migrating from one Linux version to 
another.  At this point we've migrated the code from 2.6.32 to 3.7 with 3.0 
along the way, without any issues and anything more than an automatic merge.

 
 
 The patch is against Linux 3.7-rc6.  This is Version 2 of the patch set.
 
 There is a companion patchset for QEMU that adds KVM support for the
 MIPS target.
 
 KVM/MIPS should support MIPS32-R2 processors and beyond.
 It has been tested on the following platforms:
  - Malta Board with FPGA based 34K (Little Endian).
  - Sigma Designs TangoX board with a 24K based 8654 SoC (Little Endian).
  - Malta Board with 74K @ 1GHz (Little Endian).
  - OVPSim MIPS simulator from Imperas emulating a Malta board with
24Kc and 1074Kc cores (Little Endian).
 
 Unlike x86, there is no concept of a canonical MIPS system for you to 
 implement.  So the choice of emulating a Malta or one of the SigmaDesigns 
 boards doesn't seem to me to give you anything.
 
 Why not just define the guest system to be exactly the facilities provided by 
 the VirtIO drivers?

The above list is a list of host systems that we've tested KVM/MIPS on.  The 
guest kernel runs on the Malta system that is emulated by QEMU regardless of 
the host system.

And yes we do support VirtIO devices on the emulated Malta board to speed up 
I/O, but since they attach to the emulated systems' PCI bus, we still need a 
kernel and system that supports PCI.

Just an FYI, we'll be posting the QEMU patch set shortly.

 
 
 Perhaps it is obvious from the patches, but I wasn't able to figure out how 
 you solve the problem of the Root/Host kernel clobbering the K0 and K1 
 registers in its exception handlers.  These registers are also used by the 
 Guest kernel (aren't they)?

Yes k0/k1 do need to be saved as they are used by both the guest and host 
kernels.  The code is in kvm_locore.S around line 250 where the L1 exception 
vectors are installed.

Regards
Sanjay


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH for-1.3] vfio-pci: Fix KVM disabled path

2012-11-27 Thread Alex Williamson

kvm_check_extension() explodes when KVM isn't enabled so we need to
first test whether KVM is enabled.  Use kvm_irqchip_in_kernel() for
this since it matches the test we do before using this result.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 hw/vfio_pci.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Anthony, please include this for 1.3.  Let me know if you prefer a
pull request.  Thanks,

Alex

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 7c27834..2acc757 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -438,7 +438,8 @@ static int vfio_enable_intx(VFIODevice *vdev)
  * Only conditional to avoid generating error messages on platforms
  * where we won't actually use the result anyway.
  */
-if (kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) {
+if (kvm_irqchip_in_kernel() 
+kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) {
 vdev-intx.route = pci_device_route_intx_to_irq(vdev-pdev,
 vdev-intx.pin);
 }

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance issue

2012-11-27 Thread Vadim Rozenfeld

On Tuesday, November 27, 2012 04:54:47 PM Gleb Natapov wrote:
 On Tue, Nov 27, 2012 at 02:29:20PM +0200, George-Cristian Bîrzan wrote:
  On Tue, Nov 27, 2012 at 2:20 PM, Gleb Natapov g...@redhat.com wrote:
   On Mon, Nov 26, 2012 at 09:31:19PM +0200, George-Cristian Bîrzan wrote:
   On Sun, Nov 25, 2012 at 6:17 PM, George-Cristian Bîrzan 
   g...@birzan.org 
wrote:
On Sun, Nov 25, 2012 at 5:19 PM, Gleb Natapov g...@redhat.com 
wrote:
What Windows is this? Can you try changing -cpu host to -cpu
host,+hv_relaxed?

This is on Windows Server 2008 R2 (sorry, forgot to mention that I
guess), and I can try it tomorrow (US time), as getting a stream my
way depends on complicated stuff. I will though, and let you know
how it goes.
   
   I changed that, no difference.
   
   Heh, I forgot that the part that should make difference is not yet
   upstream :(
  
  We can try recompiling kvm/qemu with some patches, if that'd help. At
  this point, anything is on the table except changing Windows and the
  hardware :-)
 
 Vadim do you have Hyper-v reference timer patches for KVM to try?
I have some code which do both reference time and invariant TSC but it
will not work after migration. I will send it later today.
Vadim.
 
  Also, it might be that the software doing the actual work is not well
  written, but even so...
  
  --
  George-Cristian Bîrzan
 
 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] Expand the steal time msr to also contain the consigned time.

2012-11-27 Thread Konrad Rzeszutek Wilk

On Mon, Nov 26, 2012 at 02:36:45PM -0600, Michael Wolf wrote:
 Add a consigned field.  This field will hold the time lost due to capping or 
 overcommit.
 The rest of the time will still show up in the steal-time field.
 
 Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
 ---
  arch/x86/include/asm/paravirt.h   |4 ++--
  arch/x86/include/asm/paravirt_types.h |2 +-
  arch/x86/kernel/kvm.c |7 ++-
  kernel/sched/core.c   |   10 +-
  kernel/sched/cputime.c|2 +-
  5 files changed, 15 insertions(+), 10 deletions(-)
 
 diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
 index a0facf3..a5f9f30 100644
 --- a/arch/x86/include/asm/paravirt.h
 +++ b/arch/x86/include/asm/paravirt.h
 @@ -196,9 +196,9 @@ struct static_key;
  extern struct static_key paravirt_steal_enabled;
  extern struct static_key paravirt_steal_rq_enabled;
  
 -static inline u64 paravirt_steal_clock(int cpu)
 +static inline u64 paravirt_steal_clock(int cpu, u64 *steal)

So its u64 here.
  {
 - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
 + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
  }
  
  static inline unsigned long long paravirt_read_pmc(int counter)
 diff --git a/arch/x86/include/asm/paravirt_types.h 
 b/arch/x86/include/asm/paravirt_types.h
 index 142236e..5d4fc8b 100644
 --- a/arch/x86/include/asm/paravirt_types.h
 +++ b/arch/x86/include/asm/paravirt_types.h
 @@ -95,7 +95,7 @@ struct pv_lazy_ops {
  
  struct pv_time_ops {
   unsigned long long (*sched_clock)(void);
 - unsigned long long (*steal_clock)(int cpu);
 + void (*steal_clock)(int cpu, unsigned long long *steal);

But not u64 here? Any particular reason?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance issue

2012-11-27 Thread George-Cristian Bîrzan

On Tue, Nov 27, 2012 at 10:38 PM, Vadim Rozenfeld vroze...@redhat.com wrote:
 I have some code which do both reference time and invariant TSC but it
 will not work after migration. I will send it later today.

Do you mean migrating guests? This is not an issue for us.

Also, it would be much appreciated!

--
George-Cristian Bîrzan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 3/4] pci-assign: Refactor MSI virq array

2012-11-27 Thread Alex Williamson

Convert the msi_virq array into a struct array so we can easily add
a place to track the MSIMessage for each vector.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 hw/kvm/pci-assign.c |   51 ++-
 1 file changed, 26 insertions(+), 25 deletions(-)

diff --git a/hw/kvm/pci-assign.c b/hw/kvm/pci-assign.c
index e80dad0..3e667d1 100644
--- a/hw/kvm/pci-assign.c
+++ b/hw/kvm/pci-assign.c
@@ -130,8 +130,10 @@ typedef struct AssignedDevice {
 } cap;
 uint8_t emulate_config_read[PCI_CONFIG_SPACE_SIZE];
 uint8_t emulate_config_write[PCI_CONFIG_SPACE_SIZE];
-int msi_virq_nr;
-int *msi_virq;
+int msi_nr;
+struct {
+int virq;
+} *msi;
 MSIXTableEntry *msix_table;
 hwaddr msix_table_addr;
 uint16_t msix_max;
@@ -683,19 +685,18 @@ again:
 return 0;
 }
 
-static void free_msi_virqs(AssignedDevice *dev)
+static void free_msi(AssignedDevice *dev)
 {
 int i;
 
-for (i = 0; i  dev-msi_virq_nr; i++) {
-if (dev-msi_virq[i] = 0) {
-kvm_irqchip_release_virq(kvm_state, dev-msi_virq[i]);
-dev-msi_virq[i] = -1;
+for (i = 0; i  dev-msi_nr; i++) {
+if (dev-msi[i].virq = 0) {
+kvm_irqchip_release_virq(kvm_state, dev-msi[i].virq);
 }
 }
-g_free(dev-msi_virq);
-dev-msi_virq = NULL;
-dev-msi_virq_nr = 0;
+g_free(dev-msi);
+dev-msi = NULL;
+dev-msi_nr = 0;
 }
 
 static void free_assigned_device(AssignedDevice *dev)
@@ -750,7 +751,7 @@ static void free_assigned_device(AssignedDevice *dev)
 close(dev-real_device.config_fd);
 }
 
-free_msi_virqs(dev);
+free_msi(dev);
 }
 
 static void assign_failed_examine(AssignedDevice *dev)
@@ -989,7 +990,7 @@ static void assigned_dev_update_msi(PCIDevice *pci_dev)
 perror(assigned_dev_update_msi: deassign irq);
 }
 
-free_msi_virqs(assigned_dev);
+free_msi(assigned_dev);
 
 assigned_dev-assigned_irq_type = ASSIGNED_IRQ_NONE;
 pci_device_set_intx_routing_notifier(pci_dev, NULL);
@@ -1005,9 +1006,9 @@ static void assigned_dev_update_msi(PCIDevice *pci_dev)
 return;
 }
 
-assigned_dev-msi_virq = g_malloc(sizeof(*assigned_dev-msi_virq));
-assigned_dev-msi_virq_nr = 1;
-assigned_dev-msi_virq[0] = virq;
+assigned_dev-msi = g_malloc(sizeof(*assigned_dev-msi));
+assigned_dev-msi_nr = 1;
+assigned_dev-msi[0].virq = virq;
 if (kvm_device_msi_assign(kvm_state, assigned_dev-dev_id, virq)  0) {
 perror(assigned_dev_update_msi: kvm_device_msi_assign);
 }
@@ -1055,14 +1056,14 @@ static int assigned_dev_update_msix_mmio(PCIDevice 
*pci_dev)
 return r;
 }
 
-free_msi_virqs(adev);
+free_msi(adev);
 
-adev-msi_virq_nr = adev-msix_max;
-adev-msi_virq = g_malloc(adev-msix_max * sizeof(*adev-msi_virq));
+adev-msi_nr = adev-msix_max;
+adev-msi = g_malloc(adev-msix_max * sizeof(*adev-msi));
 
 entry = adev-msix_table;
 for (i = 0; i  adev-msix_max; i++, entry++) {
-adev-msi_virq[i] = -1;
+adev-msi[i].virq = -1;
 
 if (assigned_dev_msix_masked(entry)) {
 continue;
@@ -1074,13 +1075,13 @@ static int assigned_dev_update_msix_mmio(PCIDevice 
*pci_dev)
 if (r  0) {
 return r;
 }
-adev-msi_virq[i] = r;
+adev-msi[i].virq = r;
 
 DEBUG(MSI-X vector %d, gsi %d, addr %08x_%08x, data %08x\n, i,
   r, entry-addr_hi, entry-addr_lo, entry-data);
 
 r = kvm_device_msix_set_vector(kvm_state, adev-dev_id, i,
-   adev-msi_virq[i]);
+   adev-msi[i].virq);
 if (r) {
 error_report(fail to set MSI-X entry! %s, strerror(-r));
 break;
@@ -1108,7 +1109,7 @@ static void assigned_dev_update_msix(PCIDevice *pci_dev)
 perror(assigned_dev_update_msix: deassign irq);
 }
 
-free_msi_virqs(assigned_dev);
+free_msi(assigned_dev);
 
 assigned_dev-assigned_irq_type = ASSIGNED_IRQ_NONE;
 pci_device_set_intx_routing_notifier(pci_dev, NULL);
@@ -1120,7 +1121,7 @@ static void assigned_dev_update_msix(PCIDevice *pci_dev)
 return;
 }
 
-if (assigned_dev-msi_virq_nr  0) {
+if (assigned_dev-msi_nr  0) {
 if (kvm_device_msix_assign(kvm_state, assigned_dev-dev_id)  0) {
 perror(assigned_dev_enable_msix: assign irq);
 return;
@@ -1548,7 +1549,7 @@ static void assigned_dev_msix_mmio_write(void *opaque, 
hwaddr addr,
 } else if (assigned_dev_msix_masked(orig) 
!assigned_dev_msix_masked(entry)) {
 /* Vector unmasked */
-if (i = adev-msi_virq_nr || adev-msi_virq[i]  0) {
+if (i = adev-msi_nr || adev-msi[i].virq  0) {
 /* Previously

[RFC PATCH 2/4] vfio-pci: Add support for MSI affinity

2012-11-27 Thread Alex Williamson

When MSI is accelerated through KVM the vectors are only programmed
when the guest first enables MSI support.  Subsequent writes to the
vector address or data fields are ignored.  Unfortunately that means
we're ignore updates done to adjust SMP affinity of the vectors.
MSI SMP affinity already works in non-KVM mode because the address
and data fields are read from their backing store on each interrupt.

This patch stores the MSIMessage programmed into KVM so that we can
determine when changes are made and update the routes.  The message
is stored for both MSI and MSI-X for consistency, but we only have
a use for them in MSI mode.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 hw/vfio_pci.c |   31 ---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 7c27834..49b9550 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -75,6 +75,7 @@ struct VFIODevice;
 typedef struct VFIOMSIVector {
 EventNotifier interrupt; /* eventfd triggered on interrupt */
 struct VFIODevice *vdev; /* back pointer to device */
+MSIMessage msg; /* cache the MSI message so we know when it changes */
 int virq; /* KVM irqchip route for QEMU bypass */
 bool use;
 } VFIOMSIVector;
@@ -574,6 +575,7 @@ static int vfio_msix_vector_use(PCIDevice *pdev,
 
 vector = vdev-msi_vectors[nr];
 vector-vdev = vdev;
+vector-msg = msg;
 vector-use = true;
 
 msix_vector_use(pdev, nr);
@@ -716,7 +718,6 @@ retry:
 vdev-msi_vectors = g_malloc0(vdev-nr_vectors * sizeof(VFIOMSIVector));
 
 for (i = 0; i  vdev-nr_vectors; i++) {
-MSIMessage msg;
 VFIOMSIVector *vector = vdev-msi_vectors[i];
 
 vector-vdev = vdev;
@@ -726,13 +727,13 @@ retry:
 error_report(vfio: Error: event_notifier_init failed\n);
 }
 
-msg = msi_get_message(vdev-pdev, i);
+vector-msg = msi_get_message(vdev-pdev, i);
 
 /*
  * Attempt to enable route through KVM irqchip,
  * default to userspace handling if unavailable.
  */
-vector-virq = kvm_irqchip_add_msi_route(kvm_state, msg);
+vector-virq = kvm_irqchip_add_msi_route(kvm_state, vector-msg);
 if (vector-virq  0 ||
 kvm_irqchip_add_irqfd_notifier(kvm_state, vector-interrupt,
vector-virq)  0) {
@@ -1022,6 +1023,30 @@ static void vfio_pci_write_config(PCIDevice *pdev, 
uint32_t addr,
 vfio_enable_msi(vdev);
 } else if (was_enabled  !is_enabled) {
 vfio_disable_msi(vdev);
+} else if (was_enabled  is_enabled) {
+int i;
+
+for (i = 0; i  vdev-nr_vectors; i++) {
+VFIOMSIVector *vector = vdev-msi_vectors[i];
+MSIMessage msg;
+
+if (!vector-use || vector-virq  0) {
+continue;
+}
+
+msg = msi_get_message(pdev, i);
+
+if (msg.address != vector-msg.address ||
+msg.data != vector-msg.data) {
+
+DPRINTF(%s(%04x:%02x:%02x.%x) MSI vector %d changed\n,
+__func__, vdev-host.domain, vdev-host.bus,
+vdev-host.slot, vdev-host.function, i);
+
+kvm_irqchip_update_msi_route(kvm_state, vector-virq, msg);
+vector-msg = msg;
+}
+}
 }
 }
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 4/4] pci-assign: Add MSI affinity support

2012-11-27 Thread Alex Williamson

Track the last MSIMessage programmed so we can determine when it has
changed and update the routing to the guest.  We track MSI-X for
consistency, but we don't do anything with it here.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 hw/kvm/pci-assign.c |   29 +
 1 file changed, 29 insertions(+)

diff --git a/hw/kvm/pci-assign.c b/hw/kvm/pci-assign.c
index 3e667d1..500c62c 100644
--- a/hw/kvm/pci-assign.c
+++ b/hw/kvm/pci-assign.c
@@ -133,6 +133,7 @@ typedef struct AssignedDevice {
 int msi_nr;
 struct {
 int virq;
+MSIMessage msg;
 } *msi;
 MSIXTableEntry *msix_table;
 hwaddr msix_table_addr;
@@ -1009,6 +1010,7 @@ static void assigned_dev_update_msi(PCIDevice *pci_dev)
 assigned_dev-msi = g_malloc(sizeof(*assigned_dev-msi));
 assigned_dev-msi_nr = 1;
 assigned_dev-msi[0].virq = virq;
+assigned_dev-msi[0].msg = msg;
 if (kvm_device_msi_assign(kvm_state, assigned_dev-dev_id, virq)  0) {
 perror(assigned_dev_update_msi: kvm_device_msi_assign);
 }
@@ -1021,6 +1023,27 @@ static void assigned_dev_update_msi(PCIDevice *pci_dev)
 }
 }
 
+static void assigned_dev_update_msi_msg(PCIDevice *pci_dev)
+{
+AssignedDevice *assigned_dev = DO_UPCAST(AssignedDevice, dev, pci_dev);
+uint8_t ctrl_byte = pci_get_byte(pci_dev-config + pci_dev-msi_cap +
+ PCI_MSI_FLAGS);
+MSIMessage msg;
+
+if (assigned_dev-assigned_irq_type != ASSIGNED_IRQ_MSI ||
+!(ctrl_byte  PCI_MSI_FLAGS_ENABLE)) {
+return;
+}
+
+msg = msi_get_message(pci_dev, 0);
+
+if (msg.address != assigned_dev-msi[0].msg.address ||
+msg.data != assigned_dev-msi[0].msg.data) {
+kvm_irqchip_update_msi_route(kvm_state, assigned_dev-msi[0].virq, 
msg);
+assigned_dev-msi[0].msg = msg;
+}
+}
+
 static bool assigned_dev_msix_masked(MSIXTableEntry *entry)
 {
 return (entry-ctrl  cpu_to_le32(0x1)) != 0;
@@ -1076,6 +1099,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice 
*pci_dev)
 return r;
 }
 adev-msi[i].virq = r;
+adev-msi[i].msg = msg;
 
 DEBUG(MSI-X vector %d, gsi %d, addr %08x_%08x, data %08x\n, i,
   r, entry-addr_hi, entry-addr_lo, entry-data);
@@ -1183,6 +1207,10 @@ static void assigned_dev_pci_write_config(PCIDevice 
*pci_dev, uint32_t address,
 if (range_covers_byte(address, len,
   pci_dev-msi_cap + PCI_MSI_FLAGS)) {
 assigned_dev_update_msi(pci_dev);
+} else if (ranges_overlap(address, len,
+  pci_dev-msi_cap + PCI_MSI_ADDRESS_LO,
+  10 - PCI_MSI_ADDRESS_LO)) {
+assigned_dev_update_msi_msg(pci_dev);
 }
 }
 if (assigned_dev-cap.available  ASSIGNED_DEVICE_CAP_MSIX) {
@@ -1561,6 +1589,7 @@ static void assigned_dev_msix_mmio_write(void *opaque, 
hwaddr addr,
 msg.address = entry-addr_lo |
 ((uint64_t)entry-addr_hi  32);
 msg.data = entry-data;
+adev-msi[i].msg = msg;
 
 ret = kvm_irqchip_update_msi_route(kvm_state,
adev-msi[i].virq, msg);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 0/4] MSI affinity for assigned devices

2012-11-27 Thread Alex Williamson

This is post-1.3 material, so I'll just post it as an RFC for now.

MSI routing updates aren't currently handled by pci-assign or
vfio-pci (when using KVM acceleration), which means that trying to
set interrupt SMP affinity in the guest has no effect unless MSI is
completely disabled and re-enabled.  This series fixes this for both
device assignment backends using similar schemes.  We store the last
MSIMessage programmed to KVM and do updates to the MSI route when it
changes.  pci-assign takes a little bit of refactoring to make this
happen cleanly.  Thanks,

Alex

---

Alex Williamson (4):
  kvm: Stub kvm_irqchip_update_msi_route
  vfio-pci: Add support for MSI affinity
  pci-assign: Refactor MSI virq array
  pci-assign: Add MSI affinity support


 hw/kvm/pci-assign.c |   80 +++
 hw/vfio_pci.c   |   31 ++--
 kvm-all.c   |5 +++
 kvm-stub.c  |5 +++
 4 files changed, 93 insertions(+), 28 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 1/4] kvm: Stub kvm_irqchip_update_msi_route

2012-11-27 Thread Alex Williamson

Its friends are already stubbed out, add this one too so vfio can
call it without wrapping it in #ifdefs.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 kvm-all.c  |5 +
 kvm-stub.c |5 +
 2 files changed, 10 insertions(+)

diff --git a/kvm-all.c b/kvm-all.c
index 3bc3347..93e9f0e 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1179,6 +1179,11 @@ int kvm_irqchip_add_msi_route(KVMState *s, MSIMessage 
msg)
 return -ENOSYS;
 }
 
+int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg)
+{
+return -ENOSYS;
+}
+
 static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int virq, bool assign)
 {
 abort();
diff --git a/kvm-stub.c b/kvm-stub.c
index a3455e2..3a3bf85 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -127,6 +127,11 @@ int kvm_irqchip_add_msi_route(KVMState *s, MSIMessage msg)
 return -ENOSYS;
 }
 
+int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg)
+{
+return -ENOSYS;
+}
+
 void kvm_irqchip_release_virq(KVMState *s, int virq)
 {
 }

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/4] MSI affinity for assigned devices

2012-11-27 Thread Jan Kiszka

On 2012-11-27 23:00, Alex Williamson wrote:
 This is post-1.3 material, so I'll just post it as an RFC for now.
 
 MSI routing updates aren't currently handled by pci-assign or
 vfio-pci (when using KVM acceleration), which means that trying to
 set interrupt SMP affinity in the guest has no effect unless MSI is
 completely disabled and re-enabled.  This series fixes this for both
 device assignment backends using similar schemes.  We store the last
 MSIMessage programmed to KVM and do updates to the MSI route when it
 changes.  pci-assign takes a little bit of refactoring to make this
 happen cleanly.  Thanks,

This should rather be done by implementing vector notifiers for MSI as
well. That way the device model no longer has to track reasons for
vector changes in an open-coded fashion, just like we already do for MSI-X.

Was on my todo list for a long time, but I never reached this item.

Jan




signature.asc
Description: OpenPGP digital signature

Re: [RFC PATCH 0/4] MSI affinity for assigned devices

2012-11-27 Thread Alex Williamson

On Wed, 2012-11-28 at 00:08 +0100, Jan Kiszka wrote:
 On 2012-11-27 23:00, Alex Williamson wrote:
  This is post-1.3 material, so I'll just post it as an RFC for now.
  
  MSI routing updates aren't currently handled by pci-assign or
  vfio-pci (when using KVM acceleration), which means that trying to
  set interrupt SMP affinity in the guest has no effect unless MSI is
  completely disabled and re-enabled.  This series fixes this for both
  device assignment backends using similar schemes.  We store the last
  MSIMessage programmed to KVM and do updates to the MSI route when it
  changes.  pci-assign takes a little bit of refactoring to make this
  happen cleanly.  Thanks,
 
 This should rather be done by implementing vector notifiers for MSI as
 well. That way the device model no longer has to track reasons for
 vector changes in an open-coded fashion, just like we already do for MSI-X.
 
 Was on my todo list for a long time, but I never reached this item.

MSI masking is optional and not many devices seem to support it.  What I
see with a linux guest is that it just overwrites the address/data while
MSI is enabled.  What were you thinking for notifiers? mask, unmask,
update?  I'm not sure I'm interested enough in this to add MSI vector
notifiers.  Thanks,

Alex



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-27 Thread Marcelo Tosatti

On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:
 In the case of where you have a system that is running in a
 capped or overcommitted environment the user may see steal time
 being reported in accounting tools such as top or vmstat.

The definition of stolen time is 'time during which the virtual CPU is
runnable to not running'. Overcommit is the main scenario which steal
time helps to detect.

Can you describe the 'capped' case?

  This can
 cause confusion for the end user.  To ease the confusion this patch set
 adds the idea of consigned (expected steal) time.  The host will separate
 the consigned time from the steal time.  The consignment limit passed to the
 host will be the amount of steal time expected within a fixed period of
 time.  Any other steal time accruing during that period will show as the
 traditional steal time.
 
 ---
 
 Michael Wolf (5):
   Alter the amount of steal time reported by the guest.
   Expand the steal time msr to also contain the consigned time.
   Add the code to send the consigned time from the host to the guest
   Add a timer to allow the separation of consigned from steal time.
   Add an ioctl to communicate the consign limit to the host.
 
 
  arch/x86/include/asm/kvm_host.h   |   11 +++
  arch/x86/include/asm/kvm_para.h   |3 +-
  arch/x86/include/asm/paravirt.h   |4 +--
  arch/x86/include/asm/paravirt_types.h |2 +
  arch/x86/kernel/kvm.c |8 ++---
  arch/x86/kernel/paravirt.c|4 +--
  arch/x86/kvm/x86.c|   50 
 -
  fs/proc/stat.c|9 +-
  include/linux/kernel_stat.h   |2 +
  include/linux/kvm_host.h  |2 +
  include/uapi/linux/kvm.h  |2 +
  kernel/sched/core.c   |   10 ++-
  kernel/sched/cputime.c|   21 +-
  kernel/sched/sched.h  |2 +
  virt/kvm/kvm_main.c   |7 +
  15 files changed, 120 insertions(+), 17 deletions(-)
 
 -- 
 Signature
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction

2012-11-27 Thread Marcelo Tosatti

On Tue, Nov 27, 2012 at 11:30:24AM +0800, Xiao Guangrong wrote:
 On 11/27/2012 06:41 AM, Marcelo Tosatti wrote:
 
 
  -  return false;
  +again:
  +  page_fault_count = ACCESS_ONCE(vcpu-kvm-arch.page_fault_count);
  +
  +  /*
  +   * if emulation was due to access to shadowed page table
  +   * and it failed try to unshadow page and re-enter the
  +   * guest to let CPU execute the instruction.
  +   */
  +  kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa));
  +  emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, PFERR_WRITE_MASK, false);
  
  Can you explain what is the objective here?
  
 
 Sure. :)
 
 The instruction emulation is caused by fault access on cr3. After unprotect
 the target page, we call vcpu-arch.mmu.page_fault to fix the mapping of cr3.
 if it return 1, mmu can not fix the mapping, we should report the error,
 otherwise it is good to return to guest and let it re-execute the instruction
 again.
 
 page_fault_count is used to avoid the race on other vcpus, since after we
 unprotect the target page, other cpu can enter page fault path and let the
 page be write-protected again.
 
 This way can help us to detect all the case that mmu can not be fixed.

How about recording the gfn number for shadow pages that have been
shadowed in the current pagefault run? (which is cheap, compared to
shadowing these pages).

If failed instruction emulation is write to one of these gfns, then
fail.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] KVM: x86: let reexecute_instruction work for tdp

2012-11-27 Thread Marcelo Tosatti

On Tue, Nov 27, 2012 at 11:13:11AM +0800, Xiao Guangrong wrote:
  +static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long 
  cr2)
   {
  -  gpa_t gpa;
  +  gpa_t gpa = cr2;
 pfn_t pfn;
 
  -  if (tdp_enabled)
  +  if (!ACCESS_ONCE(vcpu-kvm-arch.indirect_shadow_pages))
 return false;
  
  How is indirect_shadow_pages protected? Why is ACCESS_ONCE() being used
  to read it?
 
 Hi Marcelo,
 
 It is protected by mmu-lock for it only be changed when mmu-lock is hold. And
 ACCESS_ONCE is used on read path avoiding magic optimization from compiler.

Please switch to mmu_lock protection, there is no reason to have access
to this variable locklessly - not performance critical.

For example, there is no use of barriers when modifying the variable.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Networking latency - what to expect?

2012-11-27 Thread David Mohr


Hi,

we were investigating some performance issue that an application on our 
kvm VMs had and noticed that network latency is much worse in a VM 
compared to actual hardware.


Initially we were using virtio network cards, but not vhost-net. After 
enabling vhost the performance improved quite a bit, but is still only 
at around 50% of what the actual hardware can do.


The question is, what can we expect? CPU utilization was never higher 
than 50% kvm and 20% vhost. Other details are below, please let me know 
if you can recommend other tests or important details that I could 
provide.


I ran the following tests using netperf -t UDP_RR (all results are 
transactions/sec):

* host-host19k
* vm-its host  17k
* vm-vm (same host)22k
* vm-vm (diff. hosts)   7k

Host: Debian squeeze using a 3.5.2 kernel
KVM:  1.1.2 (bpo)
Host CPU: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
VM:   Debian squeeze using a 3.2.20 kernel
Network:  e1000  gigabit switch
Command-line:
-cpu host -name testvm -m 12288 -smp 16 -pidfile 
/var/run/ganeti/kvm-hypervisor/pid/testvm -balloon virtio -daemonize 
-monitor 
unix:/var/run/ganeti/kvm-hypervisor/ctrl/testvm.monitor,server,nowait 
-serial 
unix:/var/run/ganeti/kvm-hypervisor/ctrl/testvm.serial,server,nowait 
-usbdevice tablet -vnc 127.0.0.1:5102 -netdev 
type=tap,id=netdev0,fd=8,vhost=on -device 
virtio-net-pci,mac=52:54:00:00:01:e6,netdev=netdev0 -netdev 
type=tap,id=netdev1,fd=9,vhost=on -device 
virtio-net-pci,mac=52:54:00:00:01:e7,netdev=netdev1 -qmp 
unix:/var/run/ganeti/kvm-hypervisor/ctrl/testvm.qmp,server,nowait -S


Thanks,
~David
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 4/4] nested vmx: use a list to store the launched vmcs12 for L1 VMM

2012-11-27 Thread Marcelo Tosatti

On Thu, Nov 22, 2012 at 12:51:59PM +0800, Dongxiao Xu wrote:
 The launch state is not a member in the VMCS area, use a separate
 variable (list) to store it instead.
 
 Signed-off-by: Dongxiao Xu dongxiao...@intel.com

1. What is the problem with keeping launched state in the VMCS?
Assuming there is a positive answer to the above:

2. Don't you have to change VMCS ID?

3. Can't it be kept somewhere else other than a list? Current scheme 
allows guest to allocate unlimited amounts of host memory.

4. What is the state of migration / nested vmx again? If vmcs12 is
migrated, this means launched state is not migrated anymore.

Patches 1-3 seem fine.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Resend - Added x86/tsc_adjust.c to test the ia32_tsc_adjust funtionality.

2012-11-27 Thread Marcelo Tosatti

Will,

1. Please check CPUID before using ADJUST_TSC MSR, exit test successfully
if CPUID bit disabled.

2. Please test the implementation of ADJUST_TSC MSR (functional test). 
vmexit.flat test can be used for performance of MSR emulation.

Example

tsc1 = rdtsc();
wrmsr(IA32_TSC_ADJUST, 1000ull);
tsc2 = rdtsc();
if (tsc2 - tsc1  1000ull) {
printf(failure ... should be ...);
failure++;
}
...
write to tsc
check if IA32_TSC_ADJUST is adjusted accordingly

return !failure ? 0 : 1;

Thanks

On Mon, Nov 26, 2012 at 10:44:14AM -0800, Will Auld wrote:
 Signed-off-by: Will Auld will.a...@intel.com
 ---
  config-x86-common.mak |  5 -
  x86/tsc_adjust.c  | 43 +++
  2 files changed, 47 insertions(+), 1 deletion(-)
  create mode 100644 x86/tsc_adjust.c
 
 diff --git a/config-x86-common.mak b/config-x86-common.mak
 index c76cd11..47a9056 100644
 --- a/config-x86-common.mak
 +++ b/config-x86-common.mak
 @@ -34,7 +34,8 @@ tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat 
 \
 $(TEST_DIR)/realmode.flat $(TEST_DIR)/msr.flat \
 $(TEST_DIR)/hypercall.flat $(TEST_DIR)/sieve.flat \
 $(TEST_DIR)/kvmclock_test.flat  $(TEST_DIR)/eventinj.flat \
 -   $(TEST_DIR)/s3.flat $(TEST_DIR)/pmu.flat 
 $(TEST_DIR)/asyncpf.flat
 +   $(TEST_DIR)/s3.flat $(TEST_DIR)/pmu.flat \
 +$(TEST_DIR)/tsc_adjust.flat $(TEST_DIR)/asyncpf.flat
  
  ifdef API
  tests-common += api/api-sample
 @@ -64,6 +65,8 @@ $(TEST_DIR)/port80.elf: $(cstart.o) $(TEST_DIR)/port80.o
  
  $(TEST_DIR)/tsc.elf: $(cstart.o) $(TEST_DIR)/tsc.o
  
 +$(TEST_DIR)/tsc_adjust.elf: $(cstart.o) $(TEST_DIR)/tsc_adjust.o
 +
  $(TEST_DIR)/apic.elf: $(cstart.o) $(TEST_DIR)/apic.o
  
  $(TEST_DIR)/realmode.elf: $(TEST_DIR)/realmode.o
 diff --git a/x86/tsc_adjust.c b/x86/tsc_adjust.c
 new file mode 100644
 index 000..bcb8982
 --- /dev/null
 +++ b/x86/tsc_adjust.c
 @@ -0,0 +1,43 @@
 +#include libcflat.h
 +#include processor.h
 +
 +#define IA32_TSC_ADJUST 0x3b
 +
 +int main()
 +{
 + u64 t1, t2, t3, t4, t5;
 + u64 lat;
 +
 + t3 = 0x0;
 +
 + t1 = rdtsc();
 + wrmsr(IA32_TSC_ADJUST, t3);
 + t2 = rdtsc();
 + lat = t2 - t1;
 + printf(rdtsc/wrmsr/rdtsc latency %lld\n, lat);
 + printf(Initial rdtsc: %lld\n, t2);
 +
 + t1 = rdmsr(IA32_TSC_ADJUST);
 + printf(Initial rdmsr IA32_TSC_ADJUST: %lld\n, t1);
 + 
 + t5 = 1000ull;
 + wrtsc(t5);
 + t1 = rdmsr(IA32_TSC_ADJUST);
 + printf(wrtsc %lld, rdmsr IA32_TSC_ADJUST: %lld\n, t5, t1);
 +
 + wrmsr(IA32_TSC_ADJUST, t3);
 + t2 = rdtsc();
 + t1 = rdmsr(IA32_TSC_ADJUST);
 + printf( wrmsr IA32_TSC_ADJUST %lld, rdmsr IA32_TSC_ADJUST: %lld, 
 rdtsc: %lld\n, t3, t1, t2);
 + 
 + t3 = 0x;
 + t4 = rdtsc();
 + wrmsr(IA32_TSC_ADJUST, t3);
 + t2 = rdtsc();
 + t1 = rdmsr(IA32_TSC_ADJUST);
 + printf( wrmsr IA32_TSC_ADJUST %lld, rdmsr IA32_TSC_ADJUST: %lld, 
 rdtsc: %lld\n, t3, t1, t2);
 + lat = t2 - t4;
 + printf(rdtsc/wrmsr/rdtsc latency %lld\n, lat);
 + 
 + return 0;
 +}
 -- 
 1.8.0.rc0
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-27 Thread Marcelo Tosatti

On Tue, Nov 27, 2012 at 09:24:42PM -0200, Marcelo Tosatti wrote:
 On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:
  In the case of where you have a system that is running in a
  capped or overcommitted environment the user may see steal time
  being reported in accounting tools such as top or vmstat.
 
 The definition of stolen time is 'time during which the virtual CPU is
 runnable to not running'. Overcommit is the main scenario which steal
 time helps to detect.

Meant 'runnable but not running'.


 Can you describe the 'capped' case?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V5 2/2] Enabling IA32_TSC_ADJUST for KVM guest VM support

2012-11-27 Thread Marcelo Tosatti


Hi Will,

On Tue, Nov 27, 2012 at 11:09:00AM -0800, Will Auld wrote:
 CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported
 
 Basic design is to emulate the MSR by allowing reads and writes to a guest
 vcpu specific location to store the value of the emulated MSR while adding
 the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST value will
 be included in all reads to the TSC MSR whether through rdmsr or rdtsc. This
 is of course as long as the use TSC counter offsetting VM-execution
 control is enabled as well as the IA32_TSC_ADJUST control.
 
 However, because hardware will only return the TSC + IA32_TSC_ADJUST + vmsc
 tsc_offset for a guest process when it does and rdtsc (with the correct
 settings) the value of our virtualized IA32_TSC_ADJUST must be stored in
 one of these three locations. The argument against storing it in the actual
 MSR is performance. This is likely to be seldom used while the save/restore
 is required on every transition. IA32_TSC_ADJUST was created as a way to
 solve some issues with writing TSC itself so that is not an option either.
 The remaining option, defined above as our solution has the problem of
 returning incorrect vmcs tsc_offset values (unless we intercept and fix, not
 done here) as mentioned above. However, more problematic is that storing the
 data in vmcs tsc_offset will have a different semantic effect on the system
 than does using the actual MSR. This is illustrated in the following example:
 The hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a guest
 process performs a rdtsc. In this case the guest process will get TSC +
 IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including IA32_TSC_ADJUST_guest.
 While the total system semantics changed the semantics as seen by the guest
 do not and hence this will not cause a problem.
 
 Signed-off-by: Will Auld will.a...@intel.com
 ---
  arch/x86/include/asm/cpufeature.h |  1 +
  arch/x86/include/asm/kvm_host.h   |  3 +++
  arch/x86/include/asm/msr-index.h  |  1 +
  arch/x86/kvm/cpuid.c  |  2 ++
  arch/x86/kvm/cpuid.h  |  8 
  arch/x86/kvm/svm.c|  7 +++
  arch/x86/kvm/vmx.c|  9 +
  arch/x86/kvm/x86.c| 22 ++
  8 files changed, 53 insertions(+)
 
 diff --git a/arch/x86/include/asm/cpufeature.h 
 b/arch/x86/include/asm/cpufeature.h
 index 6b7ee5f..e574d81 100644
 --- a/arch/x86/include/asm/cpufeature.h
 +++ b/arch/x86/include/asm/cpufeature.h
 @@ -199,6 +199,7 @@
  
  /* Intel-defined CPU features, CPUID level 0x0007:0 (ebx), word 9 */
  #define X86_FEATURE_FSGSBASE (9*32+ 0) /* {RD/WR}{FS/GS}BASE instructions*/
 +#define X86_FEATURE_TSC_ADJUST  (9*32+ 1) /* TSC adjustment MSR 0x3b */
  #define X86_FEATURE_BMI1 (9*32+ 3) /* 1st group bit manipulation 
 extensions */
  #define X86_FEATURE_HLE  (9*32+ 4) /* Hardware Lock Elision */
  #define X86_FEATURE_AVX2 (9*32+ 5) /* AVX2 instructions */
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index da34027..cf8c7e0 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -442,6 +442,8 @@ struct kvm_vcpu_arch {
   u32 virtual_tsc_mult;
   u32 virtual_tsc_khz;
  
 + s64 ia32_tsc_adjust_msr;
 +
   atomic_t nmi_queued;  /* unprocessed asynchronous NMIs */
   unsigned nmi_pending; /* NMI queued after currently running handler */
   bool nmi_injected;/* Trying to inject an NMI this entry */
 @@ -690,6 +692,7 @@ struct kvm_x86_ops {
   bool (*has_wbinvd_exit)(void);
  
   void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool 
 scale);
 + u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu);
   void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
  
   u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
 diff --git a/arch/x86/include/asm/msr-index.h 
 b/arch/x86/include/asm/msr-index.h
 index 957ec87..6486569 100644
 --- a/arch/x86/include/asm/msr-index.h
 +++ b/arch/x86/include/asm/msr-index.h
 @@ -231,6 +231,7 @@
  #define MSR_IA32_EBL_CR_POWERON  0x002a
  #define MSR_EBC_FREQUENCY_ID 0x002c
  #define MSR_IA32_FEATURE_CONTROL0x003a
 +#define MSR_IA32_TSC_ADJUST 0x003b
  
  #define FEATURE_CONTROL_LOCKED   (10)
  #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX (11)
 diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
 index 0595f13..e817bac 100644
 --- a/arch/x86/kvm/cpuid.c
 +++ b/arch/x86/kvm/cpuid.c
 @@ -320,6 +320,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
 u32 function,
   if (index == 0) {
   entry-ebx = kvm_supported_word9_x86_features;
   cpuid_mask(entry-ebx, 9);
 + // TSC_ADJUST is emulated 
 + entry-ebx |= F(TSC_ADJUST);
   } else

Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case

2012-11-27 Thread Marcelo Tosatti


Don't understand the reasoning behind why 3 is a good choice.

On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote:
 From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 
 yield_to returns -ESRCH, When source and target of yield_to
 run queue length is one. When we see three successive failures of
 yield_to we assume we are in potential undercommit case and abort
 from PLE handler.
 The assumption is backed by low probability of wrong decision
 for even worst case scenarios such as average runqueue length
 between 1 and 2.
 
 note that we do not update last boosted vcpu in failure cases.
 Thank Avi for raising question on aborting after first fail from yield_to.
 
 Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com
 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 ---
  virt/kvm/kvm_main.c |   26 --
  1 file changed, 16 insertions(+), 10 deletions(-)
 
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index be70035..053f494 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
  {
   struct pid *pid;
   struct task_struct *task = NULL;
 + bool ret = false;
  
   rcu_read_lock();
   pid = rcu_dereference(target-pid);
 @@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
   task = get_pid_task(target-pid, PIDTYPE_PID);
   rcu_read_unlock();
   if (!task)
 - return false;
 + return ret;
   if (task-flags  PF_VCPU) {
   put_task_struct(task);
 - return false;
 - }
 - if (yield_to(task, 1)) {
 - put_task_struct(task);
 - return true;
 + return ret;
   }
 + ret = yield_to(task, 1);
   put_task_struct(task);
 - return false;
 +
 + return ret;
  }
  EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
  
 @@ -1697,12 +1696,14 @@ bool kvm_vcpu_eligible_for_directed_yield(struct 
 kvm_vcpu *vcpu)
   return eligible;
  }
  #endif
 +
  void kvm_vcpu_on_spin(struct kvm_vcpu *me)
  {
   struct kvm *kvm = me-kvm;
   struct kvm_vcpu *vcpu;
   int last_boosted_vcpu = me-kvm-last_boosted_vcpu;
   int yielded = 0;
 + int try = 3;
   int pass;
   int i;
  
 @@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
* VCPU is holding the lock that we need and will release it.
* We approximate round-robin by starting at the last boosted VCPU.
*/
 - for (pass = 0; pass  2  !yielded; pass++) {
 + for (pass = 0; pass  2  !yielded  try; pass++) {
   kvm_for_each_vcpu(i, vcpu, kvm) {
   if (!pass  i = last_boosted_vcpu) {
   i = last_boosted_vcpu;
 @@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
   continue;
   if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
   continue;
 - if (kvm_vcpu_yield_to(vcpu)) {
 +
 + yielded = kvm_vcpu_yield_to(vcpu);
 + if (yielded  0) {
   kvm-last_boosted_vcpu = i;
 - yielded = 1;
   break;
 + } else if (yielded  0) {
 + try--;
 + if (!try)
 + break;
   }
   }
   }
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm:queue 23/25] arch/s390/kvm/kvm-s390.c:358:6: error: conflicting types for 'kvm_arch_vcpu_postcreate'

2012-11-27 Thread Marcelo Tosatti

On Tue, Nov 27, 2012 at 10:56:50AM +0800, Fengguang Wu wrote:
 
 Hi Marcelo,
 
 FYI, kernel build failed on
 
 tree:   git://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
 head:   fc1ddea318fa2c1ac3d496d8653ca4bc9b66e679
 commit: 438d76a60e7be59a558f8712a47565fa8258d17d [23/25] KVM: x86: add 
 kvm_arch_vcpu_postcreate callback, move TSC initialization
 config: make ARCH=s390 allmodconfig

Fengguang Wu,

Thanks. Repository has been updated, it would be good
if you can rerun the tests.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH V5 2/2] Enabling IA32_TSC_ADJUST for KVM guest VM support

2012-11-27 Thread Auld, Will

Thanks Marcelo, 

I'll address these items.

- Please rebase against queue branch on kvm.git.

I am not sure how to do this. The repository I have been working against is:

git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

I assume that I need to change this in some way but I am not sure what I need 
to do. Will you explain or give me the needed path?

Thanks,

Will

 -Original Message-
 From: Marcelo Tosatti [mailto:mtosa...@redhat.com]
 Sent: Tuesday, November 27, 2012 5:00 PM
 To: Auld, Will
 Cc: kvm@vger.kernel.org; Dugger, Donald D; Liu, Jinsong; Zhang,
 Xiantao; a...@redhat.com; qemu-devel; Gleb
 Subject: Re: [PATCH V5 2/2] Enabling IA32_TSC_ADJUST for KVM guest VM
 support
 
 
 Hi Will,
 
 On Tue, Nov 27, 2012 at 11:09:00AM -0800, Will Auld wrote:
  CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported
 
  Basic design is to emulate the MSR by allowing reads and writes to a
  guest vcpu specific location to store the value of the emulated MSR
  while adding the value to the vmcs tsc_offset. In this way the
  IA32_TSC_ADJUST value will be included in all reads to the TSC MSR
  whether through rdmsr or rdtsc. This is of course as long as the use
  TSC counter offsetting VM-execution control is enabled as well as
 the IA32_TSC_ADJUST control.
 
  However, because hardware will only return the TSC + IA32_TSC_ADJUST
 +
  vmsc tsc_offset for a guest process when it does and rdtsc (with the
  correct
  settings) the value of our virtualized IA32_TSC_ADJUST must be stored
  in one of these three locations. The argument against storing it in
  the actual MSR is performance. This is likely to be seldom used while
  the save/restore is required on every transition. IA32_TSC_ADJUST was
  created as a way to solve some issues with writing TSC itself so that
 is not an option either.
  The remaining option, defined above as our solution has the problem
 of
  returning incorrect vmcs tsc_offset values (unless we intercept and
  fix, not done here) as mentioned above. However, more problematic is
  that storing the data in vmcs tsc_offset will have a different
  semantic effect on the system than does using the actual MSR. This is
 illustrated in the following example:
  The hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a
  guest process performs a rdtsc. In this case the guest process will
  get TSC + IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including
 IA32_TSC_ADJUST_guest.
  While the total system semantics changed the semantics as seen by the
  guest do not and hence this will not cause a problem.
 
  Signed-off-by: Will Auld will.a...@intel.com
  ---
   arch/x86/include/asm/cpufeature.h |  1 +
   arch/x86/include/asm/kvm_host.h   |  3 +++
   arch/x86/include/asm/msr-index.h  |  1 +
   arch/x86/kvm/cpuid.c  |  2 ++
   arch/x86/kvm/cpuid.h  |  8 
   arch/x86/kvm/svm.c|  7 +++
   arch/x86/kvm/vmx.c|  9 +
   arch/x86/kvm/x86.c| 22 ++
   8 files changed, 53 insertions(+)
 
  diff --git a/arch/x86/include/asm/cpufeature.h
  b/arch/x86/include/asm/cpufeature.h
  index 6b7ee5f..e574d81 100644
  --- a/arch/x86/include/asm/cpufeature.h
  +++ b/arch/x86/include/asm/cpufeature.h
  @@ -199,6 +199,7 @@
 
   /* Intel-defined CPU features, CPUID level 0x0007:0 (ebx), word
 9 */
   #define X86_FEATURE_FSGSBASE   (9*32+ 0) /* {RD/WR}{FS/GS}BASE
 instructions*/
  +#define X86_FEATURE_TSC_ADJUST  (9*32+ 1) /* TSC adjustment MSR 0x3b
  +*/
   #define X86_FEATURE_BMI1   (9*32+ 3) /* 1st group bit manipulation
 extensions */
   #define X86_FEATURE_HLE(9*32+ 4) /* Hardware Lock Elision
 */
   #define X86_FEATURE_AVX2   (9*32+ 5) /* AVX2 instructions */
  diff --git a/arch/x86/include/asm/kvm_host.h
  b/arch/x86/include/asm/kvm_host.h index da34027..cf8c7e0 100644
  --- a/arch/x86/include/asm/kvm_host.h
  +++ b/arch/x86/include/asm/kvm_host.h
  @@ -442,6 +442,8 @@ struct kvm_vcpu_arch {
  u32 virtual_tsc_mult;
  u32 virtual_tsc_khz;
 
  +   s64 ia32_tsc_adjust_msr;
  +
  atomic_t nmi_queued;  /* unprocessed asynchronous NMIs */
  unsigned nmi_pending; /* NMI queued after currently running
 handler */
  bool nmi_injected;/* Trying to inject an NMI this entry */
  @@ -690,6 +692,7 @@ struct kvm_x86_ops {
  bool (*has_wbinvd_exit)(void);
 
  void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool
  scale);
  +   u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu);
  void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
 
  u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
  diff --git a/arch/x86/include/asm/msr-index.h
  b/arch/x86/include/asm/msr-index.h
  index 957ec87..6486569 100644
  --- a/arch/x86/include/asm/msr-index.h
  +++ b/arch/x86/include/asm/msr-index.h
  @@ -231,6 +231,7 @@
   #define MSR_IA32_EBL_CR_POWERON0x002a
   #define MSR_EBC_FREQUENCY_ID

Re: [PATCH 2/3] KVM: x86: let reexecute_instruction work for tdp

2012-11-27 Thread Xiao Guangrong

On 11/28/2012 07:32 AM, Marcelo Tosatti wrote:
 On Tue, Nov 27, 2012 at 11:13:11AM +0800, Xiao Guangrong wrote:
 +static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long 
 cr2)
  {
 -  gpa_t gpa;
 +  gpa_t gpa = cr2;
pfn_t pfn;

 -  if (tdp_enabled)
 +  if (!ACCESS_ONCE(vcpu-kvm-arch.indirect_shadow_pages))
return false;

 How is indirect_shadow_pages protected? Why is ACCESS_ONCE() being used
 to read it?

 Hi Marcelo,

 It is protected by mmu-lock for it only be changed when mmu-lock is hold. And
 ACCESS_ONCE is used on read path avoiding magic optimization from compiler.
 
 Please switch to mmu_lock protection, there is no reason to have access
 to this variable locklessly - not performance critical.
 
 For example, there is no use of barriers when modifying the variable.

This is not bad, the worst case is, the direct mmu failed to unprotect the 
shadow
pages, (meet indirect_shadow_pages = 0, but there has shadow pages being 
shadowed.),
after enter to guest, we will go into reexecute_instruction again, then it will
remove shadow pages.

But, i do not have strong opinion on it, i respect your idea! :)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction

2012-11-27 Thread Xiao Guangrong

On 11/28/2012 07:42 AM, Marcelo Tosatti wrote:
 On Tue, Nov 27, 2012 at 11:30:24AM +0800, Xiao Guangrong wrote:
 On 11/27/2012 06:41 AM, Marcelo Tosatti wrote:


 -  return false;
 +again:
 +  page_fault_count = ACCESS_ONCE(vcpu-kvm-arch.page_fault_count);
 +
 +  /*
 +   * if emulation was due to access to shadowed page table
 +   * and it failed try to unshadow page and re-enter the
 +   * guest to let CPU execute the instruction.
 +   */
 +  kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa));
 +  emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, PFERR_WRITE_MASK, false);

 Can you explain what is the objective here?


 Sure. :)

 The instruction emulation is caused by fault access on cr3. After unprotect
 the target page, we call vcpu-arch.mmu.page_fault to fix the mapping of cr3.
 if it return 1, mmu can not fix the mapping, we should report the error,
 otherwise it is good to return to guest and let it re-execute the instruction
 again.

 page_fault_count is used to avoid the race on other vcpus, since after we
 unprotect the target page, other cpu can enter page fault path and let the
 page be write-protected again.

 This way can help us to detect all the case that mmu can not be fixed.
 
 How about recording the gfn number for shadow pages that have been
 shadowed in the current pagefault run? (which is cheap, compared to
 shadowing these pages).
 

Marcelo,

Thanks for your idea!

If we use this way, we should cache gfns in vcpu struct.

Actually, i have considered the approach like yours, that is getting
all page tables of the guest, then to see whether the page table gfns
are contained in the target gfn. But we need changed mmu-gva_to_pfn
or introduce a new method to get page tables of the guest.

But reexecute_instruction is really the unlikely path, both of these
ways can make the mmu code more complex and/or introduce unnecessary
overload for the common cases.

it looks like the way used in this patch is the simplest and no harmful to
the core code.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv4 0/2] kvm: direct msix injection

2012-11-27 Thread Alex Williamson

On Wed, 2012-11-21 at 21:26 +0200, Michael S. Tsirkin wrote:
 On Wed, Oct 17, 2012 at 06:05:53PM +0200, Michael S. Tsirkin wrote:
  We can deliver certain interrupts, notably MSIX,
  from atomic context.
  Here's an untested patch to do this (compiled only).
  
  Changes from v2:
  Don't inject broadcast interrupts directly
  Changes from v1:
  Tried to address comments from v1, except unifying
  with kvm_set_irq: passing flags to it looks too ugly.
  Added a comment.
  
  Jan, you said you can test this?
 
 I have tested this with some networking workloads
 and this patchset seems to work fine.
 My setup isn't a good fit for benchmarking device
 assignment though.
 Alex, could you pls verifyu that this solves the
 latency issue that you sometimes observe?
 With this patchset device assignment latency should be
 as fast as vfio.

Yep, that seems to cover the gap.  My environment is too noisy to
declare an absolute winner, but pci-assign and vfio-pci are now very,
very similar under netperf TCP_RR with these patches.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case

2012-11-27 Thread Raghavendra K T


On 11/28/2012 06:42 AM, Marcelo Tosatti wrote:


Don't understand the reasoning behind why 3 is a good choice.


Here is where I came from. (explaining from scratch for completeness, 
forgive me :))

In moderate overcommits, we can falsely exit from ple handler even when
we have preempted task of same VM waiting on other cpus. To reduce this
problem, we try few times before exiting.
The problem boils down to:
what is the probability that we exit ple handler even when we have more
than 1 task in other cpus. Theoretical worst case should be around 1.5x
overcommit (As also pointed by Andrew Theurer). [But practical
worstcase may be around 2x,3x overcommits as indicated by the results
for the patch series]

So if p is the probability of finding rq length one on a particular cpu,
and if we do n tries, then probability of exiting ple handler is:

 p^(n+1) [ because we would have come across one source with rq length
1 and n target cpu rqs  with length 1 ]

so
num tries: probability of aborting ple handler (1.5x overcommit)
 1 1/4
 2 1/8
 3 1/16

We can increase this probability with more tries, but the problem is
the overhead.
Also, If we have tried three times that means we would have iterated
over 3 good eligible vcpus along with many non-eligible candidates. In
worst case if we iterate all the vcpus, we reduce 1x performance and
overcommit performance get hit. [ as in results ].

I have tried num_tries = 1,2,3 and n already ( not 4 yet). So I
concluded 3 is enough.

Infact I have also run kernbench and hackbench which are giving 5-20%
improvement.

[ As a side note , I also thought how about having num_tries = f(n) =
ceil ( log(num_online_cpus)/2 ) But I thought calculation is too much
overhead and also there is no point in probably making it dependent on
online cpus ]

Please let me know if you are happy with this rationale/ or correct me
if you foresee some problem. (Infact Avi, Rik's concern about false
exiting made me arrive at 'try' logic which I did not have earlier).

I am currently trying out the result for 1.5x overcommit will post the
result.



On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote:

From: Raghavendra K T raghavendra...@linux.vnet.ibm.com

yield_to returns -ESRCH, When source and target of yield_to
run queue length is one. When we see three successive failures of
yield_to we assume we are in potential undercommit case and abort
from PLE handler.
The assumption is backed by low probability of wrong decision
for even worst case scenarios such as average runqueue length
between 1 and 2.

note that we do not update last boosted vcpu in failure cases.
Thank Avi for raising question on aborting after first fail from yield_to.

Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com

[...]

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: MMU: lazily drop large spte

2012-11-27 Thread Xiao Guangrong

On 11/18/2012 11:00 AM, Marcelo Tosatti wrote:
map gfn 4?  See corrected step 7 above.

 Ah, this is a real bug, and unfortunately, it exists in current
 code. I will make a separate patchset to fix it. Thank you, Marcelo!
 
 Is it? Hum..
 
 Anyway, it would be great if you can write a testcase (should be similar
 in size to rmap_chain).

Marcelo, is this patch acceptable?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task

2012-11-27 Thread Raghavendra K T


On 11/27/2012 07:34 PM, Andrew Theurer wrote:

On Tue, 2012-11-27 at 16:00 +0530, Raghavendra K T wrote:

On 11/26/2012 07:05 PM, Andrew Jones wrote:

On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:

From: Peter Zijlstra pet...@infradead.org

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple handler upon
   seeing source runqueue length = 1, but it had to export rq length).
   Peter came up with the elegant idea of return -ESRCH in scheduler core.

Signed-off-by: Peter Zijlstra pet...@infradead.org
Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---

   kernel/sched/core.c |   25 +++--
   1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..fc219a5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
* It's the caller's job to ensure that the target task struct
* can't go away on us before we can do any checks.
*
- * Returns true if we indeed boosted the target task.
+ * Returns:
+ * true (0) if we indeed boosted the target task.
+ * false (0) if we failed to boost the target.
+ * -ESRCH if there's no task to yield to.
*/
   bool __sched yield_to(struct task_struct *p, bool preempt)
   {
@@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)

   again:
p_rq = task_rq(p);
+   /*
+* If we're the only runnable task on the rq and target rq also
+* has only one task, there's absolutely no point in yielding.
+*/
+   if (rq-nr_running == 1  p_rq-nr_running == 1) {
+   yielded = -ESRCH;
+   goto out_irq;
+   }
+
double_rq_lock(rq, p_rq);
while (task_rq(p) != p_rq) {
double_rq_unlock(rq, p_rq);
@@ -4310,13 +4322,13 @@ again:
}

if (!curr-sched_class-yield_to_task)
-   goto out;
+   goto out_unlock;

if (curr-sched_class != p-sched_class)
-   goto out;
+   goto out_unlock;

if (task_running(p_rq, p) || p-state)
-   goto out;
+   goto out_unlock;

yielded = curr-sched_class-yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4329,11 +4341,12 @@ again:
resched_task(p_rq-curr);
}

-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);

-   if (yielded)
+   if (yielded  0)
schedule();

return yielded;



Acked-by: Andrew Jones drjo...@redhat.com



Thank you Drew.

Marcelo Gleb.. Please let me know if you have comments / concerns on the
patches..

Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios
especially for large guests where we do have overhead of vcpu iteration
of ple handler..


I agree, looks fine for undercommit scenarios.  I do wonder what happens
with 1.5x overcommit, where we might see 1/2 the host cpus with runqueue
of 2 and 1/2 of the host cpus with a runqueue of 1.  Even with this
change that scenario still might be fine, but it would be nice to see a
comparison.



Hi Andrew, yes thanks for pointing out 1.5x case which should have
theoretical  worst case..
I tried with 2 24 vcpu guests and the same 32 core machine.. Here is
the result..

Ebizzy (rec/sec higher is better)
x base
+ patched
N   AvgStddev
x  10 2688.6 347.55917
+  10 2707.6 260.93728

improvement 0.706%

dbench (Throughput MB/sec higher is better)
x base
+ patched
N AvgStddev
x  103164.712 140.24468
+  103244.021 185.92434

Improvement 2.5%

So there is no significant improvement / degradation seen in
1.5x.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] vfio powerpc: implemented IOMMU driver for VFIO

2012-11-27 Thread Alexey Kardashevskiy

VFIO implements platform independent stuff such as
a PCI driver, BAR access (via read/write on a file descriptor
or direct mapping when possible) and IRQ signaling.

The platform dependent part includes IOMMU initialization
and handling. This patch implements an IOMMU driver for VFIO
which does mapping/unmapping pages for the guest IO and
provides information about DMA window (required by a POWERPC
guest).

The counterpart in QEMU is required to support this functionality.

Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 drivers/vfio/Kconfig|6 +
 drivers/vfio/Makefile   |1 +
 drivers/vfio/vfio_iommu_spapr_tce.c |  332 +++
 include/linux/vfio.h|   33 
 4 files changed, 372 insertions(+)
 create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 7cd5dec..b464687 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
depends on VFIO
default n
 
+config VFIO_IOMMU_SPAPR_TCE
+   tristate
+   depends on VFIO  SPAPR_TCE_IOMMU
+   default n
+
 menuconfig VFIO
tristate VFIO Non-Privileged userspace driver framework
depends on IOMMU_API
select VFIO_IOMMU_TYPE1 if X86
+   select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
help
  VFIO provides a framework for secure userspace device drivers.
  See Documentation/vfio.txt for more details.
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 2398d4a..72bfabc 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_VFIO) += vfio.o
 obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
+obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_PCI) += pci/
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
new file mode 100644
index 000..b98770e
--- /dev/null
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -0,0 +1,332 @@
+/*
+ * VFIO: IOMMU DMA mapping support for TCE on POWER
+ *
+ * Copyright (C) 2012 IBM Corp.  All rights reserved.
+ * Author: Alexey Kardashevskiy a...@ozlabs.ru
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio_iommu_type1.c:
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ * Author: Alex Williamson alex.william...@redhat.com
+ */
+
+#include linux/module.h
+#include linux/pci.h
+#include linux/slab.h
+#include linux/uaccess.h
+#include linux/err.h
+#include linux/vfio.h
+#include asm/iommu.h
+
+#define DRIVER_VERSION  0.1
+#define DRIVER_AUTHOR   a...@ozlabs.ru
+#define DRIVER_DESC VFIO IOMMU SPAPR TCE
+
+static void tce_iommu_detach_group(void *iommu_data,
+   struct iommu_group *iommu_group);
+
+/*
+ * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
+ */
+
+/*
+ * This code handles mapping and unmapping of user data buffers
+ * into DMA'ble space using the IOMMU
+ */
+
+#define NPAGE_TO_SIZE(npage)   ((size_t)(npage)  PAGE_SHIFT)
+
+struct vwork {
+   struct mm_struct*mm;
+   longnpage;
+   struct work_struct  work;
+};
+
+/* delayed decrement/increment for locked_vm */
+static void lock_acct_bg(struct work_struct *work)
+{
+   struct vwork *vwork = container_of(work, struct vwork, work);
+   struct mm_struct *mm;
+
+   mm = vwork-mm;
+   down_write(mm-mmap_sem);
+   mm-locked_vm += vwork-npage;
+   up_write(mm-mmap_sem);
+   mmput(mm);
+   kfree(vwork);
+}
+
+static void lock_acct(long npage)
+{
+   struct vwork *vwork;
+   struct mm_struct *mm;
+
+   if (!current-mm)
+   return; /* process exited */
+
+   if (down_write_trylock(current-mm-mmap_sem)) {
+   current-mm-locked_vm += npage;
+   up_write(current-mm-mmap_sem);
+   return;
+   }
+
+   /*
+* Couldn't get mmap_sem lock, so must setup to update
+* mm-locked_vm later. If locked_vm were atomic, we
+* wouldn't need this silliness
+*/
+   vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
+   if (!vwork)
+   return;
+   mm = get_task_mm(current);
+   if (!mm) {
+   kfree(vwork);
+   return;
+   }
+   INIT_WORK(vwork-work, lock_acct_bg);
+   vwork-mm = mm;
+   vwork-npage = npage;
+   schedule_work(vwork-work);
+}
+
+/*
+ * The container descriptor supports only a single group per container.
+ * Required by the API as the container is not supplied with the IOMMU group
+ * at the moment of initialization.
+ */
+struct tce_container {
+   struct mutex lock;
+   struct iommu_table *tbl;
+};
+
+static void

60 matches

Mail list logo