Re: [PATCH] virtio-balloon spec: provide a version of the silent deflate feature that works

2012-09-10 Thread Michael S. Tsirkin
On Mon, Sep 10, 2012 at 07:50:13AM +0200, Paolo Bonzini wrote:
 Il 09/09/2012 00:22, Michael S. Tsirkin ha scritto:
  Almost.  One is the guest, if really needed, can tell the host of
  pages.  If not negotiated, and the host does not support it, the host
  must break the guest (e.g. fail to offer any virtqueues).
  
  There is no way in spec to break the guest.
  You can not fail to offer virtqueues.
 
 You can always return 0 for the first queue.

I don't think guest drivers recover gracefully from this.
Do they?

  Besides, there is no guarantee that virtqueue setup
  happens after feature negotiation.
 
 It is the only way that makes sense though (unless the guest would write
 0 for its features).
  Should we change that?

Not sure.  This was not always the case. Further
setup can fail with e.g ENOMEM and
driver could retry with a set of more conservative features.

I do think it would be nice to add a generic way for device to
notify guest about an internal failure.
This can only happen after DRIVER_OK status is written though,
and since existing drivers do not expect such failure, it might
be too late.

  The other is the guest, though, would prefer not to do so.  It is
  different because the guest can proceed in a fallback mode even if the
  host doesn't offer it.
  
  I think I get what your proposed SILENT means what I do not get
  is the motivation. It looks like a premature optimization to me.
 
 The motivation is to let the driver choose between two behaviors: the
 current one where ballooning is only done on request, and a more
 aggressive one.

Yes but why is being silent any good? Optimization?
Any data to show that it will help some workload?

...

  OK so TELL says *when* to notify host, SILENT if set allows guest
  to skip leak notifications? In this case TELL should just be ignored
  when SILENT is set.
 
 Yeah, that was my first idea.  However, there are existing drivers that
 ignore SILENT, so that would not be 100% exact.

Not sure I follow the logic.
They don't ack SILENT so that would be 100% exact.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv4] virtio-spec: virtio network device multiqueue support

2012-09-10 Thread Michael S. Tsirkin
On Mon, Sep 10, 2012 at 11:42:25AM +0930, Rusty Russell wrote:
 OK, I read the spec (pasted below for easy of reading), but I'm still
 confused over how this will work.
 
 I thought normal net drivers have the hardware provide an rxhash for
 each packet, and we map that to CPU to queue the packet on[1].  We hope
 that the receiving process migrates to that CPU, so xmit queue
 matches.

This ony works sometimes.  For example it's common to pin netperf to a
cpu to get consistent performance.  Proper hardware must obey what
applications want it to do, not the other way around.

 For virtio this would mean a new per-packet rxhash value, right?
 
 Why are we doing something different?  What am I missing?
 
 Thanks,
 Rusty.
 [1] Everything I Know About Networking I Learned From LWN:
 https://lwn.net/Articles/362339/

I think you missed this:

Some network interfaces can help with the distribution of incoming
packets; they have multiple receive queues and multiple interrupt lines.
Others, though, are equipped with a single queue, meaning that the
driver for that hardware must deal with all incoming packets in a
single, serialized stream. Parallelizing such a stream requires some
intelligence on the part of the host operating system. 

In other words RPS is a hack to speed up networking on cheapo
hardware, this is one of the reasons it is off by default.
Good hardware has multiple receive queues.
We can implement a good one so we do not need RPS.

Also not all guest OS-es support RPS.

Does this clarify?

 ---
 Transmit Packet Steering
 
 When VIRTIO_NET_F_MULTIQUEUE feature bit is negotiated, guest can use any of 
 multiple configured transmit queues to transmit a given packet. To avoid 
 packet reordering by device (which generally leads to performance 
 degradation) driver should attempt to utilize the same transmit virtqueue for 
 all packets of a given transmit flow. For bi-directional protocols (in 
 practice, TCP), a given network connection can utilize both transmit and 
 receive queues. For best performance, packets from a single connection should 
 utilize the paired transmit and receive queues from the same virtqueue pair; 
 for example both transmitqN and receiveqN. This rule makes it possible to 
 optimize processing on the device side, but this is not a hard requirement: 
 devices should function correctly even when this rule is not followed.
 
 Driver selects an active steering rule using VIRTIO_NET_CTRL_STEERING command 
 (this controls both which virtqueue is selected for a given packet for 
 receive and notifies the device which virtqueues are about to be used for 
 transmit).
 
 This command accepts a single out argument in the following format:
 
 #define VIRTIO_NET_CTRL_STEERING   4
 
 The field rule specifies the function used to select transmit virtqueue for a 
 given packet; the field param makes it possible to pass an extra parameter if 
 appropriate. When rule is set to VIRTIO_NET_CTRL_STEERING_SINGLE (this is the 
 default) all packets are steered to the default virtqueue transmitq (1); 
 param is unused; this is the default. With any other rule, When rule is set 
 to VIRTIO_NET_CTRL_STEERING_RX_FOLLOWS_TX packets are steered by driver to 
 the first N=(param+1) multiqueue virtqueues transmitq1...transmitqN; the 
 default transmitq is unused. Driver must have configured all these (param+1) 
 virtqueues beforehand.
 
 Supported steering rules can be added and removed in the future. Driver 
 should check that the request to change the steering rule was successful by 
 checking ack values of the command. As selecting a specific steering is an 
 optimization feature, drivers should avoid hard failure and fall back on 
 using a supported steering rule if this command fails. The default steering 
 rule is VIRTIO_NET_CTRL_STEERING_SINGLE. It will not be removed.
 
 When the steering rule is modified, some packets can still be outstanding in 
 one or more of the transmit virtqueues. Since drivers might choose to modify 
 the current steering rule at a high rate (e.g. adaptively in response to 
 changes in the workload) to avoid reordering packets, device is recommended 
 to complete processing of the transmit queue(s) utilized by the original 
 steering before processing any packets delivered by the modified steering 
 rule.
 
 For debugging, the current steering rule can also be read from the 
 configuration space.
 
 Receive Packet Steering
 
 When VIRTIO_NET_F_MULTIQUEUE feature bit is negotiated, device can use any of 
 multiple configured receive queues to pass a given packet to driver. Driver 
 controls which virtqueue is selected in practice by configuring packet 
 steering rule using VIRTIO_NET_CTRL_STEERING command, as described 
 above[sub:Transmit-Packet-Steering].
 
 The field rule specifies the function used to select receive virtqueue for a 
 given packet; the field param makes it possible to pass an extra parameter 

Re: [PATCH 4/5] virtio-scsi: Add start/stop functionality for vhost-scsi

2012-09-10 Thread Paolo Bonzini
Il 09/09/2012 00:40, Michael S. Tsirkin ha scritto:
 On Fri, Sep 07, 2012 at 06:00:50PM +0200, Paolo Bonzini wrote:
 Il 07/09/2012 08:48, Nicholas A. Bellinger ha scritto:
 Cc: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
 Cc: Zhi Yong Wu wu...@linux.vnet.ibm.com
 Cc: Michael S. Tsirkin m...@redhat.com
 Cc: Paolo Bonzini pbonz...@redhat.com
 Signed-off-by: Nicholas Bellinger n...@linux-iscsi.org
 ---
  hw/virtio-pci.c  |2 ++
  hw/virtio-scsi.c |   49 +
  hw/virtio-scsi.h |1 +
  3 files changed, 52 insertions(+), 0 deletions(-)

 Please create a completely separate device vhost-scsi-pci instead (or
 virtio-scsi-tcm-pci, or something like that).  It is used completely
 differently from virtio-scsi-pci, it does not make sense to conflate the
 two.
 
 Ideally the name would say how it is different, not what backend it
 uses. Any good suggestions?

I chose the backend name because, ideally, there would be no other
difference.  QEMU _could_ implement all the goodies in vhost-scsi (such
as reservations or ALUA), it just doesn't do that yet.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] virtio-scsi: Add start/stop functionality for vhost-scsi

2012-09-10 Thread Michael S. Tsirkin
On Mon, Sep 10, 2012 at 08:16:54AM +0200, Paolo Bonzini wrote:
 Il 09/09/2012 00:40, Michael S. Tsirkin ha scritto:
  On Fri, Sep 07, 2012 at 06:00:50PM +0200, Paolo Bonzini wrote:
  Il 07/09/2012 08:48, Nicholas A. Bellinger ha scritto:
  Cc: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
  Cc: Zhi Yong Wu wu...@linux.vnet.ibm.com
  Cc: Michael S. Tsirkin m...@redhat.com
  Cc: Paolo Bonzini pbonz...@redhat.com
  Signed-off-by: Nicholas Bellinger n...@linux-iscsi.org
  ---
   hw/virtio-pci.c  |2 ++
   hw/virtio-scsi.c |   49 +
   hw/virtio-scsi.h |1 +
   3 files changed, 52 insertions(+), 0 deletions(-)
 
  Please create a completely separate device vhost-scsi-pci instead (or
  virtio-scsi-tcm-pci, or something like that).  It is used completely
  differently from virtio-scsi-pci, it does not make sense to conflate the
  two.
  
  Ideally the name would say how it is different, not what backend it
  uses. Any good suggestions?
 
 I chose the backend name because, ideally, there would be no other
 difference.  QEMU _could_ implement all the goodies in vhost-scsi (such
 as reservations or ALUA), it just doesn't do that yet.
 
 Paolo

Then why do you say It is used completely differently from
virtio-scsi-pci?  Isn't it just a different backend?

If yes then it should be a backend option, like it is
for virtio-net.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv4] virtio-spec: virtio network device multiqueue support

2012-09-10 Thread Michael S. Tsirkin
On Mon, Sep 10, 2012 at 09:16:29AM +0300, Michael S. Tsirkin wrote:
 On Mon, Sep 10, 2012 at 11:42:25AM +0930, Rusty Russell wrote:
  OK, I read the spec (pasted below for easy of reading), but I'm still
  confused over how this will work.
  
  I thought normal net drivers have the hardware provide an rxhash for
  each packet, and we map that to CPU to queue the packet on[1].  We hope
  that the receiving process migrates to that CPU, so xmit queue
  matches.
 
 This ony works sometimes.  For example it's common to pin netperf to a
 cpu to get consistent performance.  Proper hardware must obey what
 applications want it to do, not the other way around.
 
  For virtio this would mean a new per-packet rxhash value, right?
  
  Why are we doing something different?  What am I missing?
  
  Thanks,
  Rusty.
  [1] Everything I Know About Networking I Learned From LWN:
  https://lwn.net/Articles/362339/
 
 I think you missed this:
 
   Some network interfaces can help with the distribution of incoming
   packets; they have multiple receive queues and multiple interrupt lines.
   Others, though, are equipped with a single queue, meaning that the
   driver for that hardware must deal with all incoming packets in a
   single, serialized stream. Parallelizing such a stream requires some
   intelligence on the part of the host operating system. 
 
 In other words RPS is a hack to speed up networking on cheapo
 hardware, this is one of the reasons it is off by default.
 Good hardware has multiple receive queues.
 We can implement a good one so we do not need RPS.
 
 Also not all guest OS-es support RPS.
 
 Does this clarify?

I would like to add that on many processors, sending
IPCs between guest CPUs requires exits on sending *and*
receiving path, making it very expensive.

  ---
  Transmit Packet Steering
  
  When VIRTIO_NET_F_MULTIQUEUE feature bit is negotiated, guest can use any 
  of multiple configured transmit queues to transmit a given packet. To avoid 
  packet reordering by device (which generally leads to performance 
  degradation) driver should attempt to utilize the same transmit virtqueue 
  for all packets of a given transmit flow. For bi-directional protocols (in 
  practice, TCP), a given network connection can utilize both transmit and 
  receive queues. For best performance, packets from a single connection 
  should utilize the paired transmit and receive queues from the same 
  virtqueue pair; for example both transmitqN and receiveqN. This rule makes 
  it possible to optimize processing on the device side, but this is not a 
  hard requirement: devices should function correctly even when this rule is 
  not followed.
  
  Driver selects an active steering rule using VIRTIO_NET_CTRL_STEERING 
  command (this controls both which virtqueue is selected for a given packet 
  for receive and notifies the device which virtqueues are about to be used 
  for transmit).
  
  This command accepts a single out argument in the following format:
  
  #define VIRTIO_NET_CTRL_STEERING   4
  
  The field rule specifies the function used to select transmit virtqueue for 
  a given packet; the field param makes it possible to pass an extra 
  parameter if appropriate. When rule is set to 
  VIRTIO_NET_CTRL_STEERING_SINGLE (this is the default) all packets are 
  steered to the default virtqueue transmitq (1); param is unused; this is 
  the default. With any other rule, When rule is set to 
  VIRTIO_NET_CTRL_STEERING_RX_FOLLOWS_TX packets are steered by driver to the 
  first N=(param+1) multiqueue virtqueues transmitq1...transmitqN; the 
  default transmitq is unused. Driver must have configured all these 
  (param+1) virtqueues beforehand.
  
  Supported steering rules can be added and removed in the future. Driver 
  should check that the request to change the steering rule was successful by 
  checking ack values of the command. As selecting a specific steering is an 
  optimization feature, drivers should avoid hard failure and fall back on 
  using a supported steering rule if this command fails. The default steering 
  rule is VIRTIO_NET_CTRL_STEERING_SINGLE. It will not be removed.
  
  When the steering rule is modified, some packets can still be outstanding 
  in one or more of the transmit virtqueues. Since drivers might choose to 
  modify the current steering rule at a high rate (e.g. adaptively in 
  response to changes in the workload) to avoid reordering packets, device is 
  recommended to complete processing of the transmit queue(s) utilized by the 
  original steering before processing any packets delivered by the modified 
  steering rule.
  
  For debugging, the current steering rule can also be read from the 
  configuration space.
  
  Receive Packet Steering
  
  When VIRTIO_NET_F_MULTIQUEUE feature bit is negotiated, device can use any 
  of multiple configured receive queues to pass a given packet to driver. 
  Driver controls which virtqueue is 

Re: [PATCHv4] virtio-spec: virtio network device multiqueue support

2012-09-10 Thread Michael S. Tsirkin
On Mon, Sep 10, 2012 at 09:27:38AM +0300, Michael S. Tsirkin wrote:
 On Mon, Sep 10, 2012 at 09:16:29AM +0300, Michael S. Tsirkin wrote:
  On Mon, Sep 10, 2012 at 11:42:25AM +0930, Rusty Russell wrote:
   OK, I read the spec (pasted below for easy of reading), but I'm still
   confused over how this will work.
   
   I thought normal net drivers have the hardware provide an rxhash for
   each packet, and we map that to CPU to queue the packet on[1].  We hope
   that the receiving process migrates to that CPU, so xmit queue
   matches.
  
  This ony works sometimes.  For example it's common to pin netperf to a
  cpu to get consistent performance.  Proper hardware must obey what
  applications want it to do, not the other way around.
  
   For virtio this would mean a new per-packet rxhash value, right?
   
   Why are we doing something different?  What am I missing?
   
   Thanks,
   Rusty.
   [1] Everything I Know About Networking I Learned From LWN:
   https://lwn.net/Articles/362339/
  
  I think you missed this:
  
  Some network interfaces can help with the distribution of incoming
  packets; they have multiple receive queues and multiple interrupt lines.
  Others, though, are equipped with a single queue, meaning that the
  driver for that hardware must deal with all incoming packets in a
  single, serialized stream. Parallelizing such a stream requires some
  intelligence on the part of the host operating system. 
  
  In other words RPS is a hack to speed up networking on cheapo
  hardware, this is one of the reasons it is off by default.
  Good hardware has multiple receive queues.
  We can implement a good one so we do not need RPS.
  
  Also not all guest OS-es support RPS.
  
  Does this clarify?
 
 I would like to add that on many processors, sending
 IPCs between guest CPUs requires exits on sending *and*
 receiving path, making it very expensive.

A final addition: what you suggest above would be
TX follows RX, right?
It is in anticipation of something like that, that I made
steering programming so generic.
I think TX follows RX is more immediately useful for reasons above
but we can add both to spec and let drivers and devices
decide what they want to support.
Pls let me know.

   ---
   Transmit Packet Steering
   
   When VIRTIO_NET_F_MULTIQUEUE feature bit is negotiated, guest can use any 
   of multiple configured transmit queues to transmit a given packet. To 
   avoid packet reordering by device (which generally leads to performance 
   degradation) driver should attempt to utilize the same transmit virtqueue 
   for all packets of a given transmit flow. For bi-directional protocols 
   (in practice, TCP), a given network connection can utilize both transmit 
   and receive queues. For best performance, packets from a single 
   connection should utilize the paired transmit and receive queues from the 
   same virtqueue pair; for example both transmitqN and receiveqN. This rule 
   makes it possible to optimize processing on the device side, but this is 
   not a hard requirement: devices should function correctly even when this 
   rule is not followed.
   
   Driver selects an active steering rule using VIRTIO_NET_CTRL_STEERING 
   command (this controls both which virtqueue is selected for a given 
   packet for receive and notifies the device which virtqueues are about to 
   be used for transmit).
   
   This command accepts a single out argument in the following format:
   
   #define VIRTIO_NET_CTRL_STEERING   4
   
   The field rule specifies the function used to select transmit virtqueue 
   for a given packet; the field param makes it possible to pass an extra 
   parameter if appropriate. When rule is set to 
   VIRTIO_NET_CTRL_STEERING_SINGLE (this is the default) all packets are 
   steered to the default virtqueue transmitq (1); param is unused; this is 
   the default. With any other rule, When rule is set to 
   VIRTIO_NET_CTRL_STEERING_RX_FOLLOWS_TX packets are steered by driver to 
   the first N=(param+1) multiqueue virtqueues transmitq1...transmitqN; the 
   default transmitq is unused. Driver must have configured all these 
   (param+1) virtqueues beforehand.
   
   Supported steering rules can be added and removed in the future. Driver 
   should check that the request to change the steering rule was successful 
   by checking ack values of the command. As selecting a specific steering 
   is an optimization feature, drivers should avoid hard failure and fall 
   back on using a supported steering rule if this command fails. The 
   default steering rule is VIRTIO_NET_CTRL_STEERING_SINGLE. It will not be 
   removed.
   
   When the steering rule is modified, some packets can still be outstanding 
   in one or more of the transmit virtqueues. Since drivers might choose to 
   modify the current steering rule at a high rate (e.g. adaptively in 
   response to changes in the workload) to avoid reordering packets, device 
   

Re: [PATCH] virtio-balloon spec: provide a version of the silent deflate feature that works

2012-09-10 Thread Paolo Bonzini
Il 10/09/2012 08:03, Michael S. Tsirkin ha scritto:
 On Mon, Sep 10, 2012 at 07:50:13AM +0200, Paolo Bonzini wrote:
 Il 09/09/2012 00:22, Michael S. Tsirkin ha scritto:
 Almost.  One is the guest, if really needed, can tell the host of
 pages.  If not negotiated, and the host does not support it, the host
 must break the guest (e.g. fail to offer any virtqueues).

 There is no way in spec to break the guest.
 You can not fail to offer virtqueues.

 You can always return 0 for the first queue.
 
 I don't think guest drivers recover gracefully from this.
 Do they?

No, that's the point (break the guest is really break the driver).

 Besides, there is no guarantee that virtqueue setup
 happens after feature negotiation.

 It is the only way that makes sense though (unless the guest would write
 0 for its features).  Should we change that?
 
 I do think it would be nice to add a generic way for device to
 notify guest about an internal failure.
 This can only happen after DRIVER_OK status is written though,
 and since existing drivers do not expect such failure, it might
 be too late.

Agreed.

 The other is the guest, though, would prefer not to do so.  It is
 different because the guest can proceed in a fallback mode even if the
 host doesn't offer it.

 I think I get what your proposed SILENT means what I do not get
 is the motivation. It looks like a premature optimization to me.

 The motivation is to let the driver choose between two behaviors: the
 current one where ballooning is only done on request, and a more
 aggressive one.
 
 Yes but why is being silent any good? Optimization?
 Any data to show that it will help some workload?

Idle guests can move cache pages to the balloon.  You can overcommit
more aggressively, because the host can madvise away a lot more memory.

 OK so TELL says *when* to notify host, SILENT if set allows guest
 to skip leak notifications? In this case TELL should just be ignored
 when SILENT is set.

 Yeah, that was my first idea.  However, there are existing drivers that
 ignore SILENT, so that would not be 100% exact.
 
 Not sure I follow the logic.
 They don't ack SILENT so that would be 100% exact.

Hmm, then I'm not sure I follow yours.  We agreed that delaying
notifications or skipping them is really the same thing, right?

I think we're just stuck in a linguistic problem, with must not being
wrong and does not have to being too verbose.  Calling it
VIRTIO_BALLOON_F_SILENT_DEFLATE was a workaround for this, but perhaps
it adds more confusion.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-balloon spec: provide a version of the silent deflate feature that works

2012-09-10 Thread Michael S. Tsirkin
On Mon, Sep 10, 2012 at 08:38:09AM +0200, Paolo Bonzini wrote:
 Il 10/09/2012 08:03, Michael S. Tsirkin ha scritto:
  On Mon, Sep 10, 2012 at 07:50:13AM +0200, Paolo Bonzini wrote:
  Il 09/09/2012 00:22, Michael S. Tsirkin ha scritto:
  Almost.  One is the guest, if really needed, can tell the host of
  pages.  If not negotiated, and the host does not support it, the host
  must break the guest (e.g. fail to offer any virtqueues).
 
  There is no way in spec to break the guest.
  You can not fail to offer virtqueues.
 
  You can always return 0 for the first queue.
  
  I don't think guest drivers recover gracefully from this.
  Do they?
 
 No, that's the point (break the guest is really break the driver).

You can just stop VM then. No need for a side channel.

...

  The other is the guest, though, would prefer not to do so.  It is
  different because the guest can proceed in a fallback mode even if the
  host doesn't offer it.
 
  I think I get what your proposed SILENT means what I do not get
  is the motivation. It looks like a premature optimization to me.
 
  The motivation is to let the driver choose between two behaviors: the
  current one where ballooning is only done on request, and a more
  aggressive one.
  
  Yes but why is being silent any good? Optimization?
  Any data to show that it will help some workload?
 
 Idle guests can move cache pages to the balloon.  You can overcommit
 more aggressively, because the host can madvise away a lot more memory.

IMHO this feature needs more thought. E.g. how will this work with assignment?
If we build something let's build it in a way that plays nicely
with other features.

  OK so TELL says *when* to notify host, SILENT if set allows guest
  to skip leak notifications? In this case TELL should just be ignored
  when SILENT is set.
 
  Yeah, that was my first idea.  However, there are existing drivers that
  ignore SILENT, so that would not be 100% exact.
  
  Not sure I follow the logic.
  They don't ack SILENT so that would be 100% exact.
 
 Hmm, then I'm not sure I follow yours.  We agreed that delaying
 notifications or skipping them is really the same thing, right?
 
 I think we're just stuck in a linguistic problem, with must not being
 wrong and does not have to being too verbose.  Calling it
 VIRTIO_BALLOON_F_SILENT_DEFLATE was a workaround for this, but perhaps
 it adds more confusion.
 
 Paolo

Looks like it does :)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] virtio-scsi: Add start/stop functionality for vhost-scsi

2012-09-10 Thread Michael S. Tsirkin
On Mon, Sep 10, 2012 at 08:42:15AM +0200, Paolo Bonzini wrote:
 Il 10/09/2012 08:24, Michael S. Tsirkin ha scritto:
   I chose the backend name because, ideally, there would be no other
   difference.  QEMU _could_ implement all the goodies in vhost-scsi (such
   as reservations or ALUA), it just doesn't do that yet.
   
   Paolo
  Then why do you say It is used completely differently from
  virtio-scsi-pci?
 
 It is configured differently (and I haven't seen a proposal yet for how
 to bridge the two), it does not interoperate, it has right now a
 different set of features.
 
 The does not interoperate bit is particularly important.  Say QEMU
 were to implement persistent reservations (right now only a vhost-scsi
 feature).  Then QEMU and vhost-scsi PR would not be interchangeable, a
 reservation made by QEMU would not be visible in vhost and vice versa.

So this is backend stuff, right?

  Isn't it just a different backend?
  
  If yes then it should be a backend option, like it is
  for virtio-net.
 
 You mean a -drive option?

Yes.

 That would mean adding the logic to configure
 vhost-scsi to the QEMU block layer, that's a completely different project...
 
 Paolo

This is an implementation detail. You can make it -drive option
but still have all the actual logic outside block layer.
All you need in block is option parsing code.
Please take a look at how -net does this:
we did *not* add all logic to qemu net layer.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-balloon spec: provide a version of the silent deflate feature that works

2012-09-10 Thread Paolo Bonzini
Il 10/09/2012 08:47, Michael S. Tsirkin ha scritto:
 On Mon, Sep 10, 2012 at 08:38:09AM +0200, Paolo Bonzini wrote:
 Il 10/09/2012 08:03, Michael S. Tsirkin ha scritto:
 On Mon, Sep 10, 2012 at 07:50:13AM +0200, Paolo Bonzini wrote:
 Il 09/09/2012 00:22, Michael S. Tsirkin ha scritto:
 Almost.  One is the guest, if really needed, can tell the host of
 pages.  If not negotiated, and the host does not support it, the host
 must break the guest (e.g. fail to offer any virtqueues).

 There is no way in spec to break the guest.
 You can not fail to offer virtqueues.

 You can always return 0 for the first queue.

 I don't think guest drivers recover gracefully from this.
 Do they?

 No, that's the point (break the guest is really break the driver).
 
 You can just stop VM then. No need for a side channel.

Keeping the VM running, just with no balloon driver is preferrable.

 The other is the guest, though, would prefer not to do so.  It is
 different because the guest can proceed in a fallback mode even if the
 host doesn't offer it.

 I think I get what your proposed SILENT means what I do not get
 is the motivation. It looks like a premature optimization to me.

 The motivation is to let the driver choose between two behaviors: the
 current one where ballooning is only done on request, and a more
 aggressive one.

 Yes but why is being silent any good? Optimization?
 Any data to show that it will help some workload?

 Idle guests can move cache pages to the balloon.  You can overcommit
 more aggressively, because the host can madvise away a lot more memory.
 
 IMHO this feature needs more thought. E.g. how will this work with assignment?

Revert to normal cooperative ballooning.

 If we build something let's build it in a way that plays nicely
 with other features.

Yes, that's the point of SILENT. :)

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm/fpu: Enable fully eager restore kvm FPU

2012-09-10 Thread Avi Kivity
On 09/10/2012 06:29 AM, Hao, Xudong wrote:
 
 Doesn't help.  We can have:
 
 host: deactivate fpu for some reason
 guest: set cr4.osxsave, xcr0.bit3
 host: enter guest with deactivated fpu
 guest: touch fpu
 
 result: host fpu corrupted.
 
 Avi, I'm not sure if I fully understand of you. Do you mean enter guest with 
 a fpu_active=0 and then fpu does not restore? 

Yes.

 If so, I will add fpu_active=1 in the no-lazy case.
 
 -   kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
 +   if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) 
 +   (vcpu-arch.xcr0  ~((u64)KVM_XSTATE_LAZY))) {
 +   kvm_x86_ops-fpu_activate(vcpu);
 +   vcpu-fpu_active=1;
 +   }
 +   else
 +   kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
 

It doesn't help here.

  1 guest boot
  2 kvm_userspace_exit (deactivates fpu)
  3 XSETBV exit that sets xcr0.new_bit
  4 kvm_enter

There is no call to kvm_put_guest_fpu() between 3 and 4, you need
something in __kvm_set_xcr() to activate the fpu.

kvm_put_guest_fpu() doesn't need to activate the fpu then, just to avoid
deactivating it.

Note you also need to consider writes to xcr0 and cr4 that happen in the
reverse order due to live migration.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: tsc deadline timer works only when hrtimer high resolution configured

2012-09-10 Thread Avi Kivity
On 09/09/2012 06:10 PM, Liu, Jinsong wrote:
 Avi Kivity wrote:
 On 09/09/2012 05:54 PM, Liu, Jinsong wrote:
 
 hrtimers is an intrusive feature, I don't think we should
 force-enable it.  Please change it to a depends on.
 
 Hmm, if it changed as
 config KVM
 depends on HIGH_RES_TIMERS
 The item 'Kernel-based Virtual Machine (KVM) support (NEW)' even
 didn't appear to user when make menuconfig (when HIGH_RES_TIMERS
 disable)  
 
 Is it good? I just have a little concern here:)
 
 It's not good, but that's what we have.
 
 It's okay to force-enable low-impact features (like preempt notifies).
 
 hrimers, on the other hand, change kernel behaviour quite deeply.
 
 Maybe over time someone will fix the config tools to unhide features
 that can be enabled by turning on a dependency.
 
 OK, updated as attached.

Thanks, applied.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm tools: Export DISPLAY ENV as our default host ip address

2012-09-10 Thread Avi Kivity
On 09/10/2012 04:26 AM, Asias He wrote:


 Or you can
 make the guest talk to an internal unix-domain socket, tunnel that
 through virtio-serial, terminate virtio-serial in lkvm, and direct it
 towards the local X socket.

 Doesn't this require some user agent or config modification to the guest?

 It does, a daemon that listens locally and forwards data over
 virtio-serial.  But you build your own initrd anyway, don't you?
 
 Using our custom init file is one use case. User may use distro disk
 image as guest also.

We could push it into the distros, it should be useful for qemu as well.
 This could be done even more properly by adding support for
virtio-serial in X itself (is anything really needed on the client side?)

The only problem is that it is insecure, but maybe
http://www.nsa.gov/research/_files/selinux/papers/x11/t1.shtml can help.
 I don't know if it's integrated yet.

 
 Another option is ppp-over-virtio-serial.
 
 Seems this still uses tcp where the link layer changes from ethernet to 
 serial.

Right, you have to terminate it on the host side without routing.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] KVM: MMU: release noslot pfn on the fail path properly

2012-09-10 Thread Avi Kivity
On 09/07/2012 09:13 AM, Xiao Guangrong wrote:
 We can not directly call kvm_release_pfn_clean to release the pfn
 since we can meet noslot pfn which is used to cache mmio info into
 spte
 
 Introduce mmu_release_pfn_clean to do this kind of thing
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/mmu.c |   19 ++-
  arch/x86/kvm/paging_tmpl.h |4 ++--
  2 files changed, 16 insertions(+), 7 deletions(-)
 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 399c177..3c10bca 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -2432,6 +2432,16 @@ done:
   return ret;
  }
 
 +/*
 + * The primary user is page fault path which call it to properly
 + * release noslot_pfn.
 + */
 +static void mmu_release_pfn_clean(pfn_t pfn)
 +{
 + if (!is_error_pfn(pfn))
 + kvm_release_pfn_clean(pfn);
 +}
 +

Too many APIs, each slightly different.  How do I know which one to call?

Please change kvm_release_pfn_*() instead, calling some arch hook (or
even #ifdef CONFIG_KVM_HAS_FAST_MMIO) to check for the special case.




-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: MMU: remove unnecessary check

2012-09-10 Thread Avi Kivity
On 09/07/2012 09:15 AM, Xiao Guangrong wrote:
 Checking the return of kvm_mmu_get_page is unnecessary since it is
 guaranteed by memory cache
 


Thanks, applied.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] KVM: fix release error page

2012-09-10 Thread Avi Kivity
On 09/07/2012 09:14 AM, Xiao Guangrong wrote:
 This bug was triggered:
 [ 4220.198458] BUG: unable to handle kernel paging request at fffe
 [ 4220.203907] IP: [81104d85] put_page+0xf/0x34
 ..
 [ 4220.237326] Call Trace:
 [ 4220.237361]  [a03830d0] kvm_arch_destroy_vm+0xf9/0x101 [kvm]
 [ 4220.237382]  [a036fe53] kvm_put_kvm+0xcc/0x127 [kvm]
 [ 4220.237401]  [a03702bc] kvm_vcpu_release+0x18/0x1c [kvm]
 [ 4220.237407]  [81145425] __fput+0x111/0x1ed
 [ 4220.237411]  [8114550f] fput+0xe/0x10
 [ 4220.237418]  [81063511] task_work_run+0x5d/0x88
 [ 4220.237424]  [8104c3f7] do_exit+0x2bf/0x7ca
 
 The test case:
 
 #include stdio.h
 #include stdlib.h
 #include pthread.h
 #include fcntl.h
 #include unistd.h
 
 #include sys/types.h
 #include sys/stat.h
 #include sys/ioctl.h
 #include sys/mman.h
 
 #include linux/kvm.h
 
 #define die(fmt, args...) do {\
   printf(fmt, ##args);\
   exit(-1);} while (0)
 
 static int create_vm(void)
 {
   int sys_fd, vm_fd;
 
   sys_fd = open(/dev/kvm, O_RDWR);
   if (sys_fd  0)
   die(open /dev/kvm fail.\n);
 
   vm_fd = ioctl(sys_fd, KVM_CREATE_VM, 0);
   if (vm_fd  0)
   die(KVM_CREATE_VM fail.\n);
 
   return vm_fd;
 }
 
 static int create_vcpu(int vm_fd)
 {
   int vcpu_fd;
 
   vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
   if (vcpu_fd  0)
   die(KVM_CREATE_VCPU ioctl.\n);
   printf(Create vcpu.\n);
   return vcpu_fd;
 }
 
 static void *vcpu_thread(void *arg)
 {
   int vm_fd = (int)(long)arg;
 
   create_vcpu(vm_fd);
   return NULL;
 }
 
 int main(int argc, char *argv[])
 {
   pthread_t thread;
   int vm_fd;
 
   (void)argc;
   (void)argv;
 
   vm_fd = create_vm();
   pthread_create(thread, NULL, vcpu_thread, (void *)(long)vm_fd);
   printf(Exit.\n);
   return 0;
 }
 
 It caused by release kvm-arch.ept_identity_map_addr which is the
 error page.
 
 The parent thread can send KILL signal to the vcpu thread when it was
 exiting which stops faulting pages and potentially allocating memory.
 So gfn_to_pfn/gfn_to_page may fail at this time
 
 Fixed by checking the page before it is used
 

Thanks, applied to master for 3.6.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] KVM: MMU: release noslot pfn on the fail path properly

2012-09-10 Thread Xiao Guangrong
On 09/10/2012 04:22 PM, Avi Kivity wrote:
 On 09/07/2012 09:13 AM, Xiao Guangrong wrote:
 We can not directly call kvm_release_pfn_clean to release the pfn
 since we can meet noslot pfn which is used to cache mmio info into
 spte

 Introduce mmu_release_pfn_clean to do this kind of thing

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/mmu.c |   19 ++-
  arch/x86/kvm/paging_tmpl.h |4 ++--
  2 files changed, 16 insertions(+), 7 deletions(-)

 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 399c177..3c10bca 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -2432,6 +2432,16 @@ done:
  return ret;
  }

 +/*
 + * The primary user is page fault path which call it to properly
 + * release noslot_pfn.
 + */
 +static void mmu_release_pfn_clean(pfn_t pfn)
 +{
 +if (!is_error_pfn(pfn))
 +kvm_release_pfn_clean(pfn);
 +}
 +
 
 Too many APIs, each slightly different.  How do I know which one to call?

It is only used in mmu and it is a static function.

 
 Please change kvm_release_pfn_*() instead, calling some arch hook (or
 even #ifdef CONFIG_KVM_HAS_FAST_MMIO) to check for the special case.

We only need to call it on page fault path. If we change the common API
other x86 components have to suffer from it.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] KVM: MMU: release noslot pfn on the fail path properly

2012-09-10 Thread Avi Kivity
On 09/10/2012 11:37 AM, Xiao Guangrong wrote:
 On 09/10/2012 04:22 PM, Avi Kivity wrote:
 On 09/07/2012 09:13 AM, Xiao Guangrong wrote:
 We can not directly call kvm_release_pfn_clean to release the pfn
 since we can meet noslot pfn which is used to cache mmio info into
 spte

 Introduce mmu_release_pfn_clean to do this kind of thing

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/mmu.c |   19 ++-
  arch/x86/kvm/paging_tmpl.h |4 ++--
  2 files changed, 16 insertions(+), 7 deletions(-)

 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 399c177..3c10bca 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -2432,6 +2432,16 @@ done:
 return ret;
  }

 +/*
 + * The primary user is page fault path which call it to properly
 + * release noslot_pfn.
 + */
 +static void mmu_release_pfn_clean(pfn_t pfn)
 +{
 +   if (!is_error_pfn(pfn))
 +   kvm_release_pfn_clean(pfn);
 +}
 +
 
 Too many APIs, each slightly different.  How do I know which one to call?
 
 It is only used in mmu and it is a static function.

Still, how do I know which one to call?  The name tells me nothing.
When I read the code, how do I know if a call is correct or not?

 
 
 Please change kvm_release_pfn_*() instead, calling some arch hook (or
 even #ifdef CONFIG_KVM_HAS_FAST_MMIO) to check for the special case.
 
 We only need to call it on page fault path. If we change the common API
 other x86 components have to suffer from it.

This way, I have to suffer from it.

btw, what about another approach, to avoid those paths completely?
Avoid calling __direct_map() with error_pfn, and jump to a label after
kvm_release_pfn_clean() in page_fault(), etc?

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Michael S. Tsirkin
On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
 This implementation of a virtio balloon driver uses the page cache to
 store pages that have been released to the host.  The communication
 (outside of target counts) is one way--the guest notifies the host when
 it adds a page to the page cache, allowing the host to madvise(2) with
 MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
 (via the regular page reclaim).  This means that inflating the balloon
 is similar to the existing balloon mechanism, but the deflate is
 different--it re-uses existing Linux kernel functionality to
 automatically reclaim.
 
 Signed-off-by: Frank Swiderski f...@google.com

I've been trying to understand this, and I have
a question: what exactly is the benefit
of this new device?

Note that users could not care less about how a driver
is implemented internally.

Is there some workload where you see VM working better with
this than regular balloon? Any numbers?

Also, can't we just replace existing balloon implementation
with this one?  Why it is so important to deflate silently?
I guess filesystem does not currently get a callback
before page is reclaimed but this isan implementation detail -
maybe this can be fixed?

Also can you pls answer Avi's question?
How is overcommit managed?


 ---
  drivers/virtio/Kconfig  |   13 +
  drivers/virtio/Makefile |1 +
  drivers/virtio/virtio_fileballoon.c |  636 
 +++
  include/linux/virtio_balloon.h  |9 +
  include/linux/virtio_ids.h  |1 +
  5 files changed, 660 insertions(+), 0 deletions(-)
  create mode 100644 drivers/virtio/virtio_fileballoon.c
 
 diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
 index f38b17a..cffa2a7 100644
 --- a/drivers/virtio/Kconfig
 +++ b/drivers/virtio/Kconfig
 @@ -35,6 +35,19 @@ config VIRTIO_BALLOON
  
If unsure, say M.
  
 +config VIRTIO_FILEBALLOON
 + tristate Virtio page cache-backed balloon driver
 + select VIRTIO
 + select VIRTIO_RING
 + ---help---
 +  This driver supports decreasing and automatically reclaiming the
 +  memory within a guest VM.  Unlike VIRTIO_BALLOON, this driver instead
 +  tries to maintain a specific target balloon size using the page cache.
 +  This allows the guest to implicitly deflate the balloon by flushing
 +  pages from the cache and touching the page.
 +
 +  If unsure, say N.
 +
   config VIRTIO_MMIO
   tristate Platform bus driver for memory mapped virtio devices 
 (EXPERIMENTAL)
   depends on HAS_IOMEM  EXPERIMENTAL
 diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
 index 5a4c63c..7ca0a3f 100644
 --- a/drivers/virtio/Makefile
 +++ b/drivers/virtio/Makefile
 @@ -3,3 +3,4 @@ obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
  obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o
  obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
 +obj-$(CONFIG_VIRTIO_FILEBALLOON) += virtio_fileballoon.o
 diff --git a/drivers/virtio/virtio_fileballoon.c 
 b/drivers/virtio/virtio_fileballoon.c
 new file mode 100644
 index 000..ff252ec
 --- /dev/null
 +++ b/drivers/virtio/virtio_fileballoon.c
 @@ -0,0 +1,636 @@
 +/* Virtio file (page cache-backed) balloon implementation, inspired by
 + * Dor Loar and Marcelo Tosatti's implementations, and based on Rusty 
 Russel's
 + * implementation.
 + *
 + * This implementation of the virtio balloon driver re-uses the page cache to
 + * allow memory consumed by inflating the balloon to be reclaimed by linux.  
 It
 + * creates and mounts a bare-bones filesystem containing a single inode.  
 When
 + * the host requests the balloon to inflate, it does so by reading pages at
 + * offsets into the inode mapping's page_tree.  The host is notified when the
 + * pages are added to the page_tree, allowing it (the host) to madvise(2) the
 + * corresponding host memory, reducing the RSS of the virtual machine.  In 
 this
 + * implementation, the host is only notified when a page is added to the
 + * balloon.  Reclaim happens under the existing TTFP logic, which flushes 
 unused
 + * pages in the page cache.  If the host used MADV_DONTNEED, then when the 
 guest
 + * uses the page, the zero page will be mapped in, allowing automatic (and 
 fast,
 + * compared to requiring a host notification via a virtio queue to get memory
 + * back) reclaim.
 + *
 + *  Copyright 2008 Rusty Russell IBM Corporation
 + *  Copyright 2011 Frank Swiderski Google Inc
 + *
 + *  This program is free software; you can redistribute it and/or modify
 + *  it under the terms of the GNU General Public License as published by
 + *  the Free Software Foundation; either version 2 of the License, or
 + *  (at your option) any later version.
 + *
 + *  This program is distributed in the hope that it will be useful,
 + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 + *  MERCHANTABILITY or FITNESS FOR A 

Re: [PATCH v2] KVM: trace the events of mmu_notifier

2012-09-10 Thread Avi Kivity
On 09/07/2012 09:16 AM, Xiao Guangrong wrote:
 mmu_notifier is the interface to broadcast the mm events to KVM, the
 tracepoints introduced in this patch can trace all these events, it is
 very helpful for us to notice and fix the bug caused by mm

There is nothing kvm specific here.  Perhaps this can be made generic
(with a mm parameter so we can filter by process).


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] KVM: MMU: release noslot pfn on the fail path properly

2012-09-10 Thread Xiao Guangrong
On 09/10/2012 05:02 PM, Avi Kivity wrote:
 On 09/10/2012 11:37 AM, Xiao Guangrong wrote:
 On 09/10/2012 04:22 PM, Avi Kivity wrote:
 On 09/07/2012 09:13 AM, Xiao Guangrong wrote:
 We can not directly call kvm_release_pfn_clean to release the pfn
 since we can meet noslot pfn which is used to cache mmio info into
 spte

 Introduce mmu_release_pfn_clean to do this kind of thing

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/mmu.c |   19 ++-
  arch/x86/kvm/paging_tmpl.h |4 ++--
  2 files changed, 16 insertions(+), 7 deletions(-)

 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 399c177..3c10bca 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -2432,6 +2432,16 @@ done:
return ret;
  }

 +/*
 + * The primary user is page fault path which call it to properly
 + * release noslot_pfn.
 + */
 +static void mmu_release_pfn_clean(pfn_t pfn)
 +{
 +  if (!is_error_pfn(pfn))
 +  kvm_release_pfn_clean(pfn);
 +}
 +

 Too many APIs, each slightly different.  How do I know which one to call?

 It is only used in mmu and it is a static function.
 
 Still, how do I know which one to call?  The name tells me nothing.
 When I read the code, how do I know if a call is correct or not?
 


 Please change kvm_release_pfn_*() instead, calling some arch hook (or
 even #ifdef CONFIG_KVM_HAS_FAST_MMIO) to check for the special case.

 We only need to call it on page fault path. If we change the common API
 other x86 components have to suffer from it.
 
 This way, I have to suffer from it.

Sorry. :(

 
 btw, what about another approach, to avoid those paths completely?
 Avoid calling __direct_map() with error_pfn, and jump to a label after
 kvm_release_pfn_clean() in page_fault(), etc?

I will try it.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [User question] Huge buffer size on KVM host

2012-09-10 Thread Martin Wawro
On 08/16/2012 04:57 PM, Avi Kivity wrote:

Hi Avi,

No, there was no reason and we disabled it there too. Interestingly, the
buffer
size did not go down significantly, even when manually flushing the pages
using /proc/sys/vm/drop_caches (3), the buffer size did not go down.
Finally,
after rebooting the host, buffer size was back to normal again and when all
caches were disabled, we were also able to manually drop the buffered pages.

However, after 2 weeks or so, we encountered significant instabilities
in the system.
Inside the guest OS, the load went up like crazy (way past 30) and the 
guest
was virtually unusable. Checking the host: buffer size, load, memory,
I/O was all in
a well defined range.

When rebooting the guest OS, the problem manifested again after
about 20-30mins (reproducible a couple of times). The only thing that
worked to
stop this (I am afraid only for a couple of weeks or so), was rebooting
the host system.
On the guest and the host there was nothing suspicious in the logs or
dmesg output.


Best regards,

Martin

 Hi Avi,

 It seems that the kernel on that particular machine is too old, those entries 
 are
 not featured. We checked on a comparable setup with a newer kernel and both 
 entries
 were set to 512.

 We also did have a third more thorough look on the caching. It turns out that 
 the
 virt-manager does not seem to honor the caching adjusted in the GUI correctly.
 We disabled caching on all virtual devices for this particular VM and checking
 with ps -fxal revealed, that only one of those devices (and a rather small 
 one too)
 had this set. We corrected this in the XML file directly and the buffer size
 currently resides at around 1.8 GB after rebooting the VM (the only virtio 
 device
 not having the cache=none option set is now the (non-mounted) cdrom).

 cc += libvirt-list

 Is there a reason that cdroms don't get cache=none?




-- 

Martin Wawro   | Digital Medics GmbH 
Managing Director  |  Otto-Hahn-Str. 15, 44227 Dortmund, Germany
Tel. +49-231-9742-6622 |  Fax: +49-231-9742-6623
Key: 0xB0A225BD|Registered at AG Dortmund, HRB 19360


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] support readonly memory feature in qemu

2012-09-10 Thread Jan Kiszka
On 2012-09-09 17:45, Avi Kivity wrote:
 On 09/07/2012 11:50 AM, Jan Kiszka wrote:

 +} else {
 +cpu_physical_memory_rw(run-mmio.phys_addr,
 +   run-mmio.data,
 +   run-mmio.len,
 +   run-mmio.is_write);
 +}
 +
  ret = 0;
  break;
  case KVM_EXIT_IRQ_WINDOW_OPEN:


 Great to see this feature for KVM finally! I'm just afraid that this
 will finally break good old isapc - due to broken Seabios. KVM used to
 unbreak it as it didn't respect write protections. ;)
 
 Can you describe the breakage?

Try qemu -machine isapc [-enable-kvm]. Seabios is writing to some
read-only marked area. Don't recall where precisely.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: trace the events of mmu_notifier

2012-09-10 Thread Xiao Guangrong
On 09/10/2012 05:09 PM, Avi Kivity wrote:
 On 09/07/2012 09:16 AM, Xiao Guangrong wrote:
 mmu_notifier is the interface to broadcast the mm events to KVM, the
 tracepoints introduced in this patch can trace all these events, it is
 very helpful for us to notice and fix the bug caused by mm
 
 There is nothing kvm specific here.  Perhaps this can be made generic
 (with a mm parameter so we can filter by process).

Hmm, i would like to put these tracepoints in the mmu-lock then we can clearly
know the sequence between mm and kvm mmu. It is useful for us to detect the
issue/race between them.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/4] kvm: i386: Add classic PCI device assignment

2012-09-10 Thread Jan Kiszka
On 2012-09-09 16:13, Avi Kivity wrote:
 On 09/06/2012 11:44 AM, Jan Kiszka wrote:
 On 2012-08-30 20:30, Jan Kiszka wrote:
 This adds PCI device assignment for i386 targets using the classic KVM
 interfaces. This version is 100% identical to what is being maintained
 in qemu-kvm for several years and is supported by libvirt as well. It is
 expected to remain relevant for another couple of years until kernels
 without full-features and performance-wise equivalent VFIO support are
 obsolete.

 A refactoring to-do that should be done in-tree is to model MSI and
 MSI-X support via the generic PCI layer, similar to what VFIO is already
 doing for MSI-X. This should improve the correctness and clean up the
 code from duplicate logic.

 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---

 Changes in v3:
  - addressed comment by Peter (changed device name to kvm-pci-assign +
alias)
  - addressed (most) comments by Michael
  - fixed INT pin regression

 Does someone _disagree_ that there are no open (and reasonably solvable)
 issues and that this can now be merged through uq/master?
 
 Is patch 4 the only one that is at v3, and the rest are to be taken from
 the original posting?

That is correct.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH uq/master] kvm: Clean up irqfd API

2012-09-10 Thread Jan Kiszka
On 2012-09-09 16:01, Avi Kivity wrote:
 On 08/20/2012 11:55 AM, Jan Kiszka wrote:
 No need to expose the fd-based interface, everyone will already be fine
 with the more handy EventNotifier variant. Rename the latter to clarify
 that we are still talking about irqfds here.
 
 Thanks, applied.
 
  
 -int kvm_irqchip_add_irqfd(KVMState *s, int fd, int virq);
 -int kvm_irqchip_remove_irqfd(KVMState *s, int fd, int virq);
 -int kvm_irqchip_add_irq_notifier(KVMState *s, EventNotifier *n, int virq);
 -int kvm_irqchip_remove_irq_notifier(KVMState *s, EventNotifier *n, int 
 virq);
 +int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n, int virq);
 +int kvm_irqchip_remove_irqfd_notifier(KVMState *s, EventNotifier *n, int 
 virq);
  #endif
 
 Those names aren't particularly satisfying.  add_irqfd_notifier implies
 you want to be notified about irqfd events, but that's not what the
 function does.  Not sure what a good name would be.

Now that there are no more variants, we could also drop the notifier
from the name again. Better?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ANNOUNCE] qemu-kvm-1.1.2

2012-09-10 Thread Avi Kivity
qemu-kvm-1.1.2 is now available. This release is based on the upstream
qemu 1.1.2, plus kvm-specific enhancements. Please see the
original QEMU 1.1.2 release announcement [1] for details.

This release can be used with the kvm kernel modules provided by your
distribution kernel, or by the modules in the kvm-kmod package, such
as kvm-kmod-3.5.

http://www.linux-kvm.org

[1] http://lists.nongnu.org/archive/html/qemu-devel/2012-09/msg00587.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ANNOUNCE] qemu-kvm-1.2.0

2012-09-10 Thread Avi Kivity
qemu-kvm-1.2.0 is now available. This release is based on the upstream
qemu 1.2.0, plus kvm-specific enhancements. Please see the
original QEMU 1.2.0 release announcement [1] for details.

This release can be used with the kvm kernel modules provided by your
distribution kernel, or by the modules in the kvm-kmod package, such
as kvm-kmod-3.5.

http://www.linux-kvm.org

[1] http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg00506.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] qemu-kvm-1.2.0

2012-09-10 Thread Jan Kiszka
On 2012-09-10 11:53, Avi Kivity wrote:
 qemu-kvm-1.2.0 is now available. This release is based on the upstream
 qemu 1.2.0, plus kvm-specific enhancements. Please see the
 original QEMU 1.2.0 release announcement [1] for details.

To be more precise about the kvm-specific enhancements: The only major
difference is now PCI passthrough. Beyond that, there are smaller
differences like old special command-line switches (for which upstream
equivalents exist), some adjustments that allow migration from older
qemu-kvm versions (cannot be merged upstream), and the test device for
kvm unit tests. That's it.

IOW: Anyone who can reboot his/her guests for an upgrade and doesn't
depend on PCI passthrough can safely use upstream QEMU 1.2.

Jan

 
 This release can be used with the kvm kernel modules provided by your
 distribution kernel, or by the modules in the kvm-kmod package, such
 as kvm-kmod-3.5.
 
 http://www.linux-kvm.org
 
 [1] http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg00506.html

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


qemu-kvm log

2012-09-10 Thread Liu, Jinsong
Hi,

I'm recently debugging a qemu-kvm issue. I add some print code like 
'fprintf(stderr, ...)', however I fail to see any info at stdio. Anyone can 
tell me where is qemu-kvm logfile, or, what I need do to record my fprintf info?

Thanks,
Jinsong--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv4] virtio-spec: virtio network device multiqueue support

2012-09-10 Thread Jason Wang

On 09/10/2012 02:33 PM, Michael S. Tsirkin wrote:

On Mon, Sep 10, 2012 at 09:27:38AM +0300, Michael S. Tsirkin wrote:

On Mon, Sep 10, 2012 at 09:16:29AM +0300, Michael S. Tsirkin wrote:

On Mon, Sep 10, 2012 at 11:42:25AM +0930, Rusty Russell wrote:

OK, I read the spec (pasted below for easy of reading), but I'm still
confused over how this will work.

I thought normal net drivers have the hardware provide an rxhash for
each packet, and we map that to CPU to queue the packet on[1].  We hope
that the receiving process migrates to that CPU, so xmit queue
matches.

This ony works sometimes.  For example it's common to pin netperf to a
cpu to get consistent performance.  Proper hardware must obey what
applications want it to do, not the other way around.


For virtio this would mean a new per-packet rxhash value, right?

Why are we doing something different?  What am I missing?

Thanks,
Rusty.
[1] Everything I Know About Networking I Learned From LWN:
 https://lwn.net/Articles/362339/

I think you missed this:

Some network interfaces can help with the distribution of incoming
packets; they have multiple receive queues and multiple interrupt lines.
Others, though, are equipped with a single queue, meaning that the
driver for that hardware must deal with all incoming packets in a
single, serialized stream. Parallelizing such a stream requires some
intelligence on the part of the host operating system.

In other words RPS is a hack to speed up networking on cheapo
hardware, this is one of the reasons it is off by default.
Good hardware has multiple receive queues.
We can implement a good one so we do not need RPS.

Also not all guest OS-es support RPS.

Does this clarify?

I would like to add that on many processors, sending
IPCs between guest CPUs requires exits on sending *and*
receiving path, making it very expensive.

A final addition: what you suggest above would be
TX follows RX, right?
It is in anticipation of something like that, that I made
steering programming so generic.
I think TX follows RX is more immediately useful for reasons above
but we can add both to spec and let drivers and devices
decide what they want to support.
Pls let me know.


AFAIK, ixgbe does rx follows tx. The only differences between ixgbe 
and virtio-net is that ixgbe driver programs the flow director during 
packet transmission but we suggest to do it silently in the device for 
simplicity. Even with this, more co-operation is still needed for the 
driver ( e.g ixgbe try to use per-cpu queue by setting affinity hint and 
using cpuid to choose the txq which could be reused in virtio-net driver).



---
Transmit Packet Steering

When VIRTIO_NET_F_MULTIQUEUE feature bit is negotiated, guest can use any of 
multiple configured transmit queues to transmit a given packet. To avoid packet 
reordering by device (which generally leads to performance degradation) driver 
should attempt to utilize the same transmit virtqueue for all packets of a 
given transmit flow. For bi-directional protocols (in practice, TCP), a given 
network connection can utilize both transmit and receive queues. For best 
performance, packets from a single connection should utilize the paired 
transmit and receive queues from the same virtqueue pair; for example both 
transmitqN and receiveqN. This rule makes it possible to optimize processing on 
the device side, but this is not a hard requirement: devices should function 
correctly even when this rule is not followed.

Driver selects an active steering rule using VIRTIO_NET_CTRL_STEERING command 
(this controls both which virtqueue is selected for a given packet for receive 
and notifies the device which virtqueues are about to be used for transmit).

This command accepts a single out argument in the following format:

#define VIRTIO_NET_CTRL_STEERING   4

The field rule specifies the function used to select transmit virtqueue for a 
given packet; the field param makes it possible to pass an extra parameter if 
appropriate. When rule is set to VIRTIO_NET_CTRL_STEERING_SINGLE (this is the 
default) all packets are steered to the default virtqueue transmitq (1); param 
is unused; this is the default. With any other rule, When rule is set to 
VIRTIO_NET_CTRL_STEERING_RX_FOLLOWS_TX packets are steered by driver to the 
first N=(param+1) multiqueue virtqueues transmitq1...transmitqN; the default 
transmitq is unused. Driver must have configured all these (param+1) virtqueues 
beforehand.

Supported steering rules can be added and removed in the future. Driver should 
check that the request to change the steering rule was successful by checking 
ack values of the command. As selecting a specific steering is an optimization 
feature, drivers should avoid hard failure and fall back on using a supported 
steering rule if this command fails. The default steering rule is 
VIRTIO_NET_CTRL_STEERING_SINGLE. It will not be removed.

When the steering rule is 

Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Peter Lieven

On 09/06/12 16:58, Avi Kivity wrote:

On 08/22/2012 06:06 PM, Peter Lieven wrote:

Hi,

has anyone ever tested to run memtest with -cpu host flag passed to
qemu-kvm?
For me it resets when probing the chipset. With -cpu qemu64 it works
just fine.

Maybe this is specific to memtest, but it might be sth that can happen
in other
applications to.

Any thoughts?

Try to identify the cpu flag that causes this by removing them
successively (-cpu host,-flag...).  Alternatively capture a trace
(http://www.linux-kvm.org/page/Tracing) look for TRIPLE_FAULT (Intel),
and post the few hundred lines preceding it.


Here we go:

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason EXCEPTION_NMI 
rip 0xd185 info 0 8307

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_fpu: load
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason IO_INSTRUCTION 
rip 0xcc60 info cf80003 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_pio: pio_write at 0xcf8 
size 4 count 1
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_fpu: unload
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason IO_INSTRUCTION 
rip 0xcc29 info cfc0009 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_emulate_insn: [FAILED TO 
PARSE] rip=52265 csbase=0 len=2 insn=fí%ÿÿ flags=5 failed=0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_pio: pio_read at 0xcfc size 
2 count 1
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason IO_INSTRUCTION 
rip 0xcc60 info cf80003 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_pio: pio_write at 0xcf8 
size 4 count 1
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason IO_INSTRUCTION 
rip 0xcc29 info cfe0009 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_emulate_insn: [FAILED TO 
PARSE] rip=52265 csbase=0 len=2 insn=fí%ÿÿ flags=5 failed=0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_pio: pio_read at 0xcfe size 
2 count 1
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason EXCEPTION_NMI 
rip 0xd185 info 0 8307

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_fpu: load
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason IO_INSTRUCTION 
rip 0xcc60 info cf80003 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_pio: pio_write at 0xcf8 
size 4 count 1
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_fpu: unload
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason IO_INSTRUCTION 
rip 0xcc29 info cfc0009 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_emulate_insn: [FAILED TO 
PARSE] rip=52265 csbase=0 len=2 insn=fí%ÿÿ flags=5 failed=0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_pio: pio_read at 0xcfc size 
2 count 1
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason IO_INSTRUCTION 
rip 0xcc60 info cf80003 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_pio: pio_write at 0xcf8 
size 4 count 1
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason IO_INSTRUCTION 
rip 0xcc29 info cfc0009 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_emulate_insn: [FAILED TO 
PARSE] rip=52265 csbase=0 len=2 insn=fí%ÿÿ flags=5 failed=0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_pio: pio_read at 0xcfc size 
2 count 1
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason EPT_MISCONFIG 
rip 0x86e0 info 0 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_emulate_insn: [FAILED TO 
PARSE] rip=34528 csbase=0 len=3 insn=ˆF@Š„Òuõ‰L$ flags=5 failed=0
qemu-kvm-1.0.1-5107 [007] 410771.148000: vcpu_match_mmio: gva 0xb873c 
gpa 0xb873c Write GPA
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_mmio: mmio write len 1 gpa 
0xb873c val 0x6f

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: 

Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Paolo Bonzini
Il 10/09/2012 13:06, Peter Lieven ha scritto:
 
 qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
 qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason MSR_READ rip
 0x11478 info 0 0
 qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_msr: msr_read 194 = 0x0 (#GP)
 qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_inj_exception: #GP (0x0)
 qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
 qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason TRIPLE_FAULT
 rip 0x11478 info 0 0

Memory controller MSR:

static float getNHMmultiplier(void)
{
unsigned int msr_lo, msr_hi;
float coef;

/* Find multiplier (by MSR) */
/* First, check if Flexible Ratio is Enabled */
rdmsr(0x194, msr_lo, msr_hi);
if((msr_lo  16)  1){
coef = (msr_lo  8)  0xFF;
 } else {
rdmsr(0xCE, msr_lo, msr_hi);
coef = (msr_lo  8)  0xFF;
 }

return coef;
}

Looks like we need to emulate it since memtest only looks at the cpuid
to detect an integrated memory controller.  What does this return for you?

dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd
dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests

2012-09-10 Thread Tomoki Sekiyama
Hi Jan,

On 2012/09/07 17:26, Jan Kiszka wrote:

 On 2012-09-06 13:27, Tomoki Sekiyama wrote:
 This RFC patch series provides facility to dedicate CPUs to KVM guests
 and enable the guests to handle interrupts from passed-through PCI devices
 directly (without VM exit and relay by the host).

 With this feature, we can improve throughput and response time of the device
 and the host's CPU usage by reducing the overhead of interrupt handling.
 This is good for the application using very high throughput/frequent
 interrupt device (e.g. 10GbE NIC).
 Real-time applicatoins also gets benefit from CPU isolation feature, which
 reduces interfare from host kernel tasks and scheduling delay.

 The overview of this patch series is presented in CloudOpen 2012.
 The slides are available at:
 http://events.linuxfoundation.org/images/stories/pdf/lcna_co2012_sekiyama.pdf
 
 One question regarding your benchmarks: If you measured against standard
 KVM, were the vCPU thread running on an isolcpus core of its own as
 well? If not, your numbers about the impact of these patches on maximum,
 maybe also average latencies are likely too good. Also, using a non-RT
 host and possibly a non-prioritized vCPU thread for the standard
 scenario looks like an unfair comparison.


In the standard KVM benchmark, the vCPU thread is pinned down to its own
CPU core. In addition, the vCPU thread and irq/*-kvm threads are both set
to the max priority with SCHED_RR policy.

As you said, RT-host may result in better max latencies as below.
(But min/average latencies became worse, however, this might be our setup
 issue.)
 Min / Avg / Max latencies
Normal KVM
  RT-host (3.4.4-rt14)  15us / 21us / 117us
  non RT-host (3.5.0-rc6)6us / 11us / 152us
KVM + Direct IRQ
  non RT-host (3.5.0-rc6 +patch) 1us /  2us /  14us

Thanks,
-- 
Tomoki Sekiyama tomoki.sekiyama...@hitachi.com
Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Peter Lieven

On 09/10/12 13:29, Paolo Bonzini wrote:

Il 10/09/2012 13:06, Peter Lieven ha scritto:

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason MSR_READ rip
0x11478 info 0 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_msr: msr_read 194 = 0x0 (#GP)
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_inj_exception: #GP (0x0)
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason TRIPLE_FAULT
rip 0x11478 info 0 0

Memory controller MSR:

static float getNHMmultiplier(void)
{
 unsigned int msr_lo, msr_hi;
 float coef;

 /* Find multiplier (by MSR) */
 /* First, check if Flexible Ratio is Enabled */
 rdmsr(0x194, msr_lo, msr_hi);
 if((msr_lo  16)  1){
 coef = (msr_lo  8)  0xFF;
  } else {
 rdmsr(0xCE, msr_lo, msr_hi);
 coef = (msr_lo  8)  0xFF;
  }

 return coef;
}

Looks like we need to emulate it since memtest only looks at the cpuid
to detect an integrated memory controller.  What does this return for you?

dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd
dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd

I/O error.

Peter

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Peter Lieven

On 09/10/12 13:29, Paolo Bonzini wrote:

Il 10/09/2012 13:06, Peter Lieven ha scritto:

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason MSR_READ rip
0x11478 info 0 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_msr: msr_read 194 = 0x0 (#GP)
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_inj_exception: #GP (0x0)
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason TRIPLE_FAULT
rip 0x11478 info 0 0

Memory controller MSR:

static float getNHMmultiplier(void)
{
 unsigned int msr_lo, msr_hi;
 float coef;

 /* Find multiplier (by MSR) */
 /* First, check if Flexible Ratio is Enabled */
 rdmsr(0x194, msr_lo, msr_hi);
 if((msr_lo  16)  1){
 coef = (msr_lo  8)  0xFF;
  } else {
 rdmsr(0xCE, msr_lo, msr_hi);
 coef = (msr_lo  8)  0xFF;
  }

 return coef;
}

Looks like we need to emulate it since memtest only looks at the cpuid
to detect an integrated memory controller.  What does this return for you?

dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd
dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd

it only works without the skip. but the msr device returns all zeroes.

peter

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Avi Kivity
On 09/10/2012 02:29 PM, Paolo Bonzini wrote:
 Il 10/09/2012 13:06, Peter Lieven ha scritto:
 
 qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
 qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason MSR_READ rip
 0x11478 info 0 0
 qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_msr: msr_read 194 = 0x0 (#GP)
 qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_inj_exception: #GP (0x0)
 qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
 qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason TRIPLE_FAULT
 rip 0x11478 info 0 0
 
 Memory controller MSR:
 
 static float getNHMmultiplier(void)
 {
 unsigned int msr_lo, msr_hi;
 float coef;
 
 /* Find multiplier (by MSR) */
 /* First, check if Flexible Ratio is Enabled */
 rdmsr(0x194, msr_lo, msr_hi);
 if((msr_lo  16)  1){
 coef = (msr_lo  8)  0xFF;
  } else {
 rdmsr(0xCE, msr_lo, msr_hi);
 coef = (msr_lo  8)  0xFF;
  }
 
 return coef;
 }
 

The SDM only mentions 0x194 as a machine check exception register
(recorded R12).

In the Architectural MSRs (how I love that name) section 18AH-197H are
listed as reserved.

In the Nehalem section they're not there at all.



-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Paolo Bonzini
Il 10/09/2012 13:52, Peter Lieven ha scritto:
 dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd
 dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd
 it only works without the skip. but the msr device returns all zeroes.

Hmm, the strange API of the MSR device doesn't work well with dd (dd
skips to 0x194 * 8 because bs is 8.  You can try this program:

#include fcntl.h
#include stdio.h
#include stdlib.h

int rdmsr(int fd, long reg)
{
char msg[40];
long long val;
sprintf(msg, rdmsr(%#x), reg);
if (pread(fd, val, 8, reg)  0) {
perror(msg);
} else {
printf(%s: %#016llx\n, msg, val);
fflush(stdout);
}
}


int main()
{
int fd = open(/dev/cpu/0/msr, O_RDONLY);
if (fd  0) { perror(open); exit(1); }
rdmsr(fd, 0x194);
rdmsr(fd, 0xCE);
}

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Gleb Natapov
On Mon, Sep 10, 2012 at 02:15:49PM +0200, Paolo Bonzini wrote:
 Il 10/09/2012 13:52, Peter Lieven ha scritto:
  dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd
  dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd
  it only works without the skip. but the msr device returns all zeroes.
 
 Hmm, the strange API of the MSR device doesn't work well with dd (dd
 skips to 0x194 * 8 because bs is 8.  You can try this program:
 
There is rdmsr/wrmsr in msr-tools.

 #include fcntl.h
 #include stdio.h
 #include stdlib.h
 
 int rdmsr(int fd, long reg)
 {
 char msg[40];
 long long val;
 sprintf(msg, rdmsr(%#x), reg);
 if (pread(fd, val, 8, reg)  0) {
 perror(msg);
 } else {
 printf(%s: %#016llx\n, msg, val);
 fflush(stdout);
 }
 }
 
 
 int main()
 {
 int fd = open(/dev/cpu/0/msr, O_RDONLY);
 if (fd  0) { perror(open); exit(1); }
 rdmsr(fd, 0x194);
 rdmsr(fd, 0xCE);
 }
 
 Paolo
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Peter Lieven

On 09/10/12 14:21, Gleb Natapov wrote:

On Mon, Sep 10, 2012 at 02:15:49PM +0200, Paolo Bonzini wrote:

Il 10/09/2012 13:52, Peter Lieven ha scritto:

dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd
dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd

it only works without the skip. but the msr device returns all zeroes.

Hmm, the strange API of the MSR device doesn't work well with dd (dd
skips to 0x194 * 8 because bs is 8.  You can try this program:


There is rdmsr/wrmsr in msr-tools.
rdmsr returns it cannot read those MSRs. regardless if I use -cpu host 
or -cpu qemu64.


peter

#includefcntl.h
#includestdio.h
#includestdlib.h

int rdmsr(int fd, long reg)
{
 char msg[40];
 long long val;
 sprintf(msg, rdmsr(%#x), reg);
 if (pread(fd,val, 8, reg)  0) {
 perror(msg);
 } else {
 printf(%s: %#016llx\n, msg, val);
 fflush(stdout);
 }
}


int main()
{
 int fd = open(/dev/cpu/0/msr, O_RDONLY);
 if (fd  0) { perror(open); exit(1); }
 rdmsr(fd, 0x194);
 rdmsr(fd, 0xCE);
}

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/4] kvm: i386: Add classic PCI device assignment

2012-09-10 Thread Avi Kivity
On 09/10/2012 12:26 PM, Jan Kiszka wrote:
 
 Is patch 4 the only one that is at v3, and the rest are to be taken from
 the original posting?
 
 That is correct.

Thanks, applied to uq/master, will push shortly.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call agenda for Tuesday, September 11th

2012-09-10 Thread Juan Quintela
Hi

Please send in any agenda items you are interested in covering.

Thanks, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Avi Kivity
On 09/10/2012 03:29 PM, Peter Lieven wrote:
 On 09/10/12 14:21, Gleb Natapov wrote:
 On Mon, Sep 10, 2012 at 02:15:49PM +0200, Paolo Bonzini wrote:
 Il 10/09/2012 13:52, Peter Lieven ha scritto:
 dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd
 dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd
 it only works without the skip. but the msr device returns all zeroes.
 Hmm, the strange API of the MSR device doesn't work well with dd (dd
 skips to 0x194 * 8 because bs is 8.  You can try this program:

 There is rdmsr/wrmsr in msr-tools.
 rdmsr returns it cannot read those MSRs. regardless if I use -cpu host
 or -cpu qemu64.

On the host.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Peter Lieven

On 09/10/12 14:32, Avi Kivity wrote:

On 09/10/2012 03:29 PM, Peter Lieven wrote:

On 09/10/12 14:21, Gleb Natapov wrote:

On Mon, Sep 10, 2012 at 02:15:49PM +0200, Paolo Bonzini wrote:

Il 10/09/2012 13:52, Peter Lieven ha scritto:

dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd
dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd

it only works without the skip. but the msr device returns all zeroes.

Hmm, the strange API of the MSR device doesn't work well with dd (dd
skips to 0x194 * 8 because bs is 8.  You can try this program:


There is rdmsr/wrmsr in msr-tools.

rdmsr returns it cannot read those MSRs. regardless if I use -cpu host
or -cpu qemu64.

On the host.

aaah ok:

#rdmsr -0 0x194
00011100
#rdmsr -0 0xce
0c0004011103

Peter





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] KVM: optimize apic interrupt delivery

2012-09-10 Thread Gleb Natapov
Most interrupt are delivered to only one vcpu. Use pre-build tables to
find interrupt destination instead of looping through all vcpus.

Signed-off-by: Gleb Natapov g...@redhat.com
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 64adb61..121f308 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -511,6 +511,14 @@ struct kvm_arch_memory_slot {
struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
 };
 
+struct kvm_apic_map {
+   struct rcu_head rcu;
+   bool flat;
+   u8 ldr_bits;
+   struct kvm_lapic *phys_map[255];
+   struct kvm_lapic *logical_map[16][16];
+};
+
 struct kvm_arch {
unsigned int n_used_mmu_pages;
unsigned int n_requested_mmu_pages;
@@ -528,6 +536,8 @@ struct kvm_arch {
struct kvm_ioapic *vioapic;
struct kvm_pit *vpit;
int vapics_in_nmi_mode;
+   struct kvm_apic_map *apic_map;
+   struct mutex apic_map_lock;
 
unsigned int tss_addr;
struct page *apic_access_page;
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 07ad628..d18ddd2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -139,11 +139,101 @@ static inline int apic_enabled(struct kvm_lapic *apic)
(LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \
 APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER)
 
+static inline int apic_x2apic_mode(struct kvm_lapic *apic)
+{
+   return apic-vcpu-arch.apic_base  X2APIC_ENABLE;
+}
+
+
 static inline int kvm_apic_id(struct kvm_lapic *apic)
 {
return (kvm_apic_get_reg(apic, APIC_ID)  24)  0xff;
 }
 
+static void rcu_free_apic_map(struct rcu_head *head)
+{
+   struct kvm_apic_map *p = container_of(head,
+   struct kvm_apic_map, rcu);
+
+   kfree(p);
+}
+
+static void kvm_apic_get_logical_id(u32 ldr, bool flat, u8 ldr_bits,
+   u16 *cid, u16 *lid)
+{
+   if (ldr_bits == 32) {
+   *cid = ldr  16;
+   *lid = ldr  0x;
+   } else {
+   ldr = GET_APIC_LOGICAL_ID(ldr);
+
+   if (flat) {
+   *cid = 0;
+   *lid = ldr;
+   } else {
+   *cid = ldr  4;
+   *lid = ldr  0xf;
+   }
+   }
+}
+
+static inline int recalculate_apic_map(struct kvm *kvm)
+{
+   struct kvm_apic_map *new, *old = NULL;
+   struct kvm_vcpu *vcpu;
+   int i;
+
+   new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
+
+   if (!new)
+   return -ENOMEM;
+
+   new-ldr_bits = 8;
+   new-flat = true;
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   u16 cid, lid;
+   struct kvm_lapic *apic = vcpu-arch.apic;
+
+   if (!kvm_apic_present(vcpu))
+   continue;
+
+   if (apic_x2apic_mode(apic)) {
+   new-ldr_bits = 32;
+   new-flat = false;
+   } else if (kvm_apic_sw_enabled(apic)  new-flat 
+   kvm_apic_get_reg(apic, APIC_DFR) == 
APIC_DFR_CLUSTER)
+   new-flat = false;
+
+   new-phys_map[kvm_apic_id(apic)] = apic;
+   kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
+   new-flat, new-ldr_bits, cid, lid);
+
+   if (lid)
+   new-logical_map[cid][ffs(lid) - 1] = apic;
+   }
+   mutex_lock(kvm-arch.apic_map_lock);
+   old = kvm-arch.apic_map;
+   rcu_assign_pointer(kvm-arch.apic_map, new);
+   mutex_unlock(kvm-arch.apic_map_lock);
+
+   if (old)
+   call_rcu(old-rcu, rcu_free_apic_map);
+
+   return 0;
+}
+
+static inline int kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
+{
+   apic_set_reg(apic, APIC_ID, id  24);
+   return recalculate_apic_map(apic-vcpu-kvm);
+}
+
+static inline int kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
+{
+   apic_set_reg(apic, APIC_LDR, id);
+   return recalculate_apic_map(apic-vcpu-kvm);
+}
+
 static inline int apic_lvt_enabled(struct kvm_lapic *apic, int lvt_type)
 {
return !(kvm_apic_get_reg(apic, lvt_type)  APIC_LVT_MASKED);
@@ -193,11 +283,6 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu)
apic_set_reg(apic, APIC_LVR, v);
 }
 
-static inline int apic_x2apic_mode(struct kvm_lapic *apic)
-{
-   return apic-vcpu-arch.apic_base  X2APIC_ENABLE;
-}
-
 static const unsigned int apic_lvt_mask[APIC_LVT_NUM] = {
LVT_MASK ,  /* part LVTT mask, timer mode mask added at runtime */
LVT_MASK | APIC_MODE_MASK,  /* LVTTHMR */
@@ -477,6 +562,67 @@ int kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct 
kvm_lapic *source,
return result;
 }
 
+bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
+   struct kvm_lapic_irq *irq, int *r)
+{
+   unsigned long bitmap = 1;
+   struct 

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Andrew Theurer
On Sat, 2012-09-08 at 14:13 +0530, Srikar Dronamraju wrote:
  
  signed-off-by: Andrew Theurer haban...@linux.vnet.ibm.com
  
  diff --git a/kernel/sched/core.c b/kernel/sched/core.c
  index fbf1fd0..c767915 100644
  --- a/kernel/sched/core.c
  +++ b/kernel/sched/core.c
  @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
  preempt)
  
   again:
  p_rq = task_rq(p);
  +   if (task_running(p_rq, p) || p-state || !(p_rq-curr-flags 
  PF_VCPU)) {
  +   goto out_no_unlock;
  +   }
  double_rq_lock(rq, p_rq);
  while (task_rq(p) != p_rq) {
  double_rq_unlock(rq, p_rq);
  @@ -4856,8 +4859,6 @@ again:
  if (curr-sched_class != p-sched_class)
  goto out;
  
  -   if (task_running(p_rq, p) || p-state)
  -   goto out;
 
 Is it possible that by this time the current thread takes double rq
 lock, thread p could actually be running?  i.e is there merit to keep
 this check around even with your similar check above?

I think that's a good idea.  I'll add that back in.
 
  
  yielded = curr-sched_class-yield_to_task(rq, p, preempt);
  if (yielded) {
  @@ -4879,6 +4880,7 @@ again:
  
   out:
  double_rq_unlock(rq, p_rq);
  +out_no_unlock:
  local_irq_restore(flags);
  
  if (yielded)
  
  
 

-Andrew Theurer


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu-kvm log

2012-09-10 Thread Avi Kivity
On 09/10/2012 01:44 PM, Liu, Jinsong wrote:
 Hi,
 
 I'm recently debugging a qemu-kvm issue. I add some print code like 
 'fprintf(stderr, ...)', however I fail to see any info at stdio. Anyone can 
 tell me where is qemu-kvm logfile, or, what I need do to record my fprintf 
 info?


If you're running via libvirt, the log is in /var/log/libvirt/qemu.  If
you're running from the command line and printf()s should end on your
terminal.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Qemu-devel] qemu-kvm log

2012-09-10 Thread Liu, Jinsong
Avi Kivity wrote:
 On 09/10/2012 01:44 PM, Liu, Jinsong wrote:
 Hi,
 
 I'm recently debugging a qemu-kvm issue. I add some print code like
 'fprintf(stderr, ...)', however I fail to see any info at stdio.
 Anyone can tell me where is qemu-kvm logfile, or, what I need do to
 record my fprintf info?   
 
 
 If you're running via libvirt, the log is in /var/log/libvirt/qemu. 
 If you're running from the command line and printf()s should end on
 your terminal.

Thanks!

I run qemu-kvm from command line, but due to could not initialize SDL, I run 
qemu command via VNC, like
1). test kvm-unit-tests: qemu-system-x86_64 -device testdev,chardev=testlog 
-chardev file,id=testlog,path=apic.out -serial stdio -kernel ./x86/apic.flat 
-cpu host
2). test kvm guest: qemu-system-x86_64 -smp2 -m 512 -hda test.qcow.img -cpu host
for case 1, there is some printf info of kvm-unit-tests on terminal but no 
qemu-kvm printf info;
for case 2, no qemu-kvm printf info on terminal;

Best Regards,
Jinsong--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: optimize apic interrupt delivery

2012-09-10 Thread Michael S. Tsirkin
On Mon, Sep 10, 2012 at 04:09:15PM +0300, Gleb Natapov wrote:
 Most interrupt are delivered to only one vcpu. Use pre-build tables to
 find interrupt destination instead of looping through all vcpus.
 
 Signed-off-by: Gleb Natapov g...@redhat.com

Looks good overall. I think I see some bugs, with rcu use and others.
the most suspecious thing is that code seems to use rcu but
there are no calls to rcu_dereference anywhere.
Pls see below.

Thanks for looking into this, once ready I intend to rework
direct injection to work on top of this.


 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 64adb61..121f308 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -511,6 +511,14 @@ struct kvm_arch_memory_slot {
   struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
  };
  
 +struct kvm_apic_map {
 + struct rcu_head rcu;
 + bool flat;
 + u8 ldr_bits;
 + struct kvm_lapic *phys_map[255];

Not 256? apic id is sometimes used in the lookup - is is guaranteed
that guest can not set it to 0xff? If yes this will overflow.

 + struct kvm_lapic *logical_map[16][16];


These are large arrays: 2Kbyte each one, which is bad
for cache.
Is it not better to have vcpu index stored there?
that would reduce cache footprint by x8.

 +};
 +
  struct kvm_arch {
   unsigned int n_used_mmu_pages;
   unsigned int n_requested_mmu_pages;
 @@ -528,6 +536,8 @@ struct kvm_arch {
   struct kvm_ioapic *vioapic;
   struct kvm_pit *vpit;
   int vapics_in_nmi_mode;
 + struct kvm_apic_map *apic_map;
 + struct mutex apic_map_lock;
  
   unsigned int tss_addr;
   struct page *apic_access_page;
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 07ad628..d18ddd2 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -139,11 +139,101 @@ static inline int apic_enabled(struct kvm_lapic *apic)
   (LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \
APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER)
  
 +static inline int apic_x2apic_mode(struct kvm_lapic *apic)
 +{
 + return apic-vcpu-arch.apic_base  X2APIC_ENABLE;
 +}
 +
 +

two emoty lines not needed

  static inline int kvm_apic_id(struct kvm_lapic *apic)
  {
   return (kvm_apic_get_reg(apic, APIC_ID)  24)  0xff;
  }
  
 +static void rcu_free_apic_map(struct rcu_head *head)
 +{
 + struct kvm_apic_map *p = container_of(head,
 + struct kvm_apic_map, rcu);

why break this line? it is not too long as is.

 +
 + kfree(p);
 +}
 +
 +static void kvm_apic_get_logical_id(u32 ldr, bool flat, u8 ldr_bits,
 + u16 *cid, u16 *lid)
 +{
 + if (ldr_bits == 32) {
 + *cid = ldr  16;
 + *lid = ldr  0x;
 + } else {
 + ldr = GET_APIC_LOGICAL_ID(ldr);
 +
 + if (flat) {
 + *cid = 0;
 + *lid = ldr;
 + } else {
 + *cid = ldr  4;
 + *lid = ldr  0xf;
 + }
 + }
 +}
 +
 +static inline int recalculate_apic_map(struct kvm *kvm)

__must_check?

 +{
 + struct kvm_apic_map *new, *old = NULL;
 + struct kvm_vcpu *vcpu;
 + int i;
 +
 + new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
 +

empty line not needed here :)

 + if (!new)
 + return -ENOMEM;
 +
 + new-ldr_bits = 8;
 + new-flat = true;
 + kvm_for_each_vcpu(i, vcpu, kvm) {
 + u16 cid, lid;
 + struct kvm_lapic *apic = vcpu-arch.apic;
 +
 + if (!kvm_apic_present(vcpu))
 + continue;
 +
 + if (apic_x2apic_mode(apic)) {
 + new-ldr_bits = 32;
 + new-flat = false;
 + } else if (kvm_apic_sw_enabled(apic)  new-flat 
 + kvm_apic_get_reg(apic, APIC_DFR) == 
 APIC_DFR_CLUSTER)
 + new-flat = false;
 +
 + new-phys_map[kvm_apic_id(apic)] = apic;
 + kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
 + new-flat, new-ldr_bits, cid, lid);
 +
 + if (lid)
 + new-logical_map[cid][ffs(lid) - 1] = apic;
 + }
 + mutex_lock(kvm-arch.apic_map_lock);
 + old = kvm-arch.apic_map;

rcu_dereference_protected?

 + rcu_assign_pointer(kvm-arch.apic_map, new);
 + mutex_unlock(kvm-arch.apic_map_lock);
 +
 + if (old)
 + call_rcu(old-rcu, rcu_free_apic_map);

What guarantees rcu_free_apic_map is called before module goes away?
I suspect we need at least rcu_barrier here:
https://lwn.net/Articles/217484/

Also, this is done upon guest request?
This allocates 2K and delays free until next rcu grace
period. Guest can buffer up *a lot of* host kernel
memory before this happens - looks like a DOS potential?
Maybe simply synchronize_rcu here? Or if that looks too
aggressive, adding this to some per-kvm pending 

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Raghavendra K T

On 09/08/2012 01:12 AM, Andrew Theurer wrote:

On Fri, 2012-09-07 at 23:36 +0530, Raghavendra K T wrote:

CCing PeterZ also.

On 09/07/2012 06:41 PM, Andrew Theurer wrote:

I have noticed recently that PLE/yield_to() is still not that scalable
for really large guests, sometimes even with no CPU over-commit.  I have
a small change that make a very big difference.

[...]

We are indeed avoiding CPUS in guest mode when we check
task-flags  PF_VCPU in vcpu_on_spin path.  Doesn't that suffice?

My understanding is that it checks if the candidate vcpu task is in
guest mode (let's call this vcpu g1vcpuN), and that vcpu will not be a
target to yield to if it is already in guest mode.  I am concerned about
a different vcpu, possibly from a different VM (let's call it g2vcpuN),
but it also located on the same runqueue as g1vcpuN -and- running.  That
vcpu, g2vcpuN, may also be doing a directed yield, and it may already be
holding the rq lock.  Or it could be in guest mode.  If it is in guest
mode, then let's still target this rq, and try to yield to g1vcpuN.
However, if g2vcpuN is not in guest mode, then don't bother trying.


- If a non vcpu task was currently running, this change can ignore 
request to yield to a target vcpu. The target vcpu could be the most 
eligible vcpu causing other vcpus to do ple exits.

Is it possible to modify the check to deal with only vcpu tasks?

- Should we use p_rq-cfs_rq-skip instead to let us know that some 
yield was active at this time?


-

Cpu 1  cpu2 cpu3
a1  a2a3
b1  b2b3
c2(yield target of a1)c3(yield target of a2)

If vcpu a1 is doing directed yield to vcpu c2; current vcpu a2 on target 
cpu is also doing a directed yield(to some vcpu c3). Then this change 
will only allow vcpu a2 will do a schedule() to b2 (if a2 - c3 yield is 
successful). Do we miss yielding to a vcpu c2?
a1 might not find a suitable vcpu to yield and might go back to 
spinning. Is my understanding correct?



Patch include below.

Here's the new, v2 result with the previous two:

10 VMs, 16-way each, all running dbench (2x cpu over-commit)
 throughput +/- stddev
  - -
ple on:   2552 +/- .70%
ple on: w/fixv1:  4621 +/- 2.12%  (81% improvement)
ple on: w/fixv2:  6115*   (139% improvement)



The numbers look great.


[*] I do not have stdev yet because all 10 runs are not complete

for v1 to v2, host CPU dropped from 60% to 50%.  Time in spin_lock() is
also dropping:


[...]


So this seems to be working.  However I wonder just how far we can take
this.  Ideally we need to be in3-4% in host for PLE work, like I
observe for the 8-way VMs.  We are still way off.

-Andrew


signed-off-by: Andrew Theurerhaban...@linux.vnet.ibm.com

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..c767915 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
preempt)

  again:
p_rq = task_rq(p);
+   if (task_running(p_rq, p) || p-state || !(p_rq-curr-flags
PF_VCPU)) {


While we are checking the flags of p_rq-curr task, the task p can 
migrate to some other runqueue. In this case will we miss yielding to 
the most eligible vcpu?



+   goto out_no_unlock;
+   }


Nit:
We dont need parenthesis above.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: optimize apic interrupt delivery

2012-09-10 Thread Avi Kivity
On 09/10/2012 04:09 PM, Gleb Natapov wrote:
 Most interrupt are delivered to only one vcpu. Use pre-build tables to
 find interrupt destination instead of looping through all vcpus.
 
 +
 +static inline int recalculate_apic_map(struct kvm *kvm)
 +{
 + struct kvm_apic_map *new, *old = NULL;
 + struct kvm_vcpu *vcpu;
 + int i;
 +
 + new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
 +
 + if (!new)
 + return -ENOMEM;
 +
 + new-ldr_bits = 8;
 + new-flat = true;
 + kvm_for_each_vcpu(i, vcpu, kvm) {
 + u16 cid, lid;
 + struct kvm_lapic *apic = vcpu-arch.apic;
 +
 + if (!kvm_apic_present(vcpu))
 + continue;
 +
 + if (apic_x2apic_mode(apic)) {
 + new-ldr_bits = 32;
 + new-flat = false;
 + } else if (kvm_apic_sw_enabled(apic)  new-flat 
 + kvm_apic_get_reg(apic, APIC_DFR) == 
 APIC_DFR_CLUSTER)
 + new-flat = false;
 +
 + new-phys_map[kvm_apic_id(apic)] = apic;
 + kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
 + new-flat, new-ldr_bits, cid, lid);
 +
 + if (lid)
 + new-logical_map[cid][ffs(lid) - 1] = apic;
 + }
 + mutex_lock(kvm-arch.apic_map_lock);
 + old = kvm-arch.apic_map;
 + rcu_assign_pointer(kvm-arch.apic_map, new);
 + mutex_unlock(kvm-arch.apic_map_lock);

is this not racy?

cpu 0 cpu 1
  -
new apic id
recalculate

  new apic id
  recalculate

  take lock
  install map
  drop lock

take lock
install map
drop lock


 +
 + if (old)
 + call_rcu(old-rcu, rcu_free_apic_map);
 +
 + return 0;
 +}
 +

 -
  static const unsigned int apic_lvt_mask[APIC_LVT_NUM] = {
   LVT_MASK ,  /* part LVTT mask, timer mode mask added at runtime */
   LVT_MASK | APIC_MODE_MASK,  /* LVTTHMR */
 @@ -477,6 +562,67 @@ int kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct 
 kvm_lapic *source,
   return result;
  }
  
 +bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
 + struct kvm_lapic_irq *irq, int *r)
 +{
 + unsigned long bitmap = 1;
 + struct kvm_lapic **dst;
 + int i;
 +
 + if (irq-shorthand == APIC_DEST_SELF) {
 + *r = kvm_apic_set_irq(src-vcpu, irq);
 + return true;
 + }
 +
 + if (irq-shorthand)
 + return false;
 +
 + if (irq-dest_mode == 0) { /* physical mode */
 + if (irq-delivery_mode == APIC_DM_LOWEST ||
 + irq-dest_id == 0xff)
 + return false;
 + rcu_read_lock();

Unmatched rcu_read_lock() inside a block is ugly, please start it earlier.

 + dst = kvm-arch.apic_map-phys_map[irq-dest_id  0xff];

Like mst says, rcu_dereference().

 + } else {
 + u16 cid, lid;
 + u32 mda = irq-dest_id;
 +
 + rcu_read_lock();
 + if (kvm-arch.apic_map-ldr_bits == 8)
 + mda = 24;
 +
 + kvm_apic_get_logical_id(mda, kvm-arch.apic_map-flat,
 + kvm-arch.apic_map-ldr_bits, cid, lid);
 + dst = kvm-arch.apic_map-logical_map[cid];
 +
 + bitmap = lid;
 + if (irq-delivery_mode == APIC_DM_LOWEST) {
 + int l = -1;
 + for_each_set_bit(i, bitmap, 16) {
 + if (!dst[i])
 + continue;
 + if (l  0)
 + l = i;
 + else if (kvm_apic_compare_prio(dst[i]-vcpu, 
 dst[l]-vcpu)  0)
 + l = i;
 + }
 +
 + bitmap = (l = 0) ? 1  l : 0;
 + }
 + }
 +
 + for_each_set_bit(i, bitmap, 16) {
 + if (!dst[i])
 + continue;
 + if (*r  0)
 + *r = 0;
 + *r += kvm_apic_set_irq(dst[i]-vcpu, irq);
 + }
 +
 + rcu_read_unlock();
 + return true;
 +}
 +
  /*
   * Add a pending IRQ into lapic.
   * Return 1 if successfully added and 0 if discarded.
 @@ -880,7 +1026,7 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 
 reg, u32 val)
   switch (reg) {
   case APIC_ID:   /* Local APIC ID */
   if (!apic_x2apic_mode(apic))
 - apic_set_reg(apic, APIC_ID, val);
 + kvm_apic_set_id(apic, val  24);
   else
   ret = 1;
   break;
 @@ -896,15 +1042,16 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 
 reg, u32 val)
  
   case APIC_LDR:
   if (!apic_x2apic_mode(apic))
 -   

[PULL] KVM updates

2012-09-10 Thread Avi Kivity
Please pull from:

  git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master

to merge some kvm updates, most notably a port of qemu-kvm's pre-vfio device
assignment.  With this there are no significant changes left between qemu and
qemu-kvm (though some work remains).


Christian Borntraeger (1):
  qemu: Use valgrind annotations to mark kvm guest memory as defined

Jan Kiszka (5):
  kvm: Clean up irqfd API
  kvm: Introduce kvm_irqchip_update_msi_route
  kvm: Introduce kvm_has_intx_set_mask
  kvm: i386: Add services required for PCI device assignment
  kvm: i386: Add classic PCI device assignment

 configure  |3 +-
 hw/kvm/Makefile.objs   |2 +-
 hw/kvm/pci-assign.c| 1915 
 hw/qdev-monitor.c  |1 +
 hw/virtio-pci.c|4 +-
 kvm-all.c  |   75 +-
 kvm-stub.c |   14 +-
 kvm.h  |8 +-
 target-i386/kvm.c  |  141 
 target-i386/kvm_i386.h |   22 +
 10 files changed, 2151 insertions(+), 34 deletions(-)
 create mode 100644 hw/kvm/pci-assign.c

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL] KVM updates

2012-09-10 Thread Jan Kiszka
On 2012-09-10 17:25, Avi Kivity wrote:
 Please pull from:
 
   git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master
 
 to merge some kvm updates, most notably a port of qemu-kvm's pre-vfio device
 assignment.  With this there are no significant changes left between qemu and
 qemu-kvm (though some work remains).
 
 
 Christian Borntraeger (1):
   qemu: Use valgrind annotations to mark kvm guest memory as defined
 
 Jan Kiszka (5):
   kvm: Clean up irqfd API
   kvm: Introduce kvm_irqchip_update_msi_route
   kvm: Introduce kvm_has_intx_set_mask
   kvm: i386: Add services required for PCI device assignment
   kvm: i386: Add classic PCI device assignment

uq/master used to contain two patches from Peter regarding the kernel
headers and my coalesced MMIO decoupling series - dropped intentionally?

Jan

 
  configure  |3 +-
  hw/kvm/Makefile.objs   |2 +-
  hw/kvm/pci-assign.c| 1915 
 
  hw/qdev-monitor.c  |1 +
  hw/virtio-pci.c|4 +-
  kvm-all.c  |   75 +-
  kvm-stub.c |   14 +-
  kvm.h  |8 +-
  target-i386/kvm.c  |  141 
  target-i386/kvm_i386.h |   22 +
  10 files changed, 2151 insertions(+), 34 deletions(-)
  create mode 100644 hw/kvm/pci-assign.c
 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 4/4] kvm: i386: Add classic PCI device assignment

2012-09-10 Thread Andreas Färber
Am 05.09.2012 22:46, schrieb Anthony Liguori:
 What do we do when the FSF comes out with the GPLv4 and relicenses again
 in an incompatible fashion?  Do we do this exercise every couple of
 years?

That's exactly why I suggested GPLv2+ because it was supposed to be a
preparation for the future. I'm pretty sure I mentioned the potential
GPLv4 back then!

Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] KVM updates for 3.6-rc5

2012-09-10 Thread Avi Kivity
Linus, please pull from the repo and branch at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/kvm-3.6-2

to receive a trio of KVM fixes: incorrect lookup of guest cpuid,
an uninitialized variable fix, and error path cleanup fix.

Shortlog/diffstat follow.


KVM updates for 3.6-rc5


Jamie Iles (1):
  KVM: PIC: fix use of uninitialised variable.

Ren, Yongjie (1):
  KVM: x86: Check INVPCID feature bit in EBX of leaf 7

Xiao Guangrong (1):
  KVM: fix error paths for failed gfn_to_page() calls

 arch/x86/kvm/i8259.c |  2 +-
 arch/x86/kvm/vmx.c   | 23 ++-
 arch/x86/kvm/x86.c   | 13 ++---
 3 files changed, 29 insertions(+), 9 deletions(-)

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: optimize apic interrupt delivery

2012-09-10 Thread Gleb Natapov
On Mon, Sep 10, 2012 at 06:09:01PM +0300, Avi Kivity wrote:
 On 09/10/2012 04:09 PM, Gleb Natapov wrote:
  Most interrupt are delivered to only one vcpu. Use pre-build tables to
  find interrupt destination instead of looping through all vcpus.
  
  +
  +static inline int recalculate_apic_map(struct kvm *kvm)
  +{
  +   struct kvm_apic_map *new, *old = NULL;
  +   struct kvm_vcpu *vcpu;
  +   int i;
  +
  +   new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
  +
  +   if (!new)
  +   return -ENOMEM;
  +
  +   new-ldr_bits = 8;
  +   new-flat = true;
  +   kvm_for_each_vcpu(i, vcpu, kvm) {
  +   u16 cid, lid;
  +   struct kvm_lapic *apic = vcpu-arch.apic;
  +
  +   if (!kvm_apic_present(vcpu))
  +   continue;
  +
  +   if (apic_x2apic_mode(apic)) {
  +   new-ldr_bits = 32;
  +   new-flat = false;
  +   } else if (kvm_apic_sw_enabled(apic)  new-flat 
  +   kvm_apic_get_reg(apic, APIC_DFR) == 
  APIC_DFR_CLUSTER)
  +   new-flat = false;
  +
  +   new-phys_map[kvm_apic_id(apic)] = apic;
  +   kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
  +   new-flat, new-ldr_bits, cid, lid);
  +
  +   if (lid)
  +   new-logical_map[cid][ffs(lid) - 1] = apic;
  +   }
  +   mutex_lock(kvm-arch.apic_map_lock);
  +   old = kvm-arch.apic_map;
  +   rcu_assign_pointer(kvm-arch.apic_map, new);
  +   mutex_unlock(kvm-arch.apic_map_lock);
 
 is this not racy?
 
 cpu 0 cpu 1
   -
 new apic id
 recalculate
 
   new apic id
   recalculate
 
   take lock
   install map
   drop lock
 
 take lock
 install map
 drop lock
 
Hmm, yes. Will have to hold lock during recalculation. Not a big deal.

 
  +
  +   if (old)
  +   call_rcu(old-rcu, rcu_free_apic_map);
  +
  +   return 0;
  +}
  +
 
  -
   static const unsigned int apic_lvt_mask[APIC_LVT_NUM] = {
  LVT_MASK ,  /* part LVTT mask, timer mode mask added at runtime */
  LVT_MASK | APIC_MODE_MASK,  /* LVTTHMR */
  @@ -477,6 +562,67 @@ int kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct 
  kvm_lapic *source,
  return result;
   }
   
  +bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
  +   struct kvm_lapic_irq *irq, int *r)
  +{
  +   unsigned long bitmap = 1;
  +   struct kvm_lapic **dst;
  +   int i;
  +
  +   if (irq-shorthand == APIC_DEST_SELF) {
  +   *r = kvm_apic_set_irq(src-vcpu, irq);
  +   return true;
  +   }
  +
  +   if (irq-shorthand)
  +   return false;
  +
  +   if (irq-dest_mode == 0) { /* physical mode */
  +   if (irq-delivery_mode == APIC_DM_LOWEST ||
  +   irq-dest_id == 0xff)
  +   return false;
  +   rcu_read_lock();
 
 Unmatched rcu_read_lock() inside a block is ugly, please start it earlier.
 
Doing rcu_read_unlock() ugly too, but if you prefer it this way...

  +   dst = kvm-arch.apic_map-phys_map[irq-dest_id  0xff];
 
 Like mst says, rcu_dereference().
 
Ugh.

  +   } else {
  +   u16 cid, lid;
  +   u32 mda = irq-dest_id;
  +
  +   rcu_read_lock();
  +   if (kvm-arch.apic_map-ldr_bits == 8)
  +   mda = 24;
  +
  +   kvm_apic_get_logical_id(mda, kvm-arch.apic_map-flat,
  +   kvm-arch.apic_map-ldr_bits, cid, lid);
  +   dst = kvm-arch.apic_map-logical_map[cid];
  +
  +   bitmap = lid;
  +   if (irq-delivery_mode == APIC_DM_LOWEST) {
  +   int l = -1;
  +   for_each_set_bit(i, bitmap, 16) {
  +   if (!dst[i])
  +   continue;
  +   if (l  0)
  +   l = i;
  +   else if (kvm_apic_compare_prio(dst[i]-vcpu, 
  dst[l]-vcpu)  0)
  +   l = i;
  +   }
  +
  +   bitmap = (l = 0) ? 1  l : 0;
  +   }
  +   }
  +
  +   for_each_set_bit(i, bitmap, 16) {
  +   if (!dst[i])
  +   continue;
  +   if (*r  0)
  +   *r = 0;
  +   *r += kvm_apic_set_irq(dst[i]-vcpu, irq);
  +   }
  +
  +   rcu_read_unlock();
  +   return true;
  +}
  +
   /*
* Add a pending IRQ into lapic.
* Return 1 if successfully added and 0 if discarded.
  @@ -880,7 +1026,7 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 
  reg, u32 val)
  switch (reg) {
  case APIC_ID:   /* Local APIC ID */
  if (!apic_x2apic_mode(apic))
  -   apic_set_reg(apic, APIC_ID, val);
  +   kvm_apic_set_id(apic, val  24);
  else
  ret = 1;
  

Re: [PULL] KVM updates

2012-09-10 Thread Marcelo Tosatti
On Mon, Sep 10, 2012 at 05:32:39PM +0200, Jan Kiszka wrote:
 On 2012-09-10 17:25, Avi Kivity wrote:
  Please pull from:
  
git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master
  
  to merge some kvm updates, most notably a port of qemu-kvm's pre-vfio device
  assignment.  With this there are no significant changes left between qemu 
  and
  qemu-kvm (though some work remains).
  
  
  Christian Borntraeger (1):
qemu: Use valgrind annotations to mark kvm guest memory as defined
  
  Jan Kiszka (5):
kvm: Clean up irqfd API
kvm: Introduce kvm_irqchip_update_msi_route
kvm: Introduce kvm_has_intx_set_mask
kvm: i386: Add services required for PCI device assignment
kvm: i386: Add classic PCI device assignment
 
 uq/master used to contain two patches from Peter regarding the kernel
 headers and my coalesced MMIO decoupling series - dropped intentionally?
 
 Jan

Probably unintentionally, they havent been lost though (will be in the
next batch once this batch is pulled).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] virtio-ring: Allocate indirect buffers from cache when possible

2012-09-10 Thread Paolo Bonzini
Il 06/09/2012 07:02, Michael S. Tsirkin ha scritto:
  It might be worth just unconditionally having a cache for the 2
  descriptor case.  This is what I get with qemu tap, though for some
  reason the device features don't have guest or host CSUM, so my setup is
  probably screwed:
 Yes without checksum net core always linearizes packets, so yes it is
 screwed.
 For -net, skb always allocates space for 17 frags + linear part so
 it seems sane to do same in virtio core, and allocate, for -net,
 up to max_frags + 1 from cache.
 We can adjust it: no _SG - 2 otherwise 18.
 
 Not sure about other drivers, maybe really use 2 there for now.

2 should also be good for virtio-blk and virtio-scsi 4KB random rw
workloads.

Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL] KVM updates

2012-09-10 Thread Avi Kivity
On 09/10/2012 06:49 PM, Marcelo Tosatti wrote:
 On Mon, Sep 10, 2012 at 05:32:39PM +0200, Jan Kiszka wrote:
 On 2012-09-10 17:25, Avi Kivity wrote:
  Please pull from:
  
git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master
  
  to merge some kvm updates, most notably a port of qemu-kvm's pre-vfio 
  device
  assignment.  With this there are no significant changes left between qemu 
  and
  qemu-kvm (though some work remains).
  
  
  Christian Borntraeger (1):
qemu: Use valgrind annotations to mark kvm guest memory as defined
  
  Jan Kiszka (5):
kvm: Clean up irqfd API
kvm: Introduce kvm_irqchip_update_msi_route
kvm: Introduce kvm_has_intx_set_mask
kvm: i386: Add services required for PCI device assignment
kvm: i386: Add classic PCI device assignment
 
 uq/master used to contain two patches from Peter regarding the kernel
 headers and my coalesced MMIO decoupling series - dropped intentionally?

Thanks for checking.

 Probably unintentionally, they havent been lost though (will be in the
 next batch once this batch is pulled).

Any idea how they were lost?

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Peter Zijlstra
On Mon, 2012-09-10 at 08:16 -0500, Andrew Theurer wrote:
   @@ -4856,8 +4859,6 @@ again:
   if (curr-sched_class != p-sched_class)
   goto out;
   
   -   if (task_running(p_rq, p) || p-state)
   -   goto out;
  
  Is it possible that by this time the current thread takes double rq
  lock, thread p could actually be running?  i.e is there merit to keep
  this check around even with your similar check above?
 
 I think that's a good idea.  I'll add that back in. 

Right, it needs to still be there, the test before acquiring p_rq is an
optimistic test to avoid work, but you have to still test it once you
acquire p_rq since the rest of the code relies on this not being so.

How about something like this instead.. ?

---
 kernel/sched/core.c | 35 ++-
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c46a011..c9ecab2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4300,6 +4300,23 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct 
*p)
+{
+   if (!curr-sched_class-yield_to_task)
+   return false;
+
+   if (curr-sched_class != p-sched_class)
+   return false;
+
+   if (task_running(p_rq, p) || p-state)
+   return false;
+
+   return true;
+}
+
 /**
  * yield_to - yield the current processor to another thread in
  * your thread group, or accelerate that thread toward the
@@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
rq = this_rq();
 
 again:
+   /* optimistic test to avoid taking locks */
+   if (!__yield_to_candidate(curr, p))
+   goto out_irq;
+
p_rq = task_rq(p);
double_rq_lock(rq, p_rq);
while (task_rq(p) != p_rq) {
@@ -4330,14 +4351,9 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
goto again;
}
 
-   if (!curr-sched_class-yield_to_task)
-   goto out;
-
-   if (curr-sched_class != p-sched_class)
-   goto out;
-
-   if (task_running(p_rq, p) || p-state)
-   goto out;
+   /* validate state, holding p_rq ensures p's state cannot change */
+   if (!__yield_to_candidate(curr, p))
+   goto out_unlock;
 
yielded = curr-sched_class-yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4350,8 +4366,9 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
resched_task(p_rq-curr);
}
 
-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);
 
if (yielded)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] virtio-ring: Allocate indirect buffers from cache when possible

2012-09-10 Thread Michael S. Tsirkin
On Mon, Sep 10, 2012 at 10:47:15AM -0500, Thomas Lendacky wrote:
 On Friday, September 07, 2012 09:19:04 AM Rusty Russell wrote:
 
  Michael S. Tsirkin m...@redhat.com writes:
 
   On Thu, Sep 06, 2012 at 05:27:23PM +0930, Rusty Russell wrote:
 
   Michael S. Tsirkin m...@redhat.com writes:
 
Yes without checksum net core always linearizes packets, so yes it is
 
screwed.
 
For -net, skb always allocates space for 17 frags + linear part so
 
it seems sane to do same in virtio core, and allocate, for -net,
 
up to max_frags + 1 from cache.
 
We can adjust it: no _SG - 2 otherwise 18.
 
  
 
   But I thought it used individual buffers these days?
 
  
 
   Yes for receive, no for transmit. That's probably why
 
   we should have the threshold per vq, not per device, BTW.
 
 
 
  Can someone actually run with my histogram patch and see what the real
 
  numbers are?
 
 
 
  
 
 I ran some TCP_RR and TCP_STREAM sessions, both host-to-guest and
 
 guest-to-host, with a form of the histogram patch applied against a
 
 RHEL6.3 kernel. The histogram values were reset after each test.
 
  
 
 Here are the results:
 
  
 
 60 session TCP_RR from host-to-guest with 256 byte request and 256 byte
 
 response for 60 seconds:
 
  
 
 Queue histogram for virtio1:
 
 Size distribution for input (max=7818456):
 
 1: 7818456 
 
 Size distribution for output (max=7816698):
 
 2: 149
 
 3: 7816698 

Here, a threshold would help.

 
 4: 2
 
 5: 1
 
 Size distribution for control (max=1):
 
 0: 0
 
  
 
  
 
 4 session TCP_STREAM from host-to-guest with 4K message size for 60 seconds:
 
  
 
 Queue histogram for virtio1:
 
 Size distribution for input (max=16050941):
 
 1: 16050941 
 
 Size distribution for output (max=1877796):
 
 2: 1877796 
 
 3: 5
 
 Size distribution for control (max=1):
 
 0: 0
 
  
 
 4 session TCP_STREAM from host-to-guest with 16K message size for 60 seconds:
 
  
 
 Queue histogram for virtio1:
 
 Size distribution for input (max=16831151):
 
 1: 16831151 
 
 Size distribution for output (max=1923965):
 
 2: 1923965 
 
 3: 5
 
 Size distribution for control (max=1):
 
 0: 0
 

Hmm for virtio net output we do always use 2 s/g, this is because of a
qemu bug. Maybe it's time we fixed this, added a feature bit?
This would fix above without threshold hacks.


 
 4 session TCP_STREAM from guest-to-host with 4K message size for 60 seconds:
 
  
 
 Queue histogram for virtio1:
 
 Size distribution for input (max=1316069):
 
 1: 1316069 
 
 Size distribution for output (max=879213):
 
 2: 24
 
 3: 24097 #
 
 4: 23176 #
 
 5: 3412
 
 6: 4446
 
 7: 4663
 
 8: 4195
 
 9: 3772
 
 10: 3388
 
 11: 3666
 
 12: 2885
 
 13: 2759
 
 14: 2997
 
 15: 3060
 
 16: 2651
 
 17: 2235
 
 18: 92721 ##
 
 19: 879213 
 
 Size distribution for control (max=1):
 
 0: 0
 
  
 
 4 session TCP_STREAM from guest-to-host with 16K message size for 60 seconds:
 
  
 
 Queue histogram for virtio1:
 
 Size distribution for input (max=1428590):
 
 1: 1428590 
 
 Size distribution for output (max=957774):
 
 2: 20
 
 3: 54955 ###
 
 4: 34281 ##
 
 5: 2967
 
 6: 3394
 
 7: 9400
 
 8: 3061
 
 9: 3397
 
 10: 3258
 
 11: 3275
 
 12: 3147
 
 13: 2876
 
 14: 2747
 
 15: 2832
 
 16: 2013
 
 17: 1670
 
 18: 100369 ##
 
 19: 957774 
 
 Size distribution for control (max=1):
 
 0: 0
 
  
 
 Thanks,
 
 Tom

In these tests we would have to set threshold pretty high.
I wonder whether the following makes any difference: the idea is to
A. get less false cache sharing by allocating full cache lines
B. better locality by using same cache for multiple sizes

So we get some of the wins of the threshold without bothering
with a cache.

Will try to test but not until later this week.

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5aa43c3..c184712 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -132,7 +132,8 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
unsigned head;
int i;
 
-   desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
+   desc = kmalloc(L1_CACHE_ALIGN((out + in) * sizeof(struct vring_desc)),
+  gfp);
if (!desc)
return -ENOMEM;
 


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to 

Re: [PATCH v2 2/2] virtio-ring: Allocate indirect buffers from cache when possible

2012-09-10 Thread Thomas Lendacky
On Friday, September 07, 2012 09:19:04 AM Rusty Russell wrote:
 Michael S. Tsirkin m...@redhat.com writes:
  On Thu, Sep 06, 2012 at 05:27:23PM +0930, Rusty Russell wrote:
  Michael S. Tsirkin m...@redhat.com writes:
   Yes without checksum net core always linearizes packets, so yes it is
   screwed.
   For -net, skb always allocates space for 17 frags + linear part so
   it seems sane to do same in virtio core, and allocate, for -net,
   up to max_frags + 1 from cache.
   We can adjust it: no _SG - 2 otherwise 18.
  
  But I thought it used individual buffers these days?
  
  Yes for receive, no for transmit. That's probably why
  we should have the threshold per vq, not per device, BTW.
 
 Can someone actually run with my histogram patch and see what the real
 numbers are?

Somehow some HTML got in my first reply, resending...

I ran some TCP_RR and TCP_STREAM sessions, both host-to-guest and
guest-to-host, with a form of the histogram patch applied against a
RHEL6.3 kernel. The histogram values were reset after each test.

Here are the results:

60 session TCP_RR from host-to-guest with 256 byte request and 256 byte
response for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=7818456):
 1: 7818456 
Size distribution for output (max=7816698):
 2: 149 
 3: 7816698 
 4: 2   
 5: 1   
Size distribution for control (max=1):
 0: 0


4 session TCP_STREAM from host-to-guest with 4K message size for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=16050941):
 1: 16050941 
Size distribution for output (max=1877796):
 2: 1877796 
 3: 5   
Size distribution for control (max=1):
 0: 0

4 session TCP_STREAM from host-to-guest with 16K message size for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=16831151):
 1: 16831151 
Size distribution for output (max=1923965):
 2: 1923965 
 3: 5   
Size distribution for control (max=1):
 0: 0

4 session TCP_STREAM from guest-to-host with 4K message size for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=1316069):
 1: 1316069 
Size distribution for output (max=879213):
 2: 24  
 3: 24097   #
 4: 23176   #
 5: 3412
 6: 4446
 7: 4663
 8: 4195
 9: 3772
10: 3388
11: 3666
12: 2885
13: 2759
14: 2997
15: 3060
16: 2651
17: 2235
18: 92721   ##  
19: 879213  
Size distribution for control (max=1):
 0: 0

4 session TCP_STREAM from guest-to-host with 16K message size for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=1428590):
 1: 1428590 
Size distribution for output (max=957774):
 2: 20
 3: 54955   ###
 4: 34281   ##
 5: 2967
 6: 3394
 7: 9400
 8: 3061
 9: 3397
10: 3258
11: 3275
12: 3147
13: 2876
14: 2747
15: 2832
16: 2013
17: 1670
18: 100369  ##
19: 957774  
Size distribution for control (max=1):
 0: 0

Thanks,
Tom

 
 I'm not convinced that the ideal 17-buffer case actually happens as much
 as we think.  And if it's not happening with this netperf test, we're
 testing the wrong thing.
 
 Thanks,
 Rusty.
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: optimize apic interrupt delivery

2012-09-10 Thread Gleb Natapov
On Mon, Sep 10, 2012 at 05:44:38PM +0300, Michael S. Tsirkin wrote:
 On Mon, Sep 10, 2012 at 04:09:15PM +0300, Gleb Natapov wrote:
  Most interrupt are delivered to only one vcpu. Use pre-build tables to
  find interrupt destination instead of looping through all vcpus.
  
  Signed-off-by: Gleb Natapov g...@redhat.com
 
 Looks good overall. I think I see some bugs, with rcu use and others.
 the most suspecious thing is that code seems to use rcu but
 there are no calls to rcu_dereference anywhere.
Yep. Forgot about it.

 Pls see below.
 
 Thanks for looking into this, once ready I intend to rework
 direct injection to work on top of this.
 
 
  diff --git a/arch/x86/include/asm/kvm_host.h 
  b/arch/x86/include/asm/kvm_host.h
  index 64adb61..121f308 100644
  --- a/arch/x86/include/asm/kvm_host.h
  +++ b/arch/x86/include/asm/kvm_host.h
  @@ -511,6 +511,14 @@ struct kvm_arch_memory_slot {
  struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
   };
   
  +struct kvm_apic_map {
  +   struct rcu_head rcu;
  +   bool flat;
  +   u8 ldr_bits;
  +   struct kvm_lapic *phys_map[255];
 
 Not 256? apic id is sometimes used in the lookup - is is guaranteed
 that guest can not set it to 0xff? If yes this will overflow.
 
Yes. Need to be 256.

  +   struct kvm_lapic *logical_map[16][16];
 
 
 These are large arrays: 2Kbyte each one, which is bad
 for cache.
 Is it not better to have vcpu index stored there?
 that would reduce cache footprint by x8.
 
We do not have vcpu indexes, only vcpu ids which are not correspond to
vcpu place in vcpu array. vcpu array likely be replaced by a list when
vcpu destruction functionality will be implemented, so I do not want to
introduce any dependencies on current implementation detail.

  +};
  +
   struct kvm_arch {
  unsigned int n_used_mmu_pages;
  unsigned int n_requested_mmu_pages;
  @@ -528,6 +536,8 @@ struct kvm_arch {
  struct kvm_ioapic *vioapic;
  struct kvm_pit *vpit;
  int vapics_in_nmi_mode;
  +   struct kvm_apic_map *apic_map;
  +   struct mutex apic_map_lock;
   
  unsigned int tss_addr;
  struct page *apic_access_page;
  diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
  index 07ad628..d18ddd2 100644
  --- a/arch/x86/kvm/lapic.c
  +++ b/arch/x86/kvm/lapic.c
  @@ -139,11 +139,101 @@ static inline int apic_enabled(struct kvm_lapic 
  *apic)
  (LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \
   APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER)
   
  +static inline int apic_x2apic_mode(struct kvm_lapic *apic)
  +{
  +   return apic-vcpu-arch.apic_base  X2APIC_ENABLE;
  +}
  +
  +
 
 two emoty lines not needed
 
That's terrible, I agree.
 
   static inline int kvm_apic_id(struct kvm_lapic *apic)
   {
  return (kvm_apic_get_reg(apic, APIC_ID)  24)  0xff;
   }
   
  +static void rcu_free_apic_map(struct rcu_head *head)
  +{
  +   struct kvm_apic_map *p = container_of(head,
  +   struct kvm_apic_map, rcu);
 
 why break this line? it is not too long as is.
 
Because it was longer, but than structure was renamed to be shorter :)

  +
  +   kfree(p);
  +}
  +
  +static void kvm_apic_get_logical_id(u32 ldr, bool flat, u8 ldr_bits,
  +   u16 *cid, u16 *lid)
  +{
  +   if (ldr_bits == 32) {
  +   *cid = ldr  16;
  +   *lid = ldr  0x;
  +   } else {
  +   ldr = GET_APIC_LOGICAL_ID(ldr);
  +
  +   if (flat) {
  +   *cid = 0;
  +   *lid = ldr;
  +   } else {
  +   *cid = ldr  4;
  +   *lid = ldr  0xf;
  +   }
  +   }
  +}
  +
  +static inline int recalculate_apic_map(struct kvm *kvm)
 
 __must_check?
What is it?
 
  +{
  +   struct kvm_apic_map *new, *old = NULL;
  +   struct kvm_vcpu *vcpu;
  +   int i;
  +
  +   new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
  +
 
 empty line not needed here :)
I like it that way. You think it will break compilation or runtime?

 
  +   if (!new)
  +   return -ENOMEM;
  +
  +   new-ldr_bits = 8;
  +   new-flat = true;
  +   kvm_for_each_vcpu(i, vcpu, kvm) {
  +   u16 cid, lid;
  +   struct kvm_lapic *apic = vcpu-arch.apic;
  +
  +   if (!kvm_apic_present(vcpu))
  +   continue;
  +
  +   if (apic_x2apic_mode(apic)) {
  +   new-ldr_bits = 32;
  +   new-flat = false;
  +   } else if (kvm_apic_sw_enabled(apic)  new-flat 
  +   kvm_apic_get_reg(apic, APIC_DFR) == 
  APIC_DFR_CLUSTER)
  +   new-flat = false;
  +
  +   new-phys_map[kvm_apic_id(apic)] = apic;
  +   kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
  +   new-flat, new-ldr_bits, cid, lid);
  +
  +   if (lid)
  +   new-logical_map[cid][ffs(lid) - 1] = apic;
  +   }
  +   mutex_lock(kvm-arch.apic_map_lock);
  +   old = kvm-arch.apic_map;
 
 rcu_dereference_protected?
What is _protected? I 

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Srikar Dronamraju
* Peter Zijlstra pet...@infradead.org [2012-09-10 18:03:55]:

 On Mon, 2012-09-10 at 08:16 -0500, Andrew Theurer wrote:
@@ -4856,8 +4859,6 @@ again:
if (curr-sched_class != p-sched_class)
goto out;

-   if (task_running(p_rq, p) || p-state)
-   goto out;
   
   Is it possible that by this time the current thread takes double rq
   lock, thread p could actually be running?  i.e is there merit to keep
   this check around even with your similar check above?
  
  I think that's a good idea.  I'll add that back in. 
 
 Right, it needs to still be there, the test before acquiring p_rq is an
 optimistic test to avoid work, but you have to still test it once you
 acquire p_rq since the rest of the code relies on this not being so.
 
 How about something like this instead.. ?
 
 ---
  kernel/sched/core.c | 35 ++-
  1 file changed, 26 insertions(+), 9 deletions(-)
 
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index c46a011..c9ecab2 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -4300,6 +4300,23 @@ void __sched yield(void)
  }
  EXPORT_SYMBOL(yield);
  
 +/*
 + * Tests preconditions required for sched_class::yield_to().
 + */
 +static bool __yield_to_candidate(struct task_struct *curr, struct 
 task_struct *p)
 +{
 + if (!curr-sched_class-yield_to_task)
 + return false;
 +
 + if (curr-sched_class != p-sched_class)
 + return false;


Peter, 

Should we also add a check if the runq has a skip buddy (as pointed out
by Raghu) and return if the skip buddy is already set.  Something akin
to 

if (p_rq-cfs_rq-skip)
return false;

So if somebody has already acquired a double run queue lock and almost
set the next buddy, we dont need to take run queue lock and also avoid
overwriting the already set skip buddy.

 +
 + if (task_running(p_rq, p) || p-state)
 + return false;
 +
 + return true;
 +}
 +
  /**
   * yield_to - yield the current processor to another thread in
   * your thread group, or accelerate that thread toward the
 @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, bool 
 preempt)
   rq = this_rq();
  
  again:
 + /* optimistic test to avoid taking locks */
 + if (!__yield_to_candidate(curr, p))
 + goto out_irq;
 +
   p_rq = task_rq(p);
   double_rq_lock(rq, p_rq);
   while (task_rq(p) != p_rq) {
 @@ -4330,14 +4351,9 @@ bool __sched yield_to(struct task_struct *p, bool 
 preempt)
   goto again;
   }
  
 - if (!curr-sched_class-yield_to_task)
 - goto out;
 -
 - if (curr-sched_class != p-sched_class)
 - goto out;
 -
 - if (task_running(p_rq, p) || p-state)
 - goto out;
 + /* validate state, holding p_rq ensures p's state cannot change */
 + if (!__yield_to_candidate(curr, p))
 + goto out_unlock;
  
   yielded = curr-sched_class-yield_to_task(rq, p, preempt);
   if (yielded) {
 @@ -4350,8 +4366,9 @@ bool __sched yield_to(struct task_struct *p, bool 
 preempt)
   resched_task(p_rq-curr);
   }
  
 -out:
 +out_unlock:
   double_rq_unlock(rq, p_rq);
 +out_irq:
   local_irq_restore(flags);
  
   if (yielded)
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: optimize apic interrupt delivery

2012-09-10 Thread Gleb Natapov
On Mon, Sep 10, 2012 at 07:17:54PM +0300, Gleb Natapov wrote:
   + return 0;
   +}
   +
   +static inline int kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
   +{
   + apic_set_reg(apic, APIC_ID, id  24);
   + return recalculate_apic_map(apic-vcpu-kvm);
   +}
   +
   +static inline int kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
   +{
   + apic_set_reg(apic, APIC_LDR, id);
   + return recalculate_apic_map(apic-vcpu-kvm);
   +}
   +
  
  return value of these functions seems never checked.
  
 Yes, the problem is that we can do nothing about the failure if failure
 happens during guest write.
 
Actually I have an idea how to handle the error. Never return one. If
map cannot be allocated go slow path always. phys_map should be checked
for NULL during delivery in this case obviously.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Peter Zijlstra
On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
  +static bool __yield_to_candidate(struct task_struct *curr, struct 
  task_struct *p)
  +{
  + if (!curr-sched_class-yield_to_task)
  + return false;
  +
  + if (curr-sched_class != p-sched_class)
  + return false;
 
 
 Peter, 
 
 Should we also add a check if the runq has a skip buddy (as pointed out
 by Raghu) and return if the skip buddy is already set. 

Oh right, I missed that suggestion.. the performance improvement went
from 81% to 139% using this, right?

It might make more sense to keep that separate, outside of this
function, since its not a strict prerequisite.

  
  + if (task_running(p_rq, p) || p-state)
  + return false;
  +
  + return true;
  +} 


  @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
 bool preempt)
rq = this_rq();
   
   again:
  + /* optimistic test to avoid taking locks */
  + if (!__yield_to_candidate(curr, p))
  + goto out_irq;
  +

So add something like:

/* Optimistic, if we 'raced' with another yield_to(), don't bother */
if (p_rq-cfs_rq-skip)
goto out_irq;
 
 
p_rq = task_rq(p);
double_rq_lock(rq, p_rq);
 
 
But I do have a question on this optimization though,.. Why do we check
p_rq-cfs_rq-skip and not rq-cfs_rq-skip ?

That is, I'd like to see this thing explained a little better.

Does it go something like: p_rq is the runqueue of the task we'd like to
yield to, rq is our own, they might be the same. If we have a -skip,
there's nothing we can do about it, OTOH p_rq having a -skip and
failing the yield_to() simply means us picking the next VCPU thread,
which might be running on an entirely different cpu (rq) and could
succeed?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


problems using virt-manager

2012-09-10 Thread Lentes, Bernd
Hi,

i try to run virt-manager on a SLES 11 SP1 box. I'm using kernel 2.6.32.12 and 
virt-manager 0.9.4-106.1.x86_64 .
The system is a 64bit box.

Here is the output:
=
pc56846:/media/idg2/SysAdmin_AG_Wurst/software_und_treiber/virt_manager/sles_11_sp1
 # virt-manager 
[1] 9659
pc56846:/media/idg2/SysAdmin_AG_Wurst/software_und_treiber/virt_manager/sles_11_sp1
 # Traceback (most recent call last):
  File /usr/share/virt-manager/virt-manager.py, line 386, in module
main()
  File /usr/share/virt-manager/virt-manager.py, line 247, in main
from virtManager import cli
  File /usr/share/virt-manager/virtManager/cli.py, line 29, in module
import libvirt
  File /usr/lib64/python2.6/site-packages/libvirt.py, line 25, in module
raise lib_e
ImportError: /usr/lib64/libvirt.so.0: undefined symbol: 
selinux_virtual_domain_context_path

[1]+  Exit 1  virt-manager
=

Thanks for any hint.


Bernd

--
Bernd Lentes

Systemadministration
Institut für Entwicklungsgenetik
Gebäude 35.34 - Raum 208
HelmholtzZentrum münchen
bernd.len...@helmholtz-muenchen.de
phone: +49 89 3187 1241
fax:   +49 89 3187 2294
http://www.helmholtz-muenchen.de/idg

Wir sollten nicht den Tod fürchten, sondern
das schlechte Leben

Helmholtz Zentrum München
Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH)
Ingolstädter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir´in Bärbel Brumme-Bothe
Geschäftsführer: Prof. Dr. Günther Wess und Dr. Nikolaus Blum
Registergericht: Amtsgericht München HRB 6466
USt-IdNr: DE 129521671
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Sending a VM for restart (after it has been shutdown and saved)

2012-09-10 Thread Marcelo Tosatti
On Mon, Sep 03, 2012 at 06:35:24PM +0100, Michael Johns wrote:
 Hi list,
 
 I have been hacking the KVM-QEMU code, but need some help to be able
 to perform a particular operation.
 
 Currently, I perform some operations on the VM image after it has
 received a shutdown call, and after the image itself has been fully
 saved before KVM-QEMU shuts down. I perform these operations within
 the bdrv_close_all () method within block.c, and it works exactly as I
 want it to.
 
 What I would like to be able to do is to send the VM in question back
 for a restart from that part of the code, and I was wondering if this
 was possible with a given command. When I say send it back, I mean
 literally to have the KVM-QEMU treat the VM as if has just been
 started/restarted. Although I know a few things, there is still a lot
 I don't know so it would be great to get some help on this.
 
 Thanks,
 
 M

Do you mean system_reset command from QEMU monitor?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Mike Waychison
On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin m...@redhat.com wrote:
 On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
 This implementation of a virtio balloon driver uses the page cache to
 store pages that have been released to the host.  The communication
 (outside of target counts) is one way--the guest notifies the host when
 it adds a page to the page cache, allowing the host to madvise(2) with
 MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
 (via the regular page reclaim).  This means that inflating the balloon
 is similar to the existing balloon mechanism, but the deflate is
 different--it re-uses existing Linux kernel functionality to
 automatically reclaim.

 Signed-off-by: Frank Swiderski f...@google.com

Hi Michael,

I'm very sorry that Frank and I have been silent on these threads.
I've been out of the office and Frank has been been swamped :)

I'll take a stab at answering some of your questions below, and
hopefully we can end up on the same page.

 I've been trying to understand this, and I have
 a question: what exactly is the benefit
 of this new device?

The key difference between this device/driver and the pre-existing
virtio_balloon device/driver is in how the memory pressure loop is
controlled.

With the pre-existing balloon device/driver, the control loop for how
much memory a given VM is allowed to use is controlled completely by
the host.  This is probably fine if the goal is to pack as much work
on a given host as possible, but it says nothing about the expected
performance that any given VM is expecting to have.  Specifically, it
allows the host to set a target goal for the size of a VM, and the
driver in the guest does whatever is needed to get to that goal.  This
is great for systems where one wants to grow or shrink a VM from the
outside.


This behaviour however doesn't match what applications actually expect
from a memory control loop however.  In a native setup, an application
can usually expect to allocate memory from the kernel on an as-needed
basis, and can in turn return memory back to the system (using a heap
implementation that actually releases memory that is).  The dynamic
size of an application is completely controlled by the application,
and there is very little that cluster management software can do to
ensure that the application fits some prescribed size.

We recognized this in the development of our cluster management
software long ago, so our systems are designed for managing tasks that
have a dynamic memory footprint.  Overcommit is possible (as most
applications do not use the full reservation of memory they asked for
originally), letting us do things like schedule lower priority/lower
service-classification work using resources that are otherwise
available in stand-by for high-priority/low-latency workloads.


 Note that users could not care less about how a driver
 is implemented internally.

 Is there some workload where you see VM working better with
 this than regular balloon? Any numbers?

This device is less about performance as it is about getting the
memory size of a job (or in this case, a job in a VM) to grow and
shrink as the application workload sees fit, much like how processes
today can grow and shrink without external direction.


 Also, can't we just replace existing balloon implementation
 with this one?

Perhaps, but as described above, both devices have very different
characteristics.

 Why it is so important to deflate silently?

It may not be so important to deflate silently.  I'm not sure why it
is important that we deflate loudly though either :)  Doing so seems
like unnecessary guest/host communication IMO, especially if the guest
is expecting to be able to grow to totalram (and the host isn't able
to nack any pages reclaimed anyway...).

 I guess filesystem does not currently get a callback
 before page is reclaimed but this isan implementation detail -
 maybe this can be fixed?

I do not follow this question.


 Also can you pls answer Avi's question?
 How is overcommit managed?

Overcommit in our deployments is managed using memory cgroups on the
host.  This allows us to have very directed policies as to how
competing VMs on a host may overcommit.



 ---
  drivers/virtio/Kconfig  |   13 +
  drivers/virtio/Makefile |1 +
  drivers/virtio/virtio_fileballoon.c |  636 
 +++
  include/linux/virtio_balloon.h  |9 +
  include/linux/virtio_ids.h  |1 +
  5 files changed, 660 insertions(+), 0 deletions(-)
  create mode 100644 drivers/virtio/virtio_fileballoon.c

 diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
 index f38b17a..cffa2a7 100644
 --- a/drivers/virtio/Kconfig
 +++ b/drivers/virtio/Kconfig
 @@ -35,6 +35,19 @@ config VIRTIO_BALLOON

If unsure, say M.

 +config VIRTIO_FILEBALLOON
 + tristate Virtio page cache-backed balloon driver
 + select VIRTIO
 + select VIRTIO_RING
 +

Re: [libvirt-users] Kernel unresponsive after booting 700+ vm's on a single host

2012-09-10 Thread Eric Blake
On 09/10/2012 07:51 AM, Alfred Bratterud wrote:
 For a research project we are trying to boot a very large amount of tiny, 
 custom built VM's on KVM/ubuntu. The maximum VM-count achieved was 1000, but 
 with substantial slowness, and eventually kernel failure, while the 
 cpu/memory loads were nowhere near limits. Where is the likely bottleneck? 
 Any solutions, workarounds, hacks or dirty tricks?

Are you using cgroups?  There have been some known bottlenecks in the
kernel cgroup code, where it scales very miserably; and since libvirt
uses a different cgroup per VM by default when cgroups are enabled, that
might explain part of the problem.

Other than that, if you can profile the slowdowns, I'm sure people would
be interested in the results.

-- 
Eric Blake   ebl...@redhat.com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Rik van Riel

On 09/10/2012 01:37 PM, Mike Waychison wrote:

On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin m...@redhat.com wrote:



Also can you pls answer Avi's question?
How is overcommit managed?


Overcommit in our deployments is managed using memory cgroups on the
host.  This allows us to have very directed policies as to how
competing VMs on a host may overcommit.


How do your memory cgroups lead to guests inflating their balloons?

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Mike Waychison
On Mon, Sep 10, 2012 at 2:04 PM, Rik van Riel r...@redhat.com wrote:
 On 09/10/2012 01:37 PM, Mike Waychison wrote:

 On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin m...@redhat.com
 wrote:


 Also can you pls answer Avi's question?
 How is overcommit managed?


 Overcommit in our deployments is managed using memory cgroups on the
 host.  This allows us to have very directed policies as to how
 competing VMs on a host may overcommit.


 How do your memory cgroups lead to guests inflating their balloons?

The control loop that is driving the cgroup on the host can still move
the balloon target page count causing the balloon in the guest to try
and inflate.  This allows the host to effectively slowly grow the
balloon in the guest, allowing reclaim of guest free memory, followed
by guest page cache (and memory on the host system).  This can then be
compared with the subsequent growth (as this balloon setup allows the
guest to grow as it sees fit), which in effect gives us a memory
pressure indicator on the host, allowing it to back-off shrinking the
guest if the guest balloon quickly deflates.

The net effect is an opportunistic release of memory from the guest
back to the host, and the ability to quickly grow a VM's memory
footprint as the workload within it requires.

This dynamic memory sizing of VMs is much more in line with what we
can expect from native tasks today (and which is what our resource
management systems are designed to handle).
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv4] virtio-spec: virtio network device multiqueue support

2012-09-10 Thread Rick Jones

On 09/09/2012 07:12 PM, Rusty Russell wrote:

OK, I read the spec (pasted below for easy of reading), but I'm still
confused over how this will work.

I thought normal net drivers have the hardware provide an rxhash for
each packet, and we map that to CPU to queue the packet on[1].  We hope
that the receiving process migrates to that CPU, so xmit queue
matches.

For virtio this would mean a new per-packet rxhash value, right?

Why are we doing something different?  What am I missing?

Thanks,
Rusty.
[1] Everything I Know About Networking I Learned From LWN:
 https://lwn.net/Articles/362339/


In my taxonomy at least, multi-queue predates RPS and RFS and is 
simply where the NIC via some means, perhaps a headers hash, separates 
incoming frames to different queues.


RPS can be thought of as doing something similar inside the host.  That 
could be used to get a spread from an otherwise dumb NIC (certainly 
that is what one of its predecessors - Inbound Packet Scheduling - used 
it for in HP-UX 10.20), or it could be used to augment the multi-queue 
support of a not-so-dump NIC - say if said NIC had a limit of queues 
that was rather lower than the number of cores/threads in the host. 
Indeed some driver/NIC combinations provide a hash value to the host for 
the host to use as it sees fit.


However, there is still the matter of a single thread of an application 
servicing multiple connections, each of which would hash to different 
locations.


RFS  (Receive Flow Steering) then goes one step further, and looks-up 
where the flow endpoint was last accessed and steers the traffic there. 
 The idea being that a thread of execution servicing multiple flows 
will have the traffic of those flows sent to the same place.  It then 
allows the scheduler to decide where things should be run rather than 
the networking code.


rick jones

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Win2k8 guest with very poor disk performance on KVM host

2012-09-10 Thread Daniel Tschritter
Hi everybody,

I got a server with CentOS 6.3 and KVM as a host and a windows 2k8 
guest.

The windows machine's disk performance is very poor.
The windows guest uses VirtIO disk drivers, no cache and uses a LVM 
partition on a Raid1.

atop shows 100% disk utilization as soon as the windows guest accesses 
the HDD, data transfers figures given are most times less than 1MB/s 
r/w, peaks are around 3MB/s r/w.

I've run a few tests to see what's going on:

Creating a 10GB test file on CentOS (guests switched off):
time dd if=/dev/random of=testfile bs=1 count=0 seek=10G
0+0 records in
0+0 records out
0 bytes (0 B) copied, 5.777e-06 s, 0.0 kB/s
real 0m0.001s
user 0m0.000s
sys 0m0.000s
 
Create copy of test file created above:
time cp testfile testfile2
real 0m3.136s
user 0m0.440s
sys 0m2.693s
That looks ok to me. According to atop data transfer rates are between 
130 and 180MB/s.

Create copy of test file above while Windows guest boots up:
time cp testfile testfile2
real 0m3.367s
user 0m0.515s
sys 0m2.826s
not much different...

Creating copy of test file above within the Win2k8R2 guest:
Current Time: 20:12:32,41
copy testfile testfile2
1 file(s) copied
Current Time: 20:22:08,64
576,23s
That takes about 170 time longer than the copy unter CentOS!

I've run the same test on a CentOS6.3 guest with the following results:
time cp testfile testfile2
real 0m3.950s
user 0m0.470s
sys 0m3.383s
that's almost as quick as the host...

I've run these tests a few times, always giving about the same result.

Why is the disk performance in the Win guest that poor?

What can be done to improve things?

Comments and ideas welcome!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Raghavendra K T

On 09/10/2012 10:42 PM, Peter Zijlstra wrote:

On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:

+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct 
*p)
+{
+ if (!curr-sched_class-yield_to_task)
+ return false;
+
+ if (curr-sched_class != p-sched_class)
+ return false;



Peter,

Should we also add a check if the runq has a skip buddy (as pointed out
by Raghu) and return if the skip buddy is already set.


Oh right, I missed that suggestion.. the performance improvement went
from 81% to 139% using this, right?

It might make more sense to keep that separate, outside of this
function, since its not a strict prerequisite.



+ if (task_running(p_rq, p) || p-state)
+ return false;
+
+ return true;
+}




@@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,

bool preempt)

   rq = this_rq();

  again:
+ /* optimistic test to avoid taking locks */
+ if (!__yield_to_candidate(curr, p))
+ goto out_irq;
+


So add something like:

/* Optimistic, if we 'raced' with another yield_to(), don't bother */
if (p_rq-cfs_rq-skip)
goto out_irq;




   p_rq = task_rq(p);
   double_rq_lock(rq, p_rq);




But I do have a question on this optimization though,.. Why do we check
p_rq-cfs_rq-skip and not rq-cfs_rq-skip ?

That is, I'd like to see this thing explained a little better.

Does it go something like: p_rq is the runqueue of the task we'd like to
yield to, rq is our own, they might be the same. If we have a -skip,
there's nothing we can do about it, OTOH p_rq having a -skip and
failing the yield_to() simply means us picking the next VCPU thread,
which might be running on an entirely different cpu (rq) and could
succeed?



Yes, That is the intention (mean checking  p_rq-cfs-skip). Though we
may be overdoing this check in the scenario when multiple vcpus of same
VM pinned to same CPU.

I am testing the above patch. Hope to be able to get back with the
results tomorrow.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Michael S. Tsirkin
On Mon, Sep 10, 2012 at 01:37:06PM -0400, Mike Waychison wrote:
 On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin m...@redhat.com wrote:
  On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
  This implementation of a virtio balloon driver uses the page cache to
  store pages that have been released to the host.  The communication
  (outside of target counts) is one way--the guest notifies the host when
  it adds a page to the page cache, allowing the host to madvise(2) with
  MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
  (via the regular page reclaim).  This means that inflating the balloon
  is similar to the existing balloon mechanism, but the deflate is
  different--it re-uses existing Linux kernel functionality to
  automatically reclaim.
 
  Signed-off-by: Frank Swiderski f...@google.com
 
 Hi Michael,
 
 I'm very sorry that Frank and I have been silent on these threads.
 I've been out of the office and Frank has been been swamped :)
 
 I'll take a stab at answering some of your questions below, and
 hopefully we can end up on the same page.
 
  I've been trying to understand this, and I have
  a question: what exactly is the benefit
  of this new device?
 
 The key difference between this device/driver and the pre-existing
 virtio_balloon device/driver is in how the memory pressure loop is
 controlled.
 
 With the pre-existing balloon device/driver, the control loop for how
 much memory a given VM is allowed to use is controlled completely by
 the host.  This is probably fine if the goal is to pack as much work
 on a given host as possible, but it says nothing about the expected
 performance that any given VM is expecting to have.  Specifically, it
 allows the host to set a target goal for the size of a VM, and the
 driver in the guest does whatever is needed to get to that goal.  This
 is great for systems where one wants to grow or shrink a VM from the
 outside.
 
 
 This behaviour however doesn't match what applications actually expect
 from a memory control loop however.  In a native setup, an application
 can usually expect to allocate memory from the kernel on an as-needed
 basis, and can in turn return memory back to the system (using a heap
 implementation that actually releases memory that is).  The dynamic
 size of an application is completely controlled by the application,
 and there is very little that cluster management software can do to
 ensure that the application fits some prescribed size.
 
 We recognized this in the development of our cluster management
 software long ago, so our systems are designed for managing tasks that
 have a dynamic memory footprint.  Overcommit is possible (as most
 applications do not use the full reservation of memory they asked for
 originally), letting us do things like schedule lower priority/lower
 service-classification work using resources that are otherwise
 available in stand-by for high-priority/low-latency workloads.

OK I am not sure I got this right so pls tell me if this summary is
correct (note: this does not talk about what guest does with memory, 
ust what it is that device does):

- existing balloon is told lower limit on target size by host and pulls in at 
least
  target size. Guest can inflate  target size if it likes
  and then it is OK to deflate back to target size but not less.
- your balloon is told upper limit on target size by host and pulls at most
  target size. Guest can deflate down to 0 at any point.

If so I think both approaches make sense and in fact they
can be useful at the same time for the same guest.
In that case, I see two ways how this can be done:

1. two devices: existing ballon + cache balloon
2. add upper limit to existing ballon

A single device looks a bit more natural in that we don't
really care in which balloon a page is as long as we
are between lower and upper limit. Right?
From implementation POV we could have it use
pagecache for pages above lower limit but that
is a separate question about driver design,
I would like to make sure I understand the high
level design first.





 
  Note that users could not care less about how a driver
  is implemented internally.
 
  Is there some workload where you see VM working better with
  this than regular balloon? Any numbers?
 
 This device is less about performance as it is about getting the
 memory size of a job (or in this case, a job in a VM) to grow and
 shrink as the application workload sees fit, much like how processes
 today can grow and shrink without external direction.

Still, e.g. swap in host achieves more or less the same functionality.
I am guessing balloon can work better by getting more cooperation
from guest but aren't there any tests showing this is true in practice?


 
  Also, can't we just replace existing balloon implementation
  with this one?
 
 Perhaps, but as described above, both devices have very different
 characteristics.
 
  Why it is so important to deflate silently?
 
 It may not 

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Andrew Theurer
On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
 On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
   +static bool __yield_to_candidate(struct task_struct *curr, struct 
   task_struct *p)
   +{
   + if (!curr-sched_class-yield_to_task)
   + return false;
   +
   + if (curr-sched_class != p-sched_class)
   + return false;
  
  
  Peter, 
  
  Should we also add a check if the runq has a skip buddy (as pointed out
  by Raghu) and return if the skip buddy is already set. 
 
 Oh right, I missed that suggestion.. the performance improvement went
 from 81% to 139% using this, right?
 
 It might make more sense to keep that separate, outside of this
 function, since its not a strict prerequisite.
 
   
   + if (task_running(p_rq, p) || p-state)
   + return false;
   +
   + return true;
   +} 
 
 
   @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
  bool preempt)
 rq = this_rq();

again:
   + /* optimistic test to avoid taking locks */
   + if (!__yield_to_candidate(curr, p))
   + goto out_irq;
   +
 
 So add something like:
 
   /* Optimistic, if we 'raced' with another yield_to(), don't bother */
   if (p_rq-cfs_rq-skip)
   goto out_irq;
  
  
 p_rq = task_rq(p);
 double_rq_lock(rq, p_rq);
  
  
 But I do have a question on this optimization though,.. Why do we check
 p_rq-cfs_rq-skip and not rq-cfs_rq-skip ?
 
 That is, I'd like to see this thing explained a little better.
 
 Does it go something like: p_rq is the runqueue of the task we'd like to
 yield to, rq is our own, they might be the same. If we have a -skip,
 there's nothing we can do about it, OTOH p_rq having a -skip and
 failing the yield_to() simply means us picking the next VCPU thread,
 which might be running on an entirely different cpu (rq) and could
 succeed?

Here's two new versions, both include a __yield_to_candidate(): v3
uses the check for p_rq-curr in guest mode, and v4 uses the cfs_rq
skip check.  Raghu, I am not sure if this is exactly what you want
implemented in v4.

Results:
 ple on:   2552 +/- .70%
 ple on: w/fixv1:  4621 +/- 2.12%  (81% improvement)
 ple on: w/fixv2:  6115   (139% improvement)
   v3:  5735   (124% improvement)
   v4:  4524   (  3% regression)

Both patches included below

-Andrew

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..0d98a67 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4820,6 +4820,23 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct 
*p, struct rq *p_rq)
+{
+   if (!curr-sched_class-yield_to_task)
+   return false;
+
+   if (curr-sched_class != p-sched_class)
+   return false;
+
+   if (task_running(p_rq, p) || p-state)
+   return false;
+
+   return true;
+}
+
 /**
  * yield_to - yield the current processor to another thread in
  * your thread group, or accelerate that thread toward the
@@ -4844,20 +4861,27 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
 
 again:
p_rq = task_rq(p);
+
+   /* optimistic test to avoid taking locks */
+   if (!__yield_to_candidate(curr, p, p_rq))
+   goto out_irq;
+
+   /*
+* if the target task is not running, then only yield if the
+* current task is in guest mode
+*/
+   if (!(p_rq-curr-flags  PF_VCPU))
+   goto out_irq;
+
double_rq_lock(rq, p_rq);
while (task_rq(p) != p_rq) {
double_rq_unlock(rq, p_rq);
goto again;
}
 
-   if (!curr-sched_class-yield_to_task)
-   goto out;
-
-   if (curr-sched_class != p-sched_class)
-   goto out;
-
-   if (task_running(p_rq, p) || p-state)
-   goto out;
+   /* validate state, holding p_rq ensures p's state cannot change */
+   if (!__yield_to_candidate(curr, p, p_rq))
+   goto out_unlock;
 
yielded = curr-sched_class-yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4877,8 +4901,9 @@ again:
rq-skip_clock_update = 0;
}
 
-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);
 
if (yielded)




v4:


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..2bec2ed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4820,6 +4820,23 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct 
*p, struct rq *p_rq)
+{
+   if (!curr-sched_class-yield_to_task)
+   return false;
+
+   

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Peter Zijlstra
On Mon, 2012-09-10 at 15:12 -0500, Andrew Theurer wrote:
 +   /*
 +* if the target task is not running, then only yield if the
 +* current task is in guest mode
 +*/
 +   if (!(p_rq-curr-flags  PF_VCPU))
 +   goto out_irq; 

This would make yield_to() only ever work on KVM, not that I mind this
too much, its a horrid thing and making it less useful for (ab)use is a
good thing, still this probably wants mention somewhere :-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Rik van Riel

On 09/10/2012 04:19 PM, Peter Zijlstra wrote:

On Mon, 2012-09-10 at 15:12 -0500, Andrew Theurer wrote:

+   /*
+* if the target task is not running, then only yield if the
+* current task is in guest mode
+*/
+   if (!(p_rq-curr-flags  PF_VCPU))
+   goto out_irq;


This would make yield_to() only ever work on KVM, not that I mind this
too much, its a horrid thing and making it less useful for (ab)use is a
good thing, still this probably wants mention somewhere :-)


Also, it would not preempt a non-kvm task, even if we need
to do that to boost a VCPU. I think the lines above should
be dropped.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Mike Waychison
On Mon, Sep 10, 2012 at 3:59 PM, Michael S. Tsirkin m...@redhat.com wrote:
 On Mon, Sep 10, 2012 at 01:37:06PM -0400, Mike Waychison wrote:
 On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin m...@redhat.com wrote:
  On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
  This implementation of a virtio balloon driver uses the page cache to
  store pages that have been released to the host.  The communication
  (outside of target counts) is one way--the guest notifies the host when
  it adds a page to the page cache, allowing the host to madvise(2) with
  MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
  (via the regular page reclaim).  This means that inflating the balloon
  is similar to the existing balloon mechanism, but the deflate is
  different--it re-uses existing Linux kernel functionality to
  automatically reclaim.
 
  Signed-off-by: Frank Swiderski f...@google.com

 Hi Michael,

 I'm very sorry that Frank and I have been silent on these threads.
 I've been out of the office and Frank has been been swamped :)

 I'll take a stab at answering some of your questions below, and
 hopefully we can end up on the same page.

  I've been trying to understand this, and I have
  a question: what exactly is the benefit
  of this new device?r balloon is told upper limit on target size by host 
  and pulls

 The key difference between this device/driver and the pre-existing
 virtio_balloon device/driver is in how the memory pressure loop is
 controlled.

 With the pre-existing balloon device/driver, the control loop for how
 much memory a given VM is allowed to use is controlled completely by
 the host.  This is probably fine if the goal is to pack as much work
 on a given host as possible, but it says nothing about the expected
 performance that any given VM is expecting to have.  Specifically, it
 allows the host to set a target goal for the size of a VM, and the
 driver in the guest does whatever is needed to get to that goal.  This
 is great for systems where one wants to grow or shrink a VM from the
 outside.


 This behaviour however doesn't match what applications actually expectr 
 balloon is told upper limit on target size by host and pulls
 from a memory control loop however.  In a native setup, an application
 can usually expect to allocate memory from the kernel on an as-needed
 basis, and can in turn return memory back to the system (using a heap
 implementation that actually releases memory that is).  The dynamic
 size of an application is completely controlled by the application,
 and there is very little that cluster management software can do to
 ensure that the application fits some prescribed size.

 We recognized this in the development of our cluster management
 software long ago, so our systems are designed for managing tasks that
 have a dynamic memory footprint.  Overcommit is possible (as most
 applications do not use the full reservation of memory they asked for
 originally), letting us do things like schedule lower priority/lower
 service-classification work using resources that are otherwise
 available in stand-by for high-priority/low-latency workloads.

 OK I am not sure I got this right so pls tell me if this summary is
 correct (note: this does not talk about what guest does with memory,
 ust what it is that device does):

 - existing balloon is told lower limit on target size by host and pulls in at 
 least
   target size. Guest can inflate  target size if it likes
   and then it is OK to deflate back to target size but not less.

Is this true?   I take it nothing is keeping the existing balloon
driver from going over the target, but the same can be said about
either balloon implementation.

 - your balloon is told upper limit on target size by host and pulls at most
   target size. Guest can deflate down to 0 at any point.

 If so I think both approaches make sense and in fact they
 can be useful at the same time for the same guest.
 In that case, I see two ways how this can be done:

 1. two devices: existing ballon + cache balloon the
 2. add upper limit to existing ballon

 A single device looks a bit more natural in that we don't
 really care in which balloon a page is as long as we
 are between lower and upper limit. Right?

I agree that this may be better done using a single device if possible.

 From implementation POV we could have it use
 pagecache for pages above lower limit but that
 is a separate question about driver design,
 I would like to make sure I understand the highr balloon is told upper limit 
 on tr balloon is told upper limit on target size by host and pullsarget size 
 by host and pulls
 level design first.

I agree that this is an implementation detail that is separate from
discussions of high and low limits.  That said, there are several
advantages to pushing these pages to the page cache (memory defrag
still works for one).

  Note that users could not care less about how a driver
  is implemented internally.
 

Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Michael S. Tsirkin
On Mon, Sep 10, 2012 at 04:49:40PM -0400, Mike Waychison wrote:
 On Mon, Sep 10, 2012 at 3:59 PM, Michael S. Tsirkin m...@redhat.com wrote:
  On Mon, Sep 10, 2012 at 01:37:06PM -0400, Mike Waychison wrote:
  On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin m...@redhat.com 
  wrote:
   On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
   This implementation of a virtio balloon driver uses the page cache to
   store pages that have been released to the host.  The communication
   (outside of target counts) is one way--the guest notifies the host when
   it adds a page to the page cache, allowing the host to madvise(2) with
   MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
   (via the regular page reclaim).  This means that inflating the balloon
   is similar to the existing balloon mechanism, but the deflate is
   different--it re-uses existing Linux kernel functionality to
   automatically reclaim.
  
   Signed-off-by: Frank Swiderski f...@google.com
 
  Hi Michael,
 
  I'm very sorry that Frank and I have been silent on these threads.
  I've been out of the office and Frank has been been swamped :)
 
  I'll take a stab at answering some of your questions below, and
  hopefully we can end up on the same page.
 
   I've been trying to understand this, and I have
   a question: what exactly is the benefit
   of this new device?r balloon is told upper limit on target size by host 
   and pulls
 
  The key difference between this device/driver and the pre-existing
  virtio_balloon device/driver is in how the memory pressure loop is
  controlled.
 
  With the pre-existing balloon device/driver, the control loop for how
  much memory a given VM is allowed to use is controlled completely by
  the host.  This is probably fine if the goal is to pack as much work
  on a given host as possible, but it says nothing about the expected
  performance that any given VM is expecting to have.  Specifically, it
  allows the host to set a target goal for the size of a VM, and the
  driver in the guest does whatever is needed to get to that goal.  This
  is great for systems where one wants to grow or shrink a VM from the
  outside.
 
 
  This behaviour however doesn't match what applications actually expectr 
  balloon is told upper limit on target size by host and pulls
  from a memory control loop however.  In a native setup, an application
  can usually expect to allocate memory from the kernel on an as-needed
  basis, and can in turn return memory back to the system (using a heap
  implementation that actually releases memory that is).  The dynamic
  size of an application is completely controlled by the application,
  and there is very little that cluster management software can do to
  ensure that the application fits some prescribed size.
 
  We recognized this in the development of our cluster management
  software long ago, so our systems are designed for managing tasks that
  have a dynamic memory footprint.  Overcommit is possible (as most
  applications do not use the full reservation of memory they asked for
  originally), letting us do things like schedule lower priority/lower
  service-classification work using resources that are otherwise
  available in stand-by for high-priority/low-latency workloads.
 
  OK I am not sure I got this right so pls tell me if this summary is
  correct (note: this does not talk about what guest does with memory,
  ust what it is that device does):
 
  - existing balloon is told lower limit on target size by host and pulls in 
  at least
target size. Guest can inflate  target size if it likes
and then it is OK to deflate back to target size but not less.
 
 Is this true?   I take it nothing is keeping the existing balloon
 driver from going over the target, but the same can be said about
 either balloon implementation.
 
  - your balloon is told upper limit on target size by host and pulls at most
target size. Guest can deflate down to 0 at any point.
 
  If so I think both approaches make sense and in fact they
  can be useful at the same time for the same guest.
  In that case, I see two ways how this can be done:
 
  1. two devices: existing ballon + cache balloon the
  2. add upper limit to existing ballon
 
  A single device looks a bit more natural in that we don't
  really care in which balloon a page is as long as we
  are between lower and upper limit. Right?
 
 I agree that this may be better done using a single device if possible.

I am not sure myself, just asking.

  From implementation POV we could have it use
  pagecache for pages above lower limit but that
  is a separate question about driver design,
  I would like to make sure I understand the highr balloon is told upper 
  limit on tr balloon is told upper limit on target size by host and 
  pullsarget size by host and pulls
  level design first.
 
 I agree that this is an implementation detail that is separate from
 discussions of high and low limits.  

Re: [PATCH v6 12/12] KVM: indicate readonly access fault

2012-09-10 Thread Marcelo Tosatti
On Fri, Sep 07, 2012 at 05:56:39PM +0800, Xiao Guangrong wrote:
 On 09/06/2012 10:09 PM, Avi Kivity wrote:
  On 08/22/2012 03:47 PM, Xiao Guangrong wrote:
  On 08/22/2012 08:06 PM, Avi Kivity wrote:
  On 08/21/2012 06:03 AM, Xiao Guangrong wrote:
  Introduce write_readonly_mem in mmio-exit-info to indicate this exit is
  caused by write access on readonly memslot
 
  Please document this in chapter 5 of apic.txt.
 
 
  Okay, please review this one.
 
  Subject: [PATCH v6 12/12] KVM: indicate readonly access fault
 
  Introduce write_readonly_mem in mmio-exit-info to indicate this exit is
  caused by write access on readonly memslot
 
  
  I'm not sure whether this indication can be trusted by userspace.  By
  the time userspace gets to process this, the slot may no longer exist,
  or it may be writable.
 
 The case of deleting memslot is ok, because userspace just skips this fault
 if no readonly mem or no fault handler can be found.
 
 Switching memslot from readonly to writable sounds strange, i agree with you
 that this flag is untrusty under this case.
 
 Marcelo, any comments?

The same can happen with slot deletion, for example. 

Userspace (which performed the modification which can result in faults
to non-existant/read-only/.../new-tag memslot), must handle the faults 
properly or avoid the possibility for reference to memslot information 
from the past.

I think its worthwhile to add a note about this in the API
documentation: The user of this interface is responsible for handling 
references to stale memslot information, either by handling
exit notifications which reference stale memslot information or not
allowing these notifications to exist by stopping all vcpus in userspace
before performing modifications to the memslots map.

  (in the same way an mmio exit might actually hit RAM)
 
 So, in the userspace, for the safe reason, we should walk all memslots not
 just walking mmio handlers?
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: fix i8259 interrupt high to low transition logic

2012-09-10 Thread Maciej W. Rozycki
On Sun, 9 Sep 2012, Matthew Ogilvie wrote:

 This bug manifested itself when the guest was Microport UNIX
 System V/386 v2.1 (ca. 1987), because it would sometimes mask
 off IRQ14 in the slave IMR after it had already been asserted.
 The master would still try to deliver an interrupt even though
 IRQ2 had dropped again, resulting in a spurious interupt
 (IRQ15) and a panicked UNIX kernel.

 That is quite weird actually -- from my experience the spurious vector is 
never sent from a slave (quite understandably -- since the interrupt is 
gone and no other is pending, the master has no reason to select a slave 
to supply a vector and therefore supplies the spurious vector itself) and 
therefore a spurious IRQ7 is always issued regardless of whether the 
discarded request came from a slave or from the master.

 Is there a bug elsewhere then too?  I would have expected a reasonable 
(and especially an old-school) x86 OS to be able to cope with spurious 
8259A interrupts, but then obviously one would expect them on IRQ7 only.

  Maciej
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems using virt-manager

2012-09-10 Thread 王金浦
2012/9/11 Lentes, Bernd bernd.len...@helmholtz-muenchen.de

 Hi,

 i try to run virt-manager on a SLES 11 SP1 box. I'm using kernel 2.6.32.12
 and virt-manager 0.9.4-106.1.x86_64 .
 The system is a 64bit box.

 Here is the output:
 =

 pc56846:/media/idg2/SysAdmin_AG_Wurst/software_und_treiber/virt_manager/sles_11_sp1
 # virt-manager 
 [1] 9659

 pc56846:/media/idg2/SysAdmin_AG_Wurst/software_und_treiber/virt_manager/sles_11_sp1
 # Traceback (most recent call last):
   File /usr/share/virt-manager/virt-manager.py, line 386, in module
 main()
   File /usr/share/virt-manager/virt-manager.py, line 247, in main
 from virtManager import cli
   File /usr/share/virt-manager/virtManager/cli.py, line 29, in module
 import libvirt
   File /usr/lib64/python2.6/site-packages/libvirt.py, line 25, in
 module
 raise lib_e
 ImportError: /usr/lib64/libvirt.so.0: undefined symbol:
 selinux_virtual_domain_context_path

 [1]+  Exit 1  virt-manager
 =

Seems libvirt popup a importError,  so libvir list may be a good place
to ask, add into cc.

Jack
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] support readonly memory feature in qemu

2012-09-10 Thread Kevin O'Connor
On Mon, Sep 10, 2012 at 11:25:38AM +0200, Jan Kiszka wrote:
 On 2012-09-09 17:45, Avi Kivity wrote:
  On 09/07/2012 11:50 AM, Jan Kiszka wrote:
 
  +} else {
  +cpu_physical_memory_rw(run-mmio.phys_addr,
  +   run-mmio.data,
  +   run-mmio.len,
  +   run-mmio.is_write);
  +}
  +
   ret = 0;
   break;
   case KVM_EXIT_IRQ_WINDOW_OPEN:
 
 
  Great to see this feature for KVM finally! I'm just afraid that this
  will finally break good old isapc - due to broken Seabios. KVM used to
  unbreak it as it didn't respect write protections. ;)
  
  Can you describe the breakage?
 
 Try qemu -machine isapc [-enable-kvm]. Seabios is writing to some
 read-only marked area. Don't recall where precisely.

On boot, QEMU marks the memory at 0xc-0x10 as read-only.
SeaBIOS then makes the area read-write, performs its init, and then
makes portions of it read-only before launching the OS.

The registers SeaBIOS uses to make the memory read-write are on a PCI
device.  On isapc, this device is not reachable, and thus SeaBIOS
can't make the memory writable.

The easiest way to fix this is to change QEMU to boot with the area
read-write.  There's no real gain in booting with the memory read-only
as the first thing SeaBIOS does is make it read-write.

-Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] vhost-scsi: Add support for host virtualized target

2012-09-10 Thread Asias He
Hello Nicholas,

On 09/07/2012 02:48 PM, Nicholas A. Bellinger wrote:
 From: Nicholas Bellinger n...@linux-iscsi.org
 
 Hello Anthony  Co,
 
 This is the fourth installment to add host virtualized target support for
 the mainline tcm_vhost fabric driver using Linux v3.6-rc into QEMU 1.3.0-rc.
 
 The series is available directly from the following git branch:
 
git://git.kernel.org/pub/scm/virt/kvm/nab/qemu-kvm.git vhost-scsi-for-1.3
 
 Note the code is cut against yesterday's QEMU head, and dispite the name
 of the tree is based upon mainline qemu.org git code + has thus far been
 running overnight with  100K IOPs small block 4k workloads using v3.6-rc2+
 based target code with RAMDISK_DR backstores.

Are you still seeing the performance degradation discussed in the thread

 vhost-scsi port to v1.1.0 + MSI-X performance regression

?

 Other than some minor fuzz between jumping from QEMU 1.2.0 - 1.2.50, this
 series is functionally identical to what's been posted for vhost-scsi RFC-v3
 to qemu-devel.
 
 Please consider applying these patches for an initial vhost-scsi merge into
 QEMU 1.3.0-rc code, or let us know what else you'd like to see addressed for
 this series to in order to merge.
 
 Thank you!
 
 --nab
 
 Nicholas Bellinger (2):
   monitor: Rename+move net_handle_fd_param - monitor_handle_fd_param
   virtio-scsi: Set max_target=0 during vhost-scsi operation
 
 Stefan Hajnoczi (3):
   vhost: Pass device path to vhost_dev_init()
   vhost-scsi: add -vhost-scsi host device for use with tcm-vhost
   virtio-scsi: Add start/stop functionality for vhost-scsi
 
  configure|   10 +++
  hw/Makefile.objs |1 +
  hw/qdev-properties.c |   41 +++
  hw/vhost-scsi.c  |  190 
 ++
  hw/vhost-scsi.h  |   62 
  hw/vhost.c   |5 +-
  hw/vhost.h   |3 +-
  hw/vhost_net.c   |2 +-
  hw/virtio-pci.c  |2 +
  hw/virtio-scsi.c |   55 ++-
  hw/virtio-scsi.h |1 +
  monitor.c|   18 +
  monitor.h|1 +
  net.c|   18 -
  net.h|2 -
  net/socket.c |2 +-
  net/tap.c|4 +-
  qemu-common.h|1 +
  qemu-config.c|   19 +
  qemu-options.hx  |4 +
  vl.c |   18 +
  21 files changed, 431 insertions(+), 28 deletions(-)
  create mode 100644 hw/vhost-scsi.c
  create mode 100644 hw/vhost-scsi.h
 


-- 
Asias
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: fix i8259 interrupt high to low transition logic

2012-09-10 Thread Matthew Ogilvie
On Tue, Sep 11, 2012 at 01:49:51AM +0100, Maciej W. Rozycki wrote:
 On Sun, 9 Sep 2012, Matthew Ogilvie wrote:
 
  This bug manifested itself when the guest was Microport UNIX
  System V/386 v2.1 (ca. 1987), because it would sometimes mask
  off IRQ14 in the slave IMR after it had already been asserted.
  The master would still try to deliver an interrupt even though
  IRQ2 had dropped again, resulting in a spurious interupt
  (IRQ15) and a panicked UNIX kernel.
 
  That is quite weird actually -- from my experience the spurious vector is 
 never sent from a slave (quite understandably -- since the interrupt is 
 gone and no other is pending, the master has no reason to select a slave 
 to supply a vector and therefore supplies the spurious vector itself) and 
 therefore a spurious IRQ7 is always issued regardless of whether the 
 discarded request came from a slave or from the master.

Keep in mind that this paragraph is describing QEMU's 8259 device
model behavior (and also KVM's), not real hardware.  Reading the
unpatched code, the master clearly latches on to the momentary IRQ2,
does not cancel it when it is cleared again, and ultimately delivers
a spurious IRQ15.

As for what the OS is doing with the IRQ15 (or IRQ7), I only have a large
dissamebly listing (with only a vague idea of it's overall interrupt
handling strategy), and some printf logs of stuff happening in the
8259 model when the OS is running (more useful).

 
  Is there a bug elsewhere then too?  I would have expected a reasonable 
 (and especially an old-school) x86 OS to be able to cope with spurious 
 8259A interrupts, but then obviously one would expect them on IRQ7 only.
 
   Maciej
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html