Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-11-30 Thread Michael S. Tsirkin
On Thu, Dec 01, 2011 at 01:12:25PM +1030, Rusty Russell wrote:
> On Wed, 30 Nov 2011 18:11:51 +0200, Sasha Levin  
> wrote:
> > On Tue, 2011-11-29 at 16:58 +0200, Avi Kivity wrote:
> > > On 11/29/2011 04:54 PM, Michael S. Tsirkin wrote:
> > > > > 
> > > > > Which is actually strange, weren't indirect buffers introduced to make
> > > > > the performance *better*? From what I see it's pretty much the
> > > > > same/worse for virtio-blk.
> > > >
> > > > I know they were introduced to allow adding very large bufs.
> > > > See 9fa29b9df32ba4db055f3977933cd0c1b8fe67cd
> > > > Mark, you wrote the patch, could you tell us which workloads
> > > > benefit the most from indirect bufs?
> > > >
> > > 
> > > Indirects are really for block devices with many spindles, since there
> > > the limiting factor is the number of requests in flight.  Network
> > > interfaces are limited by bandwidth, it's better to increase the ring
> > > size and use direct buffers there (so the ring size more or less
> > > corresponds to the buffer size).
> > > 
> > 
> > I did some testing of indirect descriptors under different workloads.
> 
> MST and I discussed getting clever with dynamic limits ages ago, but it
> was down low on the TODO list.  Thanks for diving into this...
> 
> AFAICT, if the ring never fills, direct is optimal.  When the ring
> fills, indirect is optimal (we're better to queue now than later).
> 
> Why not something simple, like a threshold which drops every time we
> fill the ring?
> 
> struct vring_virtqueue
> {
> ...
> int indirect_thresh;
> ...
> }
> 
> virtqueue_add_buf_gfp()
> {
> ...
> 
> if (vq->indirect &&
> (vq->vring.num - vq->num_free) + out + in > vq->indirect_thresh)
> return indirect()
> ...
> 
>   if (vq->num_free < out + in) {
> if (vq->indirect && vq->indirect_thresh > 0)
> vq->indirect_thresh--;
> 
> ...
> }
> 
> Too dumb?
> 
> Cheers,
> Rusty.

We'll presumably need some logic to increment is back,
to account for random workload changes.
Something like slow start?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Thu, Dec 1, 2011 at 4:28 AM, Rusty Russell  wrote:
> Hmm, we got away with light barriers because we knew we were not
> *really* talking to a device.  But now with virtio-mmio, turns out we
> are :)
>
> I'm really tempted to revert d57ed95 for 3.2, and we can revisit this
> optimization later if it proves worthwhile.

+1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Thu, Dec 1, 2011 at 1:43 AM, Michael S. Tsirkin  wrote:
> And these accesses need to be ordered with DSB? Or DMB?

DMB (i.e. smp barriers) should be enough within Normal memory
accesses, though the other issues that were reported to me are a bit
concerning. I'm still trying to get more information about them, in
the hopes that I can eventually reproduce them myself.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Thu, Dec 1, 2011 at 1:13 AM, Michael S. Tsirkin  wrote:
> For x86, stores into memory are ordered. So I think that yes, smp_XXX
> can be selected at compile time.

But then you can't use the same kernel image for both scenarios.

It won't take long until people will use virtio on ARM for both
virtualization and for talking to devices, and having to rebuild the
kernel for different use cases is nasty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Rusty Russell
On Thu, 1 Dec 2011 01:13:07 +0200, "Michael S. Tsirkin"  wrote:
> For x86, stores into memory are ordered. So I think that yes, smp_XXX
> can be selected at compile time.
> 
> So let's forget the virtio strangeness for a minute,

Hmm, we got away with light barriers because we knew we were not
*really* talking to a device.  But now with virtio-mmio, turns out we
are :)

I'm really tempted to revert d57ed95 for 3.2, and we can revisit this
optimization later if it proves worthwhile.

Thoughts?
Rusty. 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout

2011-11-30 Thread Rusty Russell
On Wed, 30 Nov 2011 15:12:43 +0200, Sasha Levin  wrote:
> On Wed, 2011-11-30 at 10:10 +1030, Rusty Russell wrote:
> > On Mon, 28 Nov 2011 11:15:31 +0200, Sasha Levin  
> > wrote:
> > > On Mon, 2011-11-28 at 11:25 +1030, Rusty Russell wrote:
> > > > I'd like to see kvmtools remove support for legacy mode altogether,
> > > > but they probably have existing users.
> > > 
> > > While we can't simply remove it right away, instead of mixing our
> > > implementation for both legacy and new spec in the same code we can
> > > split the virtio-pci implementation into two:
> > > 
> > >   - virtio/virtio-pci-legacy.c
> > >   - virtio/virtio-pci.c
> > > 
> > > At that point we can #ifdef the entire virtio-pci-legacy.c for now and
> > > remove it at the same time legacy virtio-pci is removed from the kernel.
> > 
> > Hmm, that might be neat, but we can't tell the driver core to try
> > virtio-pci before virtio-pci-legacy, so we need detection code in both
> > modules (and add a "force" flag to virtio-pci-legacy to tell it to
> > accept the device even if it's not a legacy-only one).
> 
> I was thinking more in the direction of fallback code in virtio-pci.c to
> virtio-pci-legacy.c.
> 
> Something like:
> #ifdef VIRTIO_PCI_LEGACY
> [Create BAR0 and map it to virtio-pci-legacy.c]
> #endif
> 
> So BAR0 isn't defined as long as legacy code is there, which makes
> falling back to legacy pretty simple.

But it's nicer to see the driver actually labelled "virtio-pci-legacy",
and such a module.

I'll code something up, see what it looks like.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-11-30 Thread Rusty Russell
On Wed, 30 Nov 2011 18:11:51 +0200, Sasha Levin  wrote:
> On Tue, 2011-11-29 at 16:58 +0200, Avi Kivity wrote:
> > On 11/29/2011 04:54 PM, Michael S. Tsirkin wrote:
> > > > 
> > > > Which is actually strange, weren't indirect buffers introduced to make
> > > > the performance *better*? From what I see it's pretty much the
> > > > same/worse for virtio-blk.
> > >
> > > I know they were introduced to allow adding very large bufs.
> > > See 9fa29b9df32ba4db055f3977933cd0c1b8fe67cd
> > > Mark, you wrote the patch, could you tell us which workloads
> > > benefit the most from indirect bufs?
> > >
> > 
> > Indirects are really for block devices with many spindles, since there
> > the limiting factor is the number of requests in flight.  Network
> > interfaces are limited by bandwidth, it's better to increase the ring
> > size and use direct buffers there (so the ring size more or less
> > corresponds to the buffer size).
> > 
> 
> I did some testing of indirect descriptors under different workloads.

MST and I discussed getting clever with dynamic limits ages ago, but it
was down low on the TODO list.  Thanks for diving into this...

AFAICT, if the ring never fills, direct is optimal.  When the ring
fills, indirect is optimal (we're better to queue now than later).

Why not something simple, like a threshold which drops every time we
fill the ring?

struct vring_virtqueue
{
...
int indirect_thresh;
...
}

virtqueue_add_buf_gfp()
{
...

if (vq->indirect &&
(vq->vring.num - vq->num_free) + out + in > vq->indirect_thresh)
return indirect()
...

if (vq->num_free < out + in) {
if (vq->indirect && vq->indirect_thresh > 0)
vq->indirect_thresh--;

...
}

Too dumb?

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Support virtio indirect buffers

2011-11-30 Thread Rusty Russell
On Mon, 28 Nov 2011 23:05:21 +0200, Sasha Levin  wrote:
> On Mon, 2011-11-28 at 22:17 +0200, Sasha Levin wrote:
> btw, on an unrelated subject, I think that with this patch we've fully
> covered the virtio spec, and as far as I know it's the first userspace
> implementation which covers the entire spec :)

BTW, why did you bother implementing virtio-mmio?  It seems like
gratuitous bloat.  I hope you're not going the same way as qemu :(

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Michael S. Tsirkin
On Thu, Dec 01, 2011 at 01:27:10AM +0200, Ohad Ben-Cohen wrote:
> On Wed, Nov 30, 2011 at 6:24 PM, Ohad Ben-Cohen  wrote:
> > On Wed, Nov 30, 2011 at 6:15 PM, Michael S. Tsirkin  wrote:
> >> How are the rings mapped? normal memory, right?
> >
> > No, device memory.
> 
> Ok, I have more info.
> 
> Originally remoteproc was mapping the rings using ioremap, and that
> meant ARM Device memory.
> 
> Recently, though, we moved to CMA (allocating memory for the rings via
> dma_alloc_coherent), and that isn't Device memory anymore: it's
> uncacheable Normal memory (on ARM v6+).

And these accesses need to be ordered with DSB? Or DMB?

> We still require mandatory barriers though: one very reproducible
> problem I personally face is that the avail index doesn't get updated
> before the kick.

Aha! The *kick* really is MMIO. So I think we do need a mandatory barrier
before the kick.  Maybe we need it for virtio-pci as well
(not on kvm, naturally :) Off-hand this seems to belong in the transport
layer but need to think about it.

> As a result, the remote processor misses a buffer
> that was just added (the kick wakes it up only to find that the avail
> index wasn't changed yet). In this case, it probably happens because
> the mailbox, used to kick the remote processor, is mapped as Device
> memory, and therefore the kick can be reordered before the updates to
> the ring  can be observed.
> 
> I did get two additional reports about reordering issues, on different
> setups than my own, and which I can't personally reproduce: the one
> I've described earlier (avail index gets updated before the avail
> array) and one in the receive path (reading a used buffer which we
> already read). I couldn't personally verify those, but both issues
> were reported to be gone when mandatory barriers were used.

Hmm. So it's a hint that something is wrong with memory
but not what's wrong exactly.

> I expect those reports only to increase: the diversity of platforms
> that are now looking into adopting virtio for this kind of
> inter-process communication is quite huge, with several different
> architectures and even more hardware implementations on the way (not
> only ARM).
> 
> Thanks,
> Ohad.

Right. We need to be very careful with memory,
it's a tricky field. One known problem with virtio
is its insistance on using native endian-ness
for some fields. If power is used, we'll have to fix this.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-30 Thread Chris Wright
* Sridhar Samudrala (s...@us.ibm.com) wrote:
> On 11/30/2011 3:00 PM, Chris Wright wrote:
> >physical port
> >  |
> >+++
> >| +-+ |
> >| | VEB | |
> >| +-+ |
> >|/   |   \|
> >|   /|\   |
> >|  / | \  |
> >+-+--+--+-+
> >   |  |   |
> >  PFVF 1VF 2
> >  /   |   |
> >  +---+---+  VM4  +---+---+
> >  |  sw   |   |macvtap|
> >  | switch|   +---+---+
> >  +-+-+-+-+   |
> >/ | \VM5
> >   /  |  \
> >VM1 VM2 VM3
> >
> >This has VMs 1-3 hanging of the PF via a linux bridge (traditional hv
> >switching), VM4 directly owning VF1 (pci device assignement), and VM5
> >indirectly owning VF2 (macvtap passthrough, that started this whole
> >thing).
> >
> >So, I'm understanding you saying that VM4 or VM4 sending a packet to VM1
> >goes in to VEB, out PF, and into linux bridging code, rigth?  At which
> >point the PF is in promiscuous mode (btw, same does not work if bridge is
> >attached to VF, at least for some VFs, due to lack of promiscuous mode).
> >
> >>Packets sent from a guest with a VF to the address of another guest with
> >>a VF need to be forwarded similarly, but the driver should be able to
> >>infer that from (3).
> >Right, and that works currently for the case where both guests are like
> >VM4, they directly own the VF via PCI device assignement.  But for VM4
> >to talk to VM5, VF3 is not in promiscuous mode and has a different MAC
> >address than VM5's vNIC.  If the embedded bridge does not learn, and
> >nobody programmed it to fwd frames for VM5 via VF3...
> I think you are referring to VF2. There is no VF3 in your picture.

*sigh*  (also meant 'VM4 or VM5' up above, not 'VM4 or VM4')...

> In macvtap passthru mode, VF2 will be set to the same mac address as VM5's
> MAC.  So VM4 should be be able to talk to VM5.

yes (i think macvtap in bridging or vepa mode w/ single VM has that issue,
not passthru)

> >I believe this is what Roopa's patch will allow.  The question now is
> >whether there's a better way to handle this?
> My understanding is that Roopa's patch will allow setting additional mac
> addresses to
> VM5 without the need to put VF5 in promiscous mode.

Thanks for your corrections Sridar.

cheers,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-30 Thread Sridhar Samudrala

On 11/30/2011 3:00 PM, Chris Wright wrote:

* Ben Hutchings (bhutchi...@solarflare.com) wrote:

On Wed, 2011-11-30 at 13:04 -0800, Chris Wright wrote:

I agree that it's confusing.  Couldn't you simplify your ascii art
(hopefully removing hw assumptions about receive processing, and
completely ignoring vlans for the moment) to something like:

  |RX
  v
++-+
| +--++|
| | RX MAC filter ||
| |and port select||
| +---+|
|/|\   |
|   / | \   match 2|
|  /  v  \ |
| /match  \|
|/  1 |\   |
|   / | \  |
|match /  |  \ |
|  0  /   |   \|
|v|v   |
||||   |
++++---+
  |||
 PF   VF 1 VF 2

And there's an unclear number of ways to update "RX MAC filter and port
select" table.

1) PF ndo_set_mac_addr
I expect that to be implicit to match 0.

2) PF ndo_set_rx_mode
Less clear, but I'd still expect these to implicitly match 0

3) PF ndo_set_vf_mac
I expect these to be an explicit match to VF N (given the interface
specifices which VF's MAC is being programmed).

I'm not sure whether this is supposed to implicitly add to the MAC
filter or whether that has to be changed too.  That's the main
difference between my models (a) and (b).

I see now.  I wasn't entirely clear on the difference before.  It's also
going to be hw specific.  I think (Intel folks can verify) that the
Intel SR-IOV devices have a single global unicast exact match table,
for example.


There's also PF ndo_set_vf_vlan.

Right, although I had mentioned I was trying to limit just to MAC
filtering to simplify.


4) VF ndo_set_mac_addr
This one may or may not be allowed (setting MAC+port if the VF is owned
by a guest is likely not allowed), but would expect an implicit VF N.

5) VF ndo_set_rx_mode
Same as 4) above.

So this is where we are today.

Cool, good that we agree there.


6) PF or VF? ndo_set_rx_filter_addr
The new proposal, which has an explicit VF, although when it's VF_SELF
I'm not clear if this is just the same as 5) above?

Have I missed anything?

Any physical port can be bridged to a mixture of guests with and without
their own VFs.  Packets sent from a guest with a VF to the address of a
guest without a VF need to be forwarded to the PF rather than the
physical port, but none of the drivers currently get to know about those
addresses.

To clarify, do you mean something like this?

physical port
  |
+++
| +-+ |
| | VEB | |
| +-+ |
|/   |   \|
|   /|\   |
|  / | \  |
+-+--+--+-+
   |  |   |
  PFVF 1VF 2
  /   |   |
  +---+---+  VM4  +---+---+
  |  sw   |   |macvtap|
  | switch|   +---+---+
  +-+-+-+-+   |
/ | \VM5
   /  |  \
VM1 VM2 VM3

This has VMs 1-3 hanging of the PF via a linux bridge (traditional hv
switching), VM4 directly owning VF1 (pci device assignement), and VM5
indirectly owning VF2 (macvtap passthrough, that started this whole
thing).

So, I'm understanding you saying that VM4 or VM4 sending a packet to VM1
goes in to VEB, out PF, and into linux bridging code, rigth?  At which
point the PF is in promiscuous mode (btw, same does not work if bridge is
attached to VF, at least for some VFs, due to lack of promiscuous mode).


Packets sent from a guest with a VF to the address of another guest with
a VF need to be forwarded similarly, but the driver should be able to
infer that from (3).

Right, and that works currently for the case where both guests are like
VM4, they directly own the VF via PCI device assignement.  But for VM4
to talk to VM5, VF3 is not in promiscuous mode and has a different MAC
address than VM5's vNIC.  If the embedded bridge does not learn, and
nobody programmed it to fwd frames for VM5 via VF3...

I think you are referring to VF2. There is no VF3 in your picture.
In macvtap passthru mode, VF2 will be set to the same mac address as 
VM5's MAC.

So VM4 should be be able to talk to VM5.


I believe this is what Roopa's patch will allow.  The question now is
whether there's a better way to handle this?
My understanding is that Roopa's patch will allow setting additional mac 
addresses to

VM5 without the need to put VF5 in promiscous mode.

Thanks
Sridhar


In my mind, we'd model the NIC's embedded bridge as, well, a bridge.
And set anti-spoofing, port mirroring, port mac/vlan filtering, etc via
that bridge.

thanks,
-chris



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Wed, Nov 30, 2011 at 6:24 PM, Ohad Ben-Cohen  wrote:
> On Wed, Nov 30, 2011 at 6:15 PM, Michael S. Tsirkin  wrote:
>> How are the rings mapped? normal memory, right?
>
> No, device memory.

Ok, I have more info.

Originally remoteproc was mapping the rings using ioremap, and that
meant ARM Device memory.

Recently, though, we moved to CMA (allocating memory for the rings via
dma_alloc_coherent), and that isn't Device memory anymore: it's
uncacheable Normal memory (on ARM v6+).

We still require mandatory barriers though: one very reproducible
problem I personally face is that the avail index doesn't get updated
before the kick. As a result, the remote processor misses a buffer
that was just added (the kick wakes it up only to find that the avail
index wasn't changed yet). In this case, it probably happens because
the mailbox, used to kick the remote processor, is mapped as Device
memory, and therefore the kick can be reordered before the updates to
the ring  can be observed.

I did get two additional reports about reordering issues, on different
setups than my own, and which I can't personally reproduce: the one
I've described earlier (avail index gets updated before the avail
array) and one in the receive path (reading a used buffer which we
already read). I couldn't personally verify those, but both issues
were reported to be gone when mandatory barriers were used.

I expect those reports only to increase: the diversity of platforms
that are now looking into adopting virtio for this kind of
inter-process communication is quite huge, with several different
architectures and even more hardware implementations on the way (not
only ARM).

Thanks,
Ohad.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-30 Thread Rose, Gregory V
> -Original Message-
> From: Chris Wright [mailto:chr...@redhat.com]
> Sent: Wednesday, November 30, 2011 3:01 PM
> To: Ben Hutchings
> Cc: Chris Wright; Rose, Gregory V; Roopa Prabhu; net...@vger.kernel.org;
> da...@davemloft.net; s...@us.ibm.com; dragos.tatu...@gmail.com;
> kvm@vger.kernel.org; a...@arndb.de; m...@redhat.com; mc...@broadcom.com;
> dwa...@cisco.com; shemmin...@vyatta.com; eric.duma...@gmail.com;
> ka...@trash.net; be...@cisco.com
> Subject: Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering
> support for passthru mode
> 
> * Ben Hutchings (bhutchi...@solarflare.com) wrote:
> > On Wed, 2011-11-30 at 13:04 -0800, Chris Wright wrote:
> > > I agree that it's confusing.  Couldn't you simplify your ascii art
> > > (hopefully removing hw assumptions about receive processing, and
> > > completely ignoring vlans for the moment) to something like:
> > >
> > >  |RX
> > >  v
> > > ++-+
> > > | +--++|
> > > | | RX MAC filter ||
> > > | |and port select||
> > > | +---+|
> > > |/|\   |
> > > |   / | \   match 2|
> > > |  /  v  \ |
> > > | /match  \|
> > > |/  1 |\   |
> > > |   / | \  |
> > > |match /  |  \ |
> > > |  0  /   |   \|
> > > |v|v   |
> > > ||||   |
> > > ++++---+
> > >  |||
> > > PF   VF 1 VF 2
> > >
> > > And there's an unclear number of ways to update "RX MAC filter and
> port
> > > select" table.
> > >
> > > 1) PF ndo_set_mac_addr
> > > I expect that to be implicit to match 0.
> > >
> > > 2) PF ndo_set_rx_mode
> > > Less clear, but I'd still expect these to implicitly match 0
> > >
> > > 3) PF ndo_set_vf_mac
> > > I expect these to be an explicit match to VF N (given the interface
> > > specifices which VF's MAC is being programmed).
> >
> > I'm not sure whether this is supposed to implicitly add to the MAC
> > filter or whether that has to be changed too.  That's the main
> > difference between my models (a) and (b).
> 
> I see now.  I wasn't entirely clear on the difference before.  It's also
> going to be hw specific.  I think (Intel folks can verify) that the
> Intel SR-IOV devices have a single global unicast exact match table,
> for example.
> 
> > There's also PF ndo_set_vf_vlan.
> 
> Right, although I had mentioned I was trying to limit just to MAC
> filtering to simplify.
> 
> > > 4) VF ndo_set_mac_addr
> > > This one may or may not be allowed (setting MAC+port if the VF is
> owned
> > > by a guest is likely not allowed), but would expect an implicit VF N.
> > >
> > > 5) VF ndo_set_rx_mode
> > > Same as 4) above.
> >
> > So this is where we are today.
> 
> Cool, good that we agree there.
> 
> > > 6) PF or VF? ndo_set_rx_filter_addr
> > > The new proposal, which has an explicit VF, although when it's VF_SELF
> > > I'm not clear if this is just the same as 5) above?
> > >
> > > Have I missed anything?
> >
> > Any physical port can be bridged to a mixture of guests with and without
> > their own VFs.  Packets sent from a guest with a VF to the address of a
> > guest without a VF need to be forwarded to the PF rather than the
> > physical port, but none of the drivers currently get to know about those
> > addresses.
> 
> To clarify, do you mean something like this?
> 
>physical port
>  |
> +++
> | +-+ |
> | | VEB | |
> | +-+ |
> |/   |   \|
> |   /|\   |
> |  / | \  |
> +-+--+--+-+
>   |  |   |
>  PFVF 1VF 2
>  /   |   |
>  +---+---+  VM4  +---+---+
>  |  sw   |   |macvtap|
>  | switch|   +---+---+
>  +-+-+-+-+   |
>/ | \VM5
>   /  |  \
> VM1 VM2 VM3
> 
> This has VMs 1-3 hanging of the PF via a linux bridge (traditional hv
> switching), VM4 directly owning VF1 (pci device assignement), and VM5
> indirectly owning VF2 (macvtap passthrough, that started this whole
> thing).
> 
> So, I'm understanding you saying that VM4 or VM4 sending a packet to VM1
> goes in to VEB, out PF, and into linux bridging code, rigth?  At which
> point the PF is in promiscuous mode (btw, same does not work if bridge is
> attached to VF, at least for some VFs, due to lack of promiscuous mode).
> 
> > Packets sent from a guest with a VF to the address of another guest with
> > a VF need to be forwarded similarly, but the driver should be able to
> > infer that from (3).
> 
> Right, and that works currently for the case where both guests are like
> VM4, they directly own the VF via PCI device assignement.  But for VM4
> to talk to VM5, VF3 is not in promiscuous mode and has a different MAC
> address than VM5's vNIC.  If the embedded bridge does not learn, 

Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Michael S. Tsirkin
On Thu, Dec 01, 2011 at 12:43:08AM +0200, Ohad Ben-Cohen wrote:
> On Wed, Nov 30, 2011 at 4:50 PM, Michael S. Tsirkin  wrote:
> > make headers_install
> > make -C tools/virtio/
> > (you'll need an empty stub for tools/virtio/linux/module.h,
> >  I just sent a patch to add that)
> > sudo insmod tools/virtio/vhost_test/vhost_test.ko
> > ./tools/virtio/virtio_test
> 
> Ok, I gave this a spin.
> 
> I've tried to see if reverting d57ed95 has any measurable effect on
> the execution time of virtio_test's run_test(), but I couldn't see any
> (several attempts with and without d57ed95 yielded very similar range
> of execution times).
> 
> YMMV though, especially with real workloads.
> 
> > Real virtualization/x86 can keep using current smp_XX barriers, right?
> 
> Yes, sure. ARM virtualization can too, since smp_XX barriers are
> enough for that scenario.
> 
> > We can have some config for your kind of setup.
> 
> Please note that it can't be a compile-time decision though (unless
> we're willing to effectively revert d57ed95 when this config kicks
> in): it's not unlikely that one would want to have both use cases
> running on the same time.
> 
> Thanks,
> Ohad.

For x86, stores into memory are ordered. So I think that yes, smp_XXX
can be selected at compile time.

So let's forget the virtio strangeness for a minute,

To me it starts looking like we need some new kind of barrier
that handles accesses to DMA coherent memory. dma_Xmb()?
dma_coherent_Xmb()?  For example, on x86, dma_wmb() can be barrier(),
but on your system it needs to do DSB.

We can set the rule that dma barriers are guaranteed stronger
than smp ones, and we can just use dma_ everywhere.
So the strength will be:

smp < dma < mandatory

And now virtio can use DMA barriers and instead of adding
overhead for x86, x86 will actually gain from this,
as we'll drop mandatory barriers on UP systems.

Hmm?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-30 Thread Chris Wright
* Ben Hutchings (bhutchi...@solarflare.com) wrote:
> On Wed, 2011-11-30 at 13:04 -0800, Chris Wright wrote:
> > I agree that it's confusing.  Couldn't you simplify your ascii art
> > (hopefully removing hw assumptions about receive processing, and
> > completely ignoring vlans for the moment) to something like:
> >
> >  |RX
> >  v
> > ++-+
> > | +--++|
> > | | RX MAC filter ||
> > | |and port select||
> > | +---+|
> > |/|\   |
> > |   / | \   match 2|
> > |  /  v  \ |
> > | /match  \|
> > |/  1 |\   |
> > |   / | \  |
> > |match /  |  \ |
> > |  0  /   |   \|
> > |v|v   |
> > ||||   |
> > ++++---+
> >  |||
> > PF   VF 1 VF 2
> > 
> > And there's an unclear number of ways to update "RX MAC filter and port
> > select" table.
> > 
> > 1) PF ndo_set_mac_addr
> > I expect that to be implicit to match 0.
> > 
> > 2) PF ndo_set_rx_mode
> > Less clear, but I'd still expect these to implicitly match 0
> > 
> > 3) PF ndo_set_vf_mac
> > I expect these to be an explicit match to VF N (given the interface
> > specifices which VF's MAC is being programmed).
> 
> I'm not sure whether this is supposed to implicitly add to the MAC
> filter or whether that has to be changed too.  That's the main
> difference between my models (a) and (b).

I see now.  I wasn't entirely clear on the difference before.  It's also
going to be hw specific.  I think (Intel folks can verify) that the
Intel SR-IOV devices have a single global unicast exact match table,
for example.

> There's also PF ndo_set_vf_vlan.

Right, although I had mentioned I was trying to limit just to MAC
filtering to simplify.

> > 4) VF ndo_set_mac_addr
> > This one may or may not be allowed (setting MAC+port if the VF is owned
> > by a guest is likely not allowed), but would expect an implicit VF N.
> > 
> > 5) VF ndo_set_rx_mode
> > Same as 4) above.
> 
> So this is where we are today.

Cool, good that we agree there.

> > 6) PF or VF? ndo_set_rx_filter_addr
> > The new proposal, which has an explicit VF, although when it's VF_SELF
> > I'm not clear if this is just the same as 5) above?
> > 
> > Have I missed anything?
> 
> Any physical port can be bridged to a mixture of guests with and without
> their own VFs.  Packets sent from a guest with a VF to the address of a
> guest without a VF need to be forwarded to the PF rather than the
> physical port, but none of the drivers currently get to know about those
> addresses.

To clarify, do you mean something like this?

   physical port
 |
+++
| +-+ |
| | VEB | |
| +-+ |
|/   |   \|
|   /|\   |
|  / | \  |
+-+--+--+-+
  |  |   |
 PFVF 1VF 2
 /   |   | 
 +---+---+  VM4  +---+---+
 |  sw   |   |macvtap|
 | switch|   +---+---+
 +-+-+-+-+   |
   / | \VM5
  /  |  \
VM1 VM2 VM3

This has VMs 1-3 hanging of the PF via a linux bridge (traditional hv
switching), VM4 directly owning VF1 (pci device assignement), and VM5
indirectly owning VF2 (macvtap passthrough, that started this whole
thing).

So, I'm understanding you saying that VM4 or VM4 sending a packet to VM1
goes in to VEB, out PF, and into linux bridging code, rigth?  At which
point the PF is in promiscuous mode (btw, same does not work if bridge is
attached to VF, at least for some VFs, due to lack of promiscuous mode).

> Packets sent from a guest with a VF to the address of another guest with
> a VF need to be forwarded similarly, but the driver should be able to
> infer that from (3).

Right, and that works currently for the case where both guests are like
VM4, they directly own the VF via PCI device assignement.  But for VM4
to talk to VM5, VF3 is not in promiscuous mode and has a different MAC
address than VM5's vNIC.  If the embedded bridge does not learn, and
nobody programmed it to fwd frames for VM5 via VF3...

I believe this is what Roopa's patch will allow.  The question now is
whether there's a better way to handle this?

In my mind, we'd model the NIC's embedded bridge as, well, a bridge.
And set anti-spoofing, port mirroring, port mac/vlan filtering, etc via
that bridge.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Wed, Nov 30, 2011 at 4:50 PM, Michael S. Tsirkin  wrote:
> make headers_install
> make -C tools/virtio/
> (you'll need an empty stub for tools/virtio/linux/module.h,
>  I just sent a patch to add that)
> sudo insmod tools/virtio/vhost_test/vhost_test.ko
> ./tools/virtio/virtio_test

Ok, I gave this a spin.

I've tried to see if reverting d57ed95 has any measurable effect on
the execution time of virtio_test's run_test(), but I couldn't see any
(several attempts with and without d57ed95 yielded very similar range
of execution times).

YMMV though, especially with real workloads.

> Real virtualization/x86 can keep using current smp_XX barriers, right?

Yes, sure. ARM virtualization can too, since smp_XX barriers are
enough for that scenario.

> We can have some config for your kind of setup.

Please note that it can't be a compile-time decision though (unless
we're willing to effectively revert d57ed95 when this config kicks
in): it's not unlikely that one would want to have both use cases
running on the same time.

Thanks,
Ohad.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] ivshmem: fix guest unable to start with ioeventfd

2011-11-30 Thread Cam Macdonell
2011/11/30 Zang Hongyong :
> Can this bug fix patch be applied yet?

Sorry, for not replying yet.  I'll test your patch within the next day.

> With this bug, guest os cannot successfully boot with ioeventfd.
> Thus the new PIO DoorBell patch cannot be posted.

Well, you can certainly post the new patch, just clarify that it's
dependent on this patch.

Sincerely,
Cam

>
> Thanks,
> Hongyong
>
> 于 2011/11/24,星期四 18:05, zanghongy...@huawei.com 写道:
>> From: Hongyong Zang 
>>
>> When a guest boots with ioeventfd, an error (by gdb) occurs:
>>   Program received signal SIGSEGV, Segmentation fault.
>>   0x006009cc in setup_ioeventfds (s=0x171dc40)
>>   at /home/louzhengwei/git_source/qemu-kvm/hw/ivshmem.c:363
>>   363 for (j = 0; j < s->peers[i].nb_eventfds; j++) {
>> The bug is due to accessing s->peers which is NULL.
>>
>> This patch uses the memory region API to replace the old one 
>> kvm_set_ioeventfd_mmio_long().
>> And this patch makes memory_region_add_eventfd() called in ivshmem_read() 
>> when qemu receives
>> eventfd information from ivshmem_server.
>>
>> Signed-off-by: Hongyong Zang 
>> ---
>>  hw/ivshmem.c |   41 ++---
>>  1 files changed, 14 insertions(+), 27 deletions(-)
>>
>> diff --git a/hw/ivshmem.c b/hw/ivshmem.c
>> index 242fbea..be26f03 100644
>> --- a/hw/ivshmem.c
>> +++ b/hw/ivshmem.c
>> @@ -58,7 +58,6 @@ typedef struct IVShmemState {
>>  CharDriverState *server_chr;
>>  MemoryRegion ivshmem_mmio;
>>
>> -pcibus_t mmio_addr;
>>  /* We might need to register the BAR before we actually have the memory.
>>   * So prepare a container MemoryRegion for the BAR immediately and
>>   * add a subregion when we have the memory.
>> @@ -346,8 +345,14 @@ static void close_guest_eventfds(IVShmemState *s, int 
>> posn)
>>  guest_curr_max = s->peers[posn].nb_eventfds;
>>
>>  for (i = 0; i < guest_curr_max; i++) {
>> -kvm_set_ioeventfd_mmio_long(s->peers[posn].eventfds[i],
>> -s->mmio_addr + DOORBELL, (posn << 16) | i, 0);
>> +if (ivshmem_has_feature(s, IVSHMEM_IOEVENTFD)) {
>> +memory_region_del_eventfd(&s->ivshmem_mmio,
>> + DOORBELL,
>> + 4,
>> + true,
>> + (posn << 16) | i,
>> + s->peers[posn].eventfds[i]);
>> +}
>>  close(s->peers[posn].eventfds[i]);
>>  }
>>
>> @@ -355,22 +360,6 @@ static void close_guest_eventfds(IVShmemState *s, int 
>> posn)
>>  s->peers[posn].nb_eventfds = 0;
>>  }
>>
>> -static void setup_ioeventfds(IVShmemState *s) {
>> -
>> -int i, j;
>> -
>> -for (i = 0; i <= s->max_peer; i++) {
>> -for (j = 0; j < s->peers[i].nb_eventfds; j++) {
>> -memory_region_add_eventfd(&s->ivshmem_mmio,
>> -  DOORBELL,
>> -  4,
>> -  true,
>> -  (i << 16) | j,
>> -  s->peers[i].eventfds[j]);
>> -}
>> -}
>> -}
>> -
>>  /* this function increase the dynamic storage need to store data about other
>>   * guests */
>>  static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
>> @@ -491,10 +480,12 @@ static void ivshmem_read(void *opaque, const uint8_t * 
>> buf, int flags)
>>  }
>>
>>  if (ivshmem_has_feature(s, IVSHMEM_IOEVENTFD)) {
>> -if (kvm_set_ioeventfd_mmio_long(incoming_fd, s->mmio_addr + 
>> DOORBELL,
>> -(incoming_posn << 16) | guest_max_eventfd, 1) < 0) {
>> -fprintf(stderr, "ivshmem: ioeventfd not available\n");
>> -}
>> +memory_region_add_eventfd(&s->ivshmem_mmio,
>> +  DOORBELL,
>> +  4,
>> +  true,
>> +  (incoming_posn << 16) | guest_max_eventfd,
>> +  incoming_fd);
>>  }
>>
>>  return;
>> @@ -659,10 +650,6 @@ static int pci_ivshmem_init(PCIDevice *dev)
>>  memory_region_init_io(&s->ivshmem_mmio, &ivshmem_mmio_ops, s,
>>"ivshmem-mmio", IVSHMEM_REG_BAR_SIZE);
>>
>> -if (ivshmem_has_feature(s, IVSHMEM_IOEVENTFD)) {
>> -setup_ioeventfds(s);
>> -}
>> -
>>  /* region for registers*/
>>  pci_register_bar(&s->dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY,
>>   &s->ivshmem_mmio);
>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-30 Thread Ben Hutchings
On Wed, 2011-11-30 at 13:04 -0800, Chris Wright wrote:
> * Ben Hutchings (bhutchi...@solarflare.com) wrote:
> > On Wed, 2011-11-30 at 09:34 -0800, Greg Rose wrote:
> > > On 11/29/2011 9:19 AM, Ben Hutchings wrote:
> > > > On Tue, 2011-11-29 at 16:35 +, Ben Hutchings wrote:
> > > >>
> > > >> Maybe I missed something!
> > [...]
> > > >> If not, please explain what the new model *is*.
> > > 
> > > The new model is to incorporate a VEB into the NIC.  The current model 
> > > doesn't address any of the requirements of a VEB in the NIC and this 
> > > proposed set of patches allow us to set MAC filters for the *ports* on 
> > > the internal NIC VEB.  Consider the PF and each of the VFs as just a 
> > > port on the VEB.  We need the ability to set L2 filters (MAC, MC and 
> > > VLAN) for each of the ports on that VEB.  There is no currently 
> > > supported method for doing this.  So yes, this is a new model although 
> > > it's a fairly simple one.
> > 
> > Explain precisely how the VEB changes the existing model.  Explain how
> > the existing MAC filter and VF filter APIs interact with port filters on
> > the VEB.  Refer to any relevant standards.
> 
> I agree that it's confusing.  Couldn't you simplify your ascii art
> (hopefully removing hw assumptions about receive processing, and
> completely ignoring vlans for the moment) to something like:
>
>  |RX
>  v
> ++-+
> | +--++|
> | | RX MAC filter ||
> | |and port select||
> | +---+|
> |/|\   |
> |   / | \   match 2|
> |  /  v  \ |
> | /match  \|
> |/  1 |\   |
> |   / | \  |
> |match /  |  \ |
> |  0  /   |   \|
> |v|v   |
> ||||   |
> ++++---+
>  |||
> PF   VF 1 VF 2
> 
> And there's an unclear number of ways to update "RX MAC filter and port
> select" table.
> 
> 1) PF ndo_set_mac_addr
> I expect that to be implicit to match 0.
> 
> 2) PF ndo_set_rx_mode
> Less clear, but I'd still expect these to implicitly match 0
> 
> 3) PF ndo_set_vf_mac
> I expect these to be an explicit match to VF N (given the interface
> specifices which VF's MAC is being programmed).

I'm not sure whether this is supposed to implicitly add to the MAC
filter or whether that has to be changed too.  That's the main
difference between my models (a) and (b).

There's also PF ndo_set_vf_vlan.

> 4) VF ndo_set_mac_addr
> This one may or may not be allowed (setting MAC+port if the VF is owned
> by a guest is likely not allowed), but would expect an implicit VF N.
> 
> 5) VF ndo_set_rx_mode
> Same as 4) above.

So this is where we are today.

> 6) PF or VF? ndo_set_rx_filter_addr
> The new proposal, which has an explicit VF, although when it's VF_SELF
> I'm not clear if this is just the same as 5) above?
> 
> Have I missed anything?

Any physical port can be bridged to a mixture of guests with and without
their own VFs.  Packets sent from a guest with a VF to the address of a
guest without a VF need to be forwarded to the PF rather than the
physical port, but none of the drivers currently get to know about those
addresses.

Packets sent from a guest with a VF to the address of another guest with
a VF need to be forwarded similarly, but the driver should be able to
infer that from (3).

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] Virt: screendump thread - handle IOErrors on PPM conversion

2011-11-30 Thread Lucas Meneghel Rodrigues
Under some conditions, monitor screendumps can get truncated,
generating IOError exceptions during PIL conversion. So
handle those errors and log a warning rather than failing
the entire screendump thread.

Signed-off-by: Lucas Meneghel Rodrigues 
---
 client/virt/virt_env_process.py |   14 +++---
 1 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/client/virt/virt_env_process.py b/client/virt/virt_env_process.py
index 3f996ca..ab2f77e 100644
--- a/client/virt/virt_env_process.py
+++ b/client/virt/virt_env_process.py
@@ -511,9 +511,17 @@ def _take_screendumps(test, params, env):
 pass
 else:
 try:
-image = PIL.Image.open(temp_filename)
-image.save(screendump_filename, format="JPEG", 
quality=quality)
-cache[hash] = screendump_filename
+try:
+image = PIL.Image.open(temp_filename)
+image.save(screendump_filename, format="JPEG",
+   quality=quality)
+cache[hash] = screendump_filename
+except IOError, error_detail:
+logging.warning("VM '%s' failed to produce a "
+"screendump: %s", vm.name, 
error_detail)
+# Decrement the counter as we in fact failed to
+# produce a converted screendump
+counter[vm] -= 1
 except NameError:
 pass
 os.unlink(temp_filename)
-- 
1.7.7.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] KVM test: subtests.cfg.sample: Decrease boot_savevm login timeout

2011-11-30 Thread Lucas Meneghel Rodrigues
This way we'll have much more stress by doing more save/load
cycles during boot time. We did see some disk corruption problems
using this value, but the condition is not 100% reproducible.

Signed-off-by: Lucas Meneghel Rodrigues 
---
 client/tests/kvm/subtests.cfg.sample |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/client/tests/kvm/subtests.cfg.sample 
b/client/tests/kvm/subtests.cfg.sample
index 03ddbe2..55f1b21 100644
--- a/client/tests/kvm/subtests.cfg.sample
+++ b/client/tests/kvm/subtests.cfg.sample
@@ -348,7 +348,7 @@ variants:
 - boot_savevm: install setup image_copy unattended_install.cdrom
 type = boot_savevm
 savevm_delay = 0.3
-savevm_login_delay = 120
+savevm_login_delay = 5
 savevm_timeout = 2000
 kill_vm_on_error = yes
 kill_vm_gracefully = yes
-- 
1.7.7.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] KVM test: boot_savevm: Add more debug and kernel panic detection

2011-11-30 Thread Lucas Meneghel Rodrigues
Print total time elapsed and number of save/load
VM cycles performed during the test. Also, verify whether
a kernel panic happened during the test execution.

Signed-off-by: Lucas Meneghel Rodrigues 
---
 client/tests/kvm/tests/boot_savevm.py |   23 +--
 1 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/client/tests/kvm/tests/boot_savevm.py 
b/client/tests/kvm/tests/boot_savevm.py
index d02463d..d4899de 100644
--- a/client/tests/kvm/tests/boot_savevm.py
+++ b/client/tests/kvm/tests/boot_savevm.py
@@ -19,10 +19,14 @@ def run_boot_savevm(test, params, env):
 vm.verify_alive()
 savevm_delay = float(params.get("savevm_delay"))
 savevm_login_delay = float(params.get("savevm_login_delay"))
-end_time = time.time() + float(params.get("savevm_timeout"))
+savevm_login_timeout = float(params.get("savevm_timeout"))
+start_time = time.time()
+
+cycles = 0
 
 successful_login = False
-while time.time() < end_time:
+while (time.time() - start_time) < savevm_login_timeout:
+logging.info("Save/load cycle %d", cycles + 1)
 time.sleep(savevm_delay)
 try:
 vm.monitor.cmd("stop")
@@ -45,13 +49,20 @@ def run_boot_savevm(test, params, env):
 except kvm_monitor.MonitorError, e:
 logging.error(e)
 
+vm.verify_kernel_crash()
+
 try:
 vm.wait_for_login(timeout=savevm_login_delay)
 successful_login = True
 break
-except Exception, detail:
-logging.debug(detail)
+except:
+pass
+
+cycles += 1
 
+time_elapsed = int(time.time() - start_time)
+info = "after %s s, %d load/save cycles" % (time_elapsed, cycles + 1)
 if not successful_login:
-raise error.TestFail("Not possible to log onto the vm after %s s" %
- params.get("savevm_timeout"))
+raise error.TestFail("Can't log on '%s' %s" % (vm.name, info))
+else:
+logging.info("Test ended %s", info)
-- 
1.7.7.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] KVM test: Avoid race condition on boot_savevm

2011-11-30 Thread Lucas Meneghel Rodrigues
As while loops termination conditions are checked at
the beginning of the loop, if we successfuly log onto
the vm on the last cycle, but the timeout happened
to be ended by then, we'd have a test failure.

So, introduce a variable that records whether the
test managed to log onto the VM, and use this variable
as the criteria for PASS/FAIL.

Signed-off-by: Lucas Meneghel Rodrigues 
---
 client/tests/kvm/tests/boot_savevm.py |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/client/tests/kvm/tests/boot_savevm.py 
b/client/tests/kvm/tests/boot_savevm.py
index 91199fb..d02463d 100644
--- a/client/tests/kvm/tests/boot_savevm.py
+++ b/client/tests/kvm/tests/boot_savevm.py
@@ -21,6 +21,7 @@ def run_boot_savevm(test, params, env):
 savevm_login_delay = float(params.get("savevm_login_delay"))
 end_time = time.time() + float(params.get("savevm_timeout"))
 
+successful_login = False
 while time.time() < end_time:
 time.sleep(savevm_delay)
 try:
@@ -46,10 +47,11 @@ def run_boot_savevm(test, params, env):
 
 try:
 vm.wait_for_login(timeout=savevm_login_delay)
+successful_login = True
 break
 except Exception, detail:
 logging.debug(detail)
 
-if (time.time() > end_time):
+if not successful_login:
 raise error.TestFail("Not possible to log onto the vm after %s s" %
  params.get("savevm_timeout"))
-- 
1.7.7.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-30 Thread Chris Wright
* Ben Hutchings (bhutchi...@solarflare.com) wrote:
> On Wed, 2011-11-30 at 09:34 -0800, Greg Rose wrote:
> > On 11/29/2011 9:19 AM, Ben Hutchings wrote:
> > > On Tue, 2011-11-29 at 16:35 +, Ben Hutchings wrote:
> > >>
> > >> Maybe I missed something!
> [...]
> > >> If not, please explain what the new model *is*.
> > 
> > The new model is to incorporate a VEB into the NIC.  The current model 
> > doesn't address any of the requirements of a VEB in the NIC and this 
> > proposed set of patches allow us to set MAC filters for the *ports* on 
> > the internal NIC VEB.  Consider the PF and each of the VFs as just a 
> > port on the VEB.  We need the ability to set L2 filters (MAC, MC and 
> > VLAN) for each of the ports on that VEB.  There is no currently 
> > supported method for doing this.  So yes, this is a new model although 
> > it's a fairly simple one.
> 
> Explain precisely how the VEB changes the existing model.  Explain how
> the existing MAC filter and VF filter APIs interact with port filters on
> the VEB.  Refer to any relevant standards.

I agree that it's confusing.  Couldn't you simplify your ascii art
(hopefully removing hw assumptions about receive processing, and
completely ignoring vlans for the moment) to something like:

 |RX
 v
++-+
| +--++|
| | RX MAC filter ||
| |and port select||
| +---+|
|/|\   |
|   / | \   match 2|
|  /  v  \ |
| /match  \|
|/  1 |\   |
|   / | \  |
|match /  |  \ |
|  0  /   |   \|
|v|v   |
||||   |
++++---+
 |||
PF   VF 1 VF 2

And there's an unclear number of ways to update "RX MAC filter and port
select" table.

1) PF ndo_set_mac_addr
I expect that to be implicit to match 0.

2) PF ndo_set_rx_mode
Less clear, but I'd still expect these to implicitly match 0

3) PF ndo_set_vf_mac
I expect these to be an explicit match to VF N (given the interface
specifices which VF's MAC is being programmed).

4) VF ndo_set_mac_addr
This one may or may not be allowed (setting MAC+port if the VF is owned
by a guest is likely not allowed), but would expect an implicit VF N.

5) VF ndo_set_rx_mode
Same as 4) above.

6) PF or VF? ndo_set_rx_filter_addr
The new proposal, which has an explicit VF, although when it's VF_SELF
I'm not clear if this is just the same as 5) above?

Have I missed anything?

thanks,
chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it possible to have SDL without X?

2011-11-30 Thread Brian Jackson

On 11/29/2011 9:29 PM, Matt Graham wrote:

Hello,

Can a guest with SDL graphics run on a host without X? I get an error:
"init kbd.
Could not initialize SDL - exiting"

The above happens on a host with X after running "/etc/init.d/xdm stop" and "chmod 
-R 777 /dev".
If I don't do the chmod, SDL complains about not being able to open the 
framebuffer and exits.

The host is Debian Squeeze with the standard qemu-kvm package in their 
repository (version 0.12.5).
The guest xml does not specify a keyboard device.
The same guest runs fine under X.

If there is any other information that could be useful, I will be very happy to 
provide it.
If this is not the right place for such questions, apologies, please let me 
know what the right place is.



The qemu list might be better since that code all originated there. What 
exactly are you trying to achieve? It sounds like you are trying to get 
the guest to display on a Linux console via sdl. From what I understand 
that's going to be severely limited functionality wise.





Thanks!
Richard

  --
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-30 Thread Ben Hutchings
On Wed, 2011-11-30 at 09:34 -0800, Greg Rose wrote:
> On 11/29/2011 9:19 AM, Ben Hutchings wrote:
> > On Tue, 2011-11-29 at 16:35 +, Ben Hutchings wrote:
> >>
> >> Maybe I missed something!
[...]
> >> If not, please explain what the new model *is*.
> 
> The new model is to incorporate a VEB into the NIC.  The current model 
> doesn't address any of the requirements of a VEB in the NIC and this 
> proposed set of patches allow us to set MAC filters for the *ports* on 
> the internal NIC VEB.  Consider the PF and each of the VFs as just a 
> port on the VEB.  We need the ability to set L2 filters (MAC, MC and 
> VLAN) for each of the ports on that VEB.  There is no currently 
> supported method for doing this.  So yes, this is a new model although 
> it's a fairly simple one.

Explain precisely how the VEB changes the existing model.  Explain how
the existing MAC filter and VF filter APIs interact with port filters on
the VEB.  Refer to any relevant standards.

(I have really had enough of net driver API proposals where all the
difficult questions are punted to the implementation.  Either
implementations diverge and users and userspace developers are left
horribly confused, or else the second and subsequent implementations
have to follow whatever quirks the first implementation had.  It's an
essential part of the review process that such questions are asked and
answered.)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Embeddedxen-devel] [Xen-devel] [ANNOUNCE] Xen port to Cortex-A15 / ARMv7 with virt extensions

2011-11-30 Thread Stefano Stabellini
On Wed, 30 Nov 2011, Arnd Bergmann wrote:
> > In principal we could also offer the user options as to which particular
> > platform a guest looks like.
> 
> At least when using a qemu based simulation. Most platforms have some
> characteristics that are not meaningful in a classic virtualization
> scenario, but it would certainly be helpful to use the virtualization
> extensions to run a kernel that was built for a particular platform
> faster than with pure qemu, when you want to test that kernel image.
> 
> It has been suggested in the past that it would be nice to run the
> guest kernel built for the same platform as the host kernel by
> default, but I think it would be much better to have just one
> platform that we end up using for guests on any host platform,
> unless there is a strong reason to do otherwise.
> 
> There is also ongoing restructuring in the ARM Linux kernel to
> allow running the same kernel binary on multiple platforms. While
> there is still a lot of work to be done, you should assume that
> we will finish it before you see lots of users in production, there
> is no need to plan for the current one-kernel-per-board case.

It is very good to hear, I am counting on it.


> > > Ok. It would of course still be possible to agree on an argument passing
> > > convention so that we can share the macros used to issue the hcalls,
> > > even if the individual commands are all different.
> > 
> > I think it likely that we can all agree on a common calling convention
> > for N-argument hypercalls. It doubt there are that many useful choices
> > with conflicting requirements yet strongly compelling advantages.
> 
> Exactly. I think it's only lack of communication that has resulted in
> different interfaces for each hypervisor on the other architectures.

It is also due to history: on X86 it was possible to issue hypercalls to
Xen before VMCALL (the X86 version of HVC) was available.


> KVM and Xen at least both fall into the single-return-value category,
> so we should be able to agree on a calling conventions. KVM does not
> have an hcall API on ARM yet, and I see no reason not to use the
> same implementation that you have in the Xen guest.
> 
> Stefano, can you split out the generic parts of your asm/xen/hypercall.h
> file into a common asm/hypercall.h and submit it for review to the
> arm kernel list?

Sure, I can do that.
Usually the hypercall calling convention is very hypervisor specific,
but if it turns out that we have the same requirements I happy to design
a common interface.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [ANNOUNCE] Xen port to Cortex-A15 / ARMv7 with virt extensions

2011-11-30 Thread Arnd Bergmann
On Wednesday 30 November 2011, Ian Campbell wrote:
> On Wed, 2011-11-30 at 14:32 +, Arnd Bergmann wrote:
> > On Wednesday 30 November 2011, Ian Campbell wrote:
> > What I suggested to the KVM developers is to start out with the
> > vexpress platform, but then generalize it to the point where it fits
> > your needs. All hardware that one expects a guest to have (GIC, timer,
> > ...) will still show up in the same location as on a real vexpress,
> > while anything that makes no sense or is better paravirtualized (LCD,
> > storage, ...) just becomes optional and has to be described in the
> > device tree if it's actually there.
> 
> That's along the lines of what I was thinking as well.
> 
> The DT contains the address of GIC, timer etc as well right? So at least
> in principal we needn't provide e.g. the GIC at the same address as any
> real platform but in practice I expect we will.

Yes.

> In principal we could also offer the user options as to which particular
> platform a guest looks like.

At least when using a qemu based simulation. Most platforms have some
characteristics that are not meaningful in a classic virtualization
scenario, but it would certainly be helpful to use the virtualization
extensions to run a kernel that was built for a particular platform
faster than with pure qemu, when you want to test that kernel image.

It has been suggested in the past that it would be nice to run the
guest kernel built for the same platform as the host kernel by
default, but I think it would be much better to have just one
platform that we end up using for guests on any host platform,
unless there is a strong reason to do otherwise.

There is also ongoing restructuring in the ARM Linux kernel to
allow running the same kernel binary on multiple platforms. While
there is still a lot of work to be done, you should assume that
we will finish it before you see lots of users in production, there
is no need to plan for the current one-kernel-per-board case.

> > Ok. It would of course still be possible to agree on an argument passing
> > convention so that we can share the macros used to issue the hcalls,
> > even if the individual commands are all different.
> 
> I think it likely that we can all agree on a common calling convention
> for N-argument hypercalls. It doubt there are that many useful choices
> with conflicting requirements yet strongly compelling advantages.

Exactly. I think it's only lack of communication that has resulted in
different interfaces for each hypervisor on the other architectures.

KVM and Xen at least both fall into the single-return-value category,
so we should be able to agree on a calling conventions. KVM does not
have an hcall API on ARM yet, and I see no reason not to use the
same implementation that you have in the Xen guest.

Stefano, can you split out the generic parts of your asm/xen/hypercall.h
file into a common asm/hypercall.h and submit it for review to the
arm kernel list?

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

2011-11-30 Thread Andrea Arcangeli
On Wed, Nov 30, 2011 at 09:52:37PM +0530, Dipankar Sarma wrote:
> create the guest topology correctly and optimize for NUMA. This
> would work for us.

Even on the case of 1 guest that fits in one node, you're not going to
max out the full bandwidth of all memory channels with this.

qemu all can do with ms_mbind/tbind is to create a vtopology that
matches the hardware topology. It has these limits:

1) requires all userland applications to be modified to scan either
   the physical topology if run on host, or the vtopology if run on
   guest to get the full benefit.

2) breaks across live migration if host physical topology changes

3) 1 small guest on a idle numa system that fits in one numa node will
   tell not enough information to the host kernel

4) if used outside of qemu and one threads allocates more memory than
   what fits in one node it won't tell enough info to the host kernel.

About 3): if you've just one guest that fits in one node, each vcpu
should be spread across all the nodes probably, and behave like
MADV_INTERLEAVE if the guest CPU scheduler migrate guests processes in
reverse, the global memory bandwidth will still be used full even if
they will both access remote memory. I've just seen benchmarks where
no pinning runs more than _twice_ as fast than pinning with just 1
guest and only 10 vcpu threads, probably because of that.

About 4): even if the thread scans the numa topology it won't be able
to tell tell enough info to the kernel to know which parts of the
memory may be used more or less (ok it may be possible to call mbind
and vary it at runtime but it adds even more complexity left to the
programmer).

If the vcpu is free to go in any node, and we've a automatic
vcpu<->memory affinity, then the memory will follow the vcpu. And the
scheduler domains should already optimize for maxing out the full
memory bandwidth of all channels.

Trouble 1/2/3/4 applies to the hard bindings as well, not just to
mbind/tbin.

In short it's an incremental step that moves some logic to the kernel
but I don't see it solving all situations optimally and it shares a
lot of the limits of the hard bindings.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-30 Thread Greg Rose


On 11/29/2011 9:19 AM, Ben Hutchings wrote:

On Tue, 2011-11-29 at 16:35 +, Ben Hutchings wrote:


Maybe I missed something!

Let's be clear on what our models are for filtering.  At the moment we
have MAC filters set through ndo_set_rx_mode and VF filters set through
ndo_set_vf_{mac,vlan}.

Ignoring anti-spoofing for the moment, should the currently defined
filters look like this (a):

 TX ^   | RX
|   v
+--+---+-+
|  |  +++|
|  |  |RX MAC filter||
|  |  +++|
|  |   |match|
|  ^   v |
|  |  +++|
|  |  |RX VF filters||
|  |  +---+-+|
| /|\ no /|\ |
|| | \ match/ | |match 2 |
|| ^  \/  v ||
|| |   \  /match||
||  \   \/  1/  ||
||   \  /\  /   ||
|^\/  \/v|
||/\  /\||
||   /  ||  \   ||
||  /   ||   \  ||
|| /||\ ||
||| || |||
+++-++-+++
  || || ||
  PFVF 1   VF 2

or like this (b):

 TX ^   | RX
|   v
+--+---+-+
|  |  +++|
|  |  |RX VF filters||
|  |  +++---+|
|  | no|match  /||
|  ^   v  | ||
|  | +-++ | ||
|  | |RX MAC| | ||
|  | |filter| | ||
|  | +--+ | ||
|  |   |match | ||
| /|\  |  | ||
|| | \ | match| |match 2 |
|| ^  \/1 v ||
|| |  /\  | ||
||  \/  \/  ||
||  /\   \  /   ||
|^ /  \   \/v|
|||\  /\||
||| ||  \   ||
||| ||   \  ||
||| ||\ ||
||| || |||
+++-++-+++
  || || ||
  PFVF 1   VF 2

I think the current model is (a); do you agree?

So is the proposed new model something like this (c):


Corrected diagram:

 TX ^   | RX
|   v
+--+---+-+
|  |  +++|
|  |  |RX MAC filter||
|  ^  +++|
|  |   |match|
|  no match|   v |
|  ++ +++|
|  |loopback filters| |RX VF filters||
|  +-+-++ +---+-+|
|   /|\   /|\ match  /|\ |
|  v | `-+>+-+-.2   / | ||
|   \ \  | |m \ \   / | ||
| match 0\ `-+-+.a \ \ /  v ||
| \  | | \t \ X   / ||
|  \ |  \ \c X \ /  ||
|   \|   \ \h \ X   ||
|\\/\1 X \  v|
|||   /\ |/ \ \ ||
||v  /  ||   \ \||
||| /   ^|\ ||
|||/|v |||
||| || |||
+++-++-+++
  || || ||
  PFVF 1   VF 2


(I've labelled the new filters as loopback filters here, and I'm still
leaving out anti-spoofing.)

If not, please explain what the new model *is*.


The new model is to incorporate a VEB into the NIC.  The current model 
doesn't address any of the requirements of a VEB in the NIC and this 
proposed set of patches allow us to set MAC filters for the *ports* on 
the internal NIC VEB.  Consider the PF and each of the VFs as just a 
port on the VEB.  We need the ability to set L2 filters (MAC, MC and 
VLAN) for each of the ports on that VEB.  There is no currently 
supported method for doing this.  So yes, this is a new model although 
it's a fairly simple one.


If you have an alternative proposal for allowing us to set L2 filters 
for the ports on our NIC VEB then I'm all ears (or eyes as the case may be).


- Greg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework

2011-11-30 Thread Alex Williamson
On Wed, 2011-11-30 at 09:41 -0600, Stuart Yoder wrote:
> On Tue, Nov 29, 2011 at 5:44 PM, Alex Williamson
>  wrote:
> > On Tue, 2011-11-29 at 17:20 -0600, Stuart Yoder wrote:
> >> >
> >> > BTW, github now has updated trees:
> >> >
> >> > git://github.com/awilliam/linux-vfio.git vfio-next-2029
> >> > git://github.com/awilliam/qemu-vfio.git vfio-ng
> >>
> >> Hi Alex,
> >>
> >> Have been looking at vfio a bit.   A few observations and things
> >> we'll need to figure out as it relates to the Freescale iommu.
> >>
> >> __vfio_dma_map() assumes that mappings are broken into
> >> 4KB pages.   That will not be true for us.   We normally will be mapping
> >> much larger physically contiguous chunks for our guests.  Guests will
> >> get hugetlbfs backed memory with very large pages (e.g. 16MB,
> >> 64MB) or very large chunks allocated by some proprietary
> >> means.
> >
> > Hi Stuart,
> >
> > I think practically everyone has commented on the 4k mappings ;)  There
> > are a few problems around this.  The first is that iommu drivers don't
> > necessarily support sub-region unmapping, so if we map 1GB and later
> > want to unmap 4k, we can't do it atomically.  4k gives us the most
> > flexibility for supporting fine granularities.  Another problem is that
> > we're using get_user_pages to pin memory.  It's been suggested that we
> > should use mlock for this, but I can't find anything that prevents a
> > user from later munlock'ing the memory and then getting access to memory
> > they shouldn't have.  Those kind of limit us, but I don't see it being
> > an API problem for VFIO, just implementation.
> 
> Ok.
> 
> >> Also, mappings will have additional Freescale-specific attributes
> >> that need to get passed through to dma_map somehow.   For
> >> example, the iommu can stash directly into a CPU's cache
> >> and we have iommu mapping properties like the cache stash
> >> target id and an operation mapping attribute.
> >>
> >> How do you envision handling proprietary attributes
> >> in struct vfio_dma_map?
> >
> > Let me turn the question around, how do you plan to support proprietary
> > attributes in the IOMMU API?  Is the user level the appropriate place to
> > specify them, or are they an intrinsic feature of the domain?  We've
> > designed struct vfio_dma_map for extension, so depending on how many
> > bits you need, we can make a conduit using the flags directly or setting
> > a new flag to indicate presence of an arch specific attributes field.
> 
> The attributes are not intrinsic features of the domain.  User space will
> need to set them.  But in thinking about it a bit more I think the attributes
> are more properties of the domain rather than a per map() operation
> characteristic.  I think a separate API might be appropriate.  Define a
> new set_domain_attrs() op in the iommu_ops.In user space, perhaps
>  a new vfio group API-- VFIO_GROUP_SET_ATTRS,
> VFIO_GROUP_GET_ATTRS.

In that case, you should definitely be following what Alexey is thinking
about with an iommu_setup IOMMU API callback.  I think it's shaping up
to do:

x86:
 - Report any IOVA range restrictions imposed by hw implementation
POWER:
 - Request IOVA window size, report size and base
powerpc:
 - Set domain attributes, probably report range as well.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

2011-11-30 Thread Chris Wright
* Peter Zijlstra (a.p.zijls...@chello.nl) wrote:
> On Wed, 2011-11-30 at 21:52 +0530, Dipankar Sarma wrote:
> > 
> > Also, if at all topology changes due to migration or host kernel decisions,
> > we can make use of something like VPHN (virtual processor home node)
> > capability on Power systems to have guest kernel update its topology
> > knowledge. You can refer to that in
> > arch/powerpc/mm/numa.c. 
> 
> I think that fail^Wfeature of PPC is terminally broken. You simply
> cannot change the topology after the fact. 

Agreed, there's too many things that consult topology once and never
look back.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [ANNOUNCE] Xen port to Cortex-A15 / ARMv7 with virt extensions

2011-11-30 Thread Ian Campbell
On Wed, 2011-11-30 at 14:32 +, Arnd Bergmann wrote:
> On Wednesday 30 November 2011, Ian Campbell wrote:
> > On Wed, 2011-11-30 at 13:03 +, Arnd Bergmann wrote:
> > For domU the DT would presumably be constructed by the toolstack (in
> > dom0 userspace) as appropriate for the guest configuration. I guess this
> > needn't correspond to any particular "real" hardware platform.
> 
> Correct, but it needs to correspond to some platform that is supported
> by the guest OS, which leaves the choice between emulating a real
> hardware platform, adding a completely new platform specifically for
> virtual machines, or something in between the two.
> 
> What I suggested to the KVM developers is to start out with the
> vexpress platform, but then generalize it to the point where it fits
> your needs. All hardware that one expects a guest to have (GIC, timer,
> ...) will still show up in the same location as on a real vexpress,
> while anything that makes no sense or is better paravirtualized (LCD,
> storage, ...) just becomes optional and has to be described in the
> device tree if it's actually there.

That's along the lines of what I was thinking as well.

The DT contains the address of GIC, timer etc as well right? So at least
in principal we needn't provide e.g. the GIC at the same address as any
real platform but in practice I expect we will.

In principal we could also offer the user options as to which particular
platform a guest looks like.

> > > This would also be the place where you tell the guest that it should
> > > look for PV devices. I'm not familiar with how Xen announces PV
> > > devices to the guest on other architectures, but you have the
> > > choice between providing a full "binding", i.e. a formal specification
> > > in device tree format for the guest to detect PV devices in the
> > > same way as physical or emulated devices, or just providing a single
> > > place in the device tree in which the guest detects the presence
> > > of a xen device bus and then uses hcalls to find the devices on that
> > > bus.
> > 
> > On x86 there is an emulated PCI device which serves as the hooking point
> > for the PV drivers. For ARM I don't think it would be unreasonable to
> > have a DT entry instead. I think it would be fine just represent the
> > root of the "xenbus" and further discovery would occur using the normal
> > xenbus mechanisms (so not a full binding). AIUI for buses which are
> > enumerable this is the preferred DT scheme to use.
> 
> In general that is the case, yes. One could argue that any software
> protocol between Xen and the guest is as good as any other, so it
> makes sense to use the device tree to describe all devices here.
> The counterargument to that is that Linux and other OSs already
> support Xenbus, so there is no need to come up with a new binding.

Right.

> I don't care much either way, but I think it would be good to
> use similar solutions across all hypervisors. The two options
> that I've seen discussed for KVM were to use either a virtual PCI
> bus with individual virtio-pci devices as on the PC, or to
> use the new virtio-mmio driver and individually put virtio devices
> into the device tree.
> 
> > > Another topic is the question whether there are any hcalls that
> > > we should try to standardize before we get another architecture
> > > with multiple conflicting hcall APIs as we have on x86 and powerpc.
> > 
> > The hcall API we are currently targeting is the existing Xen API (at
> > least the generic parts of it). These generally deal with fairly Xen
> > specific concepts like grant tables etc.
> 
> Ok. It would of course still be possible to agree on an argument passing
> convention so that we can share the macros used to issue the hcalls,
> even if the individual commands are all different.

I think it likely that we can all agree on a common calling convention
for N-argument hypercalls. It doubt there are that many useful choices
with conflicting requirements yet strongly compelling advantages.

>  I think I also
> remember talk about the need for a set of hypervisor independent calls
> that everyone should implement, but I can't remember what those were.

I'd not heard of this, maybe I just wasn't looking the right way though.

> Maybe we can split the number space into a range of some generic and
> some vendor specific hcalls?

Ian.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

2011-11-30 Thread Peter Zijlstra
On Wed, 2011-11-30 at 21:52 +0530, Dipankar Sarma wrote:
> 
> Also, if at all topology changes due to migration or host kernel decisions,
> we can make use of something like VPHN (virtual processor home node)
> capability on Power systems to have guest kernel update its topology
> knowledge. You can refer to that in
> arch/powerpc/mm/numa.c. 

I think that fail^Wfeature of PPC is terminally broken. You simply
cannot change the topology after the fact. 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Wed, Nov 30, 2011 at 6:15 PM, Michael S. Tsirkin  wrote:
> How are the rings mapped? normal memory, right?

No, device memory.

> We allocate them with plan alloc_pages_exact in virtio_pci.c ...

I'm not using virtio_pci.c; remoteproc is allocating the rings using
the DMA API.

> Yes wmb() is required to ensure ordering for MMIO.
> But here both accesses: index and ring - are for
> memory, not MMIO.

I'm doing IO with a device over shared memory. It does require
mandatory barriers as I explained.

> Is this something you see in practice?

Yes. These bugs are very real.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

2011-11-30 Thread Dipankar Sarma
On Wed, Nov 23, 2011 at 07:34:37PM +0100, Alexander Graf wrote:
> On 11/23/2011 04:03 PM, Andrea Arcangeli wrote:
> >Hi!
> >
> >
> >In my view the trouble of the numa hard bindings is not the fact
> >they're hard and qemu has to also decide the location (in fact it
> >doesn't need to decide the location if you use cpusets and relative
> >mbinds). The bigger problem is the fact either the admin or the app
> >developer has to explicitly scan the numa physical topology (both cpus
> >and memory) and tell the kernel how much memory to bind to each
> >thread. ms_mbind/ms_tbind only partially solve that problem. They're
> >similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you
> >don't need an admin or a cpuset-job-scheduler (or a perl script) to
> >redistribute the hardware resources.
> 
> Well yeah, of course the guest needs to see some topology. I don't
> see why we'd have to actually scan the host for this though. All we
> need to tell the kernel is "this memory region is close to that
> thread".
> 
> So if you define "-numa node,mem=1G,cpus=0" then QEMU should be able
> to tell the kernel that this GB of RAM actually is close to that
> vCPU thread.
> 
> Of course the admin still needs to decide how to split up memory.
> That's the deal with emulating real hardware. You get the interfaces
> hardware gets :). However, if you follow a reasonable default
> strategy such as numa splitting your RAM into equal chunks between
> guest vCPUs you're probably close enough to optimal usage models. Or
> at least you could have a close enough approximation of how this
> mapping could work for the _guest_ regardless of the host and when
> you migrate it somewhere else it should also work reasonably well.

Allowing specification of the numa nodes to qemu, allowing
qemu to create cpu+mem grouping (without binding) and letting
the kernel decide how to manage them seems like a reasonable incremental 
step between no guest/host NUMA awareness and automatic NUMA 
configuration in the host kernel. It would be suffice for the current 
needs we see.

Besides migration, we also have use cases where we may want to
have large multi-node VMs that are static (like LPARs), having the guest 
aware of the topology there is helpful. 

Also, if at all topology changes due to migration or host kernel decisions,
we can make use of something like VPHN (virtual processor home node)
capability on Power systems to have guest kernel update its topology
knowledge. You can refer to that in
arch/powerpc/mm/numa.c. Otherwise, as long as the host kernel
maintains mappings requested by ms_tbind()/ms_mbind(), we can
create the guest topology correctly and optimize for NUMA. This
would work for us.

Thanks
Dipankar

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-11-30 Thread Sasha Levin
Sorry, I forgot to copy-paste one of the results :)

On Wed, 2011-11-30 at 18:11 +0200, Sasha Levin wrote:
> I did some testing of indirect descriptors under different workloads.
> 
> All tests were on a 2 vcpu guest with vhost on. Simple TCP_STREAM using
> netperf.
> 
> Indirect desc off:
> guest -> host, 1 stream: ~4600mb/s
> host -> guest, 1 stream: ~5900mb/s
> guest -> host, 8 streams: ~620mb/s (on average)
> host -> guest, 8 stream: ~600mb/s (on average)
> 
> Indirect desc on:
> guest -> host, 1 stream: ~4900mb/s
> host -> guest, 1 stream: ~5400mb/s
> guest -> host, 8 streams: ~620mb/s (on average)
> host -> guest, 8 stream: ~600mb/s (on average)
Should be:
host -> guest, 8 stream: ~515mb/s (on average)

> 
> Which means that for one stream, guest to host gets faster while host to
> guest gets slower when indirect descriptors are on.
> 

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Michael S. Tsirkin
On Wed, Nov 30, 2011 at 06:04:56PM +0200, Ohad Ben-Cohen wrote:
> On Wed, Nov 30, 2011 at 4:59 PM, Michael S. Tsirkin  wrote:
> > I see. And this happens because the ARM processor reorders
> > memory writes
> 
> Yes.
> 
> > And in an SMP configuration, writes are somehow not reordered?
> 
> They are, but then the smp memory barriers are enough to control these
> effects. It's not enough to control reordering as seen by a device
> (which is what our AMP processors are) though.
> 
> (btw, the difference between an SMP processor and a device here lies
> in how the memory is mapped: normal memory vs. device memory
> attributes. it's an ARM thingy).

How are the rings mapped? normal memory, right?
We allocate them with plan alloc_pages_exact in virtio_pci.c ...

> > Just checking that this is not a bug in the smp_wmb implementation
> > for the specific platform.
> 
> No, it's not.
> 
> ARM's smp memory barriers use ARM's DMB instruction, which is enough
> to control SMP effects, whereas ARM's mandatory memory barriers use
> ARM's DSB instruction, which is required to ensure the ordering
> between Device and Normal memory accesses.
> 
> Thanks,
> Ohad.

Yes wmb() is required to ensure ordering for MMIO.
But here both accesses: index and ring - are for
memory, not MMIO.

I could understand ring kick bypassing index write, maybe ...
But you described an index write bypassing descriptor write.
Is this something you see in practice?


-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

2011-11-30 Thread Sasha Levin
On Tue, 2011-11-29 at 16:58 +0200, Avi Kivity wrote:
> On 11/29/2011 04:54 PM, Michael S. Tsirkin wrote:
> > > 
> > > Which is actually strange, weren't indirect buffers introduced to make
> > > the performance *better*? From what I see it's pretty much the
> > > same/worse for virtio-blk.
> >
> > I know they were introduced to allow adding very large bufs.
> > See 9fa29b9df32ba4db055f3977933cd0c1b8fe67cd
> > Mark, you wrote the patch, could you tell us which workloads
> > benefit the most from indirect bufs?
> >
> 
> Indirects are really for block devices with many spindles, since there
> the limiting factor is the number of requests in flight.  Network
> interfaces are limited by bandwidth, it's better to increase the ring
> size and use direct buffers there (so the ring size more or less
> corresponds to the buffer size).
> 

I did some testing of indirect descriptors under different workloads.

All tests were on a 2 vcpu guest with vhost on. Simple TCP_STREAM using
netperf.

Indirect desc off:
guest -> host, 1 stream: ~4600mb/s
host -> guest, 1 stream: ~5900mb/s
guest -> host, 8 streams: ~620mb/s (on average)
host -> guest, 8 stream: ~600mb/s (on average)

Indirect desc on:
guest -> host, 1 stream: ~4900mb/s
host -> guest, 1 stream: ~5400mb/s
guest -> host, 8 streams: ~620mb/s (on average)
host -> guest, 8 stream: ~600mb/s (on average)

Which means that for one stream, guest to host gets faster while host to
guest gets slower when indirect descriptors are on.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Wed, Nov 30, 2011 at 4:59 PM, Michael S. Tsirkin  wrote:
> I see. And this happens because the ARM processor reorders
> memory writes

Yes.

> And in an SMP configuration, writes are somehow not reordered?

They are, but then the smp memory barriers are enough to control these
effects. It's not enough to control reordering as seen by a device
(which is what our AMP processors are) though.

(btw, the difference between an SMP processor and a device here lies
in how the memory is mapped: normal memory vs. device memory
attributes. it's an ARM thingy).

> Just checking that this is not a bug in the smp_wmb implementation
> for the specific platform.

No, it's not.

ARM's smp memory barriers use ARM's DMB instruction, which is enough
to control SMP effects, whereas ARM's mandatory memory barriers use
ARM's DSB instruction, which is required to ensure the ordering
between Device and Normal memory accesses.

Thanks,
Ohad.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework

2011-11-30 Thread Stuart Yoder
On Tue, Nov 29, 2011 at 5:44 PM, Alex Williamson
 wrote:
> On Tue, 2011-11-29 at 17:20 -0600, Stuart Yoder wrote:
>> >
>> > BTW, github now has updated trees:
>> >
>> > git://github.com/awilliam/linux-vfio.git vfio-next-2029
>> > git://github.com/awilliam/qemu-vfio.git vfio-ng
>>
>> Hi Alex,
>>
>> Have been looking at vfio a bit.   A few observations and things
>> we'll need to figure out as it relates to the Freescale iommu.
>>
>> __vfio_dma_map() assumes that mappings are broken into
>> 4KB pages.   That will not be true for us.   We normally will be mapping
>> much larger physically contiguous chunks for our guests.  Guests will
>> get hugetlbfs backed memory with very large pages (e.g. 16MB,
>> 64MB) or very large chunks allocated by some proprietary
>> means.
>
> Hi Stuart,
>
> I think practically everyone has commented on the 4k mappings ;)  There
> are a few problems around this.  The first is that iommu drivers don't
> necessarily support sub-region unmapping, so if we map 1GB and later
> want to unmap 4k, we can't do it atomically.  4k gives us the most
> flexibility for supporting fine granularities.  Another problem is that
> we're using get_user_pages to pin memory.  It's been suggested that we
> should use mlock for this, but I can't find anything that prevents a
> user from later munlock'ing the memory and then getting access to memory
> they shouldn't have.  Those kind of limit us, but I don't see it being
> an API problem for VFIO, just implementation.

Ok.

>> Also, mappings will have additional Freescale-specific attributes
>> that need to get passed through to dma_map somehow.   For
>> example, the iommu can stash directly into a CPU's cache
>> and we have iommu mapping properties like the cache stash
>> target id and an operation mapping attribute.
>>
>> How do you envision handling proprietary attributes
>> in struct vfio_dma_map?
>
> Let me turn the question around, how do you plan to support proprietary
> attributes in the IOMMU API?  Is the user level the appropriate place to
> specify them, or are they an intrinsic feature of the domain?  We've
> designed struct vfio_dma_map for extension, so depending on how many
> bits you need, we can make a conduit using the flags directly or setting
> a new flag to indicate presence of an arch specific attributes field.

The attributes are not intrinsic features of the domain.  User space will
need to set them.  But in thinking about it a bit more I think the attributes
are more properties of the domain rather than a per map() operation
characteristic.  I think a separate API might be appropriate.  Define a
new set_domain_attrs() op in the iommu_ops.In user space, perhaps
 a new vfio group API-- VFIO_GROUP_SET_ATTRS,
VFIO_GROUP_GET_ATTRS.

Stuart
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Michael S. Tsirkin
On Wed, Nov 30, 2011 at 01:45:05PM +0200, Ohad Ben-Cohen wrote:
> > So you put virtio rings in MMIO memory?
> 
> I'll be precise: the vrings are created in non-cacheable memory, which
> both processors have access to.
> 
> > Could you please give a couple of examples of breakage?
> 
> Sure. Basically, the order of the vring memory operations appear
> differently to the observing processor. For example, avail->idx gets
> updated before the new entry is put in the available array...

I see. And this happens because the ARM processor reorders
memory writes to this uncacheable memory?
And in an SMP configuration, writes are somehow not reordered?

For example, if we had such an AMP configuration with and x86
processor, wmb() (sfence) would be wrong and smp_wmb() would be sufficient.

Just checking that this is not a bug in the smp_wmb implementation
for the specific platform.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [ANNOUNCE] Xen port to Cortex-A15 / ARMv7 with virt extensions

2011-11-30 Thread Pawel Moll
On Wed, 2011-11-30 at 14:32 +, Arnd Bergmann wrote:
> I don't care much either way, but I think it would be good to
> use similar solutions across all hypervisors. The two options
> that I've seen discussed for KVM were to use either a virtual PCI
> bus with individual virtio-pci devices as on the PC, or to
> use the new virtio-mmio driver and individually put virtio devices
> into the device tree.

Let me just add that the virtio-mmio devices can already be instantiated
from DT (see Documentation/devicetree/bindings/virtio/mmio.txt).

For A9-based VE I'd suggest placing them around 0x1001e000, eg.:

virtio_block@1001e000 {
compatible = "virtio,mmio";
reg = <0x1001e000 0x100>;
interrupts = <41>;
}

Cheers!

Paweł


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Michael S. Tsirkin
On Wed, Nov 30, 2011 at 01:55:53PM +0200, Ohad Ben-Cohen wrote:
> On Tue, Nov 29, 2011 at 5:19 PM, Michael S. Tsirkin  wrote:
> > On Tue, Nov 29, 2011 at 03:57:19PM +0200, Ohad Ben-Cohen wrote:
> >> > Is an extra branch faster or slower than reverting d57ed95?
> >>
> >> Sorry, unfortunately I have no way to measure this, as I don't have
> >> any virtualization/x86 setup. I'm developing on ARM SoCs, where
> >> virtualization hardware is coming, but not here yet.
> >
> > You can try using the micro-benchmark in tools/virtio/.
> 
> Hmm, care to show me exactly what do you mean ?

make headers_install
make -C tools/virtio/
(you'll need an empty stub for tools/virtio/linux/module.h,
 I just sent a patch to add that)
sudo insmod tools/virtio/vhost_test/vhost_test.ko
./tools/virtio/virtio_test

> Though I somewhat suspect that any micro-benchmarking I'll do with my
> random ARM SoC will not have much value to real virtualization/x86
> workloads.
> 
> Thanks,
> Ohad.

Real virtualization/x86 can keep using current smp_XX barriers, right?
We can have some config for your kind of setup.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for November 29

2011-11-30 Thread Anthony Liguori

On 11/30/2011 08:35 AM, Alon Levy wrote:

On Wed, Nov 30, 2011 at 07:54:30AM -0600, Anthony Liguori wrote:
[snip]

But the way we're structuring QOM, we could do very simple bindings
that just used introspection (much like GObject does).


Is this the current tree?
http://repo.or.cz/w/qemu/aliguori.git/tree/refs/heads/qom


That's the end goal, more or less.  The current submission tree is:

https://github.com/aliguori/qemu/tree/qom-upstream.4

I just need to rebase and send those out.

Regards,

Anthony Liguroi





The vast majority of work is fitting everything into an object
model.  Doing the bindings is actually fairly simple.

Regards,

Anthony Liguori


[snip]



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for November 29

2011-11-30 Thread Alon Levy
On Wed, Nov 30, 2011 at 07:54:30AM -0600, Anthony Liguori wrote:
[snip]
> But the way we're structuring QOM, we could do very simple bindings
> that just used introspection (much like GObject does).

Is this the current tree?
http://repo.or.cz/w/qemu/aliguori.git/tree/refs/heads/qom

> 
> The vast majority of work is fitting everything into an object
> model.  Doing the bindings is actually fairly simple.
> 
> Regards,
> 
> Anthony Liguori
> 
[snip]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [ANNOUNCE] Xen port to Cortex-A15 / ARMv7 with virt extensions

2011-11-30 Thread Arnd Bergmann
On Wednesday 30 November 2011, Ian Campbell wrote:
> On Wed, 2011-11-30 at 13:03 +, Arnd Bergmann wrote:
> > On Wednesday 30 November 2011, Stefano Stabellini wrote:
> > This is the same choice people have made for KVM, but it's not
> > necessarily the best option in the long run. In particular, this
> > board has a lot of hardware that you claim to have by putting the
> > machine number there, when you don't really want to emulate it.
> 
> This code is actually setting up dom0 which (for the most part) sees the
> real hardware.

Ok, I see.

> > Pawell Moll is working on a variant of the vexpress code that uses
> > the flattened device tree to describe the present hardware [1], and
> > I think that would be a much better target for an official release.
> > Ideally, the hypervisor should provide the device tree binary (dtb)
> > to the guest OS describing the hardware that is actually there.
> 
> Agreed. Our intention was to use DT so this fits perfectly with our
> plans.
> 
> For dom0 we would expose a (possibly filtered) version of the DT given
> to us by the firmware (e.g. we might hide a serial port to reserve it
> for Xen's use, we'd likely fiddle with the memory map etc).

Ah, very good.

> For domU the DT would presumably be constructed by the toolstack (in
> dom0 userspace) as appropriate for the guest configuration. I guess this
> needn't correspond to any particular "real" hardware platform.

Correct, but it needs to correspond to some platform that is supported
by the guest OS, which leaves the choice between emulating a real
hardware platform, adding a completely new platform specifically for
virtual machines, or something in between the two.

What I suggested to the KVM developers is to start out with the
vexpress platform, but then generalize it to the point where it fits
your needs. All hardware that one expects a guest to have (GIC, timer,
...) will still show up in the same location as on a real vexpress,
while anything that makes no sense or is better paravirtualized (LCD,
storage, ...) just becomes optional and has to be described in the
device tree if it's actually there.

> > This would also be the place where you tell the guest that it should
> > look for PV devices. I'm not familiar with how Xen announces PV
> > devices to the guest on other architectures, but you have the
> > choice between providing a full "binding", i.e. a formal specification
> > in device tree format for the guest to detect PV devices in the
> > same way as physical or emulated devices, or just providing a single
> > place in the device tree in which the guest detects the presence
> > of a xen device bus and then uses hcalls to find the devices on that
> > bus.
> 
> On x86 there is an emulated PCI device which serves as the hooking point
> for the PV drivers. For ARM I don't think it would be unreasonable to
> have a DT entry instead. I think it would be fine just represent the
> root of the "xenbus" and further discovery would occur using the normal
> xenbus mechanisms (so not a full binding). AIUI for buses which are
> enumerable this is the preferred DT scheme to use.

In general that is the case, yes. One could argue that any software
protocol between Xen and the guest is as good as any other, so it
makes sense to use the device tree to describe all devices here.
The counterargument to that is that Linux and other OSs already
support Xenbus, so there is no need to come up with a new binding.

I don't care much either way, but I think it would be good to
use similar solutions across all hypervisors. The two options
that I've seen discussed for KVM were to use either a virtual PCI
bus with individual virtio-pci devices as on the PC, or to
use the new virtio-mmio driver and individually put virtio devices
into the device tree.

> > Another topic is the question whether there are any hcalls that
> > we should try to standardize before we get another architecture
> > with multiple conflicting hcall APIs as we have on x86 and powerpc.
> 
> The hcall API we are currently targeting is the existing Xen API (at
> least the generic parts of it). These generally deal with fairly Xen
> specific concepts like grant tables etc.

Ok. It would of course still be possible to agree on an argument passing
convention so that we can share the macros used to issue the hcalls,
even if the individual commands are all different. I think I also
remember talk about the need for a set of hypervisor independent calls
that everyone should implement, but I can't remember what those were.
Maybe we can split the number space into a range of some generic and
some vendor specific hcalls?

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Xen port to Cortex-A15 / ARMv7 with virt extensions

2011-11-30 Thread Stefano Stabellini
On Wed, 30 Nov 2011, Catalin Marinas wrote:
> On 30 November 2011 11:39, Stefano Stabellini
>  wrote:
> > A git branch is available here (not ready for submission):
> >
> > git://xenbits.xen.org/people/sstabellini/linux-pvhvm.git arm
> >
> > the branch above is based on git://linux-arm.org/linux-2.6.git arm-lpae,
> > even though guests don't really need lpae support to run on Xen.
> 
> Indeed, you don't really need LPAE. What you may need though is
> generic timers support for A15, it would allow less Hypervisor traps.
> For up-to-date architecture patches (well, development tree, not
> guaranteed to be stable), I would recommend this (they get into
> mainline at some point):
> 
> http://git.kernel.org/?p=linux/kernel/git/cmarinas/linux-arm-arch.git;a=summary
> 
> Either use master or just cherry-pick the branches that you are interested in.

Thanks, I'll rebase on that.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Xen port to Cortex-A15 / ARMv7 with virt extensions

2011-11-30 Thread Catalin Marinas
On 30 November 2011 11:39, Stefano Stabellini
 wrote:
> A git branch is available here (not ready for submission):
>
> git://xenbits.xen.org/people/sstabellini/linux-pvhvm.git arm
>
> the branch above is based on git://linux-arm.org/linux-2.6.git arm-lpae,
> even though guests don't really need lpae support to run on Xen.

Indeed, you don't really need LPAE. What you may need though is
generic timers support for A15, it would allow less Hypervisor traps.
For up-to-date architecture patches (well, development tree, not
guaranteed to be stable), I would recommend this (they get into
mainline at some point):

http://git.kernel.org/?p=linux/kernel/git/cmarinas/linux-arm-arch.git;a=summary

Either use master or just cherry-pick the branches that you are interested in.

-- 
Catalin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for November 29

2011-11-30 Thread Anthony Liguori

On 11/30/2011 03:54 AM, Daniel P. Berrange wrote:

On Wed, Nov 30, 2011 at 11:22:37AM +0200, Alon Levy wrote:

On Tue, Nov 29, 2011 at 04:59:51PM -0600, Anthony Liguori wrote:

On 11/29/2011 10:59 AM, Avi Kivity wrote:

On 11/29/2011 05:51 PM, Juan Quintela wrote:

How to do high level stuff?
- python?



One of the disadvantages of the various scripting languages is the lack
of static type checking, which makes it harder to do full sweeps of the
source for API changes, relying on the compiler to catch type (or other)
errors.


This is less interesting to me (figuring out the perfectest language to use).

I think what's more interesting is the practical execution of
something like this.  Just assuming we used python (since that's
what I know best), I think we could do something like this:

1) We could write a binding layer to expose the QMP interface as a
python module.  This would be very little binding code but would
bring a bunch of functionality to python bits.


If going this route, I would propose to use gobject-introspection [1]
instead of directly binding to python. You should be able to get
multiple languages support this way, including python. I think it
requires using glib 3.0, but I haven't tested it myself (yet). Maybe
someone more knowledgable can shoot it down.

[1] http://live.gnome.org/GObjectIntrospection/

Actually this might make sense for the whole of QEMU. I think for a
defined interface like QMP implementing the interface directly in python
makes more sense. But having qemu itself GObject'ified and scriptable
is cool. It would also lend it self to 4) without going through 2), but
also make 2) possible (with any language, not just python).


I think taking advantage of GObject introspection is fine idea


GObject isn't flexible enough for our needs within the device model 
unfortunately.

The main problem is GObject properties.  They are tied to the class and only 
support types with copy semantics.  We need object based properties and full 
builder semantics for accessors.


But the way we're structuring QOM, we could do very simple bindings that just 
used introspection (much like GObject does).


The vast majority of work is fitting everything into an object model.  Doing the 
bindings is actually fairly simple.


Regards,

Anthony Liguori

 - I

certainly don't want to manually create python (or any other language)
bindings for any C code ever again. GObject + introspection takes away
all the burden of supporting access to C code from non-C languages.
Given that QEMU has already adopted GLib as mandatory infrastructure,
going down the GObject route seems like a very natural fit/direction
to take.

If people like the idea of a higher level language for QEMU, but are
concerned about performance / overhead of embedding a scripting
language in QEMU, then GObject introspection opens the possibilty of
writing in Vala, which is a higher level language which compiles
straight down to machine code like C does.

Regards,
Daniel


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [ANNOUNCE] Xen port to Cortex-A15 / ARMv7 with virt extensions

2011-11-30 Thread Ian Campbell
On Wed, 2011-11-30 at 13:03 +, Arnd Bergmann wrote:
> On Wednesday 30 November 2011, Stefano Stabellini wrote:
> > On Tue, 29 Nov 2011, Arnd Bergmann wrote:
> > > On Tuesday 29 November 2011, Stefano Stabellini wrote:
> > > 
> > > Do you have a pointer to the kernel sources for the Linux guest?
> > 
> > We have very few changes to the Linux kernel at the moment (only 3
> > commits!), just enough to be able to issue hypercalls and start a PV
> > console.
> > 
> > A git branch is available here (not ready for submission):
> > 
> > git://xenbits.xen.org/people/sstabellini/linux-pvhvm.git arm
> 
> Ok, interesting. There really isn't much of the platform support
> that I was expecting there. I finally found the information
> I was looking for in the xen construct_dom0() function:
> 
>  167 regs->r0 = 0; /* SBZ */
>  168 regs->r1 = 2272; /* Machine NR: Versatile Express */
>  169 regs->r2 = 0xc100; /* ATAGS */
> 
> What this means is that you are emulating the current ARM/Keil reference
> board, at least to the degree that is necessary to get the guest started.
> 
> This is the same choice people have made for KVM, but it's not
> necessarily the best option in the long run. In particular, this
> board has a lot of hardware that you claim to have by putting the
> machine number there, when you don't really want to emulate it.

This code is actually setting up dom0 which (for the most part) sees the
real hardware.

The hardcoding of the platform is just a short term hack.

> Pawell Moll is working on a variant of the vexpress code that uses
> the flattened device tree to describe the present hardware [1], and
> I think that would be a much better target for an official release.
> Ideally, the hypervisor should provide the device tree binary (dtb)
> to the guest OS describing the hardware that is actually there.

Agreed. Our intention was to use DT so this fits perfectly with our
plans.

For dom0 we would expose a (possibly filtered) version of the DT given
to us by the firmware (e.g. we might hide a serial port to reserve it
for Xen's use, we'd likely fiddle with the memory map etc).

For domU the DT would presumably be constructed by the toolstack (in
dom0 userspace) as appropriate for the guest configuration. I guess this
needn't correspond to any particular "real" hardware platform.

> This would also be the place where you tell the guest that it should
> look for PV devices. I'm not familiar with how Xen announces PV
> devices to the guest on other architectures, but you have the
> choice between providing a full "binding", i.e. a formal specification
> in device tree format for the guest to detect PV devices in the
> same way as physical or emulated devices, or just providing a single
> place in the device tree in which the guest detects the presence
> of a xen device bus and then uses hcalls to find the devices on that
> bus.

On x86 there is an emulated PCI device which serves as the hooking point
for the PV drivers. For ARM I don't think it would be unreasonable to
have a DT entry instead. I think it would be fine just represent the
root of the "xenbus" and further discovery would occur using the normal
xenbus mechanisms (so not a full binding). AIUI for buses which are
enumerable this is the preferred DT scheme to use.

> Another topic is the question whether there are any hcalls that
> we should try to standardize before we get another architecture
> with multiple conflicting hcall APIs as we have on x86 and powerpc.

The hcall API we are currently targeting is the existing Xen API (at
least the generic parts of it). These generally deal with fairly Xen
specific concepts like grant tables etc.

Ian.

> 
>   Arnd
> 
> [1] http://www.spinics.net/lists/arm-kernel/msg149604.html
> 
> ___
> Xen-devel mailing list
> xen-de...@lists.xensource.com
> http://lists.xensource.com/xen-devel


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout

2011-11-30 Thread Sasha Levin
On Wed, 2011-11-30 at 10:10 +1030, Rusty Russell wrote:
> On Mon, 28 Nov 2011 11:15:31 +0200, Sasha Levin  
> wrote:
> > On Mon, 2011-11-28 at 11:25 +1030, Rusty Russell wrote:
> > > I'd like to see kvmtools remove support for legacy mode altogether,
> > > but they probably have existing users.
> > 
> > While we can't simply remove it right away, instead of mixing our
> > implementation for both legacy and new spec in the same code we can
> > split the virtio-pci implementation into two:
> > 
> > - virtio/virtio-pci-legacy.c
> > - virtio/virtio-pci.c
> > 
> > At that point we can #ifdef the entire virtio-pci-legacy.c for now and
> > remove it at the same time legacy virtio-pci is removed from the kernel.
> 
> Hmm, that might be neat, but we can't tell the driver core to try
> virtio-pci before virtio-pci-legacy, so we need detection code in both
> modules (and add a "force" flag to virtio-pci-legacy to tell it to
> accept the device even if it's not a legacy-only one).

I was thinking more in the direction of fallback code in virtio-pci.c to
virtio-pci-legacy.c.

Something like:
#ifdef VIRTIO_PCI_LEGACY
[Create BAR0 and map it to virtio-pci-legacy.c]
#endif

So BAR0 isn't defined as long as legacy code is there, which makes
falling back to legacy pretty simple.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Xen port to Cortex-A15 / ARMv7 with virt extensions

2011-11-30 Thread Arnd Bergmann
On Wednesday 30 November 2011, Stefano Stabellini wrote:
> On Tue, 29 Nov 2011, Arnd Bergmann wrote:
> > On Tuesday 29 November 2011, Stefano Stabellini wrote:
> > 
> > Do you have a pointer to the kernel sources for the Linux guest?
> 
> We have very few changes to the Linux kernel at the moment (only 3
> commits!), just enough to be able to issue hypercalls and start a PV
> console.
> 
> A git branch is available here (not ready for submission):
> 
> git://xenbits.xen.org/people/sstabellini/linux-pvhvm.git arm

Ok, interesting. There really isn't much of the platform support
that I was expecting there. I finally found the information
I was looking for in the xen construct_dom0() function:

 167 regs->r0 = 0; /* SBZ */
 168 regs->r1 = 2272; /* Machine NR: Versatile Express */
 169 regs->r2 = 0xc100; /* ATAGS */

What this means is that you are emulating the current ARM/Keil reference
board, at least to the degree that is necessary to get the guest started.

This is the same choice people have made for KVM, but it's not
necessarily the best option in the long run. In particular, this
board has a lot of hardware that you claim to have by putting the
machine number there, when you don't really want to emulate it.

Pawell Moll is working on a variant of the vexpress code that uses
the flattened device tree to describe the present hardware [1], and
I think that would be a much better target for an official release.
Ideally, the hypervisor should provide the device tree binary (dtb)
to the guest OS describing the hardware that is actually there.

This would also be the place where you tell the guest that it should
look for PV devices. I'm not familiar with how Xen announces PV
devices to the guest on other architectures, but you have the
choice between providing a full "binding", i.e. a formal specification
in device tree format for the guest to detect PV devices in the
same way as physical or emulated devices, or just providing a single
place in the device tree in which the guest detects the presence
of a xen device bus and then uses hcalls to find the devices on that
bus.

Another topic is the question whether there are any hcalls that
we should try to standardize before we get another architecture
with multiple conflicting hcall APIs as we have on x86 and powerpc.

Arnd

[1] http://www.spinics.net/lists/arm-kernel/msg149604.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Tue, Nov 29, 2011 at 5:19 PM, Michael S. Tsirkin  wrote:
> On Tue, Nov 29, 2011 at 03:57:19PM +0200, Ohad Ben-Cohen wrote:
>> > Is an extra branch faster or slower than reverting d57ed95?
>>
>> Sorry, unfortunately I have no way to measure this, as I don't have
>> any virtualization/x86 setup. I'm developing on ARM SoCs, where
>> virtualization hardware is coming, but not here yet.
>
> You can try using the micro-benchmark in tools/virtio/.

Hmm, care to show me exactly what do you mean ?

Though I somewhat suspect that any micro-benchmarking I'll do with my
random ARM SoC will not have much value to real virtualization/x86
workloads.

Thanks,
Ohad.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Tue, Nov 29, 2011 at 5:16 PM, Michael S. Tsirkin  wrote:
> This mentions iommu - is there a need to use dma api to let
> the firmware acess the rings? Or does it have access to all
> of memory?

IOMMU may or may not be used, it really depends on the hardware (my
personal SoC does employ one, while others don't).

The vrings are created in non-cacheable memory, which is allocated
using dma_alloc_coherent, but that isn't necessarily controlling the
remote processor access to the memory (a notable example is an
iommu-less remote processor which can directly access the physical
memory).

> Is there cache snooping? If yes access from an external device
> typically works mostly in the same way as smp ...

No, nothing fancy like that. Every processor has its own cache, with
no coherency protocol. The remote processor should really be treated
as a device, and not as a processor that is part of an SMP
configuration, and we must prohibit both the compiler and the CPU from
reordering memory operations.

> So you put virtio rings in MMIO memory?

I'll be precise: the vrings are created in non-cacheable memory, which
both processors have access to.

> Could you please give a couple of examples of breakage?

Sure. Basically, the order of the vring memory operations appear
differently to the observing processor. For example, avail->idx gets
updated before the new entry is put in the available array...

Thanks,
Ohad.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Xen port to Cortex-A15 / ARMv7 with virt extensions

2011-11-30 Thread Stefano Stabellini
On Wed, 30 Nov 2011, Anup Patel wrote:
> Hi all,
> 
> I wanted to know how Xen-ARM for A15 will address following concerns:
> 
> - How will Xen-ARM for A15 support legacy guest environment like ARMv5 or 
> ARMv6 ?

It is not our focus at the moment; we are targeting operating systems
that support a modern ARMv7 machine with GIC support. 
That said, it might be possible to run legacy guests in the future,
introducing more emulation to the hypervisor.


> - What if my Cortex-A15 board does not have a GIC with virtualization support 
> ?

We expect most hardware vendors to provide a GIC with virtualization
support. However if they do not, in order to support their boards we'll
have to do more emulation in the hypervisor.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Xen port to Cortex-A15 / ARMv7 with virt extensions

2011-11-30 Thread Stefano Stabellini
On Tue, 29 Nov 2011, Arnd Bergmann wrote:
> On Tuesday 29 November 2011, Stefano Stabellini wrote:
> > Hi all,
> > a few weeks ago I (and a few others) started hacking on a
> > proof-of-concept hypervisor port to Cortex-A15 which uses and requires
> > ARMv7 virtualization extensions. The intention of this work was to find
> > out how to best support ARM v7+ on Xen. See
> > http://old-list-archives.xen.org/archives/html/xen-arm/2011-09/msg00013.html
> > for more details. 
> > 
> > I am pleased to announce that significant progress has been made, and
> > that we now have a nascent Xen port for Cortex-A15. The port is based on
> > xen-unstable (HG CS 8d6edc3d26d2) and written from scratch exploiting
> > the latest virtualization, LPAE, GIC and generic timer support in
> > hardware.
> 
> Very nice!
> 
> Do you have a pointer to the kernel sources for the Linux guest?

We have very few changes to the Linux kernel at the moment (only 3
commits!), just enough to be able to issue hypercalls and start a PV
console.


A git branch is available here (not ready for submission):

git://xenbits.xen.org/people/sstabellini/linux-pvhvm.git arm

the branch above is based on git://linux-arm.org/linux-2.6.git arm-lpae,
even though guests don't really need lpae support to run on Xen.


> Since Xen and KVM are both in an early working state right now,
> it would be very nice if we could agree on the guest model to make
> sure that it's always possible to run the same kernel in both
> (and potentially other future) hypervisors without modifications.

Yes, that would be ideal.
We don't plan on making many changes other than enabling PV frontends
and backends. 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for November 29

2011-11-30 Thread Daniel P. Berrange
On Wed, Nov 30, 2011 at 11:22:37AM +0200, Alon Levy wrote:
> On Tue, Nov 29, 2011 at 04:59:51PM -0600, Anthony Liguori wrote:
> > On 11/29/2011 10:59 AM, Avi Kivity wrote:
> > >On 11/29/2011 05:51 PM, Juan Quintela wrote:
> > >>How to do high level stuff?
> > >>- python?
> > >>
> > >
> > >One of the disadvantages of the various scripting languages is the lack
> > >of static type checking, which makes it harder to do full sweeps of the
> > >source for API changes, relying on the compiler to catch type (or other)
> > >errors.
> > 
> > This is less interesting to me (figuring out the perfectest language to 
> > use).
> > 
> > I think what's more interesting is the practical execution of
> > something like this.  Just assuming we used python (since that's
> > what I know best), I think we could do something like this:
> > 
> > 1) We could write a binding layer to expose the QMP interface as a
> > python module.  This would be very little binding code but would
> > bring a bunch of functionality to python bits.
> 
> If going this route, I would propose to use gobject-introspection [1]
> instead of directly binding to python. You should be able to get
> multiple languages support this way, including python. I think it
> requires using glib 3.0, but I haven't tested it myself (yet). Maybe
> someone more knowledgable can shoot it down.
> 
> [1] http://live.gnome.org/GObjectIntrospection/
> 
> Actually this might make sense for the whole of QEMU. I think for a
> defined interface like QMP implementing the interface directly in python
> makes more sense. But having qemu itself GObject'ified and scriptable
> is cool. It would also lend it self to 4) without going through 2), but
> also make 2) possible (with any language, not just python).

I think taking advantage of GObject introspection is fine idea - I
certainly don't want to manually create python (or any other language)
bindings for any C code ever again. GObject + introspection takes away
all the burden of supporting access to C code from non-C languages.
Given that QEMU has already adopted GLib as mandatory infrastructure,
going down the GObject route seems like a very natural fit/direction
to take.

If people like the idea of a higher level language for QEMU, but are
concerned about performance / overhead of embedding a scripting
language in QEMU, then GObject introspection opens the possibilty of
writing in Vala, which is a higher level language which compiles
straight down to machine code like C does.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] KVM: MMU: audit: replace mmu audit tracepoint with jump-lable

2011-11-30 Thread Xiao Guangrong
On 11/29/2011 06:02 PM, Avi Kivity wrote:

> On 11/29/2011 05:56 AM, Xiao Guangrong wrote:
>> Subject: [PATCH v2 2/5] KVM: MMU: audit: replace mmu audit tracepoint with 
>> jump-lable
>>
>> The tracepoint is only used to audit mmu code, it should not be exposed to
>> user, let us replace it with jump-lable
>>
>>
>>  static bool mmu_audit;
>> +static struct jump_label_key mmu_audit_key;
>> +
>> +#define kvm_mmu_audit(vcpu, point)  \
>> +if (static_branch((&mmu_audit_key))) {  \
>> +__kvm_mmu_audit(vcpu, point);   \
>> +}
>>
>>
> 
> 
> static inline function, please, and as an incremental against next. I'll
> fold it to the parent patch.
> 


OK, this is the new one. Thanks!

Subject: [PATCH] KVM: MMU: audit: inline audit function

inline audit function and little cleanup

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c   |   28 +++-
 arch/x86/kvm/mmu_audit.c |   29 +
 2 files changed, 28 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b1178d1..7a8e99c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -59,21 +59,6 @@ enum {
AUDIT_POST_SYNC
 };

-char *audit_point_name[] = {
-   "pre page fault",
-   "post page fault",
-   "pre pte write",
-   "post pte write",
-   "pre sync",
-   "post sync"
-};
-
-#ifdef CONFIG_KVM_MMU_AUDIT
-static void kvm_mmu_audit(struct kvm_vcpu *vcpu, int point);
-#else
-static void kvm_mmu_audit(struct kvm_vcpu *vcpu, int point) { }
-#endif
-
 #undef MMU_DEBUG

 #ifdef MMU_DEBUG
@@ -1539,6 +1524,13 @@ static int kvm_sync_page_transient(struct kvm_vcpu *vcpu,
return ret;
 }

+#ifdef CONFIG_KVM_MMU_AUDIT
+#include "mmu_audit.c"
+#else
+static void kvm_mmu_audit(struct kvm_vcpu *vcpu, int point) { }
+static void mmu_audit_disable(void) { }
+#endif
+
 static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 struct list_head *invalid_list)
 {
@@ -4035,12 +4027,6 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
mmu_free_memory_caches(vcpu);
 }

-#ifdef CONFIG_KVM_MMU_AUDIT
-#include "mmu_audit.c"
-#else
-static void mmu_audit_disable(void) { }
-#endif
-
 void kvm_mmu_module_exit(void)
 {
mmu_destroy_caches();
diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index 5df6736..fe15dcc 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -19,6 +19,15 @@

 #include 

+char const *audit_point_name[] = {
+   "pre page fault",
+   "post page fault",
+   "pre pte write",
+   "post pte write",
+   "pre sync",
+   "post sync"
+};
+
 #define audit_printk(kvm, fmt, args...)\
printk(KERN_ERR "audit: (%s) error: "   \
fmt, audit_point_name[kvm->arch.audit_point], ##args)
@@ -227,18 +236,22 @@ static void audit_vcpu_spte(struct kvm_vcpu *vcpu)
 static bool mmu_audit;
 static struct jump_label_key mmu_audit_key;

-static void kvm_mmu_audit(struct kvm_vcpu *vcpu, int point)
+static void __kvm_mmu_audit(struct kvm_vcpu *vcpu, int point)
 {
static DEFINE_RATELIMIT_STATE(ratelimit_state, 5 * HZ, 10);

-   if (static_branch((&mmu_audit_key))) {
-   if (!__ratelimit(&ratelimit_state))
-   return;
+   if (!__ratelimit(&ratelimit_state))
+   return;

-   vcpu->kvm->arch.audit_point = point;
-   audit_all_active_sps(vcpu->kvm);
-   audit_vcpu_spte(vcpu);
-   }
+   vcpu->kvm->arch.audit_point = point;
+   audit_all_active_sps(vcpu->kvm);
+   audit_vcpu_spte(vcpu);
+}
+
+static inline void kvm_mmu_audit(struct kvm_vcpu *vcpu, int point)
+{
+   if (static_branch((&mmu_audit_key)))
+   __kvm_mmu_audit(vcpu, point);
 }

 static void mmu_audit_enable(void)
-- 
1.7.7.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for November 29

2011-11-30 Thread Alon Levy
On Tue, Nov 29, 2011 at 04:59:51PM -0600, Anthony Liguori wrote:
> On 11/29/2011 10:59 AM, Avi Kivity wrote:
> >On 11/29/2011 05:51 PM, Juan Quintela wrote:
> >>How to do high level stuff?
> >>- python?
> >>
> >
> >One of the disadvantages of the various scripting languages is the lack
> >of static type checking, which makes it harder to do full sweeps of the
> >source for API changes, relying on the compiler to catch type (or other)
> >errors.
> 
> This is less interesting to me (figuring out the perfectest language to use).
> 
> I think what's more interesting is the practical execution of
> something like this.  Just assuming we used python (since that's
> what I know best), I think we could do something like this:
> 
> 1) We could write a binding layer to expose the QMP interface as a
> python module.  This would be very little binding code but would
> bring a bunch of functionality to python bits.

If going this route, I would propose to use gobject-introspection [1]
instead of directly binding to python. You should be able to get
multiple languages support this way, including python. I think it
requires using glib 3.0, but I haven't tested it myself (yet). Maybe
someone more knowledgable can shoot it down.

[1] http://live.gnome.org/GObjectIntrospection/

Actually this might make sense for the whole of QEMU. I think for a
defined interface like QMP implementing the interface directly in python
makes more sense. But having qemu itself GObject'ified and scriptable
is cool. It would also lend it self to 4) without going through 2), but
also make 2) possible (with any language, not just python).

> 
> 2) We could then add a binding layer to let python code implement a
> character device.
> 
> 3) We could implement the HMP logic in Python.
> 
> 4) We could add a GTK widget to replace the SDL displaystate and
> then use python code to implement a more friendly UI.  Most of the
> interaction with such an interface would probably go through (1).
> With clever coding, you could probably let the UI also be stand
> alone using GtkVnc in place of the builtin widget and using a remote
> interface for QMP.
> 
> Regards,
> 
> Anthony Liguori
> 
> >
> >On the other hand, the statically typed languages usually have more
> >boilerplate.  Since one of the goals is to simplify things, this
> >indicates the need for a language with type inference.
> >
> >On the third hand, languages with type inferences are still immature
> >(golang?), so we probably need to keep this discussion going until an
> >obvious choice presents itself.
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC V3 2/4] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks

2011-11-30 Thread Raghavendra K T
Add a hypercall to KVM hypervisor to support pv-ticketlocks 

KVM_HC_KICK_CPU allows the calling vcpu to kick another vcpu out of halt state.

The presence of these hypercalls is indicated to guest via
KVM_FEATURE_KICK_VCPU/KVM_CAP_KICK_VCPU.

Qemu needs a corresponding patch to pass up the presence of this feature to 
guest via cpuid. Patch to qemu will be sent separately.

There is no Xen/KVM hypercall interface to await kick from.

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Suzuki Poulose 
Signed-off-by: Raghavendra K T 
---
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 734c376..8b1d65d 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -16,12 +16,14 @@
 #define KVM_FEATURE_CLOCKSOURCE0
 #define KVM_FEATURE_NOP_IO_DELAY   1
 #define KVM_FEATURE_MMU_OP 2
+
 /* This indicates that the new set of kvmclock msrs
  * are available. The use of 0x11 and 0x12 is deprecated
  */
 #define KVM_FEATURE_CLOCKSOURCE23
 #define KVM_FEATURE_ASYNC_PF   4
 #define KVM_FEATURE_STEAL_TIME 5
+#define KVM_FEATURE_KICK_VCPU  6
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c38efd7..6e1c8b4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2103,6 +2103,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_XSAVE:
case KVM_CAP_ASYNC_PF:
case KVM_CAP_GET_TSC_KHZ:
+   case KVM_CAP_KICK_VCPU:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
@@ -2577,7 +2578,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
u32 function,
 (1 << KVM_FEATURE_NOP_IO_DELAY) |
 (1 << KVM_FEATURE_CLOCKSOURCE2) |
 (1 << KVM_FEATURE_ASYNC_PF) |
-(1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
+(1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
+(1 << KVM_FEATURE_KICK_VCPU);
 
if (sched_info_on())
entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
@@ -5305,6 +5307,26 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
return 1;
 }
 
+/*
+ * kvm_pv_kick_cpu_op:  Kick a vcpu.
+ *
+ * @cpu - vcpu to be kicked.
+ */
+static void kvm_pv_kick_cpu_op(struct kvm *kvm, int cpu)
+{
+   struct kvm_vcpu *vcpu = kvm_get_vcpu(kvm, cpu);
+   struct kvm_mp_state mp_state;
+
+   mp_state.mp_state = KVM_MP_STATE_RUNNABLE;
+   if (vcpu) {
+   vcpu->kicked = 1;
+   /* Ensure kicked is always set before wakeup */
+   barrier();
+   }
+   kvm_arch_vcpu_ioctl_set_mpstate(vcpu, &mp_state);
+   kvm_vcpu_kick(vcpu);
+}
+
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 {
unsigned long nr, a0, a1, a2, a3, ret;
@@ -5341,6 +5363,10 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
case KVM_HC_MMU_OP:
r = kvm_pv_mmu_op(vcpu, a0, hc_gpa(vcpu, a1, a2), &ret);
break;
+   case KVM_HC_KICK_CPU:
+   kvm_pv_kick_cpu_op(vcpu->kvm, a0);
+   ret = 0;
+   break;
default:
ret = -KVM_ENOSYS;
break;
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index f47fcd3..e760035 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -558,6 +558,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_HIOR 67
 #define KVM_CAP_PPC_PAPR 68
 #define KVM_CAP_S390_GMAP 71
+#define KVM_CAP_KICK_VCPU 72
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d526231..ff3b6ff 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -154,6 +154,11 @@ struct kvm_vcpu {
 #endif
 
struct kvm_vcpu_arch arch;
+
+   /*
+* blocked vcpu wakes up by checking this flag set by unlocker.
+*/
+   int kicked;
 };
 
 static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index 47a070b..19f10bd 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -19,6 +19,7 @@
 #define KVM_HC_MMU_OP  2
 #define KVM_HC_FEATURES3
 #define KVM_HC_PPC_MAP_MAGIC_PAGE  4
+#define KVM_HC_KICK_CPU5
 
 /*
  * hypercalls use architecture specific
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d9cfb78..8f4b6db 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -226,6 +226,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, 
unsigned id)
vcpu->kvm = kvm;
vcpu->vcpu_id = id;
vcpu->pid = NULL;
+   vcpu->kicked = 0;
init_waitqueue_head(&vcpu->wq);
kv

[PATCH RFC V3 4/4] kvm : pv-ticketlocks support for linux guests running on KVM hypervisor

2011-11-30 Thread Raghavendra K T
This patch extends Linux guests running on KVM hypervisor to support
pv-ticketlocks. 
During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
required feature (KVM_FEATURE_KICK_VCPU) to support pv-ticketlocks. If so,
 support for pv-ticketlocks is registered via pv_lock_ops.

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Suzuki Poulose 
Signed-off-by: Raghavendra K T 
---
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 8b1d65d..7e419ad 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -195,10 +195,21 @@ void kvm_async_pf_task_wait(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
-#else
-#define kvm_guest_init() do { } while (0)
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init kvm_spinlock_init(void);
+#else /* CONFIG_PARAVIRT_SPINLOCKS */
+static void kvm_spinlock_init(void)
+{
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+#else /* CONFIG_KVM_GUEST */
+#define kvm_guest_init() do {} while (0)
 #define kvm_async_pf_task_wait(T) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
+#define kvm_spinlock_init() do {} while (0)
+
 static inline u32 kvm_read_and_reset_pf_reason(void)
 {
return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a9c2116..dffeea3 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -545,6 +546,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
 #endif
kvm_guest_cpu_init();
native_smp_prepare_boot_cpu();
+   kvm_spinlock_init();
 }
 
 static void __cpuinit kvm_guest_cpu_online(void *dummy)
@@ -627,3 +629,248 @@ static __init int activate_jump_labels(void)
return 0;
 }
 arch_initcall(activate_jump_labels);
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+enum kvm_contention_stat {
+   TAKEN_SLOW,
+   TAKEN_SLOW_PICKUP,
+   RELEASED_SLOW,
+   RELEASED_SLOW_KICKED,
+   NR_CONTENTION_STATS
+};
+
+#ifdef CONFIG_KVM_DEBUG_FS
+
+static struct kvm_spinlock_stats
+{
+   u32 contention_stats[NR_CONTENTION_STATS];
+
+#define HISTO_BUCKETS  30
+   u32 histo_spin_blocked[HISTO_BUCKETS+1];
+
+   u64 time_blocked;
+} spinlock_stats;
+
+static u8 zero_stats;
+
+static inline void check_zero(void)
+{
+   u8 ret;
+   u8 old = ACCESS_ONCE(zero_stats);
+   if (unlikely(old)) {
+   ret = cmpxchg(&zero_stats, old, 0);
+   /* This ensures only one fellow resets the stat */
+   if (ret == old)
+   memset(&spinlock_stats, 0, sizeof(spinlock_stats));
+   }
+}
+
+static inline void add_stats(enum kvm_contention_stat var, int val)
+{
+   check_zero();
+   spinlock_stats.contention_stats[var] += val;
+}
+
+
+static inline u64 spin_time_start(void)
+{
+   return sched_clock();
+}
+
+static void __spin_time_accum(u64 delta, u32 *array)
+{
+   unsigned index = ilog2(delta);
+
+   check_zero();
+
+   if (index < HISTO_BUCKETS)
+   array[index]++;
+   else
+   array[HISTO_BUCKETS]++;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+   u32 delta = sched_clock() - start;
+
+   __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
+   spinlock_stats.time_blocked += delta;
+}
+
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+
+struct dentry *kvm_init_debugfs(void)
+{
+   d_kvm_debug = debugfs_create_dir("kvm", NULL);
+   if (!d_kvm_debug)
+   printk(KERN_WARNING "Could not create 'kvm' debugfs 
directory\n");
+
+   return d_kvm_debug;
+}
+
+static int __init kvm_spinlock_debugfs(void)
+{
+   struct dentry *d_kvm = kvm_init_debugfs();
+
+   if (d_kvm == NULL)
+   return -ENOMEM;
+
+   d_spin_debug = debugfs_create_dir("spinlocks", d_kvm);
+
+   debugfs_create_u8("zero_stats", 0644, d_spin_debug, &zero_stats);
+
+   debugfs_create_u32("taken_slow", 0444, d_spin_debug,
+  &spinlock_stats.contention_stats[TAKEN_SLOW]);
+   debugfs_create_u32("taken_slow_pickup", 0444, d_spin_debug,
+  &spinlock_stats.contention_stats[TAKEN_SLOW_PICKUP]);
+
+   debugfs_create_u32("released_slow", 0444, d_spin_debug,
+  &spinlock_stats.contention_stats[RELEASED_SLOW]);
+   debugfs_create_u32("released_slow_kicked", 0444, d_spin_debug,
+  &spinlock_stats.contention_stats[RELEASED_SLOW_KICKED]);
+
+   debugfs_create_u64("time_blocked", 0444, d_spin_debug,
+  &spinlock_stats.time_blocked);
+
+   debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
+spinlock_stats.histo_spin_blocked, HISTO_BUCKETS + 1);
+
+   return 0;
+}
+fs_initcall(kvm_spinlock_debugfs);
+#else  /* !CONFIG_KVM_DEBUG_FS */
+#d

[PATCH RFC V3 3/4] kvm guest : Added configuration support to enable debug information for KVM Guests

2011-11-30 Thread Raghavendra K T
Added configuration support to enable debug information
for KVM Guests in debugfs

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Suzuki Poulose 
Signed-off-by: Raghavendra K T 
---
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5d8152d..526e3ae 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -561,6 +561,15 @@ config KVM_GUEST
  This option enables various optimizations for running under the KVM
  hypervisor.
 
+config KVM_DEBUG_FS
+   bool "Enable debug information for KVM Guests in debugfs"
+   depends on KVM_GUEST && DEBUG_FS
+   default n
+   ---help---
+ This option enables collection of various statistics for KVM guest.
+ Statistics are displayed in debugfs filesystem. Enabling this option
+ may incur significant overhead.
+
 source "arch/x86/lguest/Kconfig"
 
 config PARAVIRT

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC V3 1/4] debugfs: Add support to print u32 array in debugfs

2011-11-30 Thread Raghavendra K T
Add debugfs support to print u32-arrays in debugfs. Move the code from Xen to 
debugfs
to make the code common for other users as well.

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Suzuki Poulose 
Signed-off-by: Raghavendra K T 
---
diff --git a/arch/x86/xen/debugfs.c b/arch/x86/xen/debugfs.c
index 7c0fedd..c8377fb 100644
--- a/arch/x86/xen/debugfs.c
+++ b/arch/x86/xen/debugfs.c
@@ -19,107 +19,3 @@ struct dentry * __init xen_init_debugfs(void)
return d_xen_debug;
 }
 
-struct array_data
-{
-   void *array;
-   unsigned elements;
-};
-
-static int u32_array_open(struct inode *inode, struct file *file)
-{
-   file->private_data = NULL;
-   return nonseekable_open(inode, file);
-}
-
-static size_t format_array(char *buf, size_t bufsize, const char *fmt,
-  u32 *array, unsigned array_size)
-{
-   size_t ret = 0;
-   unsigned i;
-
-   for(i = 0; i < array_size; i++) {
-   size_t len;
-
-   len = snprintf(buf, bufsize, fmt, array[i]);
-   len++;  /* ' ' or '\n' */
-   ret += len;
-
-   if (buf) {
-   buf += len;
-   bufsize -= len;
-   buf[-1] = (i == array_size-1) ? '\n' : ' ';
-   }
-   }
-
-   ret++;  /* \0 */
-   if (buf)
-   *buf = '\0';
-
-   return ret;
-}
-
-static char *format_array_alloc(const char *fmt, u32 *array, unsigned 
array_size)
-{
-   size_t len = format_array(NULL, 0, fmt, array, array_size);
-   char *ret;
-
-   ret = kmalloc(len, GFP_KERNEL);
-   if (ret == NULL)
-   return NULL;
-
-   format_array(ret, len, fmt, array, array_size);
-   return ret;
-}
-
-static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len,
- loff_t *ppos)
-{
-   struct inode *inode = file->f_path.dentry->d_inode;
-   struct array_data *data = inode->i_private;
-   size_t size;
-
-   if (*ppos == 0) {
-   if (file->private_data) {
-   kfree(file->private_data);
-   file->private_data = NULL;
-   }
-
-   file->private_data = format_array_alloc("%u", data->array, 
data->elements);
-   }
-
-   size = 0;
-   if (file->private_data)
-   size = strlen(file->private_data);
-
-   return simple_read_from_buffer(buf, len, ppos, file->private_data, 
size);
-}
-
-static int xen_array_release(struct inode *inode, struct file *file)
-{
-   kfree(file->private_data);
-
-   return 0;
-}
-
-static const struct file_operations u32_array_fops = {
-   .owner  = THIS_MODULE,
-   .open   = u32_array_open,
-   .release= xen_array_release,
-   .read   = u32_array_read,
-   .llseek = no_llseek,
-};
-
-struct dentry *xen_debugfs_create_u32_array(const char *name, mode_t mode,
-   struct dentry *parent,
-   u32 *array, unsigned elements)
-{
-   struct array_data *data = kmalloc(sizeof(*data), GFP_KERNEL);
-
-   if (data == NULL)
-   return NULL;
-
-   data->array = array;
-   data->elements = elements;
-
-   return debugfs_create_file(name, mode, parent, data, &u32_array_fops);
-}
diff --git a/arch/x86/xen/debugfs.h b/arch/x86/xen/debugfs.h
index e281320..12ebf33 100644
--- a/arch/x86/xen/debugfs.h
+++ b/arch/x86/xen/debugfs.h
@@ -3,8 +3,4 @@
 
 struct dentry * __init xen_init_debugfs(void);
 
-struct dentry *xen_debugfs_create_u32_array(const char *name, mode_t mode,
-   struct dentry *parent,
-   u32 *array, unsigned elements);
-
 #endif /* _XEN_DEBUGFS_H */
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index fc506e6..14a8961 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -286,7 +286,7 @@ static int __init xen_spinlock_debugfs(void)
debugfs_create_u64("time_blocked", 0444, d_spin_debug,
   &spinlock_stats.time_blocked);
 
-   xen_debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
+   debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
 spinlock_stats.histo_spin_blocked, 
HISTO_BUCKETS + 1);
 
return 0;
diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index 90f7657..df44ccf 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static ssize_t default_read_file(struct file *file, char __user *buf,
 size_t count, loff_t *ppos)
@@ -525,3 +526,130 @@ struct dentry *debugfs_create_blob(const char *name, 
mode_t mode,
return debugfs_create_file(name, mode, parent, blob, &fops_blob);
 }
 EXPORT_SYMBOL_GPL(debugfs_create_blob);
+
+struct array_data {
+   

[PATCH RFC V3 0/4] kvm : Paravirt-spinlock support for KVM guests

2011-11-30 Thread Raghavendra K T
The 4-patch series to follow this email extends KVM-hypervisor and Linux guest 
running on KVM-hypervisor to support pv-ticket spinlocks, based on Xen's 
implementation.

One hypercall is introduced in KVM hypervisor,that allows a vcpu to kick
another vcpu out of halt state.
The blocking of vcpu is done using halt() in (lock_spinning) slowpath.

The V2 change discussion was in:  
 https://lkml.org/lkml/2011/10/23/207
 
Previous discussions : (posted by Srivatsa V).
https://lkml.org/lkml/2010/7/26/24
https://lkml.org/lkml/2011/1/19/212

The BASE patch is tip 3.2-rc1 + Jeremy's following patches.
xadd (https://lkml.org/lkml/2011/10/4/328)
x86/ticketlocklock  (https://lkml.org/lkml/2011/10/12/496).

Changes in V3:
- rebased to 3.2-rc1
- use halt() instead of wait for kick hypercall.
- modify kick hyper call to do wakeup halted vcpu.
- hook kvm_spinlock_init to smp_prepare_cpus call (moved the call out of 
head##.c).
- fix the potential race when zero_stat is read.
- export debugfs_create_32 and add documentation to API.
- use static inline and enum instead of ADDSTAT macro. 
- add  barrier() in after setting kick_vcpu.
- empty static inline function for kvm_spinlock_init.
- combine the patches one and two readuce overhead.
- make KVM_DEBUGFS depends on DEBUGFS.
- include debugfs header unconditionally.

Changes in V2:
- rebased patchesto -rc9
- synchronization related changes based on Jeremy's changes (Jeremy 
Fitzhardinge ) pointed by
Stephan Diestelhorst 
- enabling 32 bit guests
- splitted patches into two more chunks

 Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T (4): 
  Add debugfs support to print u32-arrays in debugfs
  Add a hypercall to KVM hypervisor to support pv-ticketlocks
  Added configuration support to enable debug information for KVM Guests
  pv-ticketlocks support for linux guests running on KVM hypervisor
 
Results:
 From the results we can see that patched kernel performance is similar to
 BASE when  there is no lock contention. But once we start seeing more 
 contention, patched kernel outperforms BASE.

set up : 
Kernel for host/guest : 3.2-rc1 + Jeremy's xadd, pv spinlock patches as BASE

3 guests with 8VCPU, 4GB RAM, 1 used for kernbench (kernbench -f -H -M -o 20) 
other for cpuhog (shell script while 
true with an instruction)

scenario A: unpinned

1x: no hogs
2x: 8hogs in one guest
3x: 8hogs each in two guest

Result for Non PLE machine :
Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU with 8 core , 
64GB RAM
 BASEBASE+patch%improvement
 mean (sd)   mean (sd)
Scenario A: 
case 1x: 157.548 (10.624)156.408 (11.1622)  0.723589
case 2x: 1110.18 (807.019)   310.96 (105.194)   71.9901
case 3x: 3110.36 (2408.03)   303.688 (110.474)  90.2362

Result for PLE machine:
Machine : IBM xSeries with Intel(R) Xeon(R)  X7560 2.27GHz CPU with 32/64 core, 
with 8  
 online cores and 4*64GB RAM

 BASEBASE+patch%improvement
 mean (sd)   mean (sd)
Scenario A: 
case 1x: 159.725 (47.4906)   159.07 (47.8133)   0.41008
case 2x: 190.957 (49.2976)   187.273 (50.5469)  1.92923
case 3x: 226.317 (88.6023)   223.698 (90.4362)  1.15723

---
 13 files changed, 454 insertions(+), 112 deletions(-)
 arch/x86/Kconfig|9 ++
 arch/x86/include/asm/kvm_para.h |   17 +++-
 arch/x86/kernel/kvm.c   |  247 +++
 arch/x86/kvm/x86.c  |   28 +-
 arch/x86/xen/debugfs.c  |  104 
 arch/x86/xen/debugfs.h  |4 -
 arch/x86/xen/spinlock.c |2 +-
 fs/debugfs/file.c   |  128 
 include/linux/debugfs.h |   11 ++
 include/linux/kvm.h |1 +
 include/linux/kvm_host.h|5 +
 include/linux/kvm_para.h|1 +
 virt/kvm/kvm_main.c |7 +
 13 files changed, 452 insertions(+), 112 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] ivshmem: fix guest unable to start with ioeventfd

2011-11-30 Thread Zang Hongyong
Can this bug fix patch be applied yet?
With this bug, guest os cannot successfully boot with ioeventfd.
Thus the new PIO DoorBell patch cannot be posted.

Thanks,
Hongyong

于 2011/11/24,星期四 18:05, zanghongy...@huawei.com 写道:
> From: Hongyong Zang 
>
> When a guest boots with ioeventfd, an error (by gdb) occurs:
>   Program received signal SIGSEGV, Segmentation fault.
>   0x006009cc in setup_ioeventfds (s=0x171dc40)
>   at /home/louzhengwei/git_source/qemu-kvm/hw/ivshmem.c:363
>   363 for (j = 0; j < s->peers[i].nb_eventfds; j++) {
> The bug is due to accessing s->peers which is NULL.
>
> This patch uses the memory region API to replace the old one 
> kvm_set_ioeventfd_mmio_long().
> And this patch makes memory_region_add_eventfd() called in ivshmem_read() 
> when qemu receives
> eventfd information from ivshmem_server.
>
> Signed-off-by: Hongyong Zang 
> ---
>  hw/ivshmem.c |   41 ++---
>  1 files changed, 14 insertions(+), 27 deletions(-)
>
> diff --git a/hw/ivshmem.c b/hw/ivshmem.c
> index 242fbea..be26f03 100644
> --- a/hw/ivshmem.c
> +++ b/hw/ivshmem.c
> @@ -58,7 +58,6 @@ typedef struct IVShmemState {
>  CharDriverState *server_chr;
>  MemoryRegion ivshmem_mmio;
>  
> -pcibus_t mmio_addr;
>  /* We might need to register the BAR before we actually have the memory.
>   * So prepare a container MemoryRegion for the BAR immediately and
>   * add a subregion when we have the memory.
> @@ -346,8 +345,14 @@ static void close_guest_eventfds(IVShmemState *s, int 
> posn)
>  guest_curr_max = s->peers[posn].nb_eventfds;
>  
>  for (i = 0; i < guest_curr_max; i++) {
> -kvm_set_ioeventfd_mmio_long(s->peers[posn].eventfds[i],
> -s->mmio_addr + DOORBELL, (posn << 16) | i, 0);
> +if (ivshmem_has_feature(s, IVSHMEM_IOEVENTFD)) {
> +memory_region_del_eventfd(&s->ivshmem_mmio,
> + DOORBELL,
> + 4,
> + true,
> + (posn << 16) | i,
> + s->peers[posn].eventfds[i]);
> +}
>  close(s->peers[posn].eventfds[i]);
>  }
>  
> @@ -355,22 +360,6 @@ static void close_guest_eventfds(IVShmemState *s, int 
> posn)
>  s->peers[posn].nb_eventfds = 0;
>  }
>  
> -static void setup_ioeventfds(IVShmemState *s) {
> -
> -int i, j;
> -
> -for (i = 0; i <= s->max_peer; i++) {
> -for (j = 0; j < s->peers[i].nb_eventfds; j++) {
> -memory_region_add_eventfd(&s->ivshmem_mmio,
> -  DOORBELL,
> -  4,
> -  true,
> -  (i << 16) | j,
> -  s->peers[i].eventfds[j]);
> -}
> -}
> -}
> -
>  /* this function increase the dynamic storage need to store data about other
>   * guests */
>  static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
> @@ -491,10 +480,12 @@ static void ivshmem_read(void *opaque, const uint8_t * 
> buf, int flags)
>  }
>  
>  if (ivshmem_has_feature(s, IVSHMEM_IOEVENTFD)) {
> -if (kvm_set_ioeventfd_mmio_long(incoming_fd, s->mmio_addr + DOORBELL,
> -(incoming_posn << 16) | guest_max_eventfd, 1) < 0) {
> -fprintf(stderr, "ivshmem: ioeventfd not available\n");
> -}
> +memory_region_add_eventfd(&s->ivshmem_mmio,
> +  DOORBELL,
> +  4,
> +  true,
> +  (incoming_posn << 16) | guest_max_eventfd,
> +  incoming_fd);
>  }
>  
>  return;
> @@ -659,10 +650,6 @@ static int pci_ivshmem_init(PCIDevice *dev)
>  memory_region_init_io(&s->ivshmem_mmio, &ivshmem_mmio_ops, s,
>"ivshmem-mmio", IVSHMEM_REG_BAR_SIZE);
>  
> -if (ivshmem_has_feature(s, IVSHMEM_IOEVENTFD)) {
> -setup_ioeventfds(s);
> -}
> -
>  /* region for registers*/
>  pci_register_bar(&s->dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY,
>   &s->ivshmem_mmio);


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout

2011-11-30 Thread Michael S. Tsirkin
On Wed, Nov 30, 2011 at 10:10:22AM +1030, Rusty Russell wrote:
> On Mon, 28 Nov 2011 11:15:31 +0200, Sasha Levin  
> wrote:
> > On Mon, 2011-11-28 at 11:25 +1030, Rusty Russell wrote:
> > > I'd like to see kvmtools remove support for legacy mode altogether,
> > > but they probably have existing users.
> > 
> > While we can't simply remove it right away, instead of mixing our
> > implementation for both legacy and new spec in the same code we can
> > split the virtio-pci implementation into two:
> > 
> > - virtio/virtio-pci-legacy.c
> > - virtio/virtio-pci.c
> > 
> > At that point we can #ifdef the entire virtio-pci-legacy.c for now and
> > remove it at the same time legacy virtio-pci is removed from the kernel.
> 
> Hmm, that might be neat, but we can't tell the driver core to try
> virtio-pci before virtio-pci-legacy, so we need detection code in both
> modules (and add a "force" flag to virtio-pci-legacy to tell it to
> accept the device even if it's not a legacy-only one).

This flag might need to be per device ideally, which is tricky ...

> 
> Then it should work...
> Cheers,
> Rusty.

One also wonders whether and how this will work on other OS-es.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html