RE: how to assign a pci device to guest [with qemu.git upstream]?

2011-09-28 Thread Ren, Yongjie
> -Original Message-
> From: Chris Wright [mailto:chr...@sous-sol.org]
> Sent: Thursday, September 29, 2011 12:33 PM
> To: Ren, Yongjie
> Cc: Chris Wright; Avi Kivity; KVM General
> Subject: Re: how to assign a pci device to guest [with qemu.git upstream]?
> 
> * Chris Wright (chr...@sous-sol.org) wrote:
> > * Ren, Yongjie (yongjie@intel.com) wrote:
> > > Chris,
> > > Thanks very much for you kind help.
> > > I can't find hw/device-assignment.c in the qemu.git tree.
> > > Avi,
> > > I clone qemu from git://github.com/avikivity/qemu.git
> > > So device assignment is not available. But qemu-kvm.git has
> device-assignment code before kernel.org is down.
> > > Any update for this issue?
> >
> > Are you using the master branch?  I noticed the github web defaults to
> > the memory/queue branch.
> 
> BTW, if you hadn't used branches much before, something like this will
> get you what you want:
> 
> $ git checkout -b master origin/master
> 
> Now you'll be on the master branch (and it should track upstream master
> properly).
Oh, thanks a lot.  I didn't notice the qemu.git is using the 'memory/queue' 
branch by default.
I've switched it to the 'master' tree. It seems device assignment works fine 
now.
> 
> thanks,
> -chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to assign a pci device to guest [with qemu.git upstream]?

2011-09-28 Thread Chris Wright
* Chris Wright (chr...@sous-sol.org) wrote:
> * Ren, Yongjie (yongjie@intel.com) wrote:
> > Chris,
> > Thanks very much for you kind help. 
> > I can't find hw/device-assignment.c in the qemu.git tree.
> > Avi,
> > I clone qemu from git://github.com/avikivity/qemu.git 
> > So device assignment is not available. But qemu-kvm.git has 
> > device-assignment code before kernel.org is down.
> > Any update for this issue?
> 
> Are you using the master branch?  I noticed the github web defaults to
> the memory/queue branch.

BTW, if you hadn't used branches much before, something like this will
get you what you want:

$ git checkout -b master origin/master

Now you'll be on the master branch (and it should track upstream master
properly).

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu-kvm: device assignment: add 82599 PCIe Cap struct quirk

2011-09-28 Thread Alex Williamson
On Wed, 2011-09-28 at 20:20 -0400, Donald Dutile wrote:
> commit f9c29774d2174df6ffc20becec20928948198914
> changed the PCIe Capability structure version check
> from if > 2 fail, to if ==1, size=x, if ==2, size=y,
> else fail.
> Turns out the 82599's VF has an errata where it's
> PCIe Cap struct version is 0, which now fails device assignment
> due to the else fallout, where before, it would blissfully work.
> 
> Add a quirk if version=0, & intel-82599, set size to version 2 struct.
> 
> Signed-off-by: Donald_Dutile 

Ugh.

Acked-by: Alex Williamson 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to assign a pci device to guest [with qemu.git upstream]?

2011-09-28 Thread Chris Wright
* Ren, Yongjie (yongjie@intel.com) wrote:
> Chris,
> Thanks very much for you kind help. 
> I can't find hw/device-assignment.c in the qemu.git tree.
> Avi,
> I clone qemu from git://github.com/avikivity/qemu.git 
> So device assignment is not available. But qemu-kvm.git has device-assignment 
> code before kernel.org is down.
> Any update for this issue?

Are you using the master branch?  I noticed the github web defaults to
the memory/queue branch.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: how to assign a pci device to guest [with qemu.git upstream]?

2011-09-28 Thread Ren, Yongjie
Chris,
Thanks very much for you kind help. 
I can't find hw/device-assignment.c in the qemu.git tree.
Avi,
I clone qemu from git://github.com/avikivity/qemu.git 
So device assignment is not available. But qemu-kvm.git has device-assignment 
code before kernel.org is down.
Any update for this issue?

Best Regards,
 Yongjie Ren  (Jay)

> -Original Message-
> From: Chris Wright [mailto:chr...@sous-sol.org]
> Sent: Thursday, September 29, 2011 8:57 AM
> To: Ren, Yongjie
> Cc: KVM General
> Subject: Re: how to assign a pci device to guest [with qemu.git upstream]?
> 
> * Ren, Yongjie (yongjie@intel.com) wrote:
> > I'm using kvm and qemu upstream on https://github.com/avikivity
> > The following command line was right for me about three weeks ago,
> but now I meet some error.
> > # qemu-system-x86_64 -m 1024 -smp 2 -device pci-assign,host=0e:00.0
> -hda /root/rhel6u1.img
> > output error is like following.
> > qemu-system-x86_64: -device pci-assign,host=0d:00.0: Parameter 'driver'
> expects a driver name
> > Try with argument '?' for a list.
> 
> Looks like you don't have device assignment support compiled in.
> Start with the basics (assuming tree has hw/device-assignment.c):
> 
> did your ./configure output show:
> 
> KVM device assig. yes
> 
> and does your binary agree?
> 
> qemu-system-x86_64 -device ? 2>&1 | grep pci-assign
> 
> thanks,
> -chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to assign a pci device to guest [with qemu.git upstream]?

2011-09-28 Thread Chris Wright
* Ren, Yongjie (yongjie@intel.com) wrote:
> I'm using kvm and qemu upstream on https://github.com/avikivity
> The following command line was right for me about three weeks ago, but now I 
> meet some error.
> # qemu-system-x86_64 -m 1024 -smp 2 -device pci-assign,host=0e:00.0 -hda 
> /root/rhel6u1.img
> output error is like following.
> qemu-system-x86_64: -device pci-assign,host=0d:00.0: Parameter 'driver' 
> expects a driver name
> Try with argument '?' for a list.

Looks like you don't have device assignment support compiled in.
Start with the basics (assuming tree has hw/device-assignment.c):

did your ./configure output show:

KVM device assig. yes

and does your binary agree?

qemu-system-x86_64 -device ? 2>&1 | grep pci-assign

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt

2011-09-28 Thread Chris Wright
* Reeted (ree...@shiftmail.org) wrote:
> On 09/28/11 11:28, Daniel P. Berrange wrote:
> >On Wed, Sep 28, 2011 at 11:19:43AM +0200, Reeted wrote:
> >>On 09/28/11 09:51, Daniel P. Berrange wrote:
> >>>You could have equivalently used
> >>>
> >>>  -netdev tap,ifname=tap0,script=no,downscript=no,id=hostnet0,vhost=on
> >>>  -device 
> >>> virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3
> >>It's this! It's this!! (thanks for the line)
> >>
> >>It raises boot time by 10-13 seconds
> >Ok, that is truely bizarre and I don't really have any explanation
> >for why that is. I guess you could try 'vhost=off' too and see if that
> >makes the difference.
> 
> YES!
> It's the vhost. With vhost=on it takes about 12 seconds more time to boot.

Can you help narrow down what is happening during the additional 12
seconds in the guest?  For example, does a quick simple boot to single
user mode happen at the same boot speed w/ and w/out vhost_net?

I'm guessing (hoping) that it's the network bring-up that is slow.
Are you using dhcp to get an IP address?  Does static IP have the same
slow down?

If it's just dhcp, can you recompile qemu with this patch and see if it
causes the same slowdown you saw w/ vhost?

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 0b03b57..0c864f7 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -496,7 +496,7 @@ static int receive_header(VirtIONet *n, struct iovec *iov, 
int iovcnt,
 if (n->has_vnet_hdr) {
 memcpy(hdr, buf, sizeof(*hdr));
 offset = sizeof(*hdr);
-work_around_broken_dhclient(hdr, buf + offset, size - offset);
+//work_around_broken_dhclient(hdr, buf + offset, size - offset);
 }
 
 /* We only ever receive a struct virtio_net_hdr from the tapfd,
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu-kvm: device assignment: add 82599 PCIe Cap struct quirk

2011-09-28 Thread Chris Wright
* Donald Dutile (ddut...@redhat.com) wrote:
> commit f9c29774d2174df6ffc20becec20928948198914
> changed the PCIe Capability structure version check
> from if > 2 fail, to if ==1, size=x, if ==2, size=y,
> else fail.
> Turns out the 82599's VF has an errata where it's
> PCIe Cap struct version is 0, which now fails device assignment
> due to the else fallout, where before, it would blissfully work.
> 
> Add a quirk if version=0, & intel-82599, set size to version 2 struct.
> 
> Signed-off-by: Donald_Dutile 

(not pretty, but neither is the hw errata...)

Acked-by: Chris Wright 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qemu-kvm: device assignment: add 82599 PCIe Cap struct quirk

2011-09-28 Thread Donald Dutile
commit f9c29774d2174df6ffc20becec20928948198914
changed the PCIe Capability structure version check
from if > 2 fail, to if ==1, size=x, if ==2, size=y,
else fail.
Turns out the 82599's VF has an errata where it's
PCIe Cap struct version is 0, which now fails device assignment
due to the else fallout, where before, it would blissfully work.

Add a quirk if version=0, & intel-82599, set size to version 2 struct.

Signed-off-by: Donald_Dutile 
---
 hw/device-assignment.c |   12 ++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 288f80c..ed2a883 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1261,12 +1261,20 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 
 if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_EXP, 0))) {
 uint8_t version, size;
-uint16_t type, devctl, lnkcap, lnksta;
+uint16_t type, devctl, lnkcap, lnksta, vendor, device;
 uint32_t devcap;
 
+vendor = pci_get_word(pci_dev->config + PCI_VENDOR_ID);
+device = pci_get_word(pci_dev->config + PCI_DEVICE_ID);
 version = pci_get_byte(pci_dev->config + pos + PCI_EXP_FLAGS);
 version &= PCI_EXP_FLAGS_VERS;
-if (version == 1) {
+if (version == 0 && vendor == 0x8086 && device == 0x10ed) {
+/* 
+ * quirk for Intel 82599 VF with invalid PCIe capability version,
+ * should really be version 2 (same as PF)
+ */
+size = 0x3c;
+} else if (version == 1) {
 size = 0x14;
 } else if (version == 2) {
 /*
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm] Re: [kvm] Re: [kvm] Re: Questions about duplicate memory work

2011-09-28 Thread Robin Lee Powell
On Wed, Sep 28, 2011 at 05:11:06PM -0700, Robin Lee Powell wrote:
> On Tue, Sep 27, 2011 at 12:49:29PM +0300, Avi Kivity wrote:
> > On 09/27/2011 12:00 PM, Robin Lee Powell wrote:
> > >On Tue, Sep 27, 2011 at 01:48:43AM -0700, Robin Lee Powell wrote:
> > >>  On Tue, Sep 27, 2011 at 04:41:33PM +0800, Emmanuel Noobadmin wrote:
> > >>  >  On 9/27/11, Robin Lee Powell  wrote:
> > >>  >  >  On Mon, Sep 26, 2011 at 04:15:37PM +0800, Emmanuel Noobadmin
> > >>  >  >  wrote:
> > >>  >  >>  It's unrelated to what you're actually using as the disks,
> > >>  >  >>  whether file or block devices like LVs. I think it just makes
> > >>  >  >>  KVM tell the host not to cache I/O done on the storage device.
> > >>  >  >
> > >>  >  >  Wait, hold on, I think I had it backwards.
> > >>  >  >
> > >>  >  >  It tells the *host* to not cache the device in question, or the
> > >>  >  >  *VMs* to not cache the device in question?
> > >>  >
> > >>  >  I'm fairly certain it tells the qemu not to cache the device in
> > >>  >  question. If you don't want the guest to cache their i/o, then the
> > >>  >  guest OS should be configured if it allows that. Although I'm not
> > >>  >  sure if it's possible to disable disk buffering/caching system
> > >>  >  wide in Linux.
> > >>
> > >>  OK, great, thanks.
> > >>
> > >>  Now if I could just figure out how to stop the host from swapping
> > >>  out much of the VMs' qemu-kvm procs when it has almost a GiB of RAM
> > >>  left.  -_-  swappiness 0 doesn't seem to help there.
> > >
> > >Grrr.
> > >
> > >I turned swap off to clear it.  A few minutes ago, this host was at
> > >zero swap:
> > >
> > >top - 01:59:10 up 10 days, 15:17,  3 users,  load average: 6.39, 4.26, 3.24
> > >Tasks: 151 total,   1 running, 150 sleeping,   0 stopped,   0 zombie
> > >Cpu(s):  6.6%us,  1.0%sy,  0.0%ni, 85.9%id,  6.3%wa,  0.0%hi,  0.2%si,  
> > >0.0%st
> > >Mem:   8128772k total,  656k used,  1617656k free,14800k buffers
> > >Swap:  8388604k total,   672828k used,  7715776k free,97536k cached
> > >
> > >   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> > >  2504 qemu  20   0 2425m 1.8g  448 S 10.0 23.4   3547:59 qemu-kvm
> > >  2258 qemu  20   0 2425m 1.7g  444 S  2.7 21.7   1288:15 qemu-kvm
> > >18061 qemu  20   0 2433m 1.8g  428 S  2.3 23.4 401:01.99 qemu-kvm
> > >10335 qemu  20   0 1864m 861m  456 S  1.0 10.9   2:04.26 qemu-kvm
> > >[snip]
> > >
> > >Why is it doing this?!?  ;'(
> > >
> > 
> > Please post the contents of /proc/meminfo and /proc/zoneinfo when
> > this is happening.
> 
> I just noticed that the amount of RAM the VMs had in VIRT added up
> to considerably more than the host's actual RAM; hard_limit is now
> on.  So I may not be able to replicate this.  :)

Or not; even with hard_limit the VIRT value goes to hundreds of MiB
more than the limit.  Is that expected?

-Robin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm] Re: [kvm] Re: Questions about duplicate memory work

2011-09-28 Thread Robin Lee Powell
On Tue, Sep 27, 2011 at 12:49:29PM +0300, Avi Kivity wrote:
> On 09/27/2011 12:00 PM, Robin Lee Powell wrote:
> >On Tue, Sep 27, 2011 at 01:48:43AM -0700, Robin Lee Powell wrote:
> >>  On Tue, Sep 27, 2011 at 04:41:33PM +0800, Emmanuel Noobadmin wrote:
> >>  >  On 9/27/11, Robin Lee Powell  wrote:
> >>  >  >  On Mon, Sep 26, 2011 at 04:15:37PM +0800, Emmanuel Noobadmin
> >>  >  >  wrote:
> >>  >  >>  It's unrelated to what you're actually using as the disks,
> >>  >  >>  whether file or block devices like LVs. I think it just makes
> >>  >  >>  KVM tell the host not to cache I/O done on the storage device.
> >>  >  >
> >>  >  >  Wait, hold on, I think I had it backwards.
> >>  >  >
> >>  >  >  It tells the *host* to not cache the device in question, or the
> >>  >  >  *VMs* to not cache the device in question?
> >>  >
> >>  >  I'm fairly certain it tells the qemu not to cache the device in
> >>  >  question. If you don't want the guest to cache their i/o, then the
> >>  >  guest OS should be configured if it allows that. Although I'm not
> >>  >  sure if it's possible to disable disk buffering/caching system
> >>  >  wide in Linux.
> >>
> >>  OK, great, thanks.
> >>
> >>  Now if I could just figure out how to stop the host from swapping
> >>  out much of the VMs' qemu-kvm procs when it has almost a GiB of RAM
> >>  left.  -_-  swappiness 0 doesn't seem to help there.
> >
> >Grrr.
> >
> >I turned swap off to clear it.  A few minutes ago, this host was at
> >zero swap:
> >
> >top - 01:59:10 up 10 days, 15:17,  3 users,  load average: 6.39, 4.26, 3.24
> >Tasks: 151 total,   1 running, 150 sleeping,   0 stopped,   0 zombie
> >Cpu(s):  6.6%us,  1.0%sy,  0.0%ni, 85.9%id,  6.3%wa,  0.0%hi,  0.2%si,  
> >0.0%st
> >Mem:   8128772k total,  656k used,  1617656k free,14800k buffers
> >Swap:  8388604k total,   672828k used,  7715776k free,97536k cached
> >
> >   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> >  2504 qemu  20   0 2425m 1.8g  448 S 10.0 23.4   3547:59 qemu-kvm
> >  2258 qemu  20   0 2425m 1.7g  444 S  2.7 21.7   1288:15 qemu-kvm
> >18061 qemu  20   0 2433m 1.8g  428 S  2.3 23.4 401:01.99 qemu-kvm
> >10335 qemu  20   0 1864m 861m  456 S  1.0 10.9   2:04.26 qemu-kvm
> >[snip]
> >
> >Why is it doing this?!?  ;'(
> >
> 
> Please post the contents of /proc/meminfo and /proc/zoneinfo when
> this is happening.

I just noticed that the amount of RAM the VMs had in VIRT added up
to considerably more than the host's actual RAM; hard_limit is now
on.  So I may not be able to replicate this.  :)

-Robin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt (due to vhost=on)

2011-09-28 Thread Reeted

On 09/28/11 16:51, Reeted wrote:

On 09/28/11 14:56, Richard W.M. Jones wrote:

On Wed, Sep 28, 2011 at 12:19:09PM +0200, Reeted wrote:

Ok that seems to work: it removes the vhost part in the virsh launch
hence cutting down 12secs of boot time.

If nobody comes out with an explanation of why, I will open another
thread on the kvm list for this. I would probably need to test disk
performance on vhost=on to see if it degrades or it's for another
reason that boot time is increased.

Is it using CPU during this time, or is the qemu-kvm process idle?

It wouldn't be the first time that a network option ROM sat around
waiting for an imaginary console user to press a key.

Rich.


Of the two qemu-kvm processes (threads?) which I see consuming CPU for 
that VM, one is at about 20%, the other at about 10%. I think it's 
doing something but maybe not much, or maybe it's really I/O bound and 
the I/O is slow (as I originarily thought). I will perform some disk 
benchmarks and follow up, but I can't do that right now...

Thank you


Ok still didn't do benchmarks but I am now quite a lot convinced that 
it's either a disk performance problem or cpu problem with vhostnet on. 
Not a network performance problem or idle wait.


Because I have installed another virtual machine now, which is a fedora 
core 6 (old!),  but with a debian natty kernel vmlinuz + initrd so that 
it supports virtio devices. The initrd part from Ubuntu is extremely 
short so finishes immediately, but then the fedora core 6 boot is much 
longer than with my previous ubuntu-barebone virtual machine, and with 
more messages, and I can see the various daemons being brought up one by 
one, and I can tell you such boot (and also the teardown of services 
during shutdown) is very much faster with vhostnet disabled.


With vhostnet disabled it takes 30seconds to come up (since after grub), 
and 28 seconds to shutdown.
With vhostnet enabled it takes 1m19sec to come up (since after grub), 
and 1m04sec to shutdown.



I have some ideas about disk benchmarking, that would be fio or simple 
dd. What could I use for CPU benchmarking? Would "openssl speed" be too 
simple?


Thank you
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Benjamin Herrenschmidt
On Wed, 2011-09-28 at 16:20 -0500, Scott Wood wrote:

> Sure, if there might be stale stuff in the icache, the guest will need
> to invalidate that.  But when running on real hardware, an OS does not
> need to flush it out of data cache after a DMA transaction[1].  So
> technically we just want a flush_dcache_range() for DMA.
>
> It's moot unless we can distinguish DMA writes from breakpoint writes,
> though.
> 
> -Scott
> 
> [1] Most OSes may do this anyway, to avoid needing to special case when
> the dirtying is done entirely by DMA (or to avoid making assumptions
> that could be broken by weird hardware), but that doesn't mean QEMU/KVM
> should assume that -- maybe unless there's enough performance to be
> gained by looking like the aforementioned "weird hardware" in certain
> configurations.

I see what you mean. A DMA would have had an implicit cache flush while
qemu memcpy'ing to the guest won't. Hrm.

I'm not sure any guest relies on that since architecturally, the HW is
permitted to do cache intervention tricks, and rather than flush,
transfer the data directly to the cache that originally contained the
lines (cache injection). We do even support that on some embedded stuff.

In any case, we should then make that depend on a flag, because it's
certainly unnecessary on P5, P6 and P7 which have a snooping icache and
can be costly.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Scott Wood
On 09/28/2011 04:02 PM, Benjamin Herrenschmidt wrote:
> On Wed, 2011-09-28 at 12:27 -0500, Scott Wood wrote:
> 
>> Why would it need to be synchronous?  Even if it's asynchronous emulated
>> DMA, we don't want it sitting around only in a data cache that
>> instruction fetches won't snoop.
> 
> Except that this is exactly what happens on real HW :-)

DMA does not normally go straight to data cache, at least on hardware
I'm familiar with.

> The guest will do the necessary invalidations. DMA doesn't keep the
> icache coherent on HW, why should it on kvm/qemu ?

Sure, if there might be stale stuff in the icache, the guest will need
to invalidate that.  But when running on real hardware, an OS does not
need to flush it out of data cache after a DMA transaction[1].  So
technically we just want a flush_dcache_range() for DMA.

It's moot unless we can distinguish DMA writes from breakpoint writes,
though.

-Scott

[1] Most OSes may do this anyway, to avoid needing to special case when
the dirtying is done entirely by DMA (or to avoid making assumptions
that could be broken by weird hardware), but that doesn't mean QEMU/KVM
should assume that -- maybe unless there's enough performance to be
gained by looking like the aforementioned "weird hardware" in certain
configurations.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Benjamin Herrenschmidt
On Wed, 2011-09-28 at 12:27 -0500, Scott Wood wrote:

> Why would it need to be synchronous?  Even if it's asynchronous emulated
> DMA, we don't want it sitting around only in a data cache that
> instruction fetches won't snoop.

Except that this is exactly what happens on real HW :-)

The guest will do the necessary invalidations. DMA doesn't keep the
icache coherent on HW, why should it on kvm/qemu ?

> It's not implemented yet in mainline for powerpc (we have something
> internal that is on the backlog of things to be cleaned up and sent
> out), but this is what we'd do for kvm_arch_insert_sw_breakpoint().

Yes, breakpoints do need a flash, as does the initial program load.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Benjamin Herrenschmidt
On Wed, 2011-09-28 at 16:26 +0200, Alexander Graf wrote:
> 
> This makes sure that when device emulation overwrites code that is
> already present in the cache of a CPU, it gets flushed from the
> icache. I'm fairly sure we want that :). But let's ask Ben and David
> as well.

Hrm we don't need that. DMA doesn't flush the icache on power. The
kernel will take care of it if necessary.

The only case you do need it is when doing the initial load of the
kernel or SLOF image before you execute it.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Jeremy Fitzhardinge
On 09/28/2011 11:49 AM, Linus Torvalds wrote:
> But I don't care all *that* deeply. I do agree that the xaddw trick is
> pretty tricky. I just happen to think that it's actually *less* tricky
> than "read the upper bits separately and depend on subtle ordering
> issues with another writer that happens at the same time on another
> CPU".
>
> So I can live with either form - as long as it works. I think it might
> be easier to argue that the xaddw is guaranteed to work, because all
> values at all points are unarguably atomic (yeah, we read the lower
> bits nonatomically, but as the owner of the lock we know that nobody
> else can write them).

Exactly.  I just did a locked add variant, and while the code looks a
little simpler, it definitely has more actual complexity to analyze.

J
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Linus Torvalds
On Wed, Sep 28, 2011 at 11:08 AM, Stephan Diestelhorst
 wrote:
>
> I must have missed the part when this turned into the propose-the-
> craziest-way-that-this-still-works.contest :)

So doing it just with the "lock addb" probably works fine, but I have
to say that I personally shudder at the "surround the locked addb by
reads from the word, in order to approximate an atomic read of the
upper bits".

Because what you get is not really an "atomic read of the upper bits",
it's a "ok, we'll get the worst case of somebody modifying the upper
bits at the same time".

Which certainly should *work*, but from a conceptual standpoint, isn't
it just *much* nicer to say "we actually know *exactly* what the upper
bits were".

But I don't care all *that* deeply. I do agree that the xaddw trick is
pretty tricky. I just happen to think that it's actually *less* tricky
than "read the upper bits separately and depend on subtle ordering
issues with another writer that happens at the same time on another
CPU".

So I can live with either form - as long as it works. I think it might
be easier to argue that the xaddw is guaranteed to work, because all
values at all points are unarguably atomic (yeah, we read the lower
bits nonatomically, but as the owner of the lock we know that nobody
else can write them).

 Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-net: Read MAC only after initializing MSI-X

2011-09-28 Thread Sasha Levin
On Mon, 2011-09-19 at 17:19 +0930, Rusty Russell wrote:
> On Mon, 19 Sep 2011 09:01:50 +0300, "Michael S. Tsirkin"  
> wrote:
> > On Mon, Sep 19, 2011 at 01:05:17PM +0930, Rusty Russell wrote:
> > > On Sat, 20 Aug 2011 23:00:44 +0300, "Michael S. Tsirkin" 
> > >  wrote:
> > > > On Fri, Aug 19, 2011 at 07:33:07PM +0300, Sasha Levin wrote:
> > > > > Maybe this is better solved by copying the way it was done in PCI 
> > > > > itself
> > > > > with capability linked list?
> > > > 
> > > > There are any number of ways to lay out the structure.  I went for what
> > > > seemed a simplest one.  For MSI-X the train has left the station.  We
> > > > can probably still tweak where the high 32 bit features
> > > > for 64 bit features are.  No idea if it's worth it.
> > > 
> > > Sorry, this has been in the back of my mind.  I think it's a good idea;
> > > can we use the capability linked list for pre-device specific stuff from
> > > now on?
> > > 
> > > Thanks,
> > > Rusty.
> > 
> > Do we even want capability bits then?
> > We can give each capability an ack flag ...
> 
> We could have, and if I'd known PCI when I designed virtio I might have.
> 
> But it's not easy now to map structure offsets to that scheme, and we
> can't really force such a change on the non-PCI users.  So I'd say we
> should only do it for the non-device-specific options.  ie. we'll still
> have the MSI-X case move the device-specific config, but we'll use a
> linked list from now on, eg. for the next 32 features bits...
> 
> Thoughts?
> Rusty.

What if we create a capability list but place it in the virtio-pci
config space instead of the PCI space?

It'll work fine with non-PCI users and would leave MSI-X as the only
thing that changes offsets (and we could probably deprecate and remove
it at some point in the future).

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Jeremy Fitzhardinge
On 09/28/2011 11:08 AM, Stephan Diestelhorst wrote:
> On Wednesday 28 September 2011 19:50:08 Jeremy Fitzhardinge wrote:
>> On 09/28/2011 10:24 AM, H. Peter Anvin wrote:
>>> On 09/28/2011 10:22 AM, Linus Torvalds wrote:
 On Wed, Sep 28, 2011 at 9:47 AM, Jeremy Fitzhardinge  
 wrote:
> Could do something like:
>
>if (ticket->head >= 254)
>prev = xadd(&ticket->head_tail, 0xff02);
>else
>prev = xadd(&ticket->head_tail, 0x0002);
>
> to compensate for the overflow.
 Oh wow. You havge an even more twisted mind than I do.

 I guess that will work, exactly because we control "head" and thus can
 know about the overflow in the low byte. But boy is that ugly ;)

 But at least you wouldn't need to do the loop with cmpxchg. So it's
 twisted and ugly, but migth be practical.

>>> I suspect it should be coded as -254 in order to use a short immediate
>>> if that is even possible...
>> I'm about to test:
>>
>> static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
>> {
>>  if (TICKET_SLOWPATH_FLAG && 
>> unlikely(arch_static_branch(¶virt_ticketlocks_enabled))) {
>>  arch_spinlock_t prev;
>>  __ticketpair_t inc = TICKET_LOCK_INC;
>>
>>  if (lock->tickets.head >= (1 << TICKET_SHIFT) - TICKET_LOCK_INC)
>>  inc += -1 << TICKET_SHIFT;
>>
>>  prev.head_tail = xadd(&lock->head_tail, inc);
>>
>>  if (prev.tickets.tail & TICKET_SLOWPATH_FLAG)
>>  __ticket_unlock_slowpath(lock, prev);
>>  } else
>>  __ticket_unlock_release(lock);
>> }
>>
>> Which, frankly, is not something I particularly want to put my name to.
> I must have missed the part when this turned into the propose-the-
> craziest-way-that-this-still-works.contest :)
>
> What is wrong with converting the original addb into a lock addb? The
> crazy wrap around tricks add a conditional and lots of headache. The
> lock addb/w is clean. We are paying an atomic in both cases, so I just
> don't see the benefit of the second solution.

Well, it does end up generating surprisingly nice code.  And to be
honest, being able to do the unlock and atomically fetch the flag as one
operation makes it much easier to reason about.

I'll do a locked add variant as well to see how it turns out.

Do you think locked add is better than unlocked + mfence?

Thanks,
J
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Stephan Diestelhorst
On Wednesday 28 September 2011 18:44:25 Jeremy Fitzhardinge wrote:
> On 09/28/2011 06:58 AM, Stephan Diestelhorst wrote:
> >> I guess it comes down to throwing myself on the efficiency of some kind
> >> of fence instruction.  I guess an lfence would be sufficient; is that
> >> any more efficient than a full mfence?
> > An lfence should not be sufficient, since that essentially is a NOP on
> > WB memory. You really want a full fence here, since the store needs to
> > be published before reading the lock with the next load.
> 
> The Intel manual reads:
> 
> Reads cannot pass earlier LFENCE and MFENCE instructions.
> Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
> LFENCE instructions cannot pass earlier reads.
> 
> Which I interpreted as meaning that an lfence would prevent forwarding. 
> But I guess it doesn't say "lfence instructions cannot pass earlier
> writes", which means that the lfence could logically happen before the
> write, thereby allowing forwarding?  Or should I be reading this some
> other way?

Indeed. You are reading this the right way. 

> >> Could you give me a pointer to AMD's description of the ordering rules?
> > They should be in "AMD64 Architecture Programmer's Manual Volume 2:
> > System Programming", Section 7.2 Multiprocessor Memory Access Ordering.
> >
> > http://developer.amd.com/documentation/guides/pages/default.aspx#manuals
> >
> > Let me know if you have some clarifying suggestions. We are currently
> > revising these documents...
> 
> I find the English descriptions of these kinds of things frustrating to
> read because of ambiguities in the precise meaning of words like "pass",
> "ahead", "behind" in these contexts.  I find the prose useful to get an
> overview, but when I have a specific question I wonder if something more
> formal would be useful.

It would be, and some have started this efort:

http://www.cl.cam.ac.uk/~pes20/weakmemory/

But I am not sure whether that particular nasty forwarding case is
captured properly in their model It is on my list of things to check.

> I guess it's implied that anything that is not prohibited by the
> ordering rules is allowed, but it wouldn't hurt to say it explicitly.
> That said, the AMD description seems clearer and more explicit than the
> Intel manual (esp since it specifically discusses the problem here).

Thanks! Glad you like it :)

Stephan
-- 
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelho...@amd.com, Tel. +49 (0)351 448 356 719

Advanced Micro Devices GmbH
Einsteinring 24
85609 Aschheim
Germany

Geschaeftsfuehrer: Alberto Bozzo;
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632, WEEE-Reg-Nr: DE 12919551 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Stephan Diestelhorst
On Wednesday 28 September 2011 19:50:08 Jeremy Fitzhardinge wrote:
> On 09/28/2011 10:24 AM, H. Peter Anvin wrote:
> > On 09/28/2011 10:22 AM, Linus Torvalds wrote:
> >> On Wed, Sep 28, 2011 at 9:47 AM, Jeremy Fitzhardinge  
> >> wrote:
> >>> Could do something like:
> >>>
> >>>if (ticket->head >= 254)
> >>>prev = xadd(&ticket->head_tail, 0xff02);
> >>>else
> >>>prev = xadd(&ticket->head_tail, 0x0002);
> >>>
> >>> to compensate for the overflow.
> >> Oh wow. You havge an even more twisted mind than I do.
> >>
> >> I guess that will work, exactly because we control "head" and thus can
> >> know about the overflow in the low byte. But boy is that ugly ;)
> >>
> >> But at least you wouldn't need to do the loop with cmpxchg. So it's
> >> twisted and ugly, but migth be practical.
> >>
> > I suspect it should be coded as -254 in order to use a short immediate
> > if that is even possible...
> 
> I'm about to test:
> 
> static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
> {
>   if (TICKET_SLOWPATH_FLAG && 
> unlikely(arch_static_branch(¶virt_ticketlocks_enabled))) {
>   arch_spinlock_t prev;
>   __ticketpair_t inc = TICKET_LOCK_INC;
> 
>   if (lock->tickets.head >= (1 << TICKET_SHIFT) - TICKET_LOCK_INC)
>   inc += -1 << TICKET_SHIFT;
> 
>   prev.head_tail = xadd(&lock->head_tail, inc);
> 
>   if (prev.tickets.tail & TICKET_SLOWPATH_FLAG)
>   __ticket_unlock_slowpath(lock, prev);
>   } else
>   __ticket_unlock_release(lock);
> }
> 
> Which, frankly, is not something I particularly want to put my name to.

I must have missed the part when this turned into the propose-the-
craziest-way-that-this-still-works.contest :)

What is wrong with converting the original addb into a lock addb? The
crazy wrap around tricks add a conditional and lots of headache. The
lock addb/w is clean. We are paying an atomic in both cases, so I just
don't see the benefit of the second solution.

Stephan
-- 
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelho...@amd.com, Tel. +49 (0)351 448 356 719

Advanced Micro Devices GmbH
Einsteinring 24
85609 Aschheim
Germany

Geschaeftsfuehrer: Alberto Bozzo;
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632, WEEE-Reg-Nr: DE 12919551 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Jeremy Fitzhardinge
On 09/28/2011 10:24 AM, H. Peter Anvin wrote:
> On 09/28/2011 10:22 AM, Linus Torvalds wrote:
>> On Wed, Sep 28, 2011 at 9:47 AM, Jeremy Fitzhardinge  wrote:
>>> Could do something like:
>>>
>>>if (ticket->head >= 254)
>>>prev = xadd(&ticket->head_tail, 0xff02);
>>>else
>>>prev = xadd(&ticket->head_tail, 0x0002);
>>>
>>> to compensate for the overflow.
>> Oh wow. You havge an even more twisted mind than I do.
>>
>> I guess that will work, exactly because we control "head" and thus can
>> know about the overflow in the low byte. But boy is that ugly ;)
>>
>> But at least you wouldn't need to do the loop with cmpxchg. So it's
>> twisted and ugly, but migth be practical.
>>
> I suspect it should be coded as -254 in order to use a short immediate
> if that is even possible...

I'm about to test:

static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
{
if (TICKET_SLOWPATH_FLAG && 
unlikely(arch_static_branch(¶virt_ticketlocks_enabled))) {
arch_spinlock_t prev;
__ticketpair_t inc = TICKET_LOCK_INC;

if (lock->tickets.head >= (1 << TICKET_SHIFT) - TICKET_LOCK_INC)
inc += -1 << TICKET_SHIFT;

prev.head_tail = xadd(&lock->head_tail, inc);

if (prev.tickets.tail & TICKET_SLOWPATH_FLAG)
__ticket_unlock_slowpath(lock, prev);
} else
__ticket_unlock_release(lock);
}

Which, frankly, is not something I particularly want to put my name to.

It makes gcc go into paroxysms of trickiness:

 4a8:   80 3f fecmpb   $0xfe,(%rdi)
 4ab:   19 f6   sbb%esi,%esi
 4ad:   66 81 e6 00 01  and$0x100,%si
 4b2:   66 81 ee fe 00  sub$0xfe,%si
 4b7:   f0 66 0f c1 37  lock xadd %si,(%rdi)

...which is pretty neat, actually.

J
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread H. Peter Anvin
On 09/28/2011 10:22 AM, Linus Torvalds wrote:
> On Wed, Sep 28, 2011 at 9:47 AM, Jeremy Fitzhardinge  wrote:
>>
>> Could do something like:
>>
>>if (ticket->head >= 254)
>>prev = xadd(&ticket->head_tail, 0xff02);
>>else
>>prev = xadd(&ticket->head_tail, 0x0002);
>>
>> to compensate for the overflow.
> 
> Oh wow. You havge an even more twisted mind than I do.
> 
> I guess that will work, exactly because we control "head" and thus can
> know about the overflow in the low byte. But boy is that ugly ;)
> 
> But at least you wouldn't need to do the loop with cmpxchg. So it's
> twisted and ugly, but migth be practical.
> 

I suspect it should be coded as -254 in order to use a short immediate
if that is even possible...

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Scott Wood
On 09/28/2011 09:45 AM, Jan Kiszka wrote:
> On 2011-09-28 16:26, Alexander Graf wrote:
>>
>> On 28.09.2011, at 16:23, Jan Kiszka wrote:
>>
>>> Alex,
>>>
>>> we have this diff in qemu-kvm:
>>>
>>> diff --git a/exec.c b/exec.c
>>> index c1e045d..f188549 100644
>>> --- a/exec.c
>>> +++ b/exec.c
>>> @@ -3950,6 +3955,11 @@ void cpu_physical_memory_rw(target_phys_addr_t
>>> addr, uint8_t *buf,
>>>  cpu_physical_memory_set_dirty_flags(
>>>  addr1, (0xff&  ~CODE_DIRTY_FLAG));
>>>  }
>>> +/* qemu doesn't execute guest code directly, but kvm does
>>> +   therefore flush instruction caches */
>>> +if (kvm_enabled())
>>> +flush_icache_range((unsigned long)ptr,
>>> +   ((unsigned long)ptr)+l);
>>>  qemu_put_ram_ptr(ptr);
>>>  }
>>>  } else {

Been meaning to send a poke about this one:

http://patchwork.ozlabs.org/patch/90403/

I've seen issues with this when the executable images are initially
loaded by qemu.

>>> flush_icache_range() is doing something only on PPC hosts. So do we need
>>> this upstream?
>>
>> This makes sure that when device emulation overwrites code that is
>> already present in the cache of a CPU, it gets flushed from the
>> icache. I'm fairly sure we want that :). But let's ask Ben and David
>> as well.
> 
> /me wondered which write scenario precisely needs this. It could only be
> something synchronous /wrt to some VCPU.

Why would it need to be synchronous?  Even if it's asynchronous emulated
DMA, we don't want it sitting around only in a data cache that
instruction fetches won't snoop.

> Which operations could trigger
> such a write? Does PPC inject software breakpoints in form of trap
> operations or so?

It's not implemented yet in mainline for powerpc (we have something
internal that is on the backlog of things to be cleaned up and sent
out), but this is what we'd do for kvm_arch_insert_sw_breakpoint().

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Linus Torvalds
On Wed, Sep 28, 2011 at 9:47 AM, Jeremy Fitzhardinge  wrote:
>
> Could do something like:
>
>        if (ticket->head >= 254)
>                prev = xadd(&ticket->head_tail, 0xff02);
>        else
>                prev = xadd(&ticket->head_tail, 0x0002);
>
> to compensate for the overflow.

Oh wow. You havge an even more twisted mind than I do.

I guess that will work, exactly because we control "head" and thus can
know about the overflow in the low byte. But boy is that ugly ;)

But at least you wouldn't need to do the loop with cmpxchg. So it's
twisted and ugly, but migth be practical.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Jeremy Fitzhardinge
On 09/28/2011 09:10 AM, Linus Torvalds wrote:
> On Wed, Sep 28, 2011 at 8:55 AM, Jan Beulich  wrote:
>>> just use "lock xaddw" there too.
>> I'm afraid that's not possible, as that might carry from the low 8 bits
>> into the upper 8 ones, which must be avoided.
> Oh damn, you're right. So I guess the "right" way to do things is with
> cmpxchg, but some nasty mfence setup could do it too.

Could do something like:

if (ticket->head >= 254)
prev = xadd(&ticket->head_tail, 0xff02);
else
prev = xadd(&ticket->head_tail, 0x0002);

to compensate for the overflow.

J
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Jeremy Fitzhardinge
On 09/28/2011 06:58 AM, Stephan Diestelhorst wrote:
> I have tested this and have not seen it fail on publicly released AMD
> systems. But as I have tried to point out, this does not mean it is
> safe to do in software, because future microarchtectures may have more
> capable forwarding engines.

Sure.

>> Have you tested this, or is this just from code analysis (which I
>> agree with after reviewing the ordering rules in the Intel manual).
> We have found a similar issue in Novell's PV ticket lock implementation
> during internal product testing.

Jan may have picked it up from an earlier set of my patches.

>>> Since you want to get that addb out to global memory before the second
>>> read, either use a LOCK prefix for it, add an MFENCE between addb and
>>> movzwl, or use a LOCKed instruction that will have a fencing effect
>>> (e.g., to top-of-stack)between addb and movzwl.
>> Hm.  I don't really want to do any of those because it will probably
>> have a significant effect on the unlock performance; I was really trying
>> to avoid adding any more locked instructions.  A previous version of the
>> code had an mfence in here, but I hit on the idea of using aliasing to
>> get the ordering I want - but overlooked the possible effect of store
>> forwarding.
> Well, I'd be curious about the actual performance impact. If the store
> needs to commit to memory due to aliasing anyways, this would slow down
> execution, too. After all it is better to write working than fast code,
> no? ;-)

Rule of thumb is that AMD tends to do things like lock and fence more
efficiently than Intel - at least historically.  I don't know if that's
still true for current Intel microarchitectures.

>> I guess it comes down to throwing myself on the efficiency of some kind
>> of fence instruction.  I guess an lfence would be sufficient; is that
>> any more efficient than a full mfence?
> An lfence should not be sufficient, since that essentially is a NOP on
> WB memory. You really want a full fence here, since the store needs to
> be published before reading the lock with the next load.

The Intel manual reads:

Reads cannot pass earlier LFENCE and MFENCE instructions.
Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
LFENCE instructions cannot pass earlier reads.

Which I interpreted as meaning that an lfence would prevent forwarding. 
But I guess it doesn't say "lfence instructions cannot pass earlier
writes", which means that the lfence could logically happen before the
write, thereby allowing forwarding?  Or should I be reading this some
other way?

>> Could you give me a pointer to AMD's description of the ordering rules?
> They should be in "AMD64 Architecture Programmer's Manual Volume 2:
> System Programming", Section 7.2 Multiprocessor Memory Access Ordering.
>
> http://developer.amd.com/documentation/guides/pages/default.aspx#manuals
>
> Let me know if you have some clarifying suggestions. We are currently
> revising these documents...

I find the English descriptions of these kinds of things frustrating to
read because of ambiguities in the precise meaning of words like "pass",
"ahead", "behind" in these contexts.  I find the prose useful to get an
overview, but when I have a specific question I wonder if something more
formal would be useful.
I guess it's implied that anything that is not prohibited by the
ordering rules is allowed, but it wouldn't hurt to say it explicitly.
That said, the AMD description seems clearer and more explicit than the
Intel manual (esp since it specifically discusses the problem here).

Thanks,
J
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Linus Torvalds
On Wed, Sep 28, 2011 at 8:55 AM, Jan Beulich  wrote:
>
>> just use "lock xaddw" there too.
>
> I'm afraid that's not possible, as that might carry from the low 8 bits
> into the upper 8 ones, which must be avoided.

Oh damn, you're right. So I guess the "right" way to do things is with
cmpxchg, but some nasty mfence setup could do it too.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Jan Beulich
>>> On 28.09.11 at 17:38, Linus Torvalds  wrote:
> On Tue, Sep 27, 2011 at 9:44 AM, Jeremy Fitzhardinge  wrote:
>>
>> I guess it comes down to throwing myself on the efficiency of some kind
>> of fence instruction.  I guess an lfence would be sufficient; is that
>> any more efficient than a full mfence?  At least I can make it so that
>> its only present when pv ticket locks are actually in use, so it won't
>> affect the native case.
> 
> Please don't play with fences, just do the final "addb" as a locked 
> instruction.
> 
> In fact, don't even use an addb, this whole thing is disgusting:
> 
>   movzwl (%rdi),%esi (esi:=0x0400)
>   addb   $0x2,(%rdi) (LOCAL copy of lock is now: 0x0402)
>   movzwl (%rdi),%eax (local forwarding from previous store: eax := 0x0402)
> 
> just use "lock xaddw" there too.

I'm afraid that's not possible, as that might carry from the low 8 bits
into the upper 8 ones, which must be avoided.

Jan

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Linus Torvalds
On Tue, Sep 27, 2011 at 9:44 AM, Jeremy Fitzhardinge  wrote:
>
> I guess it comes down to throwing myself on the efficiency of some kind
> of fence instruction.  I guess an lfence would be sufficient; is that
> any more efficient than a full mfence?  At least I can make it so that
> its only present when pv ticket locks are actually in use, so it won't
> affect the native case.

Please don't play with fences, just do the final "addb" as a locked instruction.

In fact, don't even use an addb, this whole thing is disgusting:

  movzwl (%rdi),%esi (esi:=0x0400)
  addb   $0x2,(%rdi) (LOCAL copy of lock is now: 0x0402)
  movzwl (%rdi),%eax (local forwarding from previous store: eax := 0x0402)

just use "lock xaddw" there too.

The fact that the PV unlock is going to be much more expensive than a
regular native unlock is just a fact of life. It comes from
fundamentally caring about the old/new value, and has nothing to do
with aliasing. You care about the other bits, and it doesn't matter
where in memory they are.

The native unlock can do a simple "addb" (or incb), but that doesn't
mean the PV unlock can. There are no ordering issues with the final
unlock in the native case, because the native unlock is like the honey
badger: it don't care. It only cares that the store make it out *some*
day, but it doesn't care about what order the upper/lower bits get
updated. You do. So you have to use a locked access.

Good catch by Stephan.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


qemu-kvm: Remaining relevant bits in kvm/

2011-09-28 Thread Jan Kiszka

Hi all,

in order to reduce the diff between qemu-kvm.git and upstream, getting 
rid of the kvm subdirectory would be nice. What bits there are actually 
still in use today?


From the top of my head, I only remember running kvm_stat and 
scripts/vmxcap from time to time. Obviously, there is nothing actively 
built anymore during normal qemu production. So what is the role of, e.g.,


 - kvm/bios
 - kvm/extboot
 - kvm/vgabios
 - kvm/scripts

today? Anything there we want to push upstream? And then drop the rest?

Jan

--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Alexander Graf

Am 28.09.2011 um 16:49 schrieb Jan Kiszka :

> On 2011-09-28 16:45, Jan Kiszka wrote:
>> On 2011-09-28 16:26, Alexander Graf wrote:
>>> 
>>> On 28.09.2011, at 16:23, Jan Kiszka wrote:
>>> 
 Alex,
 
 we have this diff in qemu-kvm:
 
 diff --git a/exec.c b/exec.c
 index c1e045d..f188549 100644
 --- a/exec.c
 +++ b/exec.c
 @@ -3950,6 +3955,11 @@ void cpu_physical_memory_rw(target_phys_addr_t
 addr, uint8_t *buf,
 cpu_physical_memory_set_dirty_flags(
 addr1, (0xff& ~CODE_DIRTY_FLAG));
 }
 + /* qemu doesn't execute guest code directly, but kvm does
 + therefore flush instruction caches */
 + if (kvm_enabled())
 + flush_icache_range((unsigned long)ptr,
 + ((unsigned long)ptr)+l);
 qemu_put_ram_ptr(ptr);
 }
 } else {
 
 
 flush_icache_range() is doing something only on PPC hosts. So do we need
 this upstream?
>>> 
>>> This makes sure that when device emulation overwrites code that is
>>> already present in the cache of a CPU, it gets flushed from the
>>> icache. I'm fairly sure we want that :). But let's ask Ben and David
>>> as well.
>> 
>> /me wondered which write scenario precisely needs this. It could only be
>> something synchronous /wrt to some VCPU. Which operations could trigger
>> such a write? Does PPC inject software breakpoints in form of trap
>> operations or so?
>> 
>> Mmm, according to our ancient recordings, the hunk above was once
>> introduced for the sake of IA64: 9dc99a2823. I skipped it in my removal
>> patch as it has some non-IA64 effect, at least potentially.
> 
> Correction: It was introduced by 6d3295f7c8, but generalized with f067512c06. 
> That former commit talks about DMA operations on IA64 that also updates/drops 
> the icache in reality.

Yeah I remember discussions around the topic. IIUC DMA invalidates the cache 
lines of the CPUs in the system for the region it's writing to. At least 
potentially. But again, I'll leave this to the IBM guys to answer :). They know 
best how their hardware works.

Alex

> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qemu-kvm: Drop redundant cpuid filtering from cpu_x86_cpuid

2011-09-28 Thread Jan Kiszka
This is already done in kvm_arch_init_vcpu based on the kernel-reported
supported features.

Signed-off-by: Jan Kiszka 
---

I'm not 100% sure, so please double-check.

 target-i386/cpuid.c |   22 --
 1 files changed, 0 insertions(+), 22 deletions(-)

diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c
index f17..1e8bcff 100644
--- a/target-i386/cpuid.c
+++ b/target-i386/cpuid.c
@@ -1232,28 +1232,6 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 *ecx |= 1 << 1;/* CmpLegacy bit */
 }
 }
-if (kvm_enabled()) {
-uint32_t h_eax, h_edx;
-
-host_cpuid(index, 0, &h_eax, NULL, NULL, &h_edx);
-
-/* disable CPU features that the host does not support */
-
-/* long mode */
-if ((h_edx & 0x2000) == 0 /* || !lm_capable_kernel */)
-*edx &= ~0x2000;
-/* syscall */
-if ((h_edx & 0x0800) == 0)
-*edx &= ~0x0800;
-/* nx */
-if ((h_edx & 0x0010) == 0)
-*edx &= ~0x0010;
-
-/* disable CPU features that KVM cannot support */
-
-/* 3dnow */
-*edx &= ~0xc000;
-}
 break;
 case 0x8002:
 case 0x8003:
-- 
1.7.3.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt (due to vhost=on)

2011-09-28 Thread Reeted

On 09/28/11 14:56, Richard W.M. Jones wrote:

On Wed, Sep 28, 2011 at 12:19:09PM +0200, Reeted wrote:

Ok that seems to work: it removes the vhost part in the virsh launch
hence cutting down 12secs of boot time.

If nobody comes out with an explanation of why, I will open another
thread on the kvm list for this. I would probably need to test disk
performance on vhost=on to see if it degrades or it's for another
reason that boot time is increased.

Is it using CPU during this time, or is the qemu-kvm process idle?

It wouldn't be the first time that a network option ROM sat around
waiting for an imaginary console user to press a key.

Rich.


Of the two qemu-kvm processes (threads?) which I see consuming CPU for 
that VM, one is at about 20%, the other at about 10%. I think it's doing 
something but maybe not much, or maybe it's really I/O bound and the I/O 
is slow (as I originarily thought). I will perform some disk benchmarks 
and follow up, but I can't do that right now...

Thank you
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qemu-kvm: Drop kvm optimization from code_gen_alloc

2011-09-28 Thread Jan Kiszka
We no longer get here if KVM is enabled.

Signed-off-by: Jan Kiszka 
---
 exec.c |3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/exec.c b/exec.c
index f188549..83ffa39 100644
--- a/exec.c
+++ b/exec.c
@@ -470,9 +470,6 @@ static uint8_t 
static_code_gen_buffer[DEFAULT_CODE_GEN_BUFFER_SIZE]
 
 static void code_gen_alloc(unsigned long tb_size)
 {
-if (kvm_enabled())
-return;
-
 #ifdef USE_STATIC_CODE_GEN_BUFFER
 code_gen_buffer = static_code_gen_buffer;
 code_gen_buffer_size = DEFAULT_CODE_GEN_BUFFER_SIZE;
-- 
1.7.3.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qemu-kvm: Drop compatfd build hack

2011-09-28 Thread Jan Kiszka
No longer required as we use upstream posix-aio-compat now.

Signed-off-by: Jan Kiszka 
---
 Makefile.objs |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/Makefile.objs b/Makefile.objs
index cb10816..afd6137 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -28,7 +28,6 @@ block-obj-y += nbd.o block.o aio.o aes.o qemu-config.o 
qemu-progress.o qemu-sock
 block-obj-y += $(coroutine-obj-y)
 block-obj-$(CONFIG_POSIX) += posix-aio-compat.o
 block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
-block-obj-$(CONFIG_POSIX) += compatfd.o
 
 block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o 
vvfat.o
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
qcow2-cache.o
@@ -158,7 +157,7 @@ common-obj-y += $(addprefix ui/, $(ui-obj-y))
 common-obj-$(CONFIG_VNC) += $(addprefix ui/, $(vnc-obj-y))
 
 common-obj-y += iov.o acl.o
-#common-obj-$(CONFIG_POSIX) += compatfd.o
+common-obj-$(CONFIG_POSIX) += compatfd.o
 common-obj-y += notify.o event_notifier.o
 common-obj-y += qemu-timer.o qemu-timer-common.o
 
-- 
1.7.3.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Jan Kiszka

On 2011-09-28 16:45, Jan Kiszka wrote:

On 2011-09-28 16:26, Alexander Graf wrote:


On 28.09.2011, at 16:23, Jan Kiszka wrote:


Alex,

we have this diff in qemu-kvm:

diff --git a/exec.c b/exec.c
index c1e045d..f188549 100644
--- a/exec.c
+++ b/exec.c
@@ -3950,6 +3955,11 @@ void cpu_physical_memory_rw(target_phys_addr_t
addr, uint8_t *buf,
cpu_physical_memory_set_dirty_flags(
addr1, (0xff& ~CODE_DIRTY_FLAG));
}
+ /* qemu doesn't execute guest code directly, but kvm does
+ therefore flush instruction caches */
+ if (kvm_enabled())
+ flush_icache_range((unsigned long)ptr,
+ ((unsigned long)ptr)+l);
qemu_put_ram_ptr(ptr);
}
} else {


flush_icache_range() is doing something only on PPC hosts. So do we need
this upstream?


This makes sure that when device emulation overwrites code that is
already present in the cache of a CPU, it gets flushed from the
icache. I'm fairly sure we want that :). But let's ask Ben and David
as well.


/me wondered which write scenario precisely needs this. It could only be
something synchronous /wrt to some VCPU. Which operations could trigger
such a write? Does PPC inject software breakpoints in form of trap
operations or so?

Mmm, according to our ancient recordings, the hunk above was once
introduced for the sake of IA64: 9dc99a2823. I skipped it in my removal
patch as it has some non-IA64 effect, at least potentially.


Correction: It was introduced by 6d3295f7c8, but generalized with 
f067512c06. That former commit talks about DMA operations on IA64 that 
also updates/drops the icache in reality.


Jan

--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Jan Kiszka

On 2011-09-28 16:26, Alexander Graf wrote:


On 28.09.2011, at 16:23, Jan Kiszka wrote:


Alex,

we have this diff in qemu-kvm:

diff --git a/exec.c b/exec.c
index c1e045d..f188549 100644
--- a/exec.c
+++ b/exec.c
@@ -3950,6 +3955,11 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, 
uint8_t *buf,
 cpu_physical_memory_set_dirty_flags(
 addr1, (0xff&  ~CODE_DIRTY_FLAG));
 }
+   /* qemu doesn't execute guest code directly, but kvm does
+  therefore flush instruction caches */
+   if (kvm_enabled())
+   flush_icache_range((unsigned long)ptr,
+  ((unsigned long)ptr)+l);
 qemu_put_ram_ptr(ptr);
 }
 } else {


flush_icache_range() is doing something only on PPC hosts. So do we need
this upstream?


This makes sure that when device emulation overwrites code that is already 
present in the cache of a CPU, it gets flushed from the icache. I'm fairly sure 
we want that :). But let's ask Ben and David as well.


/me wondered which write scenario precisely needs this. It could only be 
something synchronous /wrt to some VCPU. Which operations could trigger 
such a write? Does PPC inject software breakpoints in form of trap 
operations or so?


Mmm, according to our ancient recordings, the hunk above was once 
introduced for the sake of IA64: 9dc99a2823. I skipped it in my removal 
patch as it has some non-IA64 effect, at least potentially.


Jan

--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] virtio-net: Prevent NULL dereference

2011-09-28 Thread Sasha Levin
This patch prevents a NULL dereference when the user has passed a length
longer than an actual buffer to virtio-net.

Cc: Rusty Russell 
Cc: "Michael S. Tsirkin" 
Cc: virtualizat...@lists.linux-foundation.org
Cc: net...@vger.kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Sasha Levin 
---
 drivers/net/virtio_net.c |   12 +++-
 1 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bde0dec..4a53d2a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -208,12 +208,22 @@ static struct sk_buff *page_to_skb(struct virtnet_info 
*vi,
return NULL;
}
 
-   while (len) {
+   while (len && page) {
set_skb_frag(skb, page, offset, &len);
page = (struct page *)page->private;
offset = 0;
}
 
+   /*
+* This is the case where we ran out of pages in our linked list, but
+* supposedly have more data to read.
+*/ 
+   if (len > 0) {
+   pr_debug("%s: missing data to assemble skb\n", skb->dev->name);
+   dev_kfree_skb(skb);
+   return NULL;
+   }
+
if (page)
give_pages(vi, page);
 
-- 
1.7.6.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] virtio-net: Verify page list size before fitting into skb

2011-09-28 Thread Sasha Levin
This patch verifies that the length of a buffer stored in a linked list
of pages is small enough to fit into a skb.

If the size is larger than a max size of a skb, it means that we shouldn't
go ahead building skbs anyway since we won't be able to send the buffer as
the user requested.

Cc: Rusty Russell 
Cc: "Michael S. Tsirkin" 
Cc: virtualizat...@lists.linux-foundation.org
Cc: net...@vger.kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Sasha Levin 
---
 drivers/net/virtio_net.c |   13 +
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0c7321c..bde0dec 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -195,6 +195,19 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
len -= copy;
offset += copy;
 
+   /*
+* Verify that we can indeed put this data into a skb.
+* This is here to handle cases when the device erroneously
+* tries to receive more than is possible. This is usually
+* the case of a broken device.
+*/
+   if (unlikely(len > MAX_SKB_FRAGS * PAGE_SIZE)) {
+   if (net_ratelimit())
+   pr_debug("%s: too much data\n", skb->dev->name);
+   dev_kfree_skb(skb);
+   return NULL;
+   }
+
while (len) {
set_skb_frag(skb, page, offset, &len);
page = (struct page *)page->private;
-- 
1.7.6.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Alexander Graf

On 28.09.2011, at 16:23, Jan Kiszka wrote:

> Alex,
> 
> we have this diff in qemu-kvm:
> 
> diff --git a/exec.c b/exec.c
> index c1e045d..f188549 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -3950,6 +3955,11 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, 
> uint8_t *buf,
> cpu_physical_memory_set_dirty_flags(
> addr1, (0xff & ~CODE_DIRTY_FLAG));
> }
> + /* qemu doesn't execute guest code directly, but kvm does
> +therefore flush instruction caches */
> + if (kvm_enabled())
> + flush_icache_range((unsigned long)ptr,
> +((unsigned long)ptr)+l);
> qemu_put_ram_ptr(ptr);
> }
> } else {
> 
> 
> flush_icache_range() is doing something only on PPC hosts. So do we need
> this upstream?

This makes sure that when device emulation overwrites code that is already 
present in the cache of a CPU, it gets flushed from the icache. I'm fairly sure 
we want that :). But let's ask Ben and David as well.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Jan Kiszka
Alex,

we have this diff in qemu-kvm:

diff --git a/exec.c b/exec.c
index c1e045d..f188549 100644
--- a/exec.c
+++ b/exec.c
@@ -3950,6 +3955,11 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, 
uint8_t *buf,
 cpu_physical_memory_set_dirty_flags(
 addr1, (0xff & ~CODE_DIRTY_FLAG));
 }
+   /* qemu doesn't execute guest code directly, but kvm does
+  therefore flush instruction caches */
+   if (kvm_enabled())
+   flush_icache_range((unsigned long)ptr,
+  ((unsigned long)ptr)+l);
 qemu_put_ram_ptr(ptr);
 }
 } else {


flush_icache_range() is doing something only on PPC hosts. So do we need
this upstream?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Stephan Diestelhorst
On Tuesday 27 September 2011, 12:44:02 Jeremy Fitzhardinge wrote:
> On 09/27/2011 02:34 AM, Stephan Diestelhorst wrote:
> > On Wednesday 14 September 2011, 17:31:32 Jeremy Fitzhardinge wrote:
> >> This series replaces the existing paravirtualized spinlock mechanism
> >> with a paravirtualized ticketlock mechanism.
> > [...] 
> >> The unlock code is very straightforward:
> >>prev = *lock;
> >>__ticket_unlock_release(lock);
> >>if (unlikely(__ticket_in_slowpath(lock)))
> >>__ticket_unlock_slowpath(lock, prev);
> >>
> >> which generates:
> >>push   %rbp
> >>mov%rsp,%rbp
> >>
> >> movzwl (%rdi),%esi
> >>addb   $0x2,(%rdi)
> >> movzwl (%rdi),%eax
> >>testb  $0x1,%ah
> >>jne1f
> >>
> >>pop%rbp
> >>retq   
> >>
> >>### SLOWPATH START
> >> 1: movzwl (%rdi),%edx
> >>movzbl %dh,%ecx
> >>mov%edx,%eax
> >>and$-2,%ecx # clear TICKET_SLOWPATH_FLAG
> >>mov%cl,%dh
> >>cmp%dl,%cl  # test to see if lock is uncontended
> >>je 3f
> >>
> >> 2: movzbl %dl,%esi
> >>callq  *__ticket_unlock_kick# kick anyone waiting
> >>pop%rbp
> >>retq   
> >>
> >> 3: lock cmpxchg %dx,(%rdi) # use cmpxchg to safely write back flag
> >>jmp2b
> >>### SLOWPATH END
> > [...]
> >> Thoughts? Comments? Suggestions?
> > You have a nasty data race in your code that can cause a losing
> > acquirer to sleep forever, because its setting the TICKET_SLOWPATH flag
> > can race with the lock holder releasing the lock.
> >
> > I used the code for the slow path from the GIT repo.
> >
> > Let me try to point out an interleaving:
> >
> > Lock is held by one thread, contains 0x0200.
> >
> > _Lock holder_   _Acquirer_
> > mov$0x200,%eax
> > lock xadd %ax,(%rdi)
> > // ax:= 0x0200, lock:= 0x0400
> > ...
> > // this guy spins for a while, reading
> > // the lock
> > ...
> > //trying to free the lock
> > movzwl (%rdi),%esi (esi:=0x0400)
> > addb   $0x2,(%rdi) (LOCAL copy of lock is now: 0x0402)
> > movzwl (%rdi),%eax (local forwarding from previous store: eax := 0x0402)
> > testb  $0x1,%ah(no wakeup of anybody)
> > jne1f
> >
> > callq  *__ticket_lock_spinning
> >   ...
> >   // __ticket_enter_slowpath(lock)
> >   lock or (%rdi), $0x100
> >   // (global view of lock := 0x0500)
> > ...
> >   ACCESS_ONCE(lock->tickets.head) == want
> >   // (reads 0x00)
> > ...
> >   xen_poll_irq(irq); // goes to sleep
> > ...
> > [addb   $0x2,(%rdi)]
> > // (becomes globally visible only now! global view of lock := 0x0502)
> > ...
> >
> > Your code is reusing the (just about) safe version of unlocking a
> > spinlock without understanding the effect that close has on later
> > memory ordering. It may work on CPUs that cannot do narrow -> wide
> > store to load forwarding and have to make the addb store visible
> > globally. This is an implementation artifact of specific uarches, and
> > you mustn't rely on it, since our specified memory model allows looser
> > behaviour.
> 
> Ah, thanks for this observation.  I've seen this bug before when I
> didn't pay attention to the unlock W vs flag R ordering at all, and I
> was hoping the aliasing would be sufficient - and certainly this seems
> to have been OK on my Intel systems.  But you're saying that it will
> fail on current AMD systems?

I have tested this and have not seen it fail on publicly released AMD
systems. But as I have tried to point out, this does not mean it is
safe to do in software, because future microarchtectures may have more
capable forwarding engines.

> Have you tested this, or is this just from code analysis (which I
> agree with after reviewing the ordering rules in the Intel manual).

We have found a similar issue in Novell's PV ticket lock implementation
during internal product testing.

> > Since you want to get that addb out to global memory before the second
> > read, either use a LOCK prefix for it, add an MFENCE between addb and
> > movzwl, or use a LOCKed instruction that will have a fencing effect
> > (e.g., to top-of-stack)between addb and movzwl.
> 
> Hm.  I don't really want to do any of those because it will probably
> have a significant effect on the unlock performance; I was really trying
> to avoid adding any more locked instructions.  A previous version of the
> code had an mfence in here, but I hit on the idea of using aliasing to
> get the ordering I want - but overlook

Re: [libvirt] Qemu/KVM is 3x slower under libvirt (due to vhost=on)

2011-09-28 Thread Reeted

On 09/28/11 11:53, Daniel P. Berrange wrote:

On Wed, Sep 28, 2011 at 11:49:01AM +0200, Reeted wrote:

YES!
It's the vhost. With vhost=on it takes about 12 seconds more time to boot.

...meaning? :-)

I've no idea. I was always under the impression that 'vhost=on' was
the 'make it go much faster' switch. So something is going wrong
here that I cna't explain.

Perhaps one of the network people on this list can explain...


To turn vhost off in the libvirt XML, you should be able to use
  for the interface in question,eg


 
   
   
   
 



Ok that seems to work: it removes the vhost part in the virsh launch 
hence cutting down 12secs of boot time.


If nobody comes out with an explanation of why, I will open another 
thread on the kvm list for this. I would probably need to test disk 
performance on vhost=on to see if it degrades or it's for another reason 
that boot time is increased.


Thanks so much for your help Daniel,
Reeted
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt (due to vhost=on)

2011-09-28 Thread Richard W.M. Jones
On Wed, Sep 28, 2011 at 12:19:09PM +0200, Reeted wrote:
> On 09/28/11 11:53, Daniel P. Berrange wrote:
> >On Wed, Sep 28, 2011 at 11:49:01AM +0200, Reeted wrote:
> >>YES!
> >>It's the vhost. With vhost=on it takes about 12 seconds more time to boot.
> >>
> >>...meaning? :-)
> >I've no idea. I was always under the impression that 'vhost=on' was
> >the 'make it go much faster' switch. So something is going wrong
> >here that I cna't explain.
> >
> >Perhaps one of the network people on this list can explain...
> >
> >
> >To turn vhost off in the libvirt XML, you should be able to use
> >  for the interface in question,eg
> >
> >
> > 
> >   
> >   
> >   
> > 
> 
> 
> Ok that seems to work: it removes the vhost part in the virsh launch
> hence cutting down 12secs of boot time.
> 
> If nobody comes out with an explanation of why, I will open another
> thread on the kvm list for this. I would probably need to test disk
> performance on vhost=on to see if it degrades or it's for another
> reason that boot time is increased.

Is it using CPU during this time, or is the qemu-kvm process idle?

It wouldn't be the first time that a network option ROM sat around
waiting for an imaginary console user to press a key.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] virtio-net: Verify page list size before fitting into skb

2011-09-28 Thread Sasha Levin
On Tue, 2011-09-27 at 15:37 +0300, Michael S. Tsirkin wrote:
> On Tue, Sep 27, 2011 at 02:20:29PM +0300, Sasha Levin wrote:
> > On Tue, 2011-09-27 at 10:00 +0300, Michael S. Tsirkin wrote:
> > > >   skb = page_to_skb(vi, page, len);
> > > >   ...
> > > 
> > > Sorry I don't get it yet. Where is mergeable ignored here?
> > 
> > The NULL deref happens in page_to_skb(), before merge buffers are
> > handled.
> 
> The len here is a single buffer length, so for mergeable
> buffers it is <= the size of the buffer we gave to hardware,
> which is PAGE_SIZE.
> 
> err = virtqueue_add_buf_single(vi->rvq, page_address(page),
>PAGE_SIZE, false, page, gfp);
>  
> 
> Unless of course we are trying to work around broken hardware again,
> which I don't have a problem with, but should probably
> get appropriate comments in code and trigger a warning.
> 
> > I'll test it and see if it's really the case.

I've verified it with VIRTIO_NET_F_MRG_RXBUF set, and we still get the
NULL deref.



-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 0/3] acpi: fix up EJ0 in DSDT

2011-09-28 Thread Amos Kong
- Original Message -
> This is a second iteration of the patch.  The patch has been
> significantly reworked to address (offline) comments by Gleb.
> 
> I think the infrastructure created is generic enough
> to be generally useful beyond the specific bug
> that I would like to fix. Specifically it
> will be able to find S3 Name to patch that,
> or process compiled CPU SSDT to avoid the need for
> hardcoded offsets.
> 
> Please comment.
> 
> Main changes:
>   - tools rewritten in python
>   - Original ASL retains _EJ0 methods, BIOS patches that to EJ0_
>   - generic ACP_EXTRACT infrastructure that can match Method
>   and Name Operators
>   - instead of matching specific method name, insert tags
> in original DSL source and match that to AML


Patch looks good to me.
Acked-by: Amos Kong 

> -
> 
> Here's a bug: guest thinks it can eject VGA device and ISA bridge.
> 
> [root@dhcp74-172 ~]#lspci
> 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
> (rev 02)
> 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA
> [Natoma/Triton II]
> 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE
> [Natoma/Triton II]
> 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
> 00:02.0 VGA compatible controller: Cirrus Logic GD 5446
> 00:03.0 PCI bridge: Red Hat, Inc. Device 0001
> 00:04.0 Ethernet controller: Qumranet, Inc. Virtio network device
> 00:05.0 SCSI storage controller: Qumranet, Inc. Virtio block device
> 01:00.0 Ethernet controller: Intel Corporation 82540EM Gigabit
> Ethernet Controller (rev 03)
> 
> [root@dhcp74-172 ~]# ls /sys/bus/pci/slots/1/
> adapter  address  attention  latch  module  power
> [root@dhcp74-172 ~]# ls /sys/bus/pci/slots/2/
> adapter  address  attention  latch  module  power
> 
> [root@dhcp74-172 ~]# echo 0 > /sys/bus/pci/slots/2/power
> [root@dhcp74-172 ~]# lspci
> 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
> (rev 02)
> 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA
> [Natoma/Triton II]
> 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE
> [Natoma/Triton II]
> 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
> 00:03.0 PCI bridge: Red Hat, Inc. Device 0001
> 00:04.0 Ethernet controller: Qumranet, Inc. Virtio network device
> 00:05.0 SCSI storage controller: Qumranet, Inc. Virtio block device
> 01:00.0 Ethernet controller: Intel Corporation 82540EM Gigabit
> Ethernet Controller (rev 03)
> 
> This is wrong because slots 1 and 2 are marked as not hotpluggable
> in qemu.
> 
> The reason is that our acpi tables declare both _RMV with value 0,
> and _EJ0 method for these slots. What happens in this case
> is undocumented by ACPI spec, so linux ignores _RMV,
> and windows seems to ignore _EJ0.
> 
> The correct way to suppress hotplug is not to have _EJ0,
> so this is what this patch does: it probes PIIX and
> modifies DSDT to match.
> 
> With these patches applied, we get:
> 
> [root@dhcp74-172 ~]# ls /sys/bus/pci/slots/1/
> address
> [root@dhcp74-172 ~]# ls /sys/bus/pci/slots/2/
> address
> 
> 
> 
> Michael S. Tsirkin (3):
>   acpi: generate and parse mixed asl/aml listing
>   acpi: EJ0 method name patching
>   acpi: remove _RMV
> 
>  Makefile |   10 +-
>  src/acpi-dsdt.dsl|   96 ---
>  src/acpi.c   |   31 ++
>  tools/acpi_extract.py|  195
>  ++
>  tools/acpi_extract_preprocess.py |   37 +++
>  5 files changed, 307 insertions(+), 62 deletions(-)
>  create mode 100755 tools/acpi_extract.py
>  create mode 100755 tools/acpi_extract_preprocess.py
> 
> --
> 1.7.5.53.gc233e
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt (due to vhost=on)

2011-09-28 Thread Daniel P. Berrange
On Wed, Sep 28, 2011 at 12:19:09PM +0200, Reeted wrote:
> On 09/28/11 11:53, Daniel P. Berrange wrote:
> >On Wed, Sep 28, 2011 at 11:49:01AM +0200, Reeted wrote:
> >>YES!
> >>It's the vhost. With vhost=on it takes about 12 seconds more time to boot.
> >>
> >>...meaning? :-)
> >I've no idea. I was always under the impression that 'vhost=on' was
> >the 'make it go much faster' switch. So something is going wrong
> >here that I cna't explain.
> >
> >Perhaps one of the network people on this list can explain...
> >
> >
> >To turn vhost off in the libvirt XML, you should be able to use
> >  for the interface in question,eg
> >
> >
> > 
> >   
> >   
> >   
> > 
> 
> 
> Ok that seems to work: it removes the vhost part in the virsh launch
> hence cutting down 12secs of boot time.
> 
> If nobody comes out with an explanation of why, I will open another
> thread on the kvm list for this. I would probably need to test disk
> performance on vhost=on to see if it degrades or it's for another
> reason that boot time is increased.

Be sure to CC the qemu-devel mailing list too next time, since that has
a wider audience who might be able to help


Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt (due to vhost=on)

2011-09-28 Thread Daniel P. Berrange
On Wed, Sep 28, 2011 at 11:49:01AM +0200, Reeted wrote:
> On 09/28/11 11:28, Daniel P. Berrange wrote:
> >On Wed, Sep 28, 2011 at 11:19:43AM +0200, Reeted wrote:
> >>On 09/28/11 09:51, Daniel P. Berrange wrote:
> This is my bash commandline:
> 
> /opt/qemu-kvm-0.14.1/bin/qemu-system-x86_64 -M pc-0.14 -enable-kvm
> -m 2002 -smp 2,sockets=2,cores=1,threads=1 -name vmname1-1 -uuid
> ee75e28a-3bf3-78d9-3cba-65aa63973380 -nodefconfig -nodefaults
> -chardev 
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/vmname1-1.monitor,server,nowait
> -mon chardev=charmonitor,id=monitor,mode=readline -rtc base=utc
> -boot order=dc,menu=on -drive 
> file=/dev/mapper/vgPtpVM-lvVM_Vmname1_d1,if=none,id=drive-virtio-disk0,boot=on,format=raw,cache=none,aio=native
> -device 
> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> -drive 
> if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,cache=none,aio=native
> -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
> -net nic,model=virtio -net tap,ifname=tap0,script=no,downscript=no
> -usb -vnc 127.0.0.1:0 -vga cirrus -device
> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
> >>>This shows KVM is being requested, but we should validate that KVM is
> >>>definitely being activated when under libvirt. You can test this by
> >>>doing:
> >>>
> >>> virsh qemu-monitor-command vmname1 'info kvm'
> >>kvm support: enabled
> >>
> >>I think I would see a higher impact if it was KVM not enabled.
> >>
> Which was taken from libvirt's command line. The only modifications
> I did to the original libvirt commandline (seen with ps aux) were:
> >
> - Network was: -netdev tap,fd=17,id=hostnet0,vhost=on,vhostfd=18
> -device 
> virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3
> Has been simplified to: -net nic,model=virtio -net
> tap,ifname=tap0,script=no,downscript=no
> and manual bridging of the tap0 interface.
> >>>You could have equivalently used
> >>>
> >>>  -netdev tap,ifname=tap0,script=no,downscript=no,id=hostnet0,vhost=on
> >>>  -device 
> >>> virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3
> >>It's this! It's this!! (thanks for the line)
> >>
> >>It raises boot time by 10-13 seconds
> >Ok, that is truely bizarre and I don't really have any explanation
> >for why that is. I guess you could try 'vhost=off' too and see if that
> >makes the difference.
> 
> YES!
> It's the vhost. With vhost=on it takes about 12 seconds more time to boot.
> 
> ...meaning? :-)

I've no idea. I was always under the impression that 'vhost=on' was
the 'make it go much faster' switch. So something is going wrong
here that I cna't explain.

Perhaps one of the network people on this list can explain...


To turn vhost off in the libvirt XML, you should be able to use
 for the interface in question,eg



  
  
  


Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt

2011-09-28 Thread Reeted

On 09/28/11 11:28, Daniel P. Berrange wrote:

On Wed, Sep 28, 2011 at 11:19:43AM +0200, Reeted wrote:

On 09/28/11 09:51, Daniel P. Berrange wrote:

This is my bash commandline:

/opt/qemu-kvm-0.14.1/bin/qemu-system-x86_64 -M pc-0.14 -enable-kvm
-m 2002 -smp 2,sockets=2,cores=1,threads=1 -name vmname1-1 -uuid
ee75e28a-3bf3-78d9-3cba-65aa63973380 -nodefconfig -nodefaults
-chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vmname1-1.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=readline -rtc base=utc
-boot order=dc,menu=on -drive 
file=/dev/mapper/vgPtpVM-lvVM_Vmname1_d1,if=none,id=drive-virtio-disk0,boot=on,format=raw,cache=none,aio=native
-device 
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-drive 
if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,cache=none,aio=native
-device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
-net nic,model=virtio -net tap,ifname=tap0,script=no,downscript=no
-usb -vnc 127.0.0.1:0 -vga cirrus -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

This shows KVM is being requested, but we should validate that KVM is
definitely being activated when under libvirt. You can test this by
doing:

 virsh qemu-monitor-command vmname1 'info kvm'

kvm support: enabled

I think I would see a higher impact if it was KVM not enabled.


Which was taken from libvirt's command line. The only modifications
I did to the original libvirt commandline (seen with ps aux) were:



- Network was: -netdev tap,fd=17,id=hostnet0,vhost=on,vhostfd=18
-device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3
Has been simplified to: -net nic,model=virtio -net
tap,ifname=tap0,script=no,downscript=no
and manual bridging of the tap0 interface.

You could have equivalently used

  -netdev tap,ifname=tap0,script=no,downscript=no,id=hostnet0,vhost=on
  -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3

It's this! It's this!! (thanks for the line)

It raises boot time by 10-13 seconds

Ok, that is truely bizarre and I don't really have any explanation
for why that is. I guess you could try 'vhost=off' too and see if that
makes the difference.


YES!
It's the vhost. With vhost=on it takes about 12 seconds more time to boot.

...meaning? :-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt

2011-09-28 Thread Daniel P. Berrange
On Wed, Sep 28, 2011 at 11:19:43AM +0200, Reeted wrote:
> On 09/28/11 09:51, Daniel P. Berrange wrote:
> >>This is my bash commandline:
> >>
> >>/opt/qemu-kvm-0.14.1/bin/qemu-system-x86_64 -M pc-0.14 -enable-kvm
> >>-m 2002 -smp 2,sockets=2,cores=1,threads=1 -name vmname1-1 -uuid
> >>ee75e28a-3bf3-78d9-3cba-65aa63973380 -nodefconfig -nodefaults
> >>-chardev 
> >>socket,id=charmonitor,path=/var/lib/libvirt/qemu/vmname1-1.monitor,server,nowait
> >>-mon chardev=charmonitor,id=monitor,mode=readline -rtc base=utc
> >>-boot order=dc,menu=on -drive 
> >>file=/dev/mapper/vgPtpVM-lvVM_Vmname1_d1,if=none,id=drive-virtio-disk0,boot=on,format=raw,cache=none,aio=native
> >>-device 
> >>virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> >>-drive 
> >>if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,cache=none,aio=native
> >>-device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
> >>-net nic,model=virtio -net tap,ifname=tap0,script=no,downscript=no
> >>-usb -vnc 127.0.0.1:0 -vga cirrus -device
> >>virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
> >
> >This shows KVM is being requested, but we should validate that KVM is
> >definitely being activated when under libvirt. You can test this by
> >doing:
> >
> > virsh qemu-monitor-command vmname1 'info kvm'
> 
> kvm support: enabled
> 
> I think I would see a higher impact if it was KVM not enabled.
> 
> >>Which was taken from libvirt's command line. The only modifications
> >>I did to the original libvirt commandline (seen with ps aux) were:


> >>- Network was: -netdev tap,fd=17,id=hostnet0,vhost=on,vhostfd=18
> >>-device 
> >>virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3
> >>Has been simplified to: -net nic,model=virtio -net
> >>tap,ifname=tap0,script=no,downscript=no
> >>and manual bridging of the tap0 interface.
> >You could have equivalently used
> >
> >  -netdev tap,ifname=tap0,script=no,downscript=no,id=hostnet0,vhost=on
> >  -device 
> > virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3
> 
> It's this! It's this!! (thanks for the line)
> 
> It raises boot time by 10-13 seconds

Ok, that is truely bizarre and I don't really have any explanation
for why that is. I guess you could try 'vhost=off' too and see if that
makes the difference.

> 
> But now I don't know where to look During boot there is a pause
> usually between /scripts/init-bottom  (Ubuntu 11.04 guest) and the
> appearance of login prompt, however that is not really meaningful
> because there is probably much background activity going on there,
> with init etc. which don't display messages
> 
> 
> init-bottom does just this
> 
> -
> #!/bin/sh -e
> # initramfs init-bottom script for udev
> 
> PREREQ=""
> 
> # Output pre-requisites
> prereqs()
> {
> echo "$PREREQ"
> }
> 
> case "$1" in
> prereqs)
> prereqs
> exit 0
> ;;
> esac
> 
> 
> # Stop udevd, we'll miss a few events while we run init, but we catch up
> pkill udevd
> 
> # Move /dev to the real filesystem
> mount -n -o move /dev ${rootmnt}/dev
> -
> 
> It doesn't look like it should take time to execute.
> So there is probably some other background activity going on... and
> that is slower, but I don't know what that is.
> 
> 
> Another thing that can be noticed is that the dmesg message:
> 
> [   13.290173] eth0: no IPv6 routers present
> 
> (which is also the last message)
> 
> happens on average 1 (one) second earlier in the fast case (-net)
> than in the slow case (-netdev)

Hmm, none of that looks particularly suspect. So I don't really have
much idea what else to try apart from the 'vhost=off' possibilty.


Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt

2011-09-28 Thread Reeted

On 09/28/11 09:51, Daniel P. Berrange wrote:

On Tue, Sep 27, 2011 at 08:10:21PM +0200, Reeted wrote:

I repost this, this time by also including the libvirt mailing list.

Info on my libvirt: it's the version in Ubuntu 11.04 Natty which is
0.8.8-1ubuntu6.5 . I didn't recompile this one, while Kernel and
qemu-kvm are vanilla and compiled by hand as described below.

My original message follows:

This is really strange.

I just installed a new host with kernel 3.0.3 and Qemu-KVM 0.14.1
compiled by me.

I have created the first VM.
This is on LVM, virtio etc... if I run it directly from bash
console, it boots in 8 seconds (it's a bare ubuntu with no
graphics), while if I boot it under virsh (libvirt) it boots in
20-22 seconds. This is the time from after Grub to the login prompt,
or from after Grub to the ssh-server up.

I was almost able to replicate the whole libvirt command line on the
bash console, and it still goes almost 3x faster when launched from
bash than with virsh start vmname. The part I wasn't able to
replicate is the -netdev part because I still haven't understood the
semantics of it.

-netdev is just an alternative way of setting up networking that
avoids QEMU's nasty VLAN concept. Using -netdev allows QEMU to
use more efficient codepaths for networking, which should improve
the network performance.


This is my bash commandline:

/opt/qemu-kvm-0.14.1/bin/qemu-system-x86_64 -M pc-0.14 -enable-kvm
-m 2002 -smp 2,sockets=2,cores=1,threads=1 -name vmname1-1 -uuid
ee75e28a-3bf3-78d9-3cba-65aa63973380 -nodefconfig -nodefaults
-chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vmname1-1.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=readline -rtc base=utc
-boot order=dc,menu=on -drive 
file=/dev/mapper/vgPtpVM-lvVM_Vmname1_d1,if=none,id=drive-virtio-disk0,boot=on,format=raw,cache=none,aio=native
-device 
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-drive 
if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,cache=none,aio=native
-device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
-net nic,model=virtio -net tap,ifname=tap0,script=no,downscript=no
-usb -vnc 127.0.0.1:0 -vga cirrus -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5


This shows KVM is being requested, but we should validate that KVM is
definitely being activated when under libvirt. You can test this by
doing:

 virsh qemu-monitor-command vmname1 'info kvm'


kvm support: enabled

I think I would see a higher impact if it was KVM not enabled.


Which was taken from libvirt's command line. The only modifications
I did to the original libvirt commandline (seen with ps aux) were:

- Removed -S

Fine, has no effect on performance.


- Network was: -netdev tap,fd=17,id=hostnet0,vhost=on,vhostfd=18
-device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3
Has been simplified to: -net nic,model=virtio -net
tap,ifname=tap0,script=no,downscript=no
and manual bridging of the tap0 interface.

You could have equivalently used

  -netdev tap,ifname=tap0,script=no,downscript=no,id=hostnet0,vhost=on
  -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3


It's this! It's this!! (thanks for the line)

It raises boot time by 10-13 seconds

But now I don't know where to look During boot there is a pause 
usually between /scripts/init-bottom  (Ubuntu 11.04 guest) and the 
appearance of login prompt, however that is not really meaningful  
because there is probably much background activity going on there, with 
init etc. which don't display messages



init-bottom does just this

-
#!/bin/sh -e
# initramfs init-bottom script for udev

PREREQ=""

# Output pre-requisites
prereqs()
{
echo "$PREREQ"
}

case "$1" in
prereqs)
prereqs
exit 0
;;
esac


# Stop udevd, we'll miss a few events while we run init, but we catch up
pkill udevd

# Move /dev to the real filesystem
mount -n -o move /dev ${rootmnt}/dev
-

It doesn't look like it should take time to execute.
So there is probably some other background activity going on... and that 
is slower, but I don't know what that is.



Another thing that can be noticed is that the dmesg message:

[   13.290173] eth0: no IPv6 routers present

(which is also the last message)

happens on average 1 (one) second earlier in the fast case (-net) than 
in the slow case (-netdev)




That said, I don't expect this has anything todo with the performance
since booting a guest rarely involves much network I/O unless you're
doing something odd like NFS-root / iSCSI-root.


No there is nothing like that. No network disks or nfs.

I had ntpdate, but I removed that and it changed nothing.



Firstly I had thought that this could be fault of the VNC: I have
compiled qemu-kvm with no separate vnc thread. I thought that
libvirt might have connected to the vnc server at all times a

Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files

2011-09-28 Thread Alexander Graf

On 28.09.2011, at 04:40, Alex Williamson wrote:

> On Tue, 2011-09-27 at 16:28 -0500, Scott Wood wrote:
>> On 09/26/2011 07:45 PM, Alex Williamson wrote:
>>> On Mon, 2011-09-26 at 18:59 -0500, Scott Wood wrote:
 On 09/26/2011 01:34 PM, Alex Williamson wrote:
> /* Reset the device */
> #define VFIO_DEVICE_RESET _IO(, ,)
 
 What generic way do we have to do this?  We should probably have a way
 to determine whether it's possible, without actually asking to do it.
>>> 
>>> It's not generic, it could be a VFIO_DEVICE_PCI_RESET or we could add a
>>> bit to the device flags to indicate if it's available or we could add a
>>> "probe" arg to the ioctl to either check for existence or do it.
>> 
>> Even with PCI, isn't this only possible if function-level reset is
>> supported?
> 
> There are a couple other things we can do if FLR isn't present (D3hot
> transition, secondary bus reset, device specific resets are possible).
> 
>> I think we need a flag.
> 
> Ok, PCI has a pci_probe_reset_function() and pci_reset_function().  I'd
> probably mimic those in the vfio device ops.  Common vfio code can probe
> the reset and set the flag appropriately and we can have a common
> VFIO_DEVICE_RESET ioctl that calls into the device ops reset function.
> 
>> For devices that can't be reset by the kernel, we'll want the ability to
>> stop/start DMA acccess through the IOMMU (or other bus-specific means),
>> separate from whether the fd is open.  If a device is assigned to a
>> partition and that partition gets reset, we'll want to disable DMA
>> before we re-use the memory, and enable it after the partition has reset
>> or quiesced the device (which requires the fd to be open).
> 
> Maybe this can be accomplished via an iommu_detach_device() to
> temporarily disassociate it from the domain.  We could also unmap all
> the DMA.  Anyway, a few possibilities.
> 
> /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS  _IOW(, , int)
> #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
> 
> Hope that covers it.
 
 It could be done this way, but I predict that the code (both kernel and
 user side) will be larger.  Maybe not much more complex, but more
 boilerplate.
 
 How will you manage extensions to the interface?
>>> 
>>> I would assume we'd do something similar to the kvm capabilities checks.
>> 
>> This information is already built into the data-structure approach.
> 
> If we define it to be part of the flags, then it's built-in to the ioctl
> approach too...
> 
 The table should not be particularly large, and you'll need to keep the
 information around in some form regardless.  Maybe in the PCI case you
 could produce it dynamically (though I probably wouldn't), but it really
 wouldn't make sense in the device tree case.
>>> 
>>> It would be entirely dynamic for PCI, there's no advantage to caching
>>> it.  Even for device tree, if you can't fetch it dynamically, you'd have
>>> to duplicate it between an internal data structure and a buffer reading
>>> the table.
>> 
>> I don't think we'd need to keep the device tree path/index info around
>> for anything but the table -- but really, this is a minor consideration.
>> 
 You also lose the ability to easily have a human look at the hexdump for
 debugging; you'll need a special "lsvfio" tool.  You might want one
 anyway to pretty-print the info, but with ioctls it's mandatory.
>>> 
>>> I don't think this alone justifies duplicating the data and making it
>>> difficult to parse on both ends.  Chances are we won't need such a tool
>>> for the ioctl interface because it's easier to get it right the first
>>> time ;)
>> 
>> It's not just useful for getting the code right, but for e.g. sanity
>> checking that the devices were bound properly.  I think such a tool
>> would be generally useful, no matter what the kernel interface ends up
>> being.  I don't just use lspci to debug the PCI subsystem. :-)
> 
> This is also a minor consideration.  Looking at hexdumps isn't much to
> rely on for debugging and if we take the step of writing a tool, it's
> not much harder to write for either interface.  The table is more akin
> to dumping the data, but I feel the ioctl is easier for how a driver
> would probably make use of the data (linear vs random access).
> 
>>> Note that I'm not stuck on this interface, I was just thinking about how
>>> to generate the table last week, it seemed like a pain so I thought I'd
>>> spend a few minutes outlining an ioctl interface... turns out it's not
>>> so bad.  Thanks,
>> 
>> Yeah, it can work either way, as long as the information's there and
>> there's a way to add new bits of information, or new bus types, down the
>> road.  Mainly a matter of aesthetics between the two.
> 
> It'd be nice if David would chime back in since he objected to the
> table.  Does an ioctl interface look better?  Alex Graf, a

Re: How many threads should a kvm vm be starting?

2011-09-28 Thread Thomas Fjellstrom
On September 28, 2011, Daniel P. Berrange wrote:
> On Tue, Sep 27, 2011 at 04:04:41PM -0600, Thomas Fjellstrom wrote:
> > On September 27, 2011, Avi Kivity wrote:
> > > On 09/27/2011 03:29 AM, Thomas Fjellstrom wrote:
> > > > I just noticed something interesting, a virtual machine on one of my
> > > > servers seems to have 69 threads (including the main thread). Other
> > > > guests on the machine only have a couple threads.
> > > > 
> > > > Is this normal? or has something gone horribly wrong?
> > > 
> > > It's normal if the guest does a lot of I/O.  The thread count should go
> > > down when the guest idles.
> > 
> > Ah, that would make sense. Though it kind of defeats assigning a vm a
> > single cpu/core. A single VM can now DOS an entire multi-core-cpu
> > server. It pretty much pegged my dual core (with HT) server for a couple
> > hours.
> 
> You can mitigate these problems by putting each KVM process in its own
> cgroup, and using the 'cpu_shares' tunable to ensure that each KVM
> process gets the same relative ratio of CPU time, regardless of how
> many threads it is running. With newer kernels there are other CPU
> tunables for placing hard caps on CPU utilization of the process as
> a whole too.

I'll have to look into how to set that up with libvirt. A brief search leads 
me to believe its rather easy to set up, so I'll have to do that asap :)

> Regards,
> Daniel


-- 
Thomas Fjellstrom
tho...@fjellstrom.ca
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt

2011-09-28 Thread Daniel P. Berrange
On Tue, Sep 27, 2011 at 08:10:21PM +0200, Reeted wrote:
> I repost this, this time by also including the libvirt mailing list.
> 
> Info on my libvirt: it's the version in Ubuntu 11.04 Natty which is
> 0.8.8-1ubuntu6.5 . I didn't recompile this one, while Kernel and
> qemu-kvm are vanilla and compiled by hand as described below.
> 
> My original message follows:
> 
> This is really strange.
> 
> I just installed a new host with kernel 3.0.3 and Qemu-KVM 0.14.1
> compiled by me.
> 
> I have created the first VM.
> This is on LVM, virtio etc... if I run it directly from bash
> console, it boots in 8 seconds (it's a bare ubuntu with no
> graphics), while if I boot it under virsh (libvirt) it boots in
> 20-22 seconds. This is the time from after Grub to the login prompt,
> or from after Grub to the ssh-server up.
>
> I was almost able to replicate the whole libvirt command line on the
> bash console, and it still goes almost 3x faster when launched from
> bash than with virsh start vmname. The part I wasn't able to
> replicate is the -netdev part because I still haven't understood the
> semantics of it.

-netdev is just an alternative way of setting up networking that
avoids QEMU's nasty VLAN concept. Using -netdev allows QEMU to
use more efficient codepaths for networking, which should improve
the network performance.

> This is my bash commandline:
> 
> /opt/qemu-kvm-0.14.1/bin/qemu-system-x86_64 -M pc-0.14 -enable-kvm
> -m 2002 -smp 2,sockets=2,cores=1,threads=1 -name vmname1-1 -uuid
> ee75e28a-3bf3-78d9-3cba-65aa63973380 -nodefconfig -nodefaults
> -chardev 
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/vmname1-1.monitor,server,nowait
> -mon chardev=charmonitor,id=monitor,mode=readline -rtc base=utc
> -boot order=dc,menu=on -drive 
> file=/dev/mapper/vgPtpVM-lvVM_Vmname1_d1,if=none,id=drive-virtio-disk0,boot=on,format=raw,cache=none,aio=native
> -device 
> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> -drive 
> if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,cache=none,aio=native
> -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
> -net nic,model=virtio -net tap,ifname=tap0,script=no,downscript=no
> -usb -vnc 127.0.0.1:0 -vga cirrus -device
> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5


This shows KVM is being requested, but we should validate that KVM is
definitely being activated when under libvirt. You can test this by
doing:

virsh qemu-monitor-command vmname1 'info kvm'

> Which was taken from libvirt's command line. The only modifications
> I did to the original libvirt commandline (seen with ps aux) were:
> 
> - Removed -S

Fine, has no effect on performance.

> - Network was: -netdev tap,fd=17,id=hostnet0,vhost=on,vhostfd=18
> -device 
> virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3
> Has been simplified to: -net nic,model=virtio -net
> tap,ifname=tap0,script=no,downscript=no
> and manual bridging of the tap0 interface.

You could have equivalently used

 -netdev tap,ifname=tap0,script=no,downscript=no,id=hostnet0,vhost=on
 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3

That said, I don't expect this has anything todo with the performance
since booting a guest rarely involves much network I/O unless you're
doing something odd like NFS-root / iSCSI-root.

> Firstly I had thought that this could be fault of the VNC: I have
> compiled qemu-kvm with no separate vnc thread. I thought that
> libvirt might have connected to the vnc server at all times and this
> could have slowed down the whole VM.
> But then I also tried connecting vith vncviewer to the KVM machine
> launched directly from bash, and the speed of it didn't change. So
> no, it doesn't seem to be that.

Yeah, I have never seen VNC be responsible for the kind of slowdown
you describe.

> BTW: is the slowdown of the VM on "no separate vnc thread" only in
> effect when somebody is actually connected to VNC, or always?

Probably, but again I dont think it is likely to be relevant here.

> Also, note that the time difference is not visible in dmesg once the
> machine has booted. So it's not a slowdown in detecting devices.
> Devices are always detected within the first 3 seconds, according to
> dmesg, at 3.6 seconds the first ext4 mount begins. It seems to be
> really the OS boot that is slow... it seems an hard disk performance
> problem.


There are a couple of things that would be different between running the
VM directly, vs via libvirt.

 - Security drivers - SELinux/AppArmour
 - CGroups

If it is was AppArmour causing this slowdown I don't think you would have
been the first person to complain, so lets ignore that. Which leaves
cgroups as a likely culprit. Do a

  grep cgroup /proc/mounts

if any of them are mounted, then for each cgroups mount in turn,

 - Umount the cgroup
 - Restart libvirtd
 - Test your guest boot performance


Regards,
Daniel
-- 
|: http://berrange.com  -o-h

Re: How many threads should a kvm vm be starting?

2011-09-28 Thread Daniel P. Berrange
On Tue, Sep 27, 2011 at 04:04:41PM -0600, Thomas Fjellstrom wrote:
> On September 27, 2011, Avi Kivity wrote:
> > On 09/27/2011 03:29 AM, Thomas Fjellstrom wrote:
> > > I just noticed something interesting, a virtual machine on one of my
> > > servers seems to have 69 threads (including the main thread). Other
> > > guests on the machine only have a couple threads.
> > > 
> > > Is this normal? or has something gone horribly wrong?
> > 
> > It's normal if the guest does a lot of I/O.  The thread count should go
> > down when the guest idles.
> 
> Ah, that would make sense. Though it kind of defeats assigning a vm a single 
> cpu/core. A single VM can now DOS an entire multi-core-cpu server. It pretty 
> much pegged my dual core (with HT) server for a couple hours.

You can mitigate these problems by putting each KVM process in its own
cgroup, and using the 'cpu_shares' tunable to ensure that each KVM
process gets the same relative ratio of CPU time, regardless of how
many threads it is running. With newer kernels there are other CPU
tunables for placing hard caps on CPU utilization of the process as
a whole too.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html