Re: [kvm-devel] [ANNOUNCE] kvm-guest-drivers-windows-2

2008-05-15 Thread Anthony Liguori
Avi Kivity wrote:
 Anthony Liguori wrote:
 FWIW, virtio-net is much better with my patches applied.

 The can_receive patches?

 Again, I'm not opposed to them in principle, I just think that if they 
 help that this points at a virtio deficiency.  Virtio should never 
 leave the rx queue empty.  Consider the case where the virtio queue 
 isn't tied to a socket buffer, but directly to hardware.

For RX performance:


right now
[  3]  0.0-10.0 sec  1016 MBytes852 Mbits/sec

revert tap hack
[  3]  0.0-10.0 sec564 MBytes473 Mbits/sec

all patches applied
[  3]  0.0-10.0 sec  1.17 GBytes  1.01 Gbits/sec

drop lots of packets
[  3]  0.0-10.0 sec  1.05 GBytes905 Mbits/sec


The last patch is not in my series but it basically makes the ring size 
512 and drops packets when we run out of descriptors.  That was to valid 
that we're not hiding a virtio deficiency.  The reason I want to buffer 
packets is that it avoids having to deal with tuning.   For 
vringfd/vmdq, we'll have to make sure to get the tuning right though.

Regards,

Anthony Liguori

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] pinning, tsc and apic

2008-05-15 Thread Anthony Liguori
Chris Wright wrote:
 * Anthony Liguori ([EMAIL PROTECTED]) wrote:
   
  From a quick look, I suspect that the number of wildly off TSC 
 calibrations correspond to the VMs that are misbehaving.  I think this 
 may mean that we have to re-examine the tsc delta computation.

 10_serial.log:time.c: Detected 1995.038 MHz processor.
 11_serial.log:time.c: Detected 2363.195 MHz processor.
 12_serial.log:time.c: Detected 2492.675 MHz processor.
 13_serial.log:time.c: Detected 1995.061 MHz processor.
 14_serial.log:time.c: Detected 1994.917 MHz processor.
 15_serial.log:time.c: Detected 4100.735 MHz processor.
 16_serial.log:time.c: Detected 2075.800 MHz processor.
 17_serial.log:time.c: Detected 2674.350 MHz processor.
 18_serial.log:time.c: Detected 1995.002 MHz processor.
 19_serial.log:time.c: Detected 1994.978 MHz processor.
 1_serial.log:time.c: Detected 4384.310 MHz processor.
 

 Is this with pinning?  We at least know we're losing small bits on
 migration.  From my measurements it's ~3000 (outliers are 10-20k).

 Also, what happens if you roll back to kvm-userspace 7f5c4d15ece5?

 I'm using this:

 diff -up arch/x86/kvm/svm.c~svm arch/x86/kvm/svm.c
 --- arch/x86/kvm/svm.c~svm2008-04-16 19:49:44.0 -0700
 +++ arch/x86/kvm/svm.c2008-05-14 23:44:18.0 -0700
 @@ -621,6 +621,13 @@ static void svm_free_vcpu(struct kvm_vcp
   kmem_cache_free(kvm_vcpu_cache, svm);
  }
  
 +static void svm_tsc_update(void *arg)
 +{
 + struct vcpu_svm *svm = arg;
 + rdtscll(svm-vcpu.arch.host_tsc);
 +
 +}
 +
  static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
  {
   struct vcpu_svm *svm = to_svm(vcpu);
 @@ -633,6 +640,9 @@ static void svm_vcpu_load(struct kvm_vcp
* Make sure that the guest sees a monotonically
* increasing TSC.
*/
 + if (vcpu-cpu != -1)
 + smp_call_function_single(vcpu-cpu, svm_tsc_update,
 +  svm, 0, 1);
   

I like this approach because of its simplicity although the IPI is not 
wonderful.  I was also thinking of using cpu_clock() to take a timestamp 
on vcpu_put, then on vcpu_load, take another timestamp and use the 
cyc2ns conversion to try and estimate the elapsed tsc ticks on the new cpu.

Regards,

Anthony Liguori

   rdtscll(tsc_this);
   delta = vcpu-arch.host_tsc - tsc_this;
   svm-vmcb-control.tsc_offset += delta;

   


-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ANNOUNCE] kvm-guest-drivers-windows-2

2008-05-15 Thread Anthony Liguori
Avi Kivity wrote:
 Anthony Liguori wrote:
 Avi Kivity wrote:
 Anthony Liguori wrote:
 FWIW, virtio-net is much better with my patches applied.

 The can_receive patches?

 Again, I'm not opposed to them in principle, I just think that if 
 they help that this points at a virtio deficiency.  Virtio should 
 never leave the rx queue empty.  Consider the case where the virtio 
 queue isn't tied to a socket buffer, but directly to hardware.

 For RX performance:


 right now
 [  3]  0.0-10.0 sec  1016 MBytes852 Mbits/sec

 revert tap hack
 [  3]  0.0-10.0 sec564 MBytes473 Mbits/sec

 all patches applied
 [  3]  0.0-10.0 sec  1.17 GBytes  1.01 Gbits/sec

 drop lots of packets
 [  3]  0.0-10.0 sec  1.05 GBytes905 Mbits/sec


 The last patch is not in my series but it basically makes the ring 
 size 512 and drops packets when we run out of descriptors.  That was 
 to valid that we're not hiding a virtio deficiency.  The reason I 
 want to buffer packets is that it avoids having to deal with 
 tuning.   For vringfd/vmdq, we'll have to make sure to get the tuning 
 right though.

 Okay; I'll apply the patches.  Hopefully we won't diverge too much 
 from upstream qemu.

I am going to push these upstream.  I need to finish the page_desc cache 
first b/c right now the version of virtio that could go into upstream 
QEMU has unacceptable performance for KVM.

Regards,

Anthony Liguori



-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH] Add support for a configuration file

2008-05-14 Thread Anthony Liguori
Fabrice Bellard wrote:
 Avi Kivity wrote:
 Fabrice Bellard wrote:

 I prefer:

 drive.file=foo.img
 drive.if=scsi


 That doesn't support multiple drives very well.

 Right, I realized it afterwards !

 I suggested it because my original plan for the configuration file was 
 based on this syntax with a strong inspiration from the OpenFirmware 
 device tree. The idea was that the object name (drive here) had no 
 hardcoded meaning, except for some predefined object names in order to 
 keep a kind of backward compatibility with the current QEMU options. 
 In order to create a new drive for example, you just have to do:

 mydrive.class=drive
 mydrive.if=scsi
 mydrive.file=abc.img

 the class field is used to select the device model. Then all the 
 other parameters are used to initialize the device model. That way it 
 is possible to keep the compatibility with the existing options and 
 add a provision to instanciate arbitrary new device models, such as:

I like this syntax primarily because it provides a means to associate 
arbitrary data with a VM.  It also provides a sane way to keep track of 
which device is which so that the config can be updated while the VM 
is running.  I'll update the patch.

Regards,

Anthony Liguori

 mynetworkcard.class=ne2000pci
 mynetworkcard.bus=1 # pci bus selection
 mynetworkcard.macaddr=00:01:02:03:04:05
 mynetworkcard.vlan=1

 I will strongly support configuration file formats having this property.

 Regards,

 Fabrice.



-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH] Add support for a configuration file

2008-05-14 Thread Anthony Liguori
Paul Brook wrote:
 the class field is used to select the device model. Then all the other
 parameters are used to initialize the device model. That way it is
 possible to keep the compatibility with the existing options and add a
 provision to instanciate arbitrary new device models, such as:
 

 I like the idea, but I'm not so keen on the automatic allocation. I generally 
 prefer explicit declaration over implicit things. The latter makes it very 
 easy to not notice when you make a typo.

 It sounds like what you really want is something similar to an OF device 
 tree.  
 So you have something like:

 # pciide0 may be an alias (possibly provided by qemu)
 # e.g. pci0.slot1.func1.ide
 alias hda ide0.primary.master
   

What I don't like about the ide0.primary.master syntax is that there 
isn't enough structure.  I would prefer:

alias hda ide,bus=0,primary,master

If you combine this with your magic variable idea, you could also do:

alias hda ide,bus=0,unit=$next

But you could also just fold that into Fabrice's syntax (which I prefer):

hda.class = ide,bus=0,unit=$next
hda.type = disk
hda.file = foo.img

If you then default bus, and unit, you can just write:

hda.class = ide
hda.type = disk
hda.file = foo.img

And better yet, you could even allow for something like:

hda.class = ide
hda.bus = 0
hda.unit = 0
hda.type = disk
hda.file = foo.img

So I really actually prefer Fabrice's syntax because there is much more 
structure.  I think it's a good idea to allow .class to contain multiple 
properties and to basically establish an alias.  This way, you could 
predefine a bunch of aliases for today's parameters like hda, hdb, etc.

 hda.type=disk
 hda.file=foo.img

 You can then define some form of magic aliases that select the next unused 
 device. e.g.

 alias mydrive $next_ide_disk

 IMHO This provides the flexibility and structure that Fabrice is talking 
 about, and with suitable aliases can be made to look a lot like the existing 
 options.

 This may require some internal restructuring to allow the machine 
 descriptions 
 to feed into the user config file.
   

I think if we choose a syntax we like, we can start using that pretty 
easily.  We'll have to start adjusting option names keeping them only 
valid on the command line (for things like -m).  However, I think it 
would grow pretty well into a machine description mechanism.

Regards,

Anthony Liguori

 Thoughts?

 Paul
   


-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH] Add support for a configuration file

2008-05-14 Thread Anthony Liguori
Dor Laor wrote:
 On Wed, 2008-05-14 at 17:41 +0300, Avi Kivity wrote:
   
 Please don't jump over me but I think it is worth mentioning OVF, at
 least for to know what's you opinions.

 Open Virtualization Format -
 http://www.vmware.com/appliances/learn/ovf.html

 It's xml based, supported by all major hypervisors, so qemu/kvm/xen
 users might eventually use a product that support OVF.
   

xml is a non-starter for QEMU.  We go out of our way to be portable.  
Having an XML dependency that could be satisfied on Windows and Linux 
would probably require more code to support the dependency than all of 
QEMU itself.


Regards,

Anthony Liguori

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ANNOUNCE] kvm-guest-drivers-windows-2

2008-05-14 Thread Anthony Liguori
Dor Laor wrote:
 On Wed, 2008-05-14 at 17:49 +0200, Tomasz Chmielewski wrote:
   
 Dor Laor schrieb:

 (...)

 
 - PV Windows (network driver)
 
 About 700Mb+-, there is currently extra copy that we need to omit.
 Thanks for Anthony, we just have to change the driver.

   
 - non-PV Windows
 
 What do you mean? Other fully emulated nics like e1000?
 It does not perform as pv but depending on the guest it can do up to
 600Mb+-.
   
 Just generally, how Windows PV drivers help to improve network performance.

 So, a PV network driver can do about 700Mb/s, and an emulated NIC can do 
 about 
 600 Mb/s, Windows guest to host?

 That would be about 20% improvement?
 

FWIW, virtio-net is much better with my patches applied.  The difference 
between the e1000 and virtio-net is that e1000 consumes almost twice as 
much CPU as virtio-net so in my testing, the performance improvement 
with virtio-net is about 2x.  We were loosing about 20-30% throughput 
because of the delays in handling incoming packets.

Regards,

Anthony LIguori

 

 It's work in progress, doing zero copy in the guest, adding TSO, using
 virtio'd tap will drastically boot performance. There is no reason the
 performance won't match Linux guest.
 Also I don't exactly remember the numbers but the gain in the tx pass is
 grater. 



 -
 This SF.net email is sponsored by: Microsoft 
 Defy all challenges. Microsoft(R) Visual Studio 2008. 
 http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
 ___
 kvm-devel mailing list
 kvm-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/kvm-devel
   


-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] pinning, tsc and apic

2008-05-14 Thread Anthony Liguori
Marcelo Tosatti wrote:
 On Mon, May 12, 2008 at 02:19:24PM -0500, Ryan Harper wrote:
   
 Hi Ryan,

 There are two places that attempt to use delivery mode 7: kexec crash
 and io_apic_64.c::check_timer().

 The later will happen if the guest fails to receive PIT IRQ's for 10
 ticks. If you're using HZ=1000 thats 10ms. See timer_irq_works().

 The in-kernel pit emulation has logic which avoids injecting more than
 one IRQ during 10ms.

 Note that the guest 10ms delay is TSC based and uses only the lower
 32-bits of the value. It is quite likely that the TSC adjustment results
 in them increasing more rapidly then they should.
   

Or that the TSC is terribly miscalibrated.  Here is the log of all 48 
guests.  In this case, the host detects 1995.008 as the frequency.  This 
is a Barcelona.  I suspect that we're masking the 
X86_FEATURE_CONSTANT_TSC though.

 From a quick look, I suspect that the number of wildly off TSC 
calibrations correspond to the VMs that are misbehaving.  I think this 
may mean that we have to re-examine the tsc delta computation.

10_serial.log:time.c: Detected 1995.038 MHz processor.
11_serial.log:time.c: Detected 2363.195 MHz processor.
12_serial.log:time.c: Detected 2492.675 MHz processor.
13_serial.log:time.c: Detected 1995.061 MHz processor.
14_serial.log:time.c: Detected 1994.917 MHz processor.
15_serial.log:time.c: Detected 4100.735 MHz processor.
16_serial.log:time.c: Detected 2075.800 MHz processor.
17_serial.log:time.c: Detected 2674.350 MHz processor.
18_serial.log:time.c: Detected 1995.002 MHz processor.
19_serial.log:time.c: Detected 1994.978 MHz processor.
1_serial.log:time.c: Detected 4384.310 MHz processor.
20_serial.log:time.c: Detected 1994.969 MHz processor.
21_serial.log:time.c: Detected 3670.696 MHz processor.
22_serial.log:time.c: Detected 1994.997 MHz processor.
23_serial.log:time.c: Detected 2218.613 MHz processor.
24_serial.log:time.c: Detected 1995.048 MHz processor.
25_serial.log:time.c: Detected 1995.015 MHz processor.
26_serial.log:time.c: Detected 1994.957 MHz processor.
27_serial.log:time.c: Detected 1995.051 MHz processor.
28_serial.log:time.c: Detected 1995.021 MHz processor.
29_serial.log:time.c: Detected 3679.640 MHz processor.
2_serial.log:time.c: Detected 2191.105 MHz processor.
30_serial.log:time.c: Detected 1995.086 MHz processor.
31_serial.log:time.c: Detected 1995.071 MHz processor.
32_serial.log:time.c: Detected 1995.051 MHz processor.
33_serial.log:time.c: Detected 2331.760 MHz processor.
34_serial.log:time.c: Detected 1995.011 MHz processor.
35_serial.log:time.c: Detected 1995.050 MHz processor.
36_serial.log:time.c: Detected 1994.911 MHz processor.
37_serial.log:time.c: Detected 1994.905 MHz processor.
38_serial.log:time.c: Detected 1994.881 MHz processor.
39_serial.log:time.c: Detected 1995.027 MHz processor.
3_serial.log:time.c: Detected 2051.467 MHz processor.
40_serial.log:time.c: Detected 1994.987 MHz processor.
41_serial.log:time.c: Detected 1994.970 MHz processor.
42_serial.log:time.c: Detected 1994.952 MHz processor.
43_serial.log:time.c: Detected 1995.042 MHz processor.
44_serial.log:time.c: Detected 1994.998 MHz processor.
45_serial.log:time.c: Detected 1995.016 MHz processor.
46_serial.log:time.c: Detected 1995.006 MHz processor.
47_serial.log:time.c: Detected 1995.000 MHz processor.
4_serial.log:time.c: Detected 1995.112 MHz processor.
5_serial.log:time.c: Detected 1995.081 MHz processor.
6_serial.log:time.c: Detected 2017.303 MHz processor.
7_serial.log:time.c: Detected 1995.046 MHz processor.
8_serial.log:time.c: Detected 1994.951 MHz processor.
9_serial.log:time.c: Detected 2184.754 MHz processor.

Regards,

Anthony Liguori

 So can you try setting KVM_MAX_PIT_INTR_INTERVAL to a lower value? HZ/10
 or something.

 You can confirm this theory by booting the guests with apic=debug.

 -
 This SF.net email is sponsored by: Microsoft 
 Defy all challenges. Microsoft(R) Visual Studio 2008. 
 http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
 ___
 kvm-devel mailing list
 kvm-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/kvm-devel
   


-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 3/4] Use fragmented send for virtio

2008-05-13 Thread Anthony Liguori
This patch converts virtio-net to use the new fragmented send interface.
We should have always supported this.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index 85cc9d2..93bca1d 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -239,23 +239,15 @@ again:
 static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 {
 VirtQueueElement elem;
-int count = 0;
 
 if (!(n-vdev.status  VIRTIO_CONFIG_S_DRIVER_OK))
 return;
 
 while (virtqueue_pop(vq, elem)) {
-   int i;
-   size_t len = 0;
+   ssize_t len = 0;
 
/* ignore the header for now */
-   for (i = 1; i  elem.out_num; i++) {
-   qemu_send_packet(n-vc, elem.out_sg[i].iov_base,
-elem.out_sg[i].iov_len);
-   len += elem.out_sg[i].iov_len;
-   }
-
-   count++;
+   len = qemu_sendv_packet(n-vc, elem.out_sg[1], elem.out_num - 1);
 
virtqueue_push(vq, elem, sizeof(struct virtio_net_hdr) + len);
virtio_notify(n-vdev, vq);

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 2/4] Add fd_readv handler to tap

2008-05-13 Thread Anthony Liguori
This allows fragmented packets to be sent with no additional copies over the
tap interface.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/vl.c b/qemu/vl.c
index 1f0a6ac..7900b76 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -4086,6 +4086,19 @@ static void tap_receive(void *opaque, const uint8_t 
*buf, int size)
 }
 }
 
+static ssize_t tap_readv(void *opaque, const struct iovec *iov,
+int iovcnt)
+{
+TAPState *s = opaque;
+ssize_t len;
+
+do {
+   len = writev(s-fd, iov, iovcnt);
+} while (len == -1  (errno == EINTR || errno == EAGAIN));
+
+return len;
+}
+
 static void tap_send(void *opaque)
 {
 TAPState *s = opaque;
@@ -4135,6 +4148,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int fd)
 s-no_poll = 0;
 enable_sigio_timer(fd);
 s-vc = qemu_new_vlan_client(vlan, tap_receive, NULL, s);
+s-vc-fd_readv = tap_readv;
 qemu_set_fd_handler2(s-fd, tap_read_poll, tap_send, NULL, s);
 snprintf(s-vc-info_str, sizeof(s-vc-info_str), tap: fd=%d, fd);
 return s;

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 1/4] Add method to send fragmented packets

2008-05-13 Thread Anthony Liguori
We need to be able to send fragmented packets in KVM to avoid an extra copy
in the TX path.  This patch adds a qemu_sendv_packet() function to send
fragemented packets.  It also provides backwards compatibility for old clients
that don't support the new interface.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/net.h b/qemu/net.h
index 13daa27..5fb190e 100644
--- a/qemu/net.h
+++ b/qemu/net.h
@@ -1,12 +1,17 @@
 #ifndef QEMU_NET_H
 #define QEMU_NET_H
 
+#include sys/uio.h
+
 /* VLANs support */
 
+typedef ssize_t (IOReadvHandler)(void *, const struct iovec *, int);
+
 typedef struct VLANClientState VLANClientState;
 
 struct VLANClientState {
 IOReadHandler *fd_read;
+IOReadvHandler *fd_readv;
 /* Packets may still be sent if this returns zero.  It's used to
rate-limit the slirp code.  */
 IOCanRWHandler *fd_can_read;
@@ -30,6 +35,8 @@ VLANClientState *qemu_new_vlan_client(VLANState *vlan,
   void *opaque);
 int qemu_can_send_packet(VLANClientState *vc);
 void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size);
+ssize_t qemu_sendv_packet(VLANClientState *vc, const struct iovec *iov,
+ int iovcnt);
 void qemu_handler_true(void *opaque);
 
 void do_info_network(void);
diff --git a/qemu/vl.c b/qemu/vl.c
index 45c97af..1f0a6ac 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -3820,6 +3820,50 @@ void qemu_send_packet(VLANClientState *vc1, const 
uint8_t *buf, int size)
 }
 }
 
+static ssize_t vc_sendv_compat(VLANClientState *vc, const struct iovec *iov,
+  int iovcnt)
+{
+char buffer[4096];
+size_t offset = 0;
+int i;
+
+for (i = 0; i  iovcnt; i++) {
+   size_t len;
+
+   len = MIN(sizeof(buffer) - offset, iov[i].iov_len);
+   memcpy(buffer + offset, iov[i].iov_base, len);
+   offset += len;
+}
+
+vc-fd_read(vc-opaque, buffer, offset);
+
+return offset;
+}
+
+ssize_t qemu_sendv_packet(VLANClientState *vc1, const struct iovec *iov,
+ int iovcnt)
+{
+VLANState *vlan = vc1-vlan;
+VLANClientState *vc;
+ssize_t max_len = 0;
+
+for (vc = vlan-first_client; vc != NULL; vc = vc-next) {
+   ssize_t len = 0;
+
+   if (vc == vc1)
+   continue;
+
+   if (vc-fd_readv)
+   len = vc-fd_readv(vc-opaque, iov, iovcnt);
+   else if (vc-fd_read)
+   len = vc_sendv_compat(vc, iov, iovcnt);
+
+   max_len = MAX(max_len, len);
+}
+
+return max_len;
+}
+
 #if defined(CONFIG_SLIRP)
 
 /* slirp network adapter */

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH] Add support for a configuration file

2008-05-13 Thread Anthony Liguori
There has been an awful lot of discussion about a configuration file with
almost no general agreement except that one is strongly desired.

I thought about it a bit, and I think a nice step would be to simply allow
the current configuration parameters to be stored in a file using a pretty
familiar format.

I think this is pretty useful as-is.  I think it also gives us a reasonable
way to move forward that will keep everyone pretty happy.

Here's a short example:

qemu-system-x86_64 -hda ~/images/linux.img -snapshot -vnc :2

Would become `foo.qemu':

 # Main disk image
 hda=/home/anthony/images/linux.img

 # Redirect disk writes to a temporary image
 snapshot

 # Make the graphical display available on port 5902
 vnc=:2

With:

qemu-system-x86_64 -config foo.qemu

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu-doc.texi b/qemu-doc.texi
index cca483c..4861fc0 100644
--- a/qemu-doc.texi
+++ b/qemu-doc.texi
@@ -395,6 +395,12 @@ Sets the @var{name} of the guest.
 This name will be display in the SDL window caption.
 The @var{name} will also be used for the VNC server.
 
[EMAIL PROTECTED] -config @var{file}
+Reads configuration options from @var{file}.  The format of @var{file} is the
+same as the command line options, except no dash ``-'' is required.  Options
+that take an argument are in the format @var{option=value}.  A pound ``#''
+character can be used as a comment.
+
 @end table
 
 Display options:
diff --git a/vl.c b/vl.c
index 67712f0..2eb39dd 100644
--- a/vl.c
+++ b/vl.c
@@ -7276,6 +7276,7 @@ static void help(int exitcode)
-clock  force the use of the given methods for timer 
alarm.\n
To see what timers are available use -clock ?\n
-startdate  select initial date of the clock\n
+  -config FILEread command line options from FILE\n
\n
During emulation, the following keys are useful:\n
ctrl-alt-f  toggle full screen\n
@@ -7379,6 +7380,7 @@ enum {
 QEMU_OPTION_old_param,
 QEMU_OPTION_clock,
 QEMU_OPTION_startdate,
+QEMU_OPTION_config,
 };
 
 typedef struct QEMUOption {
@@ -7490,6 +7492,7 @@ const QEMUOption qemu_options[] = {
 #endif
 { clock, HAS_ARG, QEMU_OPTION_clock },
 { startdate, HAS_ARG, QEMU_OPTION_startdate },
+{ config, HAS_ARG, QEMU_OPTION_config },
 { NULL },
 };
 
@@ -7665,9 +7668,106 @@ static BOOL WINAPI qemu_ctrl_handler(DWORD type)
 }
 #endif
 
+static char **insert_opts(char **old_argv, int *old_argc, int index,
+ char **argv, int argc)
+{
+char **new_argv;
+
+/* Allocate larger array */
+new_argv = realloc(old_argv, (*old_argc + argc) * sizeof(old_argv[0]));
+if (new_argv == NULL) {
+   fprintf(stderr, allocate failed in insert_opts\n);
+   exit(1);
+}
+
+/* move elements after insertion point to end of array */
+memmove(new_argv+index + argc, new_argv + index,
+   (*old_argc - index) * sizeof(argv[0]));
+
+/* copy in new elements */
+memcpy(new_argv + index, argv, argc * sizeof(argv[0]));
+
+*old_argc += argc;
+
+if (0) { /* for debugging */
+   int i;
+   printf(argv[] = {);
+   for (i = 0; i  *old_argc; i++) {
+   if (i)
+   printf(, );
+   printf(\%s\, new_argv[i]);
+   }
+   printf(}\n);
+}
+
+return new_argv;
+}
+
+static char **parse_config_file(const char *file, int *pargc)
+{
+FILE *f;
+char buffer[4096];
+char **argv = NULL;
+int argc = 0;
+
+f = fopen(file, r);
+if (f == NULL)
+   return NULL;
+
+while (fgets(buffer, sizeof(buffer), f)) {
+   char *ptr = buffer;
+   char *tok, *key, *val;
+   char *targv[2];
+   int targc = 0;
+
+   /* skip whitespace */
+   while (isspace(*ptr)) ptr++;
+
+   /* skip comments or empty lines */
+   if (*ptr == '#' || *ptr == 0)
+   continue;
+
+   /* trim new line and carriage return if necessary */
+   tok = strchr(ptr, '\n');
+   if (tok)
+   *tok = 0;
+   tok = strchr(ptr, '\r');
+   if (tok)
+   *tok = 0;
+
+   /* check if it has an argument */
+   tok = strchr(ptr, '=');
+   if (tok)
+   *tok = 0;
+
+   /* add key */
+   if (asprintf(key, --%s, ptr) == -1)
+   return NULL;
+   targv[targc++] = key;
+
+   /* add argument (optionally) */
+   if (tok) {
+   if (asprintf(val, %s, tok + 1) == -1)
+   return NULL;
+   targv[targc++] = val;
+   }
+
+   /* insert new arguments */
+   argv = insert_opts(argv, argc, argc, targv, targc);
+   if (argv == NULL)
+   return NULL;
+}
+
+fclose(f);
+
+*pargc = argc;
+
+return argv;
+}
+
 #define MAX_NET_CLIENTS 32
 
-int main(int argc, char **argv)
+int main(int orig_argc, char **orig_argv)
 {
 #ifdef CONFIG_GDBSTUB
 int use_gdbstub;
@@ -7700,6 +7800,10 @@ int main(int argc

Re: [kvm-devel] [PATCH] Add support for a configuration file

2008-05-13 Thread Anthony Liguori
Anthony Liguori wrote:
 I think this is pretty useful as-is.  I think it also gives us a reasonable
 way to move forward that will keep everyone pretty happy.
 
 Here's a short example:
 
 qemu-system-x86_64 -hda ~/images/linux.img -snapshot -vnc :2
 
 Would become `foo.qemu':
 
  # Main disk image
  hda=/home/anthony/images/linux.img
 
  # Redirect disk writes to a temporary image
  snapshot
 
  # Make the graphical display available on port 5902
  vnc=:2
 
 With:
 
 qemu-system-x86_64 -config foo.qemu

One thought I had, is that it would be very nice to break up the -drive 
file=foo.img,if=scsi syntax within the config file.  In general, I'm 
thinking something like:

[drive]
file=foo.img
if=scsi

or:

drive {
   file=foo.img
   if=scsi
}

or even:

drive:
   file=foo.img
   if=scsi

Basically, I'm looking for a syntax for sub-options.  This would be 
useful for -drive or -net but I also think would lay the foundations for 
specifying a full machine config.

It would get very unwieldy on the command line to have a large number of 
suboptions but it works reasonably well within a config.

Regards,

Anthony Liguori

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] qemu-kvm: Consolidate kvm_eat_signals

2008-05-12 Thread Anthony Liguori
Avi Kivity wrote:
 Jan Kiszka wrote:
   
 Given that with the iothread we spend very little time processing
 signals in vcpu threads, maybe it's better to drop the loop completely. 
 The common case is zero or one pending signals.  The uncommon case of
 two or more pending signals will be handled by the KVM_RUN ioctl
 returning immediately with -EINTR (i.e. in the outer loop).

 
   
 You mean

 static void kvm_main_loop_wait(CPUState *env, int timeout)
 {
 pthread_mutex_unlock(qemu_mutex);
 kvm_eat_signal(env, timeout);
 pthread_mutex_lock(qemu_mutex);
 cpu_single_env = env;

 vcpu_info[env-cpu_index].signalled = 0;
 }

 ?
   
 

 Yes.  The loop was a (perhaps premature) optimization that is now 
 totally unnecessary, unless I'm missing something quite large.
   

It used to be that kvm_eat_signal() selected after consuming as many 
signals as possible while only sleeping once.  That's why there's a 
combination of sleeping and polling.

Now the VCPU threads never select so the whole loop can be simplified to 
a single sigtimedwait() that always blocks.

In reality, I don't think sigtimedwait() is really needed/useful for 
VCPUs anymore.  We only use it to catch SIG_IPI and we only use SIG_IPI 
to break out of sleeping.  I don't see any reason why we couldn't switch 
over to using a file descriptor for notification (or a pthread 
condition).  In the very least, we could just select() on nothing and 
allow SIG_IPI to break us out of the select.

Regards,

Anthony Liguori

 Oh.  There used to be a bug where we didn't check for a pending signal 
 before the first guest entry, so this would add a lot of latency 
 (effectively making the bug window much larger).  That was only closed 
 in 2.6.24 (by 7e66f350).

   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] qemu-kvm: Consolidate kvm_eat_signals

2008-05-12 Thread Anthony Liguori
Avi Kivity wrote:
 Now the VCPU threads never select so the whole loop can be simplified 
 to a single sigtimedwait() that always blocks.

 In reality, I don't think sigtimedwait() is really needed/useful for 
 VCPUs anymore.  We only use it to catch SIG_IPI and we only use 
 SIG_IPI to break out of sleeping.  I don't see any reason why we 
 couldn't switch over to using a file descriptor for notification (or 
 a pthread condition).

 How would you stop a vcpu running in guest mode?

Yeah, I forgot about that.

 In the very least, we could just select() on nothing and allow 
 SIG_IPI to break us out of the select.

 sigtimedwait() (or just sigwait, now) doesn't require the signal to be 
 delivered, so it's faster.

Yeah, sigtimedwait() is probably the right thing to do since we have to 
use a signal for IPI.

Regards,

Anthony Liguori

 If there's nothing to select, why call select()?



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] pinning, tsc and apic

2008-05-12 Thread Anthony Liguori
Ryan Harper wrote:
 I've been digging into some of the instability we see when running
 larger numbers of guests at the same time.  The test I'm currently using
 involves launching 64 1vcpu guests on an 8-way AMD box.

Note this is a Barcelona system and therefore should have a 
fixed-frequency TSC.

   With the latest
 kvm-userspace git and kvm.git + Gerd's kvmclock fixes, I can launch all
 64 of these 1 second apart,

BTW, what if you don't pace-out the startups?  Do we still have issues 
with that?

  and only a handful (1 to 3)  end up not
 making it up.  In dmesg on the host, I get a couple messages:

 [321365.362534] vcpu not ready for apic_round_robin

 and 

 [321503.023788] Unsupported delivery mode 7

 Now, the interesting bit for me was when I used numactl to pin the guest
 to a processor, all of the guests come up with no issues at all.  As I
 looked into it, it means that we're not running any of the vcpu
 migration code which on svm is comprised of tsc_offset recalibration and
 apic migration, and on vmx, a little more per-vcpu work
   

Another data point is that -no-kvm-irqchip doesn't make the situation 
better.

 I've convinced myself that svm.c's tsc offset calculation works and
 handles the migration from cpu to cpu quite well.  I added the following
 snippet to trigger if we ever encountered the case where we migrated to
 a tsc that was behind:

 rdtscll(tsc_this);
 delta = vcpu-arch.host_tsc - tsc_this;
 old_time = vcpu-arch.host_tsc + svm-vmcb-control.tsc_offset;
 new_time = tsc_this + svm-vmcb-control.tsc_offset + delta;
 if (new_time  old_time) {
 printk(KERN_ERR ACK! (CPU%d-CPU%d) time goes back %llu\n,
vcpu-cpu, cpu, old_time - new_time);
 }
 svm-vmcb-control.tsc_offset += delta;
   

Time will never go backwards, but what can happen is that the TSC 
frequency will slow down.  This is because upon VCPU migration, we don't 
account for the time between vcpu_put on the old processor and vcpu_load 
on the new processor.  This time then disappears.

A possible way to fix this (that's only valid on a processor with a 
fixed-frequency TSC), is to take a high-res timestamp on vcpu_put, and 
then on vcpu_load, take the delta timestamp since the old TSC was saved, 
and use the TSC frequency on the new pcpu to calculate the number of 
elapsed cycles.

Assuming a fixed frequency TSC, and a calibrated TSC across all 
processors, you could get the same affects by using the VT tsc delta 
logic.  Basically, it always uses the new CPU's TSC unless that would 
cause the guest to move backwards in time.  As long as you have a 
stable, calibrated TSC, this would work out.

Can you try your old patch that did this and see if it fixes the problem?

 Noting that vcpu-arch.host_tsc is the tsc of the previous cpu the vcpu
 was running on (see svm_put_vcpu()).  This allows me to check if we are
 in fact increasing the guest's view of the tsc.  I've not be able to
 trigger this at all when the vcpus are migrating.

 As for the apic, the migrate code seems to be rather simple, but I've
 not yet dived in to see if we've got anything racy in there:

 lapic.c:
 void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
 {
 struct kvm_lapic *apic = vcpu-arch.apic;
 struct hrtimer *timer;

 if (!apic)
 return;

 timer = apic-timer.dev;
 if (hrtimer_cancel(timer))
 hrtimer_start(timer, timer-expires, HRTIMER_MODE_ABS);
 }

   

There's a big FIXME in the __apic_timer_fn() to make sure the timer runs 
on the current pCPU.  As written, it's possible for the timer to 
happen on a different pcpu as the current vcpu's but it wasn't obvious 
to me that it would cause problems.  Eddie, et al: Care to elaborate on 
what the TODO was trying to address?

Regards,

Anthony Liguori

 Ryan Harper
 Software Engineer; Linux Technology Center
 IBM Corp., Austin, Tx
 (512) 838-9253   T/L: 678-9253
 [EMAIL PROTECTED]

 -
 This SF.net email is sponsored by: Microsoft 
 Defy all challenges. Microsoft(R) Visual Studio 2008. 
 http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
 ___
 kvm-devel mailing list
 kvm-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/kvm-devel
   


-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 5/5] Stop dropping so many RX packets in tap (v3)

2008-05-11 Thread Anthony Liguori
Avi Kivity wrote:
 Anthony Liguori wrote:
 Normally, tap always reads packets and simply lets the client drop 
 them if it
 cannot receive them.  For virtio-net, this results in massive packet 
 loss and
 about an 80% performance loss in TCP throughput.

 This patch modifies qemu_send_packet() to only deliver a packet to a 
 VLAN
 client if it doesn't have a fd_can_read method or the fd_can_read method
 indicates that it can receive packets.  We also return a status of 
 whether
 any clients were able to receive the packet.

 If no clients were able to receive a packet, we buffer the packet 
 until a
 client indicates that it can receive packets again.

 This patch also modifies the tap code to only read from the tap fd if 
 at least
 one client on the VLAN is able to receive a packet.

 Finally, this patch changes the tap code to drain all possible 
 packets from
 the tap device when the tap fd is readable.

  
  #if defined(CONFIG_SLIRP)
 @@ -3970,6 +3976,8 @@ typedef struct TAPState {
  VLANClientState *vc;
  int fd;
  char down_script[1024];
 +char buf[4096];
 +int size;
  } TAPState;
   

 This breaks large MTUs.

They've always been broken for tap.

 How about the other way round: when the vlan consumer detects it can 
 no longer receive packets, it tells that to the vlan.  When all vlan 
 consumers can no longer receive, tell the producer to stop producing.  
 For the tap producer, this is simply removing its fd from the read 
 poll list.  When a vlan consumer becomes ready to receive again, it 
 tells the vlan, which tells the producers, which then install their 
 fds back again.

Yeah, that's a nice idea.   I'll think about it.  I don't know if it's 
really worth doing as an intermediate step though.  What I'd really like 
to do is have a vlan interface where consumers published all of their 
receive buffers.  Then there's no need for notifications of receive-ability.

Regards,

Anthony Liguori

 This is  a bit difficult since virtio and tap are both consumers and 
 producers, but could be made to work, I think.




-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 5/5] Stop dropping so many RX packets in tap (v3)

2008-05-11 Thread Anthony Liguori
Avi Kivity wrote:
 Anthony Liguori wrote:
   
 How about the other way round: when the vlan consumer detects it can 
 no longer receive packets, it tells that to the vlan.  When all vlan 
 consumers can no longer receive, tell the producer to stop 
 producing.  For the tap producer, this is simply removing its fd from 
 the read poll list.  When a vlan consumer becomes ready to receive 
 again, it tells the vlan, which tells the producers, which then 
 install their fds back again.
   
 Yeah, that's a nice idea.   I'll think about it.  I don't know if it's 
 really worth doing as an intermediate step though.  What I'd really 
 like to do is have a vlan interface where consumers published all of 
 their receive buffers.  Then there's no need for notifications of 
 receive-ability.
 

 That's definitely better, and is also more multiqueue nic / vringfd 
 friendly.

 I still think interrupt-on-halfway-mark is needed much more urgently.  
 It deals with concurrency much better:
   

We already sort of do this.  In try_fill_recv() in virtio-net.c, we try 
to allocate as many skbs as we can to fill the rx queue.  We keep track 
of how many we've been able to allocate.  Whenever we process an RX 
packet, we check to see if the current number of RX packets is less than 
half the maximum number of rx packets we've been able to allocate.

In the common case of small queues, this should have exactly the 
behavior you describe.  We don't currently suppress the RX notification 
even though we really could.  The can_receive changes are really the key 
to this.  Unless we are suppressing tap fd select()'ing, we can always 
suppress the RX notification.  That's been sitting on my TODO for a bit.

 rx:
   host interrupts guest on halfway mark
   guest starts processing packets
   host keeps delivering

 tx:
   guest kicks host on halfway mark
   host starts processing packets
   second vcpu on guest keeps on queueing
   

I'm not convinced it's at all practical to suppress notifications in the 
front-end.  We simply don't know whether we'll get more packets which 
means we have to do TX mitigation within the front-end.  We've been 
there, it's not as nice as doing it in the back-end.

What we really ought to do in the back-end though, is start processing 
the TX packets as soon as we begin to do TX mitigation.  This would 
address the ping latency issue while also increasing throughput 
(hopefully).  One thing I've wanted to try is to register a bottom-half 
or a polling function so that the IO thread was always trying to process 
TX packets while the TX timer is active.

Regards,

Anthony Liguori

 It's also much better with multiqueue NICs, where there's no socket 
 buffer to hold the packets while we're out of descriptors.

   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 1/5] Support more than 3.5GB with virtio (v3)

2008-05-09 Thread Anthony Liguori
Avi Kivity wrote:
 Anthony Liguori wrote:
 We're pretty sloppy in virtio right now about phys_ram_base 
 assumptions.  This
 patch is an incremental step between what we have today and a full 
 blown DMA
 API.  I backported the DMA API but the performance impact was not 
 acceptable
 to me. There's only a slight performance impact with this particular 
 patch.

 Since we're no longer assuming guest physical memory is contiguous, 
 we need
 a more complex way to validate the memory regions than just checking 
 if it's
 within ram_size.
   

 Applied patches 1-2.  Since patch 4 is under contention on qemu-devel, 
 and 3 and 5 depend on it, I'd like to get the can_receive semantic 
 change accepted first.

I'll send it upstream, but I think it's much less of a divergence than 
the current virtio_net_poll hacks.

Regards,

Anthony Liguori


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] qemu-kvm: fix guest resetting

2008-05-09 Thread Anthony Liguori
Marcelo Tosatti wrote:
 diff --git a/qemu/qemu-kvm-ia64.c b/qemu/qemu-kvm-ia64.c
 index 4d0ddd7..d227d22 100644
 --- a/qemu/qemu-kvm-ia64.c
 +++ b/qemu/qemu-kvm-ia64.c
 @@ -61,3 +61,7 @@ int kvm_arch_try_push_interrupts(void *opaque)
  void kvm_arch_update_regs_for_sipi(CPUState *env)
  {
  }
 +
 +void kvm_arch_cpu_reset(CPUState *env)
 +{
 +}
 diff --git a/qemu/qemu-kvm-powerpc.c b/qemu/qemu-kvm-powerpc.c
 index 14ed945..024b18c 100644
 --- a/qemu/qemu-kvm-powerpc.c
 +++ b/qemu/qemu-kvm-powerpc.c
 @@ -213,3 +213,7 @@ int handle_powerpc_dcr_write(int vcpu, uint32_t dcrn, 
 uint32_t data)
  
  return 0; /* XXX ignore failed DCR ops */
  }
 +
 +void kvm_arch_cpu_reset(CPUState *env)
 +{
 +}
 diff --git a/qemu/qemu-kvm-x86.c b/qemu/qemu-kvm-x86.c
 index 9a771ff..28eb5c2 100644
 --- a/qemu/qemu-kvm-x86.c
 +++ b/qemu/qemu-kvm-x86.c
 @@ -689,3 +689,19 @@ int handle_tpr_access(void *opaque, int vcpu,
  kvm_tpr_access_report(cpu_single_env, rip, is_write);
  return 0;
  }
 +
 +void kvm_arch_cpu_reset(CPUState *env)
 +{
 +struct kvm_mp_state mp_state = { .mp_state = KVM_MP_STATE_UNINITIALIZED 
 };
 +
 +kvm_arch_load_regs(env);
 +if (env-cpu_index != 0) {
 + if (kvm_irqchip_in_kernel(kvm_context))
 + kvm_set_mpstate(kvm_context, env-cpu_index, mp_state);
 + else {
 + env-interrupt_request = ~CPU_INTERRUPT_HARD;
 + env-hflags |= HF_HALTED_MASK;
 + env-exception_index = EXCP_HLT;
 + }
 +}
 +}
 diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
 index 3cc6d8e..1e1f065 100644
 --- a/qemu/qemu-kvm.c
 +++ b/qemu/qemu-kvm.c
 @@ -31,8 +31,6 @@ kvm_context_t kvm_context;
  
  extern int smp_cpus;
  
 -static int qemu_kvm_reset_requested;
 -
  pthread_mutex_t qemu_mutex = PTHREAD_MUTEX_INITIALIZER;
  pthread_cond_t qemu_aio_cond = PTHREAD_COND_INITIALIZER;
  pthread_cond_t qemu_vcpu_cond = PTHREAD_COND_INITIALIZER;
 @@ -52,7 +50,6 @@ struct vcpu_info {
  int signalled;
  int stop;
  int stopped;
 -int reload_regs;
  int created;
  } vcpu_info[256];
  
 @@ -242,21 +239,29 @@ static void pause_all_threads(void)
  {
  int i;
  
 +if (cpu_single_env) {
 +fprintf(stderr, qemu-kvm: pause_all_threads from vcpu context\n);
 +exit(0);
 +}
 +
  for (i = 0; i  smp_cpus; ++i) {
   vcpu_info[i].stop = 1;
   pthread_kill(vcpu_info[i].thread, SIG_IPI);
  }
 -while (!all_threads_paused()) {
 - CPUState *env = cpu_single_env;
 +while (!all_threads_paused())
   pthread_cond_wait(qemu_pause_cond, qemu_mutex);
 - cpu_single_env = env;
 -}
 +cpu_single_env = NULL;
  }
   

Personally, I prefer it the old way.  All of the open-coded 
cpu_single_env's are tough to understand and I believe error-prone.  I 
think a strategy of explicitly preserving cpu_single_env whenever we 
drop qemu_mutex is more robust.  Explicitly setting cpu_single_env = 
NULL happens to work because this is only called from the io thread.  
It's less clear to a casual reader why it's necessary.

In fact, I'd be much more inclined to see a wrapper around 
pthread_cond_wait() so that we never explicitly had to set cpu_single_env.

Regards,

Anthony Liguori

  }
  
  void qemu_system_shutdown_request(void)

 -
 This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
 Don't miss this year's exciting event. There's still time to save $100. 
 Use priority code J8TL2D2. 
 http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
 ___
 kvm-devel mailing list
 kvm-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/kvm-devel
   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Fix e1000 can_receive handler

2008-05-08 Thread Anthony Liguori
Aurelien Jarno wrote:
 On Wed, May 07, 2008 at 04:40:58PM -0500, Anthony Liguori wrote:
   
 The current logic of the can_receive handler is to allow packets whenever the
 receiver is disabled or when there are descriptors available in the ring.

 I think the logic ought to be to allow packets whenever the receiver is 
 enabled
 and there are descriptors available in the ring.
 

 The current behaviour is actually correct, this is the way QEMU works:
 when the card is stopped, it should always accept packets, and then
 discard them.
   

The previous patches in my virtio series change that behavior.  Before 
delivering a packet to a VLAN, it checks to see if any of the VLAN 
clients are able to receive a packet.

This is very important for achieving good performance with tap.  With 
virtio-net, we were dropping a ton of packets with tap because there 
weren't descriptors available on the RX ring.

I plan to submit that behavioral change to QEMU upstream along with the 
virtio drivers.  I'm still optimizing phys_page_find() though.  The 
performance impact of switching the ring manipulation to using the 
stl_phys accessors is unacceptable for KVM.

Regards,

Anthony Liguori

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Fix e1000 can_receive handler

2008-05-08 Thread Anthony Liguori
Avi Kivity wrote:
 Anthony Liguori wrote:
 Aurelien Jarno wrote:
 On Wed, May 07, 2008 at 04:40:58PM -0500, Anthony Liguori wrote:
  
 The current logic of the can_receive handler is to allow packets 
 whenever the
 receiver is disabled or when there are descriptors available in the 
 ring.

 I think the logic ought to be to allow packets whenever the 
 receiver is enabled
 and there are descriptors available in the ring.
 

 The current behaviour is actually correct, this is the way QEMU works:
 when the card is stopped, it should always accept packets, and then
 discard them.
   

 The previous patches in my virtio series change that behavior.  
 Before delivering a packet to a VLAN, it checks to see if any of the 
 VLAN clients are able to receive a packet.

 This is very important for achieving good performance with tap.  With 
 virtio-net, we were dropping a ton of packets with tap because there 
 weren't descriptors available on the RX ring.

 I plan to submit that behavioral change to QEMU upstream along with 
 the virtio drivers.  I'm still optimizing phys_page_find() though.  
 The performance impact of switching the ring manipulation to using 
 the stl_phys accessors is unacceptable for KVM.

 I think this indicates a virtio tuning problem.  The rx queue should 
 never be empty on normal operation, ever.  Signalling on rx queue 
 empty invites the overrun since it takes time for the interrupt to be 
 delivered and for the guest to refill the queue.

Well, to start with, the e1000_can_receive handler is just plan wrong.  
The logic is broken.  This hasn't caused an issue because the code isn't 
used.

That said, it is possible to tune virtio to get back the performance 
lose of dropping packets.  We lose about 20% of iperf from dropped 
packets ATM.  If we bump the ring size up to 512 we get it back.  If we 
bump it to 1024, we start loosing again.  It's much less reliably than 
doing flow control though.

 Long term we need to do this dynamically but we can start with 
 signalling on rx queue half empty (or half full if you're an 
 optimist).  We further need to tune the queue size so that this 
 results in an acceptable number of interrupts:

Part of the trouble with the embedded scatter gather list is that it's 
not at all clear what half empty is unless you count all of the 
descriptors.  There may be one giant descriptor, or many small ones.

 1 Gbps = 83K pps = 83 entries per half queue @ 1 KHz interrupt rate

 So we need 166 entries on the queue to keep a moderate interrupt rate, 
 if we change it to signal at the halfway mark.

 Note that flow control still makes sense since it allows us to buffer 
 some packets if the guest is scheduled out.  But we can't use it as 
 the primary mechanism since it won't exist with multiqueue NICs (where 
 the virtio descriptors are fed to driver).

Yes, we also need a better mechanism with vringfd().  I've been thinking 
about how to structure this API within QEMU but it's still not clear to 
me.  Flow control seems to make sense though with the given API.

Regards,

Anthony Liguori

 Similar reasoning probably applied to tx.



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 2/3] virtio-net save/restore support

2008-05-08 Thread Anthony Liguori
The only interesting bit here is that we have to ensure that we rearm the
timer if necessary.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index d15c2f4..5fe66ac 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -207,9 +207,40 @@ static void virtio_net_tx_timer(void *opaque)
 virtio_net_flush_tx(n, n-tx_vq);
 }
 
+static void virtio_net_save(QEMUFile *f, void *opaque)
+{
+VirtIONet *n = opaque;
+
+virtio_save(n-vdev, f);
+
+qemu_put_buffer(f, n-mac, 6);
+qemu_put_be32(f, n-tx_timer_active);
+}
+
+static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
+{
+VirtIONet *n = opaque;
+
+if (version_id != 1)
+   return -EINVAL;
+
+virtio_load(n-vdev, f);
+
+qemu_get_buffer(f, n-mac, 6);
+n-tx_timer_active = qemu_get_be32(f);
+
+if (n-tx_timer_active) {
+   qemu_mod_timer(n-tx_timer,
+  qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
+}
+
+return 0;
+}
+
 PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
 {
 VirtIONet *n;
+static int virtio_net_id;
 
 n = (VirtIONet *)virtio_init_pci(bus, virtio-net, 6900, 0x1000,
 0, VIRTIO_ID_NET,
@@ -229,5 +260,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int 
devfn)
 n-tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
 n-tx_timer_active = 0;
 
+register_savevm(virtio-net, virtio_net_id++, 1,
+   virtio_net_save, virtio_net_load, n);
+
 return (PCIDevice *)n;
 }

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 3/3] virtio-blk save/restore support

2008-05-08 Thread Anthony Liguori
No additional state needs to be saved.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/hw/virtio-blk.c b/qemu/hw/virtio-blk.c
index 048285a..148cb75 100644
--- a/qemu/hw/virtio-blk.c
+++ b/qemu/hw/virtio-blk.c
@@ -162,11 +162,30 @@ static uint32_t virtio_blk_get_features(VirtIODevice 
*vdev)
 return (1  VIRTIO_BLK_F_SEG_MAX | 1  VIRTIO_BLK_F_GEOMETRY);
 }
 
+static void virtio_blk_save(QEMUFile *f, void *opaque)
+{
+VirtIOBlock *s = opaque;
+virtio_save(s-vdev, f);
+}
+
+static int virtio_blk_load(QEMUFile *f, void *opaque, int version_id)
+{
+VirtIOBlock *s = opaque;
+
+if (version_id != 1)
+   return -EINVAL;
+
+virtio_load(s-vdev, f);
+
+return 0;
+}
+
 void *virtio_blk_init(PCIBus *bus, uint16_t vendor, uint16_t device,
  BlockDriverState *bs)
 {
 VirtIOBlock *s;
 int cylinders, heads, secs;
+static int virtio_blk_id;
 
 s = (VirtIOBlock *)virtio_init_pci(bus, virtio-blk, vendor, device,
   0, VIRTIO_ID_BLOCK,
@@ -184,5 +203,8 @@ void *virtio_blk_init(PCIBus *bus, uint16_t vendor, 
uint16_t device,
 
 virtio_add_queue(s-vdev, 128, virtio_blk_handle_output);
 
+register_savevm(virtio-blk, virtio_blk_id++, 1,
+   virtio_blk_save, virtio_blk_load, s);
+
 return s;
 }

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 1/3] Virtio save/restore support (v2)

2008-05-08 Thread Anthony Liguori
This patch implements the core of save/restore support for virtio.  It's
modelled after how PCI save/restore works.

N.B. This makes savevm/loadvm work, but not live migration.  The issue with
live migration is that we're manipulating guest memory without updating the
dirty bitmap correctly.  I will submit a patch in the near future that
addresses that problem.

Since v1, I fixed the Signed-off-by line.  Sorry about that.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c
index a4c9d10..440cc69 100644
--- a/qemu/hw/virtio.c
+++ b/qemu/hw/virtio.c
@@ -420,7 +420,6 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int 
queue_size,
 
 vdev-vq[i].vring.num = queue_size;
 vdev-vq[i].handle_output = handle_output;
-vdev-vq[i].index = i;
 
 return vdev-vq[i];
 }
@@ -436,6 +435,71 @@ void virtio_notify(VirtIODevice *vdev, VirtQueue *vq)
 virtio_update_irq(vdev);
 }
 
+void virtio_save(VirtIODevice *vdev, QEMUFile *f)
+{
+int i;
+
+pci_device_save(vdev-pci_dev, f);
+
+qemu_put_be32s(f, vdev-addr);
+qemu_put_8s(f, vdev-status);
+qemu_put_8s(f, vdev-isr);
+qemu_put_be16s(f, vdev-queue_sel);
+qemu_put_be32s(f, vdev-features);
+qemu_put_be32(f, vdev-config_len);
+qemu_put_buffer(f, vdev-config, vdev-config_len);
+
+for (i = 0; i  VIRTIO_PCI_QUEUE_MAX; i++) {
+   if (vdev-vq[i].vring.num == 0)
+   break;
+}
+
+qemu_put_be32(f, i);
+
+for (i = 0; i  VIRTIO_PCI_QUEUE_MAX; i++) {
+   if (vdev-vq[i].vring.num == 0)
+   break;
+
+   qemu_put_be32(f, vdev-vq[i].vring.num);
+   qemu_put_be32s(f, vdev-vq[i].pfn);
+   qemu_put_be16s(f, vdev-vq[i].last_avail_idx);
+}  
+}
+
+void virtio_load(VirtIODevice *vdev, QEMUFile *f)
+{
+int num, i;
+
+pci_device_load(vdev-pci_dev, f);
+
+qemu_get_be32s(f, vdev-addr);
+qemu_get_8s(f, vdev-status);
+qemu_get_8s(f, vdev-isr);
+qemu_get_be16s(f, vdev-queue_sel);
+qemu_get_be32s(f, vdev-features);
+vdev-config_len = qemu_get_be32(f);
+qemu_get_buffer(f, vdev-config, vdev-config_len);
+
+num = qemu_get_be32(f);
+
+for (i = 0; i  num; i++) {
+   vdev-vq[i].vring.num = qemu_get_be32(f);
+   qemu_get_be32s(f, vdev-vq[i].pfn);
+   qemu_get_be16s(f, vdev-vq[i].last_avail_idx);
+
+   if (vdev-vq[i].pfn) {
+   size_t size;
+   target_phys_addr_t pa;
+
+   pa = (ram_addr_t)vdev-vq[i].pfn  TARGET_PAGE_BITS;
+   size = virtqueue_size(vdev-vq[i].vring.num);
+   virtqueue_init(vdev-vq[i], virtio_map_gpa(pa, size));
+   }
+}
+
+virtio_update_irq(vdev);
+}
+
 VirtIODevice *virtio_init_pci(PCIBus *bus, const char *name,
  uint16_t vendor, uint16_t device,
  uint16_t subvendor, uint16_t subdevice,
diff --git a/qemu/hw/virtio.h b/qemu/hw/virtio.h
index dee97ba..ed8cfd6 100644
--- a/qemu/hw/virtio.h
+++ b/qemu/hw/virtio.h
@@ -87,7 +87,6 @@ struct VirtQueue
 uint32_t pfn;
 uint16_t last_avail_idx;
 void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
-int index;
 };
 
 #define VIRTQUEUE_MAX_SIZE 1024
@@ -108,8 +107,6 @@ struct VirtIODevice
 PCIDevice pci_dev;
 const char *name;
 uint32_t addr;
-uint16_t vendor;
-uint16_t device;
 uint8_t status;
 uint8_t isr;
 uint16_t queue_sel;
@@ -140,4 +137,8 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem);
 
 void virtio_notify(VirtIODevice *vdev, VirtQueue *vq);
 
+void virtio_save(VirtIODevice *vdev, QEMUFile *f);
+
+void virtio_load(VirtIODevice *vdev, QEMUFile *f);
+
 #endif

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] qemu/kvm: support for pci passthrough

2008-05-08 Thread Anthony Liguori
Dor Laor wrote:
 On Thu, 2008-05-08 at 12:03 +0200, Nicolas Daneau wrote:
 Hi,

 I saw there was an active discussion about pci passthrough support in
 KVM. I'm not a Dev, only a sys admin that need this support. As i saw
 on the KVM home page that this feature is plan for 2H2008 in the
 roadmap, I wondered if you already have a more precise time frame for
 this release? Xen already have this feature but for the rest i prfer
 KVM. As this is a key point for the final decsion (we have a Digium
 TDM11B FXO/FXS pci card for an Asterisk VM), i'm waiting for an update
 about this feature.

 
 This is feature is indeed work in progress. We have floating patches for
 several solutions:
 1. Using Intel's VT-d - hw iommu support.
Patches were sent, needs some more scrubbing. The good thing is that 
it can live withing the kvm module and does not have to wait for 
kernel inclusion like the two below.

Nope, it uses the existing Linux VT-d support and requires patches to 
the existing code.  It can't live entirely within the external module.

Regards,

Anthony Liguori


 2. PV dma - good only for Linux guests. Some locking issues needs to be 
solved + get into mainline kernel (=2.6.27)
 3. 1-1 mapping of the guest-host addresses. 
Will work for any guest, no need for special hw, good only for 
single guest and not 100% secure.
Needs a little more attention, worked in the past privately.
 
 Bottom line, soon you'll have more than one option for pci pass through
 using kvm. There is no assurance we'll make it exactly in Q2.
 You can help by applying patches and sending results.
 Cheers,
 Dor
 
 Thanks for your answer



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] strange problem with virtio on guest with over 3648M RAM

2008-05-07 Thread Anthony Liguori
Henrik Holst wrote:
 Hello KVM developers!

 We are experiencing a very strange problem: we have been testing Ubuntu 
 Hardy Heron Server as a guest (kvm-68 on the host) and when using the 
 virtio nic everything works fine until we give the guest more than 3648 
 MiB of RAM.

 If we for example start with -m 4096 then the ubuntu guest brings up 
 eth0 but it is impossible to get a dhcp address and if one sets a static 
 address no packets what so ever is leaving the guest.

 So there seam to be a problem with the virtio guest drivers when there 
 is more than 3648 MiB of ram available :), is there anyone else that 
 have experienced the same problems or are there people running virtio on 
 guest with more ram where everything is working ok?
   

It's a know bug.  I have a fix for it locally that I'll send out today.

Regards,

Anthony Liguori

 This problem could be local to ubuntu but I don't know of any other 
 distribution with virtio guest drivers out of the box.

 -
 This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
 Don't miss this year's exciting event. There's still time to save $100. 
 Use priority code J8TL2D2. 
 http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
 ___
 kvm-devel mailing list
 kvm-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/kvm-devel
   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 1/6] Add support for eventfd() (v3)

2008-05-07 Thread Anthony Liguori
This patch adds compatibility code so that we can make use of eventfd() within
QEMU.  eventfd() is a pretty useful mechanism as it allows multiple
notifications to be batched in a single system call.

We emulate eventfd() using a standard pipe().

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/Makefile.target b/qemu/Makefile.target
index bb4b9a3..46654f3 100644
--- a/qemu/Makefile.target
+++ b/qemu/Makefile.target
@@ -208,7 +208,7 @@ CPPFLAGS+=-I$(SRC_PATH)/tcg/sparc
 endif
 
 ifeq ($(USE_KVM), 1)
-LIBOBJS+=qemu-kvm.o
+LIBOBJS+=qemu-kvm.o kvm-compatfd.o
 endif
 ifdef CONFIG_SOFTFLOAT
 LIBOBJS+=fpu/softfloat.o
diff --git a/qemu/kvm-compatfd.c b/qemu/kvm-compatfd.c
new file mode 100644
index 000..1b030ba
--- /dev/null
+++ b/qemu/kvm-compatfd.c
@@ -0,0 +1,33 @@
+/*
+ * signalfd/eventfd compatibility
+ *
+ * Copyright IBM, Corp. 2008
+ *
+ * Authors:
+ *  Anthony Liguori   [EMAIL PROTECTED]
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include qemu-common.h
+#include qemu-kvm.h
+
+#include sys/syscall.h
+
+int kvm_eventfd(int *fds)
+{
+#if defined(SYS_eventfd)
+int ret;
+
+ret = syscall(SYS_eventfd, 0);
+if (ret = 0) {
+   fds[0] = fds[1] = ret;
+   return 0;
+} else if (!(ret == -1  errno == ENOSYS))
+   return ret;
+#endif
+
+return pipe(fds);
+}
diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h
index 024a653..8fa3c1b 100644
--- a/qemu/qemu-kvm.h
+++ b/qemu/qemu-kvm.h
@@ -79,6 +79,8 @@ int handle_powerpc_dcr_read(int vcpu, uint32_t dcrn, uint32_t 
*data);
 int handle_powerpc_dcr_write(int vcpu,uint32_t dcrn, uint32_t data);
 #endif
 
+int kvm_eventfd(int *fds);
+
 #define ALIGN(x, y)  (((x)+(y)-1)  ~((y)-1))
 #define BITMAP_SIZE(m) (ALIGN(((m)TARGET_PAGE_BITS), HOST_LONG_BITS) / 8)
 

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 5/6] Interrupt io thread in qemu_set_fd_handler2 (v3)

2008-05-07 Thread Anthony Liguori
The select() in the IO thread may wait a long time before rebuilding the
fd set.  Whenever we do something that changes the fd set, we should interrupt
the IO thread.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/vl.c b/qemu/vl.c
index 1192759..e9f0ca4 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -260,6 +260,16 @@ static int event_pending = 1;
 
 #define TFR(expr) do { if ((expr) != -1) break; } while (errno == EINTR)
 
+/* KVM runs the main loop in a separate thread.  If we update one of the lists
+ * that are polled before or after select(), we need to make sure to break out
+ * of the select() to ensure the new item is serviced.
+ */
+static void main_loop_break(void)
+{
+if (kvm_enabled())
+   qemu_kvm_notify_work();
+}
+
 void decorate_application_name(char *appname, int max_len)
 {
 if (kvm_enabled())
@@ -5680,6 +5690,7 @@ int qemu_set_fd_handler2(int fd,
 ioh-opaque = opaque;
 ioh-deleted = 0;
 }
+main_loop_break();
 return 0;
 }
 
@@ -7606,8 +7617,7 @@ void qemu_bh_schedule(QEMUBH *bh)
 if (env) {
 cpu_interrupt(env, CPU_INTERRUPT_EXIT);
 }
-if (kvm_enabled())
-qemu_kvm_notify_work();
+main_loop_break();
 }
 
 void qemu_bh_cancel(QEMUBH *bh)

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 6/6] Only select once per-main_loop iteration (v3)

2008-05-07 Thread Anthony Liguori
QEMU is rather aggressive about exhausting the wait period when selecting.
This is fine when the wait period is low and when there is significant delays
in-between selects as it improves IO throughput.

With the IO thread, there is a very small delay between selects and our wait
period for select is very large.  This patch changes main_loop_wait to only
select once before doing the various other things in the main loop.  This
generally improves responsiveness of things like SDL but also improves
individual file descriptor throughput quite dramatically.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index 492c3c4..cc8f292 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -423,24 +423,6 @@ void qemu_kvm_notify_work(void)
fprintf(stderr, failed to notify io thread\n);
 }
 
-static int received_signal;
-
-/* QEMU relies on periodically breaking out of select via EINTR to poll for IO
-   and timer signals.  Since we're now using a file descriptor to handle
-   signals, select() won't be interrupted by a signal.  We need to forcefully
-   break the select() loop when a signal is received hence
-   kvm_check_received_signal(). */
-
-int kvm_check_received_signal(void)
-{
-if (received_signal) {
-   received_signal = 0;
-   return 1;
-}
-
-return 0;
-}
-
 /* If we have signalfd, we mask out the signals we want to handle and then
  * use signalfd to listen for them.  We rely on whatever the current signal
  * handler is to dispatch the signals when we receive them.
@@ -474,8 +456,6 @@ static void sigfd_handler(void *opaque)
pthread_cond_signal(qemu_aio_cond); 
}
 }
-
-received_signal = 1;
 }
 
 /* Used to break IO thread out of select */
@@ -497,8 +477,6 @@ static void io_thread_wakeup(void *opaque)
 
offset += len;
 }
-
-received_signal = 1;
 }
 
 int kvm_main_loop(void)
diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h
index df573ec..21606e9 100644
--- a/qemu/qemu-kvm.h
+++ b/qemu/qemu-kvm.h
@@ -128,13 +128,4 @@ static inline void kvm_sleep_end(void)
kvm_mutex_lock();
 }
 
-int kvm_check_received_signal(void);
-
-static inline int kvm_received_signal(void)
-{
-if (kvm_enabled())
-   return kvm_check_received_signal();
-return 0;
-}
-
 #endif
diff --git a/qemu/vl.c b/qemu/vl.c
index 4e25366..a1aa270 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -7381,23 +7381,18 @@ void main_loop_wait(int timeout)
 slirp_select_fill(nfds, rfds, wfds, xfds);
 }
 #endif
- moreio:
 ret = qemu_select(nfds + 1, rfds, wfds, xfds, tv);
 if (ret  0) {
 IOHandlerRecord **pioh;
-int more = 0;
 
 for(ioh = first_io_handler; ioh != NULL; ioh = ioh-next) {
 if (!ioh-deleted  ioh-fd_read  FD_ISSET(ioh-fd, rfds)) {
 ioh-fd_read(ioh-opaque);
-if (!ioh-fd_read_poll || ioh-fd_read_poll(ioh-opaque))
-more = 1;
-else
+if (!(ioh-fd_read_poll  ioh-fd_read_poll(ioh-opaque)))
 FD_CLR(ioh-fd, rfds);
 }
 if (!ioh-deleted  ioh-fd_write  FD_ISSET(ioh-fd, wfds)) {
 ioh-fd_write(ioh-opaque);
-more = 1;
 }
 }
 
@@ -7411,8 +7406,6 @@ void main_loop_wait(int timeout)
 } else
 pioh = ioh-next;
 }
-if (more  !kvm_received_signal())
-goto moreio;
 }
 #if defined(CONFIG_SLIRP)
 if (slirp_inited) {

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 4/6] Use signalfd() in io-thread (v3)

2008-05-07 Thread Anthony Liguori
This patch reworks the IO thread to use signalfd() instead of sigtimedwait().
This will eliminate the need to use SIGIO everywhere.

Since v2, I've fixed a nasty bug in qemu_kvm_aio_wait().  We can't use
main_loop_wait() to sleep if it's at all possible we're being called from
a handler in main_loop_wait() (which is the case with qemu_kvm_aio_wait()).

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index 7134e56..492c3c4 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -17,6 +17,7 @@ int kvm_pit = 1;
 #include sysemu.h
 #include qemu-common.h
 #include console.h
+#include block.h
 
 #include qemu-kvm.h
 #include libkvm.h
@@ -36,18 +37,11 @@ pthread_mutex_t qemu_mutex = PTHREAD_MUTEX_INITIALIZER;
 pthread_cond_t qemu_aio_cond = PTHREAD_COND_INITIALIZER;
 pthread_cond_t qemu_vcpu_cond = PTHREAD_COND_INITIALIZER;
 pthread_cond_t qemu_system_cond = PTHREAD_COND_INITIALIZER;
+pthread_cond_t qemu_pause_cond = PTHREAD_COND_INITIALIZER;
 __thread struct vcpu_info *vcpu;
 
 static int qemu_system_ready;
 
-struct qemu_kvm_signal_table {
-sigset_t sigset;
-sigset_t negsigset;
-};
-
-static struct qemu_kvm_signal_table io_signal_table;
-static struct qemu_kvm_signal_table vcpu_signal_table;
-
 #define SIG_IPI (SIGRTMIN+4)
 
 struct vcpu_info {
@@ -64,6 +58,7 @@ struct vcpu_info {
 
 pthread_t io_thread;
 static int io_thread_fd = -1;
+static int io_thread_sigfd = -1;
 
 static inline unsigned long kvm_get_thread_id(void)
 {
@@ -172,37 +167,23 @@ static int has_work(CPUState *env)
 return kvm_arch_has_work(env);
 }
 
-static int kvm_process_signal(int si_signo)
-{
-struct sigaction sa;
-
-switch (si_signo) {
-case SIGUSR2:
-pthread_cond_signal(qemu_aio_cond);
-break;
-case SIGALRM:
-case SIGIO:
-sigaction(si_signo, NULL, sa);
-sa.sa_handler(si_signo);
-break;
-}
-
-return 1;
-}
-
-static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env,
-  int timeout)
+static int kvm_eat_signal(CPUState *env, int timeout)
 {
 struct timespec ts;
 int r, e, ret = 0;
 siginfo_t siginfo;
+sigset_t waitset;
 
 ts.tv_sec = timeout / 1000;
 ts.tv_nsec = (timeout % 1000) * 100;
-r = sigtimedwait(waitset-sigset, siginfo, ts);
+sigemptyset(waitset);
+sigaddset(waitset, SIG_IPI);
+
+r = sigtimedwait(waitset, siginfo, ts);
 if (r == -1  (errno == EAGAIN || errno == EINTR)  !timeout)
return 0;
 e = errno;
+
 pthread_mutex_lock(qemu_mutex);
 if (env  vcpu)
 cpu_single_env = vcpu-env;
@@ -211,12 +192,12 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table 
*waitset, CPUState *env,
exit(1);
 }
 if (r != -1)
-ret = kvm_process_signal(siginfo.si_signo);
+   ret = 1;
 
 if (env  vcpu_info[env-cpu_index].stop) {
vcpu_info[env-cpu_index].stop = 0;
vcpu_info[env-cpu_index].stopped = 1;
-   qemu_kvm_notify_work();
+   pthread_cond_signal(qemu_pause_cond);
 }
 pthread_mutex_unlock(qemu_mutex);
 
@@ -227,14 +208,13 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table 
*waitset, CPUState *env,
 static void kvm_eat_signals(CPUState *env, int timeout)
 {
 int r = 0;
-struct qemu_kvm_signal_table *waitset = vcpu_signal_table;
 
-while (kvm_eat_signal(waitset, env, 0))
+while (kvm_eat_signal(env, 0))
r = 1;
 if (!r  timeout) {
-   r = kvm_eat_signal(waitset, env, timeout);
+   r = kvm_eat_signal(env, timeout);
if (r)
-   while (kvm_eat_signal(waitset, env, 0))
+   while (kvm_eat_signal(env, 0))
;
 }
 }
@@ -266,12 +246,8 @@ static void pause_all_threads(void)
vcpu_info[i].stop = 1;
pthread_kill(vcpu_info[i].thread, SIG_IPI);
 }
-while (!all_threads_paused()) {
-   pthread_mutex_unlock(qemu_mutex);
-   kvm_eat_signal(io_signal_table, NULL, 1000);
-   pthread_mutex_lock(qemu_mutex);
-   cpu_single_env = NULL;
-}
+while (!all_threads_paused())
+   pthread_cond_wait(qemu_pause_cond, qemu_mutex);
 }
 
 static void resume_all_threads(void)
@@ -310,6 +286,12 @@ static void setup_kernel_sigmask(CPUState *env)
 {
 sigset_t set;
 
+sigemptyset(set);
+sigaddset(set, SIGUSR2);
+sigaddset(set, SIGIO);
+sigaddset(set, SIGALRM);
+sigprocmask(SIG_BLOCK, set, NULL);
+
 sigprocmask(SIG_BLOCK, NULL, set);
 sigdelset(set, SIG_IPI);
 
@@ -346,7 +328,7 @@ static int kvm_main_loop_cpu(CPUState *env)
 cpu_single_env = env;
 while (1) {
while (!has_work(env))
-   kvm_main_loop_wait(env, 10);
+   kvm_main_loop_wait(env, 1000);
if (env-interrupt_request  CPU_INTERRUPT_HARD)
env-hflags = ~HF_HALTED_MASK;
if (!kvm_irqchip_in_kernel(kvm_context)  info-sipi_needed)
@@ -394,18 +376,6 @@ static void *ap_main_loop(void *_env)
 return NULL

[kvm-devel] [PATCH 3/6] Add support for signalfd() (v3)

2008-05-07 Thread Anthony Liguori
This patch adds compatibility code so that we can use signalfd() within QEMU.
signalfd() provides a mechanism to receive signal notification through a
file descriptor.  This is very useful in eliminating the signal/select race
condition.

If signalfd() isn't available, we spawn a thread that uses sigwaitinfo() to
emulate it.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/kvm-compatfd.c b/qemu/kvm-compatfd.c
index 1b030ba..b1311e2 100644
--- a/qemu/kvm-compatfd.c
+++ b/qemu/kvm-compatfd.c
@@ -15,6 +15,97 @@
 #include qemu-kvm.h
 
 #include sys/syscall.h
+#include pthread.h
+
+struct sigfd_compat_info
+{
+sigset_t mask;
+int fd;
+};
+
+static void *sigwait_compat(void *opaque)
+{
+struct sigfd_compat_info *info = opaque;
+int err;
+
+sigprocmask(SIG_BLOCK, info-mask, NULL);
+
+do {
+   siginfo_t siginfo;
+
+   err = sigwaitinfo(info-mask, siginfo);
+   if (err == -1  errno == EINTR)
+   continue;
+
+   if (err  0) {
+   char buffer[128];
+   size_t offset = 0;
+
+   memcpy(buffer, err, sizeof(err));
+   while (offset  sizeof(buffer)) {
+   ssize_t len;
+
+   len = write(info-fd, buffer + offset,
+   sizeof(buffer) - offset);
+   if (len == -1  errno == EINTR)
+   continue;
+
+   if (len = 0) {
+   err = -1;
+   break;
+   }
+
+   offset += len;
+   }
+   }
+} while (err = 0);
+
+return NULL;
+}
+
+static int kvm_signalfd_compat(const sigset_t *mask)
+{
+pthread_attr_t attr;
+pthread_t tid;
+struct sigfd_compat_info *info;
+int fds[2];
+
+info = malloc(sizeof(*info));
+if (info == NULL) {
+   errno = ENOMEM;
+   return -1;
+}
+
+if (pipe(fds) == -1) {
+   free(info);
+   return -1;
+}
+
+memcpy(info-mask, mask, sizeof(*mask));
+info-fd = fds[1];
+
+pthread_attr_init(attr);
+pthread_attr_setdetachstate(attr, PTHREAD_CREATE_DETACHED);
+
+pthread_create(tid, attr, sigwait_compat, info);
+
+pthread_attr_destroy(attr);
+
+return fds[0];
+}
+
+int kvm_signalfd(const sigset_t *mask)
+{
+#if defined(SYS_signalfd)
+int ret;
+
+ret = syscall(SYS_signalfd, -1, mask, _NSIG / 8);
+if (!(ret == -1  errno == ENOSYS))
+   return ret;
+#endif
+
+return kvm_signalfd_compat(mask);
+}
 
 int kvm_eventfd(int *fds)
 {
diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h
index 8fa3c1b..a0dd4a8 100644
--- a/qemu/qemu-kvm.h
+++ b/qemu/qemu-kvm.h
@@ -10,6 +10,8 @@
 
 #include cpu.h
 
+#include signal.h
+
 int kvm_main_loop(void);
 int kvm_qemu_init(void);
 int kvm_qemu_create_context(void);
@@ -79,6 +81,16 @@ int handle_powerpc_dcr_read(int vcpu, uint32_t dcrn, 
uint32_t *data);
 int handle_powerpc_dcr_write(int vcpu,uint32_t dcrn, uint32_t data);
 #endif
 
+#if !defined(SYS_signalfd)
+struct signalfd_siginfo {
+uint32_t ssi_signo;
+uint8_t pad[124];
+};
+#else
+#include linux/signalfd.h
+#endif
+
+int kvm_signalfd(const sigset_t *mask);
 int kvm_eventfd(int *fds);
 
 #define ALIGN(x, y)  (((x)+(y)-1)  ~((y)-1))

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 1/5] Support more than 3.5GB with virtio (v3)

2008-05-07 Thread Anthony Liguori
We're pretty sloppy in virtio right now about phys_ram_base assumptions.  This
patch is an incremental step between what we have today and a full blown DMA
API.  I backported the DMA API but the performance impact was not acceptable
to me. There's only a slight performance impact with this particular patch.

Since we're no longer assuming guest physical memory is contiguous, we need
a more complex way to validate the memory regions than just checking if it's
within ram_size.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c
index 6a50001..a4c9d10 100644
--- a/qemu/hw/virtio.c
+++ b/qemu/hw/virtio.c
@@ -58,6 +58,48 @@
 
 /* virt queue functions */
 
+static void *virtio_map_gpa(target_phys_addr_t addr, size_t size)
+{
+ram_addr_t off;
+target_phys_addr_t addr1;
+
+off = cpu_get_physical_page_desc(addr);
+if ((off  ~TARGET_PAGE_MASK) != IO_MEM_RAM) {
+   fprintf(stderr, virtio DMA to IO ram\n);
+   exit(1);
+}
+
+off = (off  TARGET_PAGE_MASK) | (addr  ~TARGET_PAGE_MASK);
+
+for (addr1 = addr + TARGET_PAGE_SIZE;
+addr1  TARGET_PAGE_ALIGN(addr + size);
+addr1 += TARGET_PAGE_SIZE) {
+   ram_addr_t off1;
+
+   off1 = cpu_get_physical_page_desc(addr1);
+   if ((off1  ~TARGET_PAGE_MASK) != IO_MEM_RAM) {
+   fprintf(stderr, virtio DMA to IO ram\n);
+   exit(1);
+   }
+
+   off1 = (off1  TARGET_PAGE_MASK) | (addr1  ~TARGET_PAGE_MASK);
+
+   if (off1 != (off + (addr1 - addr))) {
+   fprintf(stderr, discontigous virtio memory\n);
+   exit(1);
+   }
+}
+
+return phys_ram_base + off;
+}
+
+static size_t virtqueue_size(int num)
+{
+return TARGET_PAGE_ALIGN((sizeof(VRingDesc) * num) +
+(sizeof(VRingAvail) + sizeof(uint16_t) * num)) +
+   (sizeof(VRingUsed) + sizeof(VRingUsedElem) * num);
+}
+
 static void virtqueue_init(VirtQueue *vq, void *p)
 {
 vq-vring.desc = p;
@@ -127,9 +169,6 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
 do {
struct iovec *sg;
 
-   if ((vq-vring.desc[i].addr + vq-vring.desc[i].len)  ram_size)
-   errx(1, Guest sent invalid pointer);
-
if (vq-vring.desc[i].flags  VRING_DESC_F_WRITE)
sg = elem-in_sg[elem-in_num++];
else
@@ -137,7 +176,9 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
 
/* Grab the first descriptor, and check it's OK. */
sg-iov_len = vq-vring.desc[i].len;
-   sg-iov_base = phys_ram_base + vq-vring.desc[i].addr;
+   sg-iov_base = virtio_map_gpa(vq-vring.desc[i].addr, sg-iov_len);
+   if (sg-iov_base == NULL)
+   errx(1, Invalid mapping\n);
 
/* If we've got too many, that implies a descriptor loop. */
if ((elem-in_num + elem-out_num)  vq-vring.num)
@@ -198,9 +239,10 @@ static void virtio_ioport_write(void *opaque, uint32_t 
addr, uint32_t val)
vdev-vq[vdev-queue_sel].pfn = val;
if (pa == 0) {
 virtio_reset(vdev);
-   } else if (pa  (ram_size - TARGET_PAGE_SIZE)) {
-   virtqueue_init(vdev-vq[vdev-queue_sel], phys_ram_base + pa);
-   /* FIXME if pa == 0, deal with device tear down */
+   } else {
+   size_t size = virtqueue_size(vdev-vq[vdev-queue_sel].vring.num);
+   virtqueue_init(vdev-vq[vdev-queue_sel],
+  virtio_map_gpa(pa, size));
}
break;
 case VIRTIO_PCI_QUEUE_SEL:

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 2/5] Validate the SG list layouts in virtio

2008-05-07 Thread Anthony Liguori
We should check that the first element is the size we expect instead of
just casting blindly.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/hw/virtio-blk.c b/qemu/hw/virtio-blk.c
index 3af36db..048285a 100644
--- a/qemu/hw/virtio-blk.c
+++ b/qemu/hw/virtio-blk.c
@@ -56,8 +56,6 @@ struct virtio_blk_outhdr
 uint32_t ioprio;
 /* Sector (ie. 512 byte offset) */
 uint64_t sector;
-/* Where to put reply. */
-uint64_t id;
 };
 
 #define VIRTIO_BLK_S_OK0
@@ -94,6 +92,17 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, 
VirtQueue *vq)
off_t off;
int i;
 
+   if (elem.out_num  1 || elem.in_num  1) {
+   fprintf(stderr, virtio-blk missing headers\n);
+   exit(1);
+   }
+
+   if (elem.out_sg[0].iov_len != sizeof(*out) ||
+   elem.in_sg[elem.in_num - 1].iov_len != sizeof(*in)) {
+   fprintf(stderr, virtio-blk header not in correct element\n);
+   exit(1);
+   }
+
out = (void *)elem.out_sg[0].iov_base;
in = (void *)elem.in_sg[elem.in_num - 1].iov_base;
off = out-sector;
diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index f727b14..5ac5089 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -125,6 +125,11 @@ static void virtio_net_receive(void *opaque, const uint8_t 
*buf, int size)
return;
 }
 
+if (elem.in_num  1 || elem.in_sg[0].iov_len != sizeof(*hdr)) {
+   fprintf(stderr, virtio-net header not in first element\n);
+   exit(1);
+}
+
 hdr = (void *)elem.in_sg[0].iov_base;
 hdr-flags = 0;
 hdr-gso_type = VIRTIO_NET_HDR_GSO_NONE;
@@ -197,6 +202,11 @@ void virtio_net_poll(void)
 continue;
 }
 
+   if (elem.in_num  1 || elem.in_sg[0].iov_len != sizeof(*hdr)) {
+   fprintf(stderr, virtio-net header not in first element\n);
+   exit(1);
+   }
+
 hdr = (void *)elem.in_sg[0].iov_base;
 hdr-flags = 0;
 hdr-gso_type = VIRTIO_NET_HDR_GSO_NONE;

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 3/5] Revert virtio tap hack (v3)

2008-05-07 Thread Anthony Liguori
While it has served us well, it is long overdue that we eliminate the
virtio-net tap hack.  It turns out that zero-copy has very little impact on
performance.  The tap hack was gaining such a significant performance boost
not because of zero-copy, but because it avoided dropping packets on receive
which is apparently a significant problem with the tap implementation in QEMU.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/hw/pc.h b/qemu/hw/pc.h
index 73e3918..c284bf1 100644
--- a/qemu/hw/pc.h
+++ b/qemu/hw/pc.h
@@ -155,7 +155,6 @@ void isa_ne2000_init(int base, qemu_irq irq, NICInfo *nd);
 /* virtio-net.c */
 
 PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn);
-void virtio_net_poll(void);
 
 /* virtio-blk.h */
 void *virtio_blk_init(PCIBus *bus, uint16_t vendor, uint16_t device,
diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index e2db49c..a26d04e 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -13,7 +13,6 @@
 
 #include virtio.h
 #include net.h
-#include pc.h
 #include qemu-timer.h
 
 /* from Linux's virtio_net.h */
@@ -62,15 +61,10 @@ typedef struct VirtIONet
 VirtQueue *tx_vq;
 VLANClientState *vc;
 int can_receive;
-int tap_fd;
-struct VirtIONet *next;
-int do_notify;
 QEMUTimer *tx_timer;
 int tx_timer_active;
 } VirtIONet;
 
-static VirtIONet *VirtIONetHead = NULL;
-
 static VirtIONet *to_virtio_net(VirtIODevice *vdev)
 {
 return (VirtIONet *)vdev;
@@ -105,7 +99,6 @@ static int virtio_net_can_receive(void *opaque)
 return (n-vdev.status  VIRTIO_CONFIG_S_DRIVER_OK)  n-can_receive;
 }
 
-/* -net user receive function */
 static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
 {
 VirtIONet *n = opaque;
@@ -149,92 +142,6 @@ static void virtio_net_receive(void *opaque, const uint8_t 
*buf, int size)
 virtio_notify(n-vdev, n-rx_vq);
 }
 
-/* -net tap receive handler */
-void virtio_net_poll(void)
-{
-VirtIONet *vnet;
-int len;
-fd_set rfds;
-struct timeval tv;
-int max_fd = -1;
-VirtQueueElement elem;
-struct virtio_net_hdr *hdr;
-int did_notify;
-
-FD_ZERO(rfds);
-tv.tv_sec = 0;
-tv.tv_usec = 0;
-
-while (1) {
-
-// Prepare the set of device to select from
-for (vnet = VirtIONetHead; vnet; vnet = vnet-next) {
-
-if (vnet-tap_fd == -1)
-continue;
-
-vnet-do_notify = 0;
-//first check if the driver is ok
-if (!virtio_net_can_receive(vnet))
-continue;
-
-/* FIXME: the drivers really need to set their status better */
-if (vnet-rx_vq-vring.avail == NULL) {
-vnet-can_receive = 0;
-continue;
-}
-
-FD_SET(vnet-tap_fd, rfds);
-if (max_fd  vnet-tap_fd) max_fd = vnet-tap_fd;
-}
-
-if (select(max_fd + 1, rfds, NULL, NULL, tv) = 0)
-break;
-
-// Now check who has data pending in the tap
-for (vnet = VirtIONetHead; vnet; vnet = vnet-next) {
-
-if (!FD_ISSET(vnet-tap_fd, rfds))
-continue;
-
-if (virtqueue_pop(vnet-rx_vq, elem) == 0) {
-vnet-can_receive = 0;
-continue;
-}
-
-   if (elem.in_num  1 || elem.in_sg[0].iov_len != sizeof(*hdr)) {
-   fprintf(stderr, virtio-net header not in first element\n);
-   exit(1);
-   }
-
-hdr = (void *)elem.in_sg[0].iov_base;
-hdr-flags = 0;
-hdr-gso_type = VIRTIO_NET_HDR_GSO_NONE;
-again:
-len = readv(vnet-tap_fd, elem.in_sg[1], elem.in_num - 1);
-if (len == -1) {
-if (errno == EINTR || errno == EAGAIN)
-goto again;
-else
-fprintf(stderr, reading network error %d, len);
-}
-virtqueue_push(vnet-rx_vq, elem, sizeof(*hdr) + len);
-vnet-do_notify = 1;
-}
-
-/* signal other side */
-did_notify = 0;
-for (vnet = VirtIONetHead; vnet; vnet = vnet-next)
-if (vnet-do_notify) {
-virtio_notify(vnet-vdev, vnet-rx_vq);
-did_notify++;
-}
-if (!did_notify)
-break;
- }
-
-}
-
 /* TX */
 static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 {
@@ -313,12 +220,6 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int 
devfn)
 memcpy(n-mac, nd-macaddr, 6);
 n-vc = qemu_new_vlan_client(nd-vlan, virtio_net_receive,
  virtio_net_can_receive, n);
-n-tap_fd = hack_around_tap(n-vc-vlan-first_client);
-if (n-tap_fd != -1) {
-n-next = VirtIONetHead;
-//push the device on top of the list
-VirtIONetHead = n;
-}
 
 n-tx_timer = qemu_new_timer(rt_clock, virtio_net_tx_timer, n);
 n-tx_timer_active = 0;
diff --git a/qemu/vl.c

[kvm-devel] [PATCH 5/5] Stop dropping so many RX packets in tap (v3)

2008-05-07 Thread Anthony Liguori
Normally, tap always reads packets and simply lets the client drop them if it
cannot receive them.  For virtio-net, this results in massive packet loss and
about an 80% performance loss in TCP throughput.

This patch modifies qemu_send_packet() to only deliver a packet to a VLAN
client if it doesn't have a fd_can_read method or the fd_can_read method
indicates that it can receive packets.  We also return a status of whether
any clients were able to receive the packet.

If no clients were able to receive a packet, we buffer the packet until a
client indicates that it can receive packets again.

This patch also modifies the tap code to only read from the tap fd if at least
one client on the VLAN is able to receive a packet.

Finally, this patch changes the tap code to drain all possible packets from
the tap device when the tap fd is readable.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/net.h b/qemu/net.h
index 13daa27..dfdf9af 100644
--- a/qemu/net.h
+++ b/qemu/net.h
@@ -29,7 +29,7 @@ VLANClientState *qemu_new_vlan_client(VLANState *vlan,
   IOCanRWHandler *fd_can_read,
   void *opaque);
 int qemu_can_send_packet(VLANClientState *vc);
-void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size);
+int qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size);
 void qemu_handler_true(void *opaque);
 
 void do_info_network(void);
diff --git a/qemu/vl.c b/qemu/vl.c
index 03051da..28f8d1d 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -3754,10 +3754,11 @@ int qemu_can_send_packet(VLANClientState *vc1)
 return 0;
 }
 
-void qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size)
+int qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size)
 {
 VLANState *vlan = vc1-vlan;
 VLANClientState *vc;
+int ret = -EAGAIN;
 
 #if 0
 printf(vlan %d send:\n, vlan-id);
@@ -3765,9 +3766,14 @@ void qemu_send_packet(VLANClientState *vc1, const 
uint8_t *buf, int size)
 #endif
 for(vc = vlan-first_client; vc != NULL; vc = vc-next) {
 if (vc != vc1) {
-vc-fd_read(vc-opaque, buf, size);
+   if (!vc-fd_can_read || vc-fd_can_read(vc-opaque)) {
+   vc-fd_read(vc-opaque, buf, size);
+   ret = 0;
+   }
 }
 }
+
+return ret;
 }
 
 #if defined(CONFIG_SLIRP)
@@ -3970,6 +3976,8 @@ typedef struct TAPState {
 VLANClientState *vc;
 int fd;
 char down_script[1024];
+char buf[4096];
+int size;
 } TAPState;
 
 static void tap_receive(void *opaque, const uint8_t *buf, int size)
@@ -3985,24 +3993,70 @@ static void tap_receive(void *opaque, const uint8_t 
*buf, int size)
 }
 }
 
+static int tap_can_send(void *opaque)
+{
+TAPState *s = opaque;
+VLANClientState *vc;
+int can_receive = 0;
+ 
+/* Check to see if any of our clients can receive a packet */
+for (vc = s-vc-vlan-first_client; vc; vc = vc-next) {
+   /* Skip ourselves */
+   if (vc == s-vc)
+   continue;
+
+   if (!vc-fd_can_read) {
+   /* no fd_can_read handler, they always can receive */
+   can_receive = 1;
+   } else
+   can_receive = vc-fd_can_read(vc-opaque);
+
+   /* Once someone can receive, we try to send a packet */
+   if (can_receive)
+   break;
+}
+
+return can_receive;
+}
+
 static void tap_send(void *opaque)
 {
 TAPState *s = opaque;
-uint8_t buf[4096];
-int size;
 
+/* First try to send any buffered packet */
+if (s-size  0) {
+   int err;
+
+   /* If noone can receive the packet, buffer it */
+   err = qemu_send_packet(s-vc, s-buf, s-size);
+   if (err == -EAGAIN)
+   return;
+}
+
+/* Read packets until we hit EAGAIN */
+do {
 #ifdef __sun__
-struct strbuf sbuf;
-int f = 0;
-sbuf.maxlen = sizeof(buf);
-sbuf.buf = buf;
-size = getmsg(s-fd, NULL, sbuf, f) =0 ? sbuf.len : -1;
+   struct strbuf sbuf;
+   int f = 0;
+   sbuf.maxlen = sizeof(s-buf);
+   sbuf.buf = s-buf;
+   s-size = getmsg(s-fd, NULL, sbuf, f) =0 ? sbuf.len : -1;
 #else
-size = read(s-fd, buf, sizeof(buf));
+   s-size = read(s-fd, s-buf, sizeof(s-buf));
 #endif
-if (size  0) {
-qemu_send_packet(s-vc, buf, size);
-}
+
+   if (s-size == -1  errno == EINTR)
+   continue;
+
+   if (s-size  0) {
+   int err;
+
+   /* If noone can receive the packet, buffer it */
+   err = qemu_send_packet(s-vc, s-buf, s-size);
+   if (err == -EAGAIN)
+   break;
+   }
+} while (s-size  0);
 }
 
 /* fd support */
@@ -4016,7 +4070,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int fd)
 return NULL;
 s-fd = fd;
 s-vc = qemu_new_vlan_client(vlan, tap_receive, NULL, s);
-qemu_set_fd_handler2(s-fd, NULL, tap_send, NULL, s);
+qemu_set_fd_handler2(s-fd, tap_can_send, tap_send

[kvm-devel] [PATCH 4/5] Make virtio-net can_receive more accurate (v3)

2008-05-07 Thread Anthony Liguori
In the final patch of this series, we rely on a VLAN client's fd_can_read
method to avoid dropping packets.  Unfortunately, virtio's fd_can_read method
is not very accurate at the moment.  This patch addresses this.

It also generates a notification to the IO thread when more RX packets become
available.  If we say we can't receive a packet because no RX buffers are
available, this may result in the tap file descriptor not being select()'d.
Without notifying the IO thread, we may have to wait until the select() times
out before we can receive a packet (even if there is one pending).

This particular change makes RX performance very consistent.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index 8d26832..5538979 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -14,6 +14,7 @@
 #include virtio.h
 #include net.h
 #include qemu-timer.h
+#include qemu-kvm.h
 
 /* from Linux's virtio_net.h */
 
@@ -60,11 +61,14 @@ typedef struct VirtIONet
 VirtQueue *rx_vq;
 VirtQueue *tx_vq;
 VLANClientState *vc;
-int can_receive;
 QEMUTimer *tx_timer;
 int tx_timer_active;
 } VirtIONet;
 
+/* TODO
+ * - we could suppress RX interrupt if we were so inclined.
+ */
+
 static VirtIONet *to_virtio_net(VirtIODevice *vdev)
 {
 return (VirtIONet *)vdev;
@@ -88,15 +92,24 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev)
 
 static void virtio_net_handle_rx(VirtIODevice *vdev, VirtQueue *vq)
 {
-VirtIONet *n = to_virtio_net(vdev);
-n-can_receive = 1;
+/* We now have RX buffers, signal to the IO thread to break out of the
+   select to re-poll the tap file descriptor */
+if (kvm_enabled())
+   qemu_kvm_notify_work();
 }
 
 static int virtio_net_can_receive(void *opaque)
 {
 VirtIONet *n = opaque;
 
-return (n-vdev.status  VIRTIO_CONFIG_S_DRIVER_OK)  n-can_receive;
+if (n-rx_vq-vring.avail == NULL ||
+   !(n-vdev.status  VIRTIO_CONFIG_S_DRIVER_OK))
+   return 0;
+
+if (n-rx_vq-vring.avail-idx == n-rx_vq-last_avail_idx)
+   return 0;
+
+return 1;
 }
 
 static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
@@ -106,15 +119,8 @@ static void virtio_net_receive(void *opaque, const uint8_t 
*buf, int size)
 struct virtio_net_hdr *hdr;
 int offset, i;
 
-/* FIXME: the drivers really need to set their status better */
-if (n-rx_vq-vring.avail == NULL) {
-   n-can_receive = 0;
-   return;
-}
-
 if (virtqueue_pop(n-rx_vq, elem) == 0) {
-   /* wait until the guest adds some rx bufs */
-   n-can_receive = 0;
+   fprintf(stderr, virtio_net: this should not happen\n);
return;
 }
 
@@ -209,9 +215,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int 
devfn)
 
 n-vdev.update_config = virtio_net_update_config;
 n-vdev.get_features = virtio_net_get_features;
-n-rx_vq = virtio_add_queue(n-vdev, 512, virtio_net_handle_rx);
+n-rx_vq = virtio_add_queue(n-vdev, 128, virtio_net_handle_rx);
 n-tx_vq = virtio_add_queue(n-vdev, 128, virtio_net_handle_tx);
-n-can_receive = 0;
 memcpy(n-mac, nd-macaddr, 6);
 n-vc = qemu_new_vlan_client(nd-vlan, virtio_net_receive,
  virtio_net_can_receive, n);

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC] [VTD][patch 1/3] vt-d support for pci passthrough: kvm-vtd--kernel.patch

2008-05-07 Thread Anthony Liguori
Avi Kivity wrote:
 Anthony Liguori wrote:

   
 What should be done for unmodified guest where there is no PV driver in
 the guest?  Would a call to mlock() from
 qemu/hw/pci-passthrough.c/add_pci_passthrough_device() a reasonable
 thing to do?
   

 Yup.  The idea is to ensure that the memory is always present, 
 without necessarily taking a reference to it.  This allows for memory 
 reclaiming which should allow for things like NUMA page migration.  
 We can't swap of course but that doesn't mean reclaimation isn't useful.
   

 I don't think we can do page migration with VT-d.  You need to be able 
 to detect whether the page has been changed by dma after you've copied 
 it but before you changed the pte, but VT-d doesn't allow that AFAICT.

Hrm, I would have to look at the VT-d but I suspect you're right.  
That's unfortunate.

That means mlock() isn't sufficient.  It also means that the VMAs can't 
be updated while the guest is running.  Is there any way to lock a vma 
region such that things like madvise/mmap(MAP_FIXED) will always fail?

Regards,

Anthony Liguori



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC][PATCH] svm.c: tsc unsigned delta calculation

2008-05-07 Thread Anthony Liguori
Ryan Harper wrote:
 I've been playing around with smp guests on a couple amd systems and
 I've also seen some of the smp/locking issues which lead me to dig into
 some of the tsc code.  The svm_cpu_load, the tsc_offset calc will
 generate a massively large tsc_offset if we're switching cpus and
 tsc_this is ahead of the host_tsc value (delta would normally be
 negative, but since it's unsigned, we get a huge positive number).

 svm_vcpu_load()
 ...
 rdtscll(tsc_this);
 delta = vcpu-arch.host_tsc - tsc_this;
 svm-vmcb-control.tsc_offset += delta;
   

This math will work out fine since the very large number will generate 
an overflow and the result will be identical to if we were using s64s.  
We're using u64s because that's how the tsc_offset is defined by hardware.

 This is handled a little differently on Intel (in vmx.c) where there is
 a check:

 if (tsc_this  vcpu-arch.host_tsc)
 /* do delta and new offset calc */
   

So what your patch really does is change the behavior of the tsc_offset 
to increase the guest's TSC by a potentially large amount depending on 
how out of sync the TSC is on CPU migration.

The question is why this would make things work out better for you..

Do you have Gerd's kvm-clock most recent patch applied?

Regards,

Anthony Liguori

 This check makes sense to me in that we only need to ensure that we
 don't go backwards which means that unless the new cpu is behind the
 current vcpu's host_tsc, we can skip a new delta calc.  If the check
 doesn't make sense then we'll need to do the math with s64s.

 Attached patch fixed the case where an idle guest was live-locked.

   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH] Make sure to restore cpu_single_env when after cond_wait

2008-05-07 Thread Anthony Liguori
I discovered this while testing virtio save/restore this evening.  After
sleeping, cpu_single_env can change so we have to make sure to restore it.

This applies on top of my IO thread series.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index cc8f292..0990af6 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -246,8 +246,11 @@ static void pause_all_threads(void)
vcpu_info[i].stop = 1;
pthread_kill(vcpu_info[i].thread, SIG_IPI);
 }
-while (!all_threads_paused())
+while (!all_threads_paused()) {
+   CPUState *env = cpu_single_env;
pthread_cond_wait(qemu_pause_cond, qemu_mutex);
+   cpu_single_env = env;
+}
 }
 
 static void resume_all_threads(void)

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Protected mode transitions and big real mode... still an issue

2008-05-06 Thread Anthony Liguori
Guillaume Thouvenin wrote:
 On Mon, 5 May 2008 16:29:21 +0300
 Mohammed Gamal [EMAIL PROTECTED] wrote:

   
 On Mon, May 5, 2008 at 3:57 PM, Anthony Liguori [EMAIL PROTECTED] wrote:

 
  WinXP fails to boot with your patch applied too.  FWIW, Ubuntu 8.04 has
  a fixed version of gfxboot that doesn't do nasty things with SS on
  privileged mode transitions.

   
 WinXP fails with the patch applied too. Ubuntu 7.10 live CD and
 FreeDOS don't boot but complain about instruction mov 0x11,sreg not
 being emulated.
 

 Can you try with this one please?
 On my computer it boots ubuntu-8.04-desktop-i386.iso liveCD and also
 openSUSE-10.3-GM-x86_64-mini.iso
   

8.04 is not a good test-case.  7.10 is what you want to try.

The good news is, 7.10 appears to work!  The bad news is that about 20% 
of the time, it crashes and displays the following:

kvm_run: failed entry, reason 5
kvm_run returned -8

So something appears to be a bit buggy.  Still, very good work!

Regards,

Anthony Liguori


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 1/4] Replace SIGUSR1 in io-thread with eventfd() (v2)

2008-05-06 Thread Anthony Liguori
Marcelo Tosatti wrote:
 Looks good (the whole series).

 Needs some good testing of course... Have you tested migration/loadvm?
   

No, but I will before resubmitting (which should be sometime tomorrow).

Regards,

Anthony Liguori

 On Mon, May 05, 2008 at 08:47:12AM -0500, Anthony Liguori wrote:
   
 It's a little odd to use signals to raise a notification on a file descriptor
 when we can just work directly with a file descriptor instead.  This patch
 converts the SIGUSR1 based notification in the io-thread to instead use an
 eventfd file descriptor.  If eventfd isn't available, we use a pipe() 
 instead.

 The benefit of using eventfd is that multiple notifications will be batched
 into a signal IO event.

 Signed-off-by: Anthony Liguori [EMAIL PROTECTED]
 


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC] [VTD][patch 1/3] vt-d support for pci passthrough: kvm-vtd--kernel.patch

2008-05-06 Thread Anthony Liguori
 and never program the IOMMU.  The initial mlock() in userspace 
is somewhat of a nop here but it's important with MMU-notifiers because 
we will no longer be holding a reference for guest memory.

We have to ensure we don't swap KVM guest memory while using hardware 
pass-through, but AFAICT, we do not need to make the memory 
non-reclaimable  As long as we reprogram the IOMMU with a new, valid, 
mapping everything should be fine.  mlock() really gives us the right 
semantics.

Semantically, a PV API that supports DMA window registration simply 
mlock()s the DMA regions on behalf of the guest.  No special logic 
should be needed.

Regards,

Anthony Liguori

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC] [VTD][patch 3/3] vt-d support for pci passthrough: kvm-intel-iommu.patch

2008-05-06 Thread Anthony Liguori
Kay, Allen M wrote:
 Intel-iommu driver changes for kvm vt-d support.  Important changes are
 in intel-iommu.c.  The rest of the changes are for moving intel-iommu.h
 and iova.h from drivers/pci directory to include/linux directory.

 Signed-off-by: Allen M Kay [EMAIL PROTECTED]

 

  b/drivers/pci/dmar.c  |4 
  b/drivers/pci/intel-iommu.c   |   26 ++-
  b/drivers/pci/iova.c  |2 
  b/include/linux/intel-iommu.h |  344
 ++
  b/include/linux/iova.h|   52 ++
  drivers/pci/intel-iommu.h |  344
 --
  drivers/pci/iova.h|   52 --
  7 files changed, 416 insertions(+), 408 deletions(-)

 

 diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
 index f941f60..a58a5b0 100644
 --- a/drivers/pci/dmar.c
 +++ b/drivers/pci/dmar.c
 @@ -26,8 +26,8 @@
  
  #include linux/pci.h
  #include linux/dmar.h
 -#include iova.h
 -#include intel-iommu.h
 +#include linux/iova.h
 +#include linux/intel-iommu.h
  
  #undef PREFIX
  #define PREFIX DMAR:
 diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
 index 4cb949f..bfa888b 100644
 --- a/drivers/pci/intel-iommu.c
 +++ b/drivers/pci/intel-iommu.c
 @@ -31,8 +31,8 @@
  #include linux/dmar.h
  #include linux/dma-mapping.h
  #include linux/mempool.h
 -#include iova.h
 -#include intel-iommu.h
 +#include linux/iova.h
 +#include linux/intel-iommu.h
  #include asm/proto.h /* force_iommu in this header in x86-64*/
  #include asm/cacheflush.h
  #include asm/gart.h
 @@ -1056,7 +1056,7 @@ static void free_iommu(struct intel_iommu *iommu)
   kfree(iommu);
  }
  
 -static struct dmar_domain * iommu_alloc_domain(struct intel_iommu
 *iommu)
 +struct dmar_domain * iommu_alloc_domain(struct intel_iommu *iommu)
  {
   unsigned long num;
   unsigned long ndomains;
 @@ -1086,8 +1086,9 @@ static struct dmar_domain *
 iommu_alloc_domain(struct intel_iommu *iommu)
  
   return domain;
  }
 +EXPORT_SYMBOL_GPL(iommu_alloc_domain);
  
 -static void iommu_free_domain(struct dmar_domain *domain)
 +void iommu_free_domain(struct dmar_domain *domain)
  {
   unsigned long flags;
  
 @@ -1095,6 +1096,7 @@ static void iommu_free_domain(struct dmar_domain
 *domain)
   clear_bit(domain-id, domain-iommu-domain_ids);
   spin_unlock_irqrestore(domain-iommu-lock, flags);
  }
 +EXPORT_SYMBOL_GPL(iommu_free_domain);
  
  static struct iova_domain reserved_iova_list;
  static struct lock_class_key reserved_alloc_key;
 @@ -1160,7 +1162,7 @@ static inline int guestwidth_to_adjustwidth(int
 gaw)
   return agaw;
  }
  
 -static int domain_init(struct dmar_domain *domain, int guest_width)
 +int domain_init(struct dmar_domain *domain, int guest_width)
  {
   

I think it's already been mentioned, but these are pretty terrible names 
if you're exporting these symbols.  Linux supports other IOMMUs so VT-d 
should not be hogging the iommu_* namespace.

Regards,

Anthony Liguori

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC] [VTD][patch 2/3] vt-d support for pci passthrough: kvm-vtd-user.patch

2008-05-06 Thread Anthony Liguori
Avi Kivity wrote:
 Kay, Allen M wrote:
   
 Still todo: move vt.d to kvm-intel.ko module.
   
 

 Not sure it's the right thing to do. If we get the iommus abstracted 
 properly, we can rename vtd.c to dma.c and move it to virt/kvm/.

 The code is certainly a lot more about managing memory than anything vmx 
 specific. It's hardly x86 specific, even.
   

Really, an external interface to KVM that allowed someone to query the 
GPA = PA mapping would suffice.  It should not fault in pages that 
aren't present and we should provide notifications for when the mapping 
changes for a given reason.  Userspace can enforce the requirement that 
memory remains present via mlock().  This allows us to implement a PV 
API for DMA registration without the IOMMU code having any particular 
knowledge of it.

Regards,

Anthony Liguori


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC] [VTD][patch 1/3] vt-d support for pci passthrough: kvm-vtd--kernel.patch

2008-05-06 Thread Anthony Liguori
Kay, Allen M wrote:
 We have to ensure we don't swap KVM guest memory while using hardware 
 pass-through, but AFAICT, we do not need to make the memory 
 non-reclaimable  As long as we reprogram the IOMMU with a new, valid, 
 mapping everything should be fine.  mlock() really gives us the right 
 semantics.

 Semantically, a PV API that supports DMA window registration simply 
 mlock()s the DMA regions on behalf of the guest.  No special logic 
 should be needed.

 

 What should be done for unmodified guest where there is no PV driver in
 the guest?  Would a call to mlock() from
 qemu/hw/pci-passthrough.c/add_pci_passthrough_device() a reasonable
 thing to do?
   

Yup.  The idea is to ensure that the memory is always present, without 
necessarily taking a reference to it.  This allows for memory reclaiming 
which should allow for things like NUMA page migration.  We can't swap 
of course but that doesn't mean reclaimation isn't useful.

Regards,

Anthony Liguori

 Allen
   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Protected mode transitions and big real mode... still an issue

2008-05-05 Thread Anthony Liguori
Guillaume Thouvenin wrote:
 On Sat, 3 May 2008 13:56:56 +0530
 Balaji Rao [EMAIL PROTECTED] wrote:

  
   
 With your patch applied ubuntu 8.04 livecd fails to boot. Not any better 
 with Marcelo's patch on top.
 

 Hi Balaji,

  And without the patch, can you boot the ubuntu 8.04 livecd? 
   

WinXP fails to boot with your patch applied too.  FWIW, Ubuntu 8.04 has 
a fixed version of gfxboot that doesn't do nasty things with SS on 
privileged mode transitions.

Regards,

Anthony Liguori

 Regards,
 Guillaume

 -
 This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
 Don't miss this year's exciting event. There's still time to save $100. 
 Use priority code J8TL2D2. 
 http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
 ___
 kvm-devel mailing list
 kvm-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/kvm-devel
   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 4/4] Only select once per-main_loop iteration (v2)

2008-05-05 Thread Anthony Liguori
QEMU is rather aggressive about exhausting the wait period when selecting.
This is fine when the wait period is low and when there is significant delays
in-between selects as it improves IO throughput.

With the IO thread, there is a very small delay between selects and our wait
period for select is very large.  This patch changes main_loop_wait to only
select once before doing the various other things in the main loop.  This
generally improves responsiveness of things like SDL but also improves
individual file descriptor throughput quite dramatically.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index e16b261..6a90e68 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -423,24 +423,6 @@ void qemu_kvm_notify_work(void)
fprintf(stderr, failed to notify io thread\n);
 }
 
-static int received_signal;
-
-/* QEMU relies on periodically breaking out of select via EINTR to poll for IO
-   and timer signals.  Since we're now using a file descriptor to handle
-   signals, select() won't be interrupted by a signal.  We need to forcefully
-   break the select() loop when a signal is received hence
-   kvm_check_received_signal(). */
-
-int kvm_check_received_signal(void)
-{
-if (received_signal) {
-   received_signal = 0;
-   return 1;
-}
-
-return 0;
-}
-
 /* If we have signalfd, we mask out the signals we want to handle and then
  * use signalfd to listen for them.  We rely on whatever the current signal
  * handler is to dispatch the signals when we receive them.
@@ -474,8 +456,6 @@ static void sigfd_handler(void *opaque)
pthread_cond_signal(qemu_aio_cond); 
}
 }
-
-received_signal = 1;
 }
 
 /* Used to break IO thread out of select */
@@ -497,8 +477,6 @@ static void io_thread_wakeup(void *opaque)
 
offset += len;
 }
-
-received_signal = 1;
 }
 
 int kvm_main_loop(void)
diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h
index e1e461a..34aabd2 100644
--- a/qemu/qemu-kvm.h
+++ b/qemu/qemu-kvm.h
@@ -114,15 +114,6 @@ static inline void kvm_sleep_end(void)
kvm_mutex_lock();
 }
 
-int kvm_check_received_signal(void);
-
-static inline int kvm_received_signal(void)
-{
-if (kvm_enabled())
-   return kvm_check_received_signal();
-return 0;
-}
-
 #if !defined(SYS_signalfd)
 struct signalfd_siginfo {
 uint32_t ssi_signo;
diff --git a/qemu/vl.c b/qemu/vl.c
index e9f0ca4..6935a82 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -7946,23 +7946,18 @@ void main_loop_wait(int timeout)
 slirp_select_fill(nfds, rfds, wfds, xfds);
 }
 #endif
- moreio:
 ret = qemu_select(nfds + 1, rfds, wfds, xfds, tv);
 if (ret  0) {
 IOHandlerRecord **pioh;
-int more = 0;
 
 for(ioh = first_io_handler; ioh != NULL; ioh = ioh-next) {
 if (!ioh-deleted  ioh-fd_read  FD_ISSET(ioh-fd, rfds)) {
 ioh-fd_read(ioh-opaque);
-if (!ioh-fd_read_poll || ioh-fd_read_poll(ioh-opaque))
-more = 1;
-else
+if (!(ioh-fd_read_poll  ioh-fd_read_poll(ioh-opaque)))
 FD_CLR(ioh-fd, rfds);
 }
 if (!ioh-deleted  ioh-fd_write  FD_ISSET(ioh-fd, wfds)) {
 ioh-fd_write(ioh-opaque);
-more = 1;
 }
 }
 
@@ -7976,8 +7971,6 @@ void main_loop_wait(int timeout)
 } else
 pioh = ioh-next;
 }
-if (more  !kvm_received_signal())
-goto moreio;
 }
 #if defined(CONFIG_SLIRP)
 if (slirp_inited) {

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 2/4] Use signalfd() in io-thread (v2)

2008-05-05 Thread Anthony Liguori
This patch reworks the IO thread to use signalfd() instead of sigtimedwait().
This will eliminate the need to use SIGIO everywhere.  In this version of the
patch, we use signalfd() when it's available.  When it isn't available, we
create a separate thread and use sigwaitinfo() to simulate signalfd().

I've tested Windows and Linux guests with SMP without seeing an obvious
regressions.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/kvm-compatfd.c b/qemu/kvm-compatfd.c
index 1b030ba..3c2be28 100644
--- a/qemu/kvm-compatfd.c
+++ b/qemu/kvm-compatfd.c
@@ -15,6 +15,100 @@
 #include qemu-kvm.h
 
 #include sys/syscall.h
+#include pthread.h
+
+struct sigfd_compat_info
+{
+sigset_t mask;
+int fd;
+};
+
+static void *sigwait_compat(void *opaque)
+{
+struct sigfd_compat_info *info = opaque;
+int err;
+
+sigprocmask(SIG_BLOCK, info-mask, NULL);
+
+do {
+   siginfo_t siginfo;
+
+   kvm_sleep_begin();
+   err = sigwaitinfo(info-mask, siginfo);
+   kvm_sleep_end();
+
+   if (err == -1  errno == EINTR)
+   continue;
+
+   if (err  0) {
+   char buffer[128];
+   size_t offset = 0;
+
+   memcpy(buffer, err, sizeof(err));
+   while (offset  sizeof(buffer)) {
+   ssize_t len;
+
+   len = write(info-fd, buffer + offset,
+   sizeof(buffer) - offset);
+   if (len == -1  errno == EINTR)
+   continue;
+
+   if (len = 0) {
+   err = -1;
+   break;
+   }
+
+   offset += len;
+   }
+   }
+} while (err = 0);
+
+return NULL;
+}
+
+static int kvm_signalfd_compat(const sigset_t *mask)
+{
+pthread_attr_t attr;
+pthread_t tid;
+struct sigfd_compat_info *info;
+int fds[2];
+
+info = malloc(sizeof(*info));
+if (info == NULL) {
+   errno = ENOMEM;
+   return -1;
+}
+
+if (pipe(fds) == -1) {
+   free(info);
+   return -1;
+}
+
+memcpy(info-mask, mask, sizeof(*mask));
+info-fd = fds[1];
+
+pthread_attr_init(attr);
+pthread_attr_setdetachstate(attr, PTHREAD_CREATE_DETACHED);
+
+pthread_create(tid, attr, sigwait_compat, info);
+
+pthread_attr_destroy(attr);
+
+return fds[0];
+}
+
+int kvm_signalfd(const sigset_t *mask)
+{
+#if defined(SYS_signalfd)
+int ret;
+
+ret = syscall(SYS_signalfd, -1, mask, _NSIG / 8);
+if (!(ret == -1  errno == ENOSYS))
+   return ret;
+#endif
+
+return kvm_signalfd_compat(mask);
+}
 
 int kvm_eventfd(int *fds)
 {
diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index 7134e56..0ea03f8 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -12,6 +12,9 @@ int kvm_allowed = 1;
 int kvm_irqchip = 1;
 int kvm_pit = 1;
 
+#include qemu-common.h
+#include console.h
+
 #include string.h
 #include hw/hw.h
 #include sysemu.h
@@ -40,14 +43,6 @@ __thread struct vcpu_info *vcpu;
 
 static int qemu_system_ready;
 
-struct qemu_kvm_signal_table {
-sigset_t sigset;
-sigset_t negsigset;
-};
-
-static struct qemu_kvm_signal_table io_signal_table;
-static struct qemu_kvm_signal_table vcpu_signal_table;
-
 #define SIG_IPI (SIGRTMIN+4)
 
 struct vcpu_info {
@@ -172,37 +167,23 @@ static int has_work(CPUState *env)
 return kvm_arch_has_work(env);
 }
 
-static int kvm_process_signal(int si_signo)
-{
-struct sigaction sa;
-
-switch (si_signo) {
-case SIGUSR2:
-pthread_cond_signal(qemu_aio_cond);
-break;
-case SIGALRM:
-case SIGIO:
-sigaction(si_signo, NULL, sa);
-sa.sa_handler(si_signo);
-break;
-}
-
-return 1;
-}
-
-static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env,
-  int timeout)
+static int kvm_eat_signal(CPUState *env, int timeout)
 {
 struct timespec ts;
 int r, e, ret = 0;
 siginfo_t siginfo;
+sigset_t waitset;
 
 ts.tv_sec = timeout / 1000;
 ts.tv_nsec = (timeout % 1000) * 100;
-r = sigtimedwait(waitset-sigset, siginfo, ts);
+sigemptyset(waitset);
+sigaddset(waitset, SIG_IPI);
+
+r = sigtimedwait(waitset, siginfo, ts);
 if (r == -1  (errno == EAGAIN || errno == EINTR)  !timeout)
return 0;
 e = errno;
+
 pthread_mutex_lock(qemu_mutex);
 if (env  vcpu)
 cpu_single_env = vcpu-env;
@@ -211,7 +192,7 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table 
*waitset, CPUState *env,
exit(1);
 }
 if (r != -1)
-ret = kvm_process_signal(siginfo.si_signo);
+   ret = 1;
 
 if (env  vcpu_info[env-cpu_index].stop) {
vcpu_info[env-cpu_index].stop = 0;
@@ -227,14 +208,13 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table 
*waitset, CPUState *env,
 static void kvm_eat_signals(CPUState *env, int timeout)
 {
 int r = 0;
-struct qemu_kvm_signal_table *waitset = vcpu_signal_table;
 
-while (kvm_eat_signal(waitset, env, 0

[kvm-devel] [PATCH 3/4] Interrupt io thread in qemu_set_fd_handler2 (v2)

2008-05-05 Thread Anthony Liguori
The select() in the IO thread may wait a long time before rebuilding the
fd set.  Whenever we do something that changes the fd set, we should interrupt
the IO thread.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/vl.c b/qemu/vl.c
index 1192759..e9f0ca4 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -260,6 +260,16 @@ static int event_pending = 1;
 
 #define TFR(expr) do { if ((expr) != -1) break; } while (errno == EINTR)
 
+/* KVM runs the main loop in a separate thread.  If we update one of the lists
+ * that are polled before or after select(), we need to make sure to break out
+ * of the select() to ensure the new item is serviced.
+ */
+static void main_loop_break(void)
+{
+if (kvm_enabled())
+   qemu_kvm_notify_work();
+}
+
 void decorate_application_name(char *appname, int max_len)
 {
 if (kvm_enabled())
@@ -5680,6 +5690,7 @@ int qemu_set_fd_handler2(int fd,
 ioh-opaque = opaque;
 ioh-deleted = 0;
 }
+main_loop_break();
 return 0;
 }
 
@@ -7606,8 +7617,7 @@ void qemu_bh_schedule(QEMUBH *bh)
 if (env) {
 cpu_interrupt(env, CPU_INTERRUPT_EXIT);
 }
-if (kvm_enabled())
-qemu_kvm_notify_work();
+main_loop_break();
 }
 
 void qemu_bh_cancel(QEMUBH *bh)

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 1/4] Replace SIGUSR1 in io-thread with eventfd() (v2)

2008-05-05 Thread Anthony Liguori
It's a little odd to use signals to raise a notification on a file descriptor
when we can just work directly with a file descriptor instead.  This patch
converts the SIGUSR1 based notification in the io-thread to instead use an
eventfd file descriptor.  If eventfd isn't available, we use a pipe() instead.

The benefit of using eventfd is that multiple notifications will be batched
into a signal IO event.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/Makefile.target b/qemu/Makefile.target
index 2316c92..db6912e 100644
--- a/qemu/Makefile.target
+++ b/qemu/Makefile.target
@@ -203,7 +203,7 @@ CPPFLAGS+=-I$(SRC_PATH)/tcg/sparc
 endif
 
 ifeq ($(USE_KVM), 1)
-LIBOBJS+=qemu-kvm.o
+LIBOBJS+=qemu-kvm.o kvm-compatfd.o
 endif
 ifdef CONFIG_SOFTFLOAT
 LIBOBJS+=fpu/softfloat.o
diff --git a/qemu/kvm-compatfd.c b/qemu/kvm-compatfd.c
new file mode 100644
index 000..1b030ba
--- /dev/null
+++ b/qemu/kvm-compatfd.c
@@ -0,0 +1,33 @@
+/*
+ * signalfd/eventfd compatibility
+ *
+ * Copyright IBM, Corp. 2008
+ *
+ * Authors:
+ *  Anthony Liguori   [EMAIL PROTECTED]
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include qemu-common.h
+#include qemu-kvm.h
+
+#include sys/syscall.h
+
+int kvm_eventfd(int *fds)
+{
+#if defined(SYS_eventfd)
+int ret;
+
+ret = syscall(SYS_eventfd, 0);
+if (ret = 0) {
+   fds[0] = fds[1] = ret;
+   return 0;
+} else if (!(ret == -1  errno == ENOSYS))
+   return ret;
+#endif
+
+return pipe(fds);
+}
diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index 9a9bf59..7134e56 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -15,6 +15,8 @@ int kvm_pit = 1;
 #include string.h
 #include hw/hw.h
 #include sysemu.h
+#include qemu-common.h
+#include console.h
 
 #include qemu-kvm.h
 #include libkvm.h
@@ -61,6 +63,7 @@ struct vcpu_info {
 } vcpu_info[256];
 
 pthread_t io_thread;
+static int io_thread_fd = -1;
 
 static inline unsigned long kvm_get_thread_id(void)
 {
@@ -213,7 +216,7 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table 
*waitset, CPUState *env,
 if (env  vcpu_info[env-cpu_index].stop) {
vcpu_info[env-cpu_index].stop = 0;
vcpu_info[env-cpu_index].stopped = 1;
-   pthread_kill(io_thread, SIGUSR1);
+   qemu_kvm_notify_work();
 }
 pthread_mutex_unlock(qemu_mutex);
 
@@ -418,7 +421,6 @@ static void qemu_kvm_init_signal_tables(void)
 
 kvm_add_signal(io_signal_table, SIGIO);
 kvm_add_signal(io_signal_table, SIGALRM);
-kvm_add_signal(io_signal_table, SIGUSR1);
 kvm_add_signal(io_signal_table, SIGUSR2);
 
 kvm_add_signal(vcpu_signal_table, SIG_IPI);
@@ -440,8 +442,51 @@ int kvm_init_ap(void)
 
 void qemu_kvm_notify_work(void)
 {
-if (io_thread)
-pthread_kill(io_thread, SIGUSR1);
+uint64_t value = 1;
+char buffer[8];
+size_t offset = 0;
+
+if (io_thread_fd == -1)
+   return;
+
+memcpy(buffer, value, sizeof(value));
+
+while (offset  8) {
+   ssize_t len;
+
+   len = write(io_thread_fd, buffer + offset, 8 - offset);
+   if (len == -1  errno == EINTR)
+   continue;
+
+   if (len = 0)
+   break;
+
+   offset += len;
+}
+
+if (offset != 8)
+   fprintf(stderr, failed to notify io thread\n);
+}
+
+/* Used to break IO thread out of select */
+static void io_thread_wakeup(void *opaque)
+{
+int fd = (unsigned long)opaque;
+char buffer[8];
+size_t offset = 0;
+
+while (offset  8) {
+   ssize_t len;
+
+   len = read(fd, buffer + offset, 8 - offset);
+   if (len == -1  errno == EINTR)
+   continue;
+
+   if (len = 0)
+   break;
+
+   offset += len;
+}
 }
 
 /*
@@ -452,8 +497,20 @@ void qemu_kvm_notify_work(void)
 
 int kvm_main_loop(void)
 {
+int fds[2];
+
 io_thread = pthread_self();
 qemu_system_ready = 1;
+
+if (kvm_eventfd(fds) == -1) {
+   fprintf(stderr, failed to create eventfd\n);
+   return -errno;
+}
+
+qemu_set_fd_handler2(fds[0], NULL, io_thread_wakeup, NULL,
+(void *)(unsigned long)fds[0]);
+
+io_thread_fd = fds[1];
 pthread_mutex_unlock(qemu_mutex);
 
 pthread_cond_broadcast(qemu_system_cond);
diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h
index 024a653..8cd63e6 100644
--- a/qemu/qemu-kvm.h
+++ b/qemu/qemu-kvm.h
@@ -97,4 +97,6 @@ extern kvm_context_t kvm_context;
 #define qemu_kvm_pit_in_kernel() (0)
 #endif
 
+int kvm_eventfd(int *fds);
+
 #endif

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https

Re: [kvm-devel] [patch 0/3] QEMU/KVM: add support for 128 PCI slots (v2)

2008-05-05 Thread Anthony Liguori
Avi Kivity wrote:
 Marcelo Tosatti wrote:
   
 Add three PCI bridges to support 128 slots.

 Changes since v1:
 - Remove I/O address range support (so standard PCI I/O space is used).
 - Verify that there's no special quirks for 82801 PCI bridge.
 - Introduce separate flat IRQ mapping function for non-SPARC targets.

   
 

 I've cooled off on the 128 slot stuff, mainly because most real hosts 
 don't have them.  An unusual configuration will likely lead to problems 
 as most guest OSes and workloads will not have been tested thoroughly 
 with them.

 - it requires a large number of interrupts, which are difficult to 
 provide, and which it is hard to ensure all OSes support.  MSI is 
 relatively new.
 - is only a few interrupts are available, then each interrupt requires 
 scanning a large number of queues

 If we are to do this, then we need better tests than 80 disks show up.

 The alternative approach of having the virtio block device control up to 
 16 disks allows having those 80 disks with just 5 slots (and 5 
 interrupts).  This is similar to the way traditional SCSI controllers 
 behave, and so should not surprise the guest OS.
   

If you have a single virtio-blk device that shows up as 8 functions, we 
could achieve the same thing.  We can cheat with the interrupt handlers 
to avoid cache line bouncing too.  Plus, we can use PCI hotplug so we 
don't have to reinvent a new hotplug mechanism.

I'm inclined to think that ring sharing isn't as useful as it seems as 
long as we don't have indirect scatter gather lists.

Regards,

Anthony Liguori



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Use pipe() to simulate signalfd() (v2)

2008-05-04 Thread Anthony Liguori
Avi Kivity wrote:
 Please split the signalfd() emulation into a separate (preparatory) 
 patch.  Also, we need to detect signalfd() at run time as well as 
 compile time, since qemu may be compiled on a different machine than it 
 is run on.
   

Ok.

 We can keep the signals blocked, but run the signalfd emulation in a 
 separate thread (where it can dequeue signals using sigwait as an added 
 bonus).  This will reduce the differences between the two modes at the 
 expense of increased signalfd() emulation complexity, which I think is a 
 good tradeoff.
   

signalfd() can't be emulated transparently with a separate thread 
because you won't be able to wait on signals destined for the specific 
thread (only signals sent to the process).  We deliver signals directly 
to the IO thread (specifically, SIGUSR1) so this could get nasty.  We 
could just not block SIGUSR1 and rely on the fact that it will break us 
out of select() but I that makes things a bit more subtle than I'd like.

I personally prefer using pipe() within the same thread although I'm 
willing to also do the separate thread.

Regards,

Anthony LIguori

 We can move signalfd emulation into a separate file in order to improve 
 readability.

   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Use pipe() to simulate signalfd() (v2)

2008-05-04 Thread Anthony Liguori
Avi Kivity wrote:
 Anthony Liguori wrote:
 We can keep the signals blocked, but run the signalfd emulation in a 
 separate thread (where it can dequeue signals using sigwait as an 
 added bonus).  This will reduce the differences between the two 
 modes at the expense of increased signalfd() emulation complexity, 
 which I think is a good tradeoff.
   

 signalfd() can't be emulated transparently with a separate thread 
 because you won't be able to wait on signals destined for the 
 specific thread (only signals sent to the process).  We deliver 
 signals directly to the IO thread (specifically, SIGUSR1) so this 
 could get nasty.  We could just not block SIGUSR1 and rely on the 
 fact that it will break us out of select() but I that makes things a 
 bit more subtle than I'd like.


 We can completely kill off SIGUSR1 and replace it with its own pipe.  
 There's hardly any point in asking the kernel to signal a task, then 
 having the kernel convert this to a fd write.

 (Or maybe use eventfd())

That's a really good idea.  I'll update the patch.

Regards,

Anthony Liguori


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 1/3] Use signalfd() in io-thread

2008-05-04 Thread Anthony Liguori
This patch reworks the IO thread to use signalfd() instead of sigtimedwait().
This will eliminate the need to use SIGIO everywhere.  In this version of the
patch, we use signalfd() when it's available.  When it isn't available, we
create a separate thread and use sigwaitinfo() to simulate signalfd().

We cannot handle thread-specific signals with signalfd() emulation so also
replace SIGUSR1 notifications to the io-thread with an eventfd.  Since eventfd
isn't always available, use pipe() to emulate eventfd.

I've tested Windows and Linux guests with SMP without seeing an obvious
regressions.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/Makefile.target b/qemu/Makefile.target
index 2316c92..db6912e 100644
--- a/qemu/Makefile.target
+++ b/qemu/Makefile.target
@@ -203,7 +203,7 @@ CPPFLAGS+=-I$(SRC_PATH)/tcg/sparc
 endif
 
 ifeq ($(USE_KVM), 1)
-LIBOBJS+=qemu-kvm.o
+LIBOBJS+=qemu-kvm.o kvm-compatfd.o
 endif
 ifdef CONFIG_SOFTFLOAT
 LIBOBJS+=fpu/softfloat.o
diff --git a/qemu/kvm-compatfd.c b/qemu/kvm-compatfd.c
new file mode 100644
index 000..3c2be28
--- /dev/null
+++ b/qemu/kvm-compatfd.c
@@ -0,0 +1,127 @@
+/*
+ * signalfd/eventfd compatibility
+ *
+ * Copyright IBM, Corp. 2008
+ *
+ * Authors:
+ *  Anthony Liguori   [EMAIL PROTECTED]
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include qemu-common.h
+#include qemu-kvm.h
+
+#include sys/syscall.h
+#include pthread.h
+
+struct sigfd_compat_info
+{
+sigset_t mask;
+int fd;
+};
+
+static void *sigwait_compat(void *opaque)
+{
+struct sigfd_compat_info *info = opaque;
+int err;
+
+sigprocmask(SIG_BLOCK, info-mask, NULL);
+
+do {
+   siginfo_t siginfo;
+
+   kvm_sleep_begin();
+   err = sigwaitinfo(info-mask, siginfo);
+   kvm_sleep_end();
+
+   if (err == -1  errno == EINTR)
+   continue;
+
+   if (err  0) {
+   char buffer[128];
+   size_t offset = 0;
+
+   memcpy(buffer, err, sizeof(err));
+   while (offset  sizeof(buffer)) {
+   ssize_t len;
+
+   len = write(info-fd, buffer + offset,
+   sizeof(buffer) - offset);
+   if (len == -1  errno == EINTR)
+   continue;
+
+   if (len = 0) {
+   err = -1;
+   break;
+   }
+
+   offset += len;
+   }
+   }
+} while (err = 0);
+
+return NULL;
+}
+
+static int kvm_signalfd_compat(const sigset_t *mask)
+{
+pthread_attr_t attr;
+pthread_t tid;
+struct sigfd_compat_info *info;
+int fds[2];
+
+info = malloc(sizeof(*info));
+if (info == NULL) {
+   errno = ENOMEM;
+   return -1;
+}
+
+if (pipe(fds) == -1) {
+   free(info);
+   return -1;
+}
+
+memcpy(info-mask, mask, sizeof(*mask));
+info-fd = fds[1];
+
+pthread_attr_init(attr);
+pthread_attr_setdetachstate(attr, PTHREAD_CREATE_DETACHED);
+
+pthread_create(tid, attr, sigwait_compat, info);
+
+pthread_attr_destroy(attr);
+
+return fds[0];
+}
+
+int kvm_signalfd(const sigset_t *mask)
+{
+#if defined(SYS_signalfd)
+int ret;
+
+ret = syscall(SYS_signalfd, -1, mask, _NSIG / 8);
+if (!(ret == -1  errno == ENOSYS))
+   return ret;
+#endif
+
+return kvm_signalfd_compat(mask);
+}
+
+int kvm_eventfd(int *fds)
+{
+#if defined(SYS_eventfd)
+int ret;
+
+ret = syscall(SYS_eventfd, 0);
+if (ret = 0) {
+   fds[0] = fds[1] = ret;
+   return 0;
+} else if (!(ret == -1  errno == ENOSYS))
+   return ret;
+#endif
+
+return pipe(fds);
+}
diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index 9a9bf59..e16b261 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -12,6 +12,9 @@ int kvm_allowed = 1;
 int kvm_irqchip = 1;
 int kvm_pit = 1;
 
+#include qemu-common.h
+#include console.h
+
 #include string.h
 #include hw/hw.h
 #include sysemu.h
@@ -38,14 +41,6 @@ __thread struct vcpu_info *vcpu;
 
 static int qemu_system_ready;
 
-struct qemu_kvm_signal_table {
-sigset_t sigset;
-sigset_t negsigset;
-};
-
-static struct qemu_kvm_signal_table io_signal_table;
-static struct qemu_kvm_signal_table vcpu_signal_table;
-
 #define SIG_IPI (SIGRTMIN+4)
 
 struct vcpu_info {
@@ -61,6 +56,7 @@ struct vcpu_info {
 } vcpu_info[256];
 
 pthread_t io_thread;
+static int io_thread_fd = -1;
 
 static inline unsigned long kvm_get_thread_id(void)
 {
@@ -169,37 +165,23 @@ static int has_work(CPUState *env)
 return kvm_arch_has_work(env);
 }
 
-static int kvm_process_signal(int si_signo)
-{
-struct sigaction sa;
-
-switch (si_signo) {
-case SIGUSR2:
-pthread_cond_signal(qemu_aio_cond);
-break;
-case SIGALRM:
-case SIGIO:
-sigaction(si_signo, NULL, sa);
-sa.sa_handler(si_signo);
-break;
-}
-
-return 1;
-}
-
-static int kvm_eat_signal(struct

[kvm-devel] [PATCH 3/3] Stop dropping so many RX packets in tap (v2)

2008-05-04 Thread Anthony Liguori
Normally, tap always reads packets and simply lets the client drop them if it
cannot receive them.  For virtio-net, this results in massive packet loss and
about an 80% performance loss in TCP throughput.

This patch modifies qemu_send_packet() to only deliver a packet to a VLAN
client if it doesn't have a fd_can_read method or the fd_can_read method
indicates that it can receive packets.  We also return a status of whether
any clients were able to receive the packet.

If no clients were able to receive a packet, we buffer the packet until a
client indicates that it can receive packets again.

This patch also modifies the tap code to only read from the tap fd if at least
one client on the VLAN is able to receive a packet.

Finally, this patch changes the tap code to drain all possible packets from
the tap device when the tap fd is readable.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/net.h b/qemu/net.h
index 13daa27..dfdf9af 100644
--- a/qemu/net.h
+++ b/qemu/net.h
@@ -29,7 +29,7 @@ VLANClientState *qemu_new_vlan_client(VLANState *vlan,
   IOCanRWHandler *fd_can_read,
   void *opaque);
 int qemu_can_send_packet(VLANClientState *vc);
-void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size);
+int qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size);
 void qemu_handler_true(void *opaque);
 
 void do_info_network(void);
diff --git a/qemu/vl.c b/qemu/vl.c
index c51d704..f4abea3 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -3760,10 +3760,11 @@ int qemu_can_send_packet(VLANClientState *vc1)
 return 0;
 }
 
-void qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size)
+int qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size)
 {
 VLANState *vlan = vc1-vlan;
 VLANClientState *vc;
+int ret = -EAGAIN;
 
 #if 0
 printf(vlan %d send:\n, vlan-id);
@@ -3771,9 +3772,14 @@ void qemu_send_packet(VLANClientState *vc1, const 
uint8_t *buf, int size)
 #endif
 for(vc = vlan-first_client; vc != NULL; vc = vc-next) {
 if (vc != vc1) {
-vc-fd_read(vc-opaque, buf, size);
+   if (!vc-fd_can_read || vc-fd_can_read(vc-opaque)) {
+   vc-fd_read(vc-opaque, buf, size);
+   ret = 0;
+   }
 }
 }
+
+return ret;
 }
 
 #if defined(CONFIG_SLIRP)
@@ -3976,6 +3982,8 @@ typedef struct TAPState {
 VLANClientState *vc;
 int fd;
 char down_script[1024];
+char buf[4096];
+int size;
 } TAPState;
 
 static void tap_receive(void *opaque, const uint8_t *buf, int size)
@@ -3991,24 +3999,70 @@ static void tap_receive(void *opaque, const uint8_t 
*buf, int size)
 }
 }
 
+static int tap_can_send(void *opaque)
+{
+TAPState *s = opaque;
+VLANClientState *vc;
+int can_receive = 0;
+ 
+/* Check to see if any of our clients can receive a packet */
+for (vc = s-vc-vlan-first_client; vc; vc = vc-next) {
+   /* Skip ourselves */
+   if (vc == s-vc)
+   continue;
+
+   if (!vc-fd_can_read) {
+   /* no fd_can_read handler, they always can receive */
+   can_receive = 1;
+   } else
+   can_receive = vc-fd_can_read(vc-opaque);
+
+   /* Once someone can receive, we try to send a packet */
+   if (can_receive)
+   break;
+}
+
+return can_receive;
+}
+
 static void tap_send(void *opaque)
 {
 TAPState *s = opaque;
-uint8_t buf[4096];
-int size;
 
+/* First try to send any buffered packet */
+if (s-size  0) {
+   int err;
+
+   /* If noone can receive the packet, buffer it */
+   err = qemu_send_packet(s-vc, s-buf, s-size);
+   if (err == -EAGAIN)
+   return;
+}
+
+/* Read packets until we hit EAGAIN */
+do {
 #ifdef __sun__
-struct strbuf sbuf;
-int f = 0;
-sbuf.maxlen = sizeof(buf);
-sbuf.buf = buf;
-size = getmsg(s-fd, NULL, sbuf, f) =0 ? sbuf.len : -1;
+   struct strbuf sbuf;
+   int f = 0;
+   sbuf.maxlen = sizeof(s-buf);
+   sbuf.buf = s-buf;
+   s-size = getmsg(s-fd, NULL, sbuf, f) =0 ? sbuf.len : -1;
 #else
-size = read(s-fd, buf, sizeof(buf));
+   s-size = read(s-fd, s-buf, sizeof(s-buf));
 #endif
-if (size  0) {
-qemu_send_packet(s-vc, buf, size);
-}
+
+   if (s-size == -1  errno == EINTR)
+   continue;
+
+   if (s-size  0) {
+   int err;
+
+   /* If noone can receive the packet, buffer it */
+   err = qemu_send_packet(s-vc, s-buf, s-size);
+   if (err == -EAGAIN)
+   break;
+   }
+} while (s-size  0);
 }
 
 /* fd support */
@@ -4022,7 +4076,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int fd)
 return NULL;
 s-fd = fd;
 s-vc = qemu_new_vlan_client(vlan, tap_receive, NULL, s);
-qemu_set_fd_handler2(s-fd, NULL, tap_send, NULL, s);
+qemu_set_fd_handler2(s-fd, tap_can_send, tap_send

[kvm-devel] [PATCH 3/3] Only select once per-main_loop iteration

2008-05-04 Thread Anthony Liguori
QEMU is rather aggressive about exhausting the wait period when selecting.
This is fine when the wait period is low and when there is significant delays
in-between selects as it improves IO throughput.

With the IO thread, there is a very small delay between selects and our wait
period for select is very large.  This patch changes main_loop_wait to only
select once before doing the various other things in the main loop.  This
generally improves responsiveness of things like SDL but also improves
individual file descriptor throughput quite dramatically.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index e16b261..6a90e68 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -423,24 +423,6 @@ void qemu_kvm_notify_work(void)
fprintf(stderr, failed to notify io thread\n);
 }
 
-static int received_signal;
-
-/* QEMU relies on periodically breaking out of select via EINTR to poll for IO
-   and timer signals.  Since we're now using a file descriptor to handle
-   signals, select() won't be interrupted by a signal.  We need to forcefully
-   break the select() loop when a signal is received hence
-   kvm_check_received_signal(). */
-
-int kvm_check_received_signal(void)
-{
-if (received_signal) {
-   received_signal = 0;
-   return 1;
-}
-
-return 0;
-}
-
 /* If we have signalfd, we mask out the signals we want to handle and then
  * use signalfd to listen for them.  We rely on whatever the current signal
  * handler is to dispatch the signals when we receive them.
@@ -474,8 +456,6 @@ static void sigfd_handler(void *opaque)
pthread_cond_signal(qemu_aio_cond); 
}
 }
-
-received_signal = 1;
 }
 
 /* Used to break IO thread out of select */
@@ -497,8 +477,6 @@ static void io_thread_wakeup(void *opaque)
 
offset += len;
 }
-
-received_signal = 1;
 }
 
 int kvm_main_loop(void)
diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h
index e1e461a..34aabd2 100644
--- a/qemu/qemu-kvm.h
+++ b/qemu/qemu-kvm.h
@@ -114,15 +114,6 @@ static inline void kvm_sleep_end(void)
kvm_mutex_lock();
 }
 
-int kvm_check_received_signal(void);
-
-static inline int kvm_received_signal(void)
-{
-if (kvm_enabled())
-   return kvm_check_received_signal();
-return 0;
-}
-
 #if !defined(SYS_signalfd)
 struct signalfd_siginfo {
 uint32_t ssi_signo;
diff --git a/qemu/vl.c b/qemu/vl.c
index e9f0ca4..6935a82 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -7946,23 +7946,18 @@ void main_loop_wait(int timeout)
 slirp_select_fill(nfds, rfds, wfds, xfds);
 }
 #endif
- moreio:
 ret = qemu_select(nfds + 1, rfds, wfds, xfds, tv);
 if (ret  0) {
 IOHandlerRecord **pioh;
-int more = 0;
 
 for(ioh = first_io_handler; ioh != NULL; ioh = ioh-next) {
 if (!ioh-deleted  ioh-fd_read  FD_ISSET(ioh-fd, rfds)) {
 ioh-fd_read(ioh-opaque);
-if (!ioh-fd_read_poll || ioh-fd_read_poll(ioh-opaque))
-more = 1;
-else
+if (!(ioh-fd_read_poll  ioh-fd_read_poll(ioh-opaque)))
 FD_CLR(ioh-fd, rfds);
 }
 if (!ioh-deleted  ioh-fd_write  FD_ISSET(ioh-fd, wfds)) {
 ioh-fd_write(ioh-opaque);
-more = 1;
 }
 }
 
@@ -7976,8 +7971,6 @@ void main_loop_wait(int timeout)
 } else
 pioh = ioh-next;
 }
-if (more  !kvm_received_signal())
-goto moreio;
 }
 #if defined(CONFIG_SLIRP)
 if (slirp_inited) {

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 2/3] Interrupt io thread in qemu_set_fd_handler2

2008-05-04 Thread Anthony Liguori
The select() in the IO thread may wait a long time before rebuilding the
fd set.  Whenever we do something that changes the fd set, we should interrupt
the IO thread.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/vl.c b/qemu/vl.c
index 1192759..e9f0ca4 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -260,6 +260,16 @@ static int event_pending = 1;
 
 #define TFR(expr) do { if ((expr) != -1) break; } while (errno == EINTR)
 
+/* KVM runs the main loop in a separate thread.  If we update one of the lists
+ * that are polled before or after select(), we need to make sure to break out
+ * of the select() to ensure the new item is serviced.
+ */
+static void main_loop_break(void)
+{
+if (kvm_enabled())
+   qemu_kvm_notify_work();
+}
+
 void decorate_application_name(char *appname, int max_len)
 {
 if (kvm_enabled())
@@ -5680,6 +5690,7 @@ int qemu_set_fd_handler2(int fd,
 ioh-opaque = opaque;
 ioh-deleted = 0;
 }
+main_loop_break();
 return 0;
 }
 
@@ -7606,8 +7617,7 @@ void qemu_bh_schedule(QEMUBH *bh)
 if (env) {
 cpu_interrupt(env, CPU_INTERRUPT_EXIT);
 }
-if (kvm_enabled())
-qemu_kvm_notify_work();
+main_loop_break();
 }
 
 void qemu_bh_cancel(QEMUBH *bh)

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 2/3] Make virtio-net can_receive more accurate (v2)

2008-05-04 Thread Anthony Liguori
In the final patch of this series, we rely on a VLAN client's fd_can_read
method to avoid dropping packets.  Unfortunately, virtio's fd_can_read method
is not very accurate at the moment.  This patch addresses this.

It also generates a notification to the IO thread when more RX packets become
available.  If we say we can't receive a packet because no RX buffers are
available, this may result in the tap file descriptor not being select()'d.
Without notifying the IO thread, we may have to wait until the select() times
out before we can receive a packet (even if there is one pending).

This particular change makes RX performance very consistent.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index 8d26832..5538979 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -14,6 +14,7 @@
 #include virtio.h
 #include net.h
 #include qemu-timer.h
+#include qemu-kvm.h
 
 /* from Linux's virtio_net.h */
 
@@ -60,11 +61,14 @@ typedef struct VirtIONet
 VirtQueue *rx_vq;
 VirtQueue *tx_vq;
 VLANClientState *vc;
-int can_receive;
 QEMUTimer *tx_timer;
 int tx_timer_active;
 } VirtIONet;
 
+/* TODO
+ * - we could suppress RX interrupt if we were so inclined.
+ */
+
 static VirtIONet *to_virtio_net(VirtIODevice *vdev)
 {
 return (VirtIONet *)vdev;
@@ -88,15 +92,24 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev)
 
 static void virtio_net_handle_rx(VirtIODevice *vdev, VirtQueue *vq)
 {
-VirtIONet *n = to_virtio_net(vdev);
-n-can_receive = 1;
+/* We now have RX buffers, signal to the IO thread to break out of the
+   select to re-poll the tap file descriptor */
+if (kvm_enabled())
+   qemu_kvm_notify_work();
 }
 
 static int virtio_net_can_receive(void *opaque)
 {
 VirtIONet *n = opaque;
 
-return (n-vdev.status  VIRTIO_CONFIG_S_DRIVER_OK)  n-can_receive;
+if (n-rx_vq-vring.avail == NULL ||
+   !(n-vdev.status  VIRTIO_CONFIG_S_DRIVER_OK))
+   return 0;
+
+if (n-rx_vq-vring.avail-idx == n-rx_vq-last_avail_idx)
+   return 0;
+
+return 1;
 }
 
 static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
@@ -106,15 +119,8 @@ static void virtio_net_receive(void *opaque, const uint8_t 
*buf, int size)
 struct virtio_net_hdr *hdr;
 int offset, i;
 
-/* FIXME: the drivers really need to set their status better */
-if (n-rx_vq-vring.avail == NULL) {
-   n-can_receive = 0;
-   return;
-}
-
 if (virtqueue_pop(n-rx_vq, elem) == 0) {
-   /* wait until the guest adds some rx bufs */
-   n-can_receive = 0;
+   fprintf(stderr, virtio_net: this should not happen\n);
return;
 }
 
@@ -209,9 +215,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int 
devfn)
 
 n-vdev.update_config = virtio_net_update_config;
 n-vdev.get_features = virtio_net_get_features;
-n-rx_vq = virtio_add_queue(n-vdev, 512, virtio_net_handle_rx);
+n-rx_vq = virtio_add_queue(n-vdev, 128, virtio_net_handle_rx);
 n-tx_vq = virtio_add_queue(n-vdev, 128, virtio_net_handle_tx);
-n-can_receive = 0;
 memcpy(n-mac, nd-macaddr, 6);
 n-vc = qemu_new_vlan_client(nd-vlan, virtio_net_receive,
  virtio_net_can_receive, n);

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 1/3] Revert virtio tap hack (v2)

2008-05-04 Thread Anthony Liguori
While it has served us well, it is long overdue that we eliminate the
virtio-net tap hack.  It turns out that zero-copy has very little impact on
performance.  The tap hack was gaining such a significant performance boost
not because of zero-copy, but because it avoided dropping packets on receive
which is apparently a significant problem with the tap implementation in QEMU.

Patches 3 and 4 in this series address the packet dropping issue and the net
result is a 25% boost in RX performance even in the absence of zero-copy.

Also worth mentioning, is that this makes merging virtio into upstream QEMU
significantly easier.

Since v1, we're just rebasing on the new io thread patch set.

This series depends on my IO thread series.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/hw/pc.h b/qemu/hw/pc.h
index 57d2123..f5157bd 100644
--- a/qemu/hw/pc.h
+++ b/qemu/hw/pc.h
@@ -154,7 +154,6 @@ void isa_ne2000_init(int base, qemu_irq irq, NICInfo *nd);
 /* virtio-net.c */
 
 PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn);
-void virtio_net_poll(void);
 
 /* virtio-blk.h */
 void *virtio_blk_init(PCIBus *bus, uint16_t vendor, uint16_t device,
diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index f727b14..8d26832 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -13,7 +13,6 @@
 
 #include virtio.h
 #include net.h
-#include pc.h
 #include qemu-timer.h
 
 /* from Linux's virtio_net.h */
@@ -62,15 +61,10 @@ typedef struct VirtIONet
 VirtQueue *tx_vq;
 VLANClientState *vc;
 int can_receive;
-int tap_fd;
-struct VirtIONet *next;
-int do_notify;
 QEMUTimer *tx_timer;
 int tx_timer_active;
 } VirtIONet;
 
-static VirtIONet *VirtIONetHead = NULL;
-
 static VirtIONet *to_virtio_net(VirtIODevice *vdev)
 {
 return (VirtIONet *)vdev;
@@ -105,7 +99,6 @@ static int virtio_net_can_receive(void *opaque)
 return (n-vdev.status  VIRTIO_CONFIG_S_DRIVER_OK)  n-can_receive;
 }
 
-/* -net user receive function */
 static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
 {
 VirtIONet *n = opaque;
@@ -144,87 +137,6 @@ static void virtio_net_receive(void *opaque, const uint8_t 
*buf, int size)
 virtio_notify(n-vdev, n-rx_vq);
 }
 
-/* -net tap receive handler */
-void virtio_net_poll(void)
-{
-VirtIONet *vnet;
-int len;
-fd_set rfds;
-struct timeval tv;
-int max_fd = -1;
-VirtQueueElement elem;
-struct virtio_net_hdr *hdr;
-int did_notify;
-
-FD_ZERO(rfds);
-tv.tv_sec = 0;
-tv.tv_usec = 0;
-
-while (1) {
-
-// Prepare the set of device to select from
-for (vnet = VirtIONetHead; vnet; vnet = vnet-next) {
-
-if (vnet-tap_fd == -1)
-continue;
-
-vnet-do_notify = 0;
-//first check if the driver is ok
-if (!virtio_net_can_receive(vnet))
-continue;
-
-/* FIXME: the drivers really need to set their status better */
-if (vnet-rx_vq-vring.avail == NULL) {
-vnet-can_receive = 0;
-continue;
-}
-
-FD_SET(vnet-tap_fd, rfds);
-if (max_fd  vnet-tap_fd) max_fd = vnet-tap_fd;
-}
-
-if (select(max_fd + 1, rfds, NULL, NULL, tv) = 0)
-break;
-
-// Now check who has data pending in the tap
-for (vnet = VirtIONetHead; vnet; vnet = vnet-next) {
-
-if (!FD_ISSET(vnet-tap_fd, rfds))
-continue;
-
-if (virtqueue_pop(vnet-rx_vq, elem) == 0) {
-vnet-can_receive = 0;
-continue;
-}
-
-hdr = (void *)elem.in_sg[0].iov_base;
-hdr-flags = 0;
-hdr-gso_type = VIRTIO_NET_HDR_GSO_NONE;
-again:
-len = readv(vnet-tap_fd, elem.in_sg[1], elem.in_num - 1);
-if (len == -1) {
-if (errno == EINTR || errno == EAGAIN)
-goto again;
-else
-fprintf(stderr, reading network error %d, len);
-}
-virtqueue_push(vnet-rx_vq, elem, sizeof(*hdr) + len);
-vnet-do_notify = 1;
-}
-
-/* signal other side */
-did_notify = 0;
-for (vnet = VirtIONetHead; vnet; vnet = vnet-next)
-if (vnet-do_notify) {
-virtio_notify(vnet-vdev, vnet-rx_vq);
-did_notify++;
-}
-if (!did_notify)
-break;
- }
-
-}
-
 /* TX */
 static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 {
@@ -303,12 +215,6 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int 
devfn)
 memcpy(n-mac, nd-macaddr, 6);
 n-vc = qemu_new_vlan_client(nd-vlan, virtio_net_receive,
  virtio_net_can_receive, n);
-n-tap_fd = hack_around_tap(n-vc-vlan-first_client);
-if (n-tap_fd != -1) {
-n-next = VirtIONetHead;
-//push the device on top

Re: [kvm-devel] [PATCH 3/3] Stop dropping so many RX packets in tap (v2)

2008-05-04 Thread Anthony Liguori
Dor Laor wrote:
 On Sun, 2008-05-04 at 15:21 -0500, Anthony Liguori wrote:
   
 Patchset looks good and reduces some nasty hacks. It probably also
 improves other devices like e1000 et al.
   

Yeah, we just need to make sure they have proper can_receive handlers.  
I took a quick look at the e1000 but it wasn't entirely obvious to me 
how RX packets get queued.  AFAICT, the e1000 can_receive handler only 
reports false when the device isn't initialized so it's likely it's 
dropping packets.

Regards,

Anthony Liguori

 Cheers,
 Dor


 -
 This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
 Don't miss this year's exciting event. There's still time to save $100. 
 Use priority code J8TL2D2. 
 http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
 ___
 kvm-devel mailing list
 kvm-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/kvm-devel
   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 2/4] Revert virtio tap hack

2008-05-03 Thread Anthony Liguori
While it has served us well, it is long overdue that we eliminate the
virtio-net tap hack.  It turns out that zero-copy has very little impact on
performance.  The tap hack was gaining such a significant performance boost
not because of zero-copy, but because it avoided dropping packets on receive
which is apparently a significant problem with the tap implementation in QEMU.

Patches 3 and 4 in this series address the packet dropping issue and the net
result is a 25% boost in RX performance even in the absence of zero-copy.

Also worth mentioning, is that this makes merging virtio into upstream QEMU
significantly easier.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/hw/pc.h b/qemu/hw/pc.h
index 57d2123..f5157bd 100644
--- a/qemu/hw/pc.h
+++ b/qemu/hw/pc.h
@@ -154,7 +154,6 @@ void isa_ne2000_init(int base, qemu_irq irq, NICInfo *nd);
 /* virtio-net.c */
 
 PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn);
-void virtio_net_poll(void);
 
 /* virtio-blk.h */
 void *virtio_blk_init(PCIBus *bus, uint16_t vendor, uint16_t device,
diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index f727b14..8d26832 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -13,7 +13,6 @@
 
 #include virtio.h
 #include net.h
-#include pc.h
 #include qemu-timer.h
 
 /* from Linux's virtio_net.h */
@@ -62,15 +61,10 @@ typedef struct VirtIONet
 VirtQueue *tx_vq;
 VLANClientState *vc;
 int can_receive;
-int tap_fd;
-struct VirtIONet *next;
-int do_notify;
 QEMUTimer *tx_timer;
 int tx_timer_active;
 } VirtIONet;
 
-static VirtIONet *VirtIONetHead = NULL;
-
 static VirtIONet *to_virtio_net(VirtIODevice *vdev)
 {
 return (VirtIONet *)vdev;
@@ -105,7 +99,6 @@ static int virtio_net_can_receive(void *opaque)
 return (n-vdev.status  VIRTIO_CONFIG_S_DRIVER_OK)  n-can_receive;
 }
 
-/* -net user receive function */
 static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
 {
 VirtIONet *n = opaque;
@@ -144,87 +137,6 @@ static void virtio_net_receive(void *opaque, const uint8_t 
*buf, int size)
 virtio_notify(n-vdev, n-rx_vq);
 }
 
-/* -net tap receive handler */
-void virtio_net_poll(void)
-{
-VirtIONet *vnet;
-int len;
-fd_set rfds;
-struct timeval tv;
-int max_fd = -1;
-VirtQueueElement elem;
-struct virtio_net_hdr *hdr;
-int did_notify;
-
-FD_ZERO(rfds);
-tv.tv_sec = 0;
-tv.tv_usec = 0;
-
-while (1) {
-
-// Prepare the set of device to select from
-for (vnet = VirtIONetHead; vnet; vnet = vnet-next) {
-
-if (vnet-tap_fd == -1)
-continue;
-
-vnet-do_notify = 0;
-//first check if the driver is ok
-if (!virtio_net_can_receive(vnet))
-continue;
-
-/* FIXME: the drivers really need to set their status better */
-if (vnet-rx_vq-vring.avail == NULL) {
-vnet-can_receive = 0;
-continue;
-}
-
-FD_SET(vnet-tap_fd, rfds);
-if (max_fd  vnet-tap_fd) max_fd = vnet-tap_fd;
-}
-
-if (select(max_fd + 1, rfds, NULL, NULL, tv) = 0)
-break;
-
-// Now check who has data pending in the tap
-for (vnet = VirtIONetHead; vnet; vnet = vnet-next) {
-
-if (!FD_ISSET(vnet-tap_fd, rfds))
-continue;
-
-if (virtqueue_pop(vnet-rx_vq, elem) == 0) {
-vnet-can_receive = 0;
-continue;
-}
-
-hdr = (void *)elem.in_sg[0].iov_base;
-hdr-flags = 0;
-hdr-gso_type = VIRTIO_NET_HDR_GSO_NONE;
-again:
-len = readv(vnet-tap_fd, elem.in_sg[1], elem.in_num - 1);
-if (len == -1) {
-if (errno == EINTR || errno == EAGAIN)
-goto again;
-else
-fprintf(stderr, reading network error %d, len);
-}
-virtqueue_push(vnet-rx_vq, elem, sizeof(*hdr) + len);
-vnet-do_notify = 1;
-}
-
-/* signal other side */
-did_notify = 0;
-for (vnet = VirtIONetHead; vnet; vnet = vnet-next)
-if (vnet-do_notify) {
-virtio_notify(vnet-vdev, vnet-rx_vq);
-did_notify++;
-}
-if (!did_notify)
-break;
- }
-
-}
-
 /* TX */
 static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 {
@@ -303,12 +215,6 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int 
devfn)
 memcpy(n-mac, nd-macaddr, 6);
 n-vc = qemu_new_vlan_client(nd-vlan, virtio_net_receive,
  virtio_net_can_receive, n);
-n-tap_fd = hack_around_tap(n-vc-vlan-first_client);
-if (n-tap_fd != -1) {
-n-next = VirtIONetHead;
-//push the device on top of the list
-VirtIONetHead = n;
-}
 
 n-tx_timer = qemu_new_timer(vm_clock

[kvm-devel] [PATCH 1/4] Only select once per-main_loop iteration

2008-05-03 Thread Anthony Liguori
QEMU is rather aggressive about exhausting the wait period when selecting.
This is fine when the wait period is low and when there is significant delays
in-between selects as it improves IO throughput.

With the IO thread, there is a very small delay between selects and our wait
period for select is very large.  This patch changes main_loop_wait to only
select once before doing the various other things in the main loop.  This
generally improves responsiveness of things like SDL but also improves
individual file descriptor throughput quite dramatically.

This patch is relies on my io-thread-timerfd.patch.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index 0c7f49f..31c7ca7 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -401,24 +401,6 @@ void qemu_kvm_notify_work(void)
 pthread_kill(io_thread, SIGUSR1);
 }
 
-static int received_signal;
-
-/* QEMU relies on periodically breaking out of select via EINTR to poll for IO
-   and timer signals.  Since we're now using a file descriptor to handle
-   signals, select() won't be interrupted by a signal.  We need to forcefully
-   break the select() loop when a signal is received hence
-   kvm_check_received_signal(). */
-
-int kvm_check_received_signal(void)
-{
-if (received_signal) {
-   received_signal = 0;
-   return 1;
-}
-
-return 0;
-}
-
 #if defined(SYS_signalfd)
 #if !defined(HAVE_signalfd)
 #include linux/signalfd.h
@@ -466,8 +448,6 @@ static void sigfd_handler(void *opaque)
if (info.ssi_signo == SIGUSR2)
pthread_cond_signal(qemu_aio_cond); 
 }
-
-received_signal = 1;
 }
 
 static int setup_signal_handlers(int nr_signals, ...)
@@ -576,8 +556,6 @@ static void sigfd_handler(void *opaque)
if (signo == SIGUSR2)
pthread_cond_signal(qemu_aio_cond); 
 }
-
-received_signal = 1;
 }
 
 static int setup_signal_handlers(int nr_signals, ...)
diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h
index bcab82c..5109c64 100644
--- a/qemu/qemu-kvm.h
+++ b/qemu/qemu-kvm.h
@@ -112,13 +112,4 @@ static inline void kvm_sleep_end(void)
kvm_mutex_lock();
 }
 
-int kvm_check_received_signal(void);
-
-static inline int kvm_received_signal(void)
-{
-if (kvm_enabled())
-   return kvm_check_received_signal();
-return 0;
-}
-
 #endif
diff --git a/qemu/vl.c b/qemu/vl.c
index 1192759..bcf893f 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -7936,23 +7936,18 @@ void main_loop_wait(int timeout)
 slirp_select_fill(nfds, rfds, wfds, xfds);
 }
 #endif
- moreio:
 ret = qemu_select(nfds + 1, rfds, wfds, xfds, tv);
 if (ret  0) {
 IOHandlerRecord **pioh;
-int more = 0;
 
 for(ioh = first_io_handler; ioh != NULL; ioh = ioh-next) {
 if (!ioh-deleted  ioh-fd_read  FD_ISSET(ioh-fd, rfds)) {
 ioh-fd_read(ioh-opaque);
-if (!ioh-fd_read_poll || ioh-fd_read_poll(ioh-opaque))
-more = 1;
-else
+if (!(ioh-fd_read_poll  ioh-fd_read_poll(ioh-opaque)))
 FD_CLR(ioh-fd, rfds);
 }
 if (!ioh-deleted  ioh-fd_write  FD_ISSET(ioh-fd, wfds)) {
 ioh-fd_write(ioh-opaque);
-more = 1;
 }
 }
 
@@ -7966,8 +7961,6 @@ void main_loop_wait(int timeout)
 } else
 pioh = ioh-next;
 }
-if (more  !kvm_received_signal())
-goto moreio;
 }
 #if defined(CONFIG_SLIRP)
 if (slirp_inited) {

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 4/4] Stop dropping so many RX packets in tap

2008-05-03 Thread Anthony Liguori
Normally, tap always reads packets and simply lets the client drop them if it
cannot receive them.  For virtio-net, this results in massive packet loss and
about an 80% performance loss in TCP throughput.

This patch modifies qemu_send_packet() to only deliver a packet to a VLAN
client if it doesn't have a fd_can_read method or the fd_can_read method
indicates that it can receive packets.  We also return a status of whether
any clients were able to receive the packet.

If no clients were able to receive a packet, we buffer the packet until a
client indicates that it can receive packets again.

This patch also modifies the tap code to only read from the tap fd if at least
one client on the VLAN is able to receive a packet.

Finally, this patch changes the tap code to drain all possible packets from
the tap device when the tap fd is readable.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/net.h b/qemu/net.h
index 13daa27..dfdf9af 100644
--- a/qemu/net.h
+++ b/qemu/net.h
@@ -29,7 +29,7 @@ VLANClientState *qemu_new_vlan_client(VLANState *vlan,
   IOCanRWHandler *fd_can_read,
   void *opaque);
 int qemu_can_send_packet(VLANClientState *vc);
-void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size);
+int qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size);
 void qemu_handler_true(void *opaque);
 
 void do_info_network(void);
diff --git a/qemu/vl.c b/qemu/vl.c
index b8ce485..74c34b6 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -3750,10 +3750,11 @@ int qemu_can_send_packet(VLANClientState *vc1)
 return 0;
 }
 
-void qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size)
+int qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size)
 {
 VLANState *vlan = vc1-vlan;
 VLANClientState *vc;
+int ret = -EAGAIN;
 
 #if 0
 printf(vlan %d send:\n, vlan-id);
@@ -3761,9 +3762,14 @@ void qemu_send_packet(VLANClientState *vc1, const 
uint8_t *buf, int size)
 #endif
 for(vc = vlan-first_client; vc != NULL; vc = vc-next) {
 if (vc != vc1) {
-vc-fd_read(vc-opaque, buf, size);
+   if (!vc-fd_can_read || vc-fd_can_read(vc-opaque)) {
+   vc-fd_read(vc-opaque, buf, size);
+   ret = 0;
+   }
 }
 }
+
+return ret;
 }
 
 #if defined(CONFIG_SLIRP)
@@ -3966,6 +3972,8 @@ typedef struct TAPState {
 VLANClientState *vc;
 int fd;
 char down_script[1024];
+char buf[4096];
+int size;
 } TAPState;
 
 static void tap_receive(void *opaque, const uint8_t *buf, int size)
@@ -3981,24 +3989,70 @@ static void tap_receive(void *opaque, const uint8_t 
*buf, int size)
 }
 }
 
+static int tap_can_send(void *opaque)
+{
+TAPState *s = opaque;
+VLANClientState *vc;
+int can_receive = 0;
+ 
+/* Check to see if any of our clients can receive a packet */
+for (vc = s-vc-vlan-first_client; vc; vc = vc-next) {
+   /* Skip ourselves */
+   if (vc == s-vc)
+   continue;
+
+   if (!vc-fd_can_read) {
+   /* no fd_can_read handler, they always can receive */
+   can_receive = 1;
+   } else
+   can_receive = vc-fd_can_read(vc-opaque);
+
+   /* Once someone can receive, we try to send a packet */
+   if (can_receive)
+   break;
+}
+
+return can_receive;
+}
+
 static void tap_send(void *opaque)
 {
 TAPState *s = opaque;
-uint8_t buf[4096];
-int size;
 
+/* First try to send any buffered packet */
+if (s-size  0) {
+   int err;
+
+   /* If noone can receive the packet, buffer it */
+   err = qemu_send_packet(s-vc, s-buf, s-size);
+   if (err == -EAGAIN)
+   return;
+}
+
+/* Read packets until we hit EAGAIN */
+do {
 #ifdef __sun__
-struct strbuf sbuf;
-int f = 0;
-sbuf.maxlen = sizeof(buf);
-sbuf.buf = buf;
-size = getmsg(s-fd, NULL, sbuf, f) =0 ? sbuf.len : -1;
+   struct strbuf sbuf;
+   int f = 0;
+   sbuf.maxlen = sizeof(s-buf);
+   sbuf.buf = s-buf;
+   s-size = getmsg(s-fd, NULL, sbuf, f) =0 ? sbuf.len : -1;
 #else
-size = read(s-fd, buf, sizeof(buf));
+   s-size = read(s-fd, s-buf, sizeof(s-buf));
 #endif
-if (size  0) {
-qemu_send_packet(s-vc, buf, size);
-}
+
+   if (s-size == -1  errno == EINTR)
+   continue;
+
+   if (s-size  0) {
+   int err;
+
+   /* If noone can receive the packet, buffer it */
+   err = qemu_send_packet(s-vc, s-buf, s-size);
+   if (err == -EAGAIN)
+   break;
+   }
+} while (s-size  0);
 }
 
 /* fd support */
@@ -4012,7 +4066,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int fd)
 return NULL;
 s-fd = fd;
 s-vc = qemu_new_vlan_client(vlan, tap_receive, NULL, s);
-qemu_set_fd_handler2(s-fd, NULL, tap_send, NULL, s);
+qemu_set_fd_handler2(s-fd, tap_can_send, tap_send

[kvm-devel] [PATCH 3/4] Make virtio-net can_receive more accurate

2008-05-03 Thread Anthony Liguori
In the final patch of this series, we rely on a VLAN client's fd_can_read
method to avoid dropping packets.  Unfortunately, virtio's fd_can_read method
is not very accurate at the moment.  This patch addresses this.

It also generates a notification to the IO thread when more RX packets become
available.  If we say we can't receive a packet because no RX buffers are
available, this may result in the tap file descriptor not being select()'d.
Without notifying the IO thread, we may have to wait until the select() times
out before we can receive a packet (even if there is one pending).

This particular change makes RX performance very consistent.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index 8d26832..5538979 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -14,6 +14,7 @@
 #include virtio.h
 #include net.h
 #include qemu-timer.h
+#include qemu-kvm.h
 
 /* from Linux's virtio_net.h */
 
@@ -60,11 +61,14 @@ typedef struct VirtIONet
 VirtQueue *rx_vq;
 VirtQueue *tx_vq;
 VLANClientState *vc;
-int can_receive;
 QEMUTimer *tx_timer;
 int tx_timer_active;
 } VirtIONet;
 
+/* TODO
+ * - we could suppress RX interrupt if we were so inclined.
+ */
+
 static VirtIONet *to_virtio_net(VirtIODevice *vdev)
 {
 return (VirtIONet *)vdev;
@@ -88,15 +92,24 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev)
 
 static void virtio_net_handle_rx(VirtIODevice *vdev, VirtQueue *vq)
 {
-VirtIONet *n = to_virtio_net(vdev);
-n-can_receive = 1;
+/* We now have RX buffers, signal to the IO thread to break out of the
+   select to re-poll the tap file descriptor */
+if (kvm_enabled())
+   qemu_kvm_notify_work();
 }
 
 static int virtio_net_can_receive(void *opaque)
 {
 VirtIONet *n = opaque;
 
-return (n-vdev.status  VIRTIO_CONFIG_S_DRIVER_OK)  n-can_receive;
+if (n-rx_vq-vring.avail == NULL ||
+   !(n-vdev.status  VIRTIO_CONFIG_S_DRIVER_OK))
+   return 0;
+
+if (n-rx_vq-vring.avail-idx == n-rx_vq-last_avail_idx)
+   return 0;
+
+return 1;
 }
 
 static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
@@ -106,15 +119,8 @@ static void virtio_net_receive(void *opaque, const uint8_t 
*buf, int size)
 struct virtio_net_hdr *hdr;
 int offset, i;
 
-/* FIXME: the drivers really need to set their status better */
-if (n-rx_vq-vring.avail == NULL) {
-   n-can_receive = 0;
-   return;
-}
-
 if (virtqueue_pop(n-rx_vq, elem) == 0) {
-   /* wait until the guest adds some rx bufs */
-   n-can_receive = 0;
+   fprintf(stderr, virtio_net: this should not happen\n);
return;
 }
 
@@ -209,9 +215,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int 
devfn)
 
 n-vdev.update_config = virtio_net_update_config;
 n-vdev.get_features = virtio_net_get_features;
-n-rx_vq = virtio_add_queue(n-vdev, 512, virtio_net_handle_rx);
+n-rx_vq = virtio_add_queue(n-vdev, 128, virtio_net_handle_rx);
 n-tx_vq = virtio_add_queue(n-vdev, 128, virtio_net_handle_tx);
-n-can_receive = 0;
 memcpy(n-mac, nd-macaddr, 6);
 n-vc = qemu_new_vlan_client(nd-vlan, virtio_net_receive,
  virtio_net_can_receive, n);

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 2/4] Revert virtio tap hack

2008-05-03 Thread Anthony Liguori

Anthony Liguori wrote:

While it has served us well, it is long overdue that we eliminate the
virtio-net tap hack.  It turns out that zero-copy has very little impact on
performance.  The tap hack was gaining such a significant performance boost
not because of zero-copy, but because it avoided dropping packets on receive
which is apparently a significant problem with the tap implementation in QEMU.
  


FWIW, attached is a pretty straight forward zero-copy patch.  What's 
interesting is that I see no change in throughput using this patch.  The 
CPU is pegged at 100% during the iperf run.  Since we're still using 
small MTUs, this isn't surprising.  Copying a 1500 byte packet that we 
have to bring into the cache anyway doesn't seem significant.  I think 
zero-copy will be more important with GSO though.


Regards,

Anthony Liguori


Patches 3 and 4 in this series address the packet dropping issue and the net
result is a 25% boost in RX performance even in the absence of zero-copy.

Also worth mentioning, is that this makes merging virtio into upstream QEMU
significantly easier.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]
  




net-tap-zero-copy.patch
Description: application/mbox
-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Feedback and errors

2008-05-02 Thread Anthony Liguori
Avi Kivity wrote:
 nadim khemir wrote:
   
 Hi, great work.

 While playing with kvm-qemu I noticed a few points that might be of 
 interrest:

 1/ -loadvm and -snapshot don't work together. It works as if -loadvm wasn't 
 passed as argument

 2/ two instances of kvm can be passed the same -hda. There is no locking 
 whatsoever. This messes up things seriously.

   
 

 These two are upstream qemu problems. Copying qemu-devel.

 I guess using file locking by default would improve the situation, and 
 we can add a -drive ...,exclusive=no option for people playing with 
 cluster filesystems.
   

This is not a situation where the user has a reasonable expectation of 
what will happen that we violate.  If the user is unhappy with the 
results, it's because the user made a mistake.  FWIW, the whole override 
thing for Xen has been an endless source of pain.  It's very difficult 
(if not impossible) to accurately determine if someone else is using the 
disk.  Also, it tends to confuse people trying to do something 
legitimate more often than helping someone doing something stupid.

I very frequently run multiple VMs with the same disk.  I do it strictly 
for the purposes of benchmarking.  There are ways to share a disk 
without using a clustered filesystem.

If a higher level management tool wants to enforce a policy (like 
libvirt), then let it.  We should not be enforcing policies within QEMU 
though.

Regards,

Anthony Liguori

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Feedback and errors

2008-05-02 Thread Anthony Liguori
Avi Kivity wrote:

 Well, one user (me) has made this mistake, several times.

I guess it's usage patterns.  I'm pretty religious about using -snapshot 
unless I have a very specific reason not to.  I have never encountered 
this problem myself.

 FWIW, the whole override thing for Xen has been an endless source of 
 pain.  It's very difficult (if not impossible) to accurately 
 determine if someone else is using the disk.  

 What's wrong with the standard file locking API?  Of course it won't 
 stop non-qemu apps from accessing it, but that's unlikely anyway.

Xen tries to be very smart about determining whether devices are mounted 
somewhere else or not.

 Also, it tends to confuse people trying to do something legitimate 
 more often than helping someone doing something stupid.

 -drive exclusive=off (or share=yes)

The problem I have is that the default policy gets very complicated.  At 
first thought, I would say it's fine as long as exclusive=off was the 
default for using -snapshot or using raw images.  However, if you create 
a VM with a qcow image using -snapshot, and then create another one 
without using snapshot, you're boned.

What we really need is a global configuration file so that individual 
users can select these defaults according to what makes sense for them.

In the mean time, I think the policy vs. mechanism strongly suggests 
that exclusive=off should be the default (not to mention maintaining 
backwards compatibility).


 I very frequently run multiple VMs with the same disk.  I do it 
 strictly for the purposes of benchmarking.  There are ways to share a 
 disk without using a clustered filesystem.

 I imagine only raw format disks, and only as non-root filesystems (or 
 with -shapshot, which should automatically set exclusive=off)?

Yup.


 If a higher level management tool wants to enforce a policy (like 
 libvirt), then let it.  We should not be enforcing policies within 
 QEMU though.

 I agree that qemu is not the place to enforce policies, but covering a 
 hole that users are likely to step into, while allowing its explicit 
 uncovering, is a good thing.  We're not enforcing the policy, only 
 hinting.

Unfortunately, the solution involves breaking backwards compatibility 
for legitimate use-cases (not to mention making those use-cases more 
awkward).  I think the only way to sanely do this is as a global 
configuration parameter.

Regards,

Anthony Liguori


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH] Use pipe() to simulate signalfd() (v2)

2008-05-02 Thread Anthony Liguori
This patch reworks the IO thread to use signalfd() instead of sigtimedwait().
This will eliminate the need to use SIGIO everywhere.  In this version of the
patch, we use signalfd() when it's available.  When it isn't available, we
instead use a pipe() that is written to in each signal handler.

I've tested Windows and Linux guests with SMP without seeing an obvious
regressions.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index 9a9bf59..0c7f49f 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -12,6 +12,9 @@ int kvm_allowed = 1;
 int kvm_irqchip = 1;
 int kvm_pit = 1;
 
+#include qemu-common.h
+#include console.h
+
 #include string.h
 #include hw/hw.h
 #include sysemu.h
@@ -38,14 +41,6 @@ __thread struct vcpu_info *vcpu;
 
 static int qemu_system_ready;
 
-struct qemu_kvm_signal_table {
-sigset_t sigset;
-sigset_t negsigset;
-};
-
-static struct qemu_kvm_signal_table io_signal_table;
-static struct qemu_kvm_signal_table vcpu_signal_table;
-
 #define SIG_IPI (SIGRTMIN+4)
 
 struct vcpu_info {
@@ -169,37 +164,23 @@ static int has_work(CPUState *env)
 return kvm_arch_has_work(env);
 }
 
-static int kvm_process_signal(int si_signo)
-{
-struct sigaction sa;
-
-switch (si_signo) {
-case SIGUSR2:
-pthread_cond_signal(qemu_aio_cond);
-break;
-case SIGALRM:
-case SIGIO:
-sigaction(si_signo, NULL, sa);
-sa.sa_handler(si_signo);
-break;
-}
-
-return 1;
-}
-
-static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env,
-  int timeout)
+static int kvm_eat_signal(CPUState *env, int timeout)
 {
 struct timespec ts;
 int r, e, ret = 0;
 siginfo_t siginfo;
+sigset_t waitset;
 
 ts.tv_sec = timeout / 1000;
 ts.tv_nsec = (timeout % 1000) * 100;
-r = sigtimedwait(waitset-sigset, siginfo, ts);
+sigemptyset(waitset);
+sigaddset(waitset, SIG_IPI);
+
+r = sigtimedwait(waitset, siginfo, ts);
 if (r == -1  (errno == EAGAIN || errno == EINTR)  !timeout)
return 0;
 e = errno;
+
 pthread_mutex_lock(qemu_mutex);
 if (env  vcpu)
 cpu_single_env = vcpu-env;
@@ -208,7 +189,7 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table 
*waitset, CPUState *env,
exit(1);
 }
 if (r != -1)
-ret = kvm_process_signal(siginfo.si_signo);
+   ret = 1;
 
 if (env  vcpu_info[env-cpu_index].stop) {
vcpu_info[env-cpu_index].stop = 0;
@@ -224,14 +205,13 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table 
*waitset, CPUState *env,
 static void kvm_eat_signals(CPUState *env, int timeout)
 {
 int r = 0;
-struct qemu_kvm_signal_table *waitset = vcpu_signal_table;
 
-while (kvm_eat_signal(waitset, env, 0))
+while (kvm_eat_signal(env, 0))
r = 1;
 if (!r  timeout) {
-   r = kvm_eat_signal(waitset, env, timeout);
+   r = kvm_eat_signal(env, timeout);
if (r)
-   while (kvm_eat_signal(waitset, env, 0))
+   while (kvm_eat_signal(env, 0))
;
 }
 }
@@ -264,9 +244,7 @@ static void pause_all_threads(void)
pthread_kill(vcpu_info[i].thread, SIG_IPI);
 }
 while (!all_threads_paused()) {
-   pthread_mutex_unlock(qemu_mutex);
-   kvm_eat_signal(io_signal_table, NULL, 1000);
-   pthread_mutex_lock(qemu_mutex);
+   main_loop_wait(1000);
cpu_single_env = NULL;
 }
 }
@@ -307,6 +285,13 @@ static void setup_kernel_sigmask(CPUState *env)
 {
 sigset_t set;
 
+sigemptyset(set);
+sigaddset(set, SIGUSR1);
+sigaddset(set, SIGUSR2);
+sigaddset(set, SIGIO);
+sigaddset(set, SIGALRM);
+sigprocmask(SIG_BLOCK, set, NULL);
+
 sigprocmask(SIG_BLOCK, NULL, set);
 sigdelset(set, SIG_IPI);
 
@@ -343,7 +328,7 @@ static int kvm_main_loop_cpu(CPUState *env)
 cpu_single_env = env;
 while (1) {
while (!has_work(env))
-   kvm_main_loop_wait(env, 10);
+   kvm_main_loop_wait(env, 1000);
if (env-interrupt_request  CPU_INTERRUPT_HARD)
env-hflags = ~HF_HALTED_MASK;
if (!kvm_irqchip_in_kernel(kvm_context)  info-sipi_needed)
@@ -391,18 +376,6 @@ static void *ap_main_loop(void *_env)
 return NULL;
 }
 
-static void qemu_kvm_init_signal_table(struct qemu_kvm_signal_table *sigtab)
-{
-sigemptyset(sigtab-sigset);
-sigfillset(sigtab-negsigset);
-}
-
-static void kvm_add_signal(struct qemu_kvm_signal_table *sigtab, int signum)
-{
-sigaddset(sigtab-sigset, signum);
-sigdelset(sigtab-negsigset, signum);
-}
-
 void kvm_init_new_ap(int cpu, CPUState *env)
 {
 pthread_create(vcpu_info[cpu].thread, NULL, ap_main_loop, env);
@@ -411,28 +384,12 @@ void kvm_init_new_ap(int cpu, CPUState *env)
pthread_cond_wait(qemu_vcpu_cond, qemu_mutex);
 }
 
-static void qemu_kvm_init_signal_tables(void)
-{
-qemu_kvm_init_signal_table(io_signal_table);
-qemu_kvm_init_signal_table

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

2008-04-30 Thread Anthony Liguori
Muli Ben-Yehuda wrote:
 On Tue, Apr 29, 2008 at 02:09:20PM -0500, Anthony Liguori wrote:
   
 This patch allows VMA's that contain no backing page to be used for guest
 memory.  This is a drop-in replacement for Ben-Ami's first page in his direct
 mmio series.  Here, we continue to allow mmio pages to be represented in the
 rmap.

 Since v1, I've taken into account Andrea's suggestions at using VM_PFNMAP
 instead of VM_IO and changed the BUG_ON to a return of bad_page.

 Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 1d7991a..64e5efe 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -532,6 +532,7 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
  struct page *page[1];
  unsigned long addr;
  int npages;
 +pfn_t pfn;
  
  might_sleep();
  
 @@ -544,19 +545,35 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
  npages = get_user_pages(current, current-mm, addr, 1, 1, 1, page,
  NULL);
  
 -if (npages != 1) {
 -get_page(bad_page);
 -return page_to_pfn(bad_page);
 -}
 +if (unlikely(npages != 1)) {
 +struct vm_area_struct *vma;
  
 -return page_to_pfn(page[0]);
 +vma = find_vma(current-mm, addr);
 +if (vma == NULL || addr = vma-vm_start ||
 +!(vma-vm_flags  VM_PFNMAP)) {
 

 Isn't the check for addr backwards here? For the VMA we would like to
 to find, vma-vm_start = addr  vma-vm_end.
   

Yes it is.  Thanks for spotting that.

Regards,

Anthony Liguori

 Cheers,
 Muli

 -
 This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
 Don't miss this year's exciting event. There's still time to save $100. 
 Use priority code J8TL2D2. 
 http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
 ___
 kvm-devel mailing list
 kvm-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/kvm-devel
   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

2008-04-30 Thread Anthony Liguori
Andrea Arcangeli wrote:
 On Tue, Apr 29, 2008 at 06:12:51PM -0500, Anthony Liguori wrote:
   
 IIUC PPC correctly, all IO pages have corresponding struct pages.  This 
 means that get_user_pages() would succeed and you can reference count them? 
  In this case, we would never take the VM_PFNMAP path.
 

 get_user_pages only works on vmas where only pfn with struct page can
 be mapped, but if a struct page exists it doesn't mean get_user_pages
 will succeed. All mmio regions should be marked VM_IO as reading on
 them affects hardware somehow and that prevents get_user_pages to work
 on them regardless if a struct page exists.
   

Ah, thanks for the clarification.

Regards,

Anthony Liguori

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] cirrusfb division by zero

2008-04-30 Thread Anthony Liguori
Marcelo Tosatti wrote:
 Anthony, 

 The following sequence crashes F9 guests, when using VNC:

 # modprobe cirrusfb
 # vbetool post

 Results in Floating point exception at:

 cirrus_do_copy()
 {

 depth = s-get_bpp((VGAState *)s) / 8 
 ...
 sx = (src % (width * depth)) / depth;
 ...
 }

 Problem is that -get_bpp returns 0.

 Following band-aid fixes it (coff). I have no idea if its correct
 though ?
   

It suggests something is very broken.  get_bpp only is supposed to 
return 0 when in VGA mode.  I don't think blitting should happen when in 
VGA mode.  Applying the patch is not a bad idea but what was the guest 
doing when this happened?  Was it in the process of transitioning from 
one mode to another?

Regards,

Anthony Liguori

 vbetool post corrupts both SDL and VNC displays when using cirrusfb,
 but seems a separate problem.


 diff --git a/qemu/hw/cirrus_vga.c b/qemu/hw/cirrus_vga.c
 index e14ec35..9f860ff 100644
 --- a/qemu/hw/cirrus_vga.c
 +++ b/qemu/hw/cirrus_vga.c
 @@ -709,6 +709,8 @@ static void cirrus_do_copy(CirrusVGAState *s, int dst, 
 int src, int w, in int notify = 0;

  depth = s-get_bpp((VGAState *)s) / 8;
 +if (!depth)
 +depth = 1;
  s-get_resolution((VGAState *)s, width, height);

  /* extra x, y */

   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH] Don't leak EPT identity page table

2008-04-30 Thread Anthony Liguori
In vmx.c:alloc_identity_pagetable() we grab a reference to the EPT identity
page table via gfn_to_page().  We never release this reference though.

This patch releases the reference to this page on VM destruction.  I haven't
tested this with EPT.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 578a0c1..63f46cf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3909,6 +3909,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kvm_free_physmem(kvm);
if (kvm-arch.apic_access_page)
put_page(kvm-arch.apic_access_page);
+   if (kvm-arch.ept_identity_pagetable)
+   put_page(kvm-arch.ept_identity_pagetable);
kfree(kvm);
 }
 

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH] Handle vma regions with no backing page (v3)

2008-04-30 Thread Anthony Liguori
This patch allows VMA's that contain no backing page to be used for guest
memory.  This is a drop-in replacement for Ben-Ami's first page in his direct
mmio series.  Here, we continue to allow mmio pages to be represented in the
rmap.

Since v1, I've taken into account Andrea's suggestions at using VM_PFNMAP
instead of VM_IO and changed the BUG_ON to a return of bad_page.

Since v2, I've incorporated comments from Avi about returning bad_page instead
of NULL and fixed a typo spotted by Muli.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1d7991a..9e3e4ee 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -532,6 +532,7 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
struct page *page[1];
unsigned long addr;
int npages;
+   pfn_t pfn;
 
might_sleep();
 
@@ -544,19 +545,38 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
npages = get_user_pages(current, current-mm, addr, 1, 1, 1, page,
NULL);
 
-   if (npages != 1) {
-   get_page(bad_page);
-   return page_to_pfn(bad_page);
-   }
+   if (unlikely(npages != 1)) {
+   struct vm_area_struct *vma;
 
-   return page_to_pfn(page[0]);
+   vma = find_vma(current-mm, addr);
+   if (vma == NULL || addr  vma-vm_start ||
+   !(vma-vm_flags  VM_PFNMAP)) {
+   get_page(bad_page);
+   return page_to_pfn(bad_page);
+   }
+
+   pfn = ((addr - vma-vm_start)  PAGE_SHIFT) + vma-vm_pgoff;
+   BUG_ON(pfn_valid(pfn));
+   } else
+   pfn = page_to_pfn(page[0]);
+
+   return pfn;
 }
 
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
 {
-   return pfn_to_page(gfn_to_pfn(kvm, gfn));
+   pfn_t pfn;
+
+   pfn = gfn_to_pfn(kvm, gfn);
+   if (pfn_valid(pfn))
+   return pfn_to_page(pfn);
+
+   WARN_ON(!pfn_valid(pfn));
+
+   get_page(bad_page);
+   return bad_page;
 }
 
 EXPORT_SYMBOL_GPL(gfn_to_page);
@@ -569,7 +589,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);
 
 void kvm_release_pfn_clean(pfn_t pfn)
 {
-   put_page(pfn_to_page(pfn));
+   if (pfn_valid(pfn))
+   put_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
 
@@ -594,21 +615,25 @@ EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
 
 void kvm_set_pfn_dirty(pfn_t pfn)
 {
-   struct page *page = pfn_to_page(pfn);
-   if (!PageReserved(page))
-   SetPageDirty(page);
+   if (pfn_valid(pfn)) {
+   struct page *page = pfn_to_page(pfn);
+   if (!PageReserved(page))
+   SetPageDirty(page);
+   }
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);
 
 void kvm_set_pfn_accessed(pfn_t pfn)
 {
-   mark_page_accessed(pfn_to_page(pfn));
+   if (pfn_valid(pfn))
+   mark_page_accessed(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
 
 void kvm_get_pfn(pfn_t pfn)
 {
-   get_page(pfn_to_page(pfn));
+   if (pfn_valid(pfn))
+   get_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_get_pfn);
 

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH] Handle vma regions with no backing page

2008-04-29 Thread Anthony Liguori
This patch allows VMA's that contain no backing page to be used for guest
memory.  This is a drop-in replacement for Ben-Ami's first page in his direct
mmio series.  Here, we continue to allow mmio pages to be represented in the
rmap.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f095b73..11b26f5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -531,6 +531,7 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
struct page *page[1];
unsigned long addr;
int npages;
+   pfn_t pfn;
 
might_sleep();
 
@@ -543,19 +544,36 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
npages = get_user_pages(current, current-mm, addr, 1, 1, 1, page,
NULL);
 
-   if (npages != 1) {
-   get_page(bad_page);
-   return page_to_pfn(bad_page);
-   }
+   if (unlikely(npages != 1)) {
+   struct vm_area_struct *vma;
+
+   vma = find_vma(current-mm, addr);
+   if (vma == NULL) {
+   get_page(bad_page);
+   return page_to_pfn(bad_page);
+   }
+
+   BUG_ON(!(vma-vm_flags  VM_IO));
 
-   return page_to_pfn(page[0]);
+   pfn = ((addr - vma-vm_start)  PAGE_SHIFT) + vma-vm_pgoff;
+   BUG_ON(pfn_valid(pfn));
+   } else
+   pfn = page_to_pfn(page[0]);
+
+   return pfn;
 }
 
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
 {
-   return pfn_to_page(gfn_to_pfn(kvm, gfn));
+   pfn_t pfn;
+
+   pfn = gfn_to_pfn(kvm, gfn);
+   if (pfn_valid(pfn))
+   return pfn_to_page(pfn);
+
+   return NULL;
 }
 
 EXPORT_SYMBOL_GPL(gfn_to_page);
@@ -568,7 +586,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);
 
 void kvm_release_pfn_clean(pfn_t pfn)
 {
-   put_page(pfn_to_page(pfn));
+   if (pfn_valid(pfn))
+   put_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
 
@@ -593,21 +612,25 @@ EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
 
 void kvm_set_pfn_dirty(pfn_t pfn)
 {
-   struct page *page = pfn_to_page(pfn);
-   if (!PageReserved(page))
-   SetPageDirty(page);
+   if (pfn_valid(pfn)) {
+   struct page *page = pfn_to_page(pfn);
+   if (!PageReserved(page))
+   SetPageDirty(page);
+   }
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);
 
 void kvm_set_pfn_accessed(pfn_t pfn)
 {
-   mark_page_accessed(pfn_to_page(pfn));
+   if (pfn_valid(pfn))
+   mark_page_accessed(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
 
 void kvm_get_pfn(pfn_t pfn)
 {
-   get_page(pfn_to_page(pfn));
+   if (pfn_valid(pfn))
+   get_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_get_pfn);
 

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Handle vma regions with no backing page

2008-04-29 Thread Anthony Liguori
Andrea Arcangeli wrote:
 On Tue, Apr 29, 2008 at 09:32:09AM -0500, Anthony Liguori wrote:
   
 +vma = find_vma(current-mm, addr);
 +if (vma == NULL) {
 +get_page(bad_page);
 +return page_to_pfn(bad_page);
 +}
 

 Here you must check vm_start address, find_vma only checks addr 
 vm_end but there's no guarantee addr = vm_start yet.
   

Indeed.

 +
 +BUG_ON(!(vma-vm_flags  VM_IO));
 

 For consistency we should return bad_page and not bug on, VM_IO and
 VM_PFNMAP can theoretically not be set at the same time, otherwise
 get_user_pages would be buggy checking against VM_PFNMAP|VM_IO. I
 doubt anybody isn't setting VM_IO before calling remap_pfn_range but
 anyway...

 Secondly the really correct check is against VM_PFNMAP. This is
 because PFNMAP is set at the same time of vm_pgoff = pfn. VM_IO is not
 even if in theory if a driver uses -fault instead of remap_pfn_range,
 shouldn't set VM_IO and it should only set VM_RESERVED. VM_IO is about
 keeping gdb/coredump out as they could mess with the hardware if they
 read, PFNMAP is about remap_pfn_range having been called and pgoff
 pointing to the first pfn mapped at vm_start address.
   

Good point.  I've updated the patch.  Will send out again once I've 
gotten to test it.

Regards,

Anthony Liguori

 Patch is in the right direction, way to go!
   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH][RFC] Use pipe() to simulate signalfd()

2008-04-29 Thread Anthony Liguori
This patch eliminates the use of sigtimedwait() in the IO thread.  To avoid the
signal/select race condition, we use a pipe that we write to in the signal
handlers.  This was suggested by Rusty and seems to work well.

There are a lot of cleanup/simplification opportunities with this but I've
limited it just to the signal masking/eating routines.  We've got at least one
live lock left in the code that I haven't yet identified.  My goal is to get
this all a lot simplier though so that it's easier to fix the remaining
lock-ups.

I'm looking for some feedback that this is a sane direction.  I haven't tested
this enough yet so please don't apply it.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index 9a9bf59..46d7425 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -7,6 +7,9 @@
  */
 #include config.h
 #include config-host.h
+#include qemu-common.h
+#include block.h
+#include console.h
 
 int kvm_allowed = 1;
 int kvm_irqchip = 1;
@@ -38,14 +41,6 @@ __thread struct vcpu_info *vcpu;
 
 static int qemu_system_ready;
 
-struct qemu_kvm_signal_table {
-sigset_t sigset;
-sigset_t negsigset;
-};
-
-static struct qemu_kvm_signal_table io_signal_table;
-static struct qemu_kvm_signal_table vcpu_signal_table;
-
 #define SIG_IPI (SIGRTMIN+4)
 
 struct vcpu_info {
@@ -169,53 +164,37 @@ static int has_work(CPUState *env)
 return kvm_arch_has_work(env);
 }
 
-static int kvm_process_signal(int si_signo)
-{
-struct sigaction sa;
-
-switch (si_signo) {
-case SIGUSR2:
-pthread_cond_signal(qemu_aio_cond);
-break;
-case SIGALRM:
-case SIGIO:
-sigaction(si_signo, NULL, sa);
-sa.sa_handler(si_signo);
-break;
-}
-
-return 1;
-}
-
-static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env,
-  int timeout)
+static int kvm_eat_signal(CPUState *env, int timeout)
 {
 struct timespec ts;
 int r, e, ret = 0;
 siginfo_t siginfo;
+sigset_t waitset;
 
+sigemptyset(waitset);
+sigaddset(waitset, SIG_IPI);
 ts.tv_sec = timeout / 1000;
 ts.tv_nsec = (timeout % 1000) * 100;
-r = sigtimedwait(waitset-sigset, siginfo, ts);
+qemu_kvm_unlock();
+r = sigtimedwait(waitset, siginfo, ts);
+qemu_kvm_lock(env);
+cpu_single_env = env;
 if (r == -1  (errno == EAGAIN || errno == EINTR)  !timeout)
return 0;
 e = errno;
-pthread_mutex_lock(qemu_mutex);
 if (env  vcpu)
 cpu_single_env = vcpu-env;
 if (r == -1  !(errno == EAGAIN || errno == EINTR)) {
printf(sigtimedwait: %s\n, strerror(e));
exit(1);
 }
-if (r != -1)
-ret = kvm_process_signal(siginfo.si_signo);
+ret = 1;
 
 if (env  vcpu_info[env-cpu_index].stop) {
vcpu_info[env-cpu_index].stop = 0;
vcpu_info[env-cpu_index].stopped = 1;
pthread_kill(io_thread, SIGUSR1);
 }
-pthread_mutex_unlock(qemu_mutex);
 
 return ret;
 }
@@ -224,24 +203,20 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table 
*waitset, CPUState *env,
 static void kvm_eat_signals(CPUState *env, int timeout)
 {
 int r = 0;
-struct qemu_kvm_signal_table *waitset = vcpu_signal_table;
 
-while (kvm_eat_signal(waitset, env, 0))
+while (kvm_eat_signal(env, 0))
r = 1;
 if (!r  timeout) {
-   r = kvm_eat_signal(waitset, env, timeout);
+   r = kvm_eat_signal(env, timeout);
if (r)
-   while (kvm_eat_signal(waitset, env, 0))
+   while (kvm_eat_signal(env, 0))
;
 }
 }
 
 static void kvm_main_loop_wait(CPUState *env, int timeout)
 {
-pthread_mutex_unlock(qemu_mutex);
 kvm_eat_signals(env, timeout);
-pthread_mutex_lock(qemu_mutex);
-cpu_single_env = env;
 vcpu_info[env-cpu_index].signalled = 0;
 }
 
@@ -263,12 +238,8 @@ static void pause_all_threads(void)
vcpu_info[i].stop = 1;
pthread_kill(vcpu_info[i].thread, SIG_IPI);
 }
-while (!all_threads_paused()) {
-   pthread_mutex_unlock(qemu_mutex);
-   kvm_eat_signal(io_signal_table, NULL, 1000);
-   pthread_mutex_lock(qemu_mutex);
-   cpu_single_env = NULL;
-}
+while (!all_threads_paused())
+   main_loop_wait(10);
 }
 
 static void resume_all_threads(void)
@@ -391,18 +362,6 @@ static void *ap_main_loop(void *_env)
 return NULL;
 }
 
-static void qemu_kvm_init_signal_table(struct qemu_kvm_signal_table *sigtab)
-{
-sigemptyset(sigtab-sigset);
-sigfillset(sigtab-negsigset);
-}
-
-static void kvm_add_signal(struct qemu_kvm_signal_table *sigtab, int signum)
-{
-sigaddset(sigtab-sigset, signum);
-sigdelset(sigtab-negsigset, signum);
-}
-
 void kvm_init_new_ap(int cpu, CPUState *env)
 {
 pthread_create(vcpu_info[cpu].thread, NULL, ap_main_loop, env);
@@ -411,28 +370,12 @@ void kvm_init_new_ap(int cpu, CPUState *env)
pthread_cond_wait(qemu_vcpu_cond, qemu_mutex);
 }
 
-static void

Re: [kvm-devel] Protected mode transitions and big real mode... still an issue

2008-04-29 Thread Anthony Liguori
Guillaume Thouvenin wrote:
 Hello,

  This patch should solve the problem observed during protected mode
 transitions that appears for example during the installation of
 openSuse-10.3. Unfortunately there is an issue that crashes
 kvm-userspace. I'm not sure if it's a problem introduced by the
 patch or if the patch is good and raises a new issue.
   

You still aren't emulating the instructions correctly I think.  Running 
your patch, I see:

[  979.755349] Failed vm entry (exit reason 0x21) invalid guest state
[  979.755354] emulation at (46e4b) rip 6e0b: ea 10 6e 18
[  979.755358] successfully emulated instruction
[  979.756105] Failed vm entry (exit reason 0x21) invalid guest state
[  979.756109] emulation at (46e50) rip 6e10: 66 b8 20 00
[  979.756111] successfully emulated instruction
[  979.756749] Failed vm entry (exit reason 0x21) invalid guest state
[  979.756752] emulation at (46e54) rip 6e14: 8e d8 8c d0
[  979.756755] successfully emulated instruction
[  979.757427] Failed vm entry (exit reason 0x21) invalid guest state
[  979.757430] emulation at (46e56) rip 6e16: 8c d0 81 e4
[  979.757433] successfully emulated instruction
[  979.758074] Failed vm entry (exit reason 0x21) invalid guest state
[  979.758077] emulation at (46e58) rip 6e18: 81 e4 ff ff


The corresponding gfxboot code is:

 16301 6E0B EA[106E]1800jmp 
pm_seg.prog_c32:switch_to_pm_20
 16302  switch_to_pm_20:
 16303 
 16304  bits 32
 16305 
 16306 6E10 66B82000mov ax,pm_seg.prog_d16
 16307 6E14 8ED8mov ds,ax
 16308 
 16309 6E16 8CD0mov eax,ss
 16310 6E18 81E4and esp,0h
 

The VT state should be correct after executing instruction an RIP 6E16 
(mov eax, ss).  The next instruction should not cause a vmentry 
failure.  The fact that it is for you indicates that you're not updating 
guest state correctly.

My guess would be that load_segment_descriptor is not updating the 
values within the VMCS.

Regards,

Anthony Liguori

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Protected mode transitions and big real mode... still an issue

2008-04-29 Thread Anthony Liguori
Laurent Vivier wrote:
 Le mardi 29 avril 2008 à 11:41 -0500, Anthony Liguori a écrit :
   
 Guillaume Thouvenin wrote:
 
 Hello,

  This patch should solve the problem observed during protected mode
 transitions that appears for example during the installation of
 openSuse-10.3. Unfortunately there is an issue that crashes
 kvm-userspace. I'm not sure if it's a problem introduced by the
 patch or if the patch is good and raises a new issue.
   
   
 You still aren't emulating the instructions correctly I think.  Running 
 your patch, I see:

 [  979.755349] Failed vm entry (exit reason 0x21) invalid guest state
 [  979.755354] emulation at (46e4b) rip 6e0b: ea 10 6e 18
 [  979.755358] successfully emulated instruction
 [  979.756105] Failed vm entry (exit reason 0x21) invalid guest state
 [  979.756109] emulation at (46e50) rip 6e10: 66 b8 20 00
 [  979.756111] successfully emulated instruction
 [  979.756749] Failed vm entry (exit reason 0x21) invalid guest state
 [  979.756752] emulation at (46e54) rip 6e14: 8e d8 8c d0
 [  979.756755] successfully emulated instruction
 [  979.757427] Failed vm entry (exit reason 0x21) invalid guest state
 [  979.757430] emulation at (46e56) rip 6e16: 8c d0 81 e4
 [  979.757433] successfully emulated instruction
 [  979.758074] Failed vm entry (exit reason 0x21) invalid guest state
 [  979.758077] emulation at (46e58) rip 6e18: 81 e4 ff ff


 The corresponding gfxboot code is:

  16301 6E0B EA[106E]1800jmp 
 pm_seg.prog_c32:switch_to_pm_20
  16302  switch_to_pm_20:
  16303 
  16304  bits 32
  16305 
  16306 6E10 66B82000mov ax,pm_seg.prog_d16
  16307 6E14 8ED8mov ds,ax
  16308 
  16309 6E16 8CD0mov eax,ss
  16310 6E18 81E4and esp,0h
  

 The VT state should be correct after executing instruction an RIP 6E16 
 (mov eax, ss).  The next instruction should not cause a vmentry 
 

 Are you sure ? It is intel notation (opcode dst,src) , so it updates
 eax, not ss. Guillaumes gives us (with gdb notation, opcode src,dst):
   

You're right, it's a fair bit down the code before the ss move happens.

Regards,

Anthony Liguori

   0x00046e53:  ljmp   $0x18,$0x6e18

   0x00046e58:  mov$0x20,%ax

 %EAX = 0x20

   0x00046e5c:  mov%eax,%ds

 %DS = 0x20

   0x00046e5e:  mov%ss,%eax

 %EAX = %SS = 0x53E1 (in this particular case)

 For me the issue is with instructions with dst.byte = 0.
 for instance:

 0x00046e66:  shl$0x4,%eax

 [82768.003174] emulation at (46e66) rip 6e26: c1 e0 04 01
 [82768.035153] writeback: dst.byte 0
 [82768.055174] writeback: dst.ptr  0x
 [82768.087177] writeback: dst.val  0x53e1
 [82768.78] writeback: src.ptr  0x6e28
 [82768.143157] writeback: src.val  0x4

 So my questions are:

 Why dst.val is not 0x53e10 ?
 Why dst.byte is 0 ?

   
 failure.  The fact that it is for you indicates that you're not updating 
 guest state correctly.

 My guess would be that load_segment_descriptor is not updating the 
 values within the VMCS.

 Regards,

 Anthony Liguori
 

 Regards
 Laurent
   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

2008-04-29 Thread Anthony Liguori
This patch allows VMA's that contain no backing page to be used for guest
memory.  This is a drop-in replacement for Ben-Ami's first page in his direct
mmio series.  Here, we continue to allow mmio pages to be represented in the
rmap.

Since v1, I've taken into account Andrea's suggestions at using VM_PFNMAP
instead of VM_IO and changed the BUG_ON to a return of bad_page.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1d7991a..64e5efe 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -532,6 +532,7 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
struct page *page[1];
unsigned long addr;
int npages;
+   pfn_t pfn;
 
might_sleep();
 
@@ -544,19 +545,35 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
npages = get_user_pages(current, current-mm, addr, 1, 1, 1, page,
NULL);
 
-   if (npages != 1) {
-   get_page(bad_page);
-   return page_to_pfn(bad_page);
-   }
+   if (unlikely(npages != 1)) {
+   struct vm_area_struct *vma;
 
-   return page_to_pfn(page[0]);
+   vma = find_vma(current-mm, addr);
+   if (vma == NULL || addr = vma-vm_start ||
+   !(vma-vm_flags  VM_PFNMAP)) {
+   get_page(bad_page);
+   return page_to_pfn(bad_page);
+   }
+
+   pfn = ((addr - vma-vm_start)  PAGE_SHIFT) + vma-vm_pgoff;
+   BUG_ON(pfn_valid(pfn));
+   } else
+   pfn = page_to_pfn(page[0]);
+
+   return pfn;
 }
 
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
 {
-   return pfn_to_page(gfn_to_pfn(kvm, gfn));
+   pfn_t pfn;
+
+   pfn = gfn_to_pfn(kvm, gfn);
+   if (pfn_valid(pfn))
+   return pfn_to_page(pfn);
+
+   return NULL;
 }
 
 EXPORT_SYMBOL_GPL(gfn_to_page);
@@ -569,7 +586,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);
 
 void kvm_release_pfn_clean(pfn_t pfn)
 {
-   put_page(pfn_to_page(pfn));
+   if (pfn_valid(pfn))
+   put_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
 
@@ -594,21 +612,25 @@ EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
 
 void kvm_set_pfn_dirty(pfn_t pfn)
 {
-   struct page *page = pfn_to_page(pfn);
-   if (!PageReserved(page))
-   SetPageDirty(page);
+   if (pfn_valid(pfn)) {
+   struct page *page = pfn_to_page(pfn);
+   if (!PageReserved(page))
+   SetPageDirty(page);
+   }
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);
 
 void kvm_set_pfn_accessed(pfn_t pfn)
 {
-   mark_page_accessed(pfn_to_page(pfn));
+   if (pfn_valid(pfn))
+   mark_page_accessed(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
 
 void kvm_get_pfn(pfn_t pfn)
 {
-   get_page(pfn_to_page(pfn));
+   if (pfn_valid(pfn))
+   get_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_get_pfn);
 

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

2008-04-29 Thread Anthony Liguori
Avi Kivity wrote:
 Anthony Liguori wrote:
   
 This patch allows VMA's that contain no backing page to be used for guest
 memory.  This is a drop-in replacement for Ben-Ami's first page in his direct
 mmio series.  Here, we continue to allow mmio pages to be represented in the
 rmap.

   
 

 I like this very much, as it only affects accessors and not the mmu core 
 itself.

 Hollis/Xiantao/Carsten, can you confirm that this approach works for 
 you?  Carsten, I believe you don't have mmio, but at least this 
 shouldn't interfere.

   
  
  struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
  {
 -return pfn_to_page(gfn_to_pfn(kvm, gfn));
 +pfn_t pfn;
 +
 +pfn = gfn_to_pfn(kvm, gfn);
 +if (pfn_valid(pfn))
 +return pfn_to_page(pfn);
 +
 +return NULL;
  }
   
 

 You're returning NULL here, not bad_page.
   

My thinking was that bad_page indicates that the gfn is invalid.  This 
is a different type of error though.  The problem is that the guest is 
we are trying to kmap() a page that has no struct page associated with 
it.  I'm not sure what the right thing to do here is.

Perhaps we should be replacing consumers of gfn_to_page() with 
copy_to_user() instead?

Regards,

Anthony Liguori



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH][RFC] Use pipe() to simulate signalfd()

2008-04-29 Thread Anthony Liguori
Marcelo Tosatti wrote:
 Hi Anthony,

 How is -no-kvm-irqchip working with the patch?
   

Seems to work fine.  What is your expectation?

 On Tue, Apr 29, 2008 at 09:28:14AM -0500, Anthony Liguori wrote:
   
 This patch eliminates the use of sigtimedwait() in the IO thread.  To avoid 
 the
 signal/select race condition, we use a pipe that we write to in the signal
 handlers.  This was suggested by Rusty and seems to work well.

 +static int kvm_eat_signal(CPUState *env, int timeout)
  {
  struct timespec ts;
  int r, e, ret = 0;
  siginfo_t siginfo;
 +sigset_t waitset;
  
 +sigemptyset(waitset);
 +sigaddset(waitset, SIG_IPI);
  ts.tv_sec = timeout / 1000;
  ts.tv_nsec = (timeout % 1000) * 100;
 -r = sigtimedwait(waitset-sigset, siginfo, ts);
 +qemu_kvm_unlock();
 +r = sigtimedwait(waitset, siginfo, ts);
 +qemu_kvm_lock(env);
 +cpu_single_env = env;
 

 This assignment seems redundant now.
   

Yeah, I have a bigger patch which eliminates all of the explicit 
assignments to cpu_single_env.

  
 @@ -263,12 +238,8 @@ static void pause_all_threads(void)
  vcpu_info[i].stop = 1;
  pthread_kill(vcpu_info[i].thread, SIG_IPI);
 

 Make sure the IO thread has SIG_IPI blocked (those are for APIC vcpu
 initialization only).
   

Just so I'm clear, there's really no harm in not blocking SIG_IPI 
because it would just be ignored by the IO thread (since the SIG_IPI 
handler is a nop).  But yeah, we should explicitly block it.

 +static void sig_aio_fd_read(void *opaque)
 +{
 +int signum;
 +ssize_t len;
 +
 +do { 
 +len = read(kvm_sigfd[0], signum, sizeof(signum));
 +} while (len == -1  errno == EINTR);
 

 What is the reason for this loop instead of a straight read? 

 Its alright to be interrupted by a signal.
   

Just general habit with QEMU.

 +signal(SIGUSR1, sig_aio_handler);
 +signal(SIGUSR2, sig_aio_handler);
 +signal(SIGALRM, sig_aio_handler);
 +signal(SIGIO, sig_aio_handler);
 +
 +if (pipe(kvm_sigfd) == -1)
 +abort();
 

 perror() would be nice.
   

Yeah, everything needs proper error handling.

 -kvm_eat_signal(io_signal_table, NULL, 1000);
  pthread_mutex_lock(qemu_mutex);
 -cpu_single_env = NULL;
 -main_loop_wait(0);
 +main_loop_wait(10);
 

 Increase that 1000 or something. Will make it easier to spot bugs.
   

I have actually and it does introduce some bugs.  I'm not entirely clear 
what is causing them though.

Regards,

Anthony Liguori

 Similarly in qemu_kvm_aio_wait().

   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

2008-04-29 Thread Anthony Liguori
Avi Kivity wrote:
 It depends on what's going on?  Does a page table point to mmio?  Or 
 the glommerclock?

 Not sure there is a single answer.

 Perhaps we should be replacing consumers of gfn_to_page() with 
 copy_to_user() instead?

 Indeed we should.  The problem is access in atomic contexts.  It's 
 easy to detect failure, but not always easy to handle it.

So I think we should replace it with a rate limited printk and returning 
bad_page.  That way the guest can't exploit it and we'll still hopefully 
get printk()s to track down instances of things going bad.

Regards,

Anthony Liguori


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

2008-04-29 Thread Anthony Liguori
Hollis Blanchard wrote:
 On Tuesday 29 April 2008 17:17:49 Avi Kivity wrote:
   
 I like this very much, as it only affects accessors and not the mmu core 
 itself.

 Hollis/Xiantao/Carsten, can you confirm that this approach works for 
 you?  Carsten, I believe you don't have mmio, but at least this 
 shouldn't interfere.
 

 OK, so the idea is to mmap /sys/bus/pci/.../region within the guest RAM area, 
 and just include it within the normal guest RAM memslot?
   

In the case of x86, since the PCI IO region is pretty far away from 
normal RAM, we'll probably use a different memory slot, but yes, that's 
the general idea.

IIUC PPC correctly, all IO pages have corresponding struct pages.  This 
means that get_user_pages() would succeed and you can reference count 
them?  In this case, we would never take the VM_PFNMAP path.

Is that correct?

 How will the IOMMU be programmed? Wouldn't you still need to register a 
 special type of memslot for that?
   

That's independent of this patchset.  For non-aware guests, we'll have 
to pin all of physical memory up front and then create an IOMMU table 
from the pinned physical memory.  For aware guests with a PV DMA window 
API, we'll be able to build that mapping on the fly (enforcing mlock 
allocation limits).

Regards,

Anthony Liguori


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH][RFC] Use pipe() to simulate signalfd()

2008-04-29 Thread Anthony Liguori
Marcelo Tosatti wrote:
 Problem is if the IO thread _receives_ SIGIPI instead of some vcpu
 thread. 

 So there is potential harm in not blocking it.
   

Hrm, aren't SIG_IPIs delivered to a particular thread-id though?  When 
would the IO thread receive a SIG_IPI?

 What is the reason for this loop instead of a straight read? 

 Its alright to be interrupted by a signal.
  
   
 Just general habit with QEMU.
 

 Please don't :-)
   

I don't see the harm.  In fact, I think it's more correct.  Otherwise, 
we have to wait for another invocation of the fd callback.

 -kvm_eat_signal(io_signal_table, NULL, 1000);
 pthread_mutex_lock(qemu_mutex);
 -cpu_single_env = NULL;
 -main_loop_wait(0);
 +  main_loop_wait(10);

 
 Increase that 1000 or something. Will make it easier to spot bugs.
  
   
 I have actually and it does introduce some bugs.  I'm not entirely clear 
 what is causing them though.
 

 Should indicate that some event previously delivered through signals and
 received by sigtimedwait is not waking up the IO thread.
   

I'll take a look and see.  I'm having time keeping issues in the guest 
so it's hard to tell what problems are caused by the IO thread verses time.

Regards,

Anthony Liguori


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH][RFC] Use pipe() to simulate signalfd()

2008-04-29 Thread Anthony Liguori
Marcelo Tosatti wrote:
 Why? If you leave data in the pipe the next select() won't block.

 Isnt there the possibility that this loop can be stuck for significant
 amounts of time? If you're receiving lots of notifications through
 signals.
   

If  you're getting that many signals, then select() is never going to 
run anyway.

 -kvm_eat_signal(io_signal_table, NULL, 1000);
pthread_mutex_lock(qemu_mutex);
 -cpu_single_env = NULL;
 -main_loop_wait(0);
 +main_loop_wait(10);
   

 
 Increase that 1000 or something. Will make it easier to spot bugs.

  
   
 I have actually and it does introduce some bugs.  I'm not entirely clear 
 what is causing them though.

 
 Should indicate that some event previously delivered through signals and
 received by sigtimedwait is not waking up the IO thread.
  
   
 I'll take a look and see.  I'm having time keeping issues in the guest 
 so it's hard to tell what problems are caused by the IO thread verses time.
 

 Time issues only with the patch? If not, please share details.
   

No, they're independent of the patch.  The symptom is that the guest 
tends to hang during boot for prolonged periods of time.  It tends to be 
random whether it will hang at all, how long it will hang, and at what 
point it will hang.  It tends to hang more often in particular places.  
In my ubuntu server guest, for instance, it tends to hang right after 
partition probing, right after Loading hardware drivers, and right 
before spawning a getty on /dev/tty0.

Normally, it will only hang for a few seconds.  However, with -smp N 
where N  number of cpus in the host, it tends to happen more 
frequently.  Using clock=jiffies in the guest makes the problem go away.

-no-kvm-pit doesn't make a difference (the guest is normally using the 
pm-timer).

Regards,

Anthony Liguori


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH][RFC] Use pipe() to simulate signalfd()

2008-04-29 Thread Anthony Liguori
Marcelo Tosatti wrote:
 True.

 Either way, this starvation due to signals can't happen with the current
 scheme because signals are blocked. It seems a significant drawback.
   

Practically speaking, I don't see signal starving being a problem.  Part 
of the benefit of this approach is that we're reducing the overall 
number of signals received.  With the in-kernel PIT, the number of 
userspace signals is even further reduced.

 Moving the signal handling + pipe write to a separate thread should get
 rid of it.
   

Yeah, but then you just introduce buffering problems since if you're 
getting that many signals, the pipe will get full.

No point in designing for something that isn't likely to happen in practice.

Regards,

Anthony Liguori

 No, they're independent of the patch.  The symptom is that the guest 
 tends to hang during boot for prolonged periods of time.  It tends to be 
 random whether it will hang at all, how long it will hang, and at what 
 point it will hang.  It tends to hang more often in particular places.  
 In my ubuntu server guest, for instance, it tends to hang right after 
 partition probing, right after Loading hardware drivers, and right 
 before spawning a getty on /dev/tty0.

 Normally, it will only hang for a few seconds.  However, with -smp N 
 where N  number of cpus in the host, it tends to happen more 
 frequently.  Using clock=jiffies in the guest makes the problem go away.
 

 I'll try to reproduce that, thanks.

   
 -no-kvm-pit doesn't make a difference (the guest is normally using the 
 pm-timer).

 Regards,

 Anthony Liguori
 


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] [RFC] try to reduce kvm impact in core qemu code.

2008-04-29 Thread Anthony Liguori
Glauber Costa wrote:
 Avi Kivity wrote:
   
 Glauber Costa wrote:
 
 Hi. This is a proposal for reducing the impact of kvm functions in 
 core qemu
 code. This is by all means not ready, but I felt like posting it, so a 
 discussion
 on it could follow.

 The idea in this patch is to replace the specific kvm details from 
 core qemu files
 like vl.c, with driver_yyy() functions. When kvm is not running, those 
 functions would
 just return (most of time), absolutely reducing the impact of kvm code.

 As I wanted to test it, in this patch I changed the kvm functions to 
 be called driver_yyy(),
 but that's not my final goal. I intend to use a function pointer 
 schema, similar to what the linux
 kernel already do for a lot of its subsystem, to isolate the changes.

 Comments deeply welcome.
   
   
 While I would be very annoyed if someone referred to kvm as a qemu 
 accelerator, I think accelerator_yyy() is more descriptive than 
 driver_yyy().
 
 How about booster? ;-)
   

I don't think the concern from a QEMU perspective is that QEMU is too 
intimately tied to KVM.  The concern is that overtime, it will be very 
difficult to make changes to QEMU without breaking KVM support because 
of the shear number of hooks we require.  Fabrice had actually suggested 
merging libkvm into QEMU.  We just need to reduce the overall number of 
if (kvm_enabled()) blocks.

You have to understand a lot about KVM to know why it's necessary to do 
an register reload call-out in the vmport hw for instance.  Instead of 
introducing generic hooks in vmport, a more appropriate solution may be 
to add wrappers to read individual register values within the QEMU 
device code.   We can then add our hooks to that.  I think a good 
argument can be made that devices should never access env- directly.

Regards,

Anthony Liguori

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] fork() within a VM with MMU notifiers

2008-04-28 Thread Anthony Liguori
Here's my thinking as to why we don't want to destroy the VM in the mmu 
notifiers -release method.  I don't have a valid use-case for this but 
my argument depends on the fact that this is something that should 
work.  Daemonizing a running VM may be a reasonable use-case.  It's 
useful to wait to daemonize until you are sure that everything is 
working correctly so it's not all that unreasonable to move the 
daemonize until after the VCPUs have been launched.

If you take a running VM, and pause all of the VCPUs, and then issue a 
fork() followed by an immediate exit() in the parent process, the child 
process should be able to unpause all the VCPUs and the guest should 
continue running uninterrupted.

 From KVM's perspective, issuing the fork() will increment the reference 
count of the file descriptor for the VM but otherwise, no real change 
should happen.  The issue would now be that we must completely flush the 
shadow page table cache.  In theory, MMU notifiers should do this for us.

When the parent process exits, this will result in exit_mmap() and will 
destroy the KVM guest.  This leaves the child process with a file 
descriptor that refers to a VM that is no longer valid.

Just avoiding destroying the VM in the -release() method won't fix this 
use-case I don't think.  In general, I think we need to think a little 
more about how fork() is handled with respect to mmu notifiers.

Regards,

Anthony Liguori

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Fix QEMU vcpu thread race with apic_reset

2008-04-28 Thread Anthony Liguori
Ryan Harper wrote:
 * Avi Kivity [EMAIL PROTECTED] [2008-04-26 02:23]:
   
 Please reuse qemu_mutex for this, no need for a new one.
 

 I'm having a little trouble wrapping my head around all of the locking
 here.  If I avoid qemu_mutex and use a new one, I've got everything
 working.  However, attemping to use qemu_mutex is stepping into a pile
 of locking crap.  I'm sure there is a good reason... 

 The current code looks like this:

 Thread1:
 main()
   kvm_qemu_init() // mutex_lock()
   machine-init()
pc_init1()
 pc_new_cpu()
  cpu_init()
   cpu_x86_init()
kvm_init_new_ap() // create vcpu Thread2
--  
   --  
  --  
 --  
--  
   --  
   kvm_main_loop() // mutex_unlock()
 
 Thread2:
 ap_main_loop()
  /* vcpu init */
  kvm_main_loop_cpu()
   kvm_main_loop_wait() // mutex_unlock() on enter, lock on exit
kvm_eat_signals()   // mutex_lock() on enter, unlock on exit
--
   --
  --
   

The qemu_mutex is meant to ensure that the QEMU code is only ever 
entered by one thread at a time.  QEMU is not thread-safe so this is 
necessary.  It's a little odd because we're taking this lock initially 
by being called from QEMU code.  Here's the basic theory of locking AFAICT:

kvm_main_loop_cpu() is the main loop for each VCPU.  It must acquire the 
QEMU mutex since it will call into normal QEMU code to process events.  
Whenever a VCPU is allowed to run, or when the VCPU is idling, it needs 
to release the QEMU mutex.  The former is done via the post_kvm_run() 
and pre_kvm_run() hooks.  The later is done within kvm_main_loop_wait().

kvm_main_loop_wait() will release the QEMU mutex and call 
kvm_eat_signals() which calls kvm_eat_signal() which will issue a 
sigtimedwait().  This is where we actually idle (and why SIGIO is so 
important right now).  We don't want to idle with the QEMU mutex held as 
that may result in dead lock so this is why we release it here.

kvm_eat_signal() has to acquire the lock again in order to dispatch IO 
events (via kvm_process_signal()).

Regards,

Anthony Liguori

 It wedges up in kvm_init_new_ap() if I attempt acquire qemu_mutex.
 Quite obvious after I looked at the call trace and discovered
 kvm_qemu_init() locking on exit.  I see other various functions that
 unlock and then lock; I really don't want to wade into this mess...
 rather whomever cooked it up should do some cleanup.  I tried the
 unlock, then re-lock on exit in kvm_init_new_ap() but that also wedged.

 Here is a rework with a new flag in vcpu_info indicating vcpu
 creation.  Tested this with 64 1VCPU guests booting with 1 second delay,
 and single 16-way SMP guest boot.

   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH] Don't race while creating a VCPU

2008-04-28 Thread Anthony Liguori
We hold qemu_mutex while machine-init() executes, which issues a VCPU create.
We need to make sure to not return from the VCPU creation until the VCPU
file descriptor is valid to ensure that APIC creation succeeds.

However, we also need to make sure that the VCPU thread doesn't start running
until the machine-init() is complete.  This is addressed today because the
VCPU thread tries to grab the qemu_mutex before doing anything interesting.
If we release qemu_mutex to wait for VCPU creation, then we open a window for
a race to occur.

This patch introduces two wait conditions.  The first lets the VCPU create
code that runs in the IO thread to wait for a VCPU to initialize.  The second
condition lets the VCPU thread wait for the machine to fully initialize before
running.

An added benefit of this patch is it makes the dependencies now explicit.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index 78127de..9a9bf59 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -32,8 +32,12 @@ static int qemu_kvm_reset_requested;
 
 pthread_mutex_t qemu_mutex = PTHREAD_MUTEX_INITIALIZER;
 pthread_cond_t qemu_aio_cond = PTHREAD_COND_INITIALIZER;
+pthread_cond_t qemu_vcpu_cond = PTHREAD_COND_INITIALIZER;
+pthread_cond_t qemu_system_cond = PTHREAD_COND_INITIALIZER;
 __thread struct vcpu_info *vcpu;
 
+static int qemu_system_ready;
+
 struct qemu_kvm_signal_table {
 sigset_t sigset;
 sigset_t negsigset;
@@ -53,6 +57,7 @@ struct vcpu_info {
 int stop;
 int stopped;
 int reload_regs;
+int created;
 } vcpu_info[256];
 
 pthread_t io_thread;
@@ -324,6 +329,7 @@ static int kvm_main_loop_cpu(CPUState *env)
 struct vcpu_info *info = vcpu_info[env-cpu_index];
 
 setup_kernel_sigmask(env);
+
 pthread_mutex_lock(qemu_mutex);
 if (kvm_irqchip_in_kernel(kvm_context))
env-hflags = ~HF_HALTED_MASK;
@@ -370,6 +376,17 @@ static void *ap_main_loop(void *_env)
 sigprocmask(SIG_BLOCK, signals, NULL);
 kvm_create_vcpu(kvm_context, env-cpu_index);
 kvm_qemu_init_env(env);
+
+/* signal VCPU creation */
+pthread_mutex_lock(qemu_mutex);
+vcpu-created = 1;
+pthread_cond_signal(qemu_vcpu_cond);
+
+/* and wait for machine initialization */
+while (!qemu_system_ready)
+   pthread_cond_wait(qemu_system_cond, qemu_mutex);
+pthread_mutex_unlock(qemu_mutex);
+
 kvm_main_loop_cpu(env);
 return NULL;
 }
@@ -389,8 +406,9 @@ static void kvm_add_signal(struct qemu_kvm_signal_table 
*sigtab, int signum)
 void kvm_init_new_ap(int cpu, CPUState *env)
 {
 pthread_create(vcpu_info[cpu].thread, NULL, ap_main_loop, env);
-/* FIXME: wait for thread to spin up */
-usleep(200);
+
+while (vcpu_info[cpu].created == 0)
+   pthread_cond_wait(qemu_vcpu_cond, qemu_mutex);
 }
 
 static void qemu_kvm_init_signal_tables(void)
@@ -435,7 +453,11 @@ void qemu_kvm_notify_work(void)
 int kvm_main_loop(void)
 {
 io_thread = pthread_self();
+qemu_system_ready = 1;
 pthread_mutex_unlock(qemu_mutex);
+
+pthread_cond_broadcast(qemu_system_cond);
+
 while (1) {
 kvm_eat_signal(io_signal_table, NULL, 1000);
 pthread_mutex_lock(qemu_mutex);

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Real Mode Improvement on Intel Hosts - GSoC Project

2008-04-26 Thread Anthony Liguori
Avi Kivity wrote:
 Anthony Liguori wrote:
 The second stage is to use a loop of x86_emulate() to run all 16-bit 
 code (instead of using vm86 mode).  This will allow us to support 
 guests that use big real mode.

   

 Why do that unconditionally, instead of only when in a big-real-mode 
 state?

It can be.  It's probably easier from a development perspective to make 
it unconditional and then to flush out all the necessary instructions.

Regards,

Anthony Liguori


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Fix QEMU vcpu thread race with apic_reset

2008-04-26 Thread Anthony Liguori
Ulrich Drepper wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Ryan Harper wrote:
   
 @@ -388,9 +395,10 @@ static void kvm_add_signal(struct qemu_kvm_signal_table 
 *sigtab, int signum)
  
  void kvm_init_new_ap(int cpu, CPUState *env)
  {
 +pthread_mutex_lock(vcpu_mutex);
  pthread_create(vcpu_info[cpu].thread, NULL, ap_main_loop, env);
 -/* FIXME: wait for thread to spin up */
 -usleep(200);
 +pthread_cond_wait(qemu_vcpuup_cond, vcpu_mutex);
 +pthread_mutex_unlock(vcpu_mutex);
 

 And something is very wrong here.  The pattern for using a condvar is

 1 take mutex

 2 check condition

 3 if condition is not fulfilled
 3a   call cond_wait
 3b   when returning, go back to step 2

 4 unlock mutex


 Anything else is buggy.

 So, either your condvar use is wrong or you don't really want a condvar
 in the first place.  I haven't checked the code.
   

A flag is needed in the vcpu structure that indicates whether the vcpu 
spun up or not.  This is what the while () condition should be.  This is 
needed as the thread may spin up before it gets to the 
pthread_cond_wait() in which case the signal happens when noone is 
waiting on it.  The other reason a while () is usually needed is that 
cond_signal may not wake up the right thread so it's necessary to check 
whether you really have something to do.  Not really a problem here but 
the former race is.

A condvar is definitely the right thing to use here.

Regards,

Anthony Liguori

 - --
 ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.7 (GNU/Linux)

 iD8DBQFIE2CD2ijCOnn/RHQRAs4kAJ40kbWjNJAzj2gGdbo/sSxZTx5b0ACglbis
 kw7ST4eJK9CXhNbjKphNsUo=
 =ISaC
 -END PGP SIGNATURE-

 -
 This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
 Don't miss this year's exciting event. There's still time to save $100. 
 Use priority code J8TL2D2. 
 http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
 ___
 kvm-devel mailing list
 kvm-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/kvm-devel
   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] mmu notifier #v14

2008-04-26 Thread Anthony Liguori
-arch.apic_access_page)
 - put_page(kvm-arch.apic_access_page);
 + /*
 +  * kvm_mmu_notifier_release() will be called before
 +  * mmu_notifier_unregister returns, if it didn't run
 +  * already.
 +  */
 + mmu_notifier_unregister(kvm-arch.mmu_notifier, kvm-mm);
   kfree(kvm);
  }
   

Why move the destruction of the vm to the MMU notifier unregister hook? 
Does anything else ever call mmu_notifier_unregister that would 
implicitly destroy the VM?

Regards,

Anthony Liguori

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] mmu notifier #v14

2008-04-26 Thread Anthony Liguori
Andrea Arcangeli wrote:
 On Sat, Apr 26, 2008 at 01:59:23PM -0500, Anthony Liguori wrote:
   
 +static void kvm_unmap_spte(struct kvm *kvm, u64 *spte)
 +{
 +   struct page *page = pfn_to_page((*spte  PT64_BASE_ADDR_MASK)  
 PAGE_SHIFT);
 +   get_page(page);
   
   
 You should not assume a struct page exists for any given spte. Instead, use 
 kvm_get_pfn() and kvm_release_pfn_clean().
 

 Last email from [EMAIL PROTECTED] in my inbox argues it's useless to build 
 rmap
 on mmio regions, so the above is more efficient so put_page runs
 directly on the page without going back and forth between spte - pfn
 - page - pfn - page in a single function.
   

Avi can correct me if I'm wrong, but I don't think the consensus of that 
discussion was that we're going to avoid putting mmio pages in the 
rmap.  Practically speaking, replacing:

+   struct page *page = pfn_to_page((*spte  PT64_BASE_ADDR_MASK)  
PAGE_SHIFT);
+   get_page(page);


With:

unsigned long pfn = (*spte  PT64_BASE_ADDR_MASK)  PAGE_SHIFT;
kvm_get_pfn(pfn);

Results in exactly the same code except the later allows mmio pfns in 
the rmap.  So ignoring the whole mmio thing, using accessors that are 
already there and used elsewhere seems like a good idea :-)

 Certainly if we start building rmap on mmio regions we'll have to
 change that.

   
 Perhaps I just have a weak stomach but I am uneasy having a function that 
 takes a lock on exit. I walked through the logic and it doesn't appear to 
 be wrong but it also is pretty clear that you could defer the acquisition 
 of the lock to the caller (in this case, kvm_mmu_pte_write) by moving the 
 update_pte assignment into kvm_mmu_pte_write.
 

 I agree out_lock is an uncommon exit path, the problem is that the
 code was buggy, and I tried to fix it with the smallest possible
 change and that resulting in an out_lock. That section likely need a
 refactoring, all those update_pte fields should be at least returned
 by the function guess_  but I tried to reduce the changes to make
 the issue more readable, I didn't want to rewrite certain functions
 just to take a spinlock a few instructions ahead.
   

I appreciate the desire to minimize changes, but taking a lock on return 
seems to take that to a bit of an extreme.  It seems like a simple thing 
to fix though, no?

 Why move the destruction of the vm to the MMU notifier unregister hook? 
 Does anything else ever call mmu_notifier_unregister that would implicitly 
 destroy the VM?
 

 mmu notifier -release can run at anytime before the filehandle is
 closed. -release has to zap all sptes and freeze the mmu (hence all
 vcpus) to prevent any further page fault. After -release returns all
 pages are freed (we'll never relay on the page pin to avoid the
 rmap_remove put_page to be a relevant unpin event). So the idea is
 that I wanted to maintain the same ordering of the current code in the
 vm destroy event, I didn't want to leave a partially shutdown VM on
 the vmlist. If the ordering is entirely irrelevant and the
 kvm_arch_destroy_vm can run well before kvm_destroy_vm is called, then
 I can avoid changes to kvm_main.c but I doubt.

 I've done it in a way that archs not needing mmu notifiers like s390
 can simply add the kvm_destroy_common_vm at the top of their
 kvm_arch_destroy_vm. All others using mmu_notifiers have to invoke
 kvm_destroy_common_vm in the -release of the mmu notifiers.

 This will ensure that everything will be ok regardless if exit_mmap is
 called before/after exit_files, and it won't make a whole lot of
 difference anymore, if the driver fd is pinned through vmas-vm_file
 released in exit_mmap or through the task filedescriptors relased in
 exit_files etc... Infact this allows to call mmu_notifier_unregister
 at anytime later after the task has already been killed, without any
 trouble (like if the mmu notifier owner isn't registering in
 current-mm but some other tasks mm).
   

I see.  It seems a little strange to me as a KVM guest isn't really tied 
to the current mm.  It seems like the net effect of this is that we are 
now tying a KVM guest to an mm.

For instance, if you create a guest, but didn't assign any memory to it, 
you could transfer the fd to another process and then close the fd 
(without destroying the guest).  The other process then could assign 
memory to it and presumably run the guest.

With your change, as soon as the first process exits, the guest will be 
destroyed.  I'm not sure this behavioral difference really matters but 
it is a behavioral difference.

Regards,

Anthony Liguori

 -
 This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
 Don't miss this year's exciting event. There's still time to save $100. 
 Use priority code J8TL2D2. 
 http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
 ___
 kvm-devel mailing list
 kvm

Re: [kvm-devel] Real Mode Improvement on Intel Hosts - GSoC Project

2008-04-24 Thread Anthony Liguori
Mohammed Gamal wrote:
 Hi,
 My Project proposal Improving and Stabilizing Real-Mode Support for
 Intel Hosts has been accepted into Google Summer of Code under the
 Linux Foundation. You may have a look at  the proposal abstract here:
 http://code.google.com/soc/2008/linux/appinfo.html?csaid=1CC1C8B4CCC1120E
 .

 Any pointers on where to start, what would you like to see done, and
 any other comments and suggestions would greatly be appreciated.
   

We have a two stage plan to address real-mode on Intel systems.  Both 
involve using x86_emulate() to emulate 16-bit (and 32-bit) instructions 
that VT cannot handle.

The first stage is to detect vmentry failures and run x86_emulate() for 
a single instruction.  If you look at the mailing list, you'll see 
patches from myself and Guillaume.  This should be enough to allow most 
Ubuntu installer CDs to work under KVM.  In this case, the CDs are using 
a version of GFXBOOT that doesn't use big real mode, but still relies on 
an undefined architectural aspect that VT doesn't support.

The second stage is to use a loop of x86_emulate() to run all 16-bit 
code (instead of using vm86 mode).  This will allow us to support guests 
that use big real mode.

The best place to start is probably to try out some of the patches on 
the list, and get familiar with the GFXBOOT assembly routines.  There's 
a #kvm in FreeNode, that's a good place to start if you're having 
trouble getting started.

Regards,

Anthony Liguori

 Regards,
 Mohammed

 -
 This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
 Don't miss this year's exciting event. There's still time to save $100. 
 Use priority code J8TL2D2. 
 http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
 ___
 kvm-devel mailing list
 kvm-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/kvm-devel
   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0/2] Batch writes to MMIO

2008-04-23 Thread Anthony Liguori
Laurent Vivier wrote:
 Le mercredi 23 avril 2008 à 19:25 +0300, Avi Kivity a écrit :
   
 Laurent Vivier wrote:
 
 Le mercredi 23 avril 2008 à 10:10 -0500, Anthony Liguori a écrit :
 [...]
   
   
 The ne2k is pretty mmio heavy.  You should be able to observe a boost 
 with something like iperf (guest=host) I would think if this is a real 
 savings.
 
 
 I like your advices :-D

 I use iperf with e1000 emulation and a slightly modified patch (to
 detect MMIO write in a loop), server is on the host, client on the
 guest, with default values.

 RESULT WITHOUT BATCHING:

 [  4]  0.0-10.0 sec235 MBytes197 Mbits/sec
 [  5]  0.0-10.0 sec194 MBytes163 Mbits/sec
 [  4]  0.0-10.0 sec185 MBytes155 Mbits/sec
 [  5]  0.0-10.0 sec227 MBytes190 Mbits/sec
 [  4]  0.0-10.0 sec196 MBytes164 Mbits/sec
 [  5]  0.0-10.0 sec194 MBytes163 Mbits/sec
 [  4]  0.0-10.0 sec184 MBytes154 Mbits/sec

 RESULT WITH BATCHING:

 
 Server listening on TCP port 5001
 TCP window size: 85.3 KByte (default)
 
 [  4]  0.0-10.0 sec357 MBytes299 Mbits/sec
 [  5]  0.0-10.1 sec418 MBytes347 Mbits/sec
 [  4]  0.0-10.0 sec408 MBytes342 Mbits/sec
 [  5]  0.0-10.0 sec422 MBytes353 Mbits/sec
 [  4]  0.0-10.1 sec436 MBytes362 Mbits/sec
 [  5]  0.0-10.0 sec416 MBytes348 Mbits/sec
 [  4]  0.0-10.0 sec431 MBytes361 Mbits/sec

 Well, it's nice ?
   
   
 It's too good to be true.

 I think we're seeing two bugs cancel each other out, resulting in a 
 performance gain.  Linux doesn't know how to queue outgoing packets, so 
 it bangs on the mmio that starts the transmit after every packet.  mmio 
 batching doesn't know that this mmio register is critical for latency, 
 so it queues it up.  The result is that you you get not just mmio 
 batching, but also packet batching!  Which dramatically improves 
 performace at the expense of latency.
 

 How can I check that ? How can I measure latency ?
   

ping (from guest to host)

Regards,

Anthony Liguori

 Perhaps I can swap server and client between guest and host ?

   
 Sorry (if it's true :)

 

 Thank you for your help,
 Laurent
   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 1/2] kvm: Batch writes to MMIO

2008-04-23 Thread Anthony Liguori
Laurent Vivier wrote:
 This patch is the kernel part of the batch writes to MMIO patch.

 When kernel has to send MMIO writes to userspace, it stores them
 in memory until it has to pass the hand to userspace for another
 reason. This avoids to have too many context switches on operations
 that can wait.

 WARNING: this breaks compatibility with old userspace part.

 Signed-off-by: Laurent Vivier [EMAIL PROTECTED]
 ---
  arch/x86/kvm/x86.c |   21 +
  include/asm-x86/kvm_host.h |2 ++
  include/linux/kvm.h|   10 +-
  virt/kvm/kvm_main.c|3 +++
  4 files changed, 35 insertions(+), 1 deletions(-)

 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 0ce5563..3881056 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -2942,8 +2942,21 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
 struct kvm_run *kvm_run)
   kvm_x86_ops-decache_regs(vcpu);
   }
  
 +batch:
   r = __vcpu_run(vcpu, kvm_run);
  
 + if (!r  vcpu-mmio_is_write 
 + kvm_run-exit_reason == KVM_EXIT_MMIO 
 + kvm_run-batch_count  KVM_MAX_BATCH) {
 + struct kvm_batch *batch = vcpu-arch.batch_data;
 + int i = kvm_run-batch_count++;
 +
 + batch[i].phys_addr = vcpu-mmio_phys_addr;
 + batch[i].len = vcpu-mmio_size;
 + memcpy(batch[i].data, vcpu-mmio_data, batch[i].len);
 +
 + goto batch;
 + }
   

I wonder if this is sufficient for dynticks enabled guests.  __vcpu_run 
could last a very long time.

Ignoring that for a moment, Avi's comment about having userspace tell 
you which addresses to batch is an important one.  MMIO writes may have 
side effects and the next vcpu_run may rely on those side effects.  For 
instance, MMIO based IRQ acknowledgement.

You need a white list not only for performances purposes but also for 
correctness.

Regards,

Anthony Liguori

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] [RESEND] Make make sync in kernel dir work for multiple archs

2008-04-23 Thread Anthony Liguori

 diff --git a/kernel/Makefile b/kernel/Makefile
 --- a/kernel/Makefile
 +++ b/kernel/Makefile
 @@ -1,5 +1,10 @@ include ../config.mak
  include ../config.mak
  
 +ARCH_DIR=$(ARCH)
 +ifneq '$(filter $(ARCH_DIR), x86_64 i386)' ''
 + ARCH_DIR=x86
 +endif
 +
   

This sets ARCH_DIR to x86

  KVERREL = $(patsubst /lib/modules/%/build,%,$(KERNELDIR))
  
  DESTDIR=
 @@ -18,11 +23,25 @@ _hack = mv $1 $1.orig  \
   | sed '/\#include/! s/\blapic\b/l_apic/g'  $1  rm $1.orig
  
  _unifdef = mv $1 $1.orig  \
 -   unifdef -DCONFIG_X86 $1.orig  $1; \
 + unifdef -DCONFIG_$(shell echo $(ARCH_DIR)|tr '[:lower:]' '[:upper:]') 
 $1.orig  $1; \
[ $$? -le 1 ]  rm $1.orig
  
  hack = $(call _hack,$T/$(strip $1))
  unifdef = $(call _unifdef,$T/$(strip $1))
 +
 +ifneq '$(filter $(ARCH_DIR), x86_64 i386)' ''
   

And I'm not quite sure what this is supposed to do but it doesn't look 
right.  ARCH_DIR is going to equal 'x86' so this is going to break 
things on x86 sinc ethe filter will return an empty string.

Regards,

Anthony Liguori

 +UNIFDEF_FILES = include/linux/kvm.h \
 + include/linux/kvm_para.h \
 + include/asm-$(ARCH_DIR)/kvm.h \
 + include/asm-x86/kvm_para.h
 +
 +HACK_FILES = kvm_main.c \
 + mmu.c \
 + vmx.c \
 + svm.c \
 + x86.c \
 + irq.h 
 +endif
  
  all::
  #include header priority 1) $LINUX 2) $KERNELDIR 3) include-compat
 @@ -39,17 +58,16 @@ header-sync:
   rm -rf $T
   rsync -R \
$(LINUX)/./include/linux/kvm*.h \
 -  $(LINUX)/./include/asm-x86/kvm*.h \
 - $T/
 - mkdir -p include/linux include/asm-x86
 - ln -sf asm-x86 include/asm
 - ln -sf asm-x86 include-compat/asm
 +  $(LINUX)/./include/asm-$(ARCH_DIR)/kvm*.h \
 +  $T/
  
 - $(call unifdef, include/linux/kvm.h)
 - $(call unifdef, include/linux/kvm_para.h)
 - $(call unifdef, include/asm-x86/kvm.h)
 - $(call unifdef, include/asm-x86/kvm_para.h)
 - $(call hack, include/linux/kvm.h)
 + mkdir -p include/linux include/asm-$(ARCH_DIR)
 + ln -s asm-$(ARCH_DIR) include/asm
 + ln -sf asm-$(ARCH_DIR) include-compat/asm
 +
 + for i in $(UNIFDEF_FILES); \
 + do $(call unifdef, $$i); done
 +
   for i in $$(find $T -type f -printf '%P '); \
   do cmp -s $$i $T/$$i || cp $T/$$i $$i; done
   rm -rf $T
 @@ -57,15 +75,13 @@ source-sync:
  source-sync:
   rm -rf $T
   rsync --exclude='*.mod.c' -R \
 - $(LINUX)/arch/x86/kvm/./*.[ch] \
 - $(LINUX)/virt/kvm/./*.[ch] \
 - $T/
 - $(call hack, kvm_main.c)
 - $(call hack, mmu.c)
 - $(call hack, vmx.c)
 - $(call hack, svm.c)
 - $(call hack, x86.c)
 - $(call hack, irq.h)
 +  $(LINUX)/arch/$(ARCH_DIR)/kvm/./*.[ch] \
 +  $(LINUX)/virt/kvm/./*.[ch] \
 +  $T/
 +
 + for i in $(HACK_FILES); \
 + do $(call hack, $$i); done
 +
   for i in $$(find $T -type f -printf '%P '); \
   do cmp -s $$i $T/$$i || cp $T/$$i $$i; done
   rm -rf $T

 -
 This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
 Don't miss this year's exciting event. There's still time to save $100. 
 Use priority code J8TL2D2. 
 http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
 ___
 kvm-devel mailing list
 kvm-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/kvm-devel
   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] What kernel options do I need to properly enable virtio net driver

2008-04-22 Thread Anthony Liguori
Jerone Young wrote:
 What I am asking is do I have all the proper options in my kernel config
 set to use it?
   

Yes.  You just need CONFIG_VIRTIO_NET and CONFIG_VIRTIO_PCI.  The 
remaining options will be automatically selected.

Regards,

Anthony Liguori

 On Mon, 2008-04-21 at 17:13 -0500, Anthony Liguori wrote:
   
 Jerone Young wrote:
 
 virtio net device does not appear to show itself in the guest. I'm
 curious of what options I may be missing. Here is my config
   
 You'll have to be more specific about what does not appear to show 
 itself means.  What's the output of lspci?

 Regards,

 Anthony Liguori

 
 

 -
 This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
 Don't miss this year's exciting event. There's still time to save $100. 
 Use priority code J8TL2D2. 
 http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
 

 ___
 kvm-devel mailing list
 kvm-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/kvm-devel
   

   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


  1   2   3   4   5   6   7   8   9   >