Re: [kvm-devel] [ANNOUNCE] kvm-guest-drivers-windows-2
Avi Kivity wrote: Anthony Liguori wrote: FWIW, virtio-net is much better with my patches applied. The can_receive patches? Again, I'm not opposed to them in principle, I just think that if they help that this points at a virtio deficiency. Virtio should never leave the rx queue empty. Consider the case where the virtio queue isn't tied to a socket buffer, but directly to hardware. For RX performance: right now [ 3] 0.0-10.0 sec 1016 MBytes852 Mbits/sec revert tap hack [ 3] 0.0-10.0 sec564 MBytes473 Mbits/sec all patches applied [ 3] 0.0-10.0 sec 1.17 GBytes 1.01 Gbits/sec drop lots of packets [ 3] 0.0-10.0 sec 1.05 GBytes905 Mbits/sec The last patch is not in my series but it basically makes the ring size 512 and drops packets when we run out of descriptors. That was to valid that we're not hiding a virtio deficiency. The reason I want to buffer packets is that it avoids having to deal with tuning. For vringfd/vmdq, we'll have to make sure to get the tuning right though. Regards, Anthony Liguori - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] pinning, tsc and apic
Chris Wright wrote: * Anthony Liguori ([EMAIL PROTECTED]) wrote: From a quick look, I suspect that the number of wildly off TSC calibrations correspond to the VMs that are misbehaving. I think this may mean that we have to re-examine the tsc delta computation. 10_serial.log:time.c: Detected 1995.038 MHz processor. 11_serial.log:time.c: Detected 2363.195 MHz processor. 12_serial.log:time.c: Detected 2492.675 MHz processor. 13_serial.log:time.c: Detected 1995.061 MHz processor. 14_serial.log:time.c: Detected 1994.917 MHz processor. 15_serial.log:time.c: Detected 4100.735 MHz processor. 16_serial.log:time.c: Detected 2075.800 MHz processor. 17_serial.log:time.c: Detected 2674.350 MHz processor. 18_serial.log:time.c: Detected 1995.002 MHz processor. 19_serial.log:time.c: Detected 1994.978 MHz processor. 1_serial.log:time.c: Detected 4384.310 MHz processor. Is this with pinning? We at least know we're losing small bits on migration. From my measurements it's ~3000 (outliers are 10-20k). Also, what happens if you roll back to kvm-userspace 7f5c4d15ece5? I'm using this: diff -up arch/x86/kvm/svm.c~svm arch/x86/kvm/svm.c --- arch/x86/kvm/svm.c~svm2008-04-16 19:49:44.0 -0700 +++ arch/x86/kvm/svm.c2008-05-14 23:44:18.0 -0700 @@ -621,6 +621,13 @@ static void svm_free_vcpu(struct kvm_vcp kmem_cache_free(kvm_vcpu_cache, svm); } +static void svm_tsc_update(void *arg) +{ + struct vcpu_svm *svm = arg; + rdtscll(svm-vcpu.arch.host_tsc); + +} + static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -633,6 +640,9 @@ static void svm_vcpu_load(struct kvm_vcp * Make sure that the guest sees a monotonically * increasing TSC. */ + if (vcpu-cpu != -1) + smp_call_function_single(vcpu-cpu, svm_tsc_update, + svm, 0, 1); I like this approach because of its simplicity although the IPI is not wonderful. I was also thinking of using cpu_clock() to take a timestamp on vcpu_put, then on vcpu_load, take another timestamp and use the cyc2ns conversion to try and estimate the elapsed tsc ticks on the new cpu. Regards, Anthony Liguori rdtscll(tsc_this); delta = vcpu-arch.host_tsc - tsc_this; svm-vmcb-control.tsc_offset += delta; - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [ANNOUNCE] kvm-guest-drivers-windows-2
Avi Kivity wrote: Anthony Liguori wrote: Avi Kivity wrote: Anthony Liguori wrote: FWIW, virtio-net is much better with my patches applied. The can_receive patches? Again, I'm not opposed to them in principle, I just think that if they help that this points at a virtio deficiency. Virtio should never leave the rx queue empty. Consider the case where the virtio queue isn't tied to a socket buffer, but directly to hardware. For RX performance: right now [ 3] 0.0-10.0 sec 1016 MBytes852 Mbits/sec revert tap hack [ 3] 0.0-10.0 sec564 MBytes473 Mbits/sec all patches applied [ 3] 0.0-10.0 sec 1.17 GBytes 1.01 Gbits/sec drop lots of packets [ 3] 0.0-10.0 sec 1.05 GBytes905 Mbits/sec The last patch is not in my series but it basically makes the ring size 512 and drops packets when we run out of descriptors. That was to valid that we're not hiding a virtio deficiency. The reason I want to buffer packets is that it avoids having to deal with tuning. For vringfd/vmdq, we'll have to make sure to get the tuning right though. Okay; I'll apply the patches. Hopefully we won't diverge too much from upstream qemu. I am going to push these upstream. I need to finish the page_desc cache first b/c right now the version of virtio that could go into upstream QEMU has unacceptable performance for KVM. Regards, Anthony Liguori - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH] Add support for a configuration file
Fabrice Bellard wrote: Avi Kivity wrote: Fabrice Bellard wrote: I prefer: drive.file=foo.img drive.if=scsi That doesn't support multiple drives very well. Right, I realized it afterwards ! I suggested it because my original plan for the configuration file was based on this syntax with a strong inspiration from the OpenFirmware device tree. The idea was that the object name (drive here) had no hardcoded meaning, except for some predefined object names in order to keep a kind of backward compatibility with the current QEMU options. In order to create a new drive for example, you just have to do: mydrive.class=drive mydrive.if=scsi mydrive.file=abc.img the class field is used to select the device model. Then all the other parameters are used to initialize the device model. That way it is possible to keep the compatibility with the existing options and add a provision to instanciate arbitrary new device models, such as: I like this syntax primarily because it provides a means to associate arbitrary data with a VM. It also provides a sane way to keep track of which device is which so that the config can be updated while the VM is running. I'll update the patch. Regards, Anthony Liguori mynetworkcard.class=ne2000pci mynetworkcard.bus=1 # pci bus selection mynetworkcard.macaddr=00:01:02:03:04:05 mynetworkcard.vlan=1 I will strongly support configuration file formats having this property. Regards, Fabrice. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH] Add support for a configuration file
Paul Brook wrote: the class field is used to select the device model. Then all the other parameters are used to initialize the device model. That way it is possible to keep the compatibility with the existing options and add a provision to instanciate arbitrary new device models, such as: I like the idea, but I'm not so keen on the automatic allocation. I generally prefer explicit declaration over implicit things. The latter makes it very easy to not notice when you make a typo. It sounds like what you really want is something similar to an OF device tree. So you have something like: # pciide0 may be an alias (possibly provided by qemu) # e.g. pci0.slot1.func1.ide alias hda ide0.primary.master What I don't like about the ide0.primary.master syntax is that there isn't enough structure. I would prefer: alias hda ide,bus=0,primary,master If you combine this with your magic variable idea, you could also do: alias hda ide,bus=0,unit=$next But you could also just fold that into Fabrice's syntax (which I prefer): hda.class = ide,bus=0,unit=$next hda.type = disk hda.file = foo.img If you then default bus, and unit, you can just write: hda.class = ide hda.type = disk hda.file = foo.img And better yet, you could even allow for something like: hda.class = ide hda.bus = 0 hda.unit = 0 hda.type = disk hda.file = foo.img So I really actually prefer Fabrice's syntax because there is much more structure. I think it's a good idea to allow .class to contain multiple properties and to basically establish an alias. This way, you could predefine a bunch of aliases for today's parameters like hda, hdb, etc. hda.type=disk hda.file=foo.img You can then define some form of magic aliases that select the next unused device. e.g. alias mydrive $next_ide_disk IMHO This provides the flexibility and structure that Fabrice is talking about, and with suitable aliases can be made to look a lot like the existing options. This may require some internal restructuring to allow the machine descriptions to feed into the user config file. I think if we choose a syntax we like, we can start using that pretty easily. We'll have to start adjusting option names keeping them only valid on the command line (for things like -m). However, I think it would grow pretty well into a machine description mechanism. Regards, Anthony Liguori Thoughts? Paul - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH] Add support for a configuration file
Dor Laor wrote: On Wed, 2008-05-14 at 17:41 +0300, Avi Kivity wrote: Please don't jump over me but I think it is worth mentioning OVF, at least for to know what's you opinions. Open Virtualization Format - http://www.vmware.com/appliances/learn/ovf.html It's xml based, supported by all major hypervisors, so qemu/kvm/xen users might eventually use a product that support OVF. xml is a non-starter for QEMU. We go out of our way to be portable. Having an XML dependency that could be satisfied on Windows and Linux would probably require more code to support the dependency than all of QEMU itself. Regards, Anthony Liguori - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [ANNOUNCE] kvm-guest-drivers-windows-2
Dor Laor wrote: On Wed, 2008-05-14 at 17:49 +0200, Tomasz Chmielewski wrote: Dor Laor schrieb: (...) - PV Windows (network driver) About 700Mb+-, there is currently extra copy that we need to omit. Thanks for Anthony, we just have to change the driver. - non-PV Windows What do you mean? Other fully emulated nics like e1000? It does not perform as pv but depending on the guest it can do up to 600Mb+-. Just generally, how Windows PV drivers help to improve network performance. So, a PV network driver can do about 700Mb/s, and an emulated NIC can do about 600 Mb/s, Windows guest to host? That would be about 20% improvement? FWIW, virtio-net is much better with my patches applied. The difference between the e1000 and virtio-net is that e1000 consumes almost twice as much CPU as virtio-net so in my testing, the performance improvement with virtio-net is about 2x. We were loosing about 20-30% throughput because of the delays in handling incoming packets. Regards, Anthony LIguori It's work in progress, doing zero copy in the guest, adding TSO, using virtio'd tap will drastically boot performance. There is no reason the performance won't match Linux guest. Also I don't exactly remember the numbers but the gain in the tx pass is grater. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] pinning, tsc and apic
Marcelo Tosatti wrote: On Mon, May 12, 2008 at 02:19:24PM -0500, Ryan Harper wrote: Hi Ryan, There are two places that attempt to use delivery mode 7: kexec crash and io_apic_64.c::check_timer(). The later will happen if the guest fails to receive PIT IRQ's for 10 ticks. If you're using HZ=1000 thats 10ms. See timer_irq_works(). The in-kernel pit emulation has logic which avoids injecting more than one IRQ during 10ms. Note that the guest 10ms delay is TSC based and uses only the lower 32-bits of the value. It is quite likely that the TSC adjustment results in them increasing more rapidly then they should. Or that the TSC is terribly miscalibrated. Here is the log of all 48 guests. In this case, the host detects 1995.008 as the frequency. This is a Barcelona. I suspect that we're masking the X86_FEATURE_CONSTANT_TSC though. From a quick look, I suspect that the number of wildly off TSC calibrations correspond to the VMs that are misbehaving. I think this may mean that we have to re-examine the tsc delta computation. 10_serial.log:time.c: Detected 1995.038 MHz processor. 11_serial.log:time.c: Detected 2363.195 MHz processor. 12_serial.log:time.c: Detected 2492.675 MHz processor. 13_serial.log:time.c: Detected 1995.061 MHz processor. 14_serial.log:time.c: Detected 1994.917 MHz processor. 15_serial.log:time.c: Detected 4100.735 MHz processor. 16_serial.log:time.c: Detected 2075.800 MHz processor. 17_serial.log:time.c: Detected 2674.350 MHz processor. 18_serial.log:time.c: Detected 1995.002 MHz processor. 19_serial.log:time.c: Detected 1994.978 MHz processor. 1_serial.log:time.c: Detected 4384.310 MHz processor. 20_serial.log:time.c: Detected 1994.969 MHz processor. 21_serial.log:time.c: Detected 3670.696 MHz processor. 22_serial.log:time.c: Detected 1994.997 MHz processor. 23_serial.log:time.c: Detected 2218.613 MHz processor. 24_serial.log:time.c: Detected 1995.048 MHz processor. 25_serial.log:time.c: Detected 1995.015 MHz processor. 26_serial.log:time.c: Detected 1994.957 MHz processor. 27_serial.log:time.c: Detected 1995.051 MHz processor. 28_serial.log:time.c: Detected 1995.021 MHz processor. 29_serial.log:time.c: Detected 3679.640 MHz processor. 2_serial.log:time.c: Detected 2191.105 MHz processor. 30_serial.log:time.c: Detected 1995.086 MHz processor. 31_serial.log:time.c: Detected 1995.071 MHz processor. 32_serial.log:time.c: Detected 1995.051 MHz processor. 33_serial.log:time.c: Detected 2331.760 MHz processor. 34_serial.log:time.c: Detected 1995.011 MHz processor. 35_serial.log:time.c: Detected 1995.050 MHz processor. 36_serial.log:time.c: Detected 1994.911 MHz processor. 37_serial.log:time.c: Detected 1994.905 MHz processor. 38_serial.log:time.c: Detected 1994.881 MHz processor. 39_serial.log:time.c: Detected 1995.027 MHz processor. 3_serial.log:time.c: Detected 2051.467 MHz processor. 40_serial.log:time.c: Detected 1994.987 MHz processor. 41_serial.log:time.c: Detected 1994.970 MHz processor. 42_serial.log:time.c: Detected 1994.952 MHz processor. 43_serial.log:time.c: Detected 1995.042 MHz processor. 44_serial.log:time.c: Detected 1994.998 MHz processor. 45_serial.log:time.c: Detected 1995.016 MHz processor. 46_serial.log:time.c: Detected 1995.006 MHz processor. 47_serial.log:time.c: Detected 1995.000 MHz processor. 4_serial.log:time.c: Detected 1995.112 MHz processor. 5_serial.log:time.c: Detected 1995.081 MHz processor. 6_serial.log:time.c: Detected 2017.303 MHz processor. 7_serial.log:time.c: Detected 1995.046 MHz processor. 8_serial.log:time.c: Detected 1994.951 MHz processor. 9_serial.log:time.c: Detected 2184.754 MHz processor. Regards, Anthony Liguori So can you try setting KVM_MAX_PIT_INTR_INTERVAL to a lower value? HZ/10 or something. You can confirm this theory by booting the guests with apic=debug. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 3/4] Use fragmented send for virtio
This patch converts virtio-net to use the new fragmented send interface. We should have always supported this. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index 85cc9d2..93bca1d 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -239,23 +239,15 @@ again: static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq) { VirtQueueElement elem; -int count = 0; if (!(n-vdev.status VIRTIO_CONFIG_S_DRIVER_OK)) return; while (virtqueue_pop(vq, elem)) { - int i; - size_t len = 0; + ssize_t len = 0; /* ignore the header for now */ - for (i = 1; i elem.out_num; i++) { - qemu_send_packet(n-vc, elem.out_sg[i].iov_base, -elem.out_sg[i].iov_len); - len += elem.out_sg[i].iov_len; - } - - count++; + len = qemu_sendv_packet(n-vc, elem.out_sg[1], elem.out_num - 1); virtqueue_push(vq, elem, sizeof(struct virtio_net_hdr) + len); virtio_notify(n-vdev, vq); - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 2/4] Add fd_readv handler to tap
This allows fragmented packets to be sent with no additional copies over the tap interface. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/vl.c b/qemu/vl.c index 1f0a6ac..7900b76 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -4086,6 +4086,19 @@ static void tap_receive(void *opaque, const uint8_t *buf, int size) } } +static ssize_t tap_readv(void *opaque, const struct iovec *iov, +int iovcnt) +{ +TAPState *s = opaque; +ssize_t len; + +do { + len = writev(s-fd, iov, iovcnt); +} while (len == -1 (errno == EINTR || errno == EAGAIN)); + +return len; +} + static void tap_send(void *opaque) { TAPState *s = opaque; @@ -4135,6 +4148,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int fd) s-no_poll = 0; enable_sigio_timer(fd); s-vc = qemu_new_vlan_client(vlan, tap_receive, NULL, s); +s-vc-fd_readv = tap_readv; qemu_set_fd_handler2(s-fd, tap_read_poll, tap_send, NULL, s); snprintf(s-vc-info_str, sizeof(s-vc-info_str), tap: fd=%d, fd); return s; - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 1/4] Add method to send fragmented packets
We need to be able to send fragmented packets in KVM to avoid an extra copy in the TX path. This patch adds a qemu_sendv_packet() function to send fragemented packets. It also provides backwards compatibility for old clients that don't support the new interface. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/net.h b/qemu/net.h index 13daa27..5fb190e 100644 --- a/qemu/net.h +++ b/qemu/net.h @@ -1,12 +1,17 @@ #ifndef QEMU_NET_H #define QEMU_NET_H +#include sys/uio.h + /* VLANs support */ +typedef ssize_t (IOReadvHandler)(void *, const struct iovec *, int); + typedef struct VLANClientState VLANClientState; struct VLANClientState { IOReadHandler *fd_read; +IOReadvHandler *fd_readv; /* Packets may still be sent if this returns zero. It's used to rate-limit the slirp code. */ IOCanRWHandler *fd_can_read; @@ -30,6 +35,8 @@ VLANClientState *qemu_new_vlan_client(VLANState *vlan, void *opaque); int qemu_can_send_packet(VLANClientState *vc); void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size); +ssize_t qemu_sendv_packet(VLANClientState *vc, const struct iovec *iov, + int iovcnt); void qemu_handler_true(void *opaque); void do_info_network(void); diff --git a/qemu/vl.c b/qemu/vl.c index 45c97af..1f0a6ac 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -3820,6 +3820,50 @@ void qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size) } } +static ssize_t vc_sendv_compat(VLANClientState *vc, const struct iovec *iov, + int iovcnt) +{ +char buffer[4096]; +size_t offset = 0; +int i; + +for (i = 0; i iovcnt; i++) { + size_t len; + + len = MIN(sizeof(buffer) - offset, iov[i].iov_len); + memcpy(buffer + offset, iov[i].iov_base, len); + offset += len; +} + +vc-fd_read(vc-opaque, buffer, offset); + +return offset; +} + +ssize_t qemu_sendv_packet(VLANClientState *vc1, const struct iovec *iov, + int iovcnt) +{ +VLANState *vlan = vc1-vlan; +VLANClientState *vc; +ssize_t max_len = 0; + +for (vc = vlan-first_client; vc != NULL; vc = vc-next) { + ssize_t len = 0; + + if (vc == vc1) + continue; + + if (vc-fd_readv) + len = vc-fd_readv(vc-opaque, iov, iovcnt); + else if (vc-fd_read) + len = vc_sendv_compat(vc, iov, iovcnt); + + max_len = MAX(max_len, len); +} + +return max_len; +} + #if defined(CONFIG_SLIRP) /* slirp network adapter */ - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH] Add support for a configuration file
There has been an awful lot of discussion about a configuration file with almost no general agreement except that one is strongly desired. I thought about it a bit, and I think a nice step would be to simply allow the current configuration parameters to be stored in a file using a pretty familiar format. I think this is pretty useful as-is. I think it also gives us a reasonable way to move forward that will keep everyone pretty happy. Here's a short example: qemu-system-x86_64 -hda ~/images/linux.img -snapshot -vnc :2 Would become `foo.qemu': # Main disk image hda=/home/anthony/images/linux.img # Redirect disk writes to a temporary image snapshot # Make the graphical display available on port 5902 vnc=:2 With: qemu-system-x86_64 -config foo.qemu Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu-doc.texi b/qemu-doc.texi index cca483c..4861fc0 100644 --- a/qemu-doc.texi +++ b/qemu-doc.texi @@ -395,6 +395,12 @@ Sets the @var{name} of the guest. This name will be display in the SDL window caption. The @var{name} will also be used for the VNC server. [EMAIL PROTECTED] -config @var{file} +Reads configuration options from @var{file}. The format of @var{file} is the +same as the command line options, except no dash ``-'' is required. Options +that take an argument are in the format @var{option=value}. A pound ``#'' +character can be used as a comment. + @end table Display options: diff --git a/vl.c b/vl.c index 67712f0..2eb39dd 100644 --- a/vl.c +++ b/vl.c @@ -7276,6 +7276,7 @@ static void help(int exitcode) -clock force the use of the given methods for timer alarm.\n To see what timers are available use -clock ?\n -startdate select initial date of the clock\n + -config FILEread command line options from FILE\n \n During emulation, the following keys are useful:\n ctrl-alt-f toggle full screen\n @@ -7379,6 +7380,7 @@ enum { QEMU_OPTION_old_param, QEMU_OPTION_clock, QEMU_OPTION_startdate, +QEMU_OPTION_config, }; typedef struct QEMUOption { @@ -7490,6 +7492,7 @@ const QEMUOption qemu_options[] = { #endif { clock, HAS_ARG, QEMU_OPTION_clock }, { startdate, HAS_ARG, QEMU_OPTION_startdate }, +{ config, HAS_ARG, QEMU_OPTION_config }, { NULL }, }; @@ -7665,9 +7668,106 @@ static BOOL WINAPI qemu_ctrl_handler(DWORD type) } #endif +static char **insert_opts(char **old_argv, int *old_argc, int index, + char **argv, int argc) +{ +char **new_argv; + +/* Allocate larger array */ +new_argv = realloc(old_argv, (*old_argc + argc) * sizeof(old_argv[0])); +if (new_argv == NULL) { + fprintf(stderr, allocate failed in insert_opts\n); + exit(1); +} + +/* move elements after insertion point to end of array */ +memmove(new_argv+index + argc, new_argv + index, + (*old_argc - index) * sizeof(argv[0])); + +/* copy in new elements */ +memcpy(new_argv + index, argv, argc * sizeof(argv[0])); + +*old_argc += argc; + +if (0) { /* for debugging */ + int i; + printf(argv[] = {); + for (i = 0; i *old_argc; i++) { + if (i) + printf(, ); + printf(\%s\, new_argv[i]); + } + printf(}\n); +} + +return new_argv; +} + +static char **parse_config_file(const char *file, int *pargc) +{ +FILE *f; +char buffer[4096]; +char **argv = NULL; +int argc = 0; + +f = fopen(file, r); +if (f == NULL) + return NULL; + +while (fgets(buffer, sizeof(buffer), f)) { + char *ptr = buffer; + char *tok, *key, *val; + char *targv[2]; + int targc = 0; + + /* skip whitespace */ + while (isspace(*ptr)) ptr++; + + /* skip comments or empty lines */ + if (*ptr == '#' || *ptr == 0) + continue; + + /* trim new line and carriage return if necessary */ + tok = strchr(ptr, '\n'); + if (tok) + *tok = 0; + tok = strchr(ptr, '\r'); + if (tok) + *tok = 0; + + /* check if it has an argument */ + tok = strchr(ptr, '='); + if (tok) + *tok = 0; + + /* add key */ + if (asprintf(key, --%s, ptr) == -1) + return NULL; + targv[targc++] = key; + + /* add argument (optionally) */ + if (tok) { + if (asprintf(val, %s, tok + 1) == -1) + return NULL; + targv[targc++] = val; + } + + /* insert new arguments */ + argv = insert_opts(argv, argc, argc, targv, targc); + if (argv == NULL) + return NULL; +} + +fclose(f); + +*pargc = argc; + +return argv; +} + #define MAX_NET_CLIENTS 32 -int main(int argc, char **argv) +int main(int orig_argc, char **orig_argv) { #ifdef CONFIG_GDBSTUB int use_gdbstub; @@ -7700,6 +7800,10 @@ int main(int argc
Re: [kvm-devel] [PATCH] Add support for a configuration file
Anthony Liguori wrote: I think this is pretty useful as-is. I think it also gives us a reasonable way to move forward that will keep everyone pretty happy. Here's a short example: qemu-system-x86_64 -hda ~/images/linux.img -snapshot -vnc :2 Would become `foo.qemu': # Main disk image hda=/home/anthony/images/linux.img # Redirect disk writes to a temporary image snapshot # Make the graphical display available on port 5902 vnc=:2 With: qemu-system-x86_64 -config foo.qemu One thought I had, is that it would be very nice to break up the -drive file=foo.img,if=scsi syntax within the config file. In general, I'm thinking something like: [drive] file=foo.img if=scsi or: drive { file=foo.img if=scsi } or even: drive: file=foo.img if=scsi Basically, I'm looking for a syntax for sub-options. This would be useful for -drive or -net but I also think would lay the foundations for specifying a full machine config. It would get very unwieldy on the command line to have a large number of suboptions but it works reasonably well within a config. Regards, Anthony Liguori - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] qemu-kvm: Consolidate kvm_eat_signals
Avi Kivity wrote: Jan Kiszka wrote: Given that with the iothread we spend very little time processing signals in vcpu threads, maybe it's better to drop the loop completely. The common case is zero or one pending signals. The uncommon case of two or more pending signals will be handled by the KVM_RUN ioctl returning immediately with -EINTR (i.e. in the outer loop). You mean static void kvm_main_loop_wait(CPUState *env, int timeout) { pthread_mutex_unlock(qemu_mutex); kvm_eat_signal(env, timeout); pthread_mutex_lock(qemu_mutex); cpu_single_env = env; vcpu_info[env-cpu_index].signalled = 0; } ? Yes. The loop was a (perhaps premature) optimization that is now totally unnecessary, unless I'm missing something quite large. It used to be that kvm_eat_signal() selected after consuming as many signals as possible while only sleeping once. That's why there's a combination of sleeping and polling. Now the VCPU threads never select so the whole loop can be simplified to a single sigtimedwait() that always blocks. In reality, I don't think sigtimedwait() is really needed/useful for VCPUs anymore. We only use it to catch SIG_IPI and we only use SIG_IPI to break out of sleeping. I don't see any reason why we couldn't switch over to using a file descriptor for notification (or a pthread condition). In the very least, we could just select() on nothing and allow SIG_IPI to break us out of the select. Regards, Anthony Liguori Oh. There used to be a bug where we didn't check for a pending signal before the first guest entry, so this would add a lot of latency (effectively making the bug window much larger). That was only closed in 2.6.24 (by 7e66f350). - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] qemu-kvm: Consolidate kvm_eat_signals
Avi Kivity wrote: Now the VCPU threads never select so the whole loop can be simplified to a single sigtimedwait() that always blocks. In reality, I don't think sigtimedwait() is really needed/useful for VCPUs anymore. We only use it to catch SIG_IPI and we only use SIG_IPI to break out of sleeping. I don't see any reason why we couldn't switch over to using a file descriptor for notification (or a pthread condition). How would you stop a vcpu running in guest mode? Yeah, I forgot about that. In the very least, we could just select() on nothing and allow SIG_IPI to break us out of the select. sigtimedwait() (or just sigwait, now) doesn't require the signal to be delivered, so it's faster. Yeah, sigtimedwait() is probably the right thing to do since we have to use a signal for IPI. Regards, Anthony Liguori If there's nothing to select, why call select()? - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] pinning, tsc and apic
Ryan Harper wrote: I've been digging into some of the instability we see when running larger numbers of guests at the same time. The test I'm currently using involves launching 64 1vcpu guests on an 8-way AMD box. Note this is a Barcelona system and therefore should have a fixed-frequency TSC. With the latest kvm-userspace git and kvm.git + Gerd's kvmclock fixes, I can launch all 64 of these 1 second apart, BTW, what if you don't pace-out the startups? Do we still have issues with that? and only a handful (1 to 3) end up not making it up. In dmesg on the host, I get a couple messages: [321365.362534] vcpu not ready for apic_round_robin and [321503.023788] Unsupported delivery mode 7 Now, the interesting bit for me was when I used numactl to pin the guest to a processor, all of the guests come up with no issues at all. As I looked into it, it means that we're not running any of the vcpu migration code which on svm is comprised of tsc_offset recalibration and apic migration, and on vmx, a little more per-vcpu work Another data point is that -no-kvm-irqchip doesn't make the situation better. I've convinced myself that svm.c's tsc offset calculation works and handles the migration from cpu to cpu quite well. I added the following snippet to trigger if we ever encountered the case where we migrated to a tsc that was behind: rdtscll(tsc_this); delta = vcpu-arch.host_tsc - tsc_this; old_time = vcpu-arch.host_tsc + svm-vmcb-control.tsc_offset; new_time = tsc_this + svm-vmcb-control.tsc_offset + delta; if (new_time old_time) { printk(KERN_ERR ACK! (CPU%d-CPU%d) time goes back %llu\n, vcpu-cpu, cpu, old_time - new_time); } svm-vmcb-control.tsc_offset += delta; Time will never go backwards, but what can happen is that the TSC frequency will slow down. This is because upon VCPU migration, we don't account for the time between vcpu_put on the old processor and vcpu_load on the new processor. This time then disappears. A possible way to fix this (that's only valid on a processor with a fixed-frequency TSC), is to take a high-res timestamp on vcpu_put, and then on vcpu_load, take the delta timestamp since the old TSC was saved, and use the TSC frequency on the new pcpu to calculate the number of elapsed cycles. Assuming a fixed frequency TSC, and a calibrated TSC across all processors, you could get the same affects by using the VT tsc delta logic. Basically, it always uses the new CPU's TSC unless that would cause the guest to move backwards in time. As long as you have a stable, calibrated TSC, this would work out. Can you try your old patch that did this and see if it fixes the problem? Noting that vcpu-arch.host_tsc is the tsc of the previous cpu the vcpu was running on (see svm_put_vcpu()). This allows me to check if we are in fact increasing the guest's view of the tsc. I've not be able to trigger this at all when the vcpus are migrating. As for the apic, the migrate code seems to be rather simple, but I've not yet dived in to see if we've got anything racy in there: lapic.c: void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) { struct kvm_lapic *apic = vcpu-arch.apic; struct hrtimer *timer; if (!apic) return; timer = apic-timer.dev; if (hrtimer_cancel(timer)) hrtimer_start(timer, timer-expires, HRTIMER_MODE_ABS); } There's a big FIXME in the __apic_timer_fn() to make sure the timer runs on the current pCPU. As written, it's possible for the timer to happen on a different pcpu as the current vcpu's but it wasn't obvious to me that it would cause problems. Eddie, et al: Care to elaborate on what the TODO was trying to address? Regards, Anthony Liguori Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx (512) 838-9253 T/L: 678-9253 [EMAIL PROTECTED] - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 5/5] Stop dropping so many RX packets in tap (v3)
Avi Kivity wrote: Anthony Liguori wrote: Normally, tap always reads packets and simply lets the client drop them if it cannot receive them. For virtio-net, this results in massive packet loss and about an 80% performance loss in TCP throughput. This patch modifies qemu_send_packet() to only deliver a packet to a VLAN client if it doesn't have a fd_can_read method or the fd_can_read method indicates that it can receive packets. We also return a status of whether any clients were able to receive the packet. If no clients were able to receive a packet, we buffer the packet until a client indicates that it can receive packets again. This patch also modifies the tap code to only read from the tap fd if at least one client on the VLAN is able to receive a packet. Finally, this patch changes the tap code to drain all possible packets from the tap device when the tap fd is readable. #if defined(CONFIG_SLIRP) @@ -3970,6 +3976,8 @@ typedef struct TAPState { VLANClientState *vc; int fd; char down_script[1024]; +char buf[4096]; +int size; } TAPState; This breaks large MTUs. They've always been broken for tap. How about the other way round: when the vlan consumer detects it can no longer receive packets, it tells that to the vlan. When all vlan consumers can no longer receive, tell the producer to stop producing. For the tap producer, this is simply removing its fd from the read poll list. When a vlan consumer becomes ready to receive again, it tells the vlan, which tells the producers, which then install their fds back again. Yeah, that's a nice idea. I'll think about it. I don't know if it's really worth doing as an intermediate step though. What I'd really like to do is have a vlan interface where consumers published all of their receive buffers. Then there's no need for notifications of receive-ability. Regards, Anthony Liguori This is a bit difficult since virtio and tap are both consumers and producers, but could be made to work, I think. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 5/5] Stop dropping so many RX packets in tap (v3)
Avi Kivity wrote: Anthony Liguori wrote: How about the other way round: when the vlan consumer detects it can no longer receive packets, it tells that to the vlan. When all vlan consumers can no longer receive, tell the producer to stop producing. For the tap producer, this is simply removing its fd from the read poll list. When a vlan consumer becomes ready to receive again, it tells the vlan, which tells the producers, which then install their fds back again. Yeah, that's a nice idea. I'll think about it. I don't know if it's really worth doing as an intermediate step though. What I'd really like to do is have a vlan interface where consumers published all of their receive buffers. Then there's no need for notifications of receive-ability. That's definitely better, and is also more multiqueue nic / vringfd friendly. I still think interrupt-on-halfway-mark is needed much more urgently. It deals with concurrency much better: We already sort of do this. In try_fill_recv() in virtio-net.c, we try to allocate as many skbs as we can to fill the rx queue. We keep track of how many we've been able to allocate. Whenever we process an RX packet, we check to see if the current number of RX packets is less than half the maximum number of rx packets we've been able to allocate. In the common case of small queues, this should have exactly the behavior you describe. We don't currently suppress the RX notification even though we really could. The can_receive changes are really the key to this. Unless we are suppressing tap fd select()'ing, we can always suppress the RX notification. That's been sitting on my TODO for a bit. rx: host interrupts guest on halfway mark guest starts processing packets host keeps delivering tx: guest kicks host on halfway mark host starts processing packets second vcpu on guest keeps on queueing I'm not convinced it's at all practical to suppress notifications in the front-end. We simply don't know whether we'll get more packets which means we have to do TX mitigation within the front-end. We've been there, it's not as nice as doing it in the back-end. What we really ought to do in the back-end though, is start processing the TX packets as soon as we begin to do TX mitigation. This would address the ping latency issue while also increasing throughput (hopefully). One thing I've wanted to try is to register a bottom-half or a polling function so that the IO thread was always trying to process TX packets while the TX timer is active. Regards, Anthony Liguori It's also much better with multiqueue NICs, where there's no socket buffer to hold the packets while we're out of descriptors. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/5] Support more than 3.5GB with virtio (v3)
Avi Kivity wrote: Anthony Liguori wrote: We're pretty sloppy in virtio right now about phys_ram_base assumptions. This patch is an incremental step between what we have today and a full blown DMA API. I backported the DMA API but the performance impact was not acceptable to me. There's only a slight performance impact with this particular patch. Since we're no longer assuming guest physical memory is contiguous, we need a more complex way to validate the memory regions than just checking if it's within ram_size. Applied patches 1-2. Since patch 4 is under contention on qemu-devel, and 3 and 5 depend on it, I'd like to get the can_receive semantic change accepted first. I'll send it upstream, but I think it's much less of a divergence than the current virtio_net_poll hacks. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] qemu-kvm: fix guest resetting
Marcelo Tosatti wrote: diff --git a/qemu/qemu-kvm-ia64.c b/qemu/qemu-kvm-ia64.c index 4d0ddd7..d227d22 100644 --- a/qemu/qemu-kvm-ia64.c +++ b/qemu/qemu-kvm-ia64.c @@ -61,3 +61,7 @@ int kvm_arch_try_push_interrupts(void *opaque) void kvm_arch_update_regs_for_sipi(CPUState *env) { } + +void kvm_arch_cpu_reset(CPUState *env) +{ +} diff --git a/qemu/qemu-kvm-powerpc.c b/qemu/qemu-kvm-powerpc.c index 14ed945..024b18c 100644 --- a/qemu/qemu-kvm-powerpc.c +++ b/qemu/qemu-kvm-powerpc.c @@ -213,3 +213,7 @@ int handle_powerpc_dcr_write(int vcpu, uint32_t dcrn, uint32_t data) return 0; /* XXX ignore failed DCR ops */ } + +void kvm_arch_cpu_reset(CPUState *env) +{ +} diff --git a/qemu/qemu-kvm-x86.c b/qemu/qemu-kvm-x86.c index 9a771ff..28eb5c2 100644 --- a/qemu/qemu-kvm-x86.c +++ b/qemu/qemu-kvm-x86.c @@ -689,3 +689,19 @@ int handle_tpr_access(void *opaque, int vcpu, kvm_tpr_access_report(cpu_single_env, rip, is_write); return 0; } + +void kvm_arch_cpu_reset(CPUState *env) +{ +struct kvm_mp_state mp_state = { .mp_state = KVM_MP_STATE_UNINITIALIZED }; + +kvm_arch_load_regs(env); +if (env-cpu_index != 0) { + if (kvm_irqchip_in_kernel(kvm_context)) + kvm_set_mpstate(kvm_context, env-cpu_index, mp_state); + else { + env-interrupt_request = ~CPU_INTERRUPT_HARD; + env-hflags |= HF_HALTED_MASK; + env-exception_index = EXCP_HLT; + } +} +} diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index 3cc6d8e..1e1f065 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -31,8 +31,6 @@ kvm_context_t kvm_context; extern int smp_cpus; -static int qemu_kvm_reset_requested; - pthread_mutex_t qemu_mutex = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t qemu_aio_cond = PTHREAD_COND_INITIALIZER; pthread_cond_t qemu_vcpu_cond = PTHREAD_COND_INITIALIZER; @@ -52,7 +50,6 @@ struct vcpu_info { int signalled; int stop; int stopped; -int reload_regs; int created; } vcpu_info[256]; @@ -242,21 +239,29 @@ static void pause_all_threads(void) { int i; +if (cpu_single_env) { +fprintf(stderr, qemu-kvm: pause_all_threads from vcpu context\n); +exit(0); +} + for (i = 0; i smp_cpus; ++i) { vcpu_info[i].stop = 1; pthread_kill(vcpu_info[i].thread, SIG_IPI); } -while (!all_threads_paused()) { - CPUState *env = cpu_single_env; +while (!all_threads_paused()) pthread_cond_wait(qemu_pause_cond, qemu_mutex); - cpu_single_env = env; -} +cpu_single_env = NULL; } Personally, I prefer it the old way. All of the open-coded cpu_single_env's are tough to understand and I believe error-prone. I think a strategy of explicitly preserving cpu_single_env whenever we drop qemu_mutex is more robust. Explicitly setting cpu_single_env = NULL happens to work because this is only called from the io thread. It's less clear to a casual reader why it's necessary. In fact, I'd be much more inclined to see a wrapper around pthread_cond_wait() so that we never explicitly had to set cpu_single_env. Regards, Anthony Liguori } void qemu_system_shutdown_request(void) - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Fix e1000 can_receive handler
Aurelien Jarno wrote: On Wed, May 07, 2008 at 04:40:58PM -0500, Anthony Liguori wrote: The current logic of the can_receive handler is to allow packets whenever the receiver is disabled or when there are descriptors available in the ring. I think the logic ought to be to allow packets whenever the receiver is enabled and there are descriptors available in the ring. The current behaviour is actually correct, this is the way QEMU works: when the card is stopped, it should always accept packets, and then discard them. The previous patches in my virtio series change that behavior. Before delivering a packet to a VLAN, it checks to see if any of the VLAN clients are able to receive a packet. This is very important for achieving good performance with tap. With virtio-net, we were dropping a ton of packets with tap because there weren't descriptors available on the RX ring. I plan to submit that behavioral change to QEMU upstream along with the virtio drivers. I'm still optimizing phys_page_find() though. The performance impact of switching the ring manipulation to using the stl_phys accessors is unacceptable for KVM. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Fix e1000 can_receive handler
Avi Kivity wrote: Anthony Liguori wrote: Aurelien Jarno wrote: On Wed, May 07, 2008 at 04:40:58PM -0500, Anthony Liguori wrote: The current logic of the can_receive handler is to allow packets whenever the receiver is disabled or when there are descriptors available in the ring. I think the logic ought to be to allow packets whenever the receiver is enabled and there are descriptors available in the ring. The current behaviour is actually correct, this is the way QEMU works: when the card is stopped, it should always accept packets, and then discard them. The previous patches in my virtio series change that behavior. Before delivering a packet to a VLAN, it checks to see if any of the VLAN clients are able to receive a packet. This is very important for achieving good performance with tap. With virtio-net, we were dropping a ton of packets with tap because there weren't descriptors available on the RX ring. I plan to submit that behavioral change to QEMU upstream along with the virtio drivers. I'm still optimizing phys_page_find() though. The performance impact of switching the ring manipulation to using the stl_phys accessors is unacceptable for KVM. I think this indicates a virtio tuning problem. The rx queue should never be empty on normal operation, ever. Signalling on rx queue empty invites the overrun since it takes time for the interrupt to be delivered and for the guest to refill the queue. Well, to start with, the e1000_can_receive handler is just plan wrong. The logic is broken. This hasn't caused an issue because the code isn't used. That said, it is possible to tune virtio to get back the performance lose of dropping packets. We lose about 20% of iperf from dropped packets ATM. If we bump the ring size up to 512 we get it back. If we bump it to 1024, we start loosing again. It's much less reliably than doing flow control though. Long term we need to do this dynamically but we can start with signalling on rx queue half empty (or half full if you're an optimist). We further need to tune the queue size so that this results in an acceptable number of interrupts: Part of the trouble with the embedded scatter gather list is that it's not at all clear what half empty is unless you count all of the descriptors. There may be one giant descriptor, or many small ones. 1 Gbps = 83K pps = 83 entries per half queue @ 1 KHz interrupt rate So we need 166 entries on the queue to keep a moderate interrupt rate, if we change it to signal at the halfway mark. Note that flow control still makes sense since it allows us to buffer some packets if the guest is scheduled out. But we can't use it as the primary mechanism since it won't exist with multiqueue NICs (where the virtio descriptors are fed to driver). Yes, we also need a better mechanism with vringfd(). I've been thinking about how to structure this API within QEMU but it's still not clear to me. Flow control seems to make sense though with the given API. Regards, Anthony Liguori Similar reasoning probably applied to tx. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 2/3] virtio-net save/restore support
The only interesting bit here is that we have to ensure that we rearm the timer if necessary. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index d15c2f4..5fe66ac 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -207,9 +207,40 @@ static void virtio_net_tx_timer(void *opaque) virtio_net_flush_tx(n, n-tx_vq); } +static void virtio_net_save(QEMUFile *f, void *opaque) +{ +VirtIONet *n = opaque; + +virtio_save(n-vdev, f); + +qemu_put_buffer(f, n-mac, 6); +qemu_put_be32(f, n-tx_timer_active); +} + +static int virtio_net_load(QEMUFile *f, void *opaque, int version_id) +{ +VirtIONet *n = opaque; + +if (version_id != 1) + return -EINVAL; + +virtio_load(n-vdev, f); + +qemu_get_buffer(f, n-mac, 6); +n-tx_timer_active = qemu_get_be32(f); + +if (n-tx_timer_active) { + qemu_mod_timer(n-tx_timer, + qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL); +} + +return 0; +} + PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn) { VirtIONet *n; +static int virtio_net_id; n = (VirtIONet *)virtio_init_pci(bus, virtio-net, 6900, 0x1000, 0, VIRTIO_ID_NET, @@ -229,5 +260,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn) n-tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n); n-tx_timer_active = 0; +register_savevm(virtio-net, virtio_net_id++, 1, + virtio_net_save, virtio_net_load, n); + return (PCIDevice *)n; } - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 3/3] virtio-blk save/restore support
No additional state needs to be saved. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/hw/virtio-blk.c b/qemu/hw/virtio-blk.c index 048285a..148cb75 100644 --- a/qemu/hw/virtio-blk.c +++ b/qemu/hw/virtio-blk.c @@ -162,11 +162,30 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev) return (1 VIRTIO_BLK_F_SEG_MAX | 1 VIRTIO_BLK_F_GEOMETRY); } +static void virtio_blk_save(QEMUFile *f, void *opaque) +{ +VirtIOBlock *s = opaque; +virtio_save(s-vdev, f); +} + +static int virtio_blk_load(QEMUFile *f, void *opaque, int version_id) +{ +VirtIOBlock *s = opaque; + +if (version_id != 1) + return -EINVAL; + +virtio_load(s-vdev, f); + +return 0; +} + void *virtio_blk_init(PCIBus *bus, uint16_t vendor, uint16_t device, BlockDriverState *bs) { VirtIOBlock *s; int cylinders, heads, secs; +static int virtio_blk_id; s = (VirtIOBlock *)virtio_init_pci(bus, virtio-blk, vendor, device, 0, VIRTIO_ID_BLOCK, @@ -184,5 +203,8 @@ void *virtio_blk_init(PCIBus *bus, uint16_t vendor, uint16_t device, virtio_add_queue(s-vdev, 128, virtio_blk_handle_output); +register_savevm(virtio-blk, virtio_blk_id++, 1, + virtio_blk_save, virtio_blk_load, s); + return s; } - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 1/3] Virtio save/restore support (v2)
This patch implements the core of save/restore support for virtio. It's modelled after how PCI save/restore works. N.B. This makes savevm/loadvm work, but not live migration. The issue with live migration is that we're manipulating guest memory without updating the dirty bitmap correctly. I will submit a patch in the near future that addresses that problem. Since v1, I fixed the Signed-off-by line. Sorry about that. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c index a4c9d10..440cc69 100644 --- a/qemu/hw/virtio.c +++ b/qemu/hw/virtio.c @@ -420,7 +420,6 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size, vdev-vq[i].vring.num = queue_size; vdev-vq[i].handle_output = handle_output; -vdev-vq[i].index = i; return vdev-vq[i]; } @@ -436,6 +435,71 @@ void virtio_notify(VirtIODevice *vdev, VirtQueue *vq) virtio_update_irq(vdev); } +void virtio_save(VirtIODevice *vdev, QEMUFile *f) +{ +int i; + +pci_device_save(vdev-pci_dev, f); + +qemu_put_be32s(f, vdev-addr); +qemu_put_8s(f, vdev-status); +qemu_put_8s(f, vdev-isr); +qemu_put_be16s(f, vdev-queue_sel); +qemu_put_be32s(f, vdev-features); +qemu_put_be32(f, vdev-config_len); +qemu_put_buffer(f, vdev-config, vdev-config_len); + +for (i = 0; i VIRTIO_PCI_QUEUE_MAX; i++) { + if (vdev-vq[i].vring.num == 0) + break; +} + +qemu_put_be32(f, i); + +for (i = 0; i VIRTIO_PCI_QUEUE_MAX; i++) { + if (vdev-vq[i].vring.num == 0) + break; + + qemu_put_be32(f, vdev-vq[i].vring.num); + qemu_put_be32s(f, vdev-vq[i].pfn); + qemu_put_be16s(f, vdev-vq[i].last_avail_idx); +} +} + +void virtio_load(VirtIODevice *vdev, QEMUFile *f) +{ +int num, i; + +pci_device_load(vdev-pci_dev, f); + +qemu_get_be32s(f, vdev-addr); +qemu_get_8s(f, vdev-status); +qemu_get_8s(f, vdev-isr); +qemu_get_be16s(f, vdev-queue_sel); +qemu_get_be32s(f, vdev-features); +vdev-config_len = qemu_get_be32(f); +qemu_get_buffer(f, vdev-config, vdev-config_len); + +num = qemu_get_be32(f); + +for (i = 0; i num; i++) { + vdev-vq[i].vring.num = qemu_get_be32(f); + qemu_get_be32s(f, vdev-vq[i].pfn); + qemu_get_be16s(f, vdev-vq[i].last_avail_idx); + + if (vdev-vq[i].pfn) { + size_t size; + target_phys_addr_t pa; + + pa = (ram_addr_t)vdev-vq[i].pfn TARGET_PAGE_BITS; + size = virtqueue_size(vdev-vq[i].vring.num); + virtqueue_init(vdev-vq[i], virtio_map_gpa(pa, size)); + } +} + +virtio_update_irq(vdev); +} + VirtIODevice *virtio_init_pci(PCIBus *bus, const char *name, uint16_t vendor, uint16_t device, uint16_t subvendor, uint16_t subdevice, diff --git a/qemu/hw/virtio.h b/qemu/hw/virtio.h index dee97ba..ed8cfd6 100644 --- a/qemu/hw/virtio.h +++ b/qemu/hw/virtio.h @@ -87,7 +87,6 @@ struct VirtQueue uint32_t pfn; uint16_t last_avail_idx; void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq); -int index; }; #define VIRTQUEUE_MAX_SIZE 1024 @@ -108,8 +107,6 @@ struct VirtIODevice PCIDevice pci_dev; const char *name; uint32_t addr; -uint16_t vendor; -uint16_t device; uint8_t status; uint8_t isr; uint16_t queue_sel; @@ -140,4 +137,8 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem); void virtio_notify(VirtIODevice *vdev, VirtQueue *vq); +void virtio_save(VirtIODevice *vdev, QEMUFile *f); + +void virtio_load(VirtIODevice *vdev, QEMUFile *f); + #endif - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] qemu/kvm: support for pci passthrough
Dor Laor wrote: On Thu, 2008-05-08 at 12:03 +0200, Nicolas Daneau wrote: Hi, I saw there was an active discussion about pci passthrough support in KVM. I'm not a Dev, only a sys admin that need this support. As i saw on the KVM home page that this feature is plan for 2H2008 in the roadmap, I wondered if you already have a more precise time frame for this release? Xen already have this feature but for the rest i prfer KVM. As this is a key point for the final decsion (we have a Digium TDM11B FXO/FXS pci card for an Asterisk VM), i'm waiting for an update about this feature. This is feature is indeed work in progress. We have floating patches for several solutions: 1. Using Intel's VT-d - hw iommu support. Patches were sent, needs some more scrubbing. The good thing is that it can live withing the kvm module and does not have to wait for kernel inclusion like the two below. Nope, it uses the existing Linux VT-d support and requires patches to the existing code. It can't live entirely within the external module. Regards, Anthony Liguori 2. PV dma - good only for Linux guests. Some locking issues needs to be solved + get into mainline kernel (=2.6.27) 3. 1-1 mapping of the guest-host addresses. Will work for any guest, no need for special hw, good only for single guest and not 100% secure. Needs a little more attention, worked in the past privately. Bottom line, soon you'll have more than one option for pci pass through using kvm. There is no assurance we'll make it exactly in Q2. You can help by applying patches and sending results. Cheers, Dor Thanks for your answer - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] strange problem with virtio on guest with over 3648M RAM
Henrik Holst wrote: Hello KVM developers! We are experiencing a very strange problem: we have been testing Ubuntu Hardy Heron Server as a guest (kvm-68 on the host) and when using the virtio nic everything works fine until we give the guest more than 3648 MiB of RAM. If we for example start with -m 4096 then the ubuntu guest brings up eth0 but it is impossible to get a dhcp address and if one sets a static address no packets what so ever is leaving the guest. So there seam to be a problem with the virtio guest drivers when there is more than 3648 MiB of ram available :), is there anyone else that have experienced the same problems or are there people running virtio on guest with more ram where everything is working ok? It's a know bug. I have a fix for it locally that I'll send out today. Regards, Anthony Liguori This problem could be local to ubuntu but I don't know of any other distribution with virtio guest drivers out of the box. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 1/6] Add support for eventfd() (v3)
This patch adds compatibility code so that we can make use of eventfd() within QEMU. eventfd() is a pretty useful mechanism as it allows multiple notifications to be batched in a single system call. We emulate eventfd() using a standard pipe(). Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/Makefile.target b/qemu/Makefile.target index bb4b9a3..46654f3 100644 --- a/qemu/Makefile.target +++ b/qemu/Makefile.target @@ -208,7 +208,7 @@ CPPFLAGS+=-I$(SRC_PATH)/tcg/sparc endif ifeq ($(USE_KVM), 1) -LIBOBJS+=qemu-kvm.o +LIBOBJS+=qemu-kvm.o kvm-compatfd.o endif ifdef CONFIG_SOFTFLOAT LIBOBJS+=fpu/softfloat.o diff --git a/qemu/kvm-compatfd.c b/qemu/kvm-compatfd.c new file mode 100644 index 000..1b030ba --- /dev/null +++ b/qemu/kvm-compatfd.c @@ -0,0 +1,33 @@ +/* + * signalfd/eventfd compatibility + * + * Copyright IBM, Corp. 2008 + * + * Authors: + * Anthony Liguori [EMAIL PROTECTED] + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ + +#include qemu-common.h +#include qemu-kvm.h + +#include sys/syscall.h + +int kvm_eventfd(int *fds) +{ +#if defined(SYS_eventfd) +int ret; + +ret = syscall(SYS_eventfd, 0); +if (ret = 0) { + fds[0] = fds[1] = ret; + return 0; +} else if (!(ret == -1 errno == ENOSYS)) + return ret; +#endif + +return pipe(fds); +} diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h index 024a653..8fa3c1b 100644 --- a/qemu/qemu-kvm.h +++ b/qemu/qemu-kvm.h @@ -79,6 +79,8 @@ int handle_powerpc_dcr_read(int vcpu, uint32_t dcrn, uint32_t *data); int handle_powerpc_dcr_write(int vcpu,uint32_t dcrn, uint32_t data); #endif +int kvm_eventfd(int *fds); + #define ALIGN(x, y) (((x)+(y)-1) ~((y)-1)) #define BITMAP_SIZE(m) (ALIGN(((m)TARGET_PAGE_BITS), HOST_LONG_BITS) / 8) - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 5/6] Interrupt io thread in qemu_set_fd_handler2 (v3)
The select() in the IO thread may wait a long time before rebuilding the fd set. Whenever we do something that changes the fd set, we should interrupt the IO thread. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/vl.c b/qemu/vl.c index 1192759..e9f0ca4 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -260,6 +260,16 @@ static int event_pending = 1; #define TFR(expr) do { if ((expr) != -1) break; } while (errno == EINTR) +/* KVM runs the main loop in a separate thread. If we update one of the lists + * that are polled before or after select(), we need to make sure to break out + * of the select() to ensure the new item is serviced. + */ +static void main_loop_break(void) +{ +if (kvm_enabled()) + qemu_kvm_notify_work(); +} + void decorate_application_name(char *appname, int max_len) { if (kvm_enabled()) @@ -5680,6 +5690,7 @@ int qemu_set_fd_handler2(int fd, ioh-opaque = opaque; ioh-deleted = 0; } +main_loop_break(); return 0; } @@ -7606,8 +7617,7 @@ void qemu_bh_schedule(QEMUBH *bh) if (env) { cpu_interrupt(env, CPU_INTERRUPT_EXIT); } -if (kvm_enabled()) -qemu_kvm_notify_work(); +main_loop_break(); } void qemu_bh_cancel(QEMUBH *bh) - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 6/6] Only select once per-main_loop iteration (v3)
QEMU is rather aggressive about exhausting the wait period when selecting. This is fine when the wait period is low and when there is significant delays in-between selects as it improves IO throughput. With the IO thread, there is a very small delay between selects and our wait period for select is very large. This patch changes main_loop_wait to only select once before doing the various other things in the main loop. This generally improves responsiveness of things like SDL but also improves individual file descriptor throughput quite dramatically. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index 492c3c4..cc8f292 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -423,24 +423,6 @@ void qemu_kvm_notify_work(void) fprintf(stderr, failed to notify io thread\n); } -static int received_signal; - -/* QEMU relies on periodically breaking out of select via EINTR to poll for IO - and timer signals. Since we're now using a file descriptor to handle - signals, select() won't be interrupted by a signal. We need to forcefully - break the select() loop when a signal is received hence - kvm_check_received_signal(). */ - -int kvm_check_received_signal(void) -{ -if (received_signal) { - received_signal = 0; - return 1; -} - -return 0; -} - /* If we have signalfd, we mask out the signals we want to handle and then * use signalfd to listen for them. We rely on whatever the current signal * handler is to dispatch the signals when we receive them. @@ -474,8 +456,6 @@ static void sigfd_handler(void *opaque) pthread_cond_signal(qemu_aio_cond); } } - -received_signal = 1; } /* Used to break IO thread out of select */ @@ -497,8 +477,6 @@ static void io_thread_wakeup(void *opaque) offset += len; } - -received_signal = 1; } int kvm_main_loop(void) diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h index df573ec..21606e9 100644 --- a/qemu/qemu-kvm.h +++ b/qemu/qemu-kvm.h @@ -128,13 +128,4 @@ static inline void kvm_sleep_end(void) kvm_mutex_lock(); } -int kvm_check_received_signal(void); - -static inline int kvm_received_signal(void) -{ -if (kvm_enabled()) - return kvm_check_received_signal(); -return 0; -} - #endif diff --git a/qemu/vl.c b/qemu/vl.c index 4e25366..a1aa270 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -7381,23 +7381,18 @@ void main_loop_wait(int timeout) slirp_select_fill(nfds, rfds, wfds, xfds); } #endif - moreio: ret = qemu_select(nfds + 1, rfds, wfds, xfds, tv); if (ret 0) { IOHandlerRecord **pioh; -int more = 0; for(ioh = first_io_handler; ioh != NULL; ioh = ioh-next) { if (!ioh-deleted ioh-fd_read FD_ISSET(ioh-fd, rfds)) { ioh-fd_read(ioh-opaque); -if (!ioh-fd_read_poll || ioh-fd_read_poll(ioh-opaque)) -more = 1; -else +if (!(ioh-fd_read_poll ioh-fd_read_poll(ioh-opaque))) FD_CLR(ioh-fd, rfds); } if (!ioh-deleted ioh-fd_write FD_ISSET(ioh-fd, wfds)) { ioh-fd_write(ioh-opaque); -more = 1; } } @@ -7411,8 +7406,6 @@ void main_loop_wait(int timeout) } else pioh = ioh-next; } -if (more !kvm_received_signal()) -goto moreio; } #if defined(CONFIG_SLIRP) if (slirp_inited) { - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 4/6] Use signalfd() in io-thread (v3)
This patch reworks the IO thread to use signalfd() instead of sigtimedwait(). This will eliminate the need to use SIGIO everywhere. Since v2, I've fixed a nasty bug in qemu_kvm_aio_wait(). We can't use main_loop_wait() to sleep if it's at all possible we're being called from a handler in main_loop_wait() (which is the case with qemu_kvm_aio_wait()). Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index 7134e56..492c3c4 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -17,6 +17,7 @@ int kvm_pit = 1; #include sysemu.h #include qemu-common.h #include console.h +#include block.h #include qemu-kvm.h #include libkvm.h @@ -36,18 +37,11 @@ pthread_mutex_t qemu_mutex = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t qemu_aio_cond = PTHREAD_COND_INITIALIZER; pthread_cond_t qemu_vcpu_cond = PTHREAD_COND_INITIALIZER; pthread_cond_t qemu_system_cond = PTHREAD_COND_INITIALIZER; +pthread_cond_t qemu_pause_cond = PTHREAD_COND_INITIALIZER; __thread struct vcpu_info *vcpu; static int qemu_system_ready; -struct qemu_kvm_signal_table { -sigset_t sigset; -sigset_t negsigset; -}; - -static struct qemu_kvm_signal_table io_signal_table; -static struct qemu_kvm_signal_table vcpu_signal_table; - #define SIG_IPI (SIGRTMIN+4) struct vcpu_info { @@ -64,6 +58,7 @@ struct vcpu_info { pthread_t io_thread; static int io_thread_fd = -1; +static int io_thread_sigfd = -1; static inline unsigned long kvm_get_thread_id(void) { @@ -172,37 +167,23 @@ static int has_work(CPUState *env) return kvm_arch_has_work(env); } -static int kvm_process_signal(int si_signo) -{ -struct sigaction sa; - -switch (si_signo) { -case SIGUSR2: -pthread_cond_signal(qemu_aio_cond); -break; -case SIGALRM: -case SIGIO: -sigaction(si_signo, NULL, sa); -sa.sa_handler(si_signo); -break; -} - -return 1; -} - -static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env, - int timeout) +static int kvm_eat_signal(CPUState *env, int timeout) { struct timespec ts; int r, e, ret = 0; siginfo_t siginfo; +sigset_t waitset; ts.tv_sec = timeout / 1000; ts.tv_nsec = (timeout % 1000) * 100; -r = sigtimedwait(waitset-sigset, siginfo, ts); +sigemptyset(waitset); +sigaddset(waitset, SIG_IPI); + +r = sigtimedwait(waitset, siginfo, ts); if (r == -1 (errno == EAGAIN || errno == EINTR) !timeout) return 0; e = errno; + pthread_mutex_lock(qemu_mutex); if (env vcpu) cpu_single_env = vcpu-env; @@ -211,12 +192,12 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env, exit(1); } if (r != -1) -ret = kvm_process_signal(siginfo.si_signo); + ret = 1; if (env vcpu_info[env-cpu_index].stop) { vcpu_info[env-cpu_index].stop = 0; vcpu_info[env-cpu_index].stopped = 1; - qemu_kvm_notify_work(); + pthread_cond_signal(qemu_pause_cond); } pthread_mutex_unlock(qemu_mutex); @@ -227,14 +208,13 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env, static void kvm_eat_signals(CPUState *env, int timeout) { int r = 0; -struct qemu_kvm_signal_table *waitset = vcpu_signal_table; -while (kvm_eat_signal(waitset, env, 0)) +while (kvm_eat_signal(env, 0)) r = 1; if (!r timeout) { - r = kvm_eat_signal(waitset, env, timeout); + r = kvm_eat_signal(env, timeout); if (r) - while (kvm_eat_signal(waitset, env, 0)) + while (kvm_eat_signal(env, 0)) ; } } @@ -266,12 +246,8 @@ static void pause_all_threads(void) vcpu_info[i].stop = 1; pthread_kill(vcpu_info[i].thread, SIG_IPI); } -while (!all_threads_paused()) { - pthread_mutex_unlock(qemu_mutex); - kvm_eat_signal(io_signal_table, NULL, 1000); - pthread_mutex_lock(qemu_mutex); - cpu_single_env = NULL; -} +while (!all_threads_paused()) + pthread_cond_wait(qemu_pause_cond, qemu_mutex); } static void resume_all_threads(void) @@ -310,6 +286,12 @@ static void setup_kernel_sigmask(CPUState *env) { sigset_t set; +sigemptyset(set); +sigaddset(set, SIGUSR2); +sigaddset(set, SIGIO); +sigaddset(set, SIGALRM); +sigprocmask(SIG_BLOCK, set, NULL); + sigprocmask(SIG_BLOCK, NULL, set); sigdelset(set, SIG_IPI); @@ -346,7 +328,7 @@ static int kvm_main_loop_cpu(CPUState *env) cpu_single_env = env; while (1) { while (!has_work(env)) - kvm_main_loop_wait(env, 10); + kvm_main_loop_wait(env, 1000); if (env-interrupt_request CPU_INTERRUPT_HARD) env-hflags = ~HF_HALTED_MASK; if (!kvm_irqchip_in_kernel(kvm_context) info-sipi_needed) @@ -394,18 +376,6 @@ static void *ap_main_loop(void *_env) return NULL
[kvm-devel] [PATCH 3/6] Add support for signalfd() (v3)
This patch adds compatibility code so that we can use signalfd() within QEMU. signalfd() provides a mechanism to receive signal notification through a file descriptor. This is very useful in eliminating the signal/select race condition. If signalfd() isn't available, we spawn a thread that uses sigwaitinfo() to emulate it. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/kvm-compatfd.c b/qemu/kvm-compatfd.c index 1b030ba..b1311e2 100644 --- a/qemu/kvm-compatfd.c +++ b/qemu/kvm-compatfd.c @@ -15,6 +15,97 @@ #include qemu-kvm.h #include sys/syscall.h +#include pthread.h + +struct sigfd_compat_info +{ +sigset_t mask; +int fd; +}; + +static void *sigwait_compat(void *opaque) +{ +struct sigfd_compat_info *info = opaque; +int err; + +sigprocmask(SIG_BLOCK, info-mask, NULL); + +do { + siginfo_t siginfo; + + err = sigwaitinfo(info-mask, siginfo); + if (err == -1 errno == EINTR) + continue; + + if (err 0) { + char buffer[128]; + size_t offset = 0; + + memcpy(buffer, err, sizeof(err)); + while (offset sizeof(buffer)) { + ssize_t len; + + len = write(info-fd, buffer + offset, + sizeof(buffer) - offset); + if (len == -1 errno == EINTR) + continue; + + if (len = 0) { + err = -1; + break; + } + + offset += len; + } + } +} while (err = 0); + +return NULL; +} + +static int kvm_signalfd_compat(const sigset_t *mask) +{ +pthread_attr_t attr; +pthread_t tid; +struct sigfd_compat_info *info; +int fds[2]; + +info = malloc(sizeof(*info)); +if (info == NULL) { + errno = ENOMEM; + return -1; +} + +if (pipe(fds) == -1) { + free(info); + return -1; +} + +memcpy(info-mask, mask, sizeof(*mask)); +info-fd = fds[1]; + +pthread_attr_init(attr); +pthread_attr_setdetachstate(attr, PTHREAD_CREATE_DETACHED); + +pthread_create(tid, attr, sigwait_compat, info); + +pthread_attr_destroy(attr); + +return fds[0]; +} + +int kvm_signalfd(const sigset_t *mask) +{ +#if defined(SYS_signalfd) +int ret; + +ret = syscall(SYS_signalfd, -1, mask, _NSIG / 8); +if (!(ret == -1 errno == ENOSYS)) + return ret; +#endif + +return kvm_signalfd_compat(mask); +} int kvm_eventfd(int *fds) { diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h index 8fa3c1b..a0dd4a8 100644 --- a/qemu/qemu-kvm.h +++ b/qemu/qemu-kvm.h @@ -10,6 +10,8 @@ #include cpu.h +#include signal.h + int kvm_main_loop(void); int kvm_qemu_init(void); int kvm_qemu_create_context(void); @@ -79,6 +81,16 @@ int handle_powerpc_dcr_read(int vcpu, uint32_t dcrn, uint32_t *data); int handle_powerpc_dcr_write(int vcpu,uint32_t dcrn, uint32_t data); #endif +#if !defined(SYS_signalfd) +struct signalfd_siginfo { +uint32_t ssi_signo; +uint8_t pad[124]; +}; +#else +#include linux/signalfd.h +#endif + +int kvm_signalfd(const sigset_t *mask); int kvm_eventfd(int *fds); #define ALIGN(x, y) (((x)+(y)-1) ~((y)-1)) - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 1/5] Support more than 3.5GB with virtio (v3)
We're pretty sloppy in virtio right now about phys_ram_base assumptions. This patch is an incremental step between what we have today and a full blown DMA API. I backported the DMA API but the performance impact was not acceptable to me. There's only a slight performance impact with this particular patch. Since we're no longer assuming guest physical memory is contiguous, we need a more complex way to validate the memory regions than just checking if it's within ram_size. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c index 6a50001..a4c9d10 100644 --- a/qemu/hw/virtio.c +++ b/qemu/hw/virtio.c @@ -58,6 +58,48 @@ /* virt queue functions */ +static void *virtio_map_gpa(target_phys_addr_t addr, size_t size) +{ +ram_addr_t off; +target_phys_addr_t addr1; + +off = cpu_get_physical_page_desc(addr); +if ((off ~TARGET_PAGE_MASK) != IO_MEM_RAM) { + fprintf(stderr, virtio DMA to IO ram\n); + exit(1); +} + +off = (off TARGET_PAGE_MASK) | (addr ~TARGET_PAGE_MASK); + +for (addr1 = addr + TARGET_PAGE_SIZE; +addr1 TARGET_PAGE_ALIGN(addr + size); +addr1 += TARGET_PAGE_SIZE) { + ram_addr_t off1; + + off1 = cpu_get_physical_page_desc(addr1); + if ((off1 ~TARGET_PAGE_MASK) != IO_MEM_RAM) { + fprintf(stderr, virtio DMA to IO ram\n); + exit(1); + } + + off1 = (off1 TARGET_PAGE_MASK) | (addr1 ~TARGET_PAGE_MASK); + + if (off1 != (off + (addr1 - addr))) { + fprintf(stderr, discontigous virtio memory\n); + exit(1); + } +} + +return phys_ram_base + off; +} + +static size_t virtqueue_size(int num) +{ +return TARGET_PAGE_ALIGN((sizeof(VRingDesc) * num) + +(sizeof(VRingAvail) + sizeof(uint16_t) * num)) + + (sizeof(VRingUsed) + sizeof(VRingUsedElem) * num); +} + static void virtqueue_init(VirtQueue *vq, void *p) { vq-vring.desc = p; @@ -127,9 +169,6 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem) do { struct iovec *sg; - if ((vq-vring.desc[i].addr + vq-vring.desc[i].len) ram_size) - errx(1, Guest sent invalid pointer); - if (vq-vring.desc[i].flags VRING_DESC_F_WRITE) sg = elem-in_sg[elem-in_num++]; else @@ -137,7 +176,9 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem) /* Grab the first descriptor, and check it's OK. */ sg-iov_len = vq-vring.desc[i].len; - sg-iov_base = phys_ram_base + vq-vring.desc[i].addr; + sg-iov_base = virtio_map_gpa(vq-vring.desc[i].addr, sg-iov_len); + if (sg-iov_base == NULL) + errx(1, Invalid mapping\n); /* If we've got too many, that implies a descriptor loop. */ if ((elem-in_num + elem-out_num) vq-vring.num) @@ -198,9 +239,10 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) vdev-vq[vdev-queue_sel].pfn = val; if (pa == 0) { virtio_reset(vdev); - } else if (pa (ram_size - TARGET_PAGE_SIZE)) { - virtqueue_init(vdev-vq[vdev-queue_sel], phys_ram_base + pa); - /* FIXME if pa == 0, deal with device tear down */ + } else { + size_t size = virtqueue_size(vdev-vq[vdev-queue_sel].vring.num); + virtqueue_init(vdev-vq[vdev-queue_sel], + virtio_map_gpa(pa, size)); } break; case VIRTIO_PCI_QUEUE_SEL: - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 2/5] Validate the SG list layouts in virtio
We should check that the first element is the size we expect instead of just casting blindly. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/hw/virtio-blk.c b/qemu/hw/virtio-blk.c index 3af36db..048285a 100644 --- a/qemu/hw/virtio-blk.c +++ b/qemu/hw/virtio-blk.c @@ -56,8 +56,6 @@ struct virtio_blk_outhdr uint32_t ioprio; /* Sector (ie. 512 byte offset) */ uint64_t sector; -/* Where to put reply. */ -uint64_t id; }; #define VIRTIO_BLK_S_OK0 @@ -94,6 +92,17 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq) off_t off; int i; + if (elem.out_num 1 || elem.in_num 1) { + fprintf(stderr, virtio-blk missing headers\n); + exit(1); + } + + if (elem.out_sg[0].iov_len != sizeof(*out) || + elem.in_sg[elem.in_num - 1].iov_len != sizeof(*in)) { + fprintf(stderr, virtio-blk header not in correct element\n); + exit(1); + } + out = (void *)elem.out_sg[0].iov_base; in = (void *)elem.in_sg[elem.in_num - 1].iov_base; off = out-sector; diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index f727b14..5ac5089 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -125,6 +125,11 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) return; } +if (elem.in_num 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) { + fprintf(stderr, virtio-net header not in first element\n); + exit(1); +} + hdr = (void *)elem.in_sg[0].iov_base; hdr-flags = 0; hdr-gso_type = VIRTIO_NET_HDR_GSO_NONE; @@ -197,6 +202,11 @@ void virtio_net_poll(void) continue; } + if (elem.in_num 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) { + fprintf(stderr, virtio-net header not in first element\n); + exit(1); + } + hdr = (void *)elem.in_sg[0].iov_base; hdr-flags = 0; hdr-gso_type = VIRTIO_NET_HDR_GSO_NONE; - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 3/5] Revert virtio tap hack (v3)
While it has served us well, it is long overdue that we eliminate the virtio-net tap hack. It turns out that zero-copy has very little impact on performance. The tap hack was gaining such a significant performance boost not because of zero-copy, but because it avoided dropping packets on receive which is apparently a significant problem with the tap implementation in QEMU. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/hw/pc.h b/qemu/hw/pc.h index 73e3918..c284bf1 100644 --- a/qemu/hw/pc.h +++ b/qemu/hw/pc.h @@ -155,7 +155,6 @@ void isa_ne2000_init(int base, qemu_irq irq, NICInfo *nd); /* virtio-net.c */ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn); -void virtio_net_poll(void); /* virtio-blk.h */ void *virtio_blk_init(PCIBus *bus, uint16_t vendor, uint16_t device, diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index e2db49c..a26d04e 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -13,7 +13,6 @@ #include virtio.h #include net.h -#include pc.h #include qemu-timer.h /* from Linux's virtio_net.h */ @@ -62,15 +61,10 @@ typedef struct VirtIONet VirtQueue *tx_vq; VLANClientState *vc; int can_receive; -int tap_fd; -struct VirtIONet *next; -int do_notify; QEMUTimer *tx_timer; int tx_timer_active; } VirtIONet; -static VirtIONet *VirtIONetHead = NULL; - static VirtIONet *to_virtio_net(VirtIODevice *vdev) { return (VirtIONet *)vdev; @@ -105,7 +99,6 @@ static int virtio_net_can_receive(void *opaque) return (n-vdev.status VIRTIO_CONFIG_S_DRIVER_OK) n-can_receive; } -/* -net user receive function */ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) { VirtIONet *n = opaque; @@ -149,92 +142,6 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) virtio_notify(n-vdev, n-rx_vq); } -/* -net tap receive handler */ -void virtio_net_poll(void) -{ -VirtIONet *vnet; -int len; -fd_set rfds; -struct timeval tv; -int max_fd = -1; -VirtQueueElement elem; -struct virtio_net_hdr *hdr; -int did_notify; - -FD_ZERO(rfds); -tv.tv_sec = 0; -tv.tv_usec = 0; - -while (1) { - -// Prepare the set of device to select from -for (vnet = VirtIONetHead; vnet; vnet = vnet-next) { - -if (vnet-tap_fd == -1) -continue; - -vnet-do_notify = 0; -//first check if the driver is ok -if (!virtio_net_can_receive(vnet)) -continue; - -/* FIXME: the drivers really need to set their status better */ -if (vnet-rx_vq-vring.avail == NULL) { -vnet-can_receive = 0; -continue; -} - -FD_SET(vnet-tap_fd, rfds); -if (max_fd vnet-tap_fd) max_fd = vnet-tap_fd; -} - -if (select(max_fd + 1, rfds, NULL, NULL, tv) = 0) -break; - -// Now check who has data pending in the tap -for (vnet = VirtIONetHead; vnet; vnet = vnet-next) { - -if (!FD_ISSET(vnet-tap_fd, rfds)) -continue; - -if (virtqueue_pop(vnet-rx_vq, elem) == 0) { -vnet-can_receive = 0; -continue; -} - - if (elem.in_num 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) { - fprintf(stderr, virtio-net header not in first element\n); - exit(1); - } - -hdr = (void *)elem.in_sg[0].iov_base; -hdr-flags = 0; -hdr-gso_type = VIRTIO_NET_HDR_GSO_NONE; -again: -len = readv(vnet-tap_fd, elem.in_sg[1], elem.in_num - 1); -if (len == -1) { -if (errno == EINTR || errno == EAGAIN) -goto again; -else -fprintf(stderr, reading network error %d, len); -} -virtqueue_push(vnet-rx_vq, elem, sizeof(*hdr) + len); -vnet-do_notify = 1; -} - -/* signal other side */ -did_notify = 0; -for (vnet = VirtIONetHead; vnet; vnet = vnet-next) -if (vnet-do_notify) { -virtio_notify(vnet-vdev, vnet-rx_vq); -did_notify++; -} -if (!did_notify) -break; - } - -} - /* TX */ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq) { @@ -313,12 +220,6 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn) memcpy(n-mac, nd-macaddr, 6); n-vc = qemu_new_vlan_client(nd-vlan, virtio_net_receive, virtio_net_can_receive, n); -n-tap_fd = hack_around_tap(n-vc-vlan-first_client); -if (n-tap_fd != -1) { -n-next = VirtIONetHead; -//push the device on top of the list -VirtIONetHead = n; -} n-tx_timer = qemu_new_timer(rt_clock, virtio_net_tx_timer, n); n-tx_timer_active = 0; diff --git a/qemu/vl.c
[kvm-devel] [PATCH 5/5] Stop dropping so many RX packets in tap (v3)
Normally, tap always reads packets and simply lets the client drop them if it cannot receive them. For virtio-net, this results in massive packet loss and about an 80% performance loss in TCP throughput. This patch modifies qemu_send_packet() to only deliver a packet to a VLAN client if it doesn't have a fd_can_read method or the fd_can_read method indicates that it can receive packets. We also return a status of whether any clients were able to receive the packet. If no clients were able to receive a packet, we buffer the packet until a client indicates that it can receive packets again. This patch also modifies the tap code to only read from the tap fd if at least one client on the VLAN is able to receive a packet. Finally, this patch changes the tap code to drain all possible packets from the tap device when the tap fd is readable. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/net.h b/qemu/net.h index 13daa27..dfdf9af 100644 --- a/qemu/net.h +++ b/qemu/net.h @@ -29,7 +29,7 @@ VLANClientState *qemu_new_vlan_client(VLANState *vlan, IOCanRWHandler *fd_can_read, void *opaque); int qemu_can_send_packet(VLANClientState *vc); -void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size); +int qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size); void qemu_handler_true(void *opaque); void do_info_network(void); diff --git a/qemu/vl.c b/qemu/vl.c index 03051da..28f8d1d 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -3754,10 +3754,11 @@ int qemu_can_send_packet(VLANClientState *vc1) return 0; } -void qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size) +int qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size) { VLANState *vlan = vc1-vlan; VLANClientState *vc; +int ret = -EAGAIN; #if 0 printf(vlan %d send:\n, vlan-id); @@ -3765,9 +3766,14 @@ void qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size) #endif for(vc = vlan-first_client; vc != NULL; vc = vc-next) { if (vc != vc1) { -vc-fd_read(vc-opaque, buf, size); + if (!vc-fd_can_read || vc-fd_can_read(vc-opaque)) { + vc-fd_read(vc-opaque, buf, size); + ret = 0; + } } } + +return ret; } #if defined(CONFIG_SLIRP) @@ -3970,6 +3976,8 @@ typedef struct TAPState { VLANClientState *vc; int fd; char down_script[1024]; +char buf[4096]; +int size; } TAPState; static void tap_receive(void *opaque, const uint8_t *buf, int size) @@ -3985,24 +3993,70 @@ static void tap_receive(void *opaque, const uint8_t *buf, int size) } } +static int tap_can_send(void *opaque) +{ +TAPState *s = opaque; +VLANClientState *vc; +int can_receive = 0; + +/* Check to see if any of our clients can receive a packet */ +for (vc = s-vc-vlan-first_client; vc; vc = vc-next) { + /* Skip ourselves */ + if (vc == s-vc) + continue; + + if (!vc-fd_can_read) { + /* no fd_can_read handler, they always can receive */ + can_receive = 1; + } else + can_receive = vc-fd_can_read(vc-opaque); + + /* Once someone can receive, we try to send a packet */ + if (can_receive) + break; +} + +return can_receive; +} + static void tap_send(void *opaque) { TAPState *s = opaque; -uint8_t buf[4096]; -int size; +/* First try to send any buffered packet */ +if (s-size 0) { + int err; + + /* If noone can receive the packet, buffer it */ + err = qemu_send_packet(s-vc, s-buf, s-size); + if (err == -EAGAIN) + return; +} + +/* Read packets until we hit EAGAIN */ +do { #ifdef __sun__ -struct strbuf sbuf; -int f = 0; -sbuf.maxlen = sizeof(buf); -sbuf.buf = buf; -size = getmsg(s-fd, NULL, sbuf, f) =0 ? sbuf.len : -1; + struct strbuf sbuf; + int f = 0; + sbuf.maxlen = sizeof(s-buf); + sbuf.buf = s-buf; + s-size = getmsg(s-fd, NULL, sbuf, f) =0 ? sbuf.len : -1; #else -size = read(s-fd, buf, sizeof(buf)); + s-size = read(s-fd, s-buf, sizeof(s-buf)); #endif -if (size 0) { -qemu_send_packet(s-vc, buf, size); -} + + if (s-size == -1 errno == EINTR) + continue; + + if (s-size 0) { + int err; + + /* If noone can receive the packet, buffer it */ + err = qemu_send_packet(s-vc, s-buf, s-size); + if (err == -EAGAIN) + break; + } +} while (s-size 0); } /* fd support */ @@ -4016,7 +4070,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int fd) return NULL; s-fd = fd; s-vc = qemu_new_vlan_client(vlan, tap_receive, NULL, s); -qemu_set_fd_handler2(s-fd, NULL, tap_send, NULL, s); +qemu_set_fd_handler2(s-fd, tap_can_send, tap_send
[kvm-devel] [PATCH 4/5] Make virtio-net can_receive more accurate (v3)
In the final patch of this series, we rely on a VLAN client's fd_can_read method to avoid dropping packets. Unfortunately, virtio's fd_can_read method is not very accurate at the moment. This patch addresses this. It also generates a notification to the IO thread when more RX packets become available. If we say we can't receive a packet because no RX buffers are available, this may result in the tap file descriptor not being select()'d. Without notifying the IO thread, we may have to wait until the select() times out before we can receive a packet (even if there is one pending). This particular change makes RX performance very consistent. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index 8d26832..5538979 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -14,6 +14,7 @@ #include virtio.h #include net.h #include qemu-timer.h +#include qemu-kvm.h /* from Linux's virtio_net.h */ @@ -60,11 +61,14 @@ typedef struct VirtIONet VirtQueue *rx_vq; VirtQueue *tx_vq; VLANClientState *vc; -int can_receive; QEMUTimer *tx_timer; int tx_timer_active; } VirtIONet; +/* TODO + * - we could suppress RX interrupt if we were so inclined. + */ + static VirtIONet *to_virtio_net(VirtIODevice *vdev) { return (VirtIONet *)vdev; @@ -88,15 +92,24 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev) static void virtio_net_handle_rx(VirtIODevice *vdev, VirtQueue *vq) { -VirtIONet *n = to_virtio_net(vdev); -n-can_receive = 1; +/* We now have RX buffers, signal to the IO thread to break out of the + select to re-poll the tap file descriptor */ +if (kvm_enabled()) + qemu_kvm_notify_work(); } static int virtio_net_can_receive(void *opaque) { VirtIONet *n = opaque; -return (n-vdev.status VIRTIO_CONFIG_S_DRIVER_OK) n-can_receive; +if (n-rx_vq-vring.avail == NULL || + !(n-vdev.status VIRTIO_CONFIG_S_DRIVER_OK)) + return 0; + +if (n-rx_vq-vring.avail-idx == n-rx_vq-last_avail_idx) + return 0; + +return 1; } static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) @@ -106,15 +119,8 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) struct virtio_net_hdr *hdr; int offset, i; -/* FIXME: the drivers really need to set their status better */ -if (n-rx_vq-vring.avail == NULL) { - n-can_receive = 0; - return; -} - if (virtqueue_pop(n-rx_vq, elem) == 0) { - /* wait until the guest adds some rx bufs */ - n-can_receive = 0; + fprintf(stderr, virtio_net: this should not happen\n); return; } @@ -209,9 +215,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn) n-vdev.update_config = virtio_net_update_config; n-vdev.get_features = virtio_net_get_features; -n-rx_vq = virtio_add_queue(n-vdev, 512, virtio_net_handle_rx); +n-rx_vq = virtio_add_queue(n-vdev, 128, virtio_net_handle_rx); n-tx_vq = virtio_add_queue(n-vdev, 128, virtio_net_handle_tx); -n-can_receive = 0; memcpy(n-mac, nd-macaddr, 6); n-vc = qemu_new_vlan_client(nd-vlan, virtio_net_receive, virtio_net_can_receive, n); - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC] [VTD][patch 1/3] vt-d support for pci passthrough: kvm-vtd--kernel.patch
Avi Kivity wrote: Anthony Liguori wrote: What should be done for unmodified guest where there is no PV driver in the guest? Would a call to mlock() from qemu/hw/pci-passthrough.c/add_pci_passthrough_device() a reasonable thing to do? Yup. The idea is to ensure that the memory is always present, without necessarily taking a reference to it. This allows for memory reclaiming which should allow for things like NUMA page migration. We can't swap of course but that doesn't mean reclaimation isn't useful. I don't think we can do page migration with VT-d. You need to be able to detect whether the page has been changed by dma after you've copied it but before you changed the pte, but VT-d doesn't allow that AFAICT. Hrm, I would have to look at the VT-d but I suspect you're right. That's unfortunate. That means mlock() isn't sufficient. It also means that the VMAs can't be updated while the guest is running. Is there any way to lock a vma region such that things like madvise/mmap(MAP_FIXED) will always fail? Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC][PATCH] svm.c: tsc unsigned delta calculation
Ryan Harper wrote: I've been playing around with smp guests on a couple amd systems and I've also seen some of the smp/locking issues which lead me to dig into some of the tsc code. The svm_cpu_load, the tsc_offset calc will generate a massively large tsc_offset if we're switching cpus and tsc_this is ahead of the host_tsc value (delta would normally be negative, but since it's unsigned, we get a huge positive number). svm_vcpu_load() ... rdtscll(tsc_this); delta = vcpu-arch.host_tsc - tsc_this; svm-vmcb-control.tsc_offset += delta; This math will work out fine since the very large number will generate an overflow and the result will be identical to if we were using s64s. We're using u64s because that's how the tsc_offset is defined by hardware. This is handled a little differently on Intel (in vmx.c) where there is a check: if (tsc_this vcpu-arch.host_tsc) /* do delta and new offset calc */ So what your patch really does is change the behavior of the tsc_offset to increase the guest's TSC by a potentially large amount depending on how out of sync the TSC is on CPU migration. The question is why this would make things work out better for you.. Do you have Gerd's kvm-clock most recent patch applied? Regards, Anthony Liguori This check makes sense to me in that we only need to ensure that we don't go backwards which means that unless the new cpu is behind the current vcpu's host_tsc, we can skip a new delta calc. If the check doesn't make sense then we'll need to do the math with s64s. Attached patch fixed the case where an idle guest was live-locked. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH] Make sure to restore cpu_single_env when after cond_wait
I discovered this while testing virtio save/restore this evening. After sleeping, cpu_single_env can change so we have to make sure to restore it. This applies on top of my IO thread series. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index cc8f292..0990af6 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -246,8 +246,11 @@ static void pause_all_threads(void) vcpu_info[i].stop = 1; pthread_kill(vcpu_info[i].thread, SIG_IPI); } -while (!all_threads_paused()) +while (!all_threads_paused()) { + CPUState *env = cpu_single_env; pthread_cond_wait(qemu_pause_cond, qemu_mutex); + cpu_single_env = env; +} } static void resume_all_threads(void) - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Protected mode transitions and big real mode... still an issue
Guillaume Thouvenin wrote: On Mon, 5 May 2008 16:29:21 +0300 Mohammed Gamal [EMAIL PROTECTED] wrote: On Mon, May 5, 2008 at 3:57 PM, Anthony Liguori [EMAIL PROTECTED] wrote: WinXP fails to boot with your patch applied too. FWIW, Ubuntu 8.04 has a fixed version of gfxboot that doesn't do nasty things with SS on privileged mode transitions. WinXP fails with the patch applied too. Ubuntu 7.10 live CD and FreeDOS don't boot but complain about instruction mov 0x11,sreg not being emulated. Can you try with this one please? On my computer it boots ubuntu-8.04-desktop-i386.iso liveCD and also openSUSE-10.3-GM-x86_64-mini.iso 8.04 is not a good test-case. 7.10 is what you want to try. The good news is, 7.10 appears to work! The bad news is that about 20% of the time, it crashes and displays the following: kvm_run: failed entry, reason 5 kvm_run returned -8 So something appears to be a bit buggy. Still, very good work! Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/4] Replace SIGUSR1 in io-thread with eventfd() (v2)
Marcelo Tosatti wrote: Looks good (the whole series). Needs some good testing of course... Have you tested migration/loadvm? No, but I will before resubmitting (which should be sometime tomorrow). Regards, Anthony Liguori On Mon, May 05, 2008 at 08:47:12AM -0500, Anthony Liguori wrote: It's a little odd to use signals to raise a notification on a file descriptor when we can just work directly with a file descriptor instead. This patch converts the SIGUSR1 based notification in the io-thread to instead use an eventfd file descriptor. If eventfd isn't available, we use a pipe() instead. The benefit of using eventfd is that multiple notifications will be batched into a signal IO event. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC] [VTD][patch 1/3] vt-d support for pci passthrough: kvm-vtd--kernel.patch
and never program the IOMMU. The initial mlock() in userspace is somewhat of a nop here but it's important with MMU-notifiers because we will no longer be holding a reference for guest memory. We have to ensure we don't swap KVM guest memory while using hardware pass-through, but AFAICT, we do not need to make the memory non-reclaimable As long as we reprogram the IOMMU with a new, valid, mapping everything should be fine. mlock() really gives us the right semantics. Semantically, a PV API that supports DMA window registration simply mlock()s the DMA regions on behalf of the guest. No special logic should be needed. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC] [VTD][patch 3/3] vt-d support for pci passthrough: kvm-intel-iommu.patch
Kay, Allen M wrote: Intel-iommu driver changes for kvm vt-d support. Important changes are in intel-iommu.c. The rest of the changes are for moving intel-iommu.h and iova.h from drivers/pci directory to include/linux directory. Signed-off-by: Allen M Kay [EMAIL PROTECTED] b/drivers/pci/dmar.c |4 b/drivers/pci/intel-iommu.c | 26 ++- b/drivers/pci/iova.c |2 b/include/linux/intel-iommu.h | 344 ++ b/include/linux/iova.h| 52 ++ drivers/pci/intel-iommu.h | 344 -- drivers/pci/iova.h| 52 -- 7 files changed, 416 insertions(+), 408 deletions(-) diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c index f941f60..a58a5b0 100644 --- a/drivers/pci/dmar.c +++ b/drivers/pci/dmar.c @@ -26,8 +26,8 @@ #include linux/pci.h #include linux/dmar.h -#include iova.h -#include intel-iommu.h +#include linux/iova.h +#include linux/intel-iommu.h #undef PREFIX #define PREFIX DMAR: diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index 4cb949f..bfa888b 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -31,8 +31,8 @@ #include linux/dmar.h #include linux/dma-mapping.h #include linux/mempool.h -#include iova.h -#include intel-iommu.h +#include linux/iova.h +#include linux/intel-iommu.h #include asm/proto.h /* force_iommu in this header in x86-64*/ #include asm/cacheflush.h #include asm/gart.h @@ -1056,7 +1056,7 @@ static void free_iommu(struct intel_iommu *iommu) kfree(iommu); } -static struct dmar_domain * iommu_alloc_domain(struct intel_iommu *iommu) +struct dmar_domain * iommu_alloc_domain(struct intel_iommu *iommu) { unsigned long num; unsigned long ndomains; @@ -1086,8 +1086,9 @@ static struct dmar_domain * iommu_alloc_domain(struct intel_iommu *iommu) return domain; } +EXPORT_SYMBOL_GPL(iommu_alloc_domain); -static void iommu_free_domain(struct dmar_domain *domain) +void iommu_free_domain(struct dmar_domain *domain) { unsigned long flags; @@ -1095,6 +1096,7 @@ static void iommu_free_domain(struct dmar_domain *domain) clear_bit(domain-id, domain-iommu-domain_ids); spin_unlock_irqrestore(domain-iommu-lock, flags); } +EXPORT_SYMBOL_GPL(iommu_free_domain); static struct iova_domain reserved_iova_list; static struct lock_class_key reserved_alloc_key; @@ -1160,7 +1162,7 @@ static inline int guestwidth_to_adjustwidth(int gaw) return agaw; } -static int domain_init(struct dmar_domain *domain, int guest_width) +int domain_init(struct dmar_domain *domain, int guest_width) { I think it's already been mentioned, but these are pretty terrible names if you're exporting these symbols. Linux supports other IOMMUs so VT-d should not be hogging the iommu_* namespace. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC] [VTD][patch 2/3] vt-d support for pci passthrough: kvm-vtd-user.patch
Avi Kivity wrote: Kay, Allen M wrote: Still todo: move vt.d to kvm-intel.ko module. Not sure it's the right thing to do. If we get the iommus abstracted properly, we can rename vtd.c to dma.c and move it to virt/kvm/. The code is certainly a lot more about managing memory than anything vmx specific. It's hardly x86 specific, even. Really, an external interface to KVM that allowed someone to query the GPA = PA mapping would suffice. It should not fault in pages that aren't present and we should provide notifications for when the mapping changes for a given reason. Userspace can enforce the requirement that memory remains present via mlock(). This allows us to implement a PV API for DMA registration without the IOMMU code having any particular knowledge of it. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC] [VTD][patch 1/3] vt-d support for pci passthrough: kvm-vtd--kernel.patch
Kay, Allen M wrote: We have to ensure we don't swap KVM guest memory while using hardware pass-through, but AFAICT, we do not need to make the memory non-reclaimable As long as we reprogram the IOMMU with a new, valid, mapping everything should be fine. mlock() really gives us the right semantics. Semantically, a PV API that supports DMA window registration simply mlock()s the DMA regions on behalf of the guest. No special logic should be needed. What should be done for unmodified guest where there is no PV driver in the guest? Would a call to mlock() from qemu/hw/pci-passthrough.c/add_pci_passthrough_device() a reasonable thing to do? Yup. The idea is to ensure that the memory is always present, without necessarily taking a reference to it. This allows for memory reclaiming which should allow for things like NUMA page migration. We can't swap of course but that doesn't mean reclaimation isn't useful. Regards, Anthony Liguori Allen - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Protected mode transitions and big real mode... still an issue
Guillaume Thouvenin wrote: On Sat, 3 May 2008 13:56:56 +0530 Balaji Rao [EMAIL PROTECTED] wrote: With your patch applied ubuntu 8.04 livecd fails to boot. Not any better with Marcelo's patch on top. Hi Balaji, And without the patch, can you boot the ubuntu 8.04 livecd? WinXP fails to boot with your patch applied too. FWIW, Ubuntu 8.04 has a fixed version of gfxboot that doesn't do nasty things with SS on privileged mode transitions. Regards, Anthony Liguori Regards, Guillaume - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 4/4] Only select once per-main_loop iteration (v2)
QEMU is rather aggressive about exhausting the wait period when selecting. This is fine when the wait period is low and when there is significant delays in-between selects as it improves IO throughput. With the IO thread, there is a very small delay between selects and our wait period for select is very large. This patch changes main_loop_wait to only select once before doing the various other things in the main loop. This generally improves responsiveness of things like SDL but also improves individual file descriptor throughput quite dramatically. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index e16b261..6a90e68 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -423,24 +423,6 @@ void qemu_kvm_notify_work(void) fprintf(stderr, failed to notify io thread\n); } -static int received_signal; - -/* QEMU relies on periodically breaking out of select via EINTR to poll for IO - and timer signals. Since we're now using a file descriptor to handle - signals, select() won't be interrupted by a signal. We need to forcefully - break the select() loop when a signal is received hence - kvm_check_received_signal(). */ - -int kvm_check_received_signal(void) -{ -if (received_signal) { - received_signal = 0; - return 1; -} - -return 0; -} - /* If we have signalfd, we mask out the signals we want to handle and then * use signalfd to listen for them. We rely on whatever the current signal * handler is to dispatch the signals when we receive them. @@ -474,8 +456,6 @@ static void sigfd_handler(void *opaque) pthread_cond_signal(qemu_aio_cond); } } - -received_signal = 1; } /* Used to break IO thread out of select */ @@ -497,8 +477,6 @@ static void io_thread_wakeup(void *opaque) offset += len; } - -received_signal = 1; } int kvm_main_loop(void) diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h index e1e461a..34aabd2 100644 --- a/qemu/qemu-kvm.h +++ b/qemu/qemu-kvm.h @@ -114,15 +114,6 @@ static inline void kvm_sleep_end(void) kvm_mutex_lock(); } -int kvm_check_received_signal(void); - -static inline int kvm_received_signal(void) -{ -if (kvm_enabled()) - return kvm_check_received_signal(); -return 0; -} - #if !defined(SYS_signalfd) struct signalfd_siginfo { uint32_t ssi_signo; diff --git a/qemu/vl.c b/qemu/vl.c index e9f0ca4..6935a82 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -7946,23 +7946,18 @@ void main_loop_wait(int timeout) slirp_select_fill(nfds, rfds, wfds, xfds); } #endif - moreio: ret = qemu_select(nfds + 1, rfds, wfds, xfds, tv); if (ret 0) { IOHandlerRecord **pioh; -int more = 0; for(ioh = first_io_handler; ioh != NULL; ioh = ioh-next) { if (!ioh-deleted ioh-fd_read FD_ISSET(ioh-fd, rfds)) { ioh-fd_read(ioh-opaque); -if (!ioh-fd_read_poll || ioh-fd_read_poll(ioh-opaque)) -more = 1; -else +if (!(ioh-fd_read_poll ioh-fd_read_poll(ioh-opaque))) FD_CLR(ioh-fd, rfds); } if (!ioh-deleted ioh-fd_write FD_ISSET(ioh-fd, wfds)) { ioh-fd_write(ioh-opaque); -more = 1; } } @@ -7976,8 +7971,6 @@ void main_loop_wait(int timeout) } else pioh = ioh-next; } -if (more !kvm_received_signal()) -goto moreio; } #if defined(CONFIG_SLIRP) if (slirp_inited) { - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 2/4] Use signalfd() in io-thread (v2)
This patch reworks the IO thread to use signalfd() instead of sigtimedwait(). This will eliminate the need to use SIGIO everywhere. In this version of the patch, we use signalfd() when it's available. When it isn't available, we create a separate thread and use sigwaitinfo() to simulate signalfd(). I've tested Windows and Linux guests with SMP without seeing an obvious regressions. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/kvm-compatfd.c b/qemu/kvm-compatfd.c index 1b030ba..3c2be28 100644 --- a/qemu/kvm-compatfd.c +++ b/qemu/kvm-compatfd.c @@ -15,6 +15,100 @@ #include qemu-kvm.h #include sys/syscall.h +#include pthread.h + +struct sigfd_compat_info +{ +sigset_t mask; +int fd; +}; + +static void *sigwait_compat(void *opaque) +{ +struct sigfd_compat_info *info = opaque; +int err; + +sigprocmask(SIG_BLOCK, info-mask, NULL); + +do { + siginfo_t siginfo; + + kvm_sleep_begin(); + err = sigwaitinfo(info-mask, siginfo); + kvm_sleep_end(); + + if (err == -1 errno == EINTR) + continue; + + if (err 0) { + char buffer[128]; + size_t offset = 0; + + memcpy(buffer, err, sizeof(err)); + while (offset sizeof(buffer)) { + ssize_t len; + + len = write(info-fd, buffer + offset, + sizeof(buffer) - offset); + if (len == -1 errno == EINTR) + continue; + + if (len = 0) { + err = -1; + break; + } + + offset += len; + } + } +} while (err = 0); + +return NULL; +} + +static int kvm_signalfd_compat(const sigset_t *mask) +{ +pthread_attr_t attr; +pthread_t tid; +struct sigfd_compat_info *info; +int fds[2]; + +info = malloc(sizeof(*info)); +if (info == NULL) { + errno = ENOMEM; + return -1; +} + +if (pipe(fds) == -1) { + free(info); + return -1; +} + +memcpy(info-mask, mask, sizeof(*mask)); +info-fd = fds[1]; + +pthread_attr_init(attr); +pthread_attr_setdetachstate(attr, PTHREAD_CREATE_DETACHED); + +pthread_create(tid, attr, sigwait_compat, info); + +pthread_attr_destroy(attr); + +return fds[0]; +} + +int kvm_signalfd(const sigset_t *mask) +{ +#if defined(SYS_signalfd) +int ret; + +ret = syscall(SYS_signalfd, -1, mask, _NSIG / 8); +if (!(ret == -1 errno == ENOSYS)) + return ret; +#endif + +return kvm_signalfd_compat(mask); +} int kvm_eventfd(int *fds) { diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index 7134e56..0ea03f8 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -12,6 +12,9 @@ int kvm_allowed = 1; int kvm_irqchip = 1; int kvm_pit = 1; +#include qemu-common.h +#include console.h + #include string.h #include hw/hw.h #include sysemu.h @@ -40,14 +43,6 @@ __thread struct vcpu_info *vcpu; static int qemu_system_ready; -struct qemu_kvm_signal_table { -sigset_t sigset; -sigset_t negsigset; -}; - -static struct qemu_kvm_signal_table io_signal_table; -static struct qemu_kvm_signal_table vcpu_signal_table; - #define SIG_IPI (SIGRTMIN+4) struct vcpu_info { @@ -172,37 +167,23 @@ static int has_work(CPUState *env) return kvm_arch_has_work(env); } -static int kvm_process_signal(int si_signo) -{ -struct sigaction sa; - -switch (si_signo) { -case SIGUSR2: -pthread_cond_signal(qemu_aio_cond); -break; -case SIGALRM: -case SIGIO: -sigaction(si_signo, NULL, sa); -sa.sa_handler(si_signo); -break; -} - -return 1; -} - -static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env, - int timeout) +static int kvm_eat_signal(CPUState *env, int timeout) { struct timespec ts; int r, e, ret = 0; siginfo_t siginfo; +sigset_t waitset; ts.tv_sec = timeout / 1000; ts.tv_nsec = (timeout % 1000) * 100; -r = sigtimedwait(waitset-sigset, siginfo, ts); +sigemptyset(waitset); +sigaddset(waitset, SIG_IPI); + +r = sigtimedwait(waitset, siginfo, ts); if (r == -1 (errno == EAGAIN || errno == EINTR) !timeout) return 0; e = errno; + pthread_mutex_lock(qemu_mutex); if (env vcpu) cpu_single_env = vcpu-env; @@ -211,7 +192,7 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env, exit(1); } if (r != -1) -ret = kvm_process_signal(siginfo.si_signo); + ret = 1; if (env vcpu_info[env-cpu_index].stop) { vcpu_info[env-cpu_index].stop = 0; @@ -227,14 +208,13 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env, static void kvm_eat_signals(CPUState *env, int timeout) { int r = 0; -struct qemu_kvm_signal_table *waitset = vcpu_signal_table; -while (kvm_eat_signal(waitset, env, 0
[kvm-devel] [PATCH 3/4] Interrupt io thread in qemu_set_fd_handler2 (v2)
The select() in the IO thread may wait a long time before rebuilding the fd set. Whenever we do something that changes the fd set, we should interrupt the IO thread. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/vl.c b/qemu/vl.c index 1192759..e9f0ca4 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -260,6 +260,16 @@ static int event_pending = 1; #define TFR(expr) do { if ((expr) != -1) break; } while (errno == EINTR) +/* KVM runs the main loop in a separate thread. If we update one of the lists + * that are polled before or after select(), we need to make sure to break out + * of the select() to ensure the new item is serviced. + */ +static void main_loop_break(void) +{ +if (kvm_enabled()) + qemu_kvm_notify_work(); +} + void decorate_application_name(char *appname, int max_len) { if (kvm_enabled()) @@ -5680,6 +5690,7 @@ int qemu_set_fd_handler2(int fd, ioh-opaque = opaque; ioh-deleted = 0; } +main_loop_break(); return 0; } @@ -7606,8 +7617,7 @@ void qemu_bh_schedule(QEMUBH *bh) if (env) { cpu_interrupt(env, CPU_INTERRUPT_EXIT); } -if (kvm_enabled()) -qemu_kvm_notify_work(); +main_loop_break(); } void qemu_bh_cancel(QEMUBH *bh) - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 1/4] Replace SIGUSR1 in io-thread with eventfd() (v2)
It's a little odd to use signals to raise a notification on a file descriptor when we can just work directly with a file descriptor instead. This patch converts the SIGUSR1 based notification in the io-thread to instead use an eventfd file descriptor. If eventfd isn't available, we use a pipe() instead. The benefit of using eventfd is that multiple notifications will be batched into a signal IO event. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/Makefile.target b/qemu/Makefile.target index 2316c92..db6912e 100644 --- a/qemu/Makefile.target +++ b/qemu/Makefile.target @@ -203,7 +203,7 @@ CPPFLAGS+=-I$(SRC_PATH)/tcg/sparc endif ifeq ($(USE_KVM), 1) -LIBOBJS+=qemu-kvm.o +LIBOBJS+=qemu-kvm.o kvm-compatfd.o endif ifdef CONFIG_SOFTFLOAT LIBOBJS+=fpu/softfloat.o diff --git a/qemu/kvm-compatfd.c b/qemu/kvm-compatfd.c new file mode 100644 index 000..1b030ba --- /dev/null +++ b/qemu/kvm-compatfd.c @@ -0,0 +1,33 @@ +/* + * signalfd/eventfd compatibility + * + * Copyright IBM, Corp. 2008 + * + * Authors: + * Anthony Liguori [EMAIL PROTECTED] + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ + +#include qemu-common.h +#include qemu-kvm.h + +#include sys/syscall.h + +int kvm_eventfd(int *fds) +{ +#if defined(SYS_eventfd) +int ret; + +ret = syscall(SYS_eventfd, 0); +if (ret = 0) { + fds[0] = fds[1] = ret; + return 0; +} else if (!(ret == -1 errno == ENOSYS)) + return ret; +#endif + +return pipe(fds); +} diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index 9a9bf59..7134e56 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -15,6 +15,8 @@ int kvm_pit = 1; #include string.h #include hw/hw.h #include sysemu.h +#include qemu-common.h +#include console.h #include qemu-kvm.h #include libkvm.h @@ -61,6 +63,7 @@ struct vcpu_info { } vcpu_info[256]; pthread_t io_thread; +static int io_thread_fd = -1; static inline unsigned long kvm_get_thread_id(void) { @@ -213,7 +216,7 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env, if (env vcpu_info[env-cpu_index].stop) { vcpu_info[env-cpu_index].stop = 0; vcpu_info[env-cpu_index].stopped = 1; - pthread_kill(io_thread, SIGUSR1); + qemu_kvm_notify_work(); } pthread_mutex_unlock(qemu_mutex); @@ -418,7 +421,6 @@ static void qemu_kvm_init_signal_tables(void) kvm_add_signal(io_signal_table, SIGIO); kvm_add_signal(io_signal_table, SIGALRM); -kvm_add_signal(io_signal_table, SIGUSR1); kvm_add_signal(io_signal_table, SIGUSR2); kvm_add_signal(vcpu_signal_table, SIG_IPI); @@ -440,8 +442,51 @@ int kvm_init_ap(void) void qemu_kvm_notify_work(void) { -if (io_thread) -pthread_kill(io_thread, SIGUSR1); +uint64_t value = 1; +char buffer[8]; +size_t offset = 0; + +if (io_thread_fd == -1) + return; + +memcpy(buffer, value, sizeof(value)); + +while (offset 8) { + ssize_t len; + + len = write(io_thread_fd, buffer + offset, 8 - offset); + if (len == -1 errno == EINTR) + continue; + + if (len = 0) + break; + + offset += len; +} + +if (offset != 8) + fprintf(stderr, failed to notify io thread\n); +} + +/* Used to break IO thread out of select */ +static void io_thread_wakeup(void *opaque) +{ +int fd = (unsigned long)opaque; +char buffer[8]; +size_t offset = 0; + +while (offset 8) { + ssize_t len; + + len = read(fd, buffer + offset, 8 - offset); + if (len == -1 errno == EINTR) + continue; + + if (len = 0) + break; + + offset += len; +} } /* @@ -452,8 +497,20 @@ void qemu_kvm_notify_work(void) int kvm_main_loop(void) { +int fds[2]; + io_thread = pthread_self(); qemu_system_ready = 1; + +if (kvm_eventfd(fds) == -1) { + fprintf(stderr, failed to create eventfd\n); + return -errno; +} + +qemu_set_fd_handler2(fds[0], NULL, io_thread_wakeup, NULL, +(void *)(unsigned long)fds[0]); + +io_thread_fd = fds[1]; pthread_mutex_unlock(qemu_mutex); pthread_cond_broadcast(qemu_system_cond); diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h index 024a653..8cd63e6 100644 --- a/qemu/qemu-kvm.h +++ b/qemu/qemu-kvm.h @@ -97,4 +97,6 @@ extern kvm_context_t kvm_context; #define qemu_kvm_pit_in_kernel() (0) #endif +int kvm_eventfd(int *fds); + #endif - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https
Re: [kvm-devel] [patch 0/3] QEMU/KVM: add support for 128 PCI slots (v2)
Avi Kivity wrote: Marcelo Tosatti wrote: Add three PCI bridges to support 128 slots. Changes since v1: - Remove I/O address range support (so standard PCI I/O space is used). - Verify that there's no special quirks for 82801 PCI bridge. - Introduce separate flat IRQ mapping function for non-SPARC targets. I've cooled off on the 128 slot stuff, mainly because most real hosts don't have them. An unusual configuration will likely lead to problems as most guest OSes and workloads will not have been tested thoroughly with them. - it requires a large number of interrupts, which are difficult to provide, and which it is hard to ensure all OSes support. MSI is relatively new. - is only a few interrupts are available, then each interrupt requires scanning a large number of queues If we are to do this, then we need better tests than 80 disks show up. The alternative approach of having the virtio block device control up to 16 disks allows having those 80 disks with just 5 slots (and 5 interrupts). This is similar to the way traditional SCSI controllers behave, and so should not surprise the guest OS. If you have a single virtio-blk device that shows up as 8 functions, we could achieve the same thing. We can cheat with the interrupt handlers to avoid cache line bouncing too. Plus, we can use PCI hotplug so we don't have to reinvent a new hotplug mechanism. I'm inclined to think that ring sharing isn't as useful as it seems as long as we don't have indirect scatter gather lists. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Use pipe() to simulate signalfd() (v2)
Avi Kivity wrote: Please split the signalfd() emulation into a separate (preparatory) patch. Also, we need to detect signalfd() at run time as well as compile time, since qemu may be compiled on a different machine than it is run on. Ok. We can keep the signals blocked, but run the signalfd emulation in a separate thread (where it can dequeue signals using sigwait as an added bonus). This will reduce the differences between the two modes at the expense of increased signalfd() emulation complexity, which I think is a good tradeoff. signalfd() can't be emulated transparently with a separate thread because you won't be able to wait on signals destined for the specific thread (only signals sent to the process). We deliver signals directly to the IO thread (specifically, SIGUSR1) so this could get nasty. We could just not block SIGUSR1 and rely on the fact that it will break us out of select() but I that makes things a bit more subtle than I'd like. I personally prefer using pipe() within the same thread although I'm willing to also do the separate thread. Regards, Anthony LIguori We can move signalfd emulation into a separate file in order to improve readability. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Use pipe() to simulate signalfd() (v2)
Avi Kivity wrote: Anthony Liguori wrote: We can keep the signals blocked, but run the signalfd emulation in a separate thread (where it can dequeue signals using sigwait as an added bonus). This will reduce the differences between the two modes at the expense of increased signalfd() emulation complexity, which I think is a good tradeoff. signalfd() can't be emulated transparently with a separate thread because you won't be able to wait on signals destined for the specific thread (only signals sent to the process). We deliver signals directly to the IO thread (specifically, SIGUSR1) so this could get nasty. We could just not block SIGUSR1 and rely on the fact that it will break us out of select() but I that makes things a bit more subtle than I'd like. We can completely kill off SIGUSR1 and replace it with its own pipe. There's hardly any point in asking the kernel to signal a task, then having the kernel convert this to a fd write. (Or maybe use eventfd()) That's a really good idea. I'll update the patch. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 1/3] Use signalfd() in io-thread
This patch reworks the IO thread to use signalfd() instead of sigtimedwait(). This will eliminate the need to use SIGIO everywhere. In this version of the patch, we use signalfd() when it's available. When it isn't available, we create a separate thread and use sigwaitinfo() to simulate signalfd(). We cannot handle thread-specific signals with signalfd() emulation so also replace SIGUSR1 notifications to the io-thread with an eventfd. Since eventfd isn't always available, use pipe() to emulate eventfd. I've tested Windows and Linux guests with SMP without seeing an obvious regressions. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/Makefile.target b/qemu/Makefile.target index 2316c92..db6912e 100644 --- a/qemu/Makefile.target +++ b/qemu/Makefile.target @@ -203,7 +203,7 @@ CPPFLAGS+=-I$(SRC_PATH)/tcg/sparc endif ifeq ($(USE_KVM), 1) -LIBOBJS+=qemu-kvm.o +LIBOBJS+=qemu-kvm.o kvm-compatfd.o endif ifdef CONFIG_SOFTFLOAT LIBOBJS+=fpu/softfloat.o diff --git a/qemu/kvm-compatfd.c b/qemu/kvm-compatfd.c new file mode 100644 index 000..3c2be28 --- /dev/null +++ b/qemu/kvm-compatfd.c @@ -0,0 +1,127 @@ +/* + * signalfd/eventfd compatibility + * + * Copyright IBM, Corp. 2008 + * + * Authors: + * Anthony Liguori [EMAIL PROTECTED] + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ + +#include qemu-common.h +#include qemu-kvm.h + +#include sys/syscall.h +#include pthread.h + +struct sigfd_compat_info +{ +sigset_t mask; +int fd; +}; + +static void *sigwait_compat(void *opaque) +{ +struct sigfd_compat_info *info = opaque; +int err; + +sigprocmask(SIG_BLOCK, info-mask, NULL); + +do { + siginfo_t siginfo; + + kvm_sleep_begin(); + err = sigwaitinfo(info-mask, siginfo); + kvm_sleep_end(); + + if (err == -1 errno == EINTR) + continue; + + if (err 0) { + char buffer[128]; + size_t offset = 0; + + memcpy(buffer, err, sizeof(err)); + while (offset sizeof(buffer)) { + ssize_t len; + + len = write(info-fd, buffer + offset, + sizeof(buffer) - offset); + if (len == -1 errno == EINTR) + continue; + + if (len = 0) { + err = -1; + break; + } + + offset += len; + } + } +} while (err = 0); + +return NULL; +} + +static int kvm_signalfd_compat(const sigset_t *mask) +{ +pthread_attr_t attr; +pthread_t tid; +struct sigfd_compat_info *info; +int fds[2]; + +info = malloc(sizeof(*info)); +if (info == NULL) { + errno = ENOMEM; + return -1; +} + +if (pipe(fds) == -1) { + free(info); + return -1; +} + +memcpy(info-mask, mask, sizeof(*mask)); +info-fd = fds[1]; + +pthread_attr_init(attr); +pthread_attr_setdetachstate(attr, PTHREAD_CREATE_DETACHED); + +pthread_create(tid, attr, sigwait_compat, info); + +pthread_attr_destroy(attr); + +return fds[0]; +} + +int kvm_signalfd(const sigset_t *mask) +{ +#if defined(SYS_signalfd) +int ret; + +ret = syscall(SYS_signalfd, -1, mask, _NSIG / 8); +if (!(ret == -1 errno == ENOSYS)) + return ret; +#endif + +return kvm_signalfd_compat(mask); +} + +int kvm_eventfd(int *fds) +{ +#if defined(SYS_eventfd) +int ret; + +ret = syscall(SYS_eventfd, 0); +if (ret = 0) { + fds[0] = fds[1] = ret; + return 0; +} else if (!(ret == -1 errno == ENOSYS)) + return ret; +#endif + +return pipe(fds); +} diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index 9a9bf59..e16b261 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -12,6 +12,9 @@ int kvm_allowed = 1; int kvm_irqchip = 1; int kvm_pit = 1; +#include qemu-common.h +#include console.h + #include string.h #include hw/hw.h #include sysemu.h @@ -38,14 +41,6 @@ __thread struct vcpu_info *vcpu; static int qemu_system_ready; -struct qemu_kvm_signal_table { -sigset_t sigset; -sigset_t negsigset; -}; - -static struct qemu_kvm_signal_table io_signal_table; -static struct qemu_kvm_signal_table vcpu_signal_table; - #define SIG_IPI (SIGRTMIN+4) struct vcpu_info { @@ -61,6 +56,7 @@ struct vcpu_info { } vcpu_info[256]; pthread_t io_thread; +static int io_thread_fd = -1; static inline unsigned long kvm_get_thread_id(void) { @@ -169,37 +165,23 @@ static int has_work(CPUState *env) return kvm_arch_has_work(env); } -static int kvm_process_signal(int si_signo) -{ -struct sigaction sa; - -switch (si_signo) { -case SIGUSR2: -pthread_cond_signal(qemu_aio_cond); -break; -case SIGALRM: -case SIGIO: -sigaction(si_signo, NULL, sa); -sa.sa_handler(si_signo); -break; -} - -return 1; -} - -static int kvm_eat_signal(struct
[kvm-devel] [PATCH 3/3] Stop dropping so many RX packets in tap (v2)
Normally, tap always reads packets and simply lets the client drop them if it cannot receive them. For virtio-net, this results in massive packet loss and about an 80% performance loss in TCP throughput. This patch modifies qemu_send_packet() to only deliver a packet to a VLAN client if it doesn't have a fd_can_read method or the fd_can_read method indicates that it can receive packets. We also return a status of whether any clients were able to receive the packet. If no clients were able to receive a packet, we buffer the packet until a client indicates that it can receive packets again. This patch also modifies the tap code to only read from the tap fd if at least one client on the VLAN is able to receive a packet. Finally, this patch changes the tap code to drain all possible packets from the tap device when the tap fd is readable. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/net.h b/qemu/net.h index 13daa27..dfdf9af 100644 --- a/qemu/net.h +++ b/qemu/net.h @@ -29,7 +29,7 @@ VLANClientState *qemu_new_vlan_client(VLANState *vlan, IOCanRWHandler *fd_can_read, void *opaque); int qemu_can_send_packet(VLANClientState *vc); -void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size); +int qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size); void qemu_handler_true(void *opaque); void do_info_network(void); diff --git a/qemu/vl.c b/qemu/vl.c index c51d704..f4abea3 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -3760,10 +3760,11 @@ int qemu_can_send_packet(VLANClientState *vc1) return 0; } -void qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size) +int qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size) { VLANState *vlan = vc1-vlan; VLANClientState *vc; +int ret = -EAGAIN; #if 0 printf(vlan %d send:\n, vlan-id); @@ -3771,9 +3772,14 @@ void qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size) #endif for(vc = vlan-first_client; vc != NULL; vc = vc-next) { if (vc != vc1) { -vc-fd_read(vc-opaque, buf, size); + if (!vc-fd_can_read || vc-fd_can_read(vc-opaque)) { + vc-fd_read(vc-opaque, buf, size); + ret = 0; + } } } + +return ret; } #if defined(CONFIG_SLIRP) @@ -3976,6 +3982,8 @@ typedef struct TAPState { VLANClientState *vc; int fd; char down_script[1024]; +char buf[4096]; +int size; } TAPState; static void tap_receive(void *opaque, const uint8_t *buf, int size) @@ -3991,24 +3999,70 @@ static void tap_receive(void *opaque, const uint8_t *buf, int size) } } +static int tap_can_send(void *opaque) +{ +TAPState *s = opaque; +VLANClientState *vc; +int can_receive = 0; + +/* Check to see if any of our clients can receive a packet */ +for (vc = s-vc-vlan-first_client; vc; vc = vc-next) { + /* Skip ourselves */ + if (vc == s-vc) + continue; + + if (!vc-fd_can_read) { + /* no fd_can_read handler, they always can receive */ + can_receive = 1; + } else + can_receive = vc-fd_can_read(vc-opaque); + + /* Once someone can receive, we try to send a packet */ + if (can_receive) + break; +} + +return can_receive; +} + static void tap_send(void *opaque) { TAPState *s = opaque; -uint8_t buf[4096]; -int size; +/* First try to send any buffered packet */ +if (s-size 0) { + int err; + + /* If noone can receive the packet, buffer it */ + err = qemu_send_packet(s-vc, s-buf, s-size); + if (err == -EAGAIN) + return; +} + +/* Read packets until we hit EAGAIN */ +do { #ifdef __sun__ -struct strbuf sbuf; -int f = 0; -sbuf.maxlen = sizeof(buf); -sbuf.buf = buf; -size = getmsg(s-fd, NULL, sbuf, f) =0 ? sbuf.len : -1; + struct strbuf sbuf; + int f = 0; + sbuf.maxlen = sizeof(s-buf); + sbuf.buf = s-buf; + s-size = getmsg(s-fd, NULL, sbuf, f) =0 ? sbuf.len : -1; #else -size = read(s-fd, buf, sizeof(buf)); + s-size = read(s-fd, s-buf, sizeof(s-buf)); #endif -if (size 0) { -qemu_send_packet(s-vc, buf, size); -} + + if (s-size == -1 errno == EINTR) + continue; + + if (s-size 0) { + int err; + + /* If noone can receive the packet, buffer it */ + err = qemu_send_packet(s-vc, s-buf, s-size); + if (err == -EAGAIN) + break; + } +} while (s-size 0); } /* fd support */ @@ -4022,7 +4076,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int fd) return NULL; s-fd = fd; s-vc = qemu_new_vlan_client(vlan, tap_receive, NULL, s); -qemu_set_fd_handler2(s-fd, NULL, tap_send, NULL, s); +qemu_set_fd_handler2(s-fd, tap_can_send, tap_send
[kvm-devel] [PATCH 3/3] Only select once per-main_loop iteration
QEMU is rather aggressive about exhausting the wait period when selecting. This is fine when the wait period is low and when there is significant delays in-between selects as it improves IO throughput. With the IO thread, there is a very small delay between selects and our wait period for select is very large. This patch changes main_loop_wait to only select once before doing the various other things in the main loop. This generally improves responsiveness of things like SDL but also improves individual file descriptor throughput quite dramatically. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index e16b261..6a90e68 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -423,24 +423,6 @@ void qemu_kvm_notify_work(void) fprintf(stderr, failed to notify io thread\n); } -static int received_signal; - -/* QEMU relies on periodically breaking out of select via EINTR to poll for IO - and timer signals. Since we're now using a file descriptor to handle - signals, select() won't be interrupted by a signal. We need to forcefully - break the select() loop when a signal is received hence - kvm_check_received_signal(). */ - -int kvm_check_received_signal(void) -{ -if (received_signal) { - received_signal = 0; - return 1; -} - -return 0; -} - /* If we have signalfd, we mask out the signals we want to handle and then * use signalfd to listen for them. We rely on whatever the current signal * handler is to dispatch the signals when we receive them. @@ -474,8 +456,6 @@ static void sigfd_handler(void *opaque) pthread_cond_signal(qemu_aio_cond); } } - -received_signal = 1; } /* Used to break IO thread out of select */ @@ -497,8 +477,6 @@ static void io_thread_wakeup(void *opaque) offset += len; } - -received_signal = 1; } int kvm_main_loop(void) diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h index e1e461a..34aabd2 100644 --- a/qemu/qemu-kvm.h +++ b/qemu/qemu-kvm.h @@ -114,15 +114,6 @@ static inline void kvm_sleep_end(void) kvm_mutex_lock(); } -int kvm_check_received_signal(void); - -static inline int kvm_received_signal(void) -{ -if (kvm_enabled()) - return kvm_check_received_signal(); -return 0; -} - #if !defined(SYS_signalfd) struct signalfd_siginfo { uint32_t ssi_signo; diff --git a/qemu/vl.c b/qemu/vl.c index e9f0ca4..6935a82 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -7946,23 +7946,18 @@ void main_loop_wait(int timeout) slirp_select_fill(nfds, rfds, wfds, xfds); } #endif - moreio: ret = qemu_select(nfds + 1, rfds, wfds, xfds, tv); if (ret 0) { IOHandlerRecord **pioh; -int more = 0; for(ioh = first_io_handler; ioh != NULL; ioh = ioh-next) { if (!ioh-deleted ioh-fd_read FD_ISSET(ioh-fd, rfds)) { ioh-fd_read(ioh-opaque); -if (!ioh-fd_read_poll || ioh-fd_read_poll(ioh-opaque)) -more = 1; -else +if (!(ioh-fd_read_poll ioh-fd_read_poll(ioh-opaque))) FD_CLR(ioh-fd, rfds); } if (!ioh-deleted ioh-fd_write FD_ISSET(ioh-fd, wfds)) { ioh-fd_write(ioh-opaque); -more = 1; } } @@ -7976,8 +7971,6 @@ void main_loop_wait(int timeout) } else pioh = ioh-next; } -if (more !kvm_received_signal()) -goto moreio; } #if defined(CONFIG_SLIRP) if (slirp_inited) { - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 2/3] Interrupt io thread in qemu_set_fd_handler2
The select() in the IO thread may wait a long time before rebuilding the fd set. Whenever we do something that changes the fd set, we should interrupt the IO thread. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/vl.c b/qemu/vl.c index 1192759..e9f0ca4 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -260,6 +260,16 @@ static int event_pending = 1; #define TFR(expr) do { if ((expr) != -1) break; } while (errno == EINTR) +/* KVM runs the main loop in a separate thread. If we update one of the lists + * that are polled before or after select(), we need to make sure to break out + * of the select() to ensure the new item is serviced. + */ +static void main_loop_break(void) +{ +if (kvm_enabled()) + qemu_kvm_notify_work(); +} + void decorate_application_name(char *appname, int max_len) { if (kvm_enabled()) @@ -5680,6 +5690,7 @@ int qemu_set_fd_handler2(int fd, ioh-opaque = opaque; ioh-deleted = 0; } +main_loop_break(); return 0; } @@ -7606,8 +7617,7 @@ void qemu_bh_schedule(QEMUBH *bh) if (env) { cpu_interrupt(env, CPU_INTERRUPT_EXIT); } -if (kvm_enabled()) -qemu_kvm_notify_work(); +main_loop_break(); } void qemu_bh_cancel(QEMUBH *bh) - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 2/3] Make virtio-net can_receive more accurate (v2)
In the final patch of this series, we rely on a VLAN client's fd_can_read method to avoid dropping packets. Unfortunately, virtio's fd_can_read method is not very accurate at the moment. This patch addresses this. It also generates a notification to the IO thread when more RX packets become available. If we say we can't receive a packet because no RX buffers are available, this may result in the tap file descriptor not being select()'d. Without notifying the IO thread, we may have to wait until the select() times out before we can receive a packet (even if there is one pending). This particular change makes RX performance very consistent. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index 8d26832..5538979 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -14,6 +14,7 @@ #include virtio.h #include net.h #include qemu-timer.h +#include qemu-kvm.h /* from Linux's virtio_net.h */ @@ -60,11 +61,14 @@ typedef struct VirtIONet VirtQueue *rx_vq; VirtQueue *tx_vq; VLANClientState *vc; -int can_receive; QEMUTimer *tx_timer; int tx_timer_active; } VirtIONet; +/* TODO + * - we could suppress RX interrupt if we were so inclined. + */ + static VirtIONet *to_virtio_net(VirtIODevice *vdev) { return (VirtIONet *)vdev; @@ -88,15 +92,24 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev) static void virtio_net_handle_rx(VirtIODevice *vdev, VirtQueue *vq) { -VirtIONet *n = to_virtio_net(vdev); -n-can_receive = 1; +/* We now have RX buffers, signal to the IO thread to break out of the + select to re-poll the tap file descriptor */ +if (kvm_enabled()) + qemu_kvm_notify_work(); } static int virtio_net_can_receive(void *opaque) { VirtIONet *n = opaque; -return (n-vdev.status VIRTIO_CONFIG_S_DRIVER_OK) n-can_receive; +if (n-rx_vq-vring.avail == NULL || + !(n-vdev.status VIRTIO_CONFIG_S_DRIVER_OK)) + return 0; + +if (n-rx_vq-vring.avail-idx == n-rx_vq-last_avail_idx) + return 0; + +return 1; } static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) @@ -106,15 +119,8 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) struct virtio_net_hdr *hdr; int offset, i; -/* FIXME: the drivers really need to set their status better */ -if (n-rx_vq-vring.avail == NULL) { - n-can_receive = 0; - return; -} - if (virtqueue_pop(n-rx_vq, elem) == 0) { - /* wait until the guest adds some rx bufs */ - n-can_receive = 0; + fprintf(stderr, virtio_net: this should not happen\n); return; } @@ -209,9 +215,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn) n-vdev.update_config = virtio_net_update_config; n-vdev.get_features = virtio_net_get_features; -n-rx_vq = virtio_add_queue(n-vdev, 512, virtio_net_handle_rx); +n-rx_vq = virtio_add_queue(n-vdev, 128, virtio_net_handle_rx); n-tx_vq = virtio_add_queue(n-vdev, 128, virtio_net_handle_tx); -n-can_receive = 0; memcpy(n-mac, nd-macaddr, 6); n-vc = qemu_new_vlan_client(nd-vlan, virtio_net_receive, virtio_net_can_receive, n); - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 1/3] Revert virtio tap hack (v2)
While it has served us well, it is long overdue that we eliminate the virtio-net tap hack. It turns out that zero-copy has very little impact on performance. The tap hack was gaining such a significant performance boost not because of zero-copy, but because it avoided dropping packets on receive which is apparently a significant problem with the tap implementation in QEMU. Patches 3 and 4 in this series address the packet dropping issue and the net result is a 25% boost in RX performance even in the absence of zero-copy. Also worth mentioning, is that this makes merging virtio into upstream QEMU significantly easier. Since v1, we're just rebasing on the new io thread patch set. This series depends on my IO thread series. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/hw/pc.h b/qemu/hw/pc.h index 57d2123..f5157bd 100644 --- a/qemu/hw/pc.h +++ b/qemu/hw/pc.h @@ -154,7 +154,6 @@ void isa_ne2000_init(int base, qemu_irq irq, NICInfo *nd); /* virtio-net.c */ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn); -void virtio_net_poll(void); /* virtio-blk.h */ void *virtio_blk_init(PCIBus *bus, uint16_t vendor, uint16_t device, diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index f727b14..8d26832 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -13,7 +13,6 @@ #include virtio.h #include net.h -#include pc.h #include qemu-timer.h /* from Linux's virtio_net.h */ @@ -62,15 +61,10 @@ typedef struct VirtIONet VirtQueue *tx_vq; VLANClientState *vc; int can_receive; -int tap_fd; -struct VirtIONet *next; -int do_notify; QEMUTimer *tx_timer; int tx_timer_active; } VirtIONet; -static VirtIONet *VirtIONetHead = NULL; - static VirtIONet *to_virtio_net(VirtIODevice *vdev) { return (VirtIONet *)vdev; @@ -105,7 +99,6 @@ static int virtio_net_can_receive(void *opaque) return (n-vdev.status VIRTIO_CONFIG_S_DRIVER_OK) n-can_receive; } -/* -net user receive function */ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) { VirtIONet *n = opaque; @@ -144,87 +137,6 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) virtio_notify(n-vdev, n-rx_vq); } -/* -net tap receive handler */ -void virtio_net_poll(void) -{ -VirtIONet *vnet; -int len; -fd_set rfds; -struct timeval tv; -int max_fd = -1; -VirtQueueElement elem; -struct virtio_net_hdr *hdr; -int did_notify; - -FD_ZERO(rfds); -tv.tv_sec = 0; -tv.tv_usec = 0; - -while (1) { - -// Prepare the set of device to select from -for (vnet = VirtIONetHead; vnet; vnet = vnet-next) { - -if (vnet-tap_fd == -1) -continue; - -vnet-do_notify = 0; -//first check if the driver is ok -if (!virtio_net_can_receive(vnet)) -continue; - -/* FIXME: the drivers really need to set their status better */ -if (vnet-rx_vq-vring.avail == NULL) { -vnet-can_receive = 0; -continue; -} - -FD_SET(vnet-tap_fd, rfds); -if (max_fd vnet-tap_fd) max_fd = vnet-tap_fd; -} - -if (select(max_fd + 1, rfds, NULL, NULL, tv) = 0) -break; - -// Now check who has data pending in the tap -for (vnet = VirtIONetHead; vnet; vnet = vnet-next) { - -if (!FD_ISSET(vnet-tap_fd, rfds)) -continue; - -if (virtqueue_pop(vnet-rx_vq, elem) == 0) { -vnet-can_receive = 0; -continue; -} - -hdr = (void *)elem.in_sg[0].iov_base; -hdr-flags = 0; -hdr-gso_type = VIRTIO_NET_HDR_GSO_NONE; -again: -len = readv(vnet-tap_fd, elem.in_sg[1], elem.in_num - 1); -if (len == -1) { -if (errno == EINTR || errno == EAGAIN) -goto again; -else -fprintf(stderr, reading network error %d, len); -} -virtqueue_push(vnet-rx_vq, elem, sizeof(*hdr) + len); -vnet-do_notify = 1; -} - -/* signal other side */ -did_notify = 0; -for (vnet = VirtIONetHead; vnet; vnet = vnet-next) -if (vnet-do_notify) { -virtio_notify(vnet-vdev, vnet-rx_vq); -did_notify++; -} -if (!did_notify) -break; - } - -} - /* TX */ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq) { @@ -303,12 +215,6 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn) memcpy(n-mac, nd-macaddr, 6); n-vc = qemu_new_vlan_client(nd-vlan, virtio_net_receive, virtio_net_can_receive, n); -n-tap_fd = hack_around_tap(n-vc-vlan-first_client); -if (n-tap_fd != -1) { -n-next = VirtIONetHead; -//push the device on top
Re: [kvm-devel] [PATCH 3/3] Stop dropping so many RX packets in tap (v2)
Dor Laor wrote: On Sun, 2008-05-04 at 15:21 -0500, Anthony Liguori wrote: Patchset looks good and reduces some nasty hacks. It probably also improves other devices like e1000 et al. Yeah, we just need to make sure they have proper can_receive handlers. I took a quick look at the e1000 but it wasn't entirely obvious to me how RX packets get queued. AFAICT, the e1000 can_receive handler only reports false when the device isn't initialized so it's likely it's dropping packets. Regards, Anthony Liguori Cheers, Dor - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 2/4] Revert virtio tap hack
While it has served us well, it is long overdue that we eliminate the virtio-net tap hack. It turns out that zero-copy has very little impact on performance. The tap hack was gaining such a significant performance boost not because of zero-copy, but because it avoided dropping packets on receive which is apparently a significant problem with the tap implementation in QEMU. Patches 3 and 4 in this series address the packet dropping issue and the net result is a 25% boost in RX performance even in the absence of zero-copy. Also worth mentioning, is that this makes merging virtio into upstream QEMU significantly easier. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/hw/pc.h b/qemu/hw/pc.h index 57d2123..f5157bd 100644 --- a/qemu/hw/pc.h +++ b/qemu/hw/pc.h @@ -154,7 +154,6 @@ void isa_ne2000_init(int base, qemu_irq irq, NICInfo *nd); /* virtio-net.c */ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn); -void virtio_net_poll(void); /* virtio-blk.h */ void *virtio_blk_init(PCIBus *bus, uint16_t vendor, uint16_t device, diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index f727b14..8d26832 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -13,7 +13,6 @@ #include virtio.h #include net.h -#include pc.h #include qemu-timer.h /* from Linux's virtio_net.h */ @@ -62,15 +61,10 @@ typedef struct VirtIONet VirtQueue *tx_vq; VLANClientState *vc; int can_receive; -int tap_fd; -struct VirtIONet *next; -int do_notify; QEMUTimer *tx_timer; int tx_timer_active; } VirtIONet; -static VirtIONet *VirtIONetHead = NULL; - static VirtIONet *to_virtio_net(VirtIODevice *vdev) { return (VirtIONet *)vdev; @@ -105,7 +99,6 @@ static int virtio_net_can_receive(void *opaque) return (n-vdev.status VIRTIO_CONFIG_S_DRIVER_OK) n-can_receive; } -/* -net user receive function */ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) { VirtIONet *n = opaque; @@ -144,87 +137,6 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) virtio_notify(n-vdev, n-rx_vq); } -/* -net tap receive handler */ -void virtio_net_poll(void) -{ -VirtIONet *vnet; -int len; -fd_set rfds; -struct timeval tv; -int max_fd = -1; -VirtQueueElement elem; -struct virtio_net_hdr *hdr; -int did_notify; - -FD_ZERO(rfds); -tv.tv_sec = 0; -tv.tv_usec = 0; - -while (1) { - -// Prepare the set of device to select from -for (vnet = VirtIONetHead; vnet; vnet = vnet-next) { - -if (vnet-tap_fd == -1) -continue; - -vnet-do_notify = 0; -//first check if the driver is ok -if (!virtio_net_can_receive(vnet)) -continue; - -/* FIXME: the drivers really need to set their status better */ -if (vnet-rx_vq-vring.avail == NULL) { -vnet-can_receive = 0; -continue; -} - -FD_SET(vnet-tap_fd, rfds); -if (max_fd vnet-tap_fd) max_fd = vnet-tap_fd; -} - -if (select(max_fd + 1, rfds, NULL, NULL, tv) = 0) -break; - -// Now check who has data pending in the tap -for (vnet = VirtIONetHead; vnet; vnet = vnet-next) { - -if (!FD_ISSET(vnet-tap_fd, rfds)) -continue; - -if (virtqueue_pop(vnet-rx_vq, elem) == 0) { -vnet-can_receive = 0; -continue; -} - -hdr = (void *)elem.in_sg[0].iov_base; -hdr-flags = 0; -hdr-gso_type = VIRTIO_NET_HDR_GSO_NONE; -again: -len = readv(vnet-tap_fd, elem.in_sg[1], elem.in_num - 1); -if (len == -1) { -if (errno == EINTR || errno == EAGAIN) -goto again; -else -fprintf(stderr, reading network error %d, len); -} -virtqueue_push(vnet-rx_vq, elem, sizeof(*hdr) + len); -vnet-do_notify = 1; -} - -/* signal other side */ -did_notify = 0; -for (vnet = VirtIONetHead; vnet; vnet = vnet-next) -if (vnet-do_notify) { -virtio_notify(vnet-vdev, vnet-rx_vq); -did_notify++; -} -if (!did_notify) -break; - } - -} - /* TX */ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq) { @@ -303,12 +215,6 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn) memcpy(n-mac, nd-macaddr, 6); n-vc = qemu_new_vlan_client(nd-vlan, virtio_net_receive, virtio_net_can_receive, n); -n-tap_fd = hack_around_tap(n-vc-vlan-first_client); -if (n-tap_fd != -1) { -n-next = VirtIONetHead; -//push the device on top of the list -VirtIONetHead = n; -} n-tx_timer = qemu_new_timer(vm_clock
[kvm-devel] [PATCH 1/4] Only select once per-main_loop iteration
QEMU is rather aggressive about exhausting the wait period when selecting. This is fine when the wait period is low and when there is significant delays in-between selects as it improves IO throughput. With the IO thread, there is a very small delay between selects and our wait period for select is very large. This patch changes main_loop_wait to only select once before doing the various other things in the main loop. This generally improves responsiveness of things like SDL but also improves individual file descriptor throughput quite dramatically. This patch is relies on my io-thread-timerfd.patch. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index 0c7f49f..31c7ca7 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -401,24 +401,6 @@ void qemu_kvm_notify_work(void) pthread_kill(io_thread, SIGUSR1); } -static int received_signal; - -/* QEMU relies on periodically breaking out of select via EINTR to poll for IO - and timer signals. Since we're now using a file descriptor to handle - signals, select() won't be interrupted by a signal. We need to forcefully - break the select() loop when a signal is received hence - kvm_check_received_signal(). */ - -int kvm_check_received_signal(void) -{ -if (received_signal) { - received_signal = 0; - return 1; -} - -return 0; -} - #if defined(SYS_signalfd) #if !defined(HAVE_signalfd) #include linux/signalfd.h @@ -466,8 +448,6 @@ static void sigfd_handler(void *opaque) if (info.ssi_signo == SIGUSR2) pthread_cond_signal(qemu_aio_cond); } - -received_signal = 1; } static int setup_signal_handlers(int nr_signals, ...) @@ -576,8 +556,6 @@ static void sigfd_handler(void *opaque) if (signo == SIGUSR2) pthread_cond_signal(qemu_aio_cond); } - -received_signal = 1; } static int setup_signal_handlers(int nr_signals, ...) diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h index bcab82c..5109c64 100644 --- a/qemu/qemu-kvm.h +++ b/qemu/qemu-kvm.h @@ -112,13 +112,4 @@ static inline void kvm_sleep_end(void) kvm_mutex_lock(); } -int kvm_check_received_signal(void); - -static inline int kvm_received_signal(void) -{ -if (kvm_enabled()) - return kvm_check_received_signal(); -return 0; -} - #endif diff --git a/qemu/vl.c b/qemu/vl.c index 1192759..bcf893f 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -7936,23 +7936,18 @@ void main_loop_wait(int timeout) slirp_select_fill(nfds, rfds, wfds, xfds); } #endif - moreio: ret = qemu_select(nfds + 1, rfds, wfds, xfds, tv); if (ret 0) { IOHandlerRecord **pioh; -int more = 0; for(ioh = first_io_handler; ioh != NULL; ioh = ioh-next) { if (!ioh-deleted ioh-fd_read FD_ISSET(ioh-fd, rfds)) { ioh-fd_read(ioh-opaque); -if (!ioh-fd_read_poll || ioh-fd_read_poll(ioh-opaque)) -more = 1; -else +if (!(ioh-fd_read_poll ioh-fd_read_poll(ioh-opaque))) FD_CLR(ioh-fd, rfds); } if (!ioh-deleted ioh-fd_write FD_ISSET(ioh-fd, wfds)) { ioh-fd_write(ioh-opaque); -more = 1; } } @@ -7966,8 +7961,6 @@ void main_loop_wait(int timeout) } else pioh = ioh-next; } -if (more !kvm_received_signal()) -goto moreio; } #if defined(CONFIG_SLIRP) if (slirp_inited) { - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 4/4] Stop dropping so many RX packets in tap
Normally, tap always reads packets and simply lets the client drop them if it cannot receive them. For virtio-net, this results in massive packet loss and about an 80% performance loss in TCP throughput. This patch modifies qemu_send_packet() to only deliver a packet to a VLAN client if it doesn't have a fd_can_read method or the fd_can_read method indicates that it can receive packets. We also return a status of whether any clients were able to receive the packet. If no clients were able to receive a packet, we buffer the packet until a client indicates that it can receive packets again. This patch also modifies the tap code to only read from the tap fd if at least one client on the VLAN is able to receive a packet. Finally, this patch changes the tap code to drain all possible packets from the tap device when the tap fd is readable. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/net.h b/qemu/net.h index 13daa27..dfdf9af 100644 --- a/qemu/net.h +++ b/qemu/net.h @@ -29,7 +29,7 @@ VLANClientState *qemu_new_vlan_client(VLANState *vlan, IOCanRWHandler *fd_can_read, void *opaque); int qemu_can_send_packet(VLANClientState *vc); -void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size); +int qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size); void qemu_handler_true(void *opaque); void do_info_network(void); diff --git a/qemu/vl.c b/qemu/vl.c index b8ce485..74c34b6 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -3750,10 +3750,11 @@ int qemu_can_send_packet(VLANClientState *vc1) return 0; } -void qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size) +int qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size) { VLANState *vlan = vc1-vlan; VLANClientState *vc; +int ret = -EAGAIN; #if 0 printf(vlan %d send:\n, vlan-id); @@ -3761,9 +3762,14 @@ void qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size) #endif for(vc = vlan-first_client; vc != NULL; vc = vc-next) { if (vc != vc1) { -vc-fd_read(vc-opaque, buf, size); + if (!vc-fd_can_read || vc-fd_can_read(vc-opaque)) { + vc-fd_read(vc-opaque, buf, size); + ret = 0; + } } } + +return ret; } #if defined(CONFIG_SLIRP) @@ -3966,6 +3972,8 @@ typedef struct TAPState { VLANClientState *vc; int fd; char down_script[1024]; +char buf[4096]; +int size; } TAPState; static void tap_receive(void *opaque, const uint8_t *buf, int size) @@ -3981,24 +3989,70 @@ static void tap_receive(void *opaque, const uint8_t *buf, int size) } } +static int tap_can_send(void *opaque) +{ +TAPState *s = opaque; +VLANClientState *vc; +int can_receive = 0; + +/* Check to see if any of our clients can receive a packet */ +for (vc = s-vc-vlan-first_client; vc; vc = vc-next) { + /* Skip ourselves */ + if (vc == s-vc) + continue; + + if (!vc-fd_can_read) { + /* no fd_can_read handler, they always can receive */ + can_receive = 1; + } else + can_receive = vc-fd_can_read(vc-opaque); + + /* Once someone can receive, we try to send a packet */ + if (can_receive) + break; +} + +return can_receive; +} + static void tap_send(void *opaque) { TAPState *s = opaque; -uint8_t buf[4096]; -int size; +/* First try to send any buffered packet */ +if (s-size 0) { + int err; + + /* If noone can receive the packet, buffer it */ + err = qemu_send_packet(s-vc, s-buf, s-size); + if (err == -EAGAIN) + return; +} + +/* Read packets until we hit EAGAIN */ +do { #ifdef __sun__ -struct strbuf sbuf; -int f = 0; -sbuf.maxlen = sizeof(buf); -sbuf.buf = buf; -size = getmsg(s-fd, NULL, sbuf, f) =0 ? sbuf.len : -1; + struct strbuf sbuf; + int f = 0; + sbuf.maxlen = sizeof(s-buf); + sbuf.buf = s-buf; + s-size = getmsg(s-fd, NULL, sbuf, f) =0 ? sbuf.len : -1; #else -size = read(s-fd, buf, sizeof(buf)); + s-size = read(s-fd, s-buf, sizeof(s-buf)); #endif -if (size 0) { -qemu_send_packet(s-vc, buf, size); -} + + if (s-size == -1 errno == EINTR) + continue; + + if (s-size 0) { + int err; + + /* If noone can receive the packet, buffer it */ + err = qemu_send_packet(s-vc, s-buf, s-size); + if (err == -EAGAIN) + break; + } +} while (s-size 0); } /* fd support */ @@ -4012,7 +4066,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int fd) return NULL; s-fd = fd; s-vc = qemu_new_vlan_client(vlan, tap_receive, NULL, s); -qemu_set_fd_handler2(s-fd, NULL, tap_send, NULL, s); +qemu_set_fd_handler2(s-fd, tap_can_send, tap_send
[kvm-devel] [PATCH 3/4] Make virtio-net can_receive more accurate
In the final patch of this series, we rely on a VLAN client's fd_can_read method to avoid dropping packets. Unfortunately, virtio's fd_can_read method is not very accurate at the moment. This patch addresses this. It also generates a notification to the IO thread when more RX packets become available. If we say we can't receive a packet because no RX buffers are available, this may result in the tap file descriptor not being select()'d. Without notifying the IO thread, we may have to wait until the select() times out before we can receive a packet (even if there is one pending). This particular change makes RX performance very consistent. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index 8d26832..5538979 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -14,6 +14,7 @@ #include virtio.h #include net.h #include qemu-timer.h +#include qemu-kvm.h /* from Linux's virtio_net.h */ @@ -60,11 +61,14 @@ typedef struct VirtIONet VirtQueue *rx_vq; VirtQueue *tx_vq; VLANClientState *vc; -int can_receive; QEMUTimer *tx_timer; int tx_timer_active; } VirtIONet; +/* TODO + * - we could suppress RX interrupt if we were so inclined. + */ + static VirtIONet *to_virtio_net(VirtIODevice *vdev) { return (VirtIONet *)vdev; @@ -88,15 +92,24 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev) static void virtio_net_handle_rx(VirtIODevice *vdev, VirtQueue *vq) { -VirtIONet *n = to_virtio_net(vdev); -n-can_receive = 1; +/* We now have RX buffers, signal to the IO thread to break out of the + select to re-poll the tap file descriptor */ +if (kvm_enabled()) + qemu_kvm_notify_work(); } static int virtio_net_can_receive(void *opaque) { VirtIONet *n = opaque; -return (n-vdev.status VIRTIO_CONFIG_S_DRIVER_OK) n-can_receive; +if (n-rx_vq-vring.avail == NULL || + !(n-vdev.status VIRTIO_CONFIG_S_DRIVER_OK)) + return 0; + +if (n-rx_vq-vring.avail-idx == n-rx_vq-last_avail_idx) + return 0; + +return 1; } static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) @@ -106,15 +119,8 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) struct virtio_net_hdr *hdr; int offset, i; -/* FIXME: the drivers really need to set their status better */ -if (n-rx_vq-vring.avail == NULL) { - n-can_receive = 0; - return; -} - if (virtqueue_pop(n-rx_vq, elem) == 0) { - /* wait until the guest adds some rx bufs */ - n-can_receive = 0; + fprintf(stderr, virtio_net: this should not happen\n); return; } @@ -209,9 +215,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn) n-vdev.update_config = virtio_net_update_config; n-vdev.get_features = virtio_net_get_features; -n-rx_vq = virtio_add_queue(n-vdev, 512, virtio_net_handle_rx); +n-rx_vq = virtio_add_queue(n-vdev, 128, virtio_net_handle_rx); n-tx_vq = virtio_add_queue(n-vdev, 128, virtio_net_handle_tx); -n-can_receive = 0; memcpy(n-mac, nd-macaddr, 6); n-vc = qemu_new_vlan_client(nd-vlan, virtio_net_receive, virtio_net_can_receive, n); - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 2/4] Revert virtio tap hack
Anthony Liguori wrote: While it has served us well, it is long overdue that we eliminate the virtio-net tap hack. It turns out that zero-copy has very little impact on performance. The tap hack was gaining such a significant performance boost not because of zero-copy, but because it avoided dropping packets on receive which is apparently a significant problem with the tap implementation in QEMU. FWIW, attached is a pretty straight forward zero-copy patch. What's interesting is that I see no change in throughput using this patch. The CPU is pegged at 100% during the iperf run. Since we're still using small MTUs, this isn't surprising. Copying a 1500 byte packet that we have to bring into the cache anyway doesn't seem significant. I think zero-copy will be more important with GSO though. Regards, Anthony Liguori Patches 3 and 4 in this series address the packet dropping issue and the net result is a 25% boost in RX performance even in the absence of zero-copy. Also worth mentioning, is that this makes merging virtio into upstream QEMU significantly easier. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] net-tap-zero-copy.patch Description: application/mbox - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Feedback and errors
Avi Kivity wrote: nadim khemir wrote: Hi, great work. While playing with kvm-qemu I noticed a few points that might be of interrest: 1/ -loadvm and -snapshot don't work together. It works as if -loadvm wasn't passed as argument 2/ two instances of kvm can be passed the same -hda. There is no locking whatsoever. This messes up things seriously. These two are upstream qemu problems. Copying qemu-devel. I guess using file locking by default would improve the situation, and we can add a -drive ...,exclusive=no option for people playing with cluster filesystems. This is not a situation where the user has a reasonable expectation of what will happen that we violate. If the user is unhappy with the results, it's because the user made a mistake. FWIW, the whole override thing for Xen has been an endless source of pain. It's very difficult (if not impossible) to accurately determine if someone else is using the disk. Also, it tends to confuse people trying to do something legitimate more often than helping someone doing something stupid. I very frequently run multiple VMs with the same disk. I do it strictly for the purposes of benchmarking. There are ways to share a disk without using a clustered filesystem. If a higher level management tool wants to enforce a policy (like libvirt), then let it. We should not be enforcing policies within QEMU though. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Feedback and errors
Avi Kivity wrote: Well, one user (me) has made this mistake, several times. I guess it's usage patterns. I'm pretty religious about using -snapshot unless I have a very specific reason not to. I have never encountered this problem myself. FWIW, the whole override thing for Xen has been an endless source of pain. It's very difficult (if not impossible) to accurately determine if someone else is using the disk. What's wrong with the standard file locking API? Of course it won't stop non-qemu apps from accessing it, but that's unlikely anyway. Xen tries to be very smart about determining whether devices are mounted somewhere else or not. Also, it tends to confuse people trying to do something legitimate more often than helping someone doing something stupid. -drive exclusive=off (or share=yes) The problem I have is that the default policy gets very complicated. At first thought, I would say it's fine as long as exclusive=off was the default for using -snapshot or using raw images. However, if you create a VM with a qcow image using -snapshot, and then create another one without using snapshot, you're boned. What we really need is a global configuration file so that individual users can select these defaults according to what makes sense for them. In the mean time, I think the policy vs. mechanism strongly suggests that exclusive=off should be the default (not to mention maintaining backwards compatibility). I very frequently run multiple VMs with the same disk. I do it strictly for the purposes of benchmarking. There are ways to share a disk without using a clustered filesystem. I imagine only raw format disks, and only as non-root filesystems (or with -shapshot, which should automatically set exclusive=off)? Yup. If a higher level management tool wants to enforce a policy (like libvirt), then let it. We should not be enforcing policies within QEMU though. I agree that qemu is not the place to enforce policies, but covering a hole that users are likely to step into, while allowing its explicit uncovering, is a good thing. We're not enforcing the policy, only hinting. Unfortunately, the solution involves breaking backwards compatibility for legitimate use-cases (not to mention making those use-cases more awkward). I think the only way to sanely do this is as a global configuration parameter. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH] Use pipe() to simulate signalfd() (v2)
This patch reworks the IO thread to use signalfd() instead of sigtimedwait(). This will eliminate the need to use SIGIO everywhere. In this version of the patch, we use signalfd() when it's available. When it isn't available, we instead use a pipe() that is written to in each signal handler. I've tested Windows and Linux guests with SMP without seeing an obvious regressions. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index 9a9bf59..0c7f49f 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -12,6 +12,9 @@ int kvm_allowed = 1; int kvm_irqchip = 1; int kvm_pit = 1; +#include qemu-common.h +#include console.h + #include string.h #include hw/hw.h #include sysemu.h @@ -38,14 +41,6 @@ __thread struct vcpu_info *vcpu; static int qemu_system_ready; -struct qemu_kvm_signal_table { -sigset_t sigset; -sigset_t negsigset; -}; - -static struct qemu_kvm_signal_table io_signal_table; -static struct qemu_kvm_signal_table vcpu_signal_table; - #define SIG_IPI (SIGRTMIN+4) struct vcpu_info { @@ -169,37 +164,23 @@ static int has_work(CPUState *env) return kvm_arch_has_work(env); } -static int kvm_process_signal(int si_signo) -{ -struct sigaction sa; - -switch (si_signo) { -case SIGUSR2: -pthread_cond_signal(qemu_aio_cond); -break; -case SIGALRM: -case SIGIO: -sigaction(si_signo, NULL, sa); -sa.sa_handler(si_signo); -break; -} - -return 1; -} - -static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env, - int timeout) +static int kvm_eat_signal(CPUState *env, int timeout) { struct timespec ts; int r, e, ret = 0; siginfo_t siginfo; +sigset_t waitset; ts.tv_sec = timeout / 1000; ts.tv_nsec = (timeout % 1000) * 100; -r = sigtimedwait(waitset-sigset, siginfo, ts); +sigemptyset(waitset); +sigaddset(waitset, SIG_IPI); + +r = sigtimedwait(waitset, siginfo, ts); if (r == -1 (errno == EAGAIN || errno == EINTR) !timeout) return 0; e = errno; + pthread_mutex_lock(qemu_mutex); if (env vcpu) cpu_single_env = vcpu-env; @@ -208,7 +189,7 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env, exit(1); } if (r != -1) -ret = kvm_process_signal(siginfo.si_signo); + ret = 1; if (env vcpu_info[env-cpu_index].stop) { vcpu_info[env-cpu_index].stop = 0; @@ -224,14 +205,13 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env, static void kvm_eat_signals(CPUState *env, int timeout) { int r = 0; -struct qemu_kvm_signal_table *waitset = vcpu_signal_table; -while (kvm_eat_signal(waitset, env, 0)) +while (kvm_eat_signal(env, 0)) r = 1; if (!r timeout) { - r = kvm_eat_signal(waitset, env, timeout); + r = kvm_eat_signal(env, timeout); if (r) - while (kvm_eat_signal(waitset, env, 0)) + while (kvm_eat_signal(env, 0)) ; } } @@ -264,9 +244,7 @@ static void pause_all_threads(void) pthread_kill(vcpu_info[i].thread, SIG_IPI); } while (!all_threads_paused()) { - pthread_mutex_unlock(qemu_mutex); - kvm_eat_signal(io_signal_table, NULL, 1000); - pthread_mutex_lock(qemu_mutex); + main_loop_wait(1000); cpu_single_env = NULL; } } @@ -307,6 +285,13 @@ static void setup_kernel_sigmask(CPUState *env) { sigset_t set; +sigemptyset(set); +sigaddset(set, SIGUSR1); +sigaddset(set, SIGUSR2); +sigaddset(set, SIGIO); +sigaddset(set, SIGALRM); +sigprocmask(SIG_BLOCK, set, NULL); + sigprocmask(SIG_BLOCK, NULL, set); sigdelset(set, SIG_IPI); @@ -343,7 +328,7 @@ static int kvm_main_loop_cpu(CPUState *env) cpu_single_env = env; while (1) { while (!has_work(env)) - kvm_main_loop_wait(env, 10); + kvm_main_loop_wait(env, 1000); if (env-interrupt_request CPU_INTERRUPT_HARD) env-hflags = ~HF_HALTED_MASK; if (!kvm_irqchip_in_kernel(kvm_context) info-sipi_needed) @@ -391,18 +376,6 @@ static void *ap_main_loop(void *_env) return NULL; } -static void qemu_kvm_init_signal_table(struct qemu_kvm_signal_table *sigtab) -{ -sigemptyset(sigtab-sigset); -sigfillset(sigtab-negsigset); -} - -static void kvm_add_signal(struct qemu_kvm_signal_table *sigtab, int signum) -{ -sigaddset(sigtab-sigset, signum); -sigdelset(sigtab-negsigset, signum); -} - void kvm_init_new_ap(int cpu, CPUState *env) { pthread_create(vcpu_info[cpu].thread, NULL, ap_main_loop, env); @@ -411,28 +384,12 @@ void kvm_init_new_ap(int cpu, CPUState *env) pthread_cond_wait(qemu_vcpu_cond, qemu_mutex); } -static void qemu_kvm_init_signal_tables(void) -{ -qemu_kvm_init_signal_table(io_signal_table); -qemu_kvm_init_signal_table
Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)
Muli Ben-Yehuda wrote: On Tue, Apr 29, 2008 at 02:09:20PM -0500, Anthony Liguori wrote: This patch allows VMA's that contain no backing page to be used for guest memory. This is a drop-in replacement for Ben-Ami's first page in his direct mmio series. Here, we continue to allow mmio pages to be represented in the rmap. Since v1, I've taken into account Andrea's suggestions at using VM_PFNMAP instead of VM_IO and changed the BUG_ON to a return of bad_page. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1d7991a..64e5efe 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -532,6 +532,7 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn) struct page *page[1]; unsigned long addr; int npages; +pfn_t pfn; might_sleep(); @@ -544,19 +545,35 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn) npages = get_user_pages(current, current-mm, addr, 1, 1, 1, page, NULL); -if (npages != 1) { -get_page(bad_page); -return page_to_pfn(bad_page); -} +if (unlikely(npages != 1)) { +struct vm_area_struct *vma; -return page_to_pfn(page[0]); +vma = find_vma(current-mm, addr); +if (vma == NULL || addr = vma-vm_start || +!(vma-vm_flags VM_PFNMAP)) { Isn't the check for addr backwards here? For the VMA we would like to to find, vma-vm_start = addr vma-vm_end. Yes it is. Thanks for spotting that. Regards, Anthony Liguori Cheers, Muli - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)
Andrea Arcangeli wrote: On Tue, Apr 29, 2008 at 06:12:51PM -0500, Anthony Liguori wrote: IIUC PPC correctly, all IO pages have corresponding struct pages. This means that get_user_pages() would succeed and you can reference count them? In this case, we would never take the VM_PFNMAP path. get_user_pages only works on vmas where only pfn with struct page can be mapped, but if a struct page exists it doesn't mean get_user_pages will succeed. All mmio regions should be marked VM_IO as reading on them affects hardware somehow and that prevents get_user_pages to work on them regardless if a struct page exists. Ah, thanks for the clarification. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] cirrusfb division by zero
Marcelo Tosatti wrote: Anthony, The following sequence crashes F9 guests, when using VNC: # modprobe cirrusfb # vbetool post Results in Floating point exception at: cirrus_do_copy() { depth = s-get_bpp((VGAState *)s) / 8 ... sx = (src % (width * depth)) / depth; ... } Problem is that -get_bpp returns 0. Following band-aid fixes it (coff). I have no idea if its correct though ? It suggests something is very broken. get_bpp only is supposed to return 0 when in VGA mode. I don't think blitting should happen when in VGA mode. Applying the patch is not a bad idea but what was the guest doing when this happened? Was it in the process of transitioning from one mode to another? Regards, Anthony Liguori vbetool post corrupts both SDL and VNC displays when using cirrusfb, but seems a separate problem. diff --git a/qemu/hw/cirrus_vga.c b/qemu/hw/cirrus_vga.c index e14ec35..9f860ff 100644 --- a/qemu/hw/cirrus_vga.c +++ b/qemu/hw/cirrus_vga.c @@ -709,6 +709,8 @@ static void cirrus_do_copy(CirrusVGAState *s, int dst, int src, int w, in int notify = 0; depth = s-get_bpp((VGAState *)s) / 8; +if (!depth) +depth = 1; s-get_resolution((VGAState *)s, width, height); /* extra x, y */ - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH] Don't leak EPT identity page table
In vmx.c:alloc_identity_pagetable() we grab a reference to the EPT identity page table via gfn_to_page(). We never release this reference though. This patch releases the reference to this page on VM destruction. I haven't tested this with EPT. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 578a0c1..63f46cf 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3909,6 +3909,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm) kvm_free_physmem(kvm); if (kvm-arch.apic_access_page) put_page(kvm-arch.apic_access_page); + if (kvm-arch.ept_identity_pagetable) + put_page(kvm-arch.ept_identity_pagetable); kfree(kvm); } - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH] Handle vma regions with no backing page (v3)
This patch allows VMA's that contain no backing page to be used for guest memory. This is a drop-in replacement for Ben-Ami's first page in his direct mmio series. Here, we continue to allow mmio pages to be represented in the rmap. Since v1, I've taken into account Andrea's suggestions at using VM_PFNMAP instead of VM_IO and changed the BUG_ON to a return of bad_page. Since v2, I've incorporated comments from Avi about returning bad_page instead of NULL and fixed a typo spotted by Muli. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1d7991a..9e3e4ee 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -532,6 +532,7 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn) struct page *page[1]; unsigned long addr; int npages; + pfn_t pfn; might_sleep(); @@ -544,19 +545,38 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn) npages = get_user_pages(current, current-mm, addr, 1, 1, 1, page, NULL); - if (npages != 1) { - get_page(bad_page); - return page_to_pfn(bad_page); - } + if (unlikely(npages != 1)) { + struct vm_area_struct *vma; - return page_to_pfn(page[0]); + vma = find_vma(current-mm, addr); + if (vma == NULL || addr vma-vm_start || + !(vma-vm_flags VM_PFNMAP)) { + get_page(bad_page); + return page_to_pfn(bad_page); + } + + pfn = ((addr - vma-vm_start) PAGE_SHIFT) + vma-vm_pgoff; + BUG_ON(pfn_valid(pfn)); + } else + pfn = page_to_pfn(page[0]); + + return pfn; } EXPORT_SYMBOL_GPL(gfn_to_pfn); struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn) { - return pfn_to_page(gfn_to_pfn(kvm, gfn)); + pfn_t pfn; + + pfn = gfn_to_pfn(kvm, gfn); + if (pfn_valid(pfn)) + return pfn_to_page(pfn); + + WARN_ON(!pfn_valid(pfn)); + + get_page(bad_page); + return bad_page; } EXPORT_SYMBOL_GPL(gfn_to_page); @@ -569,7 +589,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean); void kvm_release_pfn_clean(pfn_t pfn) { - put_page(pfn_to_page(pfn)); + if (pfn_valid(pfn)) + put_page(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_release_pfn_clean); @@ -594,21 +615,25 @@ EXPORT_SYMBOL_GPL(kvm_set_page_dirty); void kvm_set_pfn_dirty(pfn_t pfn) { - struct page *page = pfn_to_page(pfn); - if (!PageReserved(page)) - SetPageDirty(page); + if (pfn_valid(pfn)) { + struct page *page = pfn_to_page(pfn); + if (!PageReserved(page)) + SetPageDirty(page); + } } EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty); void kvm_set_pfn_accessed(pfn_t pfn) { - mark_page_accessed(pfn_to_page(pfn)); + if (pfn_valid(pfn)) + mark_page_accessed(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed); void kvm_get_pfn(pfn_t pfn) { - get_page(pfn_to_page(pfn)); + if (pfn_valid(pfn)) + get_page(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_get_pfn); - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH] Handle vma regions with no backing page
This patch allows VMA's that contain no backing page to be used for guest memory. This is a drop-in replacement for Ben-Ami's first page in his direct mmio series. Here, we continue to allow mmio pages to be represented in the rmap. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index f095b73..11b26f5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -531,6 +531,7 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn) struct page *page[1]; unsigned long addr; int npages; + pfn_t pfn; might_sleep(); @@ -543,19 +544,36 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn) npages = get_user_pages(current, current-mm, addr, 1, 1, 1, page, NULL); - if (npages != 1) { - get_page(bad_page); - return page_to_pfn(bad_page); - } + if (unlikely(npages != 1)) { + struct vm_area_struct *vma; + + vma = find_vma(current-mm, addr); + if (vma == NULL) { + get_page(bad_page); + return page_to_pfn(bad_page); + } + + BUG_ON(!(vma-vm_flags VM_IO)); - return page_to_pfn(page[0]); + pfn = ((addr - vma-vm_start) PAGE_SHIFT) + vma-vm_pgoff; + BUG_ON(pfn_valid(pfn)); + } else + pfn = page_to_pfn(page[0]); + + return pfn; } EXPORT_SYMBOL_GPL(gfn_to_pfn); struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn) { - return pfn_to_page(gfn_to_pfn(kvm, gfn)); + pfn_t pfn; + + pfn = gfn_to_pfn(kvm, gfn); + if (pfn_valid(pfn)) + return pfn_to_page(pfn); + + return NULL; } EXPORT_SYMBOL_GPL(gfn_to_page); @@ -568,7 +586,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean); void kvm_release_pfn_clean(pfn_t pfn) { - put_page(pfn_to_page(pfn)); + if (pfn_valid(pfn)) + put_page(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_release_pfn_clean); @@ -593,21 +612,25 @@ EXPORT_SYMBOL_GPL(kvm_set_page_dirty); void kvm_set_pfn_dirty(pfn_t pfn) { - struct page *page = pfn_to_page(pfn); - if (!PageReserved(page)) - SetPageDirty(page); + if (pfn_valid(pfn)) { + struct page *page = pfn_to_page(pfn); + if (!PageReserved(page)) + SetPageDirty(page); + } } EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty); void kvm_set_pfn_accessed(pfn_t pfn) { - mark_page_accessed(pfn_to_page(pfn)); + if (pfn_valid(pfn)) + mark_page_accessed(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed); void kvm_get_pfn(pfn_t pfn) { - get_page(pfn_to_page(pfn)); + if (pfn_valid(pfn)) + get_page(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_get_pfn); - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Handle vma regions with no backing page
Andrea Arcangeli wrote: On Tue, Apr 29, 2008 at 09:32:09AM -0500, Anthony Liguori wrote: +vma = find_vma(current-mm, addr); +if (vma == NULL) { +get_page(bad_page); +return page_to_pfn(bad_page); +} Here you must check vm_start address, find_vma only checks addr vm_end but there's no guarantee addr = vm_start yet. Indeed. + +BUG_ON(!(vma-vm_flags VM_IO)); For consistency we should return bad_page and not bug on, VM_IO and VM_PFNMAP can theoretically not be set at the same time, otherwise get_user_pages would be buggy checking against VM_PFNMAP|VM_IO. I doubt anybody isn't setting VM_IO before calling remap_pfn_range but anyway... Secondly the really correct check is against VM_PFNMAP. This is because PFNMAP is set at the same time of vm_pgoff = pfn. VM_IO is not even if in theory if a driver uses -fault instead of remap_pfn_range, shouldn't set VM_IO and it should only set VM_RESERVED. VM_IO is about keeping gdb/coredump out as they could mess with the hardware if they read, PFNMAP is about remap_pfn_range having been called and pgoff pointing to the first pfn mapped at vm_start address. Good point. I've updated the patch. Will send out again once I've gotten to test it. Regards, Anthony Liguori Patch is in the right direction, way to go! - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH][RFC] Use pipe() to simulate signalfd()
This patch eliminates the use of sigtimedwait() in the IO thread. To avoid the signal/select race condition, we use a pipe that we write to in the signal handlers. This was suggested by Rusty and seems to work well. There are a lot of cleanup/simplification opportunities with this but I've limited it just to the signal masking/eating routines. We've got at least one live lock left in the code that I haven't yet identified. My goal is to get this all a lot simplier though so that it's easier to fix the remaining lock-ups. I'm looking for some feedback that this is a sane direction. I haven't tested this enough yet so please don't apply it. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index 9a9bf59..46d7425 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -7,6 +7,9 @@ */ #include config.h #include config-host.h +#include qemu-common.h +#include block.h +#include console.h int kvm_allowed = 1; int kvm_irqchip = 1; @@ -38,14 +41,6 @@ __thread struct vcpu_info *vcpu; static int qemu_system_ready; -struct qemu_kvm_signal_table { -sigset_t sigset; -sigset_t negsigset; -}; - -static struct qemu_kvm_signal_table io_signal_table; -static struct qemu_kvm_signal_table vcpu_signal_table; - #define SIG_IPI (SIGRTMIN+4) struct vcpu_info { @@ -169,53 +164,37 @@ static int has_work(CPUState *env) return kvm_arch_has_work(env); } -static int kvm_process_signal(int si_signo) -{ -struct sigaction sa; - -switch (si_signo) { -case SIGUSR2: -pthread_cond_signal(qemu_aio_cond); -break; -case SIGALRM: -case SIGIO: -sigaction(si_signo, NULL, sa); -sa.sa_handler(si_signo); -break; -} - -return 1; -} - -static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env, - int timeout) +static int kvm_eat_signal(CPUState *env, int timeout) { struct timespec ts; int r, e, ret = 0; siginfo_t siginfo; +sigset_t waitset; +sigemptyset(waitset); +sigaddset(waitset, SIG_IPI); ts.tv_sec = timeout / 1000; ts.tv_nsec = (timeout % 1000) * 100; -r = sigtimedwait(waitset-sigset, siginfo, ts); +qemu_kvm_unlock(); +r = sigtimedwait(waitset, siginfo, ts); +qemu_kvm_lock(env); +cpu_single_env = env; if (r == -1 (errno == EAGAIN || errno == EINTR) !timeout) return 0; e = errno; -pthread_mutex_lock(qemu_mutex); if (env vcpu) cpu_single_env = vcpu-env; if (r == -1 !(errno == EAGAIN || errno == EINTR)) { printf(sigtimedwait: %s\n, strerror(e)); exit(1); } -if (r != -1) -ret = kvm_process_signal(siginfo.si_signo); +ret = 1; if (env vcpu_info[env-cpu_index].stop) { vcpu_info[env-cpu_index].stop = 0; vcpu_info[env-cpu_index].stopped = 1; pthread_kill(io_thread, SIGUSR1); } -pthread_mutex_unlock(qemu_mutex); return ret; } @@ -224,24 +203,20 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env, static void kvm_eat_signals(CPUState *env, int timeout) { int r = 0; -struct qemu_kvm_signal_table *waitset = vcpu_signal_table; -while (kvm_eat_signal(waitset, env, 0)) +while (kvm_eat_signal(env, 0)) r = 1; if (!r timeout) { - r = kvm_eat_signal(waitset, env, timeout); + r = kvm_eat_signal(env, timeout); if (r) - while (kvm_eat_signal(waitset, env, 0)) + while (kvm_eat_signal(env, 0)) ; } } static void kvm_main_loop_wait(CPUState *env, int timeout) { -pthread_mutex_unlock(qemu_mutex); kvm_eat_signals(env, timeout); -pthread_mutex_lock(qemu_mutex); -cpu_single_env = env; vcpu_info[env-cpu_index].signalled = 0; } @@ -263,12 +238,8 @@ static void pause_all_threads(void) vcpu_info[i].stop = 1; pthread_kill(vcpu_info[i].thread, SIG_IPI); } -while (!all_threads_paused()) { - pthread_mutex_unlock(qemu_mutex); - kvm_eat_signal(io_signal_table, NULL, 1000); - pthread_mutex_lock(qemu_mutex); - cpu_single_env = NULL; -} +while (!all_threads_paused()) + main_loop_wait(10); } static void resume_all_threads(void) @@ -391,18 +362,6 @@ static void *ap_main_loop(void *_env) return NULL; } -static void qemu_kvm_init_signal_table(struct qemu_kvm_signal_table *sigtab) -{ -sigemptyset(sigtab-sigset); -sigfillset(sigtab-negsigset); -} - -static void kvm_add_signal(struct qemu_kvm_signal_table *sigtab, int signum) -{ -sigaddset(sigtab-sigset, signum); -sigdelset(sigtab-negsigset, signum); -} - void kvm_init_new_ap(int cpu, CPUState *env) { pthread_create(vcpu_info[cpu].thread, NULL, ap_main_loop, env); @@ -411,28 +370,12 @@ void kvm_init_new_ap(int cpu, CPUState *env) pthread_cond_wait(qemu_vcpu_cond, qemu_mutex); } -static void
Re: [kvm-devel] Protected mode transitions and big real mode... still an issue
Guillaume Thouvenin wrote: Hello, This patch should solve the problem observed during protected mode transitions that appears for example during the installation of openSuse-10.3. Unfortunately there is an issue that crashes kvm-userspace. I'm not sure if it's a problem introduced by the patch or if the patch is good and raises a new issue. You still aren't emulating the instructions correctly I think. Running your patch, I see: [ 979.755349] Failed vm entry (exit reason 0x21) invalid guest state [ 979.755354] emulation at (46e4b) rip 6e0b: ea 10 6e 18 [ 979.755358] successfully emulated instruction [ 979.756105] Failed vm entry (exit reason 0x21) invalid guest state [ 979.756109] emulation at (46e50) rip 6e10: 66 b8 20 00 [ 979.756111] successfully emulated instruction [ 979.756749] Failed vm entry (exit reason 0x21) invalid guest state [ 979.756752] emulation at (46e54) rip 6e14: 8e d8 8c d0 [ 979.756755] successfully emulated instruction [ 979.757427] Failed vm entry (exit reason 0x21) invalid guest state [ 979.757430] emulation at (46e56) rip 6e16: 8c d0 81 e4 [ 979.757433] successfully emulated instruction [ 979.758074] Failed vm entry (exit reason 0x21) invalid guest state [ 979.758077] emulation at (46e58) rip 6e18: 81 e4 ff ff The corresponding gfxboot code is: 16301 6E0B EA[106E]1800jmp pm_seg.prog_c32:switch_to_pm_20 16302 switch_to_pm_20: 16303 16304 bits 32 16305 16306 6E10 66B82000mov ax,pm_seg.prog_d16 16307 6E14 8ED8mov ds,ax 16308 16309 6E16 8CD0mov eax,ss 16310 6E18 81E4and esp,0h The VT state should be correct after executing instruction an RIP 6E16 (mov eax, ss). The next instruction should not cause a vmentry failure. The fact that it is for you indicates that you're not updating guest state correctly. My guess would be that load_segment_descriptor is not updating the values within the VMCS. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Protected mode transitions and big real mode... still an issue
Laurent Vivier wrote: Le mardi 29 avril 2008 à 11:41 -0500, Anthony Liguori a écrit : Guillaume Thouvenin wrote: Hello, This patch should solve the problem observed during protected mode transitions that appears for example during the installation of openSuse-10.3. Unfortunately there is an issue that crashes kvm-userspace. I'm not sure if it's a problem introduced by the patch or if the patch is good and raises a new issue. You still aren't emulating the instructions correctly I think. Running your patch, I see: [ 979.755349] Failed vm entry (exit reason 0x21) invalid guest state [ 979.755354] emulation at (46e4b) rip 6e0b: ea 10 6e 18 [ 979.755358] successfully emulated instruction [ 979.756105] Failed vm entry (exit reason 0x21) invalid guest state [ 979.756109] emulation at (46e50) rip 6e10: 66 b8 20 00 [ 979.756111] successfully emulated instruction [ 979.756749] Failed vm entry (exit reason 0x21) invalid guest state [ 979.756752] emulation at (46e54) rip 6e14: 8e d8 8c d0 [ 979.756755] successfully emulated instruction [ 979.757427] Failed vm entry (exit reason 0x21) invalid guest state [ 979.757430] emulation at (46e56) rip 6e16: 8c d0 81 e4 [ 979.757433] successfully emulated instruction [ 979.758074] Failed vm entry (exit reason 0x21) invalid guest state [ 979.758077] emulation at (46e58) rip 6e18: 81 e4 ff ff The corresponding gfxboot code is: 16301 6E0B EA[106E]1800jmp pm_seg.prog_c32:switch_to_pm_20 16302 switch_to_pm_20: 16303 16304 bits 32 16305 16306 6E10 66B82000mov ax,pm_seg.prog_d16 16307 6E14 8ED8mov ds,ax 16308 16309 6E16 8CD0mov eax,ss 16310 6E18 81E4and esp,0h The VT state should be correct after executing instruction an RIP 6E16 (mov eax, ss). The next instruction should not cause a vmentry Are you sure ? It is intel notation (opcode dst,src) , so it updates eax, not ss. Guillaumes gives us (with gdb notation, opcode src,dst): You're right, it's a fair bit down the code before the ss move happens. Regards, Anthony Liguori 0x00046e53: ljmp $0x18,$0x6e18 0x00046e58: mov$0x20,%ax %EAX = 0x20 0x00046e5c: mov%eax,%ds %DS = 0x20 0x00046e5e: mov%ss,%eax %EAX = %SS = 0x53E1 (in this particular case) For me the issue is with instructions with dst.byte = 0. for instance: 0x00046e66: shl$0x4,%eax [82768.003174] emulation at (46e66) rip 6e26: c1 e0 04 01 [82768.035153] writeback: dst.byte 0 [82768.055174] writeback: dst.ptr 0x [82768.087177] writeback: dst.val 0x53e1 [82768.78] writeback: src.ptr 0x6e28 [82768.143157] writeback: src.val 0x4 So my questions are: Why dst.val is not 0x53e10 ? Why dst.byte is 0 ? failure. The fact that it is for you indicates that you're not updating guest state correctly. My guess would be that load_segment_descriptor is not updating the values within the VMCS. Regards, Anthony Liguori Regards Laurent - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH] Handle vma regions with no backing page (v2)
This patch allows VMA's that contain no backing page to be used for guest memory. This is a drop-in replacement for Ben-Ami's first page in his direct mmio series. Here, we continue to allow mmio pages to be represented in the rmap. Since v1, I've taken into account Andrea's suggestions at using VM_PFNMAP instead of VM_IO and changed the BUG_ON to a return of bad_page. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1d7991a..64e5efe 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -532,6 +532,7 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn) struct page *page[1]; unsigned long addr; int npages; + pfn_t pfn; might_sleep(); @@ -544,19 +545,35 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn) npages = get_user_pages(current, current-mm, addr, 1, 1, 1, page, NULL); - if (npages != 1) { - get_page(bad_page); - return page_to_pfn(bad_page); - } + if (unlikely(npages != 1)) { + struct vm_area_struct *vma; - return page_to_pfn(page[0]); + vma = find_vma(current-mm, addr); + if (vma == NULL || addr = vma-vm_start || + !(vma-vm_flags VM_PFNMAP)) { + get_page(bad_page); + return page_to_pfn(bad_page); + } + + pfn = ((addr - vma-vm_start) PAGE_SHIFT) + vma-vm_pgoff; + BUG_ON(pfn_valid(pfn)); + } else + pfn = page_to_pfn(page[0]); + + return pfn; } EXPORT_SYMBOL_GPL(gfn_to_pfn); struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn) { - return pfn_to_page(gfn_to_pfn(kvm, gfn)); + pfn_t pfn; + + pfn = gfn_to_pfn(kvm, gfn); + if (pfn_valid(pfn)) + return pfn_to_page(pfn); + + return NULL; } EXPORT_SYMBOL_GPL(gfn_to_page); @@ -569,7 +586,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean); void kvm_release_pfn_clean(pfn_t pfn) { - put_page(pfn_to_page(pfn)); + if (pfn_valid(pfn)) + put_page(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_release_pfn_clean); @@ -594,21 +612,25 @@ EXPORT_SYMBOL_GPL(kvm_set_page_dirty); void kvm_set_pfn_dirty(pfn_t pfn) { - struct page *page = pfn_to_page(pfn); - if (!PageReserved(page)) - SetPageDirty(page); + if (pfn_valid(pfn)) { + struct page *page = pfn_to_page(pfn); + if (!PageReserved(page)) + SetPageDirty(page); + } } EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty); void kvm_set_pfn_accessed(pfn_t pfn) { - mark_page_accessed(pfn_to_page(pfn)); + if (pfn_valid(pfn)) + mark_page_accessed(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed); void kvm_get_pfn(pfn_t pfn) { - get_page(pfn_to_page(pfn)); + if (pfn_valid(pfn)) + get_page(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_get_pfn); - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)
Avi Kivity wrote: Anthony Liguori wrote: This patch allows VMA's that contain no backing page to be used for guest memory. This is a drop-in replacement for Ben-Ami's first page in his direct mmio series. Here, we continue to allow mmio pages to be represented in the rmap. I like this very much, as it only affects accessors and not the mmu core itself. Hollis/Xiantao/Carsten, can you confirm that this approach works for you? Carsten, I believe you don't have mmio, but at least this shouldn't interfere. struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn) { -return pfn_to_page(gfn_to_pfn(kvm, gfn)); +pfn_t pfn; + +pfn = gfn_to_pfn(kvm, gfn); +if (pfn_valid(pfn)) +return pfn_to_page(pfn); + +return NULL; } You're returning NULL here, not bad_page. My thinking was that bad_page indicates that the gfn is invalid. This is a different type of error though. The problem is that the guest is we are trying to kmap() a page that has no struct page associated with it. I'm not sure what the right thing to do here is. Perhaps we should be replacing consumers of gfn_to_page() with copy_to_user() instead? Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH][RFC] Use pipe() to simulate signalfd()
Marcelo Tosatti wrote: Hi Anthony, How is -no-kvm-irqchip working with the patch? Seems to work fine. What is your expectation? On Tue, Apr 29, 2008 at 09:28:14AM -0500, Anthony Liguori wrote: This patch eliminates the use of sigtimedwait() in the IO thread. To avoid the signal/select race condition, we use a pipe that we write to in the signal handlers. This was suggested by Rusty and seems to work well. +static int kvm_eat_signal(CPUState *env, int timeout) { struct timespec ts; int r, e, ret = 0; siginfo_t siginfo; +sigset_t waitset; +sigemptyset(waitset); +sigaddset(waitset, SIG_IPI); ts.tv_sec = timeout / 1000; ts.tv_nsec = (timeout % 1000) * 100; -r = sigtimedwait(waitset-sigset, siginfo, ts); +qemu_kvm_unlock(); +r = sigtimedwait(waitset, siginfo, ts); +qemu_kvm_lock(env); +cpu_single_env = env; This assignment seems redundant now. Yeah, I have a bigger patch which eliminates all of the explicit assignments to cpu_single_env. @@ -263,12 +238,8 @@ static void pause_all_threads(void) vcpu_info[i].stop = 1; pthread_kill(vcpu_info[i].thread, SIG_IPI); Make sure the IO thread has SIG_IPI blocked (those are for APIC vcpu initialization only). Just so I'm clear, there's really no harm in not blocking SIG_IPI because it would just be ignored by the IO thread (since the SIG_IPI handler is a nop). But yeah, we should explicitly block it. +static void sig_aio_fd_read(void *opaque) +{ +int signum; +ssize_t len; + +do { +len = read(kvm_sigfd[0], signum, sizeof(signum)); +} while (len == -1 errno == EINTR); What is the reason for this loop instead of a straight read? Its alright to be interrupted by a signal. Just general habit with QEMU. +signal(SIGUSR1, sig_aio_handler); +signal(SIGUSR2, sig_aio_handler); +signal(SIGALRM, sig_aio_handler); +signal(SIGIO, sig_aio_handler); + +if (pipe(kvm_sigfd) == -1) +abort(); perror() would be nice. Yeah, everything needs proper error handling. -kvm_eat_signal(io_signal_table, NULL, 1000); pthread_mutex_lock(qemu_mutex); -cpu_single_env = NULL; -main_loop_wait(0); +main_loop_wait(10); Increase that 1000 or something. Will make it easier to spot bugs. I have actually and it does introduce some bugs. I'm not entirely clear what is causing them though. Regards, Anthony Liguori Similarly in qemu_kvm_aio_wait(). - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)
Avi Kivity wrote: It depends on what's going on? Does a page table point to mmio? Or the glommerclock? Not sure there is a single answer. Perhaps we should be replacing consumers of gfn_to_page() with copy_to_user() instead? Indeed we should. The problem is access in atomic contexts. It's easy to detect failure, but not always easy to handle it. So I think we should replace it with a rate limited printk and returning bad_page. That way the guest can't exploit it and we'll still hopefully get printk()s to track down instances of things going bad. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)
Hollis Blanchard wrote: On Tuesday 29 April 2008 17:17:49 Avi Kivity wrote: I like this very much, as it only affects accessors and not the mmu core itself. Hollis/Xiantao/Carsten, can you confirm that this approach works for you? Carsten, I believe you don't have mmio, but at least this shouldn't interfere. OK, so the idea is to mmap /sys/bus/pci/.../region within the guest RAM area, and just include it within the normal guest RAM memslot? In the case of x86, since the PCI IO region is pretty far away from normal RAM, we'll probably use a different memory slot, but yes, that's the general idea. IIUC PPC correctly, all IO pages have corresponding struct pages. This means that get_user_pages() would succeed and you can reference count them? In this case, we would never take the VM_PFNMAP path. Is that correct? How will the IOMMU be programmed? Wouldn't you still need to register a special type of memslot for that? That's independent of this patchset. For non-aware guests, we'll have to pin all of physical memory up front and then create an IOMMU table from the pinned physical memory. For aware guests with a PV DMA window API, we'll be able to build that mapping on the fly (enforcing mlock allocation limits). Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH][RFC] Use pipe() to simulate signalfd()
Marcelo Tosatti wrote: Problem is if the IO thread _receives_ SIGIPI instead of some vcpu thread. So there is potential harm in not blocking it. Hrm, aren't SIG_IPIs delivered to a particular thread-id though? When would the IO thread receive a SIG_IPI? What is the reason for this loop instead of a straight read? Its alright to be interrupted by a signal. Just general habit with QEMU. Please don't :-) I don't see the harm. In fact, I think it's more correct. Otherwise, we have to wait for another invocation of the fd callback. -kvm_eat_signal(io_signal_table, NULL, 1000); pthread_mutex_lock(qemu_mutex); -cpu_single_env = NULL; -main_loop_wait(0); + main_loop_wait(10); Increase that 1000 or something. Will make it easier to spot bugs. I have actually and it does introduce some bugs. I'm not entirely clear what is causing them though. Should indicate that some event previously delivered through signals and received by sigtimedwait is not waking up the IO thread. I'll take a look and see. I'm having time keeping issues in the guest so it's hard to tell what problems are caused by the IO thread verses time. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH][RFC] Use pipe() to simulate signalfd()
Marcelo Tosatti wrote: Why? If you leave data in the pipe the next select() won't block. Isnt there the possibility that this loop can be stuck for significant amounts of time? If you're receiving lots of notifications through signals. If you're getting that many signals, then select() is never going to run anyway. -kvm_eat_signal(io_signal_table, NULL, 1000); pthread_mutex_lock(qemu_mutex); -cpu_single_env = NULL; -main_loop_wait(0); +main_loop_wait(10); Increase that 1000 or something. Will make it easier to spot bugs. I have actually and it does introduce some bugs. I'm not entirely clear what is causing them though. Should indicate that some event previously delivered through signals and received by sigtimedwait is not waking up the IO thread. I'll take a look and see. I'm having time keeping issues in the guest so it's hard to tell what problems are caused by the IO thread verses time. Time issues only with the patch? If not, please share details. No, they're independent of the patch. The symptom is that the guest tends to hang during boot for prolonged periods of time. It tends to be random whether it will hang at all, how long it will hang, and at what point it will hang. It tends to hang more often in particular places. In my ubuntu server guest, for instance, it tends to hang right after partition probing, right after Loading hardware drivers, and right before spawning a getty on /dev/tty0. Normally, it will only hang for a few seconds. However, with -smp N where N number of cpus in the host, it tends to happen more frequently. Using clock=jiffies in the guest makes the problem go away. -no-kvm-pit doesn't make a difference (the guest is normally using the pm-timer). Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH][RFC] Use pipe() to simulate signalfd()
Marcelo Tosatti wrote: True. Either way, this starvation due to signals can't happen with the current scheme because signals are blocked. It seems a significant drawback. Practically speaking, I don't see signal starving being a problem. Part of the benefit of this approach is that we're reducing the overall number of signals received. With the in-kernel PIT, the number of userspace signals is even further reduced. Moving the signal handling + pipe write to a separate thread should get rid of it. Yeah, but then you just introduce buffering problems since if you're getting that many signals, the pipe will get full. No point in designing for something that isn't likely to happen in practice. Regards, Anthony Liguori No, they're independent of the patch. The symptom is that the guest tends to hang during boot for prolonged periods of time. It tends to be random whether it will hang at all, how long it will hang, and at what point it will hang. It tends to hang more often in particular places. In my ubuntu server guest, for instance, it tends to hang right after partition probing, right after Loading hardware drivers, and right before spawning a getty on /dev/tty0. Normally, it will only hang for a few seconds. However, with -smp N where N number of cpus in the host, it tends to happen more frequently. Using clock=jiffies in the guest makes the problem go away. I'll try to reproduce that, thanks. -no-kvm-pit doesn't make a difference (the guest is normally using the pm-timer). Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] [RFC] try to reduce kvm impact in core qemu code.
Glauber Costa wrote: Avi Kivity wrote: Glauber Costa wrote: Hi. This is a proposal for reducing the impact of kvm functions in core qemu code. This is by all means not ready, but I felt like posting it, so a discussion on it could follow. The idea in this patch is to replace the specific kvm details from core qemu files like vl.c, with driver_yyy() functions. When kvm is not running, those functions would just return (most of time), absolutely reducing the impact of kvm code. As I wanted to test it, in this patch I changed the kvm functions to be called driver_yyy(), but that's not my final goal. I intend to use a function pointer schema, similar to what the linux kernel already do for a lot of its subsystem, to isolate the changes. Comments deeply welcome. While I would be very annoyed if someone referred to kvm as a qemu accelerator, I think accelerator_yyy() is more descriptive than driver_yyy(). How about booster? ;-) I don't think the concern from a QEMU perspective is that QEMU is too intimately tied to KVM. The concern is that overtime, it will be very difficult to make changes to QEMU without breaking KVM support because of the shear number of hooks we require. Fabrice had actually suggested merging libkvm into QEMU. We just need to reduce the overall number of if (kvm_enabled()) blocks. You have to understand a lot about KVM to know why it's necessary to do an register reload call-out in the vmport hw for instance. Instead of introducing generic hooks in vmport, a more appropriate solution may be to add wrappers to read individual register values within the QEMU device code. We can then add our hooks to that. I think a good argument can be made that devices should never access env- directly. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] fork() within a VM with MMU notifiers
Here's my thinking as to why we don't want to destroy the VM in the mmu notifiers -release method. I don't have a valid use-case for this but my argument depends on the fact that this is something that should work. Daemonizing a running VM may be a reasonable use-case. It's useful to wait to daemonize until you are sure that everything is working correctly so it's not all that unreasonable to move the daemonize until after the VCPUs have been launched. If you take a running VM, and pause all of the VCPUs, and then issue a fork() followed by an immediate exit() in the parent process, the child process should be able to unpause all the VCPUs and the guest should continue running uninterrupted. From KVM's perspective, issuing the fork() will increment the reference count of the file descriptor for the VM but otherwise, no real change should happen. The issue would now be that we must completely flush the shadow page table cache. In theory, MMU notifiers should do this for us. When the parent process exits, this will result in exit_mmap() and will destroy the KVM guest. This leaves the child process with a file descriptor that refers to a VM that is no longer valid. Just avoiding destroying the VM in the -release() method won't fix this use-case I don't think. In general, I think we need to think a little more about how fork() is handled with respect to mmu notifiers. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Fix QEMU vcpu thread race with apic_reset
Ryan Harper wrote: * Avi Kivity [EMAIL PROTECTED] [2008-04-26 02:23]: Please reuse qemu_mutex for this, no need for a new one. I'm having a little trouble wrapping my head around all of the locking here. If I avoid qemu_mutex and use a new one, I've got everything working. However, attemping to use qemu_mutex is stepping into a pile of locking crap. I'm sure there is a good reason... The current code looks like this: Thread1: main() kvm_qemu_init() // mutex_lock() machine-init() pc_init1() pc_new_cpu() cpu_init() cpu_x86_init() kvm_init_new_ap() // create vcpu Thread2 -- -- -- -- -- -- kvm_main_loop() // mutex_unlock() Thread2: ap_main_loop() /* vcpu init */ kvm_main_loop_cpu() kvm_main_loop_wait() // mutex_unlock() on enter, lock on exit kvm_eat_signals() // mutex_lock() on enter, unlock on exit -- -- -- The qemu_mutex is meant to ensure that the QEMU code is only ever entered by one thread at a time. QEMU is not thread-safe so this is necessary. It's a little odd because we're taking this lock initially by being called from QEMU code. Here's the basic theory of locking AFAICT: kvm_main_loop_cpu() is the main loop for each VCPU. It must acquire the QEMU mutex since it will call into normal QEMU code to process events. Whenever a VCPU is allowed to run, or when the VCPU is idling, it needs to release the QEMU mutex. The former is done via the post_kvm_run() and pre_kvm_run() hooks. The later is done within kvm_main_loop_wait(). kvm_main_loop_wait() will release the QEMU mutex and call kvm_eat_signals() which calls kvm_eat_signal() which will issue a sigtimedwait(). This is where we actually idle (and why SIGIO is so important right now). We don't want to idle with the QEMU mutex held as that may result in dead lock so this is why we release it here. kvm_eat_signal() has to acquire the lock again in order to dispatch IO events (via kvm_process_signal()). Regards, Anthony Liguori It wedges up in kvm_init_new_ap() if I attempt acquire qemu_mutex. Quite obvious after I looked at the call trace and discovered kvm_qemu_init() locking on exit. I see other various functions that unlock and then lock; I really don't want to wade into this mess... rather whomever cooked it up should do some cleanup. I tried the unlock, then re-lock on exit in kvm_init_new_ap() but that also wedged. Here is a rework with a new flag in vcpu_info indicating vcpu creation. Tested this with 64 1VCPU guests booting with 1 second delay, and single 16-way SMP guest boot. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH] Don't race while creating a VCPU
We hold qemu_mutex while machine-init() executes, which issues a VCPU create. We need to make sure to not return from the VCPU creation until the VCPU file descriptor is valid to ensure that APIC creation succeeds. However, we also need to make sure that the VCPU thread doesn't start running until the machine-init() is complete. This is addressed today because the VCPU thread tries to grab the qemu_mutex before doing anything interesting. If we release qemu_mutex to wait for VCPU creation, then we open a window for a race to occur. This patch introduces two wait conditions. The first lets the VCPU create code that runs in the IO thread to wait for a VCPU to initialize. The second condition lets the VCPU thread wait for the machine to fully initialize before running. An added benefit of this patch is it makes the dependencies now explicit. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index 78127de..9a9bf59 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -32,8 +32,12 @@ static int qemu_kvm_reset_requested; pthread_mutex_t qemu_mutex = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t qemu_aio_cond = PTHREAD_COND_INITIALIZER; +pthread_cond_t qemu_vcpu_cond = PTHREAD_COND_INITIALIZER; +pthread_cond_t qemu_system_cond = PTHREAD_COND_INITIALIZER; __thread struct vcpu_info *vcpu; +static int qemu_system_ready; + struct qemu_kvm_signal_table { sigset_t sigset; sigset_t negsigset; @@ -53,6 +57,7 @@ struct vcpu_info { int stop; int stopped; int reload_regs; +int created; } vcpu_info[256]; pthread_t io_thread; @@ -324,6 +329,7 @@ static int kvm_main_loop_cpu(CPUState *env) struct vcpu_info *info = vcpu_info[env-cpu_index]; setup_kernel_sigmask(env); + pthread_mutex_lock(qemu_mutex); if (kvm_irqchip_in_kernel(kvm_context)) env-hflags = ~HF_HALTED_MASK; @@ -370,6 +376,17 @@ static void *ap_main_loop(void *_env) sigprocmask(SIG_BLOCK, signals, NULL); kvm_create_vcpu(kvm_context, env-cpu_index); kvm_qemu_init_env(env); + +/* signal VCPU creation */ +pthread_mutex_lock(qemu_mutex); +vcpu-created = 1; +pthread_cond_signal(qemu_vcpu_cond); + +/* and wait for machine initialization */ +while (!qemu_system_ready) + pthread_cond_wait(qemu_system_cond, qemu_mutex); +pthread_mutex_unlock(qemu_mutex); + kvm_main_loop_cpu(env); return NULL; } @@ -389,8 +406,9 @@ static void kvm_add_signal(struct qemu_kvm_signal_table *sigtab, int signum) void kvm_init_new_ap(int cpu, CPUState *env) { pthread_create(vcpu_info[cpu].thread, NULL, ap_main_loop, env); -/* FIXME: wait for thread to spin up */ -usleep(200); + +while (vcpu_info[cpu].created == 0) + pthread_cond_wait(qemu_vcpu_cond, qemu_mutex); } static void qemu_kvm_init_signal_tables(void) @@ -435,7 +453,11 @@ void qemu_kvm_notify_work(void) int kvm_main_loop(void) { io_thread = pthread_self(); +qemu_system_ready = 1; pthread_mutex_unlock(qemu_mutex); + +pthread_cond_broadcast(qemu_system_cond); + while (1) { kvm_eat_signal(io_signal_table, NULL, 1000); pthread_mutex_lock(qemu_mutex); - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Real Mode Improvement on Intel Hosts - GSoC Project
Avi Kivity wrote: Anthony Liguori wrote: The second stage is to use a loop of x86_emulate() to run all 16-bit code (instead of using vm86 mode). This will allow us to support guests that use big real mode. Why do that unconditionally, instead of only when in a big-real-mode state? It can be. It's probably easier from a development perspective to make it unconditional and then to flush out all the necessary instructions. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Fix QEMU vcpu thread race with apic_reset
Ulrich Drepper wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Ryan Harper wrote: @@ -388,9 +395,10 @@ static void kvm_add_signal(struct qemu_kvm_signal_table *sigtab, int signum) void kvm_init_new_ap(int cpu, CPUState *env) { +pthread_mutex_lock(vcpu_mutex); pthread_create(vcpu_info[cpu].thread, NULL, ap_main_loop, env); -/* FIXME: wait for thread to spin up */ -usleep(200); +pthread_cond_wait(qemu_vcpuup_cond, vcpu_mutex); +pthread_mutex_unlock(vcpu_mutex); And something is very wrong here. The pattern for using a condvar is 1 take mutex 2 check condition 3 if condition is not fulfilled 3a call cond_wait 3b when returning, go back to step 2 4 unlock mutex Anything else is buggy. So, either your condvar use is wrong or you don't really want a condvar in the first place. I haven't checked the code. A flag is needed in the vcpu structure that indicates whether the vcpu spun up or not. This is what the while () condition should be. This is needed as the thread may spin up before it gets to the pthread_cond_wait() in which case the signal happens when noone is waiting on it. The other reason a while () is usually needed is that cond_signal may not wake up the right thread so it's necessary to check whether you really have something to do. Not really a problem here but the former race is. A condvar is definitely the right thing to use here. Regards, Anthony Liguori - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFIE2CD2ijCOnn/RHQRAs4kAJ40kbWjNJAzj2gGdbo/sSxZTx5b0ACglbis kw7ST4eJK9CXhNbjKphNsUo= =ISaC -END PGP SIGNATURE- - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] mmu notifier #v14
-arch.apic_access_page) - put_page(kvm-arch.apic_access_page); + /* + * kvm_mmu_notifier_release() will be called before + * mmu_notifier_unregister returns, if it didn't run + * already. + */ + mmu_notifier_unregister(kvm-arch.mmu_notifier, kvm-mm); kfree(kvm); } Why move the destruction of the vm to the MMU notifier unregister hook? Does anything else ever call mmu_notifier_unregister that would implicitly destroy the VM? Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] mmu notifier #v14
Andrea Arcangeli wrote: On Sat, Apr 26, 2008 at 01:59:23PM -0500, Anthony Liguori wrote: +static void kvm_unmap_spte(struct kvm *kvm, u64 *spte) +{ + struct page *page = pfn_to_page((*spte PT64_BASE_ADDR_MASK) PAGE_SHIFT); + get_page(page); You should not assume a struct page exists for any given spte. Instead, use kvm_get_pfn() and kvm_release_pfn_clean(). Last email from [EMAIL PROTECTED] in my inbox argues it's useless to build rmap on mmio regions, so the above is more efficient so put_page runs directly on the page without going back and forth between spte - pfn - page - pfn - page in a single function. Avi can correct me if I'm wrong, but I don't think the consensus of that discussion was that we're going to avoid putting mmio pages in the rmap. Practically speaking, replacing: + struct page *page = pfn_to_page((*spte PT64_BASE_ADDR_MASK) PAGE_SHIFT); + get_page(page); With: unsigned long pfn = (*spte PT64_BASE_ADDR_MASK) PAGE_SHIFT; kvm_get_pfn(pfn); Results in exactly the same code except the later allows mmio pfns in the rmap. So ignoring the whole mmio thing, using accessors that are already there and used elsewhere seems like a good idea :-) Certainly if we start building rmap on mmio regions we'll have to change that. Perhaps I just have a weak stomach but I am uneasy having a function that takes a lock on exit. I walked through the logic and it doesn't appear to be wrong but it also is pretty clear that you could defer the acquisition of the lock to the caller (in this case, kvm_mmu_pte_write) by moving the update_pte assignment into kvm_mmu_pte_write. I agree out_lock is an uncommon exit path, the problem is that the code was buggy, and I tried to fix it with the smallest possible change and that resulting in an out_lock. That section likely need a refactoring, all those update_pte fields should be at least returned by the function guess_ but I tried to reduce the changes to make the issue more readable, I didn't want to rewrite certain functions just to take a spinlock a few instructions ahead. I appreciate the desire to minimize changes, but taking a lock on return seems to take that to a bit of an extreme. It seems like a simple thing to fix though, no? Why move the destruction of the vm to the MMU notifier unregister hook? Does anything else ever call mmu_notifier_unregister that would implicitly destroy the VM? mmu notifier -release can run at anytime before the filehandle is closed. -release has to zap all sptes and freeze the mmu (hence all vcpus) to prevent any further page fault. After -release returns all pages are freed (we'll never relay on the page pin to avoid the rmap_remove put_page to be a relevant unpin event). So the idea is that I wanted to maintain the same ordering of the current code in the vm destroy event, I didn't want to leave a partially shutdown VM on the vmlist. If the ordering is entirely irrelevant and the kvm_arch_destroy_vm can run well before kvm_destroy_vm is called, then I can avoid changes to kvm_main.c but I doubt. I've done it in a way that archs not needing mmu notifiers like s390 can simply add the kvm_destroy_common_vm at the top of their kvm_arch_destroy_vm. All others using mmu_notifiers have to invoke kvm_destroy_common_vm in the -release of the mmu notifiers. This will ensure that everything will be ok regardless if exit_mmap is called before/after exit_files, and it won't make a whole lot of difference anymore, if the driver fd is pinned through vmas-vm_file released in exit_mmap or through the task filedescriptors relased in exit_files etc... Infact this allows to call mmu_notifier_unregister at anytime later after the task has already been killed, without any trouble (like if the mmu notifier owner isn't registering in current-mm but some other tasks mm). I see. It seems a little strange to me as a KVM guest isn't really tied to the current mm. It seems like the net effect of this is that we are now tying a KVM guest to an mm. For instance, if you create a guest, but didn't assign any memory to it, you could transfer the fd to another process and then close the fd (without destroying the guest). The other process then could assign memory to it and presumably run the guest. With your change, as soon as the first process exits, the guest will be destroyed. I'm not sure this behavioral difference really matters but it is a behavioral difference. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm
Re: [kvm-devel] Real Mode Improvement on Intel Hosts - GSoC Project
Mohammed Gamal wrote: Hi, My Project proposal Improving and Stabilizing Real-Mode Support for Intel Hosts has been accepted into Google Summer of Code under the Linux Foundation. You may have a look at the proposal abstract here: http://code.google.com/soc/2008/linux/appinfo.html?csaid=1CC1C8B4CCC1120E . Any pointers on where to start, what would you like to see done, and any other comments and suggestions would greatly be appreciated. We have a two stage plan to address real-mode on Intel systems. Both involve using x86_emulate() to emulate 16-bit (and 32-bit) instructions that VT cannot handle. The first stage is to detect vmentry failures and run x86_emulate() for a single instruction. If you look at the mailing list, you'll see patches from myself and Guillaume. This should be enough to allow most Ubuntu installer CDs to work under KVM. In this case, the CDs are using a version of GFXBOOT that doesn't use big real mode, but still relies on an undefined architectural aspect that VT doesn't support. The second stage is to use a loop of x86_emulate() to run all 16-bit code (instead of using vm86 mode). This will allow us to support guests that use big real mode. The best place to start is probably to try out some of the patches on the list, and get familiar with the GFXBOOT assembly routines. There's a #kvm in FreeNode, that's a good place to start if you're having trouble getting started. Regards, Anthony Liguori Regards, Mohammed - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 0/2] Batch writes to MMIO
Laurent Vivier wrote: Le mercredi 23 avril 2008 à 19:25 +0300, Avi Kivity a écrit : Laurent Vivier wrote: Le mercredi 23 avril 2008 à 10:10 -0500, Anthony Liguori a écrit : [...] The ne2k is pretty mmio heavy. You should be able to observe a boost with something like iperf (guest=host) I would think if this is a real savings. I like your advices :-D I use iperf with e1000 emulation and a slightly modified patch (to detect MMIO write in a loop), server is on the host, client on the guest, with default values. RESULT WITHOUT BATCHING: [ 4] 0.0-10.0 sec235 MBytes197 Mbits/sec [ 5] 0.0-10.0 sec194 MBytes163 Mbits/sec [ 4] 0.0-10.0 sec185 MBytes155 Mbits/sec [ 5] 0.0-10.0 sec227 MBytes190 Mbits/sec [ 4] 0.0-10.0 sec196 MBytes164 Mbits/sec [ 5] 0.0-10.0 sec194 MBytes163 Mbits/sec [ 4] 0.0-10.0 sec184 MBytes154 Mbits/sec RESULT WITH BATCHING: Server listening on TCP port 5001 TCP window size: 85.3 KByte (default) [ 4] 0.0-10.0 sec357 MBytes299 Mbits/sec [ 5] 0.0-10.1 sec418 MBytes347 Mbits/sec [ 4] 0.0-10.0 sec408 MBytes342 Mbits/sec [ 5] 0.0-10.0 sec422 MBytes353 Mbits/sec [ 4] 0.0-10.1 sec436 MBytes362 Mbits/sec [ 5] 0.0-10.0 sec416 MBytes348 Mbits/sec [ 4] 0.0-10.0 sec431 MBytes361 Mbits/sec Well, it's nice ? It's too good to be true. I think we're seeing two bugs cancel each other out, resulting in a performance gain. Linux doesn't know how to queue outgoing packets, so it bangs on the mmio that starts the transmit after every packet. mmio batching doesn't know that this mmio register is critical for latency, so it queues it up. The result is that you you get not just mmio batching, but also packet batching! Which dramatically improves performace at the expense of latency. How can I check that ? How can I measure latency ? ping (from guest to host) Regards, Anthony Liguori Perhaps I can swap server and client between guest and host ? Sorry (if it's true :) Thank you for your help, Laurent - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/2] kvm: Batch writes to MMIO
Laurent Vivier wrote: This patch is the kernel part of the batch writes to MMIO patch. When kernel has to send MMIO writes to userspace, it stores them in memory until it has to pass the hand to userspace for another reason. This avoids to have too many context switches on operations that can wait. WARNING: this breaks compatibility with old userspace part. Signed-off-by: Laurent Vivier [EMAIL PROTECTED] --- arch/x86/kvm/x86.c | 21 + include/asm-x86/kvm_host.h |2 ++ include/linux/kvm.h| 10 +- virt/kvm/kvm_main.c|3 +++ 4 files changed, 35 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0ce5563..3881056 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2942,8 +2942,21 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) kvm_x86_ops-decache_regs(vcpu); } +batch: r = __vcpu_run(vcpu, kvm_run); + if (!r vcpu-mmio_is_write + kvm_run-exit_reason == KVM_EXIT_MMIO + kvm_run-batch_count KVM_MAX_BATCH) { + struct kvm_batch *batch = vcpu-arch.batch_data; + int i = kvm_run-batch_count++; + + batch[i].phys_addr = vcpu-mmio_phys_addr; + batch[i].len = vcpu-mmio_size; + memcpy(batch[i].data, vcpu-mmio_data, batch[i].len); + + goto batch; + } I wonder if this is sufficient for dynticks enabled guests. __vcpu_run could last a very long time. Ignoring that for a moment, Avi's comment about having userspace tell you which addresses to batch is an important one. MMIO writes may have side effects and the next vcpu_run may rely on those side effects. For instance, MMIO based IRQ acknowledgement. You need a white list not only for performances purposes but also for correctness. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] [RESEND] Make make sync in kernel dir work for multiple archs
diff --git a/kernel/Makefile b/kernel/Makefile --- a/kernel/Makefile +++ b/kernel/Makefile @@ -1,5 +1,10 @@ include ../config.mak include ../config.mak +ARCH_DIR=$(ARCH) +ifneq '$(filter $(ARCH_DIR), x86_64 i386)' '' + ARCH_DIR=x86 +endif + This sets ARCH_DIR to x86 KVERREL = $(patsubst /lib/modules/%/build,%,$(KERNELDIR)) DESTDIR= @@ -18,11 +23,25 @@ _hack = mv $1 $1.orig \ | sed '/\#include/! s/\blapic\b/l_apic/g' $1 rm $1.orig _unifdef = mv $1 $1.orig \ - unifdef -DCONFIG_X86 $1.orig $1; \ + unifdef -DCONFIG_$(shell echo $(ARCH_DIR)|tr '[:lower:]' '[:upper:]') $1.orig $1; \ [ $$? -le 1 ] rm $1.orig hack = $(call _hack,$T/$(strip $1)) unifdef = $(call _unifdef,$T/$(strip $1)) + +ifneq '$(filter $(ARCH_DIR), x86_64 i386)' '' And I'm not quite sure what this is supposed to do but it doesn't look right. ARCH_DIR is going to equal 'x86' so this is going to break things on x86 sinc ethe filter will return an empty string. Regards, Anthony Liguori +UNIFDEF_FILES = include/linux/kvm.h \ + include/linux/kvm_para.h \ + include/asm-$(ARCH_DIR)/kvm.h \ + include/asm-x86/kvm_para.h + +HACK_FILES = kvm_main.c \ + mmu.c \ + vmx.c \ + svm.c \ + x86.c \ + irq.h +endif all:: #include header priority 1) $LINUX 2) $KERNELDIR 3) include-compat @@ -39,17 +58,16 @@ header-sync: rm -rf $T rsync -R \ $(LINUX)/./include/linux/kvm*.h \ - $(LINUX)/./include/asm-x86/kvm*.h \ - $T/ - mkdir -p include/linux include/asm-x86 - ln -sf asm-x86 include/asm - ln -sf asm-x86 include-compat/asm + $(LINUX)/./include/asm-$(ARCH_DIR)/kvm*.h \ + $T/ - $(call unifdef, include/linux/kvm.h) - $(call unifdef, include/linux/kvm_para.h) - $(call unifdef, include/asm-x86/kvm.h) - $(call unifdef, include/asm-x86/kvm_para.h) - $(call hack, include/linux/kvm.h) + mkdir -p include/linux include/asm-$(ARCH_DIR) + ln -s asm-$(ARCH_DIR) include/asm + ln -sf asm-$(ARCH_DIR) include-compat/asm + + for i in $(UNIFDEF_FILES); \ + do $(call unifdef, $$i); done + for i in $$(find $T -type f -printf '%P '); \ do cmp -s $$i $T/$$i || cp $T/$$i $$i; done rm -rf $T @@ -57,15 +75,13 @@ source-sync: source-sync: rm -rf $T rsync --exclude='*.mod.c' -R \ - $(LINUX)/arch/x86/kvm/./*.[ch] \ - $(LINUX)/virt/kvm/./*.[ch] \ - $T/ - $(call hack, kvm_main.c) - $(call hack, mmu.c) - $(call hack, vmx.c) - $(call hack, svm.c) - $(call hack, x86.c) - $(call hack, irq.h) + $(LINUX)/arch/$(ARCH_DIR)/kvm/./*.[ch] \ + $(LINUX)/virt/kvm/./*.[ch] \ + $T/ + + for i in $(HACK_FILES); \ + do $(call hack, $$i); done + for i in $$(find $T -type f -printf '%P '); \ do cmp -s $$i $T/$$i || cp $T/$$i $$i; done rm -rf $T - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] What kernel options do I need to properly enable virtio net driver
Jerone Young wrote: What I am asking is do I have all the proper options in my kernel config set to use it? Yes. You just need CONFIG_VIRTIO_NET and CONFIG_VIRTIO_PCI. The remaining options will be automatically selected. Regards, Anthony Liguori On Mon, 2008-04-21 at 17:13 -0500, Anthony Liguori wrote: Jerone Young wrote: virtio net device does not appear to show itself in the guest. I'm curious of what options I may be missing. Here is my config You'll have to be more specific about what does not appear to show itself means. What's the output of lspci? Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel