Re: [Qemu-devel] [PATCH V5 16/18] virtio-pci: increase the maximum number of virtqueues to 513
On Tue, Apr 7, 2015 at 6:54 PM, Alexander Graf ag...@suse.de wrote: On 04/01/2015 10:15 AM, Jason Wang wrote: This patch increases the maximum number of virtqueues for pci from 64 to 513. This will allow booting a virtio-net-pci device with 256 queue pairs. ... * configuration space */ #define VIRTIO_PCI_CONFIG_SIZE(dev) VIRTIO_PCI_CONFIG_OFF(msix_ enabled(dev)) -#define VIRTIO_PCI_QUEUE_MAX 64 +#define VIRTIO_PCI_QUEUE_MAX 513 513 is an interesting number. Any particular reason for it? Maybe this was mentioned before and I just missed it ;) quite large, too. I thought multiple queue pairs were useful to split the load for multicore machines, but targeting VMs with up to 256 cores (and presumably an equal number in the host) seems really forward looking. On the other hand, the message is dated april first... cheers luigi
Re: [Qemu-devel] [PATCH] net: Disable netmap backend when not supported
On Thu, Feb 20, 2014 at 10:49:52AM +0100, Stefan Hajnoczi wrote: On Wed, Feb 19, 2014 at 04:57:28PM +0100, Vincenzo Maffione wrote: ... Please mention that in the commit description. (I guess it's too late to give them a NETMAP_* prefix since defining D() and RD() in a system header has a fair chance of causing namespace conflicts.) you are right that the naming is unfortunate and we will try to address that at a later time, removing the macros. As partial mitigation of the problem, they are intended only for debugging (so they should only be used by lazy programmers; applications should prefer application-specific logging functions), and D(), RD() and ND() are only defined when 'NETMAP_WITH_LIBS' is defined and this should happen only in the single file that implements the netmap backend. cheers luigi
Re: [Qemu-devel] [PATCH] net: Disable netmap backend when not supported
On Wed, Feb 19, 2014 at 7:30 AM, Stefan Hajnoczi stefa...@gmail.com wrote: On Fri, Feb 14, 2014 at 05:40:24PM +0100, Vincenzo Maffione wrote: This patch fixes configure so that netmap is not compiled in if the host doesn't support an API version = 11. Moreover, some modifications have been done to net/netmap.c in order to reflect the current netmap API (11). Signed-off-by: Vincenzo Maffione v.maffi...@gmail.com --- configure| 3 +++ net/netmap.c | 57 ++--- 2 files changed, 17 insertions(+), 43 deletions(-) diff --git a/configure b/configure index 88133a1..61eb932 100755 --- a/configure +++ b/configure @@ -2118,6 +2118,9 @@ if test $netmap != no ; then #include net/if.h #include net/netmap.h #include net/netmap_user.h +#if (NETMAP_API 11) || (NETMAP_API 15) +#error +#endif Why error when NETMAP_API 15? this is meant to simulate a minor/major version number. We will mark minor new features with values up to 15, and if something happens that requires a change in the backend we will move above 15, at which point we will submit backend fixes and also the necessary update to ./configure -ring-cur = NETMAP_RING_NEXT(ring, i); -ring-avail--; +ring-cur = ring-head = nm_ring_next(ring, i); ioctl(s-me.fd, NIOCTXSYNC, NULL); return size; Are these changes related to the NETMAP_WITH_LIBS macro? Please do that in a separate patch so we keep the version checking change separate from the NETMAP_WITH_LIBS change. netmap version 11 and above has NETMAP_WITH_LIBS, while previous versions do not, so this ./configure change has to go together with the change in the backend. The netmap 11 code has already been committed to the FreeBSD source repositories (for HEAD, 10 and 9) and to code.google.com/p/netmap/ (for those who want it on linux). So there is really no point, in my opinion, to make one intermediate commit just to ./configure to disable netmap detection on FreeBSD (it is already off on linux), immediately followed by this one that uses the new feature. Just to clarify: with one exception (fields in struct netmap_ring) the netmap changes that we have are not at the kernel/user boundary but in a library which avoids replicating long and boring code (interface name parsing, parameter setting) in applications. Avoiding the single incompatible struct change would have been of course possible, but at the cost extra complexity in the kernel and in userspace (to support two slightly different interfaces). Since netmap is (so far) relatively little used I thought it was more important to fix the API and keep it simple. cheers luigi -+--- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/. Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -+---
Re: [Qemu-devel] net: Adding netmap network backend
On Fri, Feb 14, 2014 at 2:20 AM, Vincenzo Maffione v.maffi...@gmail.comwrote: Yes, for sure we need to do a check. However, this would involve - I think - some non-trivial modifications to net/netmap.c, because without NS_MOREFRAG you cannot split a packet over more netmap slots/descriptors (both tx and rx side) Therefore I would ask (manly Luigi, since netmap is in-tree in FreeBSD): Wouldn't it be better to force --disable-netmap when we realize that NETMAP_API (version number for the netmap API) is less than the value required by QEMU? This can be done in ./configure. In this way we keep the QEMU code cleaner. yes we should do exactly what vincenzo suggests. cheers luigi
Re: [Qemu-devel] [PATCH] net: QEMU_NET_PACKET_FLAG_MORE introduced
On Mon, Dec 9, 2013 at 3:02 PM, Michael S. Tsirkin m...@redhat.com wrote: On Mon, Dec 09, 2013 at 01:36:54PM +0100, Stefan Hajnoczi wrote: On Fri, Dec 06, 2013 at 03:44:33PM +0100, Vincenzo Maffione wrote: - This patch is against the net-next tree ( https://github.com/stefanha/qemu.git) because the first netmap patch is not in the qemu master (AFAIK). You are right. I am sending a pull request now to get those patches into qemu.git/master. This only arrived over the weekend and affects all net devices. Whats the rush? Why not give people a chance to review and discuss properly? as i understand the pull request is for the netmap backend, not for this batching patch cheers luigi hw/net/cadence_gem.c| 3 ++- hw/net/dp8393x.c| 5 +++-- hw/net/e1000.c | 21 - hw/net/eepro100.c | 5 +++-- hw/net/etraxfs_eth.c| 5 +++-- hw/net/lan9118.c| 2 +- hw/net/mcf_fec.c| 5 +++-- hw/net/mipsnet.c| 6 -- hw/net/ne2000.c | 5 +++-- hw/net/ne2000.h | 3 ++- hw/net/opencores_eth.c | 2 +- hw/net/pcnet.c | 8 +--- hw/net/pcnet.h | 3 ++- hw/net/rtl8139.c| 7 --- hw/net/smc91c111.c | 5 +++-- hw/net/spapr_llan.c | 2 +- hw/net/stellaris_enet.c | 3 ++- hw/net/virtio-net.c | 10 -- hw/net/vmxnet3.c| 3 ++- hw/net/vmxnet_tx_pkt.c | 4 ++-- hw/net/xgmac.c | 2 +- hw/net/xilinx_axienet.c | 2 +- hw/usb/dev-network.c| 8 +--- include/net/net.h | 20 +--- include/net/queue.h | 1 + net/dump.c | 3 ++- net/hub.c | 10 ++ net/net.c | 39 +++ net/netmap.c| 17 - net/slirp.c | 5 +++-- net/socket.c| 10 ++ net/tap-win32.c | 2 +- net/tap.c | 12 +++- net/vde.c | 5 +++-- savevm.c| 2 +- 35 files changed, 155 insertions(+), 90 deletions(-) Please split this into multiple patches: 1. net subsystem API change that touches all files (if necessary) 2. e1000 MORE support 3. virtio-net MORE support 4. netmap MORE support This makes it easier to review and bisect. Thanks, Stefan -- -+--- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/. Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -+---
Re: [Qemu-devel] commit b1bbfe72 causes huge slowdown with no kvm
On Wed, Nov 20, 2013 at 10:41:22AM +0100, Paolo Bonzini wrote: Il 20/11/2013 00:00, Luigi Rizzo ha scritto: I recently found out that without kvm enabled, and especially with -smp 2 or greater, qemu becomes incredibly slow (to the point that you can see kernel messages in the quest print one character at a time). This happens with a Linux host (even with -smp 1) and with FreeBSD host (in this case -smp 2 or greater; -smp 1 seems to work fine there). Here is my test case that shows the problem: HOW TO REPRODUCE: boot a (small) FreeBSD image below: http://info.iet.unipi.it/~luigi/20131119-picobsd.bin with this command (note no kvm, also i disconnect the nic so i can generate traffic without causing trouble) x86_64-softmmu/qemu-system-x86_64 -m 1024 -hda 20131119-picobsd.bin -curses -monitor tcp::,server,nowait -net nic,model=e1000 -smp 2 The image is readonly so you can kill the qemu without problems. SYMPTOMS: as soon as the kernel starts its timer services (there is an active kernel thread woken up every millisecons) booting and runtime performance in the guest becomes terribly slow. See details on the test below. HOST / OS Tried on a linux host (ubuntu 13.10 i believe, with kernel 3.11 and CPU Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz, but similar things occur also on a FreeBSD host. With a FreeBSD host, using -smp 1 seems to bypass the problem, whereas on Linux hosts the -smp setting does not seem to make a difference. QEMU VERSION and MODIFICATIONS: git branch -v * master 5c5432e Merge remote-tracking branch 'luiz/queue/qmp' into staging No changes other than the small diff i proposed to fix the problem: git diff diff --git a/qemu-timer.c b/qemu-timer.c index e15ce47..4180a86 100644 --- a/qemu-timer.c +++ b/qemu-timer.c @@ -380,6 +380,7 @@ static void timerlist_rearm(QEMUTimerList *timer_list) { /* Interrupt execution to force deadline recalculation. */ qemu_clock_warp(timer_list-clock-type); +if (use_icount) // XXX timerlist_notify(timer_list); } DETAILED SYMPTOMS: WITH THE ABOVE PATCH, the kernel boots to a login prompt in 32 seconds (10 of which are the initial pause, you can skip it by pressing enter). Then you can run the pkt-gen program which i normally use to test netmap performance in the guest, and you see between 2 and 3 million packets per second (the NIC is disconnected form the host so this should be harmles). Example: login: root Password: setup # pkt-gen -i em0 -f tx should return something between 2 and 3 Mpps ctrl-c to terminate it WITHOUT THE PATCH, booting becomes slow as soon as the timer tick starts and we load dummynet (which also starts a kernel thread every millisecond). You should be able to see how the printing of kernel messages slows down terribly around this point: ... Timecounters tick every 1.000 msec ipfw2 initialized, divert loadable, nat loadable, default to accept, logging disabled DUMMYNET 0 with IPv6 initialized (100409) and then it takes a long/varible time to reach the login prompt, easily a couple of minutes or more. If you start pkt-gen now, you should see a much lower rate, around 300-500Kpps Since the problem seems timer-related, it makes sense that you are not seeing much difference in linux which is probably tickless, and that the trouble comes out on FreeBSD after the timers are initialized. But again I have no idea if my patch (which simply reverts part of the offending commit) makes sense. cheers luigi I could not reproduce it; the profile here seems normal, too: 24,69% perf-3281.map [.] 0x7f4ab5ac9903 14,18% qemu-system-x86_64 [.] cpu_x86_exec 2,67% libglib-2.0.so.0.3800.1 [.] g_source_iter_next 2,63% qemu-system-x86_64 [.] tcg_optimize 2,47% qemu-system-x86_64 [.] helper_pcmpeqb_xmm 2,28% qemu-system-x86_64 [.] phys_page_find 1,92% qemu-system-x86_64 [.] address_space_translate_internal 1,53% qemu-system-x86_64 [.] tcg_liveness_analysis 1,40% qemu-system-x86_64 [.] tcg_reg_alloc_op 1,17% qemu-system-x86_64 [.] helper_psubb_xmm 0,97% qemu-system-x86_64 [.] disas_insn 0,96% qemu-system-x86_64 [.] cpu_x86_handle_mmu_fault 0,92% qemu-system-x86_64 [.] tcg_out_opc 0,92% qemu-system-x86_64 [.] helper_pmovmskb_xmm 0,91% qemu-system-x86_64 [.] tlb_set_page 0,75% qemu-system-x86_64 [.] address_space_translate 0,74% qemu-system-x86_64
Re: [Qemu-devel] [PATCH for 1.7] target-i386: yield to another VCPU on PAUSE
On Wed, Nov 20, 2013 at 12:54:02PM +0100, Paolo Bonzini wrote: After commit b1bbfe7 (aio / timers: On timer modification, qemu_notify or aio_notify, 2013-08-21) FreeBSD guests report a huge slowdown. The problem shows up as soon as FreeBSD turns out its periodic (~1 ms) tick, but the timers are only the trigger for a pre-existing problem. Before the offending patch, setting a timer did a timer_settime system call. After, setting the timer exits the event loop (which uses poll) and reenters it with a new deadline. This does not cause any slowdown; the difference is between one system call (timer_settime and a signal delivery (SIGALRM) before the patch, and two system calls afterwards (write to a pipe or eventfd + calling poll again when re-entering the event loop). Unfortunately, the exit/enter causes the main loop to grab the iothread lock, which in turns kicks the VCPU thread out of execution. This causes TCG to execute the next VCPU in its round-robin scheduling of VCPUS. When the second VCPU is mostly unused, FreeBSD runs a pause instruction in its idle loop which only burns cycles without any progress. As soon as the timer tick expires, the first VCPU runs the interrupt handler but very soon it sets it again---and QEMU then goes back doing nothing in the second VCPU. The fix is to make the pause instruction do cpu_loop_exit. thank you. This fixes the slow booting problem, but the runtime performance with my test program in the other mail thread commit b1bbfe72 causes huge slowdown with no kvm is still about 50% than pre- commit b1bbfe7 So i am still wondering if there is a better way to deliver the timerlist_notify() in timerlist_rearm() . I have tried to suppress the entire timerlist_rearm() when the newly inserted event is close to the previous one, but this does not seem to help -- actually it is harmful, presumably because qemu_clock_warp() is also skipped. Conversely, filtering out timerlist_notify() as in my previous (incorrect) patch seems to speed up things considerably, perhaps because there are other timerlist_notify() calls elsewhere that keep the ball rolling. Just as a side note: I am not sure whether you were seeing use of the 'pause' instruction in profiling, but by default the idle loop in FreeBSD uses the acpi method (to enter C1 state). Other options are: - mwait (unavailable on qemu due to some missing VCPU feature); - hlt (which i tried and gives the same behaviour as acpi); - spin (which indeed does use the pause instruction, but is not used unless you force it with sysctl machdep.idle=spin). pause instructions are however used within spinlocks, and when invoking the scheduler, which happens right before and after the idle loop. cheers luigi Cc: Richard Henderson r...@twiddle.net Reported-by: Luigi Rizzo ri...@iet.unipi.it Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- target-i386/helper.h | 1 + target-i386/misc_helper.c | 22 -- target-i386/translate.c | 5 - 3 files changed, 25 insertions(+), 3 deletions(-) diff --git a/target-i386/helper.h b/target-i386/helper.h index d6974df..3775abe 100644 --- a/target-i386/helper.h +++ b/target-i386/helper.h @@ -58,6 +58,7 @@ DEF_HELPER_2(sysret, void, env, int) DEF_HELPER_2(hlt, void, env, int) DEF_HELPER_2(monitor, void, env, tl) DEF_HELPER_2(mwait, void, env, int) +DEF_HELPER_2(pause, void, env, int) DEF_HELPER_1(debug, void, env) DEF_HELPER_1(reset_rf, void, env) DEF_HELPER_3(raise_interrupt, void, env, int, int) diff --git a/target-i386/misc_helper.c b/target-i386/misc_helper.c index 93933fd..b6307ca 100644 --- a/target-i386/misc_helper.c +++ b/target-i386/misc_helper.c @@ -566,6 +566,15 @@ void helper_rdmsr(CPUX86State *env) } #endif +static void do_pause(X86CPU *cpu) +{ +CPUX86State *env = cpu-env; + +/* Just let another CPU run. */ +env-exception_index = EXCP_INTERRUPT; +cpu_loop_exit(env); +} + static void do_hlt(X86CPU *cpu) { CPUState *cs = CPU(cpu); @@ -611,13 +620,22 @@ void helper_mwait(CPUX86State *env, int next_eip_addend) cs = CPU(cpu); /* XXX: not complete but not completely erroneous */ if (cs-cpu_index != 0 || CPU_NEXT(cs) != NULL) { -/* more than one CPU: do not sleep because another CPU may - wake this one */ +do_pause(cpu); } else { do_hlt(cpu); } } +void helper_pause(CPUX86State *env, int next_eip_addend) +{ +X86CPU *cpu = x86_env_get_cpu(env); + +cpu_svm_check_intercept_param(env, SVM_EXIT_PAUSE, 0); +env-eip += next_eip_addend; + +do_pause(cpu); +} + void helper_debug(CPUX86State *env) { env-exception_index = EXCP_DEBUG; diff --git a/target-i386/translate.c b/target-i386/translate.c index eb0ea93..ecf16b3 100644 --- a/target-i386/translate.c +++ b/target-i386/translate.c @@ -7224,7 +7224,10 @@ static target_ulong
[Qemu-devel] commit b1bbfe72 causes huge slowdown with no kvm
I recently found out that without kvm enabled, and especially with -smp 2 or greater, qemu becomes incredibly slow (to the point that you can see kernel messages in the quest print one character at a time). This happens with a Linux host (even with -smp 1) and with FreeBSD host (in this case -smp 2 or greater; -smp 1 seems to work fine there). Bisecting points to this commit as the culprit: commit b1bbfe72ec1ebf302d97f886cc646466c0abd679 Author: Alex Bligh a...@alex.org.uk Date: Wed Aug 21 16:02:55 2013 +0100 aio / timers: On timer modification, qemu_notify or aio_notify On qemu_mod_timer_ns, ensure qemu_notify or aio_notify is called to end the appropriate poll(), irrespective of use_icount value. On qemu_clock_enable, ensure qemu_notify or aio_notify is called for all QEMUTimerLists attached to the QEMUClock. Signed-off-by: Alex Bligh a...@alex.org.uk Signed-off-by: Stefan Hajnoczi stefa...@redhat.com I could not revert this individual commit into master because of other changes, but i notice that one key changes of the commit was to ma k e a call to timerlist_notify() unconditional, whereas before it was controlled by the use_icount variable. So I tried the small patch below and it seems to restore the original performance, but I have no idea what use_icount does and whether the change makes sense. If not, there is an open problem. Any ideas ? cheers luigi d iff --git a/qemu-timer.c b/qemu-timer.c index e15ce47..4180a86 100644 --- a/qemu-timer.c +++ b/qemu-timer.c @@ -380,6 +380,7 @@ static void timerlist_rearm(QEMUTimerList *timer_list) { /* Interrupt execution to force deadline recalculation. */ qemu_clock_warp(timer_list-clock-type); +if (use_icount) // XXX timerlist_notify(timer_list); }
[Qemu-devel] [PATCH] better interpreter specification for scripts
some of the new scripts in scripts/ specify a interpreter path which is not always pointing to the right place. I see that an alternative format is also being used #!/usr/bin/env interpreter_name which seems to work better. The patch below is merely to let master compile on FreeBSD, but there are other offending files that can be found with grep '#!' scripts/* Signed-off-by: Luigi Rizzo ri...@iet.unipi.it --- scripts/acpi_extract.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/acpi_extract.py b/scripts/acpi_extract.py index 22ea468..66c1b3e 100755 --- a/scripts/acpi_extract.py +++ b/scripts/acpi_extract.py @@ -1,4 +1,4 @@ -#!/usr/bin/python +#!/usr/bin/env python # Copyright (C) 2011 Red Hat, Inc., Michael S. Tsirkin m...@redhat.com # # This program is free software; you can redistribute it and/or modify -- 1.8.1.2
Re: [Qemu-devel] [PATCH v3] net: Adding netmap network backend
On Mon, Nov 4, 2013 at 9:41 AM, Anthony Liguori anth...@codemonkey.wswrote: On Tue, Oct 29, 2013 at 3:12 AM, Vincenzo Maffione v.maffi...@gmail.com wrote: This patch adds support for a network backend based on netmap. netmap is a framework for high speed packet I/O. You can use it to build extremely fast traffic generators, monitors, software switches or network middleboxes. Its companion software switch VALE lets you interconnect virtual machines. netmap and VALE are implemented as a non intrusive kernel module, support NICs from multiple vendors, are part of standard FreeBSD distributions and available in source format for Linux too. I don't think it's a good idea to support this on Linux hosts. This is an out of tree module that most likely will never go upstream. I don't want to live through another kqemu with this if it eventually starts to bit-rot. I believe this is very different from kqemu. For first, it is just a one-file backend (the patches to other files are just because there is not yet a way to automatically generate them; but i am sure qemu will get there). Getting rid of it, should the code bit-rot, is completely trivial. Second, there is nothing linux specific here. Unless configure determines that the (possibly out of tree, as in Linux, or in-tree, as in FreeBSD) netmap headers are installed, it just won't build the backend. I am not sure if you do not want to support netmap _in general_ (despite it is already upstream in FreeBSD), or you'd like to put extra checks in ./configure to actually prevent its use on linux hosts. Both cases it seems to me a loss of a useful feature with no real return configure | 31 hmp-commands.hx | 4 +- net/Makefile.objs | 1 + net/clients.h | 5 + net/net.c | 6 + net/netmap.c | 423 ++ qapi-schema.json | 19 ++- qemu-options.hx | 8 ++ 8 files changed, 494 insertions(+), 3 deletions(-) create mode 100644 net/netmap.c cheers luigi
Re: [Qemu-devel] [PATCH v3] net: Adding netmap network backend
On Mon, Nov 04, 2013 at 10:20:12AM -0800, Anthony Liguori wrote: On Mon, Nov 4, 2013 at 10:08 AM, Luigi Rizzo ri...@iet.unipi.it wrote: ... On Tue, Oct 29, 2013 at 3:12 AM, Vincenzo Maffione v.maffi...@gmail.com wrote: This patch adds support for a network backend based on netmap. netmap is a framework for high speed packet I/O. You can use it to build extremely fast traffic generators, monitors, software switches or network middleboxes. Its companion software switch VALE lets you interconnect virtual machines. netmap and VALE are implemented as a non intrusive kernel module, support NICs from multiple vendors, are part of standard FreeBSD distributions and available in source format for Linux too. I don't think it's a good idea to support this on Linux hosts. This is an out of tree module that most likely will never go upstream. I don't want to live through another kqemu with this if it eventually starts to bit-rot. I believe this is very different from kqemu. For first, it is just a one-file backend (the patches to other files are just because there is not yet a way to automatically generate them; but i am sure qemu will get there). Getting rid of it, should the code bit-rot, is completely trivial. Second, there is nothing linux specific here. Unless configure determines that the (possibly out of tree, as in Linux, or in-tree, as in FreeBSD) netmap headers are installed, it just won't build the backend. Without being in upstream Linux, we have no guarantee that the API/ABI will be stable over time. I suspect it's also very unlikely that any many stream distro will include these patches meaning that the number of users that will test this is very low. I don't think just adding another backend because we can helps us out in the long term. Either this is the Right Approach to networking and we should focus on getting proper kernel support or if that's not worth it, then there's no reason to include this in QEMU either. anthony, i'd still like you to answer the question that i asked before: are you opposed to netmap support just for linux, or you oppose to it in general (despite netmap being already upstream in FreeBSD) ? Your reasoning seems along the lines if feature X is not upstream in linux we do not want to support it. While I can buy it (hey, it may save a lot of maintenance effort), it contrasts with the presence in qemu of a ton of conditional code to support other host platforms, as well as multiple backends for non-linux features (mostly audio drivers; some GUI too, think of Cocoa). Also the agendas of Qemu, linux, FreeBSD and other hosts may be different (and it does not mean that there is One Right Way or that others are wrong), so having inclusion in linux as a prerequisite for support seems a bit restrictive. Specifically, the networking requirements for qemu (a fast virtual switch, tunnels etc.) are different from that of more typical apps or the OS itself. Also regarding API stability: qemu uses a lot of user libraries whose APIs are a moving target without apparent problems. If you are worried about API stability, netmap is just two small headers, and no library. There isn't really much that can go wrong there... cheers luigi Regards, Anthony Liguori I am not sure if you do not want to support netmap _in general_ (despite it is already upstream in FreeBSD), or you'd like to put extra checks in ./configure to actually prevent its use on linux hosts.
Re: [Qemu-devel] [PATCH v3] net: Adding netmap network backend
On Mon, Nov 4, 2013 at 12:54 PM, Anthony Liguori anth...@codemonkey.wswrote: On Mon, Nov 4, 2013 at 11:51 AM, Luigi Rizzo ri...@iet.unipi.it wrote: On Mon, Nov 04, 2013 at 10:20:12AM -0800, Anthony Liguori wrote: On Mon, Nov 4, 2013 at 10:08 AM, Luigi Rizzo ri...@iet.unipi.it wrote: ... On Tue, Oct 29, 2013 at 3:12 AM, Vincenzo Maffione v.maffi...@gmail.com wrote: This patch adds support for a network backend based on netmap. netmap is a framework for high speed packet I/O. You can use it to build extremely fast traffic generators, monitors, software switches or network middleboxes. Its companion software switch VALE lets you interconnect virtual machines. netmap and VALE are implemented as a non intrusive kernel module, support NICs from multiple vendors, are part of standard FreeBSD distributions and available in source format for Linux too. I don't think it's a good idea to support this on Linux hosts. This is an out of tree module that most likely will never go upstream. I don't want to live through another kqemu with this if it eventually starts to bit-rot. I believe this is very different from kqemu. For first, it is just a one-file backend (the patches to other files are just because there is not yet a way to automatically generate them; but i am sure qemu will get there). Getting rid of it, should the code bit-rot, is completely trivial. Second, there is nothing linux specific here. Unless configure determines that the (possibly out of tree, as in Linux, or in-tree, as in FreeBSD) netmap headers are installed, it just won't build the backend. Without being in upstream Linux, we have no guarantee that the API/ABI will be stable over time. I suspect it's also very unlikely that any many stream distro will include these patches meaning that the number of users that will test this is very low. I don't think just adding another backend because we can helps us out in the long term. Either this is the Right Approach to networking and we should focus on getting proper kernel support or if that's not worth it, then there's no reason to include this in QEMU either. anthony, i'd still like you to answer the question that i asked before: are you opposed to netmap support just for linux, or you oppose to it in general (despite netmap being already upstream in FreeBSD) ? Your reasoning seems along the lines if feature X is not upstream in linux we do not want to support it. Yes. This is the historic policy we have taken for any feature. I have no problem with netmap being used on FreeBSD hosts but I think it should not be enabled on Linux hosts. ok thanks for the clarification. S o I misunderstood , the policy is if not upstream in linux, we do not want to support it _on linux_. A fundamental difference :) So in ./configure we must change to 'netmap=no' in the initial section to disable it by default, and add a line 'netmap=' in the FreeBSD section to enable autodetect there. cheers luigi
[Qemu-devel] difference between receive_raw() and receive() NetClientInfo methods ?
Can someone clarify what is the difference between the two methods r eceive_raw() and receive() in NetClientInfo ? tap seems to be the only backend actually implementing receive_raw(), but apart from prepending a vnet_hdr i cannot tell what is this for, and whether receive_raw() is a custom addon for tap, or it can be used by other backends to implement different features. The reason I am asking is that we are working on a fewer copies path between guests, and one of the things we need is an asynchronous receive so that a backend can notify the frontend when a buffer has been actually consumed. Right now nc-info-receive() and friends are all synchronous, and there is no mechanism to support asynchronous completion, which forces the backend to make a copy if it cannot complete the request inline. Hence i probably need to add another method to NetClientInfo so that the frontend can register the callback, or pass it to the backend together with the buffer/iov thanks luigi
Re: [Qemu-devel] [PATCH] e1000: cleanup process_tx_desc
On Tue, Jun 4, 2013 at 9:34 AM, Andrew Jones drjo...@redhat.com wrote: - Original Message - On 06/03/2013 10:20 AM, Andrew Jones wrote: Coverity complains about two overruns in process_tx_desc(). The complaints are false positives, but we might as well eliminate them. The problem is that hdr is defined as an unsigned int, but then used to offset an array of size 65536, and another of size 256 bytes. hdr will actually never be greater than 255 though, as it's assigned only once and to the value of tp-hdr_len, which is an uint8_t. This patch simply gets rid of hdr, replacing it with tp-hdr_len, which makes it consistent with all other tp member use in the function. Signed-off-by: Andrew Jones drjo...@redhat.com --- hw/net/e1000.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) The logic looks sound, but checkpatch detects some style issues. See below. ... Although the style issues were present to begin with, we may as well take the opportunity to fix them. Running checkpatch on the file yields 142 errors, 41 warnings I could send a v2 that fixes the 1 error and 2 warnings found in the context of this patch, but why? It's out of the scope of the patch (although I did use cleanup in the summary...), and it would hardly make a dent in this file's problems. Indeed. I find it slightly annoying (and a waste of everyone's time) that patches are bounced on issues that they are not responsible for. (this happens for several other opensource projects I have been involved with). I think that a much more effective approach would be to take the patch (on the grounds that it improves the codebase), and then if the committer feels like fixing the pre-existing style issue he can do it separately, or make a comment in the commit log if he has no time (and by the same reasoning, the original submitter may be short of time). cheers luigi Many projects i have been involved with have this drew Sincerely, Jesse Larrew Software Engineer, KVM Team IBM Linux Technology Center Phone: (512) 973-2052 (T/L: 363-2052) jlar...@linux.vnet.ibm.com -- -+--- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/. Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -+---
Re: [Qemu-devel] RFC: handling backend too fast in virtio-net
On Mon, Feb 18, 2013 at 1:50 AM, Stefan Hajnoczi stefa...@gmail.com wrote: On Fri, Feb 15, 2013 at 11:24:29AM +0100, Stefan Hajnoczi wrote: On Thu, Feb 14, 2013 at 07:21:57PM +0100, Luigi Rizzo wrote: CCed Michael Tsirkin virtio-style network devices (where the producer and consumer chase each other through a shared memory block) can enter into a bad operating regime when the consumer is too fast. I am hitting this case right now when virtio is used on top of the netmap/VALE backend that I posted a few weeks ago: what happens is that the backend is so fast that the io thread keeps re-enabling notifications every few packets. This results, on my test machine, in a throughput of 250-300Kpps (and extremely unstable, oscillating between 200 and 600Kpps). I'd like to get some feedback on the following trivial patch to have the thread keep spinning for a bounded amount of time when the producer is slower than the consumer. This gives a relatively stable throughput between 700 and 800 Kpps (we have something similar in our paravirtualized e1000 driver, which is slightly faster at 900-1100 Kpps). Did you experiment with tx timer instead of bh? It seems that hw/virtio-net.c has two tx mitigation strategies - the bh approach that you've tweaked and a true timer. It seems you don't really want tx batching but you do want to avoid guest-host notifications? One more thing I forgot: virtio-net does not use ioeventfd by default. ioeventfd changes the cost of guest-host notifications because the notification becomes an eventfd signal inside the kernel and kvm.ko then re-enters the guest. This means a guest-host notification becomes a light-weight exit and we don't return from ioctl(KVM_RUN). Perhaps -device virtio-blk-pci,ioeventfd=on will give similar results to your patch? is the ioeventfd the mechanism used by vhostnet ? If so, Giuseppe Lettieri (in Cc) has tried that with a modified netmap backend and experienced the same problem -- making the io thread (user or kernel) spin a bit more has great benefit on the throughput. cheers luigi
Re: [Qemu-devel] RFC: handling backend too fast in virtio-net
On Fri, Feb 15, 2013 at 11:24:29AM +0100, Stefan Hajnoczi wrote: On Thu, Feb 14, 2013 at 07:21:57PM +0100, Luigi Rizzo wrote: CCed Michael Tsirkin virtio-style network devices (where the producer and consumer chase each other through a shared memory block) can enter into a bad operating regime when the consumer is too fast. I am hitting this case right now when virtio is used on top of the netmap/VALE backend that I posted a few weeks ago: what happens is that the backend is so fast that the io thread keeps re-enabling notifications every few packets. This results, on my test machine, in a throughput of 250-300Kpps (and extremely unstable, oscillating between 200 and 600Kpps). I'd like to get some feedback on the following trivial patch to have the thread keep spinning for a bounded amount of time when the producer is slower than the consumer. This gives a relatively stable throughput between 700 and 800 Kpps (we have something similar in our paravirtualized e1000 driver, which is slightly faster at 900-1100 Kpps). Did you experiment with tx timer instead of bh? It seems that hw/virtio-net.c has two tx mitigation strategies - the bh approach that you've tweaked and a true timer. It seems you don't really want tx batching but you do want to avoid guest-host notifications? i have just tried: bh (with my tweaks): 700-800Kpps (large variance) timer (150us, 256 slots): ~490Kpps (very stable) As expected The throughput in the timer version seems proportional to ring_size/timeout , e.g. 1ms and 256slots give approx 210Kpps, 1ms and 128 slots give 108Kpps. but it saturates around 490Kpps. cheers luigi EXISTING LOGIC: reschedule the bh when at least a burst of packets has been received. Otherwise enable notifications and do another race-prevention check. NEW LOGIC: when the bh is scheduled the first time, establish a budget (currently 20) of reschedule attempts. Every time virtio_net_flush_tx() returns 0 packets [this could be changed to 'less than a full burst'] the counter is increased. The bh is rescheduled until the counter reaches the budget, at which point we re-enable notifications as above. I am not 100% happy with this patch because there is a magic constant (the maximum number of retries) which should be really adaptive, or perhaps we should even bound the amount of time the bh runs, rather than the number of retries. In practice, having the thread spin (or sleep) for 10-20us even without new packets is probably preferable to re-enable notifications and request a kick. opinions ? cheers luigi diff --git a/hw/virtio-net.c b/hw/virtio-net.c index 573c669..5389088 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -49,6 +51,7 @@ typedef struct VirtIONet NICState *nic; uint32_t tx_timeout; int32_t tx_burst; +int32_t tx_retries; /* keep spinning a bit on tx */ uint32_t has_vnet_hdr; size_t host_hdr_len; size_t guest_hdr_len; @@ -1062,7 +1065,9 @@ static void virtio_net_tx_bh(void *opaque) /* If we flush a full burst of packets, assume there are * more coming and immediately reschedule */ -if (ret = n-tx_burst) { +if (ret == 0) + n-tx_retries++; +if (n-tx_retries 20) { qemu_bh_schedule(q-tx_bh); q-tx_waiting = 1; return; @@ -1076,6 +1082,8 @@ static void virtio_net_tx_bh(void *opaque) virtio_queue_set_notification(q-tx_vq, 0); qemu_bh_schedule(q-tx_bh); q-tx_waiting = 1; +} else { + n-tx_retries = 0; } }
[Qemu-devel] RFC: handling backend too fast in virtio-net
virtio-style network devices (where the producer and consumer chase each other through a shared memory block) can enter into a bad operating regime when the consumer is too fast. I am hitting this case right now when virtio is used on top of the netmap/VALE backend that I posted a few weeks ago: what happens is that the backend is so fast that the io thread keeps re-enabling notifications every few packets. This results, on my test machine, in a throughput of 250-300Kpps (and extremely unstable, oscillating between 200 and 600Kpps). I'd like to get some feedback on the following trivial patch to have the thread keep spinning for a bounded amount of time when the producer is slower than the consumer. This gives a relatively stable throughput between 700 and 800 Kpps (we have something similar in our paravirtualized e1000 driver, which is slightly faster at 900-1100 Kpps). EXISTING LOGIC: reschedule the bh when at least a burst of packets has been received. Otherwise enable notifications and do another race-prevention check. NEW LOGIC: when the bh is scheduled the first time, establish a budget (currently 20) of reschedule attempts. Every time virtio_net_flush_tx() returns 0 packets [this could be changed to 'less than a full burst'] the counter is increased. The bh is rescheduled until the counter reaches the budget, at which point we re-enable notifications as above. I am not 100% happy with this patch because there is a magic constant (the maximum number of retries) which should be really adaptive, or perhaps we should even bound the amount of time the bh runs, rather than the number of retries. In practice, having the thread spin (or sleep) for 10-20us even without new packets is probably preferable to re-enable notifications and request a kick. opinions ? cheers luigi diff --git a/hw/virtio-net.c b/hw/virtio-net.c index 573c669..5389088 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -49,6 +51,7 @@ typedef struct VirtIONet NICState *nic; uint32_t tx_timeout; int32_t tx_burst; +int32_t tx_retries; /* keep spinning a bit on tx */ uint32_t has_vnet_hdr; size_t host_hdr_len; size_t guest_hdr_len; @@ -1062,7 +1065,9 @@ static void virtio_net_tx_bh(void *opaque) /* If we flush a full burst of packets, assume there are * more coming and immediately reschedule */ -if (ret = n-tx_burst) { +if (ret == 0) + n-tx_retries++; +if (n-tx_retries 20) { qemu_bh_schedule(q-tx_bh); q-tx_waiting = 1; return; @@ -1076,6 +1082,8 @@ static void virtio_net_tx_bh(void *opaque) virtio_queue_set_notification(q-tx_vq, 0); qemu_bh_schedule(q-tx_bh); q-tx_waiting = 1; +} else { + n-tx_retries = 0; } }
Re: [Qemu-devel] [PATCH] implement moderation registers for e1000
On Fri, Feb 8, 2013 at 2:02 AM, Stefan Hajnoczi stefa...@gmail.com wrote: On Wed, Feb 06, 2013 at 03:23:41PM +0100, Luigi Rizzo wrote: The following patch implements interrupt moderation registers for the e1000 adapter. These feature is normally used by OS drivers, and their implementation improves performance significantly, especially on the transmit path. The feature can be disabled through a command line option. We have seen only benefits in throughput, while latency slightly increases (that is an inherent feature of interrupt moderation) because very short delays cannot be emulated precisely. For those curious on performance, there is a set of measurements (of this and other changes that we will post separately) at http://info.iet.unipi.it/~luigi/research.html#qemu http://info.iet.unipi.it/~luigi/papers/20130206-qemu.pdf is 404. sorry, fixed now. And, will resubmit a fixed patch with uninit and fixed braces in the if() statement. I am happy to make this default to off. But it would be good if you could actually give it a try. Note that linux and FreeBSD (and presumably windows) do use moderation by default so enabling the emulation of the registers makes the guest OS behave differently (and closer to bare metal). To test that the patch itself does not cause regression in the emulator one should also turn off moderation in the guest (one of the tests i have run). cheers luigi
Re: [Qemu-devel] [PATCH] implement moderation registers for e1000
On Fri, Feb 08, 2013 at 11:46:34AM +0100, Paolo Bonzini wrote: Il 08/02/2013 11:20, Luigi Rizzo ha scritto: ... I am happy to make this default to off. But it would be good if you could actually give it a try. Note that linux and FreeBSD (and presumably windows) do use moderation by default so enabling the emulation of the registers makes the guest OS behave differently (and closer to bare metal). To test that the patch itself does not cause regression in the emulator one should also turn off moderation in the guest (one of the tests i have run). I think making the default on is fine, but you need to add compatibility options to leave it off in older machine types (pc-1.4 and earlier). I am unclear on what you mean by older machine types ? The hardware (E1000_DEV_ID_82540EM) does have these registers, and it is entirely up to the guest OS to use them or not. cheers luigi
Re: [Qemu-devel] [PATCH] implement moderation registers for e1000
On Fri, Feb 08, 2013 at 11:59:12AM +0100, Stefan Hajnoczi wrote: ... http://info.iet.unipi.it/~luigi/papers/20130206-qemu.pdf is 404. sorry, fixed now. And, will resubmit a fixed patch with uninit and fixed braces in the if() statement. I am happy to make this default to off. But it would be good if you could actually give it a try. Note that linux and FreeBSD (and presumably windows) do use moderation by default so enabling the emulation of the registers makes the guest OS behave differently (and closer to bare metal). To test that the patch itself does not cause regression in the emulator one should also turn off moderation in the guest (one of the tests i have run). I think making the default on is fine, but you need to add compatibility options to leave it off in older machine types (pc-1.4 and earlier). Latency regression. Would need to see real results to understand how bad it is. most of the numbers in the paper are for FreeBSD, not sure if they qualify as real here :) For Linux, in guest-host configuration, we have these numbers for TCP_RR CONFIGURATION TCP_RR (KTps) 1 VCPU TX itr=0 17.7 1 VCPU TX itr=1007.3 2 VCPU TX itr=0 13.9 2 VCPU TX itr=1007.4 These are computed forcing the value in the itr register. However, by default, linux seems to dynamically adjust the itr values depending on the load so it is difficult to make repeatable experiments, This is why I'd encourage you to run some tests with what you think is an appropriate configuration. cheers luigi
Re: [Qemu-devel] [PATCH] implement moderation registers for e1000
On Fri, Feb 08, 2013 at 12:52:12PM +0100, Paolo Bonzini wrote: Il 07/02/2106 07:28, Luigi Rizzo ha scritto: On Fri, Feb 08, 2013 at 11:46:34AM +0100, Paolo Bonzini wrote: Il 08/02/2013 11:20, Luigi Rizzo ha scritto: ... I am happy to make this default to off. But it would be good if you could actually give it a try. Note that linux and FreeBSD (and presumably windows) do use moderation by default so enabling the emulation of the registers makes the guest OS behave differently (and closer to bare metal). To test that the patch itself does not cause regression in the emulator one should also turn off moderation in the guest (one of the tests i have run). I think making the default on is fine, but you need to add compatibility options to leave it off in older machine types (pc-1.4 and earlier). I am unclear on what you mean by older machine types ? The hardware (E1000_DEV_ID_82540EM) does have these registers, and it is entirely up to the guest OS to use them or not. Yes, but suppose a guest has a bug that was masked by the old non-implementation. -M pc-1.4 is supposed to bring back the old version's behavior as much as possible, including making the guest bug latent again. ok i see. Is there some example code that already handles a similar '-M' option so i can look at it and use to set the mit_on parameter ? cheers luigi
[Qemu-devel] [PATCH] implement moderation registers for e1000
The following patch implements interrupt moderation registers for the e1000 adapter. These feature is normally used by OS drivers, and their implementation improves performance significantly, especially on the transmit path. The feature can be disabled through a command line option. We have seen only benefits in throughput, while latency slightly increases (that is an inherent feature of interrupt moderation) because very short delays cannot be emulated precisely. For those curious on performance, there is a set of measurements (of this and other changes that we will post separately) at http://info.iet.unipi.it/~luigi/research.html#qemu cheers luigi Signed-off-by: Luigi Rizzo ri...@iet.unipi.it diff --git a/hw/e1000.c b/hw/e1000.c index bb150c6..b4c0f4a 100644 --- a/hw/e1000.c +++ b/hw/e1000.c @@ -131,6 +131,10 @@ typedef struct E1000State_st { } eecd_state; QEMUTimer *autoneg_timer; +QEMUTimer *mit_timer; /* handle for the timer */ +uint32_t mit_timer_on; /* mitigation timer active */ +uint32_t mit_cause;/* pending interrupt cause */ +uint32_t mit_on; /* mitigation enable */ } E1000State; #definedefreg(x) x = (E1000_##x2) @@ -146,6 +150,7 @@ enum { defreg(TPR), defreg(TPT),defreg(TXDCTL), defreg(WUFC), defreg(RA),defreg(MTA),defreg(CRCERRS),defreg(VFTA), defreg(VET), +defreg(RDTR), defreg(RADV), defreg(TADV), defreg(ITR), }; static void @@ -652,6 +657,73 @@ static uint64_t tx_desc_base(E1000State *s) return (bah 32) + bal; } +/* helper function, 0 means the value is not set */ +static inline void +mit_update_delay(uint32_t *curr, uint32_t value) +{ +if (value (*curr == 0 || value *curr)) { +*curr = value; +} +} + +/* + * If necessary, rearm the timer and post an interrupt. + * Called at the end of tx/rx routines (mit_timer_on == 0), + * and when the timer fires (mit_timer_on == 1). + * We provide a partial implementation of interrupt mitigation, + * emulating only RADV, TADV and ITR (lower 16 bits, 1024ns units for + * RADV and TADV, 256ns units for ITR). RDTR is only used to enable RADV; + * relative timers based on TIDV and RDTR are not implemented. + */ +static void +mit_rearm_and_int(void *opaque) +{ +E1000State *s = opaque; +uint32_t mit_delay = 0; + +/* + * Clear the flag. It is only set when the callback fires, + * and we need to clear it anyways. + */ +s-mit_timer_on = 0; +if (s-mit_cause == 0) { /* no events pending, we are done */ +return; +} +/* + * Compute the next mitigation delay according to pending interrupts + * and the current values of RADV (provided RDTR!=0), TADV and ITR. + * Then rearm the timer. + */ +if (s-mit_cause (E1000_ICR_TXQE | E1000_ICR_TXDW)) { +mit_update_delay(mit_delay, s-mac_reg[TADV] * 4); +} +if (s-mac_reg[RDTR] (s-mit_cause E1000_ICS_RXT0)) { +mit_update_delay(mit_delay, s-mac_reg[RADV] * 4); +} +mit_update_delay(mit_delay, s-mac_reg[ITR]); + +if (mit_delay) { +s-mit_timer_on = 1; +qemu_mod_timer(s-mit_timer, +qemu_get_clock_ns(vm_clock) + mit_delay * 256); +} + +set_ics(s, 0, s-mit_cause); +s-mit_cause = 0; +} + +static void +mit_set_ics(E1000State *s, uint32_t cause) +{ +if (s-mit_on == 0) { +set_ics(s, 0, cause); +return; +} +s-mit_cause |= cause; +if (!s-mit_timer_on) +mit_rearm_and_int(s); +} + static void start_xmit(E1000State *s) { @@ -689,7 +761,7 @@ start_xmit(E1000State *s) break; } } -set_ics(s, 0, cause); +mit_set_ics(s, cause); } static int @@ -908,7 +980,7 @@ e1000_receive(NetClientState *nc, const uint8_t *buf, size_t size) s-rxbuf_min_shift) n |= E1000_ICS_RXDMT0; -set_ics(s, 0, n); +mit_set_ics(s, n); return size; } @@ -1013,6 +1085,7 @@ static uint32_t (*macreg_readops[])(E1000State *, int) = { getreg(RDH), getreg(RDT),getreg(VET),getreg(ICS), getreg(TDBAL), getreg(TDBAH), getreg(RDBAH), getreg(RDBAL), getreg(TDLEN), getreg(RDLEN), +getreg(RDTR), getreg(RADV), getreg(TADV), getreg(ITR), [TOTH] = mac_read_clr8,[TORH] = mac_read_clr8, [GPRC] = mac_read_clr4, [GPTC] = mac_read_clr4,[TPR] = mac_read_clr4, [TPT] = mac_read_clr4, @@ -1029,6 +1102,8 @@ static void (*macreg_writeops[])(E1000State *, int, uint32_t) = { putreg(PBA), putreg(EERD), putreg(SWSM), putreg(WUFC), putreg(TDBAL), putreg(TDBAH), putreg(TXDCTL), putreg(RDBAH), putreg(RDBAL), putreg(LEDCTL), putreg(VET), +[RDTR] = set_16bit, [RADV] = set_16bit, [TADV] = set_16bit, +[ITR] = set_16bit, [TDLEN] = set_dlen,[RDLEN] = set_dlen, [TCTL] = set_tctl, [TDT] = set_tctl, [MDIC] = set_mdic
[Qemu-devel] [PATCH v2] fix qemu_flush_queued_packets() in presence of a hub
On Tue, Jan 29, 2013 at 02:08:27PM +0100, Stefan Hajnoczi wrote: It's cleaner to add a bool (*flush)(NetClientState *nc) function to NetClientInfo but given that net.c already knows about hub.c specifically we don't gain much by trying to decouple this callback. If you want to leave it hardcoded that's fine by me. I'd leave it like this for now. Here is the version with fixed style bugs. DESCRIPTION: When frontend and backend are connected through a hub as below (showing only one direction), and the frontend (or in general, all output ports of the hub) cannot accept more traffic, the backend queues packets in queue-A. When the frontend (or in general, one output port) becomes ready again, quemu tries to flush packets from queue-B, which is unfortunately empty. e1000.0 --[queue B]-- hub0port0(hub)hub0port1 --[queue A]-- tap.0 To fix this i propose to introduce a new function net_hub_flush() which is called when trying to flush a queue connected to a hub. cheers luigi Signed-off-by: Luigi Rizzo ri...@iet.unipi.it diff --git a/net/hub.c b/net/hub.c index a24c9d1..df32074 100644 --- a/net/hub.c +++ b/net/hub.c @@ -338,3 +338,17 @@ void net_hub_check_clients(void) } } } + +bool net_hub_flush(NetClientState *nc) +{ +NetHubPort *port; +NetHubPort *source_port = DO_UPCAST(NetHubPort, nc, nc); +int ret = 0; + +QLIST_FOREACH(port, source_port-hub-ports, next) { +if (port != source_port) { +ret += qemu_net_queue_flush(port-nc.send_queue); +} +} +return ret ? true : false; +} diff --git a/net/hub.h b/net/hub.h index 583ada8..a625eff 100644 --- a/net/hub.h +++ b/net/hub.h @@ -21,5 +21,6 @@ NetClientState *net_hub_add_port(int hub_id, const char *name); NetClientState *net_hub_find_client_by_name(int hub_id, const char *name); void net_hub_info(Monitor *mon); void net_hub_check_clients(void); +bool net_hub_flush(NetClientState *nc); #endif /* NET_HUB_H */ diff --git a/net/net.c b/net/net.c index 9806862..9489702 100644 --- a/net/net.c +++ b/net/net.c @@ -439,6 +439,12 @@ void qemu_flush_queued_packets(NetClientState *nc) { nc-receive_disabled = 0; +if (nc-peer nc-peer-info-type == NET_CLIENT_OPTIONS_KIND_HUBPORT) { +if (net_hub_flush(nc-peer)) { +qemu_notify_event(); +} +return; +} if (qemu_net_queue_flush(nc-send_queue)) { /* We emptied the queue successfully, signal to the IO thread to repoll * the file descriptor (for tap, for example).
[Qemu-devel] [PATCH v3] fix qemu_flush_queued_packets() in presence of a hub
DESCRIPTION: When frontend and backend are connected through a hub as below (showing only one direction), and the frontend (or in general, all output ports of the hub) cannot accept more traffic, the backend queues packets in queue-A. When the frontend (or in general, one output port) becomes ready again, quemu tries to flush packets from queue-B, which is unfortunately empty. e1000.0 --[queue B]-- hub0port0(hub)hub0port1 --[queue A]-- tap.0 To fix this i propose to introduce a new function net_hub_flush() which is called when trying to flush a queue connected to a hub. cheers luigi Signed-off-by: Luigi Rizzo ri...@iet.unipi.it diff --git a/net/hub.c b/net/hub.c index a24c9d1..df32074 100644 --- a/net/hub.c +++ b/net/hub.c @@ -338,3 +338,17 @@ void net_hub_check_clients(void) } } } + +bool net_hub_flush(NetClientState *nc) +{ +NetHubPort *port; +NetHubPort *source_port = DO_UPCAST(NetHubPort, nc, nc); +int ret = 0; + +QLIST_FOREACH(port, source_port-hub-ports, next) { +if (port != source_port) { +ret += qemu_net_queue_flush(port-nc.send_queue); +} +} +return ret ? true : false; +} diff --git a/net/hub.h b/net/hub.h index 583ada8..a625eff 100644 --- a/net/hub.h +++ b/net/hub.h @@ -21,5 +21,6 @@ NetClientState *net_hub_add_port(int hub_id, const char *name); NetClientState *net_hub_find_client_by_name(int hub_id, const char *name); void net_hub_info(Monitor *mon); void net_hub_check_clients(void); +bool net_hub_flush(NetClientState *nc); #endif /* NET_HUB_H */ diff --git a/net/net.c b/net/net.c index 9806862..9489702 100644 --- a/net/net.c +++ b/net/net.c @@ -439,6 +439,12 @@ void qemu_flush_queued_packets(NetClientState *nc) { nc-receive_disabled = 0; +if (nc-peer nc-peer-info-type == NET_CLIENT_OPTIONS_KIND_HUBPORT) { +if (net_hub_flush(nc-peer)) { +qemu_notify_event(); +} +return; +} if (qemu_net_queue_flush(nc-send_queue)) { /* We emptied the queue successfully, signal to the IO thread to repoll * the file descriptor (for tap, for example).
[Qemu-devel] [PATCH v2] fix unbounded NetQueue
In the current implementation of qemu, running without a network backend will cause the queue to grow unbounded when the guest is transmitting traffic. This patch fixes the problem by implementing bounded size NetQueue, used with an arbitrary limit of 1 packets, and dropping packets when the queue is full _and_ the sender does not pass a callback. The second condition makes sure that we never drop packets that contains a callback (which would be tricky, because the producer expects the callback to be run when all previous packets have been consumed; so we cannot run it when the packet is dropped). If documentation is correct, producers that submit a callback should stop sending when their packet is queued, so there is no real risk that the queue exceeds the max size by large values. Signed-off-by: Luigi Rizzo ri...@iet.unipi.it diff --git a/net/queue.c b/net/queue.c index 6eaf5b6..859d02a 100644 --- a/net/queue.c +++ b/net/queue.c @@ -50,6 +50,8 @@ struct NetPacket { struct NetQueue { void *opaque; +uint32_t nq_maxlen; +uint32_t nq_count; QTAILQ_HEAD(packets, NetPacket) packets; @@ -63,6 +65,8 @@ NetQueue *qemu_new_net_queue(void *opaque) queue = g_malloc0(sizeof(NetQueue)); queue-opaque = opaque; +queue-nq_maxlen = 1; +queue-nq_count = 0; QTAILQ_INIT(queue-packets); @@ -92,6 +96,9 @@ static void qemu_net_queue_append(NetQueue *queue, { NetPacket *packet; +if (queue-nq_count = queue-nq_maxlen !sent_cb) { +return; /* drop if queue full and no callback */ +} packet = g_malloc(sizeof(NetPacket) + size); packet-sender = sender; packet-flags = flags; @@ -99,6 +106,7 @@ static void qemu_net_queue_append(NetQueue *queue, packet-sent_cb = sent_cb; memcpy(packet-data, buf, size); +queue-nq_count++; QTAILQ_INSERT_TAIL(queue-packets, packet, entry); } @@ -113,6 +121,9 @@ static void qemu_net_queue_append_iov(NetQueue *queue, size_t max_len = 0; int i; +if (queue-nq_count = queue-nq_maxlen !sent_cb) { +return; /* drop if queue full and no callback */ +} for (i = 0; i iovcnt; i++) { max_len += iov[i].iov_len; } @@ -130,6 +141,7 @@ static void qemu_net_queue_append_iov(NetQueue *queue, packet-size += len; } +queue-nq_count++; QTAILQ_INSERT_TAIL(queue-packets, packet, entry); } @@ -220,6 +232,7 @@ void qemu_net_queue_purge(NetQueue *queue, NetClientState *from) QTAILQ_FOREACH_SAFE(packet, queue-packets, entry, next) { if (packet-sender == from) { QTAILQ_REMOVE(queue-packets, packet, entry); +queue-nq_count--; g_free(packet); } } @@ -233,6 +246,7 @@ bool qemu_net_queue_flush(NetQueue *queue) packet = QTAILQ_FIRST(queue-packets); QTAILQ_REMOVE(queue-packets, packet, entry); +queue-nq_count--; ret = qemu_net_queue_deliver(queue, packet-sender, @@ -240,6 +254,7 @@ bool qemu_net_queue_flush(NetQueue *queue) packet-data, packet-size); if (ret == 0) { +queue-nq_count++; QTAILQ_INSERT_HEAD(queue-packets, packet, entry); return false; }
Re: [Qemu-devel] memcpy speed (Re: [PATCH v2] netmap backend (revised))
On Thu, Jan 24, 2013 at 11:24 PM, Stefan Hajnoczi stefa...@gmail.comwrote: On Thu, Jan 24, 2013 at 6:35 PM, Luigi Rizzo ri...@iet.unipi.it wrote: never mind, pilot error. in my test program i had swapped the arguments to __builtin_memcpy(). With the correct ones, __builtin_memcpy() == bcopy == memcpy on both machines, and never faster than the pkt_copy(). Are the bcopy()/memcpy() calls given a length that is a multiple of 64 bytes? IIUC pkt_copy() assumes 64-byte multiple lengths and that optimization can matches with memcpy(dst, src, (len + 63) ~63). Maybe it helps and at least ensures they are doing equal amounts of byte copying. the length is a parameter from the command line. For short packets, at least on the i7-2600 and freebsd the pkt_copy() is only slightly faster than memcpy on multiples of 64, and *a lot* faster when the length is not a multiple. How about dropping pkt_copy() and instead rounding the memcpy() length up to the next 64 byte multiple? Using memcpy() is more future-proof IMO, that's why I'm pushing for this. fair enough, i'll make this conditional and enable memcpy() rounded to 64 bytes multiples by default (though as i said the pkt_copy() is always at least as fast as memcpy() on all machines i tried. cheers luigi
Re: [Qemu-devel] (change __FUNCTION__ to __func__ according to qemu coding style)
On Fri, Jan 25, 2013 at 9:23 AM, Michael Tokarev m...@tls.msk.ru wrote: --- v2: change __FUNCTION__ to __func__ according to qemu coding style will do. However it does not seem a highly followed standard :) lrizzo@netmap:~/work-qemu/qemu-git-head-20130121$ grep -r __FUNCTION__ . | wc 3442119 25898 lrizzo@netmap:~/work-qemu/qemu-git-head-20130121$ grep -r __func__ . | wc 7595035 58455 cheers luigi
Re: [Qemu-devel] memcpy speed (Re: [PATCH v2] netmap backend (revised))
On Thu, Jan 24, 2013 at 09:54:19AM +0100, Stefan Hajnoczi wrote: On Wed, Jan 23, 2013 at 06:55:59PM -0800, Luigi Rizzo wrote: On Wed, Jan 23, 2013 at 8:03 AM, Luigi Rizzo ri...@iet.unipi.it wrote: I'm even doubtful that it's always a win on FreeBSD. You have a threshold to fall back to bcopy() and who knows what the best value for various CPUs is. indeed. With the attached program (which however might be affected by the fact that data is not used after copying) it seems that on a recent linux (using gcc 4.6.2) the fastest is __builtin_memcpy() ./testlock -m __builtin_memcpy -l 64 (by a factor of 2 or more) whereas all the other methods have approximately the same speed. never mind, pilot error. in my test program i had swapped the arguments to __builtin_memcpy(). With the correct ones, __builtin_memcpy() == bcopy == memcpy on both machines, and never faster than the pkt_copy(). Are the bcopy()/memcpy() calls given a length that is a multiple of 64 bytes? IIUC pkt_copy() assumes 64-byte multiple lengths and that optimization can matches with memcpy(dst, src, (len + 63) ~63). Maybe it helps and at least ensures they are doing equal amounts of byte copying. the length is a parameter from the command line. For short packets, at least on the i7-2600 and freebsd the pkt_copy() is only slightly faster than memcpy on multiples of 64, and *a lot* faster when the length is not a multiple. Again i am not sure whether it depends on the compiler/glibc or simply on the CPU, unfortunately i have no way to swap machines. luigi
Re: [Qemu-devel] [PATCH v2] netmap backend (revised)
On Tue, Jan 22, 2013 at 2:50 PM, Anthony Liguori aligu...@us.ibm.comwrote: Hi, Thank you for submitting your patch series. checkpatch.pl has detected that one or more of the patches in this series violate the QEMU coding style. If you believe this message was sent in error, please ignore it or respond here with an explanation. Otherwise, please correct the coding style issues and resubmit a new version of the patch. will do, thanks for the feedback luigi
Re: [Qemu-devel] [PATCH v2] netmap backend (revised)
On Wed, Jan 23, 2013 at 12:10:55PM +0100, Stefan Hajnoczi wrote: On Tue, Jan 22, 2013 at 08:12:15AM +0100, Luigi Rizzo wrote: reposting a version without changes that implement bounded queues in net/queue.c Hi, the attached patch implements a qemu backend for the netmap API thus allowing machines to attach to the VALE software switch as well as netmap-supported cards (links below). http://info.iet.unipi.it/~luigi/netmap/ http://info.iet.unipi.it/~luigi/vale/ This is a cleaned up version of code written last summer. Looks like a clean software approach to low-level packet I/O. I guess the biggest competitor would be a userspace library with NIC drivers using modern IOMMUs to avoid the security/reliability problem that previous userspace approaches suffered. Pretty cool that netmap reuses kernel NIC driver implementations to avoid duplicating all that code. I wonder how/if netmaps handles advanced NIC features like multiqueue and offloads? But I've only read the webpage, not the papers or source code :). there are definitely more details in the papers :) IOMMU is not strictly necessary because userspace only sees packet buffers and a device independent replica of the descriptor ring. NIC registers and rings are only manipulated by the kernel within system calls. multiqueue is completely straightforward -- netmap exposes as many queues as the hardware implements, you can attach file descriptors to individual queues, bind threads to queues by just using pthread_setaffinity(). offloading so far is not supported, and that's part of the design: it adds complexity at runtime to check the various possible combinations of offloading in various places in the stack. A single packet format makes the driver extremely efficient. The thing that may make a difference is tcp segmentation and reassembly, we will look at how to support it at some point. I'apply the other changes you suggest below, thanks. Only some comments: The debugging macros (D, RD() ) are taken from our existing code, +#define ND(fd, ... ) // debugging +#define D(format, ...) \ ... +/* rate limited, lps indicates how many per second */ +#define RD(lps, format, ...)\ ... Have you seen docs/tracing.txt? It can do fprintf(stderr) but also SystemTap/DTrace and a simple built-in binary tracer. will look at it. These debugging macros are the same we use in other netmap code so i'd prefer to keep them. +// a fast copy routine only for multiples of 64 bytes, non overlapped. +static inline void +pkt_copy(const void *_src, void *_dst, int l) ... +*dst++ = *src++; +} +} I wonder how different FreeBSD bcopy() is from glibc memcpy() and if the optimization is even a win. The glibc code is probably hand-written assembly that CPU vendors have contributed for specific CPU model families. Did you compare glibc memcpy() against pkt_copy()? I haven't tried in detail on glibc but will run some tests. In any case not all systems have glibc, and on FreeBSD this pkt_copy was a significant win for small packets (saving some 20ns each; of course this counts only when you approach the 10 Mpps range, which is what you get with netmap, and of course when data is in cache). One reason pkt_copy gains something is that if it can assume there is extra space in the buffer, it can work on large chunks avoiding the extra jumps and instructions for the remaining 1-2-4 bytes. + ring-slot[i].len = size; + pkt_copy(buf, dst, size); How does netmap guarantee that the buffer is large enough for this packet? the netmap buffers are 2k, i'll make sure there is a check that the size is not exceeded. +close(s-me.fd); Missing munmap? yes you are correct. cheers luigi
[Qemu-devel] memcpy speed (Re: [PATCH v2] netmap backend (revised))
On Wed, Jan 23, 2013 at 02:03:17PM +0100, Stefan Hajnoczi wrote: On Wed, Jan 23, 2013 at 12:50:26PM +0100, Luigi Rizzo wrote: On Wed, Jan 23, 2013 at 12:10:55PM +0100, Stefan Hajnoczi wrote: On Tue, Jan 22, 2013 at 08:12:15AM +0100, Luigi Rizzo wrote: ... +// a fast copy routine only for multiples of 64 bytes, non overlapped. +static inline void +pkt_copy(const void *_src, void *_dst, int l) ... +*dst++ = *src++; +} +} I wonder how different FreeBSD bcopy() is from glibc memcpy() and if the optimization is even a win. The glibc code is probably hand-written assembly that CPU vendors have contributed for specific CPU model families. Did you compare glibc memcpy() against pkt_copy()? I haven't tried in detail on glibc but will run some tests. In any case not all systems have glibc, and on FreeBSD this pkt_copy was a significant win for small packets (saving some 20ns each; of course this counts only when you approach the 10 Mpps range, which is what you get with netmap, and of course when data is in cache). One reason pkt_copy gains something is that if it can assume there is extra space in the buffer, it can work on large chunks avoiding the extra jumps and instructions for the remaining 1-2-4 bytes. I'd like to drop this code or at least make it FreeBSD-specific since there's no guarantee that this is a good idea on any other libc. I'm even doubtful that it's always a win on FreeBSD. You have a threshold to fall back to bcopy() and who knows what the best value for various CPUs is. indeed. With the attached program (which however might be affected by the fact that data is not used after copying) it seems that on a recent linux (using gcc 4.6.2) the fastest is __builtin_memcpy() ./testlock -m __builtin_memcpy -l 64 (by a factor of 2 or more) whereas all the other methods have approximately the same speed. On FreeBSD (with clang, gcc 4.2.1, gcc 4.6.4) the pkt_copy() above ./testlock -m fastcopy -l 64 is largely better than other methods. I am a bit puzzled why the builtin method on FreeBSD is not effective, but i will check on some other forum... cheers luigi /* * Copyright (C) 2012 Luigi Rizzo. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the *documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ /* * $Id: testlock.c 12015 2013-01-23 15:51:17Z luigi $ * * Test program to study various ops and concurrency issues. * Create multiple threads, possibly bind to cpus, and run a workload. * * cc -O2 -Werror -Wall testlock.c -o testlock -lpthread * you might need -lrt */ #include inttypes.h #include sys/types.h #include pthread.h/* pthread_* */ #if defined(__APPLE__) #include libkern/OSAtomic.h #define atomic_add_int(p, n) OSAtomicAdd32(n, (int *)p) #define atomic_cmpset_32(p, o, n) OSAtomicCompareAndSwap32(o, n, (int *)p) #elif defined(linux) int atomic_cmpset_32(volatile uint32_t *p, uint32_t old, uint32_t new) { int ret = *p == old; *p = new; return ret; } #if defined(HAVE_GCC_ATOMICS) int atomic_add_int(volatile int *p, int v) { return __sync_fetch_and_add(p, v); } #else inline uint32_t atomic_add_int(uint32_t *p, int v) { __asm __volatile ( lock xaddl %0, %1 ; : +r (v), /* 0 (result) */ =m (*p) /* 1 */ : m (*p));/* 2 */ return (v); } #endif #else /* FreeBSD */ #include sys/param.h #include machine/atomic.h #include pthread_np.h /* pthread w/ affinity */ #if __FreeBSD_version 50 #include sys/cpuset.h /* cpu_set */ #if __FreeBSD_version 80 #define HAVE_AFFINITY #endif inline void
Re: [Qemu-devel] memcpy speed (Re: [PATCH v2] netmap backend (revised))
On Wed, Jan 23, 2013 at 8:03 AM, Luigi Rizzo ri...@iet.unipi.it wrote: I'm even doubtful that it's always a win on FreeBSD. You have a threshold to fall back to bcopy() and who knows what the best value for various CPUs is. indeed. With the attached program (which however might be affected by the fact that data is not used after copying) it seems that on a recent linux (using gcc 4.6.2) the fastest is __builtin_memcpy() ./testlock -m __builtin_memcpy -l 64 (by a factor of 2 or more) whereas all the other methods have approximately the same speed. never mind, pilot error. in my test program i had swapped the arguments to __builtin_memcpy(). With the correct ones, __builtin_memcpy() == bcopy == memcpy on both machines, and never faster than the pkt_copy(). In fact, on the machine with FreeBSD the unrolled loop still beats all other methods at small packet sizes. (e.g. (memcin my test program I had swapped the source and destination operands for __builtin_memcpy(), and this substantially changed the memory access pattern. With the correct operands, __builtin_memcpy == memcpy == bcopy on both FreeBSD and Linux. On FreeBSD pkt_copy is still faster than the other methods for small packets, whereas on Linux they are equivalent. If you are curious why swapping source and dst changed things so dramatically: the test was supposed to read from a large chunk of memory (over 1GB) to avoid always hitting L1 or L2. Swapping operands causes reads to hit always the same line, thus saving a lot of misses. The difference between the two machine then probably is due to how the cache is used on writes. cheers luigi -- -+--- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/. Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -+---
[Qemu-devel] [PATCH] netmap backend
Hi, the attached patch implements a qemu backend for the netmap API thus allowing machines to attach to the VALE software switch as well as netmap-supported cards (links below). http://info.iet.unipi.it/~luigi/netmap/ http://info.iet.unipi.it/~luigi/vale/ This is a cleaned up version of code written last summer. guest-guest speed using an e1000 frontend (with some modifications related to interrupt moderation, will repost an updated version later): up to 700 Kpps using sockets, and up to 5 Mpps using netmap within the guests. I have not tried with virtio. cheers luigi Signed-off-by: Luigi Rizzo ri...@iet.unipi.it -- configure | 31 + net/Makefile.objs |1 + net/clients.h |4 + net/net.c |3 + net/qemu-netmap.c | 353 + net/queue.c | 15 +++ qapi-schema.json |8 +- 7 files changed, 414 insertions(+), 1 deletions(-) diff --git a/configure b/configure index c6172ef..cfdf8a6 100755 --- a/configure +++ b/configure @@ -146,6 +146,7 @@ curl= curses= docs= fdt= +netmap= nptl= pixman= sdl= @@ -739,6 +740,10 @@ for opt do ;; --enable-vde) vde=yes ;; + --disable-netmap) netmap=no + ;; + --enable-netmap) netmap=yes + ;; --disable-xen) xen=no ;; --enable-xen) xen=yes @@ -1112,6 +1117,8 @@ echo --disable-uuid disable uuid support echo --enable-uuidenable uuid support echo --disable-vdedisable support for vde network echo --enable-vde enable support for vde network +echo --disable-netmap disable support for netmap network +echo --enable-netmap enable support for netmap network echo --disable-linux-aio disable Linux AIO support echo --enable-linux-aio enable Linux AIO support echo --disable-cap-ng disable libcap-ng support @@ -1914,6 +1921,26 @@ EOF fi ## +# netmap headers probe +if test $netmap != no ; then + cat $TMPC EOF +#include inttypes.h +#include net/if.h +#include net/netmap.h +#include net/netmap_user.h +int main(void) { return 0; } +EOF + if compile_prog ; then +netmap=yes + else +if test $netmap = yes ; then + feature_not_found netmap +fi +netmap=no + fi +fi + +## # libcap-ng library probe if test $cap_ng != no ; then cap_libs=-lcap-ng @@ -3314,6 +3341,7 @@ echo NPTL support $nptl echo GUEST_BASE$guest_base echo PIE $pie echo vde support $vde +echo netmap support$netmap echo Linux AIO support $linux_aio echo ATTR/XATTR support $attr echo Install blobs $blobs @@ -3438,6 +3466,9 @@ fi if test $vde = yes ; then echo CONFIG_VDE=y $config_host_mak fi +if test $netmap = yes ; then + echo CONFIG_NETMAP=y $config_host_mak +fi if test $cap_ng = yes ; then echo CONFIG_LIBCAP=y $config_host_mak fi diff --git a/net/Makefile.objs b/net/Makefile.objs index a08cd14..068253f 100644 --- a/net/Makefile.objs +++ b/net/Makefile.objs @@ -10,3 +10,4 @@ common-obj-$(CONFIG_AIX) += tap-aix.o common-obj-$(CONFIG_HAIKU) += tap-haiku.o common-obj-$(CONFIG_SLIRP) += slirp.o common-obj-$(CONFIG_VDE) += vde.o +common-obj-$(CONFIG_NETMAP) += qemu-netmap.o diff --git a/net/clients.h b/net/clients.h index 7793294..952d076 100644 --- a/net/clients.h +++ b/net/clients.h @@ -52,4 +52,8 @@ int net_init_vde(const NetClientOptions *opts, const char *name, NetClientState *peer); #endif +#ifdef CONFIG_NETMAP +int net_init_netmap(const NetClientOptions *opts, const char *name, +NetClientState *peer); +#endif #endif /* QEMU_NET_CLIENTS_H */ diff --git a/net/net.c b/net/net.c index cdd9b04..816c987 100644 --- a/net/net.c +++ b/net/net.c @@ -618,6 +618,9 @@ static int (* const net_client_init_fun[NET_CLIENT_OPTIONS_KIND_MAX])( [NET_CLIENT_OPTIONS_KIND_BRIDGE]= net_init_bridge, #endif [NET_CLIENT_OPTIONS_KIND_HUBPORT] = net_init_hubport, +#ifdef CONFIG_NETMAP + [NET_CLIENT_OPTIONS_KIND_NETMAP]= net_init_netmap, +#endif }; diff --git a/net/qemu-netmap.c b/net/qemu-netmap.c new file mode 100644 index 000..79d7c09 --- /dev/null +++ b/net/qemu-netmap.c @@ -0,0 +1,353 @@ +/* + * netmap access for qemu + * + * Copyright (c) 2012-2013 Luigi Rizzo + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE
[Qemu-devel] [PATCH] fix qemu_flush_queued_packets() in presence of a hub
when frontend and backend are connected through a hub as below (showing only one direction), and the frontend (or in general, all output ports of the hub) cannot accept more traffic, the backend queues packets in queue-A. When the frontend (or in general, one output port) becomes ready again, quemu tries to flush packets from queue-B, which is unfortunately empty. e1000.0 --[queue B]-- hub0port0(hub)hub0port1 --[queue A]-- tap.0 To fix this i propose to introduce a new function net_hub_flush() which is called when trying to flush a queue connected to a hub. cheers luigi Signed-off-by: Luigi Rizzo ri...@iet.unipi.it net/hub.c | 13 + net/hub.h |1 + net/net.c |6 ++ diff --git a/net/hub.c b/net/hub.c index a24c9d1..08f95d0 100644 --- a/net/hub.c +++ b/net/hub.c @@ -338,3 +338,16 @@ void net_hub_check_clients(void) } } } + +bool net_hub_flush(NetClientState *nc) +{ +NetHubPort *port; +NetHubPort *source_port = DO_UPCAST(NetHubPort, nc, nc); +int ret = 0; + +QLIST_FOREACH(port, source_port-hub-ports, next) { + if (port != source_port) + ret += qemu_net_queue_flush(port-nc.send_queue); +} +return ret ? true : false; +} diff --git a/net/hub.h b/net/hub.h index 583ada8..a625eff 100644 --- a/net/hub.h +++ b/net/hub.h @@ -21,5 +21,6 @@ NetClientState *net_hub_add_port(int hub_id, const char *name); NetClientState *net_hub_find_client_by_name(int hub_id, const char *name); void net_hub_info(Monitor *mon); void net_hub_check_clients(void); +bool net_hub_flush(NetClientState *nc); #endif /* NET_HUB_H */ diff --git a/net/net.c b/net/net.c index 816c987..8caa368 100644 --- a/net/net.c +++ b/net/net.c @@ -355,6 +355,12 @@ void qemu_flush_queued_packets(NetClientState *nc) { nc-receive_disabled = 0; +if (nc-peer nc-peer-info-type == NET_CLIENT_OPTIONS_KIND_HUBPORT) { + if (net_hub_flush(nc-peer)) { + qemu_notify_event(); + } + return; +} if (qemu_net_queue_flush(nc-send_queue)) { /* We emptied the queue successfully, signal to the IO thread to repoll * the file descriptor (for tap, for example).
[Qemu-devel] [PATCH v2] netmap backend (revised)
reposting a version without changes that implement bounded queues in net/queue.c Hi, the attached patch implements a qemu backend for the netmap API thus allowing machines to attach to the VALE software switch as well as netmap-supported cards (links below). http://info.iet.unipi.it/~luigi/netmap/ http://info.iet.unipi.it/~luigi/vale/ This is a cleaned up version of code written last summer. guest-guest speed using an e1000 frontend (with some modifications related to interrupt moderation, will repost an updated version later): up to 700 Kpps using sockets, and up to 5 Mpps using netmap within the guests. I have not tried with virtio. cheers luigi Signed-off-by: Luigi Rizzo ri...@iet.unipi.it -- configure | 31 + net/Makefile.objs |1 + net/clients.h |4 + net/net.c |3 + net/qemu-netmap.c | 353 + qapi-schema.json |8 +- diff --git a/configure b/configure index c6172ef..cfdf8a6 100755 --- a/configure +++ b/configure @@ -146,6 +146,7 @@ curl= curses= docs= fdt= +netmap= nptl= pixman= sdl= @@ -739,6 +740,10 @@ for opt do ;; --enable-vde) vde=yes ;; + --disable-netmap) netmap=no + ;; + --enable-netmap) netmap=yes + ;; --disable-xen) xen=no ;; --enable-xen) xen=yes @@ -1112,6 +1117,8 @@ echo --disable-uuid disable uuid support echo --enable-uuidenable uuid support echo --disable-vdedisable support for vde network echo --enable-vde enable support for vde network +echo --disable-netmap disable support for netmap network +echo --enable-netmap enable support for netmap network echo --disable-linux-aio disable Linux AIO support echo --enable-linux-aio enable Linux AIO support echo --disable-cap-ng disable libcap-ng support @@ -1914,6 +1921,26 @@ EOF fi ## +# netmap headers probe +if test $netmap != no ; then + cat $TMPC EOF +#include inttypes.h +#include net/if.h +#include net/netmap.h +#include net/netmap_user.h +int main(void) { return 0; } +EOF + if compile_prog ; then +netmap=yes + else +if test $netmap = yes ; then + feature_not_found netmap +fi +netmap=no + fi +fi + +## # libcap-ng library probe if test $cap_ng != no ; then cap_libs=-lcap-ng @@ -3314,6 +3341,7 @@ echo NPTL support $nptl echo GUEST_BASE$guest_base echo PIE $pie echo vde support $vde +echo netmap support$netmap echo Linux AIO support $linux_aio echo ATTR/XATTR support $attr echo Install blobs $blobs @@ -3438,6 +3466,9 @@ fi if test $vde = yes ; then echo CONFIG_VDE=y $config_host_mak fi +if test $netmap = yes ; then + echo CONFIG_NETMAP=y $config_host_mak +fi if test $cap_ng = yes ; then echo CONFIG_LIBCAP=y $config_host_mak fi diff --git a/net/Makefile.objs b/net/Makefile.objs index a08cd14..068253f 100644 --- a/net/Makefile.objs +++ b/net/Makefile.objs @@ -10,3 +10,4 @@ common-obj-$(CONFIG_AIX) += tap-aix.o common-obj-$(CONFIG_HAIKU) += tap-haiku.o common-obj-$(CONFIG_SLIRP) += slirp.o common-obj-$(CONFIG_VDE) += vde.o +common-obj-$(CONFIG_NETMAP) += qemu-netmap.o diff --git a/net/clients.h b/net/clients.h index 7793294..952d076 100644 --- a/net/clients.h +++ b/net/clients.h @@ -52,4 +52,8 @@ int net_init_vde(const NetClientOptions *opts, const char *name, NetClientState *peer); #endif +#ifdef CONFIG_NETMAP +int net_init_netmap(const NetClientOptions *opts, const char *name, +NetClientState *peer); +#endif #endif /* QEMU_NET_CLIENTS_H */ diff --git a/net/net.c b/net/net.c index cdd9b04..816c987 100644 --- a/net/net.c +++ b/net/net.c @@ -618,6 +618,9 @@ static int (* const net_client_init_fun[NET_CLIENT_OPTIONS_KIND_MAX])( [NET_CLIENT_OPTIONS_KIND_BRIDGE]= net_init_bridge, #endif [NET_CLIENT_OPTIONS_KIND_HUBPORT] = net_init_hubport, +#ifdef CONFIG_NETMAP + [NET_CLIENT_OPTIONS_KIND_NETMAP]= net_init_netmap, +#endif }; diff --git a/net/qemu-netmap.c b/net/qemu-netmap.c new file mode 100644 index 000..79d7c09 --- /dev/null +++ b/net/qemu-netmap.c @@ -0,0 +1,353 @@ +/* + * netmap access for qemu + * + * Copyright (c) 2012-2013 Luigi Rizzo + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software
[Qemu-devel] wrong argument to qemu_flush_queued_packets() in network frontends ?
While running qemu 1.3.0 with the following network-related flags: -net nic -net tap,ifname=tap0,script='' I encountered the same problem (should be common to several frontends, e.g. e100, eepro100, virtio-net, xen_nic): in net/tap.c :: tap_send(), if qemu_send_packet_async() returns 0 (e.g. because the NIC has no buffers available) traffic stops, despite the fact that the frontend will try to pull queued packets when the receive ring is updated. Upon investigation, it turns out that the backend code does size = qemu_send_packet_async(s-nc, buf, size, tap_send_completed); if (size == 0) { tap_read_poll(s, 0); and the arguments are s-nc.name = tap.0 s-nc.peer-name = hub0port1 s-nc.send_queue = 0x7f40b2f61e20 s-nc.peer-send_queue = 0x7f40b2f63690 --- enqueued here whereis the frontend is trying to pull from a different queue qemu_flush_queued_packets(s-nic-nc); with arguments s-nic-nc.name = e1000.0 s-nic-nc.peer-name = hub0port0 --- try to flush this s-nic-nc.send_queue = 0x7f40b3008ae0 s-nic-nc.peer-send_queue = 0x7f40b2f63660 Note, regular traffic flows correctly across the hub, but qemu_flush_queued_packets() seems to try and pull from the wrong place. Any idea how to fix this (other than the inefficient solution of leaving read_poll=1 in the frontend) cheers luigi
Re: [Qemu-devel] wrong argument to qemu_flush_queued_packets() in network frontends ?
small correction: On Sun, Jan 20, 2013 at 6:50 PM, Luigi Rizzo ri...@iet.unipi.it wrote: While running qemu 1.3.0 with the following network-related flags: -net nic -net tap,ifname=tap0,script='' I encountered the same problem (should be common to several frontends, e.g. e100, eepro100, virtio-net, xen_nic): in net/tap.c :: tap_send(), if qemu_send_packet_async() returns 0 (e.g. because the NIC has no buffers available) traffic stops, despite the fact that the frontend will try to pull queued packets when the receive ring is updated. Upon investigation, it turns out that the backend code does size = qemu_send_packet_async(s-nc, buf, size, tap_send_completed); if (size == 0) { tap_read_poll(s, 0); and the arguments are s-nc.name = tap.0 s-nc.peer-name = hub0port1 s-nc.send_queue = 0x7f40b2f61e20 s-nc.peer-send_queue = 0x7f40b2f63690 --- enqueued here whereis the frontend is trying to pull from a different queue qemu_flush_queued_packets(s-nic-nc); with arguments s-nic-nc.name = e1000.0 s-nic-nc.peer-name = hub0port0 --- try to flush this s-nic-nc.send_queue = 0x7f40b3008ae0 the queue that is actually flushed is s-nic-nc.send_queue or 0x7f40b3008ae0 s-nic-nc.peer-send_queue = 0x7f40b2f63660 Note, regular traffic flows correctly across the hub, but qemu_flush_queued_packets() seems to try and pull from the wrong place. Any idea how to fix this (other than the inefficient solution of leaving read_poll=1 in the frontend) cheers luigi cheers luigi -- -+--- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/. Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -+---
Re: [Qemu-devel] wrong argument to qemu_flush_queued_packets() in network frontends ?
... and upon closer inspection, the problem described below (frontend blocks the backend, then tries to drain the wrong queue causing a stall) occurs because the hub in the middle breaks the flow of events. In the configuration below ( -net nic -net tap,ifname=tap0,... ) we have e1000.0 -- hub0port0 [hub] hub0port1 -- tap.0 The hub0port1 reports as non-writable when all other ports (just one in this case) are full, and the packet is queued on hub0port1. However when the e1000 frontend tries to drain the queue, it directly accesses the queue attached to hub0port0, which is empty. So it appears that the only fix is the following: when a node is attached to a hub, instead of draining the queue on the node one should drain all queues attached to the hub. A new function qemu_flush_hub() would be handy, something like QLIST_FOREACH(port, hub-ports, next) { if (port != source_port) qemu_flush_queued_packets(port-nc); } The other option (queueing on the output ports of the hub) would require a bit more attention to make sure that the callback is only executed once (and also, avoid exceeding data replication). Not impossible, but it requires reference counting the packet. What do you think, which way do you prefer ? cheers luigi On Sun, Jan 20, 2013 at 6:50 PM, Luigi Rizzo ri...@iet.unipi.it wrote: While running qemu 1.3.0 with the following network-related flags: -net nic -net tap,ifname=tap0,script='' I encountered the same problem (should be common to several frontends, e.g. e100, eepro100, virtio-net, xen_nic): in net/tap.c :: tap_send(), if qemu_send_packet_async() returns 0 (e.g. because the NIC has no buffers available) traffic stops, despite the fact that the frontend will try to pull queued packets when the receive ring is updated. Upon investigation, it turns out that the backend code does size = qemu_send_packet_async(s-nc, buf, size, tap_send_completed); if (size == 0) { tap_read_poll(s, 0); and the arguments are s-nc.name = tap.0 s-nc.peer-name = hub0port1 s-nc.send_queue = 0x7f40b2f61e20 s-nc.peer-send_queue = 0x7f40b2f63690 --- enqueued here whereis the frontend is trying to pull from a different queue qemu_flush_queued_packets(s-nic-nc); with arguments s-nic-nc.name = e1000.0 s-nic-nc.peer-name = hub0port0 --- try to flush this s-nic-nc.send_queue = 0x7f40b3008ae0 s-nic-nc.peer-send_queue = 0x7f40b2f63660 Note, regular traffic flows correctly across the hub, but qemu_flush_queued_packets() seems to try and pull from the wrong place. Any idea how to fix this (other than the inefficient solution of leaving read_poll=1 in the frontend) cheers luigi -- -+--- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/. Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -+---
Re: [Qemu-devel] bypassing pci_dma_read() and pci_dma_write() ?
On Fri, Jan 18, 2013 at 05:14:02PM +0100, Paolo Bonzini wrote: ... This relies on the assumption that the ring (which is contiguous in the guest's physical address space) is also contiguous in the host's virtual address space. In principle the property could be easily verified once the ring is set up. IIRC, the amount of contiguous memory is written by address_space_map in the plen parameter. In your case: + s-txring = address_space_map(pci_dma_context(s-dev)-as, + base, desclen, 0 /* is_write */); that would be desclen on return from address_space_map. ok thanks. And of course, am i missing some important detail ? Unfortunately yes. First, host memory mappings could change (though they rarely do on PC). The result of address_space_map is not guaranteed to be stable. To avoid problems with this, however, you could use something like hw/dataplane/hostmem.c and even avoid address_space_map altogether. I'll look into that. Hopefully there is something that i can use as a notification that the mapping has changed... Second, that pci_dma_*() could have the addresses translated by an IOMMU. virtio is documented to have real physical memory addresses, but this does not apply to other devices. I see. I suppose the ability to have an iommu depends on the specific NIC ? I am only planning to use the above shortcut for e1000. thanks a lot for the quick feedback luigi
[Qemu-devel] bypassing pci_dma_read() and pci_dma_write() ?
Hi, with a bunch of e1000 improvements we are at a point where we are doing over 1Mpps (short frames) and 7-8Gbit/s (1500 byte frames) between two guests, and two things that are high in the perf top stats are phys_page_find() and related memory copies. Both are triggered by the pci_dma_read() and pci_dma_write(), which on e1000 (and presumably other frontends) are called on every single descriptor and every single buffer. I have then tried to access the guest memory without going every time through the page lookup. For the tx and rx rings i have a partial workaround, which tracks changes to the base address of the ring, converts it to a host virtual address +#ifdef MAP_RING +base = tx_desc_base(s); +if (base != s-txring_phi) { + hwaddr desclen = s-mac_reg[TDLEN]; + s-txring_phi = base; + s-txring = address_space_map(pci_dma_context(s-dev)-as, + base, desclen, 0 /* is_write */); +} +#endif /* MAP_RING */ ... and then accesses the descriptor directly into guest memory desc = s-txring[s-mac_reg[TDH]]; (sprinkle with memory barriers as needed). This relies on the assumption that the ring (which is contiguous in the guest's physical address space) is also contiguous in the host's virtual address space. In principle the property could be easily verified once the ring is set up. I have not done this for buffers because I am not sure how to verify that the same mapping holds for all packet buffers. One way could be the following: on the first buffer access, make the address translation and try to determine the boundaries of the contiguous (in virtual host memory) region that holds the buffer. Then subsequent buffers can be easily validated against this region. So the problem is now the following: given a guest physical address, is there a quick way to determine the contiguous region of memory in the host that contains it ? And of course, am i missing some important detail ? Of course the above could be used conditionally if the required conditions hold, and then revert to the pci_dma_*() in other cases. cheers luigi
Re: [Qemu-devel] bypassing pci_dma_read() and pci_dma_write() ?
On Fri, Jan 18, 2013 at 05:14:02PM +0100, Paolo Bonzini wrote: Il 18/01/2013 17:04, Luigi Rizzo ha scritto: Hi, with a bunch of e1000 improvements we are at a point where we are doing over 1Mpps (short frames) and 7-8Gbit/s (1500 byte frames) between two guests, and two things that are high in the perf top stats are phys_page_find() and related memory copies. Both are triggered by the pci_dma_read() and pci_dma_write(), which on e1000 (and presumably other frontends) are called on every single descriptor and every single buffer. I have then tried to access the guest memory without going every time through the page lookup. [...] This relies on the assumption that the ring (which is contiguous in the guest's physical address space) is also contiguous in the host's virtual address space. In principle the property could be easily verified once the ring is set up. IIRC, the amount of contiguous memory is written by address_space_map in the plen parameter. unfortunately the plen parameter is modified only if the area is smaller than the request, and there is no method that i can find that returns [base,len] of a RAMBlock. What I came up with, also to check for invalid addresses, IOMMU and the like, is something like this: // addr is the address we want to map into host VM int mappable_addr(PCIDevice *dev, hwaddr addr, uint64_t *guest_ha_low, uint64_t *guest_ha_high, uint64_t *gpa_to_hva) { AddressSpace *as = dev-as; AddressSpaceDispatch *d = as-dispatch; MemoryRegionSection *section; RAMBlock *block; if (dma_has_iommu(pci_dma_context(dev))) return 0; // no direct access section = phys_page_find(d, addr TARGET_PAGE_BITS); if (!memory_region_is_ram(section-mr) || section-readonly) return 0; // no direct access QLIST_FOREACH(block, ram_list.blocks, next) { if (addr - block-offset block-length) { /* set 3 variables indicating the valid range * and the offset between the two address spaces. */ *guest_ha_low = block-offset; *guest_ha_high = block-offset + block-length; *gpa_to_hva = (uint64_t)block-host - block-offset; return 1; } } return 0; } (this probably needs to be put in exec.c or some other place that can access phys_page_find() and RAMBlock) The interested client (hw/e1000.c in my case) could then do a memory_listener_register() to be notified of changes, invoke mappable_addr() on the first data buffer it has to translate, and cache the result and use it to speed up the translation subsequently in case of a hit (with the pci_dma_read/write being the fallback methods in case of a miss). The cache is invalidated on updates arriving from the memory listener, and refreshed at the next access. Is this more sound ? The only missing piece then is the call to invalidate_and_set_dirty() cheers luigi
Re: [Qemu-devel] [PATCH] fix unbounded qemu NetQueue
On Thu, Jan 17, 2013 at 2:21 AM, Stefan Hajnoczi stefa...@gmail.com wrote: On Thu, Jan 17, 2013 at 07:07:11AM +0100, Luigi Rizzo wrote: The comment at the beginning of net/queue.c says that packets that cannot be sent by qemu_net_queue_send() should not be enqueued unless a callback is set. This patch implements this behaviour, that prevents a queue to grow unbounded (e.g. when a network backend is not connected). Also for good measure the patch implements bounded size queues (though it should not be necessary now because each source can only have one packet queued). When a packet is dropped because excessive queue size the callback is not supposed to be called. Although I appreciate the semantics that the comment tries to establish, the code doesn't behave like this today and we cannot drop packets in cases where we relied on queuing them. More changes will be required to make the hub, USB, pcap scenario I described previously work. i see. then the other option would be to drop packets only if the queue is oversize AND the callback is not set: +if (queue-nq_count = queue-nq_maxlen !sent_cb) + return; so we should be able to use the queue to store packets when the USB guest is slow, and avoid dropping precious packet with the callback set (there should be only one of them for each source) ? The queue might grow slightly overlimit but still be kept under control. I cannot think of another way to handle the callback. I think we cannot run it on a drop as it would cause a premature restart of the sender. (at a cursory inspection of the code, just tap.c and virtio-net.c use callbacks) cheers luigi
Re: [Qemu-devel] unbounded qemu NetQue's ?
On Mon, Jan 7, 2013 at 5:49 AM, Stefan Hajnoczi stefa...@gmail.com wrote: On Sun, Jan 06, 2013 at 08:23:56PM +0100, Luigi Rizzo wrote: Hi, while testing the tx path in qemu without a network backend connected, i noticed that qemu_net_queue_send() builds up an unbounded queue, because QTAILQ* have no notion of a queue length. As a result, memory usage of a qemu instance can grow to extremely large values. I wonder if it makes sense to implement a hard limit on size of NetQue's. The patch below is only a partial implementation but probably sufficient for these purposes. cheers luigi Hi Luigi, Good idea, we should bound the queues to prevent rare situations or bugs from hogging memory. ... diff -urp qemu-1.3.0-orig/net/queue.c qemu-1.3.0/net/queue.c --- qemu-1.3.0-orig/net/queue.c 2012-12-03 20:37:05.0+0100 +++ qemu-1.3.0/net/queue.c2013-01-06 19:38:12.0 +0100 @@ -92,6 +92,9 @@ static void qemu_net_queue_append(NetQue Please also do it for qemu_net_queue_append_iov(). { NetPacket *packet; +if (queue-packets.tqh_count 1) + return; // XXX drop + sent_cb() must be invoked. It's best to do this in a QEMUBH in case the caller is not prepared for reentrancy. Stefan, the semantic of callbacks makes it difficult to run them on drops: they are supposed to run only when the queue has been drained, and apparently only once per sender, according to the comment in the header of net/queue.c: /* The delivery handler may only return zero if it will call * qemu_net_queue_flush() when it determines that it is once again able * to deliver packets. It must also call qemu_net_queue_purge() in its * cleanup path. * * If a sent callback is provided to send(), the caller must handle a * zero return from the delivery handler by not sending any more packets * until we have invoked the callback. Only in that case will we queue * the packet. * * If a sent callback isn't provided, we just drop the packet to avoid * unbounded queueing. */ This seems to suggest that a packet to qemu_net_queue_send() should never be queued unless it has a callback (hence the existing code is buggy, because it never ever drops packets), so the queue can only hold one packet per sender, hence there is no real risk of overflow. cheers luigi
[Qemu-devel] [PATCH] fix unbounded qemu NetQueue
The comment at the beginning of net/queue.c says that packets that cannot be sent by qemu_net_queue_send() should not be enqueued unless a callback is set. This patch implements this behaviour, that prevents a queue to grow unbounded (e.g. when a network backend is not connected). Also for good measure the patch implements bounded size queues (though it should not be necessary now because each source can only have one packet queued). When a packet is dropped because excessive queue size the callback is not supposed to be called. cheers luigi Signed-off-by: Luigi Rizzo ri...@iet.unipi.it --- ../orig/net/queue.c 2012-12-03 11:37:05.0 -0800 +++ ./net/queue.c 2013-01-16 21:37:20.109732443 -0800 @@ -50,6 +50,8 @@ struct NetPacket { struct NetQueue { void *opaque; +uint32_t nq_maxlen; +uint32_t nq_count; QTAILQ_HEAD(packets, NetPacket) packets; @@ -63,6 +65,8 @@ NetQueue *qemu_new_net_queue(void *opaqu queue = g_malloc0(sizeof(NetQueue)); queue-opaque = opaque; +queue-nq_maxlen = 1; /* arbitrary limit */ +queue-nq_count = 0; QTAILQ_INIT(queue-packets); @@ -92,6 +96,8 @@ static void qemu_net_queue_append(NetQue { NetPacket *packet; +if (!sent_cb || queue-nq_count = queue-nq_maxlen) + return; packet = g_malloc(sizeof(NetPacket) + size); packet-sender = sender; packet-flags = flags; @@ -99,6 +105,7 @@ static void qemu_net_queue_append(NetQue packet-sent_cb = sent_cb; memcpy(packet-data, buf, size); +queue-nq_count++; QTAILQ_INSERT_TAIL(queue-packets, packet, entry); } @@ -113,6 +120,8 @@ static void qemu_net_queue_append_iov(Ne size_t max_len = 0; int i; +if (!sent_cb || queue-nq_count = queue-nq_maxlen) + return; for (i = 0; i iovcnt; i++) { max_len += iov[i].iov_len; } @@ -130,6 +139,7 @@ static void qemu_net_queue_append_iov(Ne packet-size += len; } +queue-nq_count++; QTAILQ_INSERT_TAIL(queue-packets, packet, entry); } @@ -220,6 +230,7 @@ void qemu_net_queue_purge(NetQueue *queu QTAILQ_FOREACH_SAFE(packet, queue-packets, entry, next) { if (packet-sender == from) { QTAILQ_REMOVE(queue-packets, packet, entry); +queue-nq_count--; g_free(packet); } } @@ -233,6 +244,7 @@ bool qemu_net_queue_flush(NetQueue *queu packet = QTAILQ_FIRST(queue-packets); QTAILQ_REMOVE(queue-packets, packet, entry); +queue-nq_count--; ret = qemu_net_queue_deliver(queue, packet-sender, @@ -240,6 +252,7 @@ bool qemu_net_queue_flush(NetQueue *queu packet-data, packet-size); if (ret == 0) { +queue-nq_count++; QTAILQ_INSERT_HEAD(queue-packets, packet, entry); return false; }
[Qemu-devel] nic-specific options ? (Re: [RFC] updated e1000 mitigation patch)
On Thu, Jan 10, 2013 at 01:25:48PM +0100, Stefan Hajnoczi wrote: On Thu, Dec 27, 2012 at 11:06:58AM +0100, Luigi Rizzo wrote: diff -urp qemu-1.3.0-orig/hw/e1000.c qemu-1.3.0/hw/e1000.c --- qemu-1.3.0-orig/hw/e1000.c 2012-12-03 20:37:05.0 +0100 +++ qemu-1.3.0/hw/e1000.c 2012-12-27 09:47:16.0 +0100 @@ -35,6 +35,8 @@ #include e1000_hw.h +static int mit_on = 1; /* interrupt mitigation enable */ If you want to make this optional then please put it inside E1000State so it can be toggled per NIC. what is the simplest way to add NIC-specific options ? I have added one line to e1000_properties, as below static Property e1000_properties[] = { DEFINE_NIC_PROPERTIES(E1000State, conf), +DEFINE_PROP_UINT32(mit_on, E1000State, mit_on, 0), DEFINE_PROP_END_OF_LIST(), }; and this way i can do recognise this on the command line qemu ... -device e1000,mit_on=1 ... but i do not know how to set the property for the NIC i am using. Specifically, i normally run qemu with -net nic,model=1000 (leaving the nic unconnected to the host network, so i can test the tx path without being constrained by the backend's speed) Any suggestion ? thanks luigi
Re: [Qemu-devel] unbounded qemu NetQue's ?
On Mon, Jan 7, 2013 at 2:49 PM, Stefan Hajnoczi stefa...@gmail.com wrote: On Sun, Jan 06, 2013 at 08:23:56PM +0100, Luigi Rizzo wrote: Hi, while testing the tx path in qemu without a network backend connected, i noticed that qemu_net_queue_send() builds up an unbounded queue, because QTAILQ* have no notion of a queue length. As a result, memory usage of a qemu instance can grow to extremely large values. I wonder if it makes sense to implement a hard limit on size of NetQue's. The patch below is only a partial implementation but probably sufficient for these purposes. cheers luigi Hi Luigi, Good idea, we should bound the queues to prevent rare situations or bugs from hogging memory. Ideally we would do away with queues completely and make net clients hand buffers to each other ahead of time to impose flow control. I've thought about this a few times and always reached the conclusion that it's not quite possible. given that implementing flow control on the inbound (host-guest) path is impossible, i tend to agree that removing queues is probably not worth the effort even if possible at all. ... Your comments below also remind me of a more general issues with the code: diff -urp qemu-1.3.0-orig/net/queue.c qemu-1.3.0/net/queue.c --- qemu-1.3.0-orig/net/queue.c 2012-12-03 20:37:05.0+0100 +++ qemu-1.3.0/net/queue.c2013-01-06 19:38:12.0 +0100 @@ -92,6 +92,9 @@ static void qemu_net_queue_append(NetQue Please also do it for qemu_net_queue_append_iov(). the qemu code has many duplicate functions of the form foo() and foo_iov() . Not only here, but even in the backents (e.g. see net/tap.c) I think it would be good to settle on a single version of the function and remove or convert the non-iov instances. { NetPacket *packet; +if (queue-packets.tqh_count 1) + return; // XXX drop + sent_cb() must be invoked. It's best to do this in a QEMUBH in case the caller is not prepared for reentrancy. i forgot that, but the use of sent_cb() is interesting: the only place that actually uses it seems to be net/tap.c, and the way it uses it only makes sense if the queue has room! packet = g_malloc(sizeof(NetPacket) + size); packet-sender = sender; packet-flags = flags; diff -urp qemu-1.3.0-orig/qemu-queue.h qemu-1.3.0/qemu-queue.h --- qemu-1.3.0-orig/qemu-queue.h 2012-12-03 20:37:05.0+0100 +++ qemu-1.3.0/qemu-queue.h 2013-01-06 19:34:01.0 +0100 Please don't modify qemu-queue.h. It's a generic queue data structure used by all of QEMU. Instead, keep a counter in the NetQueue. good suggestion, that also makes the change smaller. cheers luigi Stefan -- -+--- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/. Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -+---
[Qemu-devel] unbounded qemu NetQue's ?
Hi, while testing the tx path in qemu without a network backend connected, i noticed that qemu_net_queue_send() builds up an unbounded queue, because QTAILQ* have no notion of a queue length. As a result, memory usage of a qemu instance can grow to extremely large values. I wonder if it makes sense to implement a hard limit on size of NetQue's. The patch below is only a partial implementation but probably sufficient for these purposes. cheers luigi diff -urp qemu-1.3.0-orig/net/queue.c qemu-1.3.0/net/queue.c --- qemu-1.3.0-orig/net/queue.c 2012-12-03 20:37:05.0 +0100 +++ qemu-1.3.0/net/queue.c 2013-01-06 19:38:12.0 +0100 @@ -92,6 +92,9 @@ static void qemu_net_queue_append(NetQue { NetPacket *packet; +if (queue-packets.tqh_count 1) + return; // XXX drop + packet = g_malloc(sizeof(NetPacket) + size); packet-sender = sender; packet-flags = flags; diff -urp qemu-1.3.0-orig/qemu-queue.h qemu-1.3.0/qemu-queue.h --- qemu-1.3.0-orig/qemu-queue.h2012-12-03 20:37:05.0 +0100 +++ qemu-1.3.0/qemu-queue.h 2013-01-06 19:34:01.0 +0100 @@ -320,11 +320,12 @@ struct { struct name { \ qual type *tqh_first; /* first element */ \ qual type *qual *tqh_last; /* addr of last next element */ \ + int32_t tqh_count; \ } #define QTAILQ_HEAD(name, type) Q_TAILQ_HEAD(name, struct type,) #define QTAILQ_HEAD_INITIALIZER(head) \ -{ NULL, (head).tqh_first } +{ NULL, (head).tqh_first, 0 } #define Q_TAILQ_ENTRY(type, qual) \ struct {\ @@ -339,6 +340,7 @@ struct { #define QTAILQ_INIT(head) do { \ (head)-tqh_first = NULL; \ (head)-tqh_last = (head)-tqh_first; \ +(head)-tqh_count = 0; \ } while (/*CONSTCOND*/0) #define QTAILQ_INSERT_HEAD(head, elm, field) do { \ @@ -348,6 +350,7 @@ struct { else\ (head)-tqh_last = (elm)-field.tqe_next; \ (head)-tqh_first = (elm); \ +(head)-tqh_count++; \ (elm)-field.tqe_prev = (head)-tqh_first; \ } while (/*CONSTCOND*/0) @@ -356,6 +359,7 @@ struct { (elm)-field.tqe_prev = (head)-tqh_last; \ *(head)-tqh_last = (elm); \ (head)-tqh_last = (elm)-field.tqe_next; \ +(head)-tqh_count++; \ } while (/*CONSTCOND*/0) #define QTAILQ_INSERT_AFTER(head, listelm, elm, field) do { \ @@ -381,6 +385,7 @@ struct { (elm)-field.tqe_prev; \ else\ (head)-tqh_last = (elm)-field.tqe_prev; \ +(head)-tqh_count--; \ *(elm)-field.tqe_prev = (elm)-field.tqe_next; \ } while (/*CONSTCOND*/0)
Re: [Qemu-devel] [PULL 0/1] update seabios
are you going to distribute a 1.3.x snapshot with the updated bios that lets FreeBSD boot ? thanks luigi On Wed, Jan 2, 2013 at 5:57 PM, Anthony Liguori anth...@codemonkey.wswrote: Gerd Hoffmann kra...@redhat.com writes: Hi, One more seabios update, fixing the FreeBSD build failure. please pull, Gerd
[Qemu-devel] [RFC] updated e1000 mitigation patch
Before submitting a proper patch, I'd like to hear feedback on the following proposed change to hw/e1000.c to properly implement interrupt mitigation. This is joint work with Vincenzo Maffione and Giuseppe Lettieri (in Cc), and is a followup to a similar patch i posted in july https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg03195.html The patch models three of the five (sic!) e1000 register that control moderation, and uses qemu timers for that (the minimum delay for these timers affects the fidelity of the emulation). Right now there is a static variable that controls whether the feature is enabled. I would probably like to make it a parameter accessible from the command line in qemu, possibly extending it to override the mitigation delay set by the device driver. Right now we reached transmit speeds in the order of 2-300Kpps (if matched with a proper guest device driver optimization, see https://groups.google.com/forum/?fromgroups=#!topic/mailing.freebsd.emulator/xQHR_hleCuc ) Some performance data using a FreeBSD guest, for udp transmissions: KVM QEMU standard KVM, standard driver20 Kpps 6.3 Kpps modified KVM, standard driver37 Kpps28 Kpps modified KVM, modified driver 200 Kpps34 Kpps As you can see, on kvm this change gives a 5x speedup to the tx path, which combines nicely with the 2x speedup that comes from supporting interrupt mitigation alone in the hypervisor. Without kvm (or kqemu ?) the benefits are much lower, as the guest becomes too slow. cheers luigi diff -urp qemu-1.3.0-orig/hw/e1000.c qemu-1.3.0/hw/e1000.c --- qemu-1.3.0-orig/hw/e1000.c 2012-12-03 20:37:05.0 +0100 +++ qemu-1.3.0/hw/e1000.c 2012-12-27 09:47:16.0 +0100 @@ -35,6 +35,8 @@ #include e1000_hw.h +static int mit_on = 1; /* interrupt mitigation enable */ + #define E1000_DEBUG #ifdef E1000_DEBUG @@ -129,6 +131,9 @@ typedef struct E1000State_st { } eecd_state; QEMUTimer *autoneg_timer; +QEMUTimer *mit_timer; // handle for the timer +uint32_t mit_timer_on; // mitigation timer active +uint32_t mit_cause;// pending interrupt cause } E1000State; #definedefreg(x) x = (E1000_##x2) @@ -144,6 +149,7 @@ enum { defreg(TPR), defreg(TPT),defreg(TXDCTL), defreg(WUFC), defreg(RA),defreg(MTA),defreg(CRCERRS),defreg(VFTA), defreg(VET), +defreg(RDTR), defreg(RADV), defreg(TADV), defreg(ITR), }; static void @@ -639,6 +645,68 @@ static uint64_t tx_desc_base(E1000State return (bah 32) + bal; } +/* helper function, 0 means the value is not set */ +static inline void +mit_update_delay(uint32_t *curr, uint32_t value) +{ +if (value (*curr == 0 || value *curr)) + *curr = value; +} + +/* + * If necessary, rearm the timer and post an interrupt. + * Called at the end of tx/rx routines (mit_timer_on == 0), + * and when the timer fires (mit_timer_on == 1). + * We provide a partial implementation of interrupt mitigation, + * emulating only RADV, TADV and ITR (lower 16 bits, 1024ns units for + * RADV and TADV, 256ns units for ITR). RDTR is only used to enable RADV; + * relative timers based on TIDV and RDTR are not implemented. + */ +static void +mit_rearm_and_int(void *opaque) +{ +E1000State *s = opaque; +uint32_t mit_delay = 0; + +/* + * Clear the flag. It is only set when the callback fires, + * and we need to clear it anyways. + */ +s-mit_timer_on = 0; +if (s-mit_cause == 0) /* no events pending, we are done */ + return; +/* + * Compute the next mitigation delay according to pending interrupts + * and the current values of RADV (provided RDTR!=0), TADV and ITR. + * Then rearm the timer. + */ +if (s-mit_cause (E1000_ICR_TXQE | E1000_ICR_TXDW)) + mit_update_delay(mit_delay, s-mac_reg[TADV] * 4); +if (s-mac_reg[RDTR] (s-mit_cause E1000_ICS_RXT0)) + mit_update_delay(mit_delay, s-mac_reg[RADV] * 4); +mit_update_delay(mit_delay, s-mac_reg[ITR]); + +if (mit_delay) { + s-mit_timer_on = 1; + qemu_mod_timer(s-mit_timer, + qemu_get_clock_ns(vm_clock) + mit_delay * 256); +} +set_ics(s, 0, s-mit_cause); +s-mit_cause = 0; +} + +static void +mit_set_ics(E1000State *s, uint32_t cause) +{ +if (mit_on == 0) { + set_ics(s, 0, cause); + return; +} +s-mit_cause |= cause; +if (!s-mit_timer_on) + mit_rearm_and_int(s); +} + static void start_xmit(E1000State *s) { @@ -676,7 +744,7 @@ start_xmit(E1000State *s) break; } } -set_ics(s, 0, cause); +mit_set_ics(s, cause); } static int @@ -894,7 +962,7 @@ e1000_receive(NetClientState *nc, const s-rxbuf_min_shift) n |= E1000_ICS_RXDMT0; -set_ics(s, 0, n); +mit_set_ics(s, n); return size; } @@ -999,6
[Qemu-devel] new pc-bios/bios.bin breaks freebsd booting
I am not sure if it has been reported already but this commit http://git.qemu.org/?p=qemu.git;a=commitdiff;h=d7a51dbbaa70677846453f8c961590913052dd86 (replacing pc-bios/bios.bin with a newer version) breaks booting of FreeBSD on recent qemu (starting roughly with qemu- 1.3.0-rc2). Using a FreeBSD host, and a FreeBSD guest, the qemu thread runs at 100% and the console is stuck after the 'pci0' probe: ... hpet0: High Precision Event Timer iomem 0xfed0-0xfed003ff on acpi0 Timecounter HPET frequency 1 Hz quality 950 Timecounter ACPI-fast frequency 3579545 Hz quality 900 acpi_timer0: 24-bit timer at 3.579545MHz port 0xb008-0xb00b on acpi0 pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0 pci0: ACPI PCI bus on pcib0 Reverting the bios fixes things. I wonder if it isn't the case of reverting this change ? cheers luigi -- -+--- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/. Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -+---
Re: [Qemu-devel] new pc-bios/bios.bin breaks freebsd booting
On Wed, Dec 12, 2012 at 06:47:58PM +0100, Paolo Bonzini wrote: Il 12/12/2012 17:04, Luigi Rizzo ha scritto: I am not sure if it has been reported already but this commit http://git.qemu.org/?p=qemu.git;a=commitdiff;h=d7a51dbbaa70677846453f8c961590913052dd86 (replacing pc-bios/bios.bin with a newer version) breaks booting of FreeBSD on recent qemu (starting roughly with qemu-1.3.0-rc2). Using a FreeBSD host, and a FreeBSD guest, the qemu thread runs at 100% and the console is stuck after the 'pci0' probe: ... hpet0: High Precision Event Timer iomem 0xfed0-0xfed003ff on acpi0 Timecounter HPET frequency 1 Hz quality 950 Timecounter ACPI-fast frequency 3579545 Hz quality 900 acpi_timer0: 24-bit timer at 3.579545MHz port 0xb008-0xb00b on acpi0 pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0 pci0: ACPI PCI bus on pcib0 Reverting the bios fixes things. I wonder if it isn't the case of reverting this change ? Not reverting the change (which fixes other things), but yes---we should get the fix into SeaBIOS. I don't have a FreeBSD VM handy, can you try the attached BIOS so I can have your Tested-by? The patch I used is after my signature. thanks, the attached bios successfully boots a FreeBSD guest Tested-by: Luigi Rizzo ri...@iet.unipi.it cheers luigi Paolo diff --git a/src/acpi-dsdt.dsl b/src/acpi-dsdt.dsl index 8019b71..b58ef62 100644 --- a/src/acpi-dsdt.dsl +++ b/src/acpi-dsdt.dsl @@ -187,7 +187,7 @@ DefinitionBlock ( prt_slot0(0x), /* Device 1 is power mgmt device, and can only use irq 9 */ -Package() { 0x1, 0,0, 9 }, +Package() { 0x1, 0, LNKS, 0 }, Package() { 0x1, 1, LNKB, 0 }, Package() { 0x1, 2, LNKC, 0 }, Package() { 0x1, 3, LNKD, 0 }, @@ -278,6 +278,22 @@ DefinitionBlock ( define_link(LNKB, 1, PRQ1) define_link(LNKC, 2, PRQ2) define_link(LNKD, 3, PRQ3) + +Device(LNKS) { +Name(_HID, EISAID(PNP0C0F)) +Name(_UID, 4) +Name(_PRS, ResourceTemplate() { +Interrupt(, Level, ActiveHigh, Shared) { 9 } +}) + +// The SCI cannot be disabled and is always attached to GSI 9, +// so these are no-ops. We only need this link to override the +// polarity to active high and match the content of the MADT. +Method(_STA, 0, NotSerialized) { Return (0x0b) } +Method(_DIS, 0, NotSerialized) { } +Method(_CRS, 0, NotSerialized) { Return (_PRS) } +Method(_SRS, 1, NotSerialized) { } +} } #include acpi-dsdt-cpu-hotplug.dsl
Re: [Qemu-devel] interrupt mitigation for e1000
On Wed, Jul 25, 2012 at 11:53:29AM +0300, Avi Kivity wrote: On 07/24/2012 07:58 PM, Luigi Rizzo wrote: I noticed that the various NIC modules in qemu/kvm do not implement interrupt mitigation, which is very beneficial as it dramatically reduces exits from the hypervisor. As a proof of concept i tried to implement it for the e1000 driver (patch below), and it brings tx performance from 9 to 56Kpps on qemu-softmmu, and from ~20 to 140Kpps on qemu-kvm. I am going to measure the rx interrupt mitigation in the next couple of days. Is there any interest in having this code in ? Indeed. But please drop the #ifdef MITIGATIONs. Thanks for the comments. The #ifdef block MITIGATION was only temporary to point out the differences and run the performance comparisons. Similarly, the magic thresholds below will be replaced with appropriately commented #defines. Note: On the real hardware interrupt mitigation is controlled by a total of four registers (TIDV, TADV, RIDV, RADV) which control it with a granularity of 1024ns , see http://www.intel.com/content/dam/doc/manual/pci-pci-x-family-gbe-controllers-software-dev-manual.pdf An exact emulation of the feature is hard, because the timer resolution we have is much coarser (in the ms range). So i am inclined to use a different approach, similar to the one i have implemented, namely: - the first few packets (whether 1 or 4 or 5 will be decided on the host) report an interrupt immediately; - subsequent interrupts are delayed through qemu_bh_schedule_idle() (which is unpredictable but efficient; i tried qemu_bh_schedule() but it completely defeats mitigation) - when the TX or RX rings are close to getting full, then again an interrupt is delivered immediately. This approach also has the advantage of not requiring specific support in the OS drivers. cheers luigi + +#ifdef MITIGATION +QEMUBH *int_bh;// interrupt mitigation handler +int tx_ics_count; // pending tx int requests +int rx_ics_count; // pending rx int requests +int int_cause; // int cause Use uint32_t for int_cause, also a correctly sized type for the packet counts. +#ifdef MITIGATION +/* we transmit the first few packets, or we do if we are + * approaching a full ring. in the latter case, also + * send an ics. + * + */ +{ +int len, pending; +len = s-mac_reg[TDLEN] / sizeof(desc) ; +pending = s-mac_reg[TDT] - s-mac_reg[TDH]; +if (pending 0) + pending += len; +/* ignore requests after the first few ones, as long as + * we are not approaching a full ring. + * Otherwise, deliver packets to the backend. + */ +if (s-tx_ics_count 4 s-tx_ics_count + pending len - 5) + return; Where do the 4 and 5 come from? Shouldn't they be controlled by the guest using a device register? } +#ifdef MITIGATION +s-int_cause |= cause; // remember the interrupt cause. +s-tx_ics_count += pending; +if (s-tx_ics_count = len - 5) { +// if the ring is about to become full, generate an interrupt Another magic number. + set_ics(s, 0, s-int_cause); + s-tx_ics_count = 0; + s-int_cause = 0; +} else { // otherwise just schedule it for later. +qemu_bh_schedule_idle(s-int_bh); +} +} +#else /* !MITIGATION */ set_ics(s, 0, cause); +#endif } -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] interrupt mitigation for e1000
On Wed, Jul 25, 2012 at 12:12:55PM +0200, Paolo Bonzini wrote: Il 25/07/2012 11:56, Luigi Rizzo ha scritto: On Wed, Jul 25, 2012 at 11:53:29AM +0300, Avi Kivity wrote: On 07/24/2012 07:58 PM, Luigi Rizzo wrote: I noticed that the various NIC modules in qemu/kvm do not implement interrupt mitigation, which is very beneficial as it dramatically reduces exits from the hypervisor. As a proof of concept i tried to implement it for the e1000 driver (patch below), and it brings tx performance from 9 to 56Kpps on qemu-softmmu, and from ~20 to 140Kpps on qemu-kvm. I am going to measure the rx interrupt mitigation in the next couple of days. Is there any interest in having this code in ? Indeed. But please drop the #ifdef MITIGATIONs. Thanks for the comments. The #ifdef block MITIGATION was only temporary to point out the differences and run the performance comparisons. Similarly, the magic thresholds below will be replaced with appropriately commented #defines. Note: On the real hardware interrupt mitigation is controlled by a total of four registers (TIDV, TADV, RIDV, RADV) which control it with a granularity of 1024ns , see http://www.intel.com/content/dam/doc/manual/pci-pci-x-family-gbe-controllers-software-dev-manual.pdf An exact emulation of the feature is hard, because the timer resolution we have is much coarser (in the ms range). So i am inclined to use a different approach, similar to the one i have implemented, namely: - the first few packets (whether 1 or 4 or 5 will be decided on the host) report an interrupt immediately; - subsequent interrupts are delayed through qemu_bh_schedule_idle() qemu_bh_schedule_idle() is really a 10ms timer. yes, i figured that out, this is why i said that my code was more a proof of concept than an actual patch. If you have a suggestion on how to schedule a shorter (say 1ms) timer i am all hears. Perhaps qemu_new_timer_ns() and friends ? This said, i do not plan to implement the full mitigation registers controlled by the guest, just possibly use a parameter as in virtio-net where you can have 'tx=bh' or 'tx=timer' and 'x-txtimer=N' with N is the mitigation delay in nanoseconds (virtually, in practice rounded to whatever the host granularity is) cheers luigi
[Qemu-devel] forgotten commit ? (Re: Proposed patch: huge RX speedup for hw/e1000.c)
Paolo, a few weeks ago you posted the patch below but apparently it did not get in after my 'tested-by' reply of June C1st4th I'd like to confirm that your patch works perfectly for me. Tested-by: Luigi Rizzo ri...@iet.unipi.it cheers luigi On Thu, May 31, 2012 at 01:03:35PM +0200, Paolo Bonzini wrote: Il 31/05/2012 12:40, Jan Kiszka ha scritto: On 2012-05-31 12:03, Paolo Bonzini wrote: Il 31/05/2012 10:23, Jan Kiszka ha scritto: @@ -922,6 +923,7 @@ set_rdt(E1000State *s, int index, uint32_t val) { s-check_rxov = 0; s-mac_reg[index] = val 0x; +qemu_notify_event(); This still looks like the wrong tool: Packets that can't be delivered are queued. Packets that are read from the tap but can't be delivered are queued; packets that are left on the tap need qemu_notify_event to be flushed. So we need to flush the queue and clear the blocked delivery there. qemu_flush_queued_packets appears more appropriate for this. Right, and qemu_flush_queued_packets needs to call qemu_notify_event which makes the call in virtio-net unnecessary. Paolo diff --git a/hw/e1000.c b/hw/e1000.c index 4573f13..43d933a 100644 --- a/hw/e1000.c +++ b/hw/e1000.c @@ -295,6 +295,7 @@ set_rx_control(E1000State *s, int index, uint32_t val) s-rxbuf_min_shift = ((val / E1000_RCTL_RDMTS_QUAT) 3) + 1; DBGOUT(RX, RCTL: %d, mac_reg[RCTL] = 0x%x\n, s-mac_reg[RDT], s-mac_reg[RCTL]); +qemu_flush_queued_packets(s-nic-nc); } static void @@ -926,6 +927,9 @@ set_rdt(E1000State *s, int index, uint32_t val) { s-check_rxov = 0; s-mac_reg[index] = val 0x; +if (e1000_has_rxbufs(s, 1)) { +qemu_flush_queued_packets(s-nic-nc); +} } static void diff --git a/hw/virtio-net.c b/hw/virtio-net.c index 3f190d4..0974945 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -447,10 +447,6 @@ static void virtio_net_handle_rx(VirtIODevice *vdev, VirtQueue *vq) VirtIONet *n = to_virtio_net(vdev); qemu_flush_queued_packets(n-nic-nc); - -/* We now have RX buffers, signal to the IO thread to break out of the - * select to re-poll the tap file descriptor */ -qemu_notify_event(); } static int virtio_net_can_receive(VLANClientState *nc) diff --git a/net.c b/net.c index 1922d8a..fa846ae 100644 --- a/net.c +++ b/net.c @@ -491,7 +491,12 @@ void qemu_flush_queued_packets(VLANClientState *vc) queue = vc-send_queue; } -qemu_net_queue_flush(queue); +if (qemu_net_queue_flush(queue)) { +/* We emptied the queue successfully, signal to the IO thread to repoll + * the file descriptor (for tap, for example). + */ +qemu_notify_event(); +} } static ssize_t qemu_send_packet_async_with_flags(VLANClientState *sender, diff --git a/net/queue.c b/net/queue.c index 1ab5247..fd1c7e6 100644 --- a/net/queue.c +++ b/net/queue.c @@ -232,7 +232,7 @@ void qemu_net_queue_purge(NetQueue *queue, VLANClientState *from) } } -void qemu_net_queue_flush(NetQueue *queue) +bool qemu_net_queue_flush(NetQueue *queue) { while (!QTAILQ_EMPTY(queue-packets)) { NetPacket *packet; @@ -248,7 +248,7 @@ void qemu_net_queue_flush(NetQueue *queue) packet-size); if (ret == 0) { QTAILQ_INSERT_HEAD(queue-packets, packet, entry); -break; +return 0; } if (packet-sent_cb) { @@ -257,4 +257,5 @@ void qemu_net_queue_flush(NetQueue *queue) g_free(packet); } +return 1; } diff --git a/net/queue.h b/net/queue.h index a31958e..4bf6d3c 100644 --- a/net/queue.h +++ b/net/queue.h @@ -66,6 +66,6 @@ ssize_t qemu_net_queue_send_iov(NetQueue *queue, NetPacketSent *sent_cb); void qemu_net_queue_purge(NetQueue *queue, VLANClientState *from); -void qemu_net_queue_flush(NetQueue *queue); +bool qemu_net_queue_flush(NetQueue *queue); #endif /* QEMU_NET_QUEUE_H */ Looks good. Luigi, please reply with a Tested-by when possible. Paolo
[Qemu-devel] interrupt mitigation for e1000
I noticed that the various NIC modules in qemu/kvm do not implement interrupt mitigation, which is very beneficial as it dramatically reduces exits from the hypervisor. As a proof of concept i tried to implement it for the e1000 driver (patch below), and it brings tx performance from 9 to 56Kpps on qemu-softmmu, and from ~20 to 140Kpps on qemu-kvm. I am going to measure the rx interrupt mitigation in the next couple of days. Is there any interest in having this code in ? cheers luigi diff -ubwrp --exclude '*.[do]' /tmp/qemu-61dc008/hw/e1000.c ./hw/e1000.c --- /tmp/qemu-61dc008/hw/e1000.c2012-07-20 01:25:52.0 +0200 +++ ./hw/e1000.c2012-07-24 18:21:39.0 +0200 @@ -33,6 +33,8 @@ #include sysemu.h #include dma.h +#define MITIGATION + #include e1000_hw.h #define E1000_DEBUG @@ -127,6 +129,13 @@ typedef struct E1000State_st { } eecd_state; QEMUTimer *autoneg_timer; + +#ifdef MITIGATION +QEMUBH *int_bh;// interrupt mitigation handler +int tx_ics_count; // pending tx int requests +int rx_ics_count; // pending rx int requests +int int_cause; // int cause +#endif // MITIGATION } E1000State; #definedefreg(x) x = (E1000_##x2) @@ -638,6 +648,26 @@ start_xmit(E1000State *s) return; } +#ifdef MITIGATION +/* we transmit the first few packets, or we do if we are + * approaching a full ring. in the latter case, also + * send an ics. + * + */ +{ +int len, pending; +len = s-mac_reg[TDLEN] / sizeof(desc) ; +pending = s-mac_reg[TDT] - s-mac_reg[TDH]; +if (pending 0) + pending += len; +/* ignore requests after the first few ones, as long as + * we are not approaching a full ring. + * Otherwise, deliver packets to the backend. + */ +if (s-tx_ics_count 4 s-tx_ics_count + pending len - 5) + return; +#endif // MITIGATION + while (s-mac_reg[TDH] != s-mac_reg[TDT]) { base = tx_desc_base(s) + sizeof(struct e1000_tx_desc) * s-mac_reg[TDH]; @@ -663,7 +693,21 @@ start_xmit(E1000State *s) break; } } +#ifdef MITIGATION +s-int_cause |= cause; // remember the interrupt cause. +s-tx_ics_count += pending; +if (s-tx_ics_count = len - 5) { +// if the ring is about to become full, generate an interrupt + set_ics(s, 0, s-int_cause); + s-tx_ics_count = 0; + s-int_cause = 0; +} else { // otherwise just schedule it for later. +qemu_bh_schedule_idle(s-int_bh); +} +} +#else /* !MITIGATION */ set_ics(s, 0, cause); +#endif } static int @@ -875,7 +919,27 @@ e1000_receive(VLANClientState *nc, const s-rxbuf_min_shift) n |= E1000_ICS_RXDMT0; +#ifdef MITIGATION +#define MIT_RXDMT0_SENT10 // large +s-int_cause |= n; +if (s-rx_ics_count == 0) { + /* deliver the first interrupt */ + set_ics(s, 0, s-int_cause); + s-int_cause = 0; + s-rx_ics_count++; +} else if ( (n E1000_ICS_RXDMT0) s-rx_ics_count MIT_RXDMT0_SENT) { + /* also deliver if we are approaching ring full */ + set_ics(s, 0, s-int_cause); + s-int_cause = 0; + s-rx_ics_count = MIT_RXDMT0_SENT; +} else { + /* otherwise schedule for later */ + s-rx_ics_count++; + qemu_bh_schedule_idle(s-int_bh); +} +#else /* !MITIGATION */ set_ics(s, 0, n); +#endif /* !MITIGATION */ return size; } @@ -1214,6 +1281,20 @@ static NetClientInfo net_e1000_info = { .link_status_changed = e1000_set_link_status, }; +#ifdef MITIGATION +static void e1000_int_bh(void *opaque) +{ +E1000State *s = opaque; +if (s-tx_ics_count 1 s-rx_ics_count 1) + return; +s-tx_ics_count = 0; +s-rx_ics_count = 0; +start_xmit(s); +set_ics(s, 0, s-int_cause); +s-int_cause = 0; +} +#endif /* MITIGATION */ + static int pci_e1000_init(PCIDevice *pci_dev) { E1000State *d = DO_UPCAST(E1000State, dev, pci_dev); @@ -1231,6 +1312,9 @@ static int pci_e1000_init(PCIDevice *pci e1000_mmio_setup(d); +#ifdef MITIGATION +d-int_bh = qemu_bh_new(e1000_int_bh, d); +#endif /* MITIGATION */ pci_register_bar(d-dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, d-mmio); pci_register_bar(d-dev, 1, PCI_BASE_ADDRESS_SPACE_IO, d-io);
Re: [Qemu-devel] qemu 1.1.0 slow as hell booting ?
On Thu, Jul 19, 2012 at 04:43:05PM +0200, Luigi Rizzo wrote: hi, while playing with various qemu versions i noticed that qemu-devel now in FreeBSD ports (1.1.0) is slow as hell when booting FreeBSD (take for instance the netmap image at http://info.iet.unipi.it/~luigi/netmap/20120608-netmap-picobsd-head-amd64.bin while 0.11.x and 1.1.1 and 1.0.1 seem to work well. I should mention that i had the same exact problem a few months ago with a qemu version compiled on my Mac using macports, which was fixed by manually building one myself from the sources (the working Mac version reports version 1.0.91; unfortunately i do not remember which version it was, but judging from the macports history it could have been either 0.14.1 or 1.0) so i do not think this is a specific problem with FreeBSD. Are you able to reproduce this problem ? Any idea on what could it be ? Just for the archives: the problem was that the FreeBSD port, when compiling with CLANG, used --enable-tcg-interpreter which caused the slow code to be generated. I suppose the same problem existed in the macports version cheers luigi cheers luigi
Re: [Qemu-devel] qemu 1.1.0 slow as hell booting ?
On Fri, Jul 20, 2012 at 09:04:51PM +1000, ronnie sahlberg wrote: Is it only very slow during boot but then becomes normal speed when the system is fully booted and kernel and userspace are all up and running normally? i did not have enough patience to go past the 2 minutes needed to load text+data for the kernel. See my followup, it was due to configure using --enable-tcg-interpreter when using CLANG. cheers luigi regards ronnie sahlberg On Fri, Jul 20, 2012 at 12:43 AM, Luigi Rizzo ri...@iet.unipi.it wrote: hi, while playing with various qemu versions i noticed that qemu-devel now in FreeBSD ports (1.1.0) is slow as hell when booting FreeBSD (take for instance the netmap image at http://info.iet.unipi.it/~luigi/netmap/20120608-netmap-picobsd-head-amd64.bin while 0.11.x and 1.1.1 and 1.0.1 seem to work well. I should mention that i had the same exact problem a few months ago with a qemu version compiled on my Mac using macports, which was fixed by manually building one myself from the sources (the working Mac version reports version 1.0.91; unfortunately i do not remember which version it was, but judging from the macports history it could have been either 0.14.1 or 1.0) so i do not think this is a specific problem with FreeBSD. Are you able to reproduce this problem ? Any idea on what could it be ? cheers luigi
[Qemu-devel] qemu 1.1.0 slow as hell booting ?
hi, while playing with various qemu versions i noticed that qemu-devel now in FreeBSD ports (1.1.0) is slow as hell when booting FreeBSD (take for instance the netmap image at http://info.iet.unipi.it/~luigi/netmap/20120608-netmap-picobsd-head-amd64.bin while 0.11.x and 1.1.1 and 1.0.1 seem to work well. I should mention that i had the same exact problem a few months ago with a qemu version compiled on my Mac using macports, which was fixed by manually building one myself from the sources (the working Mac version reports version 1.0.91; unfortunately i do not remember which version it was, but judging from the macports history it could have been either 0.14.1 or 1.0) so i do not think this is a specific problem with FreeBSD. Are you able to reproduce this problem ? Any idea on what could it be ? cheers luigi
[Qemu-devel] VALE, a Virtual Local Ethernet. http://info.iet.unipi.it/~luigi/vale/
We have just completed a netmap extensions that let you build a local high speed switch called VALE which i think can be very useful. http://info.iet.unipi.it/~luigi/vale/ VALE is a software Virtual Local Ethernet whose ports are accessible using the netmap API. Designed to be used as the interconnect between virtual machines (or as a fast local bus), it works as a learning bridge and supports speeds of up to 20 Mpps with short frames, and an aggregate 70 Gbit/s with 1514-byte packets. The VALE paper contains more details and performance measurements. VALE is implemented as a small extension of the netmap module, and is available for FreeBSD and Linux. The source code includes a backend for qemu and KVM, so you can use VALE to interconnect virtual machines launching them with qemu -net nic -net netmap,ifname=vale0 ... qemu -net nic -net netmap,ifname=vale1 ... ... Processes can talk to a VALE switch too, so you can use the pkt-gen or bridge tools that are part of the netmap distribution, or even the pcap.c module that maps libpcap calls into netmap equivalents. This lets you use VALE for all sort of pcap-based applications. More details, code, bootable images on the VALE page, http://info.iet.unipi.it/~luigi/vale/ feedback welcome, as usual. cheers luigi
Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c
Works me. I can now receive at 1.15 Mpps, slightly faster than my previous patch which generated unnecessary writes to the signalling socket. Tested-by: Luigi Rizzo ri...@iet.unipi.it On Thu, May 31, 2012 at 12:03 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 31/05/2012 10:23, Jan Kiszka ha scritto: @@ -922,6 +923,7 @@ set_rdt(E1000State *s, int index, uint32_t val) { s-check_rxov = 0; s-mac_reg[index] = val 0x; +qemu_notify_event(); This still looks like the wrong tool: Packets that can't be delivered are queued. Packets that are read from the tap but can't be delivered are queued; packets that are left on the tap need qemu_notify_event to be flushed. So we need to flush the queue and clear the blocked delivery there. qemu_flush_queued_packets appears more appropriate for this. Right, and qemu_flush_queued_packets needs to call qemu_notify_event which makes the call in virtio-net unnecessary. Paolo diff --git a/hw/e1000.c b/hw/e1000.c index 4573f13..43d933a 100644 --- a/hw/e1000.c +++ b/hw/e1000.c @@ -295,6 +295,7 @@ set_rx_control(E1000State *s, int index, uint32_t val) s-rxbuf_min_shift = ((val / E1000_RCTL_RDMTS_QUAT) 3) + 1; DBGOUT(RX, RCTL: %d, mac_reg[RCTL] = 0x%x\n, s-mac_reg[RDT], s-mac_reg[RCTL]); +qemu_flush_queued_packets(s-nic-nc); } static void @@ -926,6 +927,9 @@ set_rdt(E1000State *s, int index, uint32_t val) { s-check_rxov = 0; s-mac_reg[index] = val 0x; +if (e1000_has_rxbufs(s, 1)) { +qemu_flush_queued_packets(s-nic-nc); +} } static void diff --git a/hw/virtio-net.c b/hw/virtio-net.c index 3f190d4..0974945 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -447,10 +447,6 @@ static void virtio_net_handle_rx(VirtIODevice *vdev, VirtQueue *vq) VirtIONet *n = to_virtio_net(vdev); qemu_flush_queued_packets(n-nic-nc); - -/* We now have RX buffers, signal to the IO thread to break out of the - * select to re-poll the tap file descriptor */ -qemu_notify_event(); } static int virtio_net_can_receive(VLANClientState *nc) diff --git a/net.c b/net.c index 1922d8a..fa846ae 100644 --- a/net.c +++ b/net.c @@ -491,7 +491,12 @@ void qemu_flush_queued_packets(VLANClientState *vc) queue = vc-send_queue; } -qemu_net_queue_flush(queue); +if (qemu_net_queue_flush(queue)) { +/* We emptied the queue successfully, signal to the IO thread to repoll + * the file descriptor (for tap, for example). + */ +qemu_notify_event(); +} } static ssize_t qemu_send_packet_async_with_flags(VLANClientState *sender, diff --git a/net/queue.c b/net/queue.c index 1ab5247..fd1c7e6 100644 --- a/net/queue.c +++ b/net/queue.c @@ -232,7 +232,7 @@ void qemu_net_queue_purge(NetQueue *queue, VLANClientState *from) } } -void qemu_net_queue_flush(NetQueue *queue) +bool qemu_net_queue_flush(NetQueue *queue) { while (!QTAILQ_EMPTY(queue-packets)) { NetPacket *packet; @@ -248,7 +248,7 @@ void qemu_net_queue_flush(NetQueue *queue) packet-size); if (ret == 0) { QTAILQ_INSERT_HEAD(queue-packets, packet, entry); -break; +return 0; } if (packet-sent_cb) { @@ -257,4 +257,5 @@ void qemu_net_queue_flush(NetQueue *queue) g_free(packet); } +return 1; } diff --git a/net/queue.h b/net/queue.h index a31958e..4bf6d3c 100644 --- a/net/queue.h +++ b/net/queue.h @@ -66,6 +66,6 @@ ssize_t qemu_net_queue_send_iov(NetQueue *queue, NetPacketSent *sent_cb); void qemu_net_queue_purge(NetQueue *queue, VLANClientState *from); -void qemu_net_queue_flush(NetQueue *queue); +bool qemu_net_queue_flush(NetQueue *queue); #endif /* QEMU_NET_QUEUE_H */ -- -+--- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/. Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -+---
Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c
On Thu, May 31, 2012 at 9:38 AM, Paolo Bonzini pbonz...@redhat.com wrote: Il 31/05/2012 00:53, Luigi Rizzo ha scritto: The image contains my fast packet generator pkt-gen (a stock traffic generator such as netperf etc. is too slow to show the problem). pkt-gen can send about 1Mpps in this configuration using -net netmap in the backend. The qemu process in this case takes 100% CPU. On the receive side, i cannot receive more than 50Kpps, even if i flood the bridge with a a huge amount of traffic. The qemu process stays at 5% cpu or less. Then i read on the docs in main-loop.h which says that one case where the qemu_notify_event() is needed is when using qemu_set_fd_handler2(), which is exactly what my backend uses (similar to tap.c) The path is a bit involved, but I think Luigi is right. The docs say Remember to call qemu_notify_event whenever the [return value of the fd_read_poll callback] may change from false to true. Now net/tap.c has ... So as a conservative approximation, you need to fire qemu_notify_event whenever you write to RDH, RDT, RDLEN and RCTL, or when check_rxov becomes zero. In practice, only RDT, RCTL and check_rxov matter. Luigi, does this patch work for you? it almost surely works (cannot check today), as my (guest) driver modifies RDT to notify the hardware of read packets. I know/mentioned that the notification can be optimized and sent only in specific case, as you describe above (i might be missing the check_rxov). But i think it would be more robust to make fewer assumptions on what the guest does, and send the notify on all register writes (those are relatively rare in a driver, and the datapath touches exactly the registers we ought to be interested in), and possibly on those reads that may have side effects, such as interrupt registers, together with a big comment in the code explaining why we need to call qemu_notify_event(). Actually, what would pay even more is devise a way to avoid the write() syscall in qemu_notify_event if the other process (or is it a thread ?) has data queued. This is not so important in my netmap driver, but a standard driver is likely to update the RDT on every single packet, which would pretty much kill performance. cheers luigi -- -+--- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/. Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -+---
Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c
On Thu, May 31, 2012 at 10:23 AM, Jan Kiszka jan.kis...@web.de wrote: On 2012-05-31 09:38, Paolo Bonzini wrote ... This still looks like the wrong tool: Packets that can't be delivered are queued. So we need to flush the queue and clear the blocked delivery there. qemu_flush_queued_packets appears more appropriate for this. Conceptually, the backend should be responsible for kicking the iothread as needed. as i understand the code, the backend _is_ the iothread, and it is sleeping when the frontend becomes able to receive again. cheers luigi -- -+--- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/. Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -+---
Re: [Qemu-devel] frame reordering in qemu_net_queue_send() ?
On Thu, May 31, 2012 at 11:18 AM, Stefan Hajnoczi stefa...@gmail.comwrote: On Wed, May 30, 2012 at 8:59 AM, Luigi Rizzo ri...@iet.unipi.it wrote: Hi, while investigating rx performance for emulated network devices (i am looking at the userspace version, relying on net=tap or similar approaches) i noticed the code in net/queue.c :: qemu_net_queue_send() which look strange to me (same goes for the iov version). The whole function is below, just for reference. My impression is that the call to qemu_net_queue_flush() is misplaced, and it should be before qemu_net_queue_deliver() otherwise the current packet is pushed out before anything was already in the queue. Does it make sense ? cheers luigi ssize_t qemu_net_queue_send(NetQueue *queue, VLANClientState *sender, unsigned flags, const uint8_t *data, size_t size, NetPacketSent *sent_cb) { ssize_t ret; if (queue-delivering) { return qemu_net_queue_append(queue, sender, flags, data, size, NULL); } ret = qemu_net_queue_deliver(queue, sender, flags, data, size); if (ret == 0) { qemu_net_queue_append(queue, sender, flags, data, size, sent_cb); return 0; } qemu_net_queue_flush(queue); This of the case where delivering a packet causes additional (re-entrant) qemu_net_queue_send() calls to this queue. We'll be in -delivering state and queue those packets. After we've finished delivering the initial packet we flush the queue. This is a weird case but this is how I read the intention of the code. i was under the impression that qemu_net_queue_deliver() may also return 0 if the other side (frontend network device) has no room for the packet. This would cause the queue to contain data on entry in the next call, even without reentrancy. Consider this: t0. qemu_net_queue_send(pkt-A) qemu_net_queue_deliver() returns 0, pkt-A is queued and we return t1. qemu_net_queue_send(pkt-B) qemu_net_queue_deliver() returns successfully, pkt-B is sent to the frontend then we call qemu_net_queue_flush() and this sends pkt-B to the frontend, in the wrong order cheers luigi -- -+--- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/. Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -+---
Re: [Qemu-devel] frame reordering in qemu_net_queue_send() ?
On Thu, May 31, 2012 at 03:23:12PM +0200, Jan Kiszka wrote: On 2012-05-31 15:19, Paolo Bonzini wrote: Il 31/05/2012 15:07, Zhi Yong Wu ha scritto: Yeah, this case actually exists, but tcp/ip protocol stack in guest will make sure this ordering will finally be correct. Nevertheless it's not good, and the latest Windows Logo tests will fail if you reorder frames. Can we really enter qemu_net_queue_send with delivering == 0 and a non-empty queue? i have no idea. it doesn't help that comments are using very very sparingly throughout the code... cheers luigi
[Qemu-devel] frame reordering in qemu_net_queue_send() ?
Hi, while investigating rx performance for emulated network devices (i am looking at the userspace version, relying on net=tap or similar approaches) i noticed the code in net/queue.c :: qemu_net_queue_send() which look strange to me (same goes for the iov version). The whole function is below, just for reference. My impression is that the call to qemu_net_queue_flush() is misplaced, and it should be before qemu_net_queue_deliver() otherwise the current packet is pushed out before anything was already in the queue. Does it make sense ? cheers luigi ssize_t qemu_net_queue_send(NetQueue *queue, VLANClientState *sender, unsigned flags, const uint8_t *data, size_t size, NetPacketSent *sent_cb) { ssize_t ret; if (queue-delivering) { return qemu_net_queue_append(queue, sender, flags, data, size, NULL); } ret = qemu_net_queue_deliver(queue, sender, flags, data, size); if (ret == 0) { qemu_net_queue_append(queue, sender, flags, data, size, sent_cb); return 0; } qemu_net_queue_flush(queue); return ret; }
[Qemu-devel] Q: frame reordering in qemu_net_queue_send() ?
Hi, while investigating rx performance for emulated network devices (i am looking at the userspace version, relying on net=tap or similar approaches) i noticed the code in net/queue.c :: qemu_net_queue_send() which look strange to me (same goes for the iov version). The whole function is below, just for reference. My impression is that the call to qemu_net_queue_flush() is misplaced, and it should be before qemu_net_queue_deliver() otherwise the current packet is pushed out before anything was already in the queue. Does it make sense ? cheers luigi ssize_t qemu_net_queue_send(NetQueue *queue, VLANClientState *sender, unsigned flags, const uint8_t *data, size_t size, NetPacketSent *sent_cb) { ssize_t ret; if (queue-delivering) { return qemu_net_queue_append(queue, sender, flags, data, size, NULL); } ret = qemu_net_queue_deliver(queue, sender, flags, data, size); if (ret == 0) { qemu_net_queue_append(queue, sender, flags, data, size, sent_cb); return 0; } qemu_net_queue_flush(queue); return ret; }
[Qemu-devel] frame reordering in qemu_net_queue_send() ?
Hi, while investigating rx performance for emulated network devices (i am looking at the userspace version, relying on net=tap or similar approaches) i noticed the code in net/queue.c :: qemu_net_queue_send() which look strange to me (same goes for the iov version). The whole function is below, just for reference. My impression is that the call to qemu_net_queue_flush() is misplaced, and it should be before qemu_net_queue_deliver() otherwise the current packet is pushed out before anything was already in the queue. Does it make sense ? cheers luigi ssize_t qemu_net_queue_send(NetQueue *queue, VLANClientState *sender, unsigned flags, const uint8_t *data, size_t size, NetPacketSent *sent_cb) { ssize_t ret; if (queue-delivering) { return qemu_net_queue_append(queue, sender, flags, data, size, NULL); } ret = qemu_net_queue_deliver(queue, sender, flags, data, size); if (ret == 0) { qemu_net_queue_append(queue, sender, flags, data, size, sent_cb); return 0; } qemu_net_queue_flush(queue); return ret; }
[Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c
Hi, while testing qemu with netmap (see [Note 1] for details) on e1000 emulation, i noticed that my sender program using a custom backend [Note 2] could reach 1 Mpps (million packets per second) but on the receive side i was limited to 50 Kpps (and CPU always below 5%). The problem was fixed by the following one-line addition to hw/e1000.c :: e1000_mmio_write() , to wakeup the qemu mainloop and check that some buffers might be available. --- hw/e1000.c.orig 2012-02-17 20:45:39.0 +0100 +++ hw/e1000.c 2012-05-30 20:01:52.0 +0200 @@ -919,6 +926,7 @@ DBGOUT(UNKNOWN, MMIO unknown write addr=0x%08x,val=0x%08PRIx64\n, index2, val); } +qemu_notify_event(); } static uint64_t With this fix, the read throughput reaches 1 Mpps matching the write speed. Now the system becomes CPU-bound, but this opens the way to more optimizations in the emulator. The same problem seems to exist on other network drivers, e.g. hw/rtl8139.c and others. The only one that seems to get it right is virtio-net.c I think it would be good if this change could make it into the tree. [Note 1] Netmap ( http://info.iet.unipi.it/~luigi/netmap ) is an efficient mechanism for packet I/O that bypasses the network stack and provides protected access to the network adapter from userspace. It works especially well on top of qemu because the kernel needs only to trap a single register access for each batch of packets. [Note 2] the custom backend is a virtual local ethernet called VALE, implemented as a kernel module on the host, that extends netmap to implement communication between virtual machines. VALE is extremely efficient, currently delivering about 10~Mpps with 60-byte frames, and 5~Mpps with 1500-byte frames. The 1 Mpps rates i mentioned are obtained between qemu instances running in userspace on FreeBSD (no kernel acceleration whatsoever) and using VALE as a communication mechanism. cheers luigi -+--- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/. Universita` di Pisa -+---
Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c
On Wed, May 30, 2012 at 10:23:11PM +0200, Luigi Rizzo wrote: ... The problem was fixed by the following one-line addition to hw/e1000.c :: e1000_mmio_write() , to wakeup the qemu mainloop and check that some buffers might be available. --- hw/e1000.c.orig 2012-02-17 20:45:39.0 +0100 +++ hw/e1000.c 2012-05-30 20:01:52.0 +0200 @@ -919,6 +926,7 @@ DBGOUT(UNKNOWN, MMIO unknown write addr=0x%08x,val=0x%08PRIx64\n, index2, val); } +qemu_notify_event(); } static uint64_t With this fix, the read throughput reaches 1 Mpps matching the write speed. Now the system becomes CPU-bound, but this opens the way to more optimizations in the emulator. The same problem seems to exist on other network drivers, e.g. hw/rtl8139.c and others. The only one that seems to get it right is virtio-net.c I think it would be good if this change could make it into the tree. [Note 1] Netmap ( http://info.iet.unipi.it/~luigi/netmap ) is an efficient mechanism for packet I/O that bypasses the network stack and provides protected access to the network adapter from userspace. It works especially well on top of qemu because the kernel needs only to trap a single register access for each batch of packets. [Note 2] the custom backend is a virtual local ethernet called VALE, implemented as a kernel module on the host, that extends netmap to implement communication between virtual machines. VALE is extremely efficient, currently delivering about 10~Mpps with 60-byte frames, and 5~Mpps with 1500-byte frames. The 1 Mpps rates i mentioned are obtained between qemu instances running in userspace on FreeBSD (no kernel acceleration whatsoever) and using VALE as a communication mechanism. Custom backend == you patched QEMU? Or what backend are you using? This sounds a lot like [1] and suggests that you are either a) using slirp in a version that doesn't contain that fix yet (before 1.1-rcX) or b) wrote a backend that suffers from a similar bug. Jan [1] http://thread.gmane.org/gmane.comp.emulators.qemu/144433 my custom backend is the one in [Note 2] above. It replaces the -net pcap/user/tap/socket option which defines how qemu communicate with the host network device. The problem is not in my module, but rather in the emulation device exposed to the guest, and i presume this is the same thing you fixed in the slirp patch. I checked the git version http://git.qemu.org/qemu.git and most guest-side devices have the same problem, only virtio-net does the notification. cheers luigi
Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c
On Wed, May 30, 2012 at 11:39 PM, Jan Kiszka jan.kis...@web.de wrote: Please keep CCs. ok On 2012-05-30 23:23, Luigi Rizzo wrote: On Wed, May 30, 2012 at 10:23:11PM +0200, Luigi Rizzo wrote: ... The problem was fixed by the following one-line addition to hw/e1000.c :: e1000_mmio_write() , to wakeup the qemu mainloop and check that some buffers might be available. --- hw/e1000.c.orig 2012-02-17 20:45:39.0 +0100 +++ hw/e1000.c 2012-05-30 20:01:52.0 +0200 @@ -919,6 +926,7 @@ DBGOUT(UNKNOWN, MMIO unknown write addr=0x%08x,val=0x%08PRIx64\n, index2, val); } +qemu_notify_event(); } static uint64_t With this fix, the read throughput reaches 1 Mpps matching the write speed. Now the system becomes CPU-bound, but this opens the way to more optimizations in the emulator. The same problem seems to exist on other network drivers, e.g. hw/rtl8139.c and others. The only one that seems to get it right is virtio-net.c I think it would be good if this change could make it into the tree. [Note 1] Netmap ( http://info.iet.unipi.it/~luigi/netmap ) is an efficient mechanism for packet I/O that bypasses the network stack and provides protected access to the network adapter from userspace. It works especially well on top of qemu because the kernel needs only to trap a single register access for each batch of packets. [Note 2] the custom backend is a virtual local ethernet called VALE, implemented as a kernel module on the host, that extends netmap to implement communication between virtual machines. VALE is extremely efficient, currently delivering about 10~Mpps with 60-byte frames, and 5~Mpps with 1500-byte frames. The 1 Mpps rates i mentioned are obtained between qemu instances running in userspace on FreeBSD (no kernel acceleration whatsoever) and using VALE as a communication mechanism. Custom backend == you patched QEMU? Or what backend are you using? This sounds a lot like [1] and suggests that you are either a) using slirp in a version that doesn't contain that fix yet (before 1.1-rcX) or b) wrote a backend that suffers from a similar bug. Jan [1] http://thread.gmane.org/gmane.comp.emulators.qemu/144433 my custom backend is the one in [Note 2] above. It replaces the -net pcap/user/tap/socket option which defines how qemu communicate with the host network device. Any code to share? It's hard to discuss just concepts. you can take the freebsd image from the netmap page in my link and run it in qemu, and then run the pkt-gen program in the image in either send or receive mode. But this is overkill, as you have described the problem exactly in your post: when the guest reads the packets from the emulated device (e1000 in my case, but most of them have the problem) it fails to wake up the thread blocked in main_loop_wait(). I am unclear on the terminology (what is frontend and what is backend ?) but it is the guest side that has to wake up the qemu process: the file descriptor talking to the host side (tap, socket, bpf ...) has already fired its events and the only thing it could do is cause a busy wait if it keeps passing a readable file descriptor to select. I thought your slirp.c patch was also on the same side as e1000.c cheers luigi The problem is not in my module, but rather in the emulation device exposed to the guest, and i presume this is the same thing you fixed in the slirp patch. I checked the git version http://git.qemu.org/qemu.git and most guest-side devices have the same problem, only virtio-net does the notification. And that is most likely wrong. The bug I cited was not a front-end issue but clearly one of the backend. It lacked kicking of the io-thread once its queue state changed in a way that was not reported otherwise (via some file descriptor the io-thread is subscribed to). If your backend creates such states as well, it has to fix it similarly. Again, discussing this abstractly is not very efficient. Jan
Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c
On Thu, May 31, 2012 at 12:11 AM, Jan Kiszka jan.kis...@web.de wrote: On 2012-05-30 23:55, Luigi Rizzo wrote: you can take the freebsd image from the netmap page in my link and run it in qemu, and then run the pkt-gen program in the image in either send or receive mode. But this is overkill, as you have described the problem exactly in your post: when the guest reads the packets from the emulated device (e1000 in my case, but most of them have the problem) it fails to wake up the thread blocked in main_loop_wait(). OK, so there are no QEMU code changes involved? Then, what is your command line and what is the QEMU version? # x86_64-softmmu/qemu-system-x86_64 -version QEMU emulator version 1.0,1, Copyright (c) 2003-2008 Fabrice Bellard the command line for my test is (this is on FreeBSD 9, with a pure userland qemu without any kqemu support) x86_64-softmmu/qemu-system-x86_64 -m 512 -hda picobsd.bin -net nic,model=e1000 -net netmap,ifname=valeXX -net netmap is my fast bridge, code will be available in a couple of days, in the meantime you can have a look at netmap link in my original post to see how qemu accesses the host adapter and how fast it is. The problem exists als you use -net tap, ... just at lower speed, and maybe too low to notice. The image contains my fast packet generator pkt-gen (a stock traffic generator such as netperf etc. is too slow to show the problem). pkt-gen can send about 1Mpps in this configuration using -net netmap in the backend. The qemu process in this case takes 100% CPU. On the receive side, i cannot receive more than 50Kpps, even if i flood the bridge with a a huge amount of traffic. The qemu process stays at 5% cpu or less. Then i read on the docs in main-loop.h which says that one case where the qemu_notify_event() is needed is when using qemu_set_fd_handler2() , which is exactly what my backend uses (similar to tap.c) Once i add the notify, the receiver can do 1 Mpps and can use 100% CPU (and it is not in a busy wait, if the offered traffic goes down, the qemu process uses less and less cpu). I am unclear on the terminology (what is frontend and what is backend ?) Frontend is the emulated NIC here, backend the host-side interface, i.e. slirp (user), tap, socket. thanks for the clarification. but it is the guest side that has to wake up the qemu process: the file descriptor talking to the host side (tap, socket, bpf ...) has already fired its events and the only thing it could do is cause a busy wait if it keeps passing a readable file descriptor to select. Can't follow. How did you analyze your delays? see above. I thought your slirp.c patch was also on the same side as e1000.c It was only in the backend, not any frontend driver. It's generally no business of the frontend driver to kick (virtio has some exceptional path). BTW, your patch is rather untargeted as it kicks on every MMIO write of the e1000. Hard to asses what actually makes the difference for you. in my case it is easy said - the process that reads from the interface on the guest is only doing one read and one write access to the RDT register for each batch of packets. The RDT change is the one that frees the rx buffers. Surely the patch could be made more specific, but this requires a lot of investigation on which register accesses may require attention (and there are so many of them, say if the guest decides to reset the ring, the card, etc., that I don't think it is worth the trouble.) cheers luigi