from:"Luigi Rizzo"

Re: [Qemu-devel] [PATCH V5 16/18] virtio-pci: increase the maximum number of virtqueues to 513

2015-04-07 Thread Luigi Rizzo

On Tue, Apr 7, 2015 at 6:54 PM, Alexander Graf ag...@suse.de wrote:

 On 04/01/2015 10:15 AM, Jason Wang wrote:

 This patch increases the maximum number of virtqueues for pci from 64
 to 513. This will allow booting a virtio-net-pci device with 256 queue
 pairs.
 ...

* configuration space */
   #define VIRTIO_PCI_CONFIG_SIZE(dev) VIRTIO_PCI_CONFIG_OFF(msix_
 enabled(dev))
   -#define VIRTIO_PCI_QUEUE_MAX 64
 +#define VIRTIO_PCI_QUEUE_MAX 513


 513 is an interesting number. Any particular reason for it? Maybe this was
 mentioned before and I just missed it ;)


quite large, too. I thought multiple queue pairs were useful
to split the load for multicore machines, but targeting VMs with
up to 256 cores (and presumably an equal number in the host)
seems really forward looking.

On the other hand, the message is dated april first...

cheers
luigi

Re: [Qemu-devel] [PATCH] net: Disable netmap backend when not supported

2014-02-21 Thread Luigi Rizzo

On Thu, Feb 20, 2014 at 10:49:52AM +0100, Stefan Hajnoczi wrote:
 On Wed, Feb 19, 2014 at 04:57:28PM +0100, Vincenzo Maffione wrote:
...
 Please mention that in the commit description.
 
 (I guess it's too late to give them a NETMAP_* prefix since defining D()
 and RD() in a system header has a fair chance of causing namespace
 conflicts.)

you are right that the naming is unfortunate and we will try
to address that at a later time, removing the macros.

As partial mitigation of the problem, they are intended only for
debugging (so they should only be used by lazy programmers;
applications should prefer application-specific logging functions),
and D(), RD() and ND() are only defined when 'NETMAP_WITH_LIBS' is
defined and this should happen only in the single file that implements
the netmap backend.

cheers
luigi

Re: [Qemu-devel] [PATCH] net: Disable netmap backend when not supported

2014-02-19 Thread Luigi Rizzo

On Wed, Feb 19, 2014 at 7:30 AM, Stefan Hajnoczi stefa...@gmail.com wrote:

 On Fri, Feb 14, 2014 at 05:40:24PM +0100, Vincenzo Maffione wrote:
  This patch fixes configure so that netmap is not compiled in if the
  host doesn't support an API version = 11.
 
  Moreover, some modifications have been done to net/netmap.c in
  order to reflect the current netmap API (11).
 
  Signed-off-by: Vincenzo Maffione v.maffi...@gmail.com
  ---
   configure|  3 +++
   net/netmap.c | 57
 ++---
   2 files changed, 17 insertions(+), 43 deletions(-)
 
  diff --git a/configure b/configure
  index 88133a1..61eb932 100755
  --- a/configure
  +++ b/configure
  @@ -2118,6 +2118,9 @@ if test $netmap != no ; then
   #include net/if.h
   #include net/netmap.h
   #include net/netmap_user.h
  +#if (NETMAP_API  11) || (NETMAP_API  15)
  +#error
  +#endif

 Why error when NETMAP_API  15?


this is meant to simulate a minor/major version number.
We will mark minor new features with values up to 15,
and if something happens that requires a change in the
backend we will move above 15, at which point we
will submit backend fixes and also the necessary
update to ./configure

  

  -ring-cur = NETMAP_RING_NEXT(ring, i);
  -ring-avail--;
  +ring-cur = ring-head = nm_ring_next(ring, i);
   ioctl(s-me.fd, NIOCTXSYNC, NULL);
 
   return size;

 Are these changes related to the NETMAP_WITH_LIBS macro?  Please do that
 in a separate patch so we keep the version checking change separate from
 the NETMAP_WITH_LIBS change.


netmap version 11 and above has NETMAP_WITH_LIBS,
while previous versions do not, so this ./configure
change has to go together with the change in the backend.

The netmap 11 code has already been committed to the FreeBSD
source repositories (for HEAD, 10 and 9) and to
code.google.com/p/netmap/ (for those who want it on linux).

So there is really no point, in my opinion, to make one
intermediate commit just to ./configure to disable
netmap detection on FreeBSD (it is already off on linux),
immediately followed by this one that uses the new feature.

Just to clarify: with one exception (fields in struct netmap_ring)
the netmap changes that we have are not at the kernel/user boundary
but in a library which avoids replicating long and boring code
(interface name parsing, parameter setting) in applications.

Avoiding the single incompatible struct change would have
been of course possible, but at the cost
extra complexity in the kernel and in userspace
(to support two slightly different interfaces).
Since netmap is (so far) relatively little used I thought it
was more important to fix the API and keep it simple.

cheers
luigi

-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] net: Adding netmap network backend

2014-02-14 Thread Luigi Rizzo

On Fri, Feb 14, 2014 at 2:20 AM, Vincenzo Maffione v.maffi...@gmail.comwrote:

 Yes, for sure we need to do a check.

 However, this would involve - I think - some non-trivial modifications to
 net/netmap.c, because without NS_MOREFRAG you cannot split a packet over
 more netmap slots/descriptors (both tx and rx side)

 Therefore I would ask (manly Luigi, since netmap is in-tree in FreeBSD):
 Wouldn't it be better to force --disable-netmap when we realize that
 NETMAP_API (version number for the netmap API) is less than the value
 required by QEMU? This can be done in ./configure. In this way we keep the
 QEMU code cleaner.


yes we should do exactly what vincenzo suggests.

cheers
luigi

Re: [Qemu-devel] [PATCH] net: QEMU_NET_PACKET_FLAG_MORE introduced

2013-12-09 Thread Luigi Rizzo

On Mon, Dec 9, 2013 at 3:02 PM, Michael S. Tsirkin m...@redhat.com wrote:

 On Mon, Dec 09, 2013 at 01:36:54PM +0100, Stefan Hajnoczi wrote:
  On Fri, Dec 06, 2013 at 03:44:33PM +0100, Vincenzo Maffione wrote:
   - This patch is against the net-next tree (
 https://github.com/stefanha/qemu.git)
 because the first netmap patch is not in the qemu master (AFAIK).
 
  You are right.  I am sending a pull request now to get those patches
  into qemu.git/master.

 This only arrived over the weekend and affects all
 net devices. Whats the rush?
 Why not give people a chance to review and discuss
 properly?


as i understand the pull request is for the netmap backend,
not for this batching patch

cheers
luigi



hw/net/cadence_gem.c|  3 ++-
hw/net/dp8393x.c|  5 +++--
hw/net/e1000.c  | 21 -
hw/net/eepro100.c   |  5 +++--
hw/net/etraxfs_eth.c|  5 +++--
hw/net/lan9118.c|  2 +-
hw/net/mcf_fec.c|  5 +++--
hw/net/mipsnet.c|  6 --
hw/net/ne2000.c |  5 +++--
hw/net/ne2000.h |  3 ++-
hw/net/opencores_eth.c  |  2 +-
hw/net/pcnet.c  |  8 +---
hw/net/pcnet.h  |  3 ++-
hw/net/rtl8139.c|  7 ---
hw/net/smc91c111.c  |  5 +++--
hw/net/spapr_llan.c |  2 +-
hw/net/stellaris_enet.c |  3 ++-
hw/net/virtio-net.c | 10 --
hw/net/vmxnet3.c|  3 ++-
hw/net/vmxnet_tx_pkt.c  |  4 ++--
hw/net/xgmac.c  |  2 +-
hw/net/xilinx_axienet.c |  2 +-
hw/usb/dev-network.c|  8 +---
include/net/net.h   | 20 +---
include/net/queue.h |  1 +
net/dump.c  |  3 ++-
net/hub.c   | 10 ++
net/net.c   | 39 +++
net/netmap.c| 17 -
net/slirp.c |  5 +++--
net/socket.c| 10 ++
net/tap-win32.c |  2 +-
net/tap.c   | 12 +++-
net/vde.c   |  5 +++--
savevm.c|  2 +-
35 files changed, 155 insertions(+), 90 deletions(-)
 
  Please split this into multiple patches:
 
  1. net subsystem API change that touches all files (if necessary)
  2. e1000 MORE support
  3. virtio-net MORE support
  4. netmap MORE support
 
  This makes it easier to review and bisect.
 
  Thanks,
  Stefan




-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] commit b1bbfe72 causes huge slowdown with no kvm

2013-11-20 Thread Luigi Rizzo

On Wed, Nov 20, 2013 at 10:41:22AM +0100, Paolo Bonzini wrote:
 Il 20/11/2013 00:00, Luigi Rizzo ha scritto:
  I recently found out that without kvm enabled, and especially
  with -smp 2 or greater, qemu becomes incredibly slow
  (to the point that you can see kernel messages in the
  quest print one character at a time).
  
  This happens with a Linux host (even with -smp 1)
  and with FreeBSD host (in this case -smp 2 or greater;
  -smp 1 seems to work fine there).

Here is my test case that shows the problem:

HOW TO REPRODUCE:

boot a (small) FreeBSD image below:

http://info.iet.unipi.it/~luigi/20131119-picobsd.bin 

with this command (note no kvm, also i disconnect the nic
so i can generate traffic without causing trouble)

x86_64-softmmu/qemu-system-x86_64 -m 1024 -hda 20131119-picobsd.bin 
-curses -monitor tcp::,server,nowait -net nic,model=e1000 -smp 2

The image is readonly so you can kill the qemu without problems.

SYMPTOMS:
as soon as the kernel starts its timer services (there is an
active kernel thread woken up every millisecons) booting and
runtime performance in the guest becomes terribly slow.
See details on the test below.

HOST / OS
Tried on a linux host (ubuntu 13.10 i believe, with kernel 3.11
and CPU Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz, but similar
things occur also on a FreeBSD host. With a FreeBSD host,
using -smp 1 seems to bypass the problem, whereas on Linux hosts
the -smp setting does not seem to make a difference.

QEMU VERSION and MODIFICATIONS:

 git branch -v
* master 5c5432e Merge remote-tracking branch 'luiz/queue/qmp' into 
staging

No changes other than the small diff i proposed to fix the problem:
 git diff

diff --git a/qemu-timer.c b/qemu-timer.c
index e15ce47..4180a86 100644
--- a/qemu-timer.c
+++ b/qemu-timer.c
@@ -380,6 +380,7 @@ static void timerlist_rearm(QEMUTimerList *timer_list)
 {
 /* Interrupt execution to force deadline recalculation.  */
 qemu_clock_warp(timer_list-clock-type);
+if (use_icount) // XXX
 timerlist_notify(timer_list);
 }

DETAILED SYMPTOMS:

WITH THE ABOVE PATCH, the kernel boots to a login prompt in 32
seconds (10 of which are the initial pause, you can skip it
by pressing enter). Then you can run the pkt-gen program which i
normally use to test netmap performance in the guest, and you
see between 2 and 3 million packets per second (the NIC is disconnected
form the host so this should be harmles). Example:

login: root
Password: setup
# pkt-gen -i em0 -f tx

should return something between 2 and 3 Mpps
ctrl-c to terminate it 

WITHOUT THE PATCH, booting becomes slow as soon as the timer tick starts
and we load dummynet (which also starts a kernel thread every millisecond).
You should be able to see how the printing of kernel messages slows down
terribly around this point:

...
Timecounters tick every 1.000 msec
  ipfw2 initialized, divert loadable, nat loadable, default to accept, 
logging disabled
  DUMMYNET 0 with IPv6 initialized (100409)


and then it takes a long/varible time to reach the login prompt,
easily a couple of minutes or more.
If you start pkt-gen now, you should see a much lower rate,
around 300-500Kpps



Since the problem seems timer-related, it makes sense that you are
not seeing much difference in linux which is probably tickless,
and that the trouble comes out on FreeBSD after the timers are
initialized.

But again I have no idea if my patch (which simply reverts part of
the offending commit) makes sense.

cheers
luigi


 
 I could not reproduce it; the profile here seems normal, too:
 
  24,69%  perf-3281.map   [.] 0x7f4ab5ac9903
  14,18%  qemu-system-x86_64  [.] cpu_x86_exec
   2,67%  libglib-2.0.so.0.3800.1 [.] g_source_iter_next
   2,63%  qemu-system-x86_64  [.] tcg_optimize
   2,47%  qemu-system-x86_64  [.] helper_pcmpeqb_xmm
   2,28%  qemu-system-x86_64  [.] phys_page_find
   1,92%  qemu-system-x86_64  [.] 
 address_space_translate_internal
   1,53%  qemu-system-x86_64  [.] tcg_liveness_analysis
   1,40%  qemu-system-x86_64  [.] tcg_reg_alloc_op
   1,17%  qemu-system-x86_64  [.] helper_psubb_xmm
   0,97%  qemu-system-x86_64  [.] disas_insn
   0,96%  qemu-system-x86_64  [.] cpu_x86_handle_mmu_fault
   0,92%  qemu-system-x86_64  [.] tcg_out_opc
   0,92%  qemu-system-x86_64  [.] helper_pmovmskb_xmm
   0,91%  qemu-system-x86_64  [.] tlb_set_page
   0,75%  qemu-system-x86_64  [.] address_space_translate
   0,74%  qemu-system-x86_64

Re: [Qemu-devel] [PATCH for 1.7] target-i386: yield to another VCPU on PAUSE

2013-11-20 Thread Luigi Rizzo

On Wed, Nov 20, 2013 at 12:54:02PM +0100, Paolo Bonzini wrote:
 After commit b1bbfe7 (aio / timers: On timer modification, qemu_notify
 or aio_notify, 2013-08-21) FreeBSD guests report a huge slowdown.
 
 The problem shows up as soon as FreeBSD turns out its periodic (~1 ms)
 tick, but the timers are only the trigger for a pre-existing problem.
 
 Before the offending patch, setting a timer did a timer_settime system call.
 
 After, setting the timer exits the event loop (which uses poll) and
 reenters it with a new deadline.  This does not cause any slowdown; the
 difference is between one system call (timer_settime and a signal
 delivery (SIGALRM) before the patch, and two system calls afterwards
 (write to a pipe or eventfd + calling poll again when re-entering the
 event loop).
 
 Unfortunately, the exit/enter causes the main loop to grab the iothread
 lock, which in turns kicks the VCPU thread out of execution.  This
 causes TCG to execute the next VCPU in its round-robin scheduling of
 VCPUS.  When the second VCPU is mostly unused, FreeBSD runs a pause
 instruction in its idle loop which only burns cycles without any
 progress.  As soon as the timer tick expires, the first VCPU runs
 the interrupt handler but very soon it sets it again---and QEMU
 then goes back doing nothing in the second VCPU.
 
 The fix is to make the pause instruction do cpu_loop_exit.

thank you.

This fixes the slow booting problem, but the runtime performance
with my test program in the other mail thread

commit b1bbfe72 causes huge slowdown with no kvm

is still about 50% than pre- commit b1bbfe7

So i am still wondering if there is a better way to deliver the
timerlist_notify() in timerlist_rearm() .

I have tried to suppress the entire timerlist_rearm() when the newly
inserted event is close to the previous one, but this does not
seem to help -- actually it is harmful, presumably because qemu_clock_warp()
is also skipped. Conversely, filtering out timerlist_notify() as in
my previous (incorrect) patch seems to speed up things considerably,
perhaps because there are other timerlist_notify() calls elsewhere
that keep the ball rolling.


Just as a side note:

I am not sure whether you were seeing use of the 'pause' instruction
in profiling, but by default the idle loop in FreeBSD uses the acpi method
(to enter C1 state).

Other options are:
- mwait (unavailable on qemu due to some missing VCPU feature);
- hlt (which i tried and gives the same behaviour as acpi);
- spin (which indeed does use the pause instruction, but is not
  used unless you force it with sysctl machdep.idle=spin).

pause instructions are however used within spinlocks, and when
invoking the scheduler, which happens right before and after the idle loop.

cheers
luigi

 Cc: Richard Henderson r...@twiddle.net
 Reported-by: Luigi Rizzo ri...@iet.unipi.it
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  target-i386/helper.h  |  1 +
  target-i386/misc_helper.c | 22 --
  target-i386/translate.c   |  5 -
  3 files changed, 25 insertions(+), 3 deletions(-)
 
 diff --git a/target-i386/helper.h b/target-i386/helper.h
 index d6974df..3775abe 100644
 --- a/target-i386/helper.h
 +++ b/target-i386/helper.h
 @@ -58,6 +58,7 @@ DEF_HELPER_2(sysret, void, env, int)
  DEF_HELPER_2(hlt, void, env, int)
  DEF_HELPER_2(monitor, void, env, tl)
  DEF_HELPER_2(mwait, void, env, int)
 +DEF_HELPER_2(pause, void, env, int)
  DEF_HELPER_1(debug, void, env)
  DEF_HELPER_1(reset_rf, void, env)
  DEF_HELPER_3(raise_interrupt, void, env, int, int)
 diff --git a/target-i386/misc_helper.c b/target-i386/misc_helper.c
 index 93933fd..b6307ca 100644
 --- a/target-i386/misc_helper.c
 +++ b/target-i386/misc_helper.c
 @@ -566,6 +566,15 @@ void helper_rdmsr(CPUX86State *env)
  }
  #endif
  
 +static void do_pause(X86CPU *cpu)
 +{
 +CPUX86State *env = cpu-env;
 +
 +/* Just let another CPU run.  */
 +env-exception_index = EXCP_INTERRUPT;
 +cpu_loop_exit(env);
 +}
 +
  static void do_hlt(X86CPU *cpu)
  {
  CPUState *cs = CPU(cpu);
 @@ -611,13 +620,22 @@ void helper_mwait(CPUX86State *env, int next_eip_addend)
  cs = CPU(cpu);
  /* XXX: not complete but not completely erroneous */
  if (cs-cpu_index != 0 || CPU_NEXT(cs) != NULL) {
 -/* more than one CPU: do not sleep because another CPU may
 -   wake this one */
 +do_pause(cpu);
  } else {
  do_hlt(cpu);
  }
  }
  
 +void helper_pause(CPUX86State *env, int next_eip_addend)
 +{
 +X86CPU *cpu = x86_env_get_cpu(env);
 +
 +cpu_svm_check_intercept_param(env, SVM_EXIT_PAUSE, 0);
 +env-eip += next_eip_addend;
 +
 +do_pause(cpu);
 +}
 +
  void helper_debug(CPUX86State *env)
  {
  env-exception_index = EXCP_DEBUG;
 diff --git a/target-i386/translate.c b/target-i386/translate.c
 index eb0ea93..ecf16b3 100644
 --- a/target-i386/translate.c
 +++ b/target-i386/translate.c
 @@ -7224,7 +7224,10 @@ static target_ulong

[Qemu-devel] commit b1bbfe72 causes huge slowdown with no kvm

2013-11-19 Thread Luigi Rizzo

I recently found out that without kvm enabled, and especially
with -smp 2 or greater, qemu becomes incredibly slow
(to the point that you can see kernel messages in the
quest print one character at a time).

This happens with a Linux host (even with -smp 1)
and with FreeBSD host (in this case -smp 2 or greater;
-smp 1 seems to work fine there).

Bisecting points to this commit as the culprit:


commit b1bbfe72ec1ebf302d97f886cc646466c0abd679
Author: Alex Bligh a...@alex.org.uk
Date:   Wed Aug 21 16:02:55 2013 +0100

aio / timers: On timer modification, qemu_notify or aio_notify

On qemu_mod_timer_ns, ensure qemu_notify or aio_notify is called to
end the appropriate poll(), irrespective of use_icount value.

On qemu_clock_enable, ensure qemu_notify or aio_notify is called for
all QEMUTimerLists attached to the QEMUClock.

Signed-off-by: Alex Bligh a...@alex.org.uk
Signed-off-by: Stefan Hajnoczi stefa...@redhat.com

I could not revert this individual commit into master because
of other changes, but i notice that
one key
changes of the commit

was
 to
 ma
k
e a
 
call to timerlist_notify() unconditional, whereas
before
 
it was controlled by the

use_icount

variable.

So I tried the small patch below and it seems to restore the original
performance, but I have no idea what use_icount does and
whether the change makes sense.

If not, there is an open problem.

Any ideas ?

cheers
luigi

d
iff --git a/qemu-timer.c b/qemu-timer.c
index e15ce47..4180a86 100644
--- a/qemu-timer.c
+++ b/qemu-timer.c
@@ -380,6 +380,7 @@ static void timerlist_rearm(QEMUTimerList *timer_list)
 {
 /* Interrupt execution to force deadline recalculation.  */
 qemu_clock_warp(timer_list-clock-type);
+if (use_icount) // XXX
 timerlist_notify(timer_list);
 }

[Qemu-devel] [PATCH] better interpreter specification for scripts

2013-11-14 Thread Luigi Rizzo

some of the new scripts in scripts/ specify a interpreter path
which is not always pointing to the right place.
I see that an alternative format is also being used
#!/usr/bin/env interpreter_name
which seems to work better.

The patch below is merely to let master compile on FreeBSD,
but there are other offending files that can be found with
 grep '#!' scripts/*

Signed-off-by: Luigi Rizzo ri...@iet.unipi.it
---
 scripts/acpi_extract.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/acpi_extract.py b/scripts/acpi_extract.py
index 22ea468..66c1b3e 100755
--- a/scripts/acpi_extract.py
+++ b/scripts/acpi_extract.py
@@ -1,4 +1,4 @@
-#!/usr/bin/python
+#!/usr/bin/env python
 # Copyright (C) 2011 Red Hat, Inc., Michael S. Tsirkin m...@redhat.com
 #
 # This program is free software; you can redistribute it and/or modify
--
1.8.1.2

Re: [Qemu-devel] [PATCH v3] net: Adding netmap network backend

2013-11-04 Thread Luigi Rizzo

On Mon, Nov 4, 2013 at 9:41 AM, Anthony Liguori anth...@codemonkey.wswrote:

 On Tue, Oct 29, 2013 at 3:12 AM, Vincenzo Maffione v.maffi...@gmail.com
 wrote:
  This patch adds support for a network backend based on netmap.
  netmap is a framework for high speed packet I/O. You can use it
  to build extremely fast traffic generators, monitors, software
  switches or network middleboxes. Its companion software switch
  VALE lets you interconnect virtual machines.
  netmap and VALE are implemented as a non intrusive kernel module,
  support NICs from multiple vendors, are part of standard FreeBSD
  distributions and available in source format for Linux too.

 I don't think it's a good idea to support this on Linux hosts.  This
 is an out of tree module that most likely will never go upstream.

 I don't want to live through another kqemu with this if it eventually
 starts to bit-rot.


I believe this is very different from kqemu.

For first, it is just a one-file backend (the patches
to other files are just because there is not yet a way
to automatically generate them; but i am sure qemu
will get there). Getting rid of it, should the code
bit-rot, is completely trivial.

Second, there is nothing linux specific here. Unless configure
determines that the (possibly out of tree, as in Linux,
or in-tree, as in FreeBSD) netmap headers are
installed, it just won't build the backend.

I am not sure if you do not want to support netmap _in general_
(despite it is already upstream in FreeBSD),
or you'd like to put extra checks in ./configure to actually
prevent its use on linux hosts.

Both cases it seems to me a loss of a useful feature with no
real return

 
configure |  31 

 hmp-commands.hx   |   4 +-

 net/Makefile.objs |   1 +

 net/clients.h |   5 +

 net/net.c |   6 +

 net/netmap.c  | 423 ++


 qapi-schema.json  |  19 ++-

 qemu-options.hx   |   8 ++

 8 files changed, 494 insertions(+), 3 deletions(-)

 create mode 100644 net/netmap.c

cheers
luigi

Re: [Qemu-devel] [PATCH v3] net: Adding netmap network backend

2013-11-04 Thread Luigi Rizzo

On Mon, Nov 04, 2013 at 10:20:12AM -0800, Anthony Liguori wrote:
 On Mon, Nov 4, 2013 at 10:08 AM, Luigi Rizzo ri...@iet.unipi.it wrote:
...
  On Tue, Oct 29, 2013 at 3:12 AM, Vincenzo Maffione v.maffi...@gmail.com
  wrote:
   This patch adds support for a network backend based on netmap.
   netmap is a framework for high speed packet I/O. You can use it
   to build extremely fast traffic generators, monitors, software
   switches or network middleboxes. Its companion software switch
   VALE lets you interconnect virtual machines.
   netmap and VALE are implemented as a non intrusive kernel module,
   support NICs from multiple vendors, are part of standard FreeBSD
   distributions and available in source format for Linux too.
 
  I don't think it's a good idea to support this on Linux hosts.  This
  is an out of tree module that most likely will never go upstream.
 
  I don't want to live through another kqemu with this if it eventually
  starts to bit-rot.
 
 
  I believe this is very different from kqemu.
 
  For first, it is just a one-file backend (the patches
  to other files are just because there is not yet a way
  to automatically generate them; but i am sure qemu
  will get there). Getting rid of it, should the code
  bit-rot, is completely trivial.
 
  Second, there is nothing linux specific here. Unless configure
  determines that the (possibly out of tree, as in Linux,
  or in-tree, as in FreeBSD) netmap headers are
  installed, it just won't build the backend.
 
 Without being in upstream Linux, we have no guarantee that the API/ABI
 will be stable over time.  I suspect it's also very unlikely that any
 many stream distro will include these patches meaning that the number
 of users that will test this is very low.
 
 I don't think just adding another backend because we can helps us out
 in the long term.  Either this is the Right Approach to networking and
 we should focus on getting proper kernel support or if that's not
 worth it, then there's no reason to include this in QEMU either.

anthony,
i'd still like you to answer the question that i asked before:

are you opposed to netmap support just for linux, or you
oppose to it in general (despite netmap being already
upstream in FreeBSD) ?

Your reasoning seems along the lines if feature X is not upstream
in linux we do not want to support it.

While I can buy it (hey, it may save a lot of maintenance effort),
it contrasts with the presence in qemu of a ton of conditional code
to support other host platforms, as well as multiple backends for
non-linux features (mostly audio drivers; some GUI too, think of
Cocoa).

Also the agendas of Qemu, linux, FreeBSD and other hosts may be
different (and it does not mean that there is One Right Way or that
others are wrong), so having inclusion in linux as a prerequisite
for support seems a bit restrictive.

Specifically, the networking requirements for qemu (a fast virtual
switch, tunnels etc.) are different from that of more typical apps
or the OS itself.

Also regarding API stability: qemu uses a lot of user libraries
whose APIs are a moving target without apparent problems.
If you are worried about API stability, netmap is just two
small headers, and no library. There isn't really much
that can go wrong there...

cheers
luigi

 Regards,
 
 Anthony Liguori
 
  I am not sure if you do not want to support netmap _in general_
  (despite it is already upstream in FreeBSD),
  or you'd like to put extra checks in ./configure to actually
  prevent its use on linux hosts.

Re: [Qemu-devel] [PATCH v3] net: Adding netmap network backend

2013-11-04 Thread Luigi Rizzo

On Mon, Nov 4, 2013 at 12:54 PM, Anthony Liguori anth...@codemonkey.wswrote:

 On Mon, Nov 4, 2013 at 11:51 AM, Luigi Rizzo ri...@iet.unipi.it wrote:
  On Mon, Nov 04, 2013 at 10:20:12AM -0800, Anthony Liguori wrote:
  On Mon, Nov 4, 2013 at 10:08 AM, Luigi Rizzo ri...@iet.unipi.it
 wrote:
  ...
   On Tue, Oct 29, 2013 at 3:12 AM, Vincenzo Maffione 
 v.maffi...@gmail.com
   wrote:
This patch adds support for a network backend based on netmap.
netmap is a framework for high speed packet I/O. You can use it
to build extremely fast traffic generators, monitors, software
switches or network middleboxes. Its companion software switch
VALE lets you interconnect virtual machines.
netmap and VALE are implemented as a non intrusive kernel module,
support NICs from multiple vendors, are part of standard FreeBSD
distributions and available in source format for Linux too.
  
   I don't think it's a good idea to support this on Linux hosts.  This
   is an out of tree module that most likely will never go upstream.
  
   I don't want to live through another kqemu with this if it eventually
   starts to bit-rot.
  
  
   I believe this is very different from kqemu.
  
   For first, it is just a one-file backend (the patches
   to other files are just because there is not yet a way
   to automatically generate them; but i am sure qemu
   will get there). Getting rid of it, should the code
   bit-rot, is completely trivial.
  
   Second, there is nothing linux specific here. Unless configure
   determines that the (possibly out of tree, as in Linux,
   or in-tree, as in FreeBSD) netmap headers are
   installed, it just won't build the backend.
 
  Without being in upstream Linux, we have no guarantee that the API/ABI
  will be stable over time.  I suspect it's also very unlikely that any
  many stream distro will include these patches meaning that the number
  of users that will test this is very low.
 
  I don't think just adding another backend because we can helps us out
  in the long term.  Either this is the Right Approach to networking and
  we should focus on getting proper kernel support or if that's not
  worth it, then there's no reason to include this in QEMU either.
 
  anthony,
  i'd still like you to answer the question that i asked before:
 
  are you opposed to netmap support just for linux, or you
  oppose to it in general (despite netmap being already
  upstream in FreeBSD) ?
 
  Your reasoning seems along the lines if feature X is not upstream
  in linux we do not want to support it.

 Yes.  This is the historic policy we have taken for any feature.  I
 have no problem with netmap being used on FreeBSD hosts but I think it
 should not be enabled on Linux hosts.


ok thanks for the clarification.
 S
o I misunderstood
,
 the policy is
if not upstream in linux, we do not want to support it _on linux_.
A fundamental difference :)

So in ./configure we must change to 'netmap=no' in the initial
section to disable it by default, and add a line 'netmap=' in the
FreeBSD section to enable autodetect there.

cheers
luigi

[Qemu-devel] difference between receive_raw() and receive() NetClientInfo methods ?

2013-06-06 Thread Luigi Rizzo

Can someone clarify what is the difference between the two methods
r
eceive_raw() and receive() in NetClientInfo ?

tap seems to be the only backend actually implementing receive_raw(),
but apart from prepending a vnet_hdr i cannot tell what is this for,
and whether receive_raw() is a custom addon for tap, or it can be used
by other backends to implement different features.

The reason I am asking is that we are working on a fewer copies
path between guests, and one of the things we need is an
asynchronous receive so that a backend can notify the frontend
when a buffer has been actually consumed.
Right now nc-info-receive() and friends are all synchronous,
and there is no mechanism to support asynchronous completion,
which forces the backend to make a copy if it cannot
complete the request inline.
Hence i probably need to add another method to NetClientInfo
so that the frontend can register the callback, or pass it to
the backend together with the buffer/iov

thanks
luigi

Re: [Qemu-devel] [PATCH] e1000: cleanup process_tx_desc

2013-06-04 Thread Luigi Rizzo

On Tue, Jun 4, 2013 at 9:34 AM, Andrew Jones drjo...@redhat.com wrote:



 - Original Message -
  On 06/03/2013 10:20 AM, Andrew Jones wrote:
   Coverity complains about two overruns in process_tx_desc(). The
   complaints are false positives, but we might as well eliminate
   them. The problem is that hdr is defined as an unsigned int,
   but then used to offset an array of size 65536, and another of
   size 256 bytes. hdr will actually never be greater than 255
   though, as it's assigned only once and to the value of
   tp-hdr_len, which is an uint8_t. This patch simply gets rid of
   hdr, replacing it with tp-hdr_len, which makes it consistent
   with all other tp member use in the function.
  
   Signed-off-by: Andrew Jones drjo...@redhat.com
   ---
hw/net/e1000.c | 16 
1 file changed, 8 insertions(+), 8 deletions(-)
  
 
  The logic looks sound, but checkpatch detects some style issues. See
 below.

 ...


  Although the style issues were present to begin with, we may as well take
  the opportunity to fix them.

 Running checkpatch on the file yields

 142 errors, 41 warnings

 I could send a v2 that fixes the 1 error and 2 warnings found in the
 context
 of this patch, but why? It's out of the scope of the patch (although I did
 use cleanup in the summary...), and it would hardly make a dent in this
 file's problems.
 


Indeed. I find it slightly annoying (and a waste of everyone's time)
that patches are bounced on issues that they are not responsible for.
(this happens for several other opensource projects I have been
involved with).

I think that a much more effective approach would be to take the patch
(on the grounds that it improves the codebase),
and then if the committer feels like fixing the pre-existing
style issue he can do it separately, or make a comment in the
commit log if he has no time (and by the same reasoning, the original
submitter may be short of time).

cheers
luigi

Many projects i have been involved with have this 


 drew

 
  Sincerely,
 
  Jesse Larrew
  Software Engineer, KVM Team
  IBM Linux Technology Center
  Phone: (512) 973-2052 (T/L: 363-2052)
  jlar...@linux.vnet.ibm.com
 
 




-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] RFC: handling backend too fast in virtio-net

2013-02-18 Thread Luigi Rizzo

On Mon, Feb 18, 2013 at 1:50 AM, Stefan Hajnoczi stefa...@gmail.com wrote:

 On Fri, Feb 15, 2013 at 11:24:29AM +0100, Stefan Hajnoczi wrote:
  On Thu, Feb 14, 2013 at 07:21:57PM +0100, Luigi Rizzo wrote:
 
  CCed Michael Tsirkin
 
   virtio-style network devices (where the producer and consumer chase
   each other through a shared memory block) can enter into a
   bad operating regime when the consumer is too fast.
  
   I am hitting this case right now when virtio is used on top of the
   netmap/VALE backend that I posted a few weeks ago: what happens is that
   the backend is so fast that the io thread keeps re-enabling
 notifications
   every few packets.  This results, on my test machine, in a throughput
 of
   250-300Kpps (and extremely unstable, oscillating between 200 and
 600Kpps).
  
   I'd like to get some feedback on the following trivial patch to have
   the thread keep spinning for a bounded amount of time when the producer
   is slower than the consumer. This gives a relatively stable throughput
   between 700 and 800 Kpps (we have something similar in our
 paravirtualized
   e1000 driver, which is slightly faster at 900-1100 Kpps).
 
  Did you experiment with tx timer instead of bh?  It seems that
  hw/virtio-net.c has two tx mitigation strategies - the bh approach that
  you've tweaked and a true timer.
 
  It seems you don't really want tx batching but you do want to avoid
  guest-host notifications?

 One more thing I forgot: virtio-net does not use ioeventfd by default.
 ioeventfd changes the cost of guest-host notifications because the
 notification becomes an eventfd signal inside the kernel and kvm.ko then
 re-enters the guest.

 This means a guest-host notification becomes a light-weight exit and we
 don't return from ioctl(KVM_RUN).

 Perhaps -device virtio-blk-pci,ioeventfd=on will give similar results to
 your patch?


is the ioeventfd the mechanism used by vhostnet ?
If so, Giuseppe Lettieri (in Cc) has tried that with
a modified netmap backend and experienced the same
problem -- making the io thread (user or kernel)
spin a bit more has great benefit on the throughput.

cheers
luigi

Re: [Qemu-devel] RFC: handling backend too fast in virtio-net

2013-02-15 Thread Luigi Rizzo

On Fri, Feb 15, 2013 at 11:24:29AM +0100, Stefan Hajnoczi wrote:
 On Thu, Feb 14, 2013 at 07:21:57PM +0100, Luigi Rizzo wrote:
 
 CCed Michael Tsirkin
 
  virtio-style network devices (where the producer and consumer chase
  each other through a shared memory block) can enter into a
  bad operating regime when the consumer is too fast.
  
  I am hitting this case right now when virtio is used on top of the
  netmap/VALE backend that I posted a few weeks ago: what happens is that
  the backend is so fast that the io thread keeps re-enabling notifications
  every few packets.  This results, on my test machine, in a throughput of
  250-300Kpps (and extremely unstable, oscillating between 200 and 600Kpps).
  
  I'd like to get some feedback on the following trivial patch to have
  the thread keep spinning for a bounded amount of time when the producer
  is slower than the consumer. This gives a relatively stable throughput
  between 700 and 800 Kpps (we have something similar in our paravirtualized
  e1000 driver, which is slightly faster at 900-1100 Kpps).
 
 Did you experiment with tx timer instead of bh?  It seems that
 hw/virtio-net.c has two tx mitigation strategies - the bh approach that
 you've tweaked and a true timer.
 
 It seems you don't really want tx batching but you do want to avoid
 guest-host notifications?

i have just tried:
bh (with my tweaks): 700-800Kpps (large variance)
timer (150us, 256 slots):  ~490Kpps (very stable)

As expected The throughput in the timer version seems proportional
to ring_size/timeout , e.g. 1ms and 256slots
give approx 210Kpps, 1ms and 128 slots give 108Kpps.
but it saturates around 490Kpps.

cheers
luigi

  EXISTING LOGIC: reschedule the bh when at least a burst of packets has
 been received. Otherwise enable notifications and do another
 race-prevention check.
  
  NEW LOGIC: when the bh is scheduled the first time, establish a
 budget (currently 20) of reschedule attempts.
 Every time virtio_net_flush_tx() returns 0 packets [this could 
 be changed to 'less than a full burst'] the counter is increased.
 The bh is rescheduled until the counter reaches the budget,
 at which point we re-enable notifications as above.
  
  I am not 100% happy with this patch because there is a magic
  constant (the maximum number of retries) which should be really
  adaptive, or perhaps we should even bound the amount of time the
  bh runs, rather than the number of retries.
  In practice, having the thread spin (or sleep) for 10-20us 
  even without new packets is probably preferable to 
  re-enable notifications and request a kick.
  
  opinions ?
  
  cheers
  luigi
  
  diff --git a/hw/virtio-net.c b/hw/virtio-net.c
  index 573c669..5389088 100644
  --- a/hw/virtio-net.c
  +++ b/hw/virtio-net.c
  @@ -49,6 +51,7 @@ typedef struct VirtIONet
   NICState *nic;
   uint32_t tx_timeout;
   int32_t tx_burst;
  +int32_t tx_retries; /* keep spinning a bit on tx */
   uint32_t has_vnet_hdr;
   size_t host_hdr_len;
   size_t guest_hdr_len;
  @@ -1062,7 +1065,9 @@ static void virtio_net_tx_bh(void *opaque)
  
   /* If we flush a full burst of packets, assume there are
* more coming and immediately reschedule */
  -if (ret = n-tx_burst) {
  +if (ret == 0)
  +   n-tx_retries++;
  +if (n-tx_retries  20) {
   qemu_bh_schedule(q-tx_bh);
   q-tx_waiting = 1;
   return;
  @@ -1076,6 +1082,8 @@ static void virtio_net_tx_bh(void *opaque)
   virtio_queue_set_notification(q-tx_vq, 0);
   qemu_bh_schedule(q-tx_bh);
   q-tx_waiting = 1;
  +} else {
  +   n-tx_retries = 0;
   }
   }

[Qemu-devel] RFC: handling backend too fast in virtio-net

2013-02-14 Thread Luigi Rizzo

virtio-style network devices (where the producer and consumer chase
each other through a shared memory block) can enter into a
bad operating regime when the consumer is too fast.

I am hitting this case right now when virtio is used on top of the
netmap/VALE backend that I posted a few weeks ago: what happens is that
the backend is so fast that the io thread keeps re-enabling notifications
every few packets.  This results, on my test machine, in a throughput of
250-300Kpps (and extremely unstable, oscillating between 200 and 600Kpps).

I'd like to get some feedback on the following trivial patch to have
the thread keep spinning for a bounded amount of time when the producer
is slower than the consumer. This gives a relatively stable throughput
between 700 and 800 Kpps (we have something similar in our paravirtualized
e1000 driver, which is slightly faster at 900-1100 Kpps).

EXISTING LOGIC: reschedule the bh when at least a burst of packets has
   been received. Otherwise enable notifications and do another
   race-prevention check.

NEW LOGIC: when the bh is scheduled the first time, establish a
   budget (currently 20) of reschedule attempts.
   Every time virtio_net_flush_tx() returns 0 packets [this could 
   be changed to 'less than a full burst'] the counter is increased.
   The bh is rescheduled until the counter reaches the budget,
   at which point we re-enable notifications as above.

I am not 100% happy with this patch because there is a magic
constant (the maximum number of retries) which should be really
adaptive, or perhaps we should even bound the amount of time the
bh runs, rather than the number of retries.
In practice, having the thread spin (or sleep) for 10-20us 
even without new packets is probably preferable to 
re-enable notifications and request a kick.

opinions ?

cheers
luigi

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 573c669..5389088 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -49,6 +51,7 @@ typedef struct VirtIONet
 NICState *nic;
 uint32_t tx_timeout;
 int32_t tx_burst;
+int32_t tx_retries; /* keep spinning a bit on tx */
 uint32_t has_vnet_hdr;
 size_t host_hdr_len;
 size_t guest_hdr_len;
@@ -1062,7 +1065,9 @@ static void virtio_net_tx_bh(void *opaque)

 /* If we flush a full burst of packets, assume there are
  * more coming and immediately reschedule */
-if (ret = n-tx_burst) {
+if (ret == 0)
+   n-tx_retries++;
+if (n-tx_retries  20) {
 qemu_bh_schedule(q-tx_bh);
 q-tx_waiting = 1;
 return;
@@ -1076,6 +1082,8 @@ static void virtio_net_tx_bh(void *opaque)
 virtio_queue_set_notification(q-tx_vq, 0);
 qemu_bh_schedule(q-tx_bh);
 q-tx_waiting = 1;
+} else {
+   n-tx_retries = 0;
 }
 }

Re: [Qemu-devel] [PATCH] implement moderation registers for e1000

2013-02-08 Thread Luigi Rizzo

On Fri, Feb 8, 2013 at 2:02 AM, Stefan Hajnoczi stefa...@gmail.com wrote:

 On Wed, Feb 06, 2013 at 03:23:41PM +0100, Luigi Rizzo wrote:
  The following patch implements interrupt moderation registers
  for the e1000 adapter. These feature is normally used by OS
  drivers, and their implementation improves performance significantly,
  especially on the transmit path.
  The feature can be disabled through a command line option.
  We have seen only benefits in throughput, while latency slightly
  increases (that is an inherent feature of interrupt moderation)
  because very short delays cannot be emulated precisely.
 
  For those curious on performance, there is a set of measurements
  (of this and other changes that we will post separately) at
 
  http://info.iet.unipi.it/~luigi/research.html#qemu

 http://info.iet.unipi.it/~luigi/papers/20130206-qemu.pdf is 404.


sorry, fixed now.
And, will resubmit a fixed patch with uninit and fixed braces in the if()
statement.

I am happy to make this default to off. But it would be good if you could
actually give it a try. Note that linux and FreeBSD (and presumably windows)
do use moderation by default so enabling the emulation of the
registers makes the guest OS behave differently (and closer to bare metal).

To test that the patch itself does not cause regression in the emulator
one should also turn off moderation in the guest (one of the tests i have
run).

cheers
luigi

Re: [Qemu-devel] [PATCH] implement moderation registers for e1000

2013-02-08 Thread Luigi Rizzo

On Fri, Feb 08, 2013 at 11:46:34AM +0100, Paolo Bonzini wrote:
 Il 08/02/2013 11:20, Luigi Rizzo ha scritto:
...
  I am happy to make this default to off. But it would be good if you could
  actually give it a try. Note that linux and FreeBSD (and presumably windows)
  do use moderation by default so enabling the emulation of the
  registers makes the guest OS behave differently (and closer to bare metal).
  
  To test that the patch itself does not cause regression in the emulator
  one should also turn off moderation in the guest (one of the tests i
  have run).
 
 I think making the default on is fine, but you need to add compatibility
 options to leave it off in older machine types (pc-1.4 and earlier).

I am unclear on what you mean by older machine types ?
The hardware (E1000_DEV_ID_82540EM) does have these registers,
and it is entirely up to the guest OS to use them or not.

cheers
luigi

Re: [Qemu-devel] [PATCH] implement moderation registers for e1000

2013-02-08 Thread Luigi Rizzo

On Fri, Feb 08, 2013 at 11:59:12AM +0100, Stefan Hajnoczi wrote:
...
  http://info.iet.unipi.it/~luigi/papers/20130206-qemu.pdf is 404.
 
 
  sorry, fixed now.
  And, will resubmit a fixed patch with uninit and fixed braces in the
  if() statement.
 
  I am happy to make this default to off. But it would be good if you could
  actually give it a try. Note that linux and FreeBSD (and presumably 
  windows)
  do use moderation by default so enabling the emulation of the
  registers makes the guest OS behave differently (and closer to bare metal).
 
  To test that the patch itself does not cause regression in the emulator
  one should also turn off moderation in the guest (one of the tests i
  have run).
 
  I think making the default on is fine, but you need to add compatibility
  options to leave it off in older machine types (pc-1.4 and earlier).
 
 Latency regression.  Would need to see real results to understand how bad it 
 is.

most of the numbers in the paper are for FreeBSD,
not sure if they qualify as real here :)

For Linux, in guest-host configuration, we have these numbers for TCP_RR

CONFIGURATION   TCP_RR (KTps)

1 VCPU TX itr=0 17.7
1 VCPU TX itr=1007.3
2 VCPU TX itr=0 13.9
2 VCPU TX itr=1007.4

These are computed forcing the value in the itr register.
However, by default,
linux seems to dynamically adjust the itr values depending
on the load so it is difficult to make repeatable experiments,
This is why I'd encourage you to run some tests with what you
think is an appropriate configuration.

cheers
luigi

Re: [Qemu-devel] [PATCH] implement moderation registers for e1000

2013-02-08 Thread Luigi Rizzo

On Fri, Feb 08, 2013 at 12:52:12PM +0100, Paolo Bonzini wrote:
 Il 07/02/2106 07:28, Luigi Rizzo ha scritto:
  On Fri, Feb 08, 2013 at 11:46:34AM +0100, Paolo Bonzini wrote:
  Il 08/02/2013 11:20, Luigi Rizzo ha scritto:
  ...
  I am happy to make this default to off. But it would be good if you could
  actually give it a try. Note that linux and FreeBSD (and presumably 
  windows)
  do use moderation by default so enabling the emulation of the
  registers makes the guest OS behave differently (and closer to bare 
  metal).
 
  To test that the patch itself does not cause regression in the emulator
  one should also turn off moderation in the guest (one of the tests i
  have run).
 
  I think making the default on is fine, but you need to add compatibility
  options to leave it off in older machine types (pc-1.4 and earlier).
  
  I am unclear on what you mean by older machine types ?
  The hardware (E1000_DEV_ID_82540EM) does have these registers,
  and it is entirely up to the guest OS to use them or not.
 
 Yes, but suppose a guest has a bug that was masked by the old
 non-implementation.  -M pc-1.4 is supposed to bring back the old
 version's behavior as much as possible, including making the guest bug
 latent again.

ok i see. Is there some example code that already handles
a similar '-M' option so i can look at it and use to set
the mit_on parameter ?

cheers
luigi

[Qemu-devel] [PATCH] implement moderation registers for e1000

2013-02-06 Thread Luigi Rizzo

The following patch implements interrupt moderation registers
for the e1000 adapter. These feature is normally used by OS
drivers, and their implementation improves performance significantly,
especially on the transmit path.
The feature can be disabled through a command line option.
We have seen only benefits in throughput, while latency slightly
increases (that is an inherent feature of interrupt moderation)
because very short delays cannot be emulated precisely.

For those curious on performance, there is a set of measurements
(of this and other changes that we will post separately) at

http://info.iet.unipi.it/~luigi/research.html#qemu

cheers
luigi


Signed-off-by: Luigi Rizzo ri...@iet.unipi.it

diff --git a/hw/e1000.c b/hw/e1000.c
index bb150c6..b4c0f4a 100644
--- a/hw/e1000.c
+++ b/hw/e1000.c
@@ -131,6 +131,10 @@ typedef struct E1000State_st {
 } eecd_state;
 
 QEMUTimer *autoneg_timer;
+QEMUTimer *mit_timer;   /* handle for the timer  */
+uint32_t  mit_timer_on; /* mitigation timer active   */
+uint32_t  mit_cause;/* pending interrupt cause   */
+uint32_t  mit_on;   /* mitigation enable */
 } E1000State;
 
 #definedefreg(x)   x = (E1000_##x2)
@@ -146,6 +150,7 @@ enum {
 defreg(TPR),   defreg(TPT),defreg(TXDCTL), defreg(WUFC),
 defreg(RA),defreg(MTA),defreg(CRCERRS),defreg(VFTA),
 defreg(VET),
+defreg(RDTR),   defreg(RADV),   defreg(TADV),   defreg(ITR),
 };
 
 static void
@@ -652,6 +657,73 @@ static uint64_t tx_desc_base(E1000State *s)
 return (bah  32) + bal;
 }
 
+/* helper function, 0 means the value is not set */
+static inline void
+mit_update_delay(uint32_t *curr, uint32_t value)
+{
+if (value  (*curr == 0 || value  *curr)) {
+*curr = value;
+}
+}
+
+/*
+ * If necessary, rearm the timer and post an interrupt.
+ * Called at the end of tx/rx routines (mit_timer_on == 0),
+ * and when the timer fires (mit_timer_on == 1).
+ * We provide a partial implementation of interrupt mitigation,
+ * emulating only RADV, TADV and ITR (lower 16 bits, 1024ns units for
+ * RADV and TADV, 256ns units for ITR). RDTR is only used to enable RADV;
+ * relative timers based on TIDV and RDTR are not implemented.
+ */
+static void
+mit_rearm_and_int(void *opaque)
+{
+E1000State *s = opaque;
+uint32_t mit_delay = 0;
+
+/*
+ * Clear the flag. It is only set when the callback fires,
+ * and we need to clear it anyways.
+ */
+s-mit_timer_on = 0;
+if (s-mit_cause == 0) { /* no events pending, we are done */
+return;
+}
+/*
+ * Compute the next mitigation delay according to pending interrupts
+ * and the current values of RADV (provided RDTR!=0), TADV and ITR.
+ * Then rearm the timer.
+ */
+if (s-mit_cause  (E1000_ICR_TXQE | E1000_ICR_TXDW)) {
+mit_update_delay(mit_delay, s-mac_reg[TADV] * 4);
+}
+if (s-mac_reg[RDTR]  (s-mit_cause  E1000_ICS_RXT0)) {
+mit_update_delay(mit_delay, s-mac_reg[RADV] * 4);
+}
+mit_update_delay(mit_delay, s-mac_reg[ITR]);
+
+if (mit_delay) {
+s-mit_timer_on = 1;
+qemu_mod_timer(s-mit_timer,
+qemu_get_clock_ns(vm_clock) + mit_delay * 256);
+}
+
+set_ics(s, 0, s-mit_cause);
+s-mit_cause = 0;
+}
+
+static void
+mit_set_ics(E1000State *s, uint32_t cause)
+{
+if (s-mit_on == 0) {
+set_ics(s, 0, cause);
+return;
+}
+s-mit_cause |= cause;
+if (!s-mit_timer_on)
+mit_rearm_and_int(s);
+}
+
 static void
 start_xmit(E1000State *s)
 {
@@ -689,7 +761,7 @@ start_xmit(E1000State *s)
 break;
 }
 }
-set_ics(s, 0, cause);
+mit_set_ics(s, cause);
 }
 
 static int
@@ -908,7 +980,7 @@ e1000_receive(NetClientState *nc, const uint8_t *buf, 
size_t size)
 s-rxbuf_min_shift)
 n |= E1000_ICS_RXDMT0;
 
-set_ics(s, 0, n);
+mit_set_ics(s, n);
 
 return size;
 }
@@ -1013,6 +1085,7 @@ static uint32_t (*macreg_readops[])(E1000State *, int) = {
 getreg(RDH),   getreg(RDT),getreg(VET),getreg(ICS),
 getreg(TDBAL), getreg(TDBAH),  getreg(RDBAH),  getreg(RDBAL),
 getreg(TDLEN), getreg(RDLEN),
+getreg(RDTR),   getreg(RADV),   getreg(TADV),   getreg(ITR),
 
 [TOTH] = mac_read_clr8,[TORH] = mac_read_clr8, [GPRC] = mac_read_clr4,
 [GPTC] = mac_read_clr4,[TPR] = mac_read_clr4,  [TPT] = mac_read_clr4,
@@ -1029,6 +1102,8 @@ static void (*macreg_writeops[])(E1000State *, int, 
uint32_t) = {
 putreg(PBA),   putreg(EERD),   putreg(SWSM),   putreg(WUFC),
 putreg(TDBAL), putreg(TDBAH),  putreg(TXDCTL), putreg(RDBAH),
 putreg(RDBAL), putreg(LEDCTL), putreg(VET),
+[RDTR] = set_16bit, [RADV] = set_16bit, [TADV] = set_16bit,
+[ITR] = set_16bit,
 [TDLEN] = set_dlen,[RDLEN] = set_dlen, [TCTL] = set_tctl,
 [TDT] = set_tctl,  [MDIC] = set_mdic

[Qemu-devel] [PATCH v2] fix qemu_flush_queued_packets() in presence of a hub

2013-02-05 Thread Luigi Rizzo

On Tue, Jan 29, 2013 at 02:08:27PM +0100, Stefan Hajnoczi wrote:

 It's cleaner to add a bool (*flush)(NetClientState *nc) function to
 NetClientInfo but given that net.c already knows about hub.c
 specifically we don't gain much by trying to decouple this callback.  If
 you want to leave it hardcoded that's fine by me.

I'd leave it like this for now. Here is the version with fixed style bugs.



DESCRIPTION:

When frontend and backend are connected through a hub as below
(showing only one direction), and the frontend (or in general, all
output ports of the hub) cannot accept more traffic, the backend
queues packets in queue-A.

When the frontend (or in general, one output port) becomes ready again,
quemu tries to flush packets from queue-B, which is unfortunately empty.

  e1000.0 --[queue B]-- hub0port0(hub)hub0port1 --[queue A]-- tap.0

To fix this i propose to introduce a new function net_hub_flush()
which is called when trying to flush a queue connected to a hub.

cheers
luigi

Signed-off-by: Luigi Rizzo ri...@iet.unipi.it

diff --git a/net/hub.c b/net/hub.c
index a24c9d1..df32074 100644
--- a/net/hub.c
+++ b/net/hub.c
@@ -338,3 +338,17 @@ void net_hub_check_clients(void)
 }
 }
 }
+
+bool net_hub_flush(NetClientState *nc)
+{
+NetHubPort *port;
+NetHubPort *source_port = DO_UPCAST(NetHubPort, nc, nc);
+int ret = 0;
+
+QLIST_FOREACH(port, source_port-hub-ports, next) {
+if (port != source_port) {
+ret += qemu_net_queue_flush(port-nc.send_queue);
+}
+}
+return ret ? true : false;
+}
diff --git a/net/hub.h b/net/hub.h
index 583ada8..a625eff 100644
--- a/net/hub.h
+++ b/net/hub.h
@@ -21,5 +21,6 @@ NetClientState *net_hub_add_port(int hub_id, const char 
*name);
 NetClientState *net_hub_find_client_by_name(int hub_id, const char *name);
 void net_hub_info(Monitor *mon);
 void net_hub_check_clients(void);
+bool net_hub_flush(NetClientState *nc);
 
 #endif /* NET_HUB_H */
diff --git a/net/net.c b/net/net.c
index 9806862..9489702 100644
--- a/net/net.c
+++ b/net/net.c
@@ -439,6 +439,12 @@ void qemu_flush_queued_packets(NetClientState *nc)
 {
 nc-receive_disabled = 0;
 
+if (nc-peer  nc-peer-info-type == NET_CLIENT_OPTIONS_KIND_HUBPORT) {
+if (net_hub_flush(nc-peer)) {
+qemu_notify_event();
+}
+return;
+}
 if (qemu_net_queue_flush(nc-send_queue)) {
 /* We emptied the queue successfully, signal to the IO thread to repoll
  * the file descriptor (for tap, for example).

[Qemu-devel] [PATCH v3] fix qemu_flush_queued_packets() in presence of a hub

2013-02-05 Thread Luigi Rizzo

DESCRIPTION:

When frontend and backend are connected through a hub as below
(showing only one direction), and the frontend (or in general, all
output ports of the hub) cannot accept more traffic, the backend
queues packets in queue-A.

When the frontend (or in general, one output port) becomes ready again,
quemu tries to flush packets from queue-B, which is unfortunately empty.

  e1000.0 --[queue B]-- hub0port0(hub)hub0port1 --[queue A]-- tap.0

To fix this i propose to introduce a new function net_hub_flush()
which is called when trying to flush a queue connected to a hub.

cheers
luigi

Signed-off-by: Luigi Rizzo ri...@iet.unipi.it

diff --git a/net/hub.c b/net/hub.c
index a24c9d1..df32074 100644
--- a/net/hub.c
+++ b/net/hub.c
@@ -338,3 +338,17 @@ void net_hub_check_clients(void)
 }
 }
 }
+
+bool net_hub_flush(NetClientState *nc)
+{
+NetHubPort *port;
+NetHubPort *source_port = DO_UPCAST(NetHubPort, nc, nc);
+int ret = 0;
+
+QLIST_FOREACH(port, source_port-hub-ports, next) {
+if (port != source_port) {
+ret += qemu_net_queue_flush(port-nc.send_queue);
+}
+}
+return ret ? true : false;
+}
diff --git a/net/hub.h b/net/hub.h
index 583ada8..a625eff 100644
--- a/net/hub.h
+++ b/net/hub.h
@@ -21,5 +21,6 @@ NetClientState *net_hub_add_port(int hub_id, const char 
*name);
 NetClientState *net_hub_find_client_by_name(int hub_id, const char *name);
 void net_hub_info(Monitor *mon);
 void net_hub_check_clients(void);
+bool net_hub_flush(NetClientState *nc);
 
 #endif /* NET_HUB_H */
diff --git a/net/net.c b/net/net.c
index 9806862..9489702 100644
--- a/net/net.c
+++ b/net/net.c
@@ -439,6 +439,12 @@ void qemu_flush_queued_packets(NetClientState *nc)
 {
 nc-receive_disabled = 0;
 
+if (nc-peer  nc-peer-info-type == NET_CLIENT_OPTIONS_KIND_HUBPORT) {
+if (net_hub_flush(nc-peer)) {
+qemu_notify_event();
+}
+return;
+}
 if (qemu_net_queue_flush(nc-send_queue)) {
 /* We emptied the queue successfully, signal to the IO thread to repoll
  * the file descriptor (for tap, for example).

[Qemu-devel] [PATCH v2] fix unbounded NetQueue

2013-02-05 Thread Luigi Rizzo

In the current implementation of qemu, running without a network
backend will cause the queue to grow unbounded when the guest is
transmitting traffic.

This patch fixes the problem by implementing bounded size NetQueue,
used with an arbitrary limit of 1 packets, and dropping packets
when the queue is full _and_ the sender does not pass a callback.

The second condition makes sure that we never drop packets that
contains a callback (which would be tricky, because the producer
expects the callback to be run when all previous packets have been
consumed; so we cannot run it when the packet is dropped).

If documentation is correct, producers that submit a callback should
stop sending when their packet is queued, so there is no real risk
that the queue exceeds the max size by large values.

Signed-off-by: Luigi Rizzo ri...@iet.unipi.it

diff --git a/net/queue.c b/net/queue.c
index 6eaf5b6..859d02a 100644
--- a/net/queue.c
+++ b/net/queue.c
@@ -50,6 +50,8 @@ struct NetPacket {
 
 struct NetQueue {
 void *opaque;
+uint32_t nq_maxlen;
+uint32_t nq_count;
 
 QTAILQ_HEAD(packets, NetPacket) packets;
 
@@ -63,6 +65,8 @@ NetQueue *qemu_new_net_queue(void *opaque)
 queue = g_malloc0(sizeof(NetQueue));
 
 queue-opaque = opaque;
+queue-nq_maxlen = 1;
+queue-nq_count = 0;
 
 QTAILQ_INIT(queue-packets);
 
@@ -92,6 +96,9 @@ static void qemu_net_queue_append(NetQueue *queue,
 {
 NetPacket *packet;
 
+if (queue-nq_count = queue-nq_maxlen  !sent_cb) {
+return; /* drop if queue full and no callback */
+}
 packet = g_malloc(sizeof(NetPacket) + size);
 packet-sender = sender;
 packet-flags = flags;
@@ -99,6 +106,7 @@ static void qemu_net_queue_append(NetQueue *queue,
 packet-sent_cb = sent_cb;
 memcpy(packet-data, buf, size);
 
+queue-nq_count++;
 QTAILQ_INSERT_TAIL(queue-packets, packet, entry);
 }
 
@@ -113,6 +121,9 @@ static void qemu_net_queue_append_iov(NetQueue *queue,
 size_t max_len = 0;
 int i;
 
+if (queue-nq_count = queue-nq_maxlen  !sent_cb) {
+return; /* drop if queue full and no callback */
+}
 for (i = 0; i  iovcnt; i++) {
 max_len += iov[i].iov_len;
 }
@@ -130,6 +141,7 @@ static void qemu_net_queue_append_iov(NetQueue *queue,
 packet-size += len;
 }
 
+queue-nq_count++;
 QTAILQ_INSERT_TAIL(queue-packets, packet, entry);
 }
 
@@ -220,6 +232,7 @@ void qemu_net_queue_purge(NetQueue *queue, NetClientState 
*from)
 QTAILQ_FOREACH_SAFE(packet, queue-packets, entry, next) {
 if (packet-sender == from) {
 QTAILQ_REMOVE(queue-packets, packet, entry);
+queue-nq_count--;
 g_free(packet);
 }
 }
@@ -233,6 +246,7 @@ bool qemu_net_queue_flush(NetQueue *queue)
 
 packet = QTAILQ_FIRST(queue-packets);
 QTAILQ_REMOVE(queue-packets, packet, entry);
+queue-nq_count--;
 
 ret = qemu_net_queue_deliver(queue,
  packet-sender,
@@ -240,6 +254,7 @@ bool qemu_net_queue_flush(NetQueue *queue)
  packet-data,
  packet-size);
 if (ret == 0) {
+queue-nq_count++;
 QTAILQ_INSERT_HEAD(queue-packets, packet, entry);
 return false;
 }

Re: [Qemu-devel] memcpy speed (Re: [PATCH v2] netmap backend (revised))

2013-01-28 Thread Luigi Rizzo

On Thu, Jan 24, 2013 at 11:24 PM, Stefan Hajnoczi stefa...@gmail.comwrote:

 On Thu, Jan 24, 2013 at 6:35 PM, Luigi Rizzo ri...@iet.unipi.it wrote:



  
   never mind, pilot error. in my test program i had swapped the
   arguments to __builtin_memcpy(). With the correct ones,
   __builtin_memcpy()  == bcopy == memcpy on both machines,
   and never faster than the pkt_copy().
 
  Are the bcopy()/memcpy() calls given a length that is a multiple of 64
 bytes?
 
  IIUC pkt_copy() assumes 64-byte multiple lengths and that optimization
  can matches with memcpy(dst, src, (len + 63)  ~63).  Maybe it helps and
  at least ensures they are doing equal amounts of byte copying.
 
  the length is a parameter from the command line.
  For short packets, at least on the i7-2600 and freebsd the pkt_copy()
  is only slightly faster than memcpy on multiples of 64, and *a lot*
  faster when the length is not a multiple.

 How about dropping pkt_copy() and instead rounding the memcpy() length
 up to the next 64 byte multiple?


 Using memcpy() is more future-proof IMO, that's why I'm pushing for this.


fair enough, i'll make this conditional and enable memcpy() rounded to 64
bytes
multiples by default (though as i said the pkt_copy() is always at least
as fast as memcpy() on all machines i tried.

cheers
luigi

Re: [Qemu-devel] (change FUNCTION to func according to qemu coding style)

2013-01-25 Thread Luigi Rizzo

On Fri, Jan 25, 2013 at 9:23 AM, Michael Tokarev m...@tls.msk.ru wrote:


 ---
 v2: change __FUNCTION__ to __func__ according to qemu coding style


will do. However it does not seem a highly followed standard :)

 lrizzo@netmap:~/work-qemu/qemu-git-head-20130121$ grep -r __FUNCTION__ . |
wc
3442119   25898
lrizzo@netmap:~/work-qemu/qemu-git-head-20130121$ grep -r __func__ . | wc
7595035   58455

cheers
luigi

Re: [Qemu-devel] memcpy speed (Re: [PATCH v2] netmap backend (revised))

2013-01-24 Thread Luigi Rizzo

On Thu, Jan 24, 2013 at 09:54:19AM +0100, Stefan Hajnoczi wrote:
 On Wed, Jan 23, 2013 at 06:55:59PM -0800, Luigi Rizzo wrote:
  On Wed, Jan 23, 2013 at 8:03 AM, Luigi Rizzo ri...@iet.unipi.it wrote:
  
I'm even doubtful that it's always a win on FreeBSD.  You have a
threshold to fall back to bcopy() and who knows what the best value
for various CPUs is.
  
   indeed.
   With the attached program (which however might be affected by the
   fact that data is not used after copying) it seems that on a recent
   linux (using gcc 4.6.2) the fastest is __builtin_memcpy()
  
   ./testlock -m __builtin_memcpy -l 64
  
   (by a factor of 2 or more) whereas all the other methods have
   approximately the same speed.
  
  
  never mind, pilot error. in my test program i had swapped the
  arguments to __builtin_memcpy(). With the correct ones,
  __builtin_memcpy()  == bcopy == memcpy on both machines,
  and never faster than the pkt_copy().
 
 Are the bcopy()/memcpy() calls given a length that is a multiple of 64 bytes?
 
 IIUC pkt_copy() assumes 64-byte multiple lengths and that optimization
 can matches with memcpy(dst, src, (len + 63)  ~63).  Maybe it helps and
 at least ensures they are doing equal amounts of byte copying.

the length is a parameter from the command line.
For short packets, at least on the i7-2600 and freebsd the pkt_copy()
is only slightly faster than memcpy on multiples of 64, and *a lot*
faster when the length is not a multiple.
Again i am not sure whether it depends on the compiler/glibc or
simply on the CPU, unfortunately i have no way to swap machines.

luigi

Re: [Qemu-devel] [PATCH v2] netmap backend (revised)

2013-01-23 Thread Luigi Rizzo

On Tue, Jan 22, 2013 at 2:50 PM, Anthony Liguori aligu...@us.ibm.comwrote:

 Hi,

 Thank you for submitting your patch series.  checkpatch.pl has
 detected that one or more of the patches in this series violate
 the QEMU coding style.

 If you believe this message was sent in error, please ignore it
 or respond here with an explanation.

 Otherwise, please correct the coding style issues and resubmit a
 new version of the patch.


will do, thanks for the feedback
luigi

Re: [Qemu-devel] [PATCH v2] netmap backend (revised)

2013-01-23 Thread Luigi Rizzo

On Wed, Jan 23, 2013 at 12:10:55PM +0100, Stefan Hajnoczi wrote:
 On Tue, Jan 22, 2013 at 08:12:15AM +0100, Luigi Rizzo wrote:
  reposting a version without changes that implement bounded
  queues in net/queue.c
  
  Hi,
  the attached patch implements a qemu backend for the netmap API
  thus allowing machines to attach to the VALE software switch as
  well as netmap-supported cards (links below).
  
  http://info.iet.unipi.it/~luigi/netmap/
  http://info.iet.unipi.it/~luigi/vale/
  
  This is a cleaned up version of code written last summer.
 
 Looks like a clean software approach to low-level packet I/O.  I guess
 the biggest competitor would be a userspace library with NIC drivers
 using modern IOMMUs to avoid the security/reliability problem that
 previous userspace approaches suffered.  Pretty cool that netmap reuses
 kernel NIC driver implementations to avoid duplicating all that code.
 
 I wonder how/if netmaps handles advanced NIC features like multiqueue
 and offloads?  But I've only read the webpage, not the papers or source
 code :).

there are definitely more details in the papers :)

IOMMU is not strictly necessary because userspace only sees packet
buffers and a device independent replica of the descriptor ring.
NIC registers and rings are only manipulated by the kernel within
system calls.

multiqueue is completely straightforward -- netmap exposes as many
queues as the hardware implements, you can attach file descriptors to
individual queues, bind threads to queues by just using pthread_setaffinity().

offloading so far is not supported, and that's part of the design:
it adds complexity at runtime to check the various possible
combinations of offloading in various places in the stack.
A single packet format makes the driver extremely efficient.

The thing that may make a difference is tcp segmentation and reassembly,
we will look at how to support it at some point.

I'apply the other changes you suggest below, thanks.
Only some comments:

The debugging macros (D, RD() )  are taken from our existing code,
 
  +#define ND(fd, ... )   // debugging
  +#define D(format, ...)  \
...
  +/* rate limited, lps indicates how many per second */
  +#define RD(lps, format, ...)\
...

 Have you seen docs/tracing.txt?  It can do fprintf(stderr) but also
 SystemTap/DTrace and a simple built-in binary tracer.

will look at it. These debugging macros are the same we use in other
netmap code so i'd prefer to keep them.

  +// a fast copy routine only for multiples of 64 bytes, non overlapped.
  +static inline void
  +pkt_copy(const void *_src, void *_dst, int l)
...
  +*dst++ = *src++;
  +}
  +}
 
 I wonder how different FreeBSD bcopy() is from glibc memcpy() and if the
 optimization is even a win.  The glibc code is probably hand-written
 assembly that CPU vendors have contributed for specific CPU model
 families.
 
 Did you compare glibc memcpy() against pkt_copy()?

I haven't tried in detail on glibc but will run some tests.  In any
case not all systems have glibc, and on FreeBSD this pkt_copy was
a significant win for small packets (saving some 20ns each; of
course this counts only when you approach the 10 Mpps range, which
is what you get with netmap, and of course when data is in cache).

One reason pkt_copy gains something is that if it can assume there
is extra space in the buffer, it can work on large chunks avoiding the extra
jumps and instructions for the remaining 1-2-4 bytes.

  +   ring-slot[i].len = size;
  +   pkt_copy(buf, dst, size);
 
 How does netmap guarantee that the buffer is large enough for this
 packet?

the netmap buffers are 2k, i'll make sure there is a check that the
size is not exceeded.

  +close(s-me.fd);
 
 Missing munmap?

yes you are correct.

cheers
luigi

[Qemu-devel] memcpy speed (Re: [PATCH v2] netmap backend (revised))

2013-01-23 Thread Luigi Rizzo

On Wed, Jan 23, 2013 at 02:03:17PM +0100, Stefan Hajnoczi wrote:
 On Wed, Jan 23, 2013 at 12:50:26PM +0100, Luigi Rizzo wrote:
  On Wed, Jan 23, 2013 at 12:10:55PM +0100, Stefan Hajnoczi wrote:
   On Tue, Jan 22, 2013 at 08:12:15AM +0100, Luigi Rizzo wrote:
...
+// a fast copy routine only for multiples of 64 bytes, non overlapped.
+static inline void
+pkt_copy(const void *_src, void *_dst, int l)
  ...
+*dst++ = *src++;
+}
+}
   
   I wonder how different FreeBSD bcopy() is from glibc memcpy() and if the
   optimization is even a win.  The glibc code is probably hand-written
   assembly that CPU vendors have contributed for specific CPU model
   families.
   
   Did you compare glibc memcpy() against pkt_copy()?
  
  I haven't tried in detail on glibc but will run some tests.  In any
  case not all systems have glibc, and on FreeBSD this pkt_copy was
  a significant win for small packets (saving some 20ns each; of
  course this counts only when you approach the 10 Mpps range, which
  is what you get with netmap, and of course when data is in cache).
  
  One reason pkt_copy gains something is that if it can assume there
  is extra space in the buffer, it can work on large chunks avoiding the extra
  jumps and instructions for the remaining 1-2-4 bytes.
 
 I'd like to drop this code or at least make it FreeBSD-specific since
 there's no guarantee that this is a good idea on any other libc.
 
 I'm even doubtful that it's always a win on FreeBSD.  You have a
 threshold to fall back to bcopy() and who knows what the best value
 for various CPUs is.

indeed.
With the attached program (which however might be affected by the
fact that data is not used after copying) it seems that on a recent
linux (using gcc 4.6.2) the fastest is __builtin_memcpy()

./testlock -m __builtin_memcpy -l 64

(by a factor of 2 or more) whereas all the other methods have
approximately the same speed.

On FreeBSD (with clang, gcc 4.2.1, gcc 4.6.4) the pkt_copy() above

./testlock -m fastcopy -l 64

is largely better than other methods. I am a bit puzzled why
the builtin method on FreeBSD is not effective, but i will check
on some other forum...

cheers
luigi

/*
 * Copyright (C) 2012 Luigi Rizzo. All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *   1. Redistributions of source code must retain the above copyright
 *  notice, this list of conditions and the following disclaimer.
 *   2. Redistributions in binary form must reproduce the above copyright
 *  notice, this list of conditions and the following disclaimer in the
 *documentation and/or other materials provided with the distribution.
 *
 * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 */

/*
 * $Id: testlock.c 12015 2013-01-23 15:51:17Z luigi $
 *
 * Test program to study various ops and concurrency issues.
 * Create multiple threads, possibly bind to cpus, and run a workload.
 *
 * cc -O2 -Werror -Wall testlock.c -o testlock -lpthread
 *  you might need -lrt
 */

#include inttypes.h
#include sys/types.h
#include pthread.h/* pthread_* */

#if defined(__APPLE__)

#include libkern/OSAtomic.h
#define atomic_add_int(p, n) OSAtomicAdd32(n, (int *)p)
#define atomic_cmpset_32(p, o, n)   OSAtomicCompareAndSwap32(o, n, (int *)p)

#elif defined(linux)

int atomic_cmpset_32(volatile uint32_t *p, uint32_t old, uint32_t new)
{
int ret = *p == old;
*p = new;
return ret;
}

#if defined(HAVE_GCC_ATOMICS)
int atomic_add_int(volatile int *p, int v)
{
return __sync_fetch_and_add(p, v);
}
#else
inline
uint32_t atomic_add_int(uint32_t *p, int v)
{
__asm __volatile (
   lock   xaddl   %0, %1 ;
: +r (v), /* 0 (result) */
  =m (*p) /* 1 */
: m (*p));/* 2 */
return (v);
}
#endif

#else /* FreeBSD */
#include sys/param.h
#include machine/atomic.h
#include pthread_np.h /* pthread w/ affinity */

#if __FreeBSD_version  50
#include sys/cpuset.h /* cpu_set */
#if __FreeBSD_version  80
#define HAVE_AFFINITY
#endif

inline void

Re: [Qemu-devel] memcpy speed (Re: [PATCH v2] netmap backend (revised))

2013-01-23 Thread Luigi Rizzo

On Wed, Jan 23, 2013 at 8:03 AM, Luigi Rizzo ri...@iet.unipi.it wrote:

  I'm even doubtful that it's always a win on FreeBSD.  You have a
  threshold to fall back to bcopy() and who knows what the best value
  for various CPUs is.

 indeed.
 With the attached program (which however might be affected by the
 fact that data is not used after copying) it seems that on a recent
 linux (using gcc 4.6.2) the fastest is __builtin_memcpy()

 ./testlock -m __builtin_memcpy -l 64

 (by a factor of 2 or more) whereas all the other methods have
 approximately the same speed.


never mind, pilot error. in my test program i had swapped the
arguments to __builtin_memcpy(). With the correct ones,
__builtin_memcpy()  == bcopy == memcpy on both machines,
and never faster than the pkt_copy().

In fact, on the machine with FreeBSD the unrolled loop
still beats all other methods at small packet sizes.

(e.g. (memcin my test program I had swapped the
source and destination operands for __builtin_memcpy(), and
this substantially changed the memory access pattern.

With the correct operands, __builtin_memcpy == memcpy == bcopy
on both FreeBSD and Linux.
On FreeBSD pkt_copy is still faster than the other methods for
small packets, whereas on Linux they are equivalent.

If you are curious why swapping source and dst changed things
so dramatically:

the test was supposed to read from a large chunk of
memory (over 1GB) to avoid always hitting L1 or L2.
Swapping operands causes reads to hit always the same line,
thus saving a lot of misses. The difference between the two
machine then probably is due to how the cache is used on writes.

cheers
luigi


-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

[Qemu-devel] [PATCH] netmap backend

2013-01-21 Thread Luigi Rizzo

Hi,
the attached patch implements a qemu backend for the netmap API
thus allowing machines to attach to the VALE software switch as
well as netmap-supported cards (links below).

http://info.iet.unipi.it/~luigi/netmap/
http://info.iet.unipi.it/~luigi/vale/

This is a cleaned up version of code written last summer.

guest-guest speed using an e1000 frontend (with some modifications
related to interrupt moderation, will repost an updated version later):
up to 700 Kpps using sockets, and up to 5 Mpps using netmap within
the guests. I have not tried with virtio.

cheers
luigi



Signed-off-by: Luigi Rizzo ri...@iet.unipi.it
--
 configure |   31 +
 net/Makefile.objs |1 +
 net/clients.h |4 +
 net/net.c |3 +
 net/qemu-netmap.c |  353 +
 net/queue.c   |   15 +++
 qapi-schema.json  |8 +-
 7 files changed, 414 insertions(+), 1 deletions(-)

diff --git a/configure b/configure
index c6172ef..cfdf8a6 100755
--- a/configure
+++ b/configure
@@ -146,6 +146,7 @@ curl=
 curses=
 docs=
 fdt=
+netmap=
 nptl=
 pixman=
 sdl=
@@ -739,6 +740,10 @@ for opt do
   ;;
   --enable-vde) vde=yes
   ;;
+  --disable-netmap) netmap=no
+  ;;
+  --enable-netmap) netmap=yes
+  ;;
   --disable-xen) xen=no
   ;;
   --enable-xen) xen=yes
@@ -1112,6 +1117,8 @@ echo   --disable-uuid   disable uuid support
 echo   --enable-uuidenable uuid support
 echo   --disable-vdedisable support for vde network
 echo   --enable-vde enable support for vde network
+echo   --disable-netmap disable support for netmap network
+echo   --enable-netmap  enable support for netmap network
 echo   --disable-linux-aio  disable Linux AIO support
 echo   --enable-linux-aio   enable Linux AIO support
 echo   --disable-cap-ng disable libcap-ng support
@@ -1914,6 +1921,26 @@ EOF
 fi
 
 ##
+# netmap headers probe
+if test $netmap != no ; then
+  cat  $TMPC  EOF
+#include inttypes.h
+#include net/if.h
+#include net/netmap.h
+#include net/netmap_user.h
+int main(void) { return 0; }
+EOF
+  if compile_prog   ; then
+netmap=yes
+  else
+if test $netmap = yes ; then
+  feature_not_found netmap
+fi
+netmap=no
+  fi
+fi
+
+##
 # libcap-ng library probe
 if test $cap_ng != no ; then
   cap_libs=-lcap-ng
@@ -3314,6 +3341,7 @@ echo NPTL support  $nptl
 echo GUEST_BASE$guest_base
 echo PIE   $pie
 echo vde support   $vde
+echo netmap support$netmap
 echo Linux AIO support $linux_aio
 echo ATTR/XATTR support $attr
 echo Install blobs $blobs
@@ -3438,6 +3466,9 @@ fi
 if test $vde = yes ; then
   echo CONFIG_VDE=y  $config_host_mak
 fi
+if test $netmap = yes ; then
+  echo CONFIG_NETMAP=y  $config_host_mak
+fi
 if test $cap_ng = yes ; then
   echo CONFIG_LIBCAP=y  $config_host_mak
 fi
diff --git a/net/Makefile.objs b/net/Makefile.objs
index a08cd14..068253f 100644
--- a/net/Makefile.objs
+++ b/net/Makefile.objs
@@ -10,3 +10,4 @@ common-obj-$(CONFIG_AIX) += tap-aix.o
 common-obj-$(CONFIG_HAIKU) += tap-haiku.o
 common-obj-$(CONFIG_SLIRP) += slirp.o
 common-obj-$(CONFIG_VDE) += vde.o
+common-obj-$(CONFIG_NETMAP) += qemu-netmap.o
diff --git a/net/clients.h b/net/clients.h
index 7793294..952d076 100644
--- a/net/clients.h
+++ b/net/clients.h
@@ -52,4 +52,8 @@ int net_init_vde(const NetClientOptions *opts, const char 
*name,
  NetClientState *peer);
 #endif
 
+#ifdef CONFIG_NETMAP
+int net_init_netmap(const NetClientOptions *opts, const char *name,
+NetClientState *peer);
+#endif
 #endif /* QEMU_NET_CLIENTS_H */
diff --git a/net/net.c b/net/net.c
index cdd9b04..816c987 100644
--- a/net/net.c
+++ b/net/net.c
@@ -618,6 +618,9 @@ static int (* const 
net_client_init_fun[NET_CLIENT_OPTIONS_KIND_MAX])(
 [NET_CLIENT_OPTIONS_KIND_BRIDGE]= net_init_bridge,
 #endif
 [NET_CLIENT_OPTIONS_KIND_HUBPORT]   = net_init_hubport,
+#ifdef CONFIG_NETMAP
+   [NET_CLIENT_OPTIONS_KIND_NETMAP]= net_init_netmap,
+#endif
 };
 
 
diff --git a/net/qemu-netmap.c b/net/qemu-netmap.c
new file mode 100644
index 000..79d7c09
--- /dev/null
+++ b/net/qemu-netmap.c
@@ -0,0 +1,353 @@
+/*
+ * netmap access for qemu
+ *
+ * Copyright (c) 2012-2013 Luigi Rizzo
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the Software), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE

[Qemu-devel] [PATCH] fix qemu_flush_queued_packets() in presence of a hub

2013-01-21 Thread Luigi Rizzo

when frontend and backend are connected through a hub as below
(showing only one direction), and the frontend (or in general, all
output ports of the hub) cannot accept more traffic, the backend
queues packets in queue-A.

When the frontend (or in general, one output port) becomes ready again,
quemu tries to flush packets from queue-B, which is unfortunately empty.

   e1000.0 --[queue B]-- hub0port0(hub)hub0port1 --[queue A]-- tap.0

To fix this i propose to introduce a new function net_hub_flush()
which is called when trying to flush a queue connected to a hub.

cheers
luigi

Signed-off-by: Luigi Rizzo ri...@iet.unipi.it
 net/hub.c |   13 +
 net/hub.h |1 +
 net/net.c |6 ++

diff --git a/net/hub.c b/net/hub.c
index a24c9d1..08f95d0 100644
--- a/net/hub.c
+++ b/net/hub.c
@@ -338,3 +338,16 @@ void net_hub_check_clients(void)
 }
 }
 }
+
+bool net_hub_flush(NetClientState *nc)
+{
+NetHubPort *port;
+NetHubPort *source_port = DO_UPCAST(NetHubPort, nc, nc);
+int ret = 0;
+
+QLIST_FOREACH(port, source_port-hub-ports, next) {
+   if (port != source_port)
+   ret += qemu_net_queue_flush(port-nc.send_queue);
+}
+return ret ? true : false;
+}
diff --git a/net/hub.h b/net/hub.h
index 583ada8..a625eff 100644
--- a/net/hub.h
+++ b/net/hub.h
@@ -21,5 +21,6 @@ NetClientState *net_hub_add_port(int hub_id, const char 
*name);
 NetClientState *net_hub_find_client_by_name(int hub_id, const char *name);
 void net_hub_info(Monitor *mon);
 void net_hub_check_clients(void);
+bool net_hub_flush(NetClientState *nc);
 
 #endif /* NET_HUB_H */
diff --git a/net/net.c b/net/net.c
index 816c987..8caa368 100644
--- a/net/net.c
+++ b/net/net.c
@@ -355,6 +355,12 @@ void qemu_flush_queued_packets(NetClientState *nc)
 {
 nc-receive_disabled = 0;
 
+if (nc-peer  nc-peer-info-type == NET_CLIENT_OPTIONS_KIND_HUBPORT) {
+   if (net_hub_flush(nc-peer)) {
+   qemu_notify_event();
+   }
+   return;
+}
 if (qemu_net_queue_flush(nc-send_queue)) {
 /* We emptied the queue successfully, signal to the IO thread to repoll
  * the file descriptor (for tap, for example).

[Qemu-devel] [PATCH v2] netmap backend (revised)

2013-01-21 Thread Luigi Rizzo

reposting a version without changes that implement bounded
queues in net/queue.c

Hi,
the attached patch implements a qemu backend for the netmap API
thus allowing machines to attach to the VALE software switch as
well as netmap-supported cards (links below).

http://info.iet.unipi.it/~luigi/netmap/
http://info.iet.unipi.it/~luigi/vale/

This is a cleaned up version of code written last summer.

guest-guest speed using an e1000 frontend (with some modifications
related to interrupt moderation, will repost an updated version later):
up to 700 Kpps using sockets, and up to 5 Mpps using netmap within
the guests. I have not tried with virtio.

cheers
luigi



Signed-off-by: Luigi Rizzo ri...@iet.unipi.it
--
 configure |   31 +
 net/Makefile.objs |1 +
 net/clients.h |4 +
 net/net.c |3 +
 net/qemu-netmap.c |  353 +
 qapi-schema.json  |8 +-

diff --git a/configure b/configure
index c6172ef..cfdf8a6 100755
--- a/configure
+++ b/configure
@@ -146,6 +146,7 @@ curl=
 curses=
 docs=
 fdt=
+netmap=
 nptl=
 pixman=
 sdl=
@@ -739,6 +740,10 @@ for opt do
   ;;
   --enable-vde) vde=yes
   ;;
+  --disable-netmap) netmap=no
+  ;;
+  --enable-netmap) netmap=yes
+  ;;
   --disable-xen) xen=no
   ;;
   --enable-xen) xen=yes
@@ -1112,6 +1117,8 @@ echo   --disable-uuid   disable uuid support
 echo   --enable-uuidenable uuid support
 echo   --disable-vdedisable support for vde network
 echo   --enable-vde enable support for vde network
+echo   --disable-netmap disable support for netmap network
+echo   --enable-netmap  enable support for netmap network
 echo   --disable-linux-aio  disable Linux AIO support
 echo   --enable-linux-aio   enable Linux AIO support
 echo   --disable-cap-ng disable libcap-ng support
@@ -1914,6 +1921,26 @@ EOF
 fi
 
 ##
+# netmap headers probe
+if test $netmap != no ; then
+  cat  $TMPC  EOF
+#include inttypes.h
+#include net/if.h
+#include net/netmap.h
+#include net/netmap_user.h
+int main(void) { return 0; }
+EOF
+  if compile_prog   ; then
+netmap=yes
+  else
+if test $netmap = yes ; then
+  feature_not_found netmap
+fi
+netmap=no
+  fi
+fi
+
+##
 # libcap-ng library probe
 if test $cap_ng != no ; then
   cap_libs=-lcap-ng
@@ -3314,6 +3341,7 @@ echo NPTL support  $nptl
 echo GUEST_BASE$guest_base
 echo PIE   $pie
 echo vde support   $vde
+echo netmap support$netmap
 echo Linux AIO support $linux_aio
 echo ATTR/XATTR support $attr
 echo Install blobs $blobs
@@ -3438,6 +3466,9 @@ fi
 if test $vde = yes ; then
   echo CONFIG_VDE=y  $config_host_mak
 fi
+if test $netmap = yes ; then
+  echo CONFIG_NETMAP=y  $config_host_mak
+fi
 if test $cap_ng = yes ; then
   echo CONFIG_LIBCAP=y  $config_host_mak
 fi
diff --git a/net/Makefile.objs b/net/Makefile.objs
index a08cd14..068253f 100644
--- a/net/Makefile.objs
+++ b/net/Makefile.objs
@@ -10,3 +10,4 @@ common-obj-$(CONFIG_AIX) += tap-aix.o
 common-obj-$(CONFIG_HAIKU) += tap-haiku.o
 common-obj-$(CONFIG_SLIRP) += slirp.o
 common-obj-$(CONFIG_VDE) += vde.o
+common-obj-$(CONFIG_NETMAP) += qemu-netmap.o
diff --git a/net/clients.h b/net/clients.h
index 7793294..952d076 100644
--- a/net/clients.h
+++ b/net/clients.h
@@ -52,4 +52,8 @@ int net_init_vde(const NetClientOptions *opts, const char 
*name,
  NetClientState *peer);
 #endif
 
+#ifdef CONFIG_NETMAP
+int net_init_netmap(const NetClientOptions *opts, const char *name,
+NetClientState *peer);
+#endif
 #endif /* QEMU_NET_CLIENTS_H */
diff --git a/net/net.c b/net/net.c
index cdd9b04..816c987 100644
--- a/net/net.c
+++ b/net/net.c
@@ -618,6 +618,9 @@ static int (* const 
net_client_init_fun[NET_CLIENT_OPTIONS_KIND_MAX])(
 [NET_CLIENT_OPTIONS_KIND_BRIDGE]= net_init_bridge,
 #endif
 [NET_CLIENT_OPTIONS_KIND_HUBPORT]   = net_init_hubport,
+#ifdef CONFIG_NETMAP
+   [NET_CLIENT_OPTIONS_KIND_NETMAP]= net_init_netmap,
+#endif
 };
 
 
diff --git a/net/qemu-netmap.c b/net/qemu-netmap.c
new file mode 100644
index 000..79d7c09
--- /dev/null
+++ b/net/qemu-netmap.c
@@ -0,0 +1,353 @@
+/*
+ * netmap access for qemu
+ *
+ * Copyright (c) 2012-2013 Luigi Rizzo
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the Software), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software

[Qemu-devel] wrong argument to qemu_flush_queued_packets() in network frontends ?

2013-01-20 Thread Luigi Rizzo

While running qemu 1.3.0 with the following network-related flags:

-net nic -net tap,ifname=tap0,script=''

I encountered the same problem (should be common to several
frontends, e.g. e100, eepro100, virtio-net, xen_nic):

in net/tap.c :: tap_send(), if qemu_send_packet_async() returns 0
(e.g. because the NIC has no buffers available)
traffic stops, despite the fact that the frontend will try to pull
queued packets when the receive ring is updated.

Upon investigation, it turns out that the backend code does

size = qemu_send_packet_async(s-nc, buf, size, tap_send_completed);
if (size == 0) {
tap_read_poll(s, 0);

and the arguments are

s-nc.name = tap.0
s-nc.peer-name = hub0port1
s-nc.send_queue = 0x7f40b2f61e20
s-nc.peer-send_queue = 0x7f40b2f63690 --- enqueued here

whereis the frontend is trying to pull from a different queue

qemu_flush_queued_packets(s-nic-nc);

with arguments

s-nic-nc.name = e1000.0
s-nic-nc.peer-name = hub0port0 --- try to flush this
s-nic-nc.send_queue = 0x7f40b3008ae0
s-nic-nc.peer-send_queue = 0x7f40b2f63660


Note, regular traffic flows correctly across the hub,
but qemu_flush_queued_packets() seems to try and pull
from the wrong place.

Any idea how to fix this (other than the inefficient solution
of leaving read_poll=1 in the frontend)

cheers
luigi

Re: [Qemu-devel] wrong argument to qemu_flush_queued_packets() in network frontends ?

2013-01-20 Thread Luigi Rizzo

small correction:

On Sun, Jan 20, 2013 at 6:50 PM, Luigi Rizzo ri...@iet.unipi.it wrote:

 While running qemu 1.3.0 with the following network-related flags:

 -net nic -net tap,ifname=tap0,script=''

 I encountered the same problem (should be common to several
 frontends, e.g. e100, eepro100, virtio-net, xen_nic):

 in net/tap.c :: tap_send(), if qemu_send_packet_async() returns 0
 (e.g. because the NIC has no buffers available)
 traffic stops, despite the fact that the frontend will try to pull
 queued packets when the receive ring is updated.

 Upon investigation, it turns out that the backend code does

 size = qemu_send_packet_async(s-nc, buf, size, tap_send_completed);
 if (size == 0) {
 tap_read_poll(s, 0);

 and the arguments are

 s-nc.name = tap.0
 s-nc.peer-name = hub0port1
 s-nc.send_queue = 0x7f40b2f61e20
 s-nc.peer-send_queue = 0x7f40b2f63690 --- enqueued here

 whereis the frontend is trying to pull from a different queue

 qemu_flush_queued_packets(s-nic-nc);

 with arguments

 s-nic-nc.name = e1000.0
 s-nic-nc.peer-name = hub0port0 --- try to flush this
 s-nic-nc.send_queue = 0x7f40b3008ae0


the queue that is actually flushed is  s-nic-nc.send_queue or
0x7f40b3008ae0

s-nic-nc.peer-send_queue = 0x7f40b2f63660


 Note, regular traffic flows correctly across the hub,
 but qemu_flush_queued_packets() seems to try and pull
 from the wrong place.

 Any idea how to fix this (other than the inefficient solution
 of leaving read_poll=1 in the frontend)

 cheers
 luigi


cheers
luigi

-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] wrong argument to qemu_flush_queued_packets() in network frontends ?

2013-01-20 Thread Luigi Rizzo

... and upon closer inspection, the problem described below (frontend
blocks the backend, then tries to drain the wrong queue causing a stall)
occurs because the hub in the middle breaks the flow of events.
In the configuration below ( -net nic -net tap,ifname=tap0,... ) we have

e1000.0 -- hub0port0 [hub] hub0port1 -- tap.0

The hub0port1 reports as non-writable when all other ports
(just one in this case) are full, and the packet is queued
on hub0port1. However when the e1000 frontend tries to drain
the queue, it directly accesses the queue attached to hub0port0,
which is empty.
So it appears that the only fix is the following:
when a node is attached to a hub, instead of draining the
queue on the node one should drain all queues attached to the hub.
A new function qemu_flush_hub() would be handy, something like

QLIST_FOREACH(port, hub-ports, next) {
if (port != source_port)
   qemu_flush_queued_packets(port-nc);
}

The other option (queueing on the output ports of the hub)
would require a bit more attention to make sure that
the callback is only executed once (and also, avoid exceeding
data replication). Not impossible, but it requires reference
counting the packet.

What do you think, which way do you prefer ?

cheers
luigi

On Sun, Jan 20, 2013 at 6:50 PM, Luigi Rizzo ri...@iet.unipi.it wrote:

 While running qemu 1.3.0 with the following network-related flags:

 -net nic -net tap,ifname=tap0,script=''

 I encountered the same problem (should be common to several
 frontends, e.g. e100, eepro100, virtio-net, xen_nic):

 in net/tap.c :: tap_send(), if qemu_send_packet_async() returns 0
 (e.g. because the NIC has no buffers available)
 traffic stops, despite the fact that the frontend will try to pull
 queued packets when the receive ring is updated.

 Upon investigation, it turns out that the backend code does

 size = qemu_send_packet_async(s-nc, buf, size, tap_send_completed);
 if (size == 0) {
 tap_read_poll(s, 0);

 and the arguments are

 s-nc.name = tap.0
 s-nc.peer-name = hub0port1
 s-nc.send_queue = 0x7f40b2f61e20
 s-nc.peer-send_queue = 0x7f40b2f63690 --- enqueued here

 whereis the frontend is trying to pull from a different queue

 qemu_flush_queued_packets(s-nic-nc);

 with arguments

 s-nic-nc.name = e1000.0
 s-nic-nc.peer-name = hub0port0 --- try to flush this
 s-nic-nc.send_queue = 0x7f40b3008ae0
 s-nic-nc.peer-send_queue = 0x7f40b2f63660


 Note, regular traffic flows correctly across the hub,
 but qemu_flush_queued_packets() seems to try and pull
 from the wrong place.

 Any idea how to fix this (other than the inefficient solution
 of leaving read_poll=1 in the frontend)

 cheers
 luigi




-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] bypassing pci_dma_read() and pci_dma_write() ?

2013-01-18 Thread Luigi Rizzo

On Fri, Jan 18, 2013 at 05:14:02PM +0100, Paolo Bonzini wrote:
...
  This relies on the assumption that the ring (which is contiguous in the
  guest's physical address space) is also contiguous in the host's virtual
  address space.  In principle the property could be easily verified once
  the ring is set up.
 
 IIRC, the amount of contiguous memory is written by address_space_map in
 the plen parameter.
 
 In your case:
 
  +   s-txring = address_space_map(pci_dma_context(s-dev)-as,
  +   base, desclen, 0 /* is_write */);
 
 that would be desclen on return from address_space_map.

ok thanks.

  And of course, am i missing some important detail ?
 
 Unfortunately yes.
 
 First, host memory mappings could change (though they rarely do on PC).
  The result of address_space_map is not guaranteed to be stable.  To
 avoid problems with this, however, you could use something like
 hw/dataplane/hostmem.c and even avoid address_space_map altogether.

I'll look into that. Hopefully there is something that i can
use as a notification that the mapping has changed...

 Second, that pci_dma_*() could have the addresses translated by an
 IOMMU.  virtio is documented to have real physical memory addresses,
 but this does not apply to other devices.

I see. I suppose the ability to have an iommu depends on the
specific NIC ? I am only planning to use the above shortcut for
e1000.

thanks a lot for the quick feedback
luigi

[Qemu-devel] bypassing pci_dma_read() and pci_dma_write() ?

2013-01-18 Thread Luigi Rizzo

Hi,
with a bunch of e1000 improvements we are at a point where we are
doing over 1Mpps (short frames) and 7-8Gbit/s (1500 byte frames)
between two guests, and two things that are high in the perf top
stats are phys_page_find() and related memory copies.

Both are triggered by the pci_dma_read() and pci_dma_write(),
which on e1000 (and presumably other frontends) are called on
every single descriptor and every single buffer.

I have then tried to access the guest memory without going every
time through the page lookup.

For the tx and rx rings i have a partial workaround, which tracks changes
to the base address of the ring, converts it to a host virtual address

 
+#ifdef MAP_RING
+base = tx_desc_base(s);
+if (base != s-txring_phi) {
+   hwaddr desclen = s-mac_reg[TDLEN];
+   s-txring_phi = base;
+   s-txring = address_space_map(pci_dma_context(s-dev)-as,
+   base, desclen, 0 /* is_write */);
+}
+#endif /* MAP_RING */
 ...

and then accesses the descriptor directly into guest memory

desc = s-txring[s-mac_reg[TDH]];

(sprinkle with memory barriers as needed).

This relies on the assumption that the ring (which is contiguous in the
guest's physical address space) is also contiguous in the host's virtual
address space.  In principle the property could be easily verified once
the ring is set up.

I have not done this for buffers because I am not sure how to verify
that the same mapping holds for all packet buffers.
One way could be the following:
on the first buffer access, make the address translation and try to
determine the boundaries of the contiguous (in virtual host memory)
region that holds the buffer. Then subsequent buffers can be easily
validated against this region.

So the problem is now the following: given a guest physical
address, is there a quick way to determine the contiguous
region of memory in the host that contains it ?

And of course, am i missing some important detail ?
Of course the above could be used conditionally if the required
conditions hold, and then revert to the pci_dma_*()
in other cases.

cheers
luigi

Re: [Qemu-devel] bypassing pci_dma_read() and pci_dma_write() ?

2013-01-18 Thread Luigi Rizzo

On Fri, Jan 18, 2013 at 05:14:02PM +0100, Paolo Bonzini wrote:
 Il 18/01/2013 17:04, Luigi Rizzo ha scritto:
  Hi,
  with a bunch of e1000 improvements we are at a point where we are
  doing over 1Mpps (short frames) and 7-8Gbit/s (1500 byte frames)
  between two guests, and two things that are high in the perf top
  stats are phys_page_find() and related memory copies.
  
  Both are triggered by the pci_dma_read() and pci_dma_write(),
  which on e1000 (and presumably other frontends) are called on
  every single descriptor and every single buffer.
  
  I have then tried to access the guest memory without going every
  time through the page lookup. [...]
  
  This relies on the assumption that the ring (which is contiguous in the
  guest's physical address space) is also contiguous in the host's virtual
  address space.  In principle the property could be easily verified once
  the ring is set up.
 
 IIRC, the amount of contiguous memory is written by address_space_map in
 the plen parameter.

unfortunately the plen parameter is modified only if the area
is smaller than the request, and there is no method that i can
find that returns [base,len] of a RAMBlock.

What I came up with, also to check for invalid addresses,
IOMMU and the like, is something like this:

// addr is the address we want to map into host VM
int mappable_addr(PCIDevice *dev, hwaddr addr, 
uint64_t *guest_ha_low, uint64_t *guest_ha_high,
uint64_t *gpa_to_hva)
{
AddressSpace *as = dev-as;
AddressSpaceDispatch *d = as-dispatch;
MemoryRegionSection *section;
RAMBlock *block;

if (dma_has_iommu(pci_dma_context(dev)))
return 0;   // no direct access

section = phys_page_find(d, addr  TARGET_PAGE_BITS);
if (!memory_region_is_ram(section-mr) || section-readonly)
return 0;   // no direct access

QLIST_FOREACH(block, ram_list.blocks, next) {
if (addr - block-offset  block-length) {
/* set 3 variables indicating the valid range
 * and the offset between the two address spaces.
 */
*guest_ha_low =  block-offset;
*guest_ha_high = block-offset + block-length;
*gpa_to_hva = (uint64_t)block-host - block-offset;
return 1;
}
}
return 0;
}

(this probably needs to be put in exec.c or some other place
that can access phys_page_find() and RAMBlock)

The interested client (hw/e1000.c in my case) could then do a
memory_listener_register() to be notified of changes,
invoke mappable_addr() on the first data buffer it has to
translate, and cache the result and use it to speed up the
translation subsequently in case of a hit (with the
pci_dma_read/write being the fallback methods in case
of a miss).

The cache is invalidated on updates arriving from the
memory listener, and refreshed at the next access.

Is this more sound ?
The only missing piece then is the call to
invalidate_and_set_dirty() 


cheers
luigi

Re: [Qemu-devel] [PATCH] fix unbounded qemu NetQueue

2013-01-17 Thread Luigi Rizzo

On Thu, Jan 17, 2013 at 2:21 AM, Stefan Hajnoczi stefa...@gmail.com wrote:

 On Thu, Jan 17, 2013 at 07:07:11AM +0100, Luigi Rizzo wrote:
  The comment at the beginning of net/queue.c says that packets that
  cannot be sent by qemu_net_queue_send() should not be enqueued
  unless a callback is set.
 
  This patch implements this behaviour, that prevents a queue to grow
  unbounded (e.g. when a network backend is not connected).
 
  Also for good measure the patch implements bounded size queues
  (though it should not be necessary now because each source can only have
  one packet queued). When a packet is dropped because excessive
  queue size the callback is not supposed to be called.

 Although I appreciate the semantics that the comment tries to establish,
 the code doesn't behave like this today and we cannot drop packets in
 cases where we relied on queuing them.

 More changes will be required to make the hub, USB, pcap scenario I
 described previously work.


i see. then the other option would be to drop packets only
if the queue is oversize AND the callback is not set:

+if (queue-nq_count = queue-nq_maxlen  !sent_cb)
+   return;

so we should be able to use the queue to store packets when the
USB guest is slow, and avoid dropping precious packet with the callback set
(there should be only one of them for each source) ?
The queue might grow slightly overlimit but still be kept under control.

I cannot think of another way to handle the callback. I think we cannot run
it on a drop as it would cause a premature restart of the sender.
(at a cursory inspection of the code, just tap.c and virtio-net.c use
callbacks)

cheers
luigi

Re: [Qemu-devel] unbounded qemu NetQue's ?

2013-01-16 Thread Luigi Rizzo

On Mon, Jan 7, 2013 at 5:49 AM, Stefan Hajnoczi stefa...@gmail.com wrote:

 On Sun, Jan 06, 2013 at 08:23:56PM +0100, Luigi Rizzo wrote:
  Hi,
  while testing the tx path in qemu without a network backend connected,
  i noticed that qemu_net_queue_send() builds up an unbounded
  queue, because QTAILQ* have no notion of a queue length.
 
  As a result, memory usage of a qemu instance can grow to extremely
  large values.
 
  I wonder if it makes sense to implement a hard limit on size of
  NetQue's. The patch below is only a partial implementation
  but probably sufficient for these purposes.
 
cheers
luigi

 Hi Luigi,
 Good idea, we should bound the queues to prevent rare situations or bugs
 from hogging memory.


...


   diff -urp qemu-1.3.0-orig/net/queue.c qemu-1.3.0/net/queue.c
  --- qemu-1.3.0-orig/net/queue.c   2012-12-03 20:37:05.0+0100
  +++ qemu-1.3.0/net/queue.c2013-01-06 19:38:12.0 +0100
  @@ -92,6 +92,9 @@ static void qemu_net_queue_append(NetQue

 Please also do it for qemu_net_queue_append_iov().

   {
   NetPacket *packet;
 
  +if (queue-packets.tqh_count  1)
  + return; // XXX drop
  +

 sent_cb() must be invoked.  It's best to do this in a QEMUBH in case the
 caller is not prepared for reentrancy.


Stefan, the semantic of callbacks makes it difficult to run them on drops:
they are supposed to run only when the queue has been drained,
and apparently only once per sender, according to the comment
in the header of net/queue.c:

/* The delivery handler may only return zero if it will call
 * qemu_net_queue_flush() when it determines that it is once again able
 * to deliver packets. It must also call qemu_net_queue_purge() in its
 * cleanup path.
 *
 * If a sent callback is provided to send(), the caller must handle a
 * zero return from the delivery handler by not sending any more packets
 * until we have invoked the callback. Only in that case will we queue
 * the packet.
 *
 * If a sent callback isn't provided, we just drop the packet to avoid
 * unbounded queueing.
 */

This seems to suggest that a packet to qemu_net_queue_send()
should never be queued unless it has a callback
(hence the existing code is buggy, because it never ever drops packets),
so the queue can only hold one packet per sender,
hence there is no real risk of overflow.

cheers
luigi

[Qemu-devel] [PATCH] fix unbounded qemu NetQueue

2013-01-16 Thread Luigi Rizzo

The comment at the beginning of net/queue.c says that packets that
cannot be sent by qemu_net_queue_send() should not be enqueued
unless a callback is set.

This patch implements this behaviour, that prevents a queue to grow
unbounded (e.g. when a network backend is not connected).

Also for good measure the patch implements bounded size queues
(though it should not be necessary now because each source can only have
one packet queued). When a packet is dropped because excessive
queue size the callback is not supposed to be called.

cheers
luigi

Signed-off-by: Luigi Rizzo ri...@iet.unipi.it
--- ../orig/net/queue.c 2012-12-03 11:37:05.0 -0800
+++ ./net/queue.c   2013-01-16 21:37:20.109732443 -0800
@@ -50,6 +50,8 @@ struct NetPacket {
 
 struct NetQueue {
 void *opaque;
+uint32_t nq_maxlen;
+uint32_t nq_count;
 
 QTAILQ_HEAD(packets, NetPacket) packets;
 
@@ -63,6 +65,8 @@ NetQueue *qemu_new_net_queue(void *opaqu
 queue = g_malloc0(sizeof(NetQueue));
 
 queue-opaque = opaque;
+queue-nq_maxlen = 1; /* arbitrary limit */
+queue-nq_count = 0;
 
 QTAILQ_INIT(queue-packets);
 
@@ -92,6 +96,8 @@ static void qemu_net_queue_append(NetQue
 {
 NetPacket *packet;
 
+if (!sent_cb || queue-nq_count = queue-nq_maxlen)
+   return;
 packet = g_malloc(sizeof(NetPacket) + size);
 packet-sender = sender;
 packet-flags = flags;
@@ -99,6 +105,7 @@ static void qemu_net_queue_append(NetQue
 packet-sent_cb = sent_cb;
 memcpy(packet-data, buf, size);
 
+queue-nq_count++;
 QTAILQ_INSERT_TAIL(queue-packets, packet, entry);
 }
 
@@ -113,6 +120,8 @@ static void qemu_net_queue_append_iov(Ne
 size_t max_len = 0;
 int i;
 
+if (!sent_cb || queue-nq_count = queue-nq_maxlen)
+   return;
 for (i = 0; i  iovcnt; i++) {
 max_len += iov[i].iov_len;
 }
@@ -130,6 +139,7 @@ static void qemu_net_queue_append_iov(Ne
 packet-size += len;
 }
 
+queue-nq_count++;
 QTAILQ_INSERT_TAIL(queue-packets, packet, entry);
 }
 
@@ -220,6 +230,7 @@ void qemu_net_queue_purge(NetQueue *queu
 QTAILQ_FOREACH_SAFE(packet, queue-packets, entry, next) {
 if (packet-sender == from) {
 QTAILQ_REMOVE(queue-packets, packet, entry);
+queue-nq_count--;
 g_free(packet);
 }
 }
@@ -233,6 +244,7 @@ bool qemu_net_queue_flush(NetQueue *queu
 
 packet = QTAILQ_FIRST(queue-packets);
 QTAILQ_REMOVE(queue-packets, packet, entry);
+queue-nq_count--;
 
 ret = qemu_net_queue_deliver(queue,
  packet-sender,
@@ -240,6 +252,7 @@ bool qemu_net_queue_flush(NetQueue *queu
  packet-data,
  packet-size);
 if (ret == 0) {
+queue-nq_count++;
 QTAILQ_INSERT_HEAD(queue-packets, packet, entry);
 return false;
 }

[Qemu-devel] nic-specific options ? (Re: [RFC] updated e1000 mitigation patch)

2013-01-11 Thread Luigi Rizzo

On Thu, Jan 10, 2013 at 01:25:48PM +0100, Stefan Hajnoczi wrote:
 On Thu, Dec 27, 2012 at 11:06:58AM +0100, Luigi Rizzo wrote:
  diff -urp qemu-1.3.0-orig/hw/e1000.c qemu-1.3.0/hw/e1000.c
  --- qemu-1.3.0-orig/hw/e1000.c  2012-12-03 20:37:05.0 +0100
  +++ qemu-1.3.0/hw/e1000.c   2012-12-27 09:47:16.0 +0100
  @@ -35,6 +35,8 @@
   
   #include e1000_hw.h
   
  +static int mit_on = 1; /* interrupt mitigation enable */
 
 If you want to make this optional then please put it inside E1000State
 so it can be toggled per NIC.

what is the simplest way to add NIC-specific options ?
I have added one line to e1000_properties, as below

 static Property e1000_properties[] = {
 DEFINE_NIC_PROPERTIES(E1000State, conf),
+DEFINE_PROP_UINT32(mit_on, E1000State, mit_on, 0),
 DEFINE_PROP_END_OF_LIST(),
 };

and this way i can do recognise this on the command line
qemu ... -device e1000,mit_on=1 ...

but i do not know how to set the property for the NIC i am using.
Specifically, i normally run qemu with -net nic,model=1000
(leaving the nic unconnected to the host network, so i can
test the tx path without being constrained by the backend's speed)

Any suggestion ?

thanks
luigi

Re: [Qemu-devel] unbounded qemu NetQue's ?

2013-01-07 Thread Luigi Rizzo

On Mon, Jan 7, 2013 at 2:49 PM, Stefan Hajnoczi stefa...@gmail.com wrote:

 On Sun, Jan 06, 2013 at 08:23:56PM +0100, Luigi Rizzo wrote:
  Hi,
  while testing the tx path in qemu without a network backend connected,
  i noticed that qemu_net_queue_send() builds up an unbounded
  queue, because QTAILQ* have no notion of a queue length.
 
  As a result, memory usage of a qemu instance can grow to extremely
  large values.
 
  I wonder if it makes sense to implement a hard limit on size of
  NetQue's. The patch below is only a partial implementation
  but probably sufficient for these purposes.
 
cheers
luigi

 Hi Luigi,
 Good idea, we should bound the queues to prevent rare situations or bugs
 from hogging memory.

 Ideally we would do away with queues completely and make net clients
 hand buffers to each other ahead of time to impose flow control.  I've
 thought about this a few times and always reached the conclusion that
 it's not quite possible.


given that implementing flow control on the inbound (host-guest) path
is impossible, i tend to agree that removing queues is probably
not worth the effort even if possible at all.
...

Your comments below also remind me of a more general issues with the code:

  diff -urp qemu-1.3.0-orig/net/queue.c qemu-1.3.0/net/queue.c
  --- qemu-1.3.0-orig/net/queue.c   2012-12-03 20:37:05.0+0100
  +++ qemu-1.3.0/net/queue.c2013-01-06 19:38:12.0 +0100
  @@ -92,6 +92,9 @@ static void qemu_net_queue_append(NetQue

 Please also do it for qemu_net_queue_append_iov().


the qemu code has many duplicate functions of the form foo() and foo_iov() .
Not only here, but even in the backents (e.g. see net/tap.c)
I think it would be good to settle on a single version of the function
and remove or convert the non-iov instances.

   {
   NetPacket *packet;
 
  +if (queue-packets.tqh_count  1)
  + return; // XXX drop
  +

 sent_cb() must be invoked.  It's best to do this in a QEMUBH in case the
 caller is not prepared for reentrancy.


i forgot that, but the use of sent_cb() is interesting:
the only place that actually uses it seems to be net/tap.c,
and the way it uses it only makes sense if the queue has
room!

  packet = g_malloc(sizeof(NetPacket) + size);
   packet-sender = sender;
   packet-flags = flags;
  diff -urp qemu-1.3.0-orig/qemu-queue.h qemu-1.3.0/qemu-queue.h
  --- qemu-1.3.0-orig/qemu-queue.h  2012-12-03 20:37:05.0+0100
  +++ qemu-1.3.0/qemu-queue.h   2013-01-06 19:34:01.0 +0100

 Please don't modify qemu-queue.h.  It's a generic queue data structure
 used by all of QEMU.  Instead, keep a counter in the NetQueue.


good suggestion, that also makes the change smaller.

cheers
luigi


 Stefan




-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

[Qemu-devel] unbounded qemu NetQue's ?

2013-01-06 Thread Luigi Rizzo

Hi,
while testing the tx path in qemu without a network backend connected,
i noticed that qemu_net_queue_send() builds up an unbounded
queue, because QTAILQ* have no notion of a queue length.

As a result, memory usage of a qemu instance can grow to extremely
large values.

I wonder if it makes sense to implement a hard limit on size of
NetQue's. The patch below is only a partial implementation
but probably sufficient for these purposes.

cheers
luigi

diff -urp qemu-1.3.0-orig/net/queue.c qemu-1.3.0/net/queue.c
--- qemu-1.3.0-orig/net/queue.c 2012-12-03 20:37:05.0 +0100
+++ qemu-1.3.0/net/queue.c  2013-01-06 19:38:12.0 +0100
@@ -92,6 +92,9 @@ static void qemu_net_queue_append(NetQue
 {
 NetPacket *packet;
 
+if (queue-packets.tqh_count  1)
+   return; // XXX drop
+
 packet = g_malloc(sizeof(NetPacket) + size);
 packet-sender = sender;
 packet-flags = flags;
diff -urp qemu-1.3.0-orig/qemu-queue.h qemu-1.3.0/qemu-queue.h
--- qemu-1.3.0-orig/qemu-queue.h2012-12-03 20:37:05.0 +0100
+++ qemu-1.3.0/qemu-queue.h 2013-01-06 19:34:01.0 +0100
@@ -320,11 +320,12 @@ struct {
 struct name {   \
 qual type *tqh_first;   /* first element */ \
 qual type *qual *tqh_last;  /* addr of last next element */ \
+   int32_t tqh_count;  \
 }
 #define QTAILQ_HEAD(name, type)  Q_TAILQ_HEAD(name, struct type,)
 
 #define QTAILQ_HEAD_INITIALIZER(head)   \
-{ NULL, (head).tqh_first }
+{ NULL, (head).tqh_first, 0 }
 
 #define Q_TAILQ_ENTRY(type, qual)   \
 struct {\
@@ -339,6 +340,7 @@ struct {
 #define QTAILQ_INIT(head) do {  \
 (head)-tqh_first = NULL;   \
 (head)-tqh_last = (head)-tqh_first;  \
+(head)-tqh_count = 0; \
 } while (/*CONSTCOND*/0)
 
 #define QTAILQ_INSERT_HEAD(head, elm, field) do {   \
@@ -348,6 +350,7 @@ struct {
 else\
 (head)-tqh_last = (elm)-field.tqe_next;  \
 (head)-tqh_first = (elm);  \
+(head)-tqh_count++;   \
 (elm)-field.tqe_prev = (head)-tqh_first; \
 } while (/*CONSTCOND*/0)
 
@@ -356,6 +359,7 @@ struct {
 (elm)-field.tqe_prev = (head)-tqh_last;   \
 *(head)-tqh_last = (elm);  \
 (head)-tqh_last = (elm)-field.tqe_next;  \
+(head)-tqh_count++;   \
 } while (/*CONSTCOND*/0)
 
 #define QTAILQ_INSERT_AFTER(head, listelm, elm, field) do { \
@@ -381,6 +385,7 @@ struct {
 (elm)-field.tqe_prev;  \
 else\
 (head)-tqh_last = (elm)-field.tqe_prev;   \
+(head)-tqh_count--;   \
 *(elm)-field.tqe_prev = (elm)-field.tqe_next; \
 } while (/*CONSTCOND*/0)

Re: [Qemu-devel] [PULL 0/1] update seabios

2013-01-02 Thread Luigi Rizzo

are you going to distribute a 1.3.x snapshot with the updated bios that
lets FreeBSD boot ?

thanks
luigi

On Wed, Jan 2, 2013 at 5:57 PM, Anthony Liguori anth...@codemonkey.wswrote:

 Gerd Hoffmann kra...@redhat.com writes:

Hi,
 
  One more seabios update, fixing the FreeBSD build failure.
 
  please pull,
Gerd

[Qemu-devel] [RFC] updated e1000 mitigation patch

2012-12-27 Thread Luigi Rizzo

Before submitting a proper patch, I'd like to hear feedback on the
following proposed change to hw/e1000.c to properly implement
interrupt mitigation. 
This is joint work with Vincenzo Maffione and Giuseppe Lettieri (in Cc),
and is a followup to a similar patch i posted in july

https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg03195.html


The patch models three of the five (sic!) e1000 register that control
moderation, and uses qemu timers for that (the minimum delay for
these timers affects the fidelity of the emulation).

Right now there is a static variable that controls whether the
feature is enabled. I would probably like to make it a parameter
accessible from the command line in qemu, possibly extending it to
override the mitigation delay set by the device driver.

Right now we reached transmit speeds in the order of 2-300Kpps
(if matched with a proper guest device driver optimization, see
https://groups.google.com/forum/?fromgroups=#!topic/mailing.freebsd.emulator/xQHR_hleCuc
 )

Some performance data using a FreeBSD guest, for udp transmissions:

KVM QEMU
standard KVM, standard driver20 Kpps 6.3 Kpps
modified KVM, standard driver37 Kpps28 Kpps
modified KVM, modified driver   200 Kpps34 Kpps
 
As you can see, on kvm this change gives a 5x speedup to the tx path,
which combines nicely with the 2x speedup that comes from supporting
interrupt mitigation alone in the hypervisor. Without kvm (or kqemu ?)
the benefits are much lower, as the guest becomes too slow.

cheers
luigi

diff -urp qemu-1.3.0-orig/hw/e1000.c qemu-1.3.0/hw/e1000.c
--- qemu-1.3.0-orig/hw/e1000.c  2012-12-03 20:37:05.0 +0100
+++ qemu-1.3.0/hw/e1000.c   2012-12-27 09:47:16.0 +0100
@@ -35,6 +35,8 @@
 
 #include e1000_hw.h
 
+static int mit_on = 1; /* interrupt mitigation enable */
+
 #define E1000_DEBUG
 
 #ifdef E1000_DEBUG
@@ -129,6 +131,9 @@ typedef struct E1000State_st {
 } eecd_state;
 
 QEMUTimer *autoneg_timer;
+QEMUTimer *mit_timer;  // handle for the timer
+uint32_t mit_timer_on; // mitigation timer active
+uint32_t mit_cause;// pending interrupt cause
 } E1000State;
 
 #definedefreg(x)   x = (E1000_##x2)
@@ -144,6 +149,7 @@ enum {
 defreg(TPR),   defreg(TPT),defreg(TXDCTL), defreg(WUFC),
 defreg(RA),defreg(MTA),defreg(CRCERRS),defreg(VFTA),
 defreg(VET),
+defreg(RDTR),  defreg(RADV),   defreg(TADV),   defreg(ITR),
 };
 
 static void
@@ -639,6 +645,68 @@ static uint64_t tx_desc_base(E1000State
 return (bah  32) + bal;
 }
 
+/* helper function, 0 means the value is not set */
+static inline void
+mit_update_delay(uint32_t *curr, uint32_t value)
+{
+if (value  (*curr == 0 || value  *curr))
+   *curr = value;
+}
+
+/*
+ * If necessary, rearm the timer and post an interrupt.
+ * Called at the end of tx/rx routines (mit_timer_on == 0),
+ * and when the timer fires (mit_timer_on == 1).
+ * We provide a partial implementation of interrupt mitigation,
+ * emulating only RADV, TADV and ITR (lower 16 bits, 1024ns units for
+ * RADV and TADV, 256ns units for ITR). RDTR is only used to enable RADV;
+ * relative timers based on TIDV and RDTR are not implemented.
+ */
+static void
+mit_rearm_and_int(void *opaque)
+{
+E1000State *s = opaque;
+uint32_t mit_delay = 0;
+
+/*
+ * Clear the flag. It is only set when the callback fires,
+ * and we need to clear it anyways.
+ */
+s-mit_timer_on = 0;
+if (s-mit_cause == 0) /* no events pending, we are done */
+   return;
+/*
+ * Compute the next mitigation delay according to pending interrupts
+ * and the current values of RADV (provided RDTR!=0), TADV and ITR.
+ * Then rearm the timer.
+ */
+if (s-mit_cause  (E1000_ICR_TXQE | E1000_ICR_TXDW))
+   mit_update_delay(mit_delay, s-mac_reg[TADV] * 4);
+if (s-mac_reg[RDTR]  (s-mit_cause  E1000_ICS_RXT0))
+   mit_update_delay(mit_delay, s-mac_reg[RADV] * 4);
+mit_update_delay(mit_delay, s-mac_reg[ITR]);  
+
+if (mit_delay) {
+   s-mit_timer_on = 1;
+   qemu_mod_timer(s-mit_timer,
+   qemu_get_clock_ns(vm_clock) + mit_delay * 256);
+}
+set_ics(s, 0, s-mit_cause);
+s-mit_cause = 0;
+}
+
+static void
+mit_set_ics(E1000State *s, uint32_t cause)
+{
+if (mit_on == 0) {
+   set_ics(s, 0, cause);
+   return;
+}
+s-mit_cause |= cause;
+if (!s-mit_timer_on)
+   mit_rearm_and_int(s);
+}
+
 static void
 start_xmit(E1000State *s)
 {
@@ -676,7 +744,7 @@ start_xmit(E1000State *s)
 break;
 }
 }
-set_ics(s, 0, cause);
+mit_set_ics(s, cause);
 }
 
 static int
@@ -894,7 +962,7 @@ e1000_receive(NetClientState *nc, const
 s-rxbuf_min_shift)
 n |= E1000_ICS_RXDMT0;
 
-set_ics(s, 0, n);
+mit_set_ics(s, n);
 
 return size;
 }
@@ -999,6

[Qemu-devel] new pc-bios/bios.bin breaks freebsd booting

2012-12-12 Thread Luigi Rizzo

I am not sure if it has been reported already but this commit

http://git.qemu.org/?p=qemu.git;a=commitdiff;h=d7a51dbbaa70677846453f8c961590913052dd86

(replacing pc-bios/bios.bin with a newer version)
breaks booting of FreeBSD on recent qemu (starting roughly with qemu-
1.3.0-rc2).

Using a FreeBSD host, and a FreeBSD guest,
the qemu thread runs at 100% and the console is stuck
after the 'pci0' probe:


...
hpet0: High Precision Event Timer iomem 0xfed0-0xfed003ff on acpi0

Timecounter HPET frequency 1 Hz quality 950

Timecounter ACPI-fast frequency 3579545 Hz quality 900

acpi_timer0: 24-bit timer at 3.579545MHz port 0xb008-0xb00b on acpi0

pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0

pci0: ACPI PCI bus on pcib0

Reverting the bios fixes things.
I wonder if it isn't the case of reverting this change ?

cheers
luigi



-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] new pc-bios/bios.bin breaks freebsd booting

2012-12-12 Thread Luigi Rizzo

On Wed, Dec 12, 2012 at 06:47:58PM +0100, Paolo Bonzini wrote:
 Il 12/12/2012 17:04, Luigi Rizzo ha scritto:
  I am not sure if it has been reported already but this commit
  
  http://git.qemu.org/?p=qemu.git;a=commitdiff;h=d7a51dbbaa70677846453f8c961590913052dd86
  
  (replacing pc-bios/bios.bin with a newer version)
  breaks booting of FreeBSD on recent qemu (starting roughly with
  qemu-1.3.0-rc2).
  
  Using a FreeBSD host, and a FreeBSD guest,
  the qemu thread runs at 100% and the console is stuck
  after the 'pci0' probe:
  
 
  ...
  hpet0: High Precision Event Timer iomem 0xfed0-0xfed003ff on acpi0
 
  Timecounter HPET frequency 1 Hz quality 950  
  
  Timecounter ACPI-fast frequency 3579545 Hz quality 900
 
  acpi_timer0: 24-bit timer at 3.579545MHz port 0xb008-0xb00b on acpi0  
 
  pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0
  
  pci0: ACPI PCI bus on pcib0
  
  Reverting the bios fixes things.
  I wonder if it isn't the case of reverting this change ?
 
 Not reverting the change (which fixes other things), but yes---we should
 get the fix into SeaBIOS.
 
 I don't have a FreeBSD VM handy, can you try the attached BIOS so I can
 have your Tested-by?  The patch I used is after my signature.

thanks, the attached bios successfully boots a FreeBSD guest

Tested-by: Luigi Rizzo ri...@iet.unipi.it


cheers
luigi

 Paolo
 
 diff --git a/src/acpi-dsdt.dsl b/src/acpi-dsdt.dsl
 index 8019b71..b58ef62 100644
 --- a/src/acpi-dsdt.dsl
 +++ b/src/acpi-dsdt.dsl
 @@ -187,7 +187,7 @@ DefinitionBlock (
  
  prt_slot0(0x),
  /* Device 1 is power mgmt device, and can only use irq 9 */
 -Package() { 0x1, 0,0, 9 },
 +Package() { 0x1, 0, LNKS, 0 },
  Package() { 0x1, 1, LNKB, 0 },
  Package() { 0x1, 2, LNKC, 0 },
  Package() { 0x1, 3, LNKD, 0 },
 @@ -278,6 +278,22 @@ DefinitionBlock (
  define_link(LNKB, 1, PRQ1)
  define_link(LNKC, 2, PRQ2)
  define_link(LNKD, 3, PRQ3)
 +
 +Device(LNKS) {
 +Name(_HID, EISAID(PNP0C0F))
 +Name(_UID, 4)
 +Name(_PRS, ResourceTemplate() {
 +Interrupt(, Level, ActiveHigh, Shared) { 9 }
 +})
 +
 +// The SCI cannot be disabled and is always attached to GSI 9,
 +// so these are no-ops.  We only need this link to override the
 +// polarity to active high and match the content of the MADT.
 +Method(_STA, 0, NotSerialized) { Return (0x0b) }
 +Method(_DIS, 0, NotSerialized) { }
 +Method(_CRS, 0, NotSerialized) { Return (_PRS) }
 +Method(_SRS, 1, NotSerialized) { }
 +}
  }
  
  #include acpi-dsdt-cpu-hotplug.dsl

Re: [Qemu-devel] interrupt mitigation for e1000

2012-07-25 Thread Luigi Rizzo

On Wed, Jul 25, 2012 at 11:53:29AM +0300, Avi Kivity wrote:
 On 07/24/2012 07:58 PM, Luigi Rizzo wrote:
  I noticed that the various NIC modules in qemu/kvm do not implement
  interrupt mitigation, which is very beneficial as it dramatically
  reduces exits from the hypervisor.
  
  As a proof of concept i tried to implement it for the e1000 driver
  (patch below), and it brings tx performance from 9 to 56Kpps on
  qemu-softmmu, and from ~20 to 140Kpps on qemu-kvm.
  
  I am going to measure the rx interrupt mitigation in the next couple
  of days.
  
  Is there any interest in having this code in ?
 
 Indeed.  But please drop the #ifdef MITIGATIONs.

Thanks for the comments. The #ifdef block MITIGATION was only temporary to
point out the differences and run the performance comparisons.
Similarly, the magic thresholds below will be replaced with
appropriately commented #defines.

Note:
On the real hardware interrupt mitigation is controlled by a total of four
registers (TIDV, TADV, RIDV, RADV) which control it with a granularity
of 1024ns , see

http://www.intel.com/content/dam/doc/manual/pci-pci-x-family-gbe-controllers-software-dev-manual.pdf

An exact emulation of the feature is hard, because the timer resolution we
have is much coarser (in the ms range). So i am inclined to use a different
approach, similar to the one i have implemented, namely:
- the first few packets (whether 1 or 4 or 5 will be decided on the host)
  report an interrupt immediately;
- subsequent interrupts are delayed through qemu_bh_schedule_idle()
  (which is unpredictable but efficient; i tried qemu_bh_schedule()
  but it completely defeats mitigation)
- when the TX or RX rings are close to getting full, then again
  an interrupt is delivered immediately.

This approach also has the advantage of not requiring specific support
in the OS drivers.

cheers
luigi

  +
  +#ifdef MITIGATION
  +QEMUBH *int_bh;// interrupt mitigation handler
  +int tx_ics_count;  // pending tx int requests
  +int rx_ics_count;  // pending rx int requests
  +int int_cause; // int cause
 
 Use uint32_t for int_cause, also a correctly sized type for the packet
 counts.
 
   
  +#ifdef MITIGATION
  +/* we transmit the first few packets, or we do if we are
  + * approaching a full ring. in the latter case, also
  + * send an ics.
  + * 
  + */
  +{
  +int len, pending;
  +len = s-mac_reg[TDLEN] / sizeof(desc) ;
  +pending = s-mac_reg[TDT] - s-mac_reg[TDH];
  +if (pending  0)
  +   pending += len;
  +/* ignore requests after the first few ones, as long as
  + * we are not approaching a full ring.
  + * Otherwise, deliver packets to the backend.
  + */
  +if (s-tx_ics_count  4  s-tx_ics_count + pending  len - 5)
  +   return;
 
 Where do the 4 and 5 come from?  Shouldn't they be controlled by the
 guest using a device register?
 
   }
  +#ifdef MITIGATION
  +s-int_cause |= cause; // remember the interrupt cause.
  +s-tx_ics_count += pending;
  +if (s-tx_ics_count = len - 5) {
  +// if the ring is about to become full, generate an interrupt
 
 Another magic number.
 
  +   set_ics(s, 0, s-int_cause);
  +   s-tx_ics_count = 0;
  +   s-int_cause = 0;
  +} else {   // otherwise just schedule it for later.
  +qemu_bh_schedule_idle(s-int_bh);
  +}
  +}
  +#else /* !MITIGATION */
   set_ics(s, 0, cause);
  +#endif
   }
   
 
 -- 
 error compiling committee.c: too many arguments to function

Re: [Qemu-devel] interrupt mitigation for e1000

2012-07-25 Thread Luigi Rizzo

On Wed, Jul 25, 2012 at 12:12:55PM +0200, Paolo Bonzini wrote:
Il 25/07/2012 11:56, Luigi Rizzo ha scritto:
On Wed, Jul 25, 2012 at 11:53:29AM +0300, Avi Kivity wrote:
On 07/24/2012 07:58 PM, Luigi Rizzo wrote:
I noticed that the various NIC modules in qemu/kvm do not implement
interrupt mitigation, which is very beneficial as it dramatically
reduces exits from the hypervisor.

As a proof of concept i tried to implement it for the e1000 driver
(patch below), and it brings tx performance from 9 to 56Kpps on
qemu-softmmu, and from ~20 to 140Kpps on qemu-kvm.

I am going to measure the rx interrupt mitigation in the next couple
of days.

Is there any interest in having this code in ?

Indeed. But please drop the #ifdef MITIGATIONs.

Thanks for the comments. The #ifdef block MITIGATION was only temporary to
point out the differences and run the performance comparisons.
Similarly, the magic thresholds below will be replaced with
appropriately commented #defines.

Note:
On the real hardware interrupt mitigation is controlled by a total of four
registers (TIDV, TADV, RIDV, RADV) which control it with a granularity
of 1024ns , see

http://www.intel.com/content/dam/doc/manual/pci-pci-x-family-gbe-controllers-software-dev-manual.pdf

An exact emulation of the feature is hard, because the timer resolution we
have is much coarser (in the ms range). So i am inclined to use a different
approach, similar to the one i have implemented, namely:
- the first few packets (whether 1 or 4 or 5 will be decided on the host)
report an interrupt immediately;
- subsequent interrupts are delayed through qemu_bh_schedule_idle()

qemu_bh_schedule_idle() is really a 10ms timer.

yes, i figured that out, this is why i said that my code was more
a proof of concept than an actual patch.

If you have a suggestion on how to schedule a shorter (say 1ms) timer i
am all hears. Perhaps qemu_new_timer_ns() and friends ?

This said, i do not plan to implement the full mitigation registers
controlled by the guest, just possibly use a parameter as in virtio-net
where you can have
'tx=bh' or 'tx=timer' and 'x-txtimer=N' with N is the mitigation delay
in nanoseconds (virtually, in practice rounded to whatever the host granularity
is)

cheers
luigi

[Qemu-devel] forgotten commit ? (Re: Proposed patch: huge RX speedup for hw/e1000.c)

2012-07-24 Thread Luigi Rizzo

Paolo,
a few weeks ago you posted the patch below but apparently
it did not get in after my 'tested-by' reply of June C1st4th
I'd like to confirm that your patch works perfectly for me.

Tested-by: Luigi Rizzo ri...@iet.unipi.it

cheers
luigi

On Thu, May 31, 2012 at 01:03:35PM +0200, Paolo Bonzini wrote:
 Il 31/05/2012 12:40, Jan Kiszka ha scritto:
  On 2012-05-31 12:03, Paolo Bonzini wrote:
  Il 31/05/2012 10:23, Jan Kiszka ha scritto:
  @@ -922,6 +923,7 @@ set_rdt(E1000State *s, int index, uint32_t val)
   {
   s-check_rxov = 0;
   s-mac_reg[index] = val  0x;
  +qemu_notify_event();
  This still looks like the wrong tool: Packets that can't be delivered
  are queued.
 
  Packets that are read from the tap but can't be delivered are queued;
  packets that are left on the tap need qemu_notify_event to be flushed.
 
  So we need to flush the queue and clear the blocked delivery
  there. qemu_flush_queued_packets appears more appropriate for this.
 
  Right, and qemu_flush_queued_packets needs to call qemu_notify_event
  which makes the call in virtio-net unnecessary.
 
  Paolo
 
  diff --git a/hw/e1000.c b/hw/e1000.c
  index 4573f13..43d933a 100644
  --- a/hw/e1000.c
  +++ b/hw/e1000.c
  @@ -295,6 +295,7 @@ set_rx_control(E1000State *s, int index, uint32_t val)
   s-rxbuf_min_shift = ((val / E1000_RCTL_RDMTS_QUAT)  3) + 1;
   DBGOUT(RX, RCTL: %d, mac_reg[RCTL] = 0x%x\n, s-mac_reg[RDT],
  s-mac_reg[RCTL]);
  +qemu_flush_queued_packets(s-nic-nc);
   }
   
   static void
  @@ -926,6 +927,9 @@ set_rdt(E1000State *s, int index, uint32_t val)
   {
   s-check_rxov = 0;
   s-mac_reg[index] = val  0x;
  +if (e1000_has_rxbufs(s, 1)) {
  +qemu_flush_queued_packets(s-nic-nc);
  +}
   }
   
   static void
  diff --git a/hw/virtio-net.c b/hw/virtio-net.c
  index 3f190d4..0974945 100644
  --- a/hw/virtio-net.c
  +++ b/hw/virtio-net.c
  @@ -447,10 +447,6 @@ static void virtio_net_handle_rx(VirtIODevice *vdev, 
  VirtQueue *vq)
   VirtIONet *n = to_virtio_net(vdev);
   
   qemu_flush_queued_packets(n-nic-nc);
  -
  -/* We now have RX buffers, signal to the IO thread to break out of the
  - * select to re-poll the tap file descriptor */
  -qemu_notify_event();
   }
   
   static int virtio_net_can_receive(VLANClientState *nc)
  diff --git a/net.c b/net.c
  index 1922d8a..fa846ae 100644
  --- a/net.c
  +++ b/net.c
  @@ -491,7 +491,12 @@ void qemu_flush_queued_packets(VLANClientState *vc)
   queue = vc-send_queue;
   }
   
  -qemu_net_queue_flush(queue);
  +if (qemu_net_queue_flush(queue)) {
  +/* We emptied the queue successfully, signal to the IO thread to 
  repoll
  + * the file descriptor (for tap, for example).
  + */
  +qemu_notify_event();
  +}
   }
   
   static ssize_t qemu_send_packet_async_with_flags(VLANClientState *sender,
  diff --git a/net/queue.c b/net/queue.c
  index 1ab5247..fd1c7e6 100644
  --- a/net/queue.c
  +++ b/net/queue.c
  @@ -232,7 +232,7 @@ void qemu_net_queue_purge(NetQueue *queue, 
  VLANClientState *from)
   }
   }
   
  -void qemu_net_queue_flush(NetQueue *queue)
  +bool qemu_net_queue_flush(NetQueue *queue)
   {
   while (!QTAILQ_EMPTY(queue-packets)) {
   NetPacket *packet;
  @@ -248,7 +248,7 @@ void qemu_net_queue_flush(NetQueue *queue)
packet-size);
   if (ret == 0) {
   QTAILQ_INSERT_HEAD(queue-packets, packet, entry);
  -break;
  +return 0;
   }
   
   if (packet-sent_cb) {
  @@ -257,4 +257,5 @@ void qemu_net_queue_flush(NetQueue *queue)
   
   g_free(packet);
   }
  +return 1;
   }
  diff --git a/net/queue.h b/net/queue.h
  index a31958e..4bf6d3c 100644
  --- a/net/queue.h
  +++ b/net/queue.h
  @@ -66,6 +66,6 @@ ssize_t qemu_net_queue_send_iov(NetQueue *queue,
   NetPacketSent *sent_cb);
   
   void qemu_net_queue_purge(NetQueue *queue, VLANClientState *from);
  -void qemu_net_queue_flush(NetQueue *queue);
  +bool qemu_net_queue_flush(NetQueue *queue);
   
   #endif /* QEMU_NET_QUEUE_H */
  
  Looks good.
 
 Luigi, please reply with a Tested-by when possible.
 
 Paolo

[Qemu-devel] interrupt mitigation for e1000

2012-07-24 Thread Luigi Rizzo

I noticed that the various NIC modules in qemu/kvm do not implement
interrupt mitigation, which is very beneficial as it dramatically
reduces exits from the hypervisor.

As a proof of concept i tried to implement it for the e1000 driver
(patch below), and it brings tx performance from 9 to 56Kpps on
qemu-softmmu, and from ~20 to 140Kpps on qemu-kvm.

I am going to measure the rx interrupt mitigation in the next couple
of days.

Is there any interest in having this code in ?

cheers
luigi

diff -ubwrp --exclude '*.[do]' /tmp/qemu-61dc008/hw/e1000.c ./hw/e1000.c
--- /tmp/qemu-61dc008/hw/e1000.c2012-07-20 01:25:52.0 +0200
+++ ./hw/e1000.c2012-07-24 18:21:39.0 +0200
@@ -33,6 +33,8 @@
 #include sysemu.h
 #include dma.h
 
+#define MITIGATION
+
 #include e1000_hw.h
 
 #define E1000_DEBUG
@@ -127,6 +129,13 @@ typedef struct E1000State_st {
 } eecd_state;
 
 QEMUTimer *autoneg_timer;
+
+#ifdef MITIGATION
+QEMUBH *int_bh;// interrupt mitigation handler
+int tx_ics_count;  // pending tx int requests
+int rx_ics_count;  // pending rx int requests
+int int_cause; // int cause
+#endif // MITIGATION
 } E1000State;
 
 #definedefreg(x)   x = (E1000_##x2)
@@ -638,6 +648,26 @@ start_xmit(E1000State *s)
 return;
 }
 
+#ifdef MITIGATION
+/* we transmit the first few packets, or we do if we are
+ * approaching a full ring. in the latter case, also
+ * send an ics.
+ * 
+ */
+{
+int len, pending;
+len = s-mac_reg[TDLEN] / sizeof(desc) ;
+pending = s-mac_reg[TDT] - s-mac_reg[TDH];
+if (pending  0)
+   pending += len;
+/* ignore requests after the first few ones, as long as
+ * we are not approaching a full ring.
+ * Otherwise, deliver packets to the backend.
+ */
+if (s-tx_ics_count  4  s-tx_ics_count + pending  len - 5)
+   return;
+#endif // MITIGATION
+
 while (s-mac_reg[TDH] != s-mac_reg[TDT]) {
 base = tx_desc_base(s) +
sizeof(struct e1000_tx_desc) * s-mac_reg[TDH];
@@ -663,7 +693,21 @@ start_xmit(E1000State *s)
 break;
 }
 }
+#ifdef MITIGATION
+s-int_cause |= cause; // remember the interrupt cause.
+s-tx_ics_count += pending;
+if (s-tx_ics_count = len - 5) {
+// if the ring is about to become full, generate an interrupt
+   set_ics(s, 0, s-int_cause);
+   s-tx_ics_count = 0;
+   s-int_cause = 0;
+} else {   // otherwise just schedule it for later.
+qemu_bh_schedule_idle(s-int_bh);
+}
+}
+#else /* !MITIGATION */
 set_ics(s, 0, cause);
+#endif
 }
 
 static int
@@ -875,7 +919,27 @@ e1000_receive(VLANClientState *nc, const
 s-rxbuf_min_shift)
 n |= E1000_ICS_RXDMT0;
 
+#ifdef MITIGATION
+#define MIT_RXDMT0_SENT10  // large
+s-int_cause |= n;
+if (s-rx_ics_count == 0) {
+   /* deliver the first interrupt */
+   set_ics(s, 0, s-int_cause);
+   s-int_cause = 0;
+   s-rx_ics_count++;
+} else if ( (n  E1000_ICS_RXDMT0)  s-rx_ics_count  MIT_RXDMT0_SENT) {
+   /* also deliver if we are approaching ring full */
+   set_ics(s, 0, s-int_cause);
+   s-int_cause = 0;
+   s-rx_ics_count = MIT_RXDMT0_SENT;
+} else {
+   /* otherwise schedule for later */
+   s-rx_ics_count++;
+   qemu_bh_schedule_idle(s-int_bh);
+}
+#else /* !MITIGATION */
 set_ics(s, 0, n);
+#endif /* !MITIGATION */
 
 return size;
 }
@@ -1214,6 +1281,20 @@ static NetClientInfo net_e1000_info = {
 .link_status_changed = e1000_set_link_status,
 };
 
+#ifdef MITIGATION
+static void e1000_int_bh(void *opaque)
+{
+E1000State *s = opaque;
+if (s-tx_ics_count  1  s-rx_ics_count  1)
+   return;
+s-tx_ics_count = 0;
+s-rx_ics_count = 0;
+start_xmit(s);
+set_ics(s, 0, s-int_cause);
+s-int_cause = 0;
+}
+#endif /* MITIGATION */
+
 static int pci_e1000_init(PCIDevice *pci_dev)
 {
 E1000State *d = DO_UPCAST(E1000State, dev, pci_dev);
@@ -1231,6 +1312,9 @@ static int pci_e1000_init(PCIDevice *pci
 
 e1000_mmio_setup(d);
 
+#ifdef MITIGATION
+d-int_bh = qemu_bh_new(e1000_int_bh, d);
+#endif /* MITIGATION */
 pci_register_bar(d-dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, d-mmio);
 
 pci_register_bar(d-dev, 1, PCI_BASE_ADDRESS_SPACE_IO, d-io);

Re: [Qemu-devel] qemu 1.1.0 slow as hell booting ?

2012-07-20 Thread Luigi Rizzo

On Thu, Jul 19, 2012 at 04:43:05PM +0200, Luigi Rizzo wrote:
 hi,
 while playing with various qemu versions i noticed that
 qemu-devel now in FreeBSD ports (1.1.0) is slow as hell when
 booting FreeBSD (take for instance the netmap image at
 
 http://info.iet.unipi.it/~luigi/netmap/20120608-netmap-picobsd-head-amd64.bin
 
 while 0.11.x and 1.1.1 and 1.0.1 seem to work well.
 
 I should mention that i had the same exact problem a few months
 ago with a qemu version compiled on my Mac using macports,
 which was fixed by manually building one myself from the sources
 (the working Mac version reports version 1.0.91; unfortunately
 i do not remember which version it was, but judging from the
 macports history it could have been either 0.14.1 or 1.0)
 so i do not think this is a specific problem with FreeBSD.
 
 Are you able to reproduce this problem ?  Any idea on what could it be ?

Just for the archives:

the problem was that the FreeBSD port, when compiling with CLANG,
used --enable-tcg-interpreter which caused the slow code to be
generated. I suppose the same problem existed in the macports version

cheers
luigi

 cheers
 luigi

Re: [Qemu-devel] qemu 1.1.0 slow as hell booting ?

2012-07-20 Thread Luigi Rizzo

On Fri, Jul 20, 2012 at 09:04:51PM +1000, ronnie sahlberg wrote:
 Is it only very slow during boot   but then becomes normal speed when
 the system is fully booted and kernel and userspace are all up and
 running normally?

i did not have enough patience to go past the 2 minutes needed
to load text+data for the kernel.
See my followup, it was due to configure using --enable-tcg-interpreter
when using CLANG.

cheers
luigi


 regards
 ronnie sahlberg
 
 
 On Fri, Jul 20, 2012 at 12:43 AM, Luigi Rizzo ri...@iet.unipi.it wrote:
  hi,
  while playing with various qemu versions i noticed that
  qemu-devel now in FreeBSD ports (1.1.0) is slow as hell when
  booting FreeBSD (take for instance the netmap image at
 
  http://info.iet.unipi.it/~luigi/netmap/20120608-netmap-picobsd-head-amd64.bin
 
  while 0.11.x and 1.1.1 and 1.0.1 seem to work well.
 
  I should mention that i had the same exact problem a few months
  ago with a qemu version compiled on my Mac using macports,
  which was fixed by manually building one myself from the sources
  (the working Mac version reports version 1.0.91; unfortunately
  i do not remember which version it was, but judging from the
  macports history it could have been either 0.14.1 or 1.0)
  so i do not think this is a specific problem with FreeBSD.
 
  Are you able to reproduce this problem ?  Any idea on what could it be ?
 
  cheers
  luigi

[Qemu-devel] qemu 1.1.0 slow as hell booting ?

2012-07-19 Thread Luigi Rizzo

hi,
while playing with various qemu versions i noticed that
qemu-devel now in FreeBSD ports (1.1.0) is slow as hell when
booting FreeBSD (take for instance the netmap image at

http://info.iet.unipi.it/~luigi/netmap/20120608-netmap-picobsd-head-amd64.bin

while 0.11.x and 1.1.1 and 1.0.1 seem to work well.

I should mention that i had the same exact problem a few months
ago with a qemu version compiled on my Mac using macports,
which was fixed by manually building one myself from the sources
(the working Mac version reports version 1.0.91; unfortunately
i do not remember which version it was, but judging from the
macports history it could have been either 0.14.1 or 1.0)
so i do not think this is a specific problem with FreeBSD.

Are you able to reproduce this problem ?  Any idea on what could it be ?

cheers
luigi

[Qemu-devel] VALE, a Virtual Local Ethernet. http://info.iet.unipi.it/~luigi/vale/

2012-06-08 Thread Luigi Rizzo

We have just completed a netmap extensions that let you build a
local high speed switch called VALE which i think can be very useful.

   http://info.iet.unipi.it/~luigi/vale/

VALE is a software Virtual Local Ethernet whose ports are accessible
using the netmap API. Designed to be used as the interconnect between
virtual machines (or as a fast local bus), it works as a learning
bridge and supports speeds of up to 20 Mpps with short frames, and
an aggregate 70 Gbit/s with 1514-byte packets. The VALE paper
contains more details and performance measurements.

VALE is implemented as a small extension of the netmap module, and
is available for FreeBSD and Linux. The source code includes a
backend for qemu and KVM, so you can use VALE to interconnect virtual
machines launching them with

   qemu -net nic -net netmap,ifname=vale0 ...
   qemu -net nic -net netmap,ifname=vale1 ...
   ...

Processes can talk to a VALE switch too, so you can use the pkt-gen
or bridge tools that are part of the netmap distribution, or even
the pcap.c module that maps libpcap calls into netmap equivalents.
This lets you use VALE for all sort of pcap-based applications.

More details, code, bootable images on the VALE page,

   http://info.iet.unipi.it/~luigi/vale/

feedback welcome, as usual.

cheers
luigi

Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c

2012-06-01 Thread Luigi Rizzo

Works me. I can now receive at 1.15 Mpps, slightly
faster than my previous patch which generated unnecessary
writes to the signalling socket.

Tested-by: Luigi Rizzo ri...@iet.unipi.it

On Thu, May 31, 2012 at 12:03 PM, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 31/05/2012 10:23, Jan Kiszka ha scritto:
   @@ -922,6 +923,7 @@ set_rdt(E1000State *s, int index, uint32_t val)
{
s-check_rxov = 0;
s-mac_reg[index] = val  0x;
   +qemu_notify_event();
  This still looks like the wrong tool: Packets that can't be delivered
  are queued.

 Packets that are read from the tap but can't be delivered are queued;
 packets that are left on the tap need qemu_notify_event to be flushed.

  So we need to flush the queue and clear the blocked delivery
  there. qemu_flush_queued_packets appears more appropriate for this.

 Right, and qemu_flush_queued_packets needs to call qemu_notify_event
 which makes the call in virtio-net unnecessary.

 Paolo

 diff --git a/hw/e1000.c b/hw/e1000.c
 index 4573f13..43d933a 100644
 --- a/hw/e1000.c
 +++ b/hw/e1000.c
 @@ -295,6 +295,7 @@ set_rx_control(E1000State *s, int index, uint32_t val)
 s-rxbuf_min_shift = ((val / E1000_RCTL_RDMTS_QUAT)  3) + 1;
 DBGOUT(RX, RCTL: %d, mac_reg[RCTL] = 0x%x\n, s-mac_reg[RDT],
s-mac_reg[RCTL]);
 +qemu_flush_queued_packets(s-nic-nc);
  }

  static void
 @@ -926,6 +927,9 @@ set_rdt(E1000State *s, int index, uint32_t val)
  {
 s-check_rxov = 0;
 s-mac_reg[index] = val  0x;
 +if (e1000_has_rxbufs(s, 1)) {
 +qemu_flush_queued_packets(s-nic-nc);
 +}
  }

  static void
 diff --git a/hw/virtio-net.c b/hw/virtio-net.c
 index 3f190d4..0974945 100644
 --- a/hw/virtio-net.c
 +++ b/hw/virtio-net.c
 @@ -447,10 +447,6 @@ static void virtio_net_handle_rx(VirtIODevice *vdev,
 VirtQueue *vq)
 VirtIONet *n = to_virtio_net(vdev);

 qemu_flush_queued_packets(n-nic-nc);
 -
 -/* We now have RX buffers, signal to the IO thread to break out of the
 - * select to re-poll the tap file descriptor */
 -qemu_notify_event();
  }

  static int virtio_net_can_receive(VLANClientState *nc)
 diff --git a/net.c b/net.c
 index 1922d8a..fa846ae 100644
 --- a/net.c
 +++ b/net.c
 @@ -491,7 +491,12 @@ void qemu_flush_queued_packets(VLANClientState *vc)
 queue = vc-send_queue;
 }

 -qemu_net_queue_flush(queue);
 +if (qemu_net_queue_flush(queue)) {
 +/* We emptied the queue successfully, signal to the IO thread to
 repoll
 + * the file descriptor (for tap, for example).
 + */
 +qemu_notify_event();
 +}
  }

  static ssize_t qemu_send_packet_async_with_flags(VLANClientState *sender,
 diff --git a/net/queue.c b/net/queue.c
 index 1ab5247..fd1c7e6 100644
 --- a/net/queue.c
 +++ b/net/queue.c
 @@ -232,7 +232,7 @@ void qemu_net_queue_purge(NetQueue *queue,
 VLANClientState *from)
 }
  }

 -void qemu_net_queue_flush(NetQueue *queue)
 +bool qemu_net_queue_flush(NetQueue *queue)
  {
 while (!QTAILQ_EMPTY(queue-packets)) {
 NetPacket *packet;
 @@ -248,7 +248,7 @@ void qemu_net_queue_flush(NetQueue *queue)
  packet-size);
 if (ret == 0) {
 QTAILQ_INSERT_HEAD(queue-packets, packet, entry);
 -break;
 +return 0;
 }

 if (packet-sent_cb) {
 @@ -257,4 +257,5 @@ void qemu_net_queue_flush(NetQueue *queue)

 g_free(packet);
 }
 +return 1;
  }
 diff --git a/net/queue.h b/net/queue.h
 index a31958e..4bf6d3c 100644
 --- a/net/queue.h
 +++ b/net/queue.h
 @@ -66,6 +66,6 @@ ssize_t qemu_net_queue_send_iov(NetQueue *queue,
 NetPacketSent *sent_cb);

  void qemu_net_queue_purge(NetQueue *queue, VLANClientState *from);
 -void qemu_net_queue_flush(NetQueue *queue);
 +bool qemu_net_queue_flush(NetQueue *queue);

  #endif /* QEMU_NET_QUEUE_H */




-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c

2012-05-31 Thread Luigi Rizzo

On Thu, May 31, 2012 at 9:38 AM, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 31/05/2012 00:53, Luigi Rizzo ha scritto:
  The image contains my fast packet generator pkt-gen (a stock
  traffic generator such as netperf etc. is too slow to show the
  problem). pkt-gen can send about 1Mpps in this configuration using
  -net netmap in the backend. The qemu process in this case takes 100%
  CPU. On the receive side, i cannot receive more than 50Kpps, even if i
  flood the bridge with a a huge amount of traffic. The qemu process stays
  at 5% cpu or less.
 
  Then i read on the docs in main-loop.h which says that one case where
  the qemu_notify_event() is needed is when using
  qemu_set_fd_handler2(), which is exactly what my backend uses
  (similar to tap.c)

 The path is a bit involved, but I think Luigi is right.  The docs say
 Remember to call qemu_notify_event whenever the [return value of the
 fd_read_poll callback] may change from false to true.  Now net/tap.c has

...


 So as a conservative approximation, you need to fire qemu_notify_event
 whenever you write to RDH, RDT, RDLEN and RCTL, or when check_rxov becomes
 zero.  In practice, only RDT, RCTL and check_rxov matter.  Luigi, does this
 patch work for you?


it almost surely works (cannot check today), as my (guest) driver modifies
RDT to notify the
hardware of  read packets. I know/mentioned that the notification can be
optimized and sent only in specific case, as you describe above
(i might be missing the check_rxov).

But i think it would be more robust to make fewer assumptions on
what the guest does, and send the notify on all register writes
(those are relatively rare in a driver, and the datapath touches
exactly the registers we ought to be interested in),
and possibly on those reads that may have side effects, such as interrupt
registers, together with a big comment in the code explaining
why we need to call qemu_notify_event().

Actually, what would pay even more is devise a way to avoid the
write() syscall in qemu_notify_event if the other process (or is it a
thread ?)
has data queued.
This is not so important in my netmap driver, but a standard driver is
likely to
update the RDT on every single packet, which would pretty much kill
performance.

cheers
luigi
-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c

2012-05-31 Thread Luigi Rizzo

On Thu, May 31, 2012 at 10:23 AM, Jan Kiszka jan.kis...@web.de wrote:

 On 2012-05-31 09:38, Paolo Bonzini wrote

...


 This still looks like the wrong tool: Packets that can't be delivered
 are queued. So we need to flush the queue and clear the blocked delivery
 there. qemu_flush_queued_packets appears more appropriate for this.

 Conceptually, the backend should be responsible for kicking the iothread
 as needed.


as i understand the code, the backend _is_ the iothread, and it is
sleeping when the frontend becomes able to receive again.

cheers
luigi

-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] frame reordering in qemu_net_queue_send() ?

2012-05-31 Thread Luigi Rizzo

On Thu, May 31, 2012 at 11:18 AM, Stefan Hajnoczi stefa...@gmail.comwrote:

 On Wed, May 30, 2012 at 8:59 AM, Luigi Rizzo ri...@iet.unipi.it wrote:
  Hi,
  while investigating rx performance for emulated network devices
  (i am looking at the userspace version, relying on net=tap
  or similar approaches) i noticed the code
  in net/queue.c :: qemu_net_queue_send()
  which look strange to me (same goes for the iov version).
 
  The whole function is below, just for reference.
  My impression is that the call to qemu_net_queue_flush()
  is misplaced, and it should be before qemu_net_queue_deliver()
  otherwise the current packet is pushed out before anything
  was already in the queue.
 
  Does it make sense ?
 
  cheers
  luigi
 
 ssize_t qemu_net_queue_send(NetQueue *queue,
 VLANClientState *sender,
 unsigned flags,
 const uint8_t *data,
 size_t size,
 NetPacketSent *sent_cb)
 {
 ssize_t ret;
 
 if (queue-delivering) {
 return qemu_net_queue_append(queue, sender, flags, data,
 size, NULL);
 }
 
 ret = qemu_net_queue_deliver(queue, sender, flags, data, size);
 if (ret == 0) {
 qemu_net_queue_append(queue, sender, flags, data, size,
 sent_cb);
 return 0;
 }
 
 qemu_net_queue_flush(queue);

 This of the case where delivering a packet causes additional
 (re-entrant) qemu_net_queue_send() calls to this queue.  We'll be in
 -delivering state and queue those packets.  After we've finished
 delivering the initial packet we flush the queue.

 This is a weird case but this is how I read the intention of the code.


i was under the impression that qemu_net_queue_deliver() may also return 0
if the other side
(frontend network device) has no room for the packet. This would cause the
queue to
contain data on entry in the next call, even without reentrancy. Consider
this:

t0. qemu_net_queue_send(pkt-A)
 qemu_net_queue_deliver() returns 0, pkt-A is queued and we return
t1. qemu_net_queue_send(pkt-B)
   qemu_net_queue_deliver() returns successfully, pkt-B is sent to the
frontend
   then we call qemu_net_queue_flush() and this sends pkt-B to the
frontend,
   in the wrong order

cheers
luigi

-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---

Re: [Qemu-devel] frame reordering in qemu_net_queue_send() ?

2012-05-31 Thread Luigi Rizzo

On Thu, May 31, 2012 at 03:23:12PM +0200, Jan Kiszka wrote:
 On 2012-05-31 15:19, Paolo Bonzini wrote:
  Il 31/05/2012 15:07, Zhi Yong Wu ha scritto:
  Yeah, this case actually exists, but tcp/ip protocol stack in guest
  will make sure this ordering will finally be correct.
  
  Nevertheless it's not good, and the latest Windows Logo tests will fail
  if you reorder frames.
 
 Can we really enter qemu_net_queue_send with delivering == 0 and a
 non-empty queue?

i have no idea.

it doesn't help that comments are using very very
sparingly throughout the code...

cheers
luigi

[Qemu-devel] frame reordering in qemu_net_queue_send() ?

2012-05-30 Thread Luigi Rizzo

Hi,
while investigating rx performance for emulated network devices
(i am looking at the userspace version, relying on net=tap
or similar approaches) i noticed the code
in net/queue.c :: qemu_net_queue_send()
which look strange to me (same goes for the iov version).

The whole function is below, just for reference.
My impression is that the call to qemu_net_queue_flush()
is misplaced, and it should be before qemu_net_queue_deliver()
otherwise the current packet is pushed out before anything
was already in the queue.

Does it make sense ?

cheers
luigi

ssize_t qemu_net_queue_send(NetQueue *queue,
VLANClientState *sender,
unsigned flags,
const uint8_t *data,
size_t size,
NetPacketSent *sent_cb)
{
ssize_t ret;

if (queue-delivering) {
return qemu_net_queue_append(queue, sender, flags, data, size, 
NULL);
}

ret = qemu_net_queue_deliver(queue, sender, flags, data, size);
if (ret == 0) {
qemu_net_queue_append(queue, sender, flags, data, size, sent_cb);
return 0;
}

qemu_net_queue_flush(queue);

return ret;
}

[Qemu-devel] Q: frame reordering in qemu_net_queue_send() ?

2012-05-30 Thread Luigi Rizzo

Hi,
while investigating rx performance for emulated network devices
(i am looking at the userspace version, relying on net=tap
or similar approaches) i noticed the code
in net/queue.c :: qemu_net_queue_send()
which look strange to me (same goes for the iov version).

The whole function is below, just for reference.
My impression is that the call to qemu_net_queue_flush()
is misplaced, and it should be before qemu_net_queue_deliver()
otherwise the current packet is pushed out before anything
was already in the queue.

Does it make sense ?

cheers
luigi

   ssize_t qemu_net_queue_send(NetQueue *queue,
   VLANClientState *sender,
   unsigned flags,
   const uint8_t *data,
   size_t size,
   NetPacketSent *sent_cb)
   {
   ssize_t ret;

   if (queue-delivering) {
   return qemu_net_queue_append(queue, sender, flags, data, size,
NULL);
   }

   ret = qemu_net_queue_deliver(queue, sender, flags, data, size);
   if (ret == 0) {
   qemu_net_queue_append(queue, sender, flags, data, size, sent_cb);
   return 0;
   }

   qemu_net_queue_flush(queue);

   return ret;
   }

[Qemu-devel] frame reordering in qemu_net_queue_send() ?

2012-05-30 Thread Luigi Rizzo

Hi,
while investigating rx performance for emulated network devices
(i am looking at the userspace version, relying on net=tap
or similar approaches) i noticed the code
in net/queue.c :: qemu_net_queue_send()
which look strange to me (same goes for the iov version).

The whole function is below, just for reference.
My impression is that the call to qemu_net_queue_flush()
is misplaced, and it should be before qemu_net_queue_deliver()
otherwise the current packet is pushed out before anything
was already in the queue.

Does it make sense ?

cheers
luigi

ssize_t qemu_net_queue_send(NetQueue *queue,
VLANClientState *sender,
unsigned flags,
const uint8_t *data,
size_t size,
NetPacketSent *sent_cb)
{
ssize_t ret;

if (queue-delivering) {
return qemu_net_queue_append(queue, sender, flags, data, size, 
NULL);
}

ret = qemu_net_queue_deliver(queue, sender, flags, data, size);
if (ret == 0) {
qemu_net_queue_append(queue, sender, flags, data, size, sent_cb);
return 0;
}

qemu_net_queue_flush(queue);

return ret;
}

[Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c

2012-05-30 Thread Luigi Rizzo

Hi,
while testing qemu with netmap (see [Note 1] for details) on e1000
emulation, i noticed that my sender program using a custom backend
[Note 2] could reach 1 Mpps (million packets per second) but on the
receive side i was limited to 50 Kpps (and CPU always below 5%).

The problem was fixed by the following one-line addition to
hw/e1000.c :: e1000_mmio_write() , to wakeup the qemu mainloop and
check that some buffers might be available.

--- hw/e1000.c.orig  2012-02-17 20:45:39.0 +0100
+++ hw/e1000.c  2012-05-30 20:01:52.0 +0200
@@ -919,6 +926,7 @@
 DBGOUT(UNKNOWN, MMIO unknown write 
addr=0x%08x,val=0x%08PRIx64\n,
index2, val);
 }
+qemu_notify_event();
 }

 static uint64_t

With this fix, the read throughput reaches 1 Mpps matching the write
speed. Now the system becomes CPU-bound, but this opens the way to
more optimizations in the emulator.

The same problem seems to exist on other network drivers, e.g.
hw/rtl8139.c and others. The only one that seems to get it
right is virtio-net.c

I think it would be good if this change could make it into
the tree.

[Note 1] Netmap ( http://info.iet.unipi.it/~luigi/netmap )
is an efficient mechanism for packet I/O that bypasses
the network stack and provides protected access to the
network adapter from userspace.
It works especially well on top of qemu because the
kernel needs only to trap a single register access
for each batch of packets.

[Note 2] the custom backend is a virtual local ethernet
called VALE, implemented as a kernel module on the host,
that extends netmap to implement communication
between virtual machines.
VALE is extremely efficient, currently delivering about
10~Mpps with 60-byte frames, and 5~Mpps with 1500-byte frames.
The 1 Mpps rates i mentioned are obtained between qemu instances
running in userspace on FreeBSD (no kernel acceleration whatsoever)
and using VALE as a communication mechanism.


cheers
luigi
-+---
  Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
  http://www.iet.unipi.it/~luigi/. Universita` di Pisa
-+---

Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c

2012-05-30 Thread Luigi Rizzo

 On Wed, May 30, 2012 at 10:23:11PM +0200, Luigi Rizzo wrote:
...
  The problem was fixed by the following one-line addition to
  hw/e1000.c :: e1000_mmio_write() , to wakeup the qemu mainloop and
  check that some buffers might be available.
  
  --- hw/e1000.c.orig  2012-02-17 20:45:39.0 +0100
  +++ hw/e1000.c  2012-05-30 20:01:52.0 +0200
  @@ -919,6 +926,7 @@
   DBGOUT(UNKNOWN, MMIO unknown write 
  addr=0x%08x,val=0x%08PRIx64\n,
  index2, val);
   }
  +qemu_notify_event();
   }
  
   static uint64_t
  
  With this fix, the read throughput reaches 1 Mpps matching the write
  speed. Now the system becomes CPU-bound, but this opens the way to
  more optimizations in the emulator.
  
  The same problem seems to exist on other network drivers, e.g.
  hw/rtl8139.c and others. The only one that seems to get it
  right is virtio-net.c
  
  I think it would be good if this change could make it into
  the tree.
  
  [Note 1] Netmap ( http://info.iet.unipi.it/~luigi/netmap )
  is an efficient mechanism for packet I/O that bypasses
  the network stack and provides protected access to the
  network adapter from userspace.
  It works especially well on top of qemu because the
  kernel needs only to trap a single register access
  for each batch of packets.
  
  [Note 2] the custom backend is a virtual local ethernet
  called VALE, implemented as a kernel module on the host,
  that extends netmap to implement communication
  between virtual machines.
  VALE is extremely efficient, currently delivering about
  10~Mpps with 60-byte frames, and 5~Mpps with 1500-byte frames.
  The 1 Mpps rates i mentioned are obtained between qemu instances
  running in userspace on FreeBSD (no kernel acceleration whatsoever)
  and using VALE as a communication mechanism.
 
 Custom backend == you patched QEMU? Or what backend are you using?
 
 This sounds a lot like [1] and suggests that you are either a) using
 slirp in a version that doesn't contain that fix yet (before 1.1-rcX) or
 b) wrote a backend that suffers from a similar bug.
 
 Jan
 
 [1] http://thread.gmane.org/gmane.comp.emulators.qemu/144433

my custom backend is the one in [Note 2] above.
It replaces the -net pcap/user/tap/socket option which defines
how qemu communicate with the host network device.

The problem is not in my module, but rather in the emulation
device exposed to the guest, and i presume this is the same thing
you fixed in the slirp patch.
I checked the git version http://git.qemu.org/qemu.git
and most guest-side devices have the same problem,
only virtio-net does the notification.

cheers
luigi

Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c

2012-05-30 Thread Luigi Rizzo

On Wed, May 30, 2012 at 11:39 PM, Jan Kiszka jan.kis...@web.de wrote:

 Please keep CCs.

 ok


 On 2012-05-30 23:23, Luigi Rizzo wrote:
  On Wed, May 30, 2012 at 10:23:11PM +0200, Luigi Rizzo wrote:
  ...
  The problem was fixed by the following one-line addition to
  hw/e1000.c :: e1000_mmio_write() , to wakeup the qemu mainloop and
  check that some buffers might be available.
 
  --- hw/e1000.c.orig  2012-02-17 20:45:39.0 +0100
  +++ hw/e1000.c  2012-05-30 20:01:52.0 +0200
  @@ -919,6 +926,7 @@
   DBGOUT(UNKNOWN, MMIO unknown write
 addr=0x%08x,val=0x%08PRIx64\n,
  index2, val);
   }
  +qemu_notify_event();
   }
 
   static uint64_t
 
  With this fix, the read throughput reaches 1 Mpps matching the write
  speed. Now the system becomes CPU-bound, but this opens the way to
  more optimizations in the emulator.
 
  The same problem seems to exist on other network drivers, e.g.
  hw/rtl8139.c and others. The only one that seems to get it
  right is virtio-net.c
 
  I think it would be good if this change could make it into
  the tree.
 
  [Note 1] Netmap ( http://info.iet.unipi.it/~luigi/netmap )
  is an efficient mechanism for packet I/O that bypasses
  the network stack and provides protected access to the
  network adapter from userspace.
  It works especially well on top of qemu because the
  kernel needs only to trap a single register access
  for each batch of packets.
 
  [Note 2] the custom backend is a virtual local ethernet
  called VALE, implemented as a kernel module on the host,
  that extends netmap to implement communication
  between virtual machines.
  VALE is extremely efficient, currently delivering about
  10~Mpps with 60-byte frames, and 5~Mpps with 1500-byte frames.
  The 1 Mpps rates i mentioned are obtained between qemu instances
  running in userspace on FreeBSD (no kernel acceleration whatsoever)
  and using VALE as a communication mechanism.
 
  Custom backend == you patched QEMU? Or what backend are you using?
 
  This sounds a lot like [1] and suggests that you are either a) using
  slirp in a version that doesn't contain that fix yet (before 1.1-rcX) or
  b) wrote a backend that suffers from a similar bug.
 
  Jan
 
  [1] http://thread.gmane.org/gmane.comp.emulators.qemu/144433
 
  my custom backend is the one in [Note 2] above.
  It replaces the -net pcap/user/tap/socket option which defines
  how qemu communicate with the host network device.

 Any code to share? It's hard to discuss just concepts.


you can take the freebsd image from the netmap page in my link and run it
in qemu, and then run the pkt-gen program in the image in either
send or receive mode. But this is overkill, as you have described the
problem exactly in your post: when the guest reads the packets from
the emulated device (e1000 in my case, but most of them have the
problem) it fails to wake up the thread blocked in main_loop_wait().

I am unclear on the terminology (what is frontend and what is backend ?)
but  it is the guest side that has to wake up the qemu process: the file
descriptor talking to the host side (tap, socket, bpf ...) has already
fired its events and the only thing it could do is cause a busy wait
if it keeps passing a readable file descriptor to select.

I thought your slirp.c patch was also on the same side as e1000.c

cheers
luigi


  The problem is not in my module, but rather in the emulation
  device exposed to the guest, and i presume this is the same thing
  you fixed in the slirp patch.
  I checked the git version http://git.qemu.org/qemu.git
  and most guest-side devices have the same problem,
  only virtio-net does the notification.

 And that is most likely wrong. The bug I cited was not a front-end issue
 but clearly one of the backend. It lacked kicking of the io-thread once
 its queue state changed in a way that was not reported otherwise (via
 some file descriptor the io-thread is subscribed to). If your backend
 creates such states as well, it has to fix it similarly.

 Again, discussing this abstractly is not very efficient.

 Jan

Re: [Qemu-devel] Proposed patch: huge RX speedup for hw/e1000.c

2012-05-30 Thread Luigi Rizzo

On Thu, May 31, 2012 at 12:11 AM, Jan Kiszka jan.kis...@web.de wrote:

 On 2012-05-30 23:55, Luigi Rizzo wrote:
  you can take the freebsd image from the netmap page in my link and run it
  in qemu, and then run the pkt-gen program in the image in either
  send or receive mode. But this is overkill, as you have described the
  problem exactly in your post: when the guest reads the packets from
  the emulated device (e1000 in my case, but most of them have the
  problem) it fails to wake up the thread blocked in main_loop_wait().

 OK, so there are no QEMU code changes involved? Then, what is your
 command line and what is the QEMU version?

 # x86_64-softmmu/qemu-system-x86_64 -version
QEMU emulator version 1.0,1, Copyright (c) 2003-2008 Fabrice Bellard

the command line for my test is (this is on FreeBSD 9, with a pure userland
qemu
without any kqemu support)

x86_64-softmmu/qemu-system-x86_64 -m 512 -hda picobsd.bin -net
nic,model=e1000 -net netmap,ifname=valeXX

-net netmap is my fast bridge, code will be available in a couple of days,
in the meantime you can have a look at netmap link in my original post to
see how
qemu accesses the host adapter and how fast it is.
The problem exists als you use -net tap, ... just at lower speed, and maybe
too low to notice.

The image contains my fast packet generator pkt-gen (a stock traffic
generator
such as netperf etc. is too slow to show the problem). pkt-gen can send
about 1Mpps
in this configuration using -net netmap in the backend. The qemu process in
this
case takes 100% CPU.
On the receive side, i cannot receive more than 50Kpps, even if i flood the
bridge with a a huge amount of traffic. The qemu process stays at 5% cpu or
less.

Then i read on the docs in main-loop.h which says that one case where
the qemu_notify_event() is needed is when using
qemu_set_fd_handler2() , which is exactly what my backend uses
(similar to tap.c)

Once i add the notify, the receiver can do 1 Mpps and can use 100% CPU
(and it is not in a busy wait, if the offered traffic goes down, the qemu
process
uses less and less cpu).



 
  I am unclear on the terminology (what is frontend and what is backend ?)

 Frontend is the emulated NIC here, backend the host-side interface, i.e.
 slirp (user), tap, socket.


thanks for the clarification.


  but  it is the guest side that has to wake up the qemu process: the file
  descriptor talking to the host side (tap, socket, bpf ...) has already
  fired its events and the only thing it could do is cause a busy wait
  if it keeps passing a readable file descriptor to select.

 Can't follow. How did you analyze your delays?

 see above.


 
  I thought your slirp.c patch was also on the same side as e1000.c

 It was only in the backend, not any frontend driver. It's generally no
 business of the frontend driver to kick (virtio has some exceptional path).

 BTW, your patch is rather untargeted as it kicks on every MMIO write of
 the e1000. Hard to asses what actually makes the difference for you.


in my case it is easy said - the process that reads from the interface on
the guest
is only doing one read and one write access to the RDT register for each
batch of packets. The RDT change is the one that frees the rx buffers.

Surely the patch could be made more specific, but this requires a lot of
investigation on which register accesses may require attention (and there
are
so many of them, say if the guest decides to reset the ring, the card, etc.,
that I don't think it is worth the trouble.)

cheers
luigi

71 matches

Mail list logo