Re: [Qemu-devel] [PATCH 0/7] coroutine: optimizations

2014-11-30 Thread Ming Lei
On Mon, 01 Dec 2014 08:05:17 +0100
Peter Lieven  wrote:

> On 01.12.2014 06:55, Ming Lei wrote:
> > On Fri, Nov 28, 2014 at 10:12 PM, Paolo Bonzini  wrote:
> >> As discussed in the other thread, this brings speedups from
> >> dropping the coroutine mutex (which serializes multiple iothreads,
> >> too) and using ELF thread-local storage.
> >>
> >> The speedup in perf/cost is about 30% (190->145).  Windows port tested
> >> with tests/test-coroutine.exe under Wine.
> > The data is very nice, and in my laptop, 'perf cost' can be decreased
> > from 244ns to 174ns.
> >
> > BTW, the cost by using coroutine to run function isn't only from these
> > helpers(*_yield, *_enter, *_create, and perf-cost just measures
> > this part of cost), but also some implicit/invisible part. I have some
> > test cases which can show the problem. If someone is interested,
> > I can post them in list.
> 
> Of course, maybe the problem can be solved or impaired.

OK, please try below patch:

From 917d5cc0a273f9825b10abd52152c54e08c81ef8 Mon Sep 17 00:00:00 2001
From: Ming Lei 
Date: Mon, 1 Dec 2014 11:11:23 +0800
Subject: [PATCH] test-coroutine: introduce perf-cost-with-load

The perf/cost test case only covers explicit cost by
using coroutine.

This patch provides a open/close file test case, and
from this case, we can find there is also some implicit
or invisible cost except for the cost measured by /perf/cost.

In my environment, follows the test result after appying this
patch and running perf/cost and perf/cost-with-load:

{*LOG(start):{/perf/cost}:LOG*}
/perf/cost: {*LOG(message):{Run operation 4000 iterations 7.539413
s, 5305K operations/s, 188ns per coroutine}:LOG*}
OK
{*LOG(stop):(0;0;7.539497):LOG*}

{*LOG(start):{/perf/cost-with-load}:LOG*}
/perf/cost-with-load: {*LOG(message):{Run operation 100 iterations
2.648014 s, 377K operations/s, 2648ns per operation without using
coroutine}:LOG*}
{*LOG(message):{Run operation 100 iterations 2.919133 s, 342K
operations/s, 2919ns per operation, 271ns(cost introduced by coroutine)
per operation with using coroutine}:LOG*}
OK
{*LOG(stop):(0;0;5.567333):LOG*}

From above data, we can see 188ns is introduced for running one
coroutine, but in /perf/cost-with-load, the actual cost introduced
is 271ns, and the extra 83ns cost is invisible and implicit.

The similar result can be found in following test case too:
- read from /dev/nullb0 which is opened with O_DIRECT
(it is sort of aio read simulation, need 3.13+ kernel for
/dev/nullbX support by 'modprobe null_blk', this case
can show +150ns extra cost)
- statvfs() syscall, there is ~30ns extra cost for running
one statvfs() with coroutine
---
 tests/test-coroutine.c |   67 
 1 file changed, 67 insertions(+)

diff --git a/tests/test-coroutine.c b/tests/test-coroutine.c
index 27d1b6f..7323a91 100644
--- a/tests/test-coroutine.c
+++ b/tests/test-coroutine.c
@@ -311,6 +311,72 @@ static void perf_baseline(void)
 maxcycles, duration);
 }
 
+static void perf_cost_load_worker(void *opaque)
+{
+int fd;
+
+fd = open("/proc/self/exe", O_RDONLY);
+assert(fd >= 0);
+close(fd);
+}
+
+static __attribute__((noinline)) void perf_cost_load_func(void *opaque)
+{
+perf_cost_load_worker(opaque);
+qemu_coroutine_yield();
+}
+
+static double perf_cost_load(unsigned long maxcycles, bool use_co)
+{
+unsigned long i = 0;
+double duration;
+
+g_test_timer_start();
+if (use_co) {
+Coroutine *co;
+while (i++ < maxcycles) {
+co = qemu_coroutine_create(perf_cost_load_func);
+qemu_coroutine_enter(co, &i);
+qemu_coroutine_enter(co, NULL);
+}
+} else {
+while (i++ < maxcycles) {
+perf_cost_load_worker(&i);
+}
+}
+duration = g_test_timer_elapsed();
+
+return duration;
+}
+
+static void perf_cost_with_load(void)
+{
+const unsigned long maxcycles = 100;
+double duration;
+unsigned long ops;
+unsigned long cost_co, cost;
+
+duration = perf_cost_load(maxcycles, false);
+ops = (long)(maxcycles / (duration * 1000));
+cost = (unsigned long)(10.0 * duration / maxcycles);
+g_test_message("Run operation %lu iterations %f s, %luK operations/s, "
+   "%luns per operation without using coroutine",
+   maxcycles,
+   duration, ops,
+   cost);
+
+duration = perf_cost_load(maxcycles, true);
+ops = (long)(maxcycles / (duration * 1000));
+cost_co = (unsigned long)(10.0 * duration / maxcycles);
+g_test_message("Run operation %lu iterations %f s, %luK operations/s, "
+   "%luns per operation, "
+   "%luns(cost introduced by coroutine) per operation "
+   "wit

Re: [Qemu-devel] [PATCH 0/7] coroutine: optimizations

2014-11-30 Thread Peter Lieven

On 01.12.2014 06:55, Ming Lei wrote:

On Fri, Nov 28, 2014 at 10:12 PM, Paolo Bonzini  wrote:

As discussed in the other thread, this brings speedups from
dropping the coroutine mutex (which serializes multiple iothreads,
too) and using ELF thread-local storage.

The speedup in perf/cost is about 30% (190->145).  Windows port tested
with tests/test-coroutine.exe under Wine.

The data is very nice, and in my laptop, 'perf cost' can be decreased
from 244ns to 174ns.

BTW, the cost by using coroutine to run function isn't only from these
helpers(*_yield, *_enter, *_create, and perf-cost just measures
this part of cost), but also some implicit/invisible part. I have some
test cases which can show the problem. If someone is interested,
I can post them in list.


Of course, maybe the problem can be solved or impaired.

Peter



Re: [Qemu-devel] [PATCH 0/7] coroutine: optimizations

2014-11-30 Thread Ming Lei
On Fri, Nov 28, 2014 at 10:12 PM, Paolo Bonzini  wrote:
> As discussed in the other thread, this brings speedups from
> dropping the coroutine mutex (which serializes multiple iothreads,
> too) and using ELF thread-local storage.
>
> The speedup in perf/cost is about 30% (190->145).  Windows port tested
> with tests/test-coroutine.exe under Wine.

The data is very nice, and in my laptop, 'perf cost' can be decreased
from 244ns to 174ns.

BTW, the cost by using coroutine to run function isn't only from these
helpers(*_yield, *_enter, *_create, and perf-cost just measures
this part of cost), but also some implicit/invisible part. I have some
test cases which can show the problem. If someone is interested,
I can post them in list.


Thanks,
Ming Lei



[Qemu-devel] [PATCH 0/7] coroutine: optimizations

2014-11-28 Thread Paolo Bonzini
As discussed in the other thread, this brings speedups from
dropping the coroutine mutex (which serializes multiple iothreads,
too) and using ELF thread-local storage.

The speedup in perf/cost is about 30% (190->145).  Windows port tested
with tests/test-coroutine.exe under Wine.

Paolo

Paolo Bonzini (7):
  coroutine-ucontext: use __thread
  qemu-thread: add per-thread atexit functions
  test-coroutine: avoid overflow on 32-bit systems
  QSLIST: add lock-free operations
  coroutine: rewrite pool to avoid mutex
  coroutine: drop qemu_coroutine_adjust_pool_size
  coroutine: try harder not to delete coroutines

 block/block-backend.c |   4 --
 coroutine-ucontext.c  |  64 +++-
 include/block/coroutine.h |  10 -
 include/qemu/queue.h  |  15 ++-
 include/qemu/thread.h |   4 ++
 qemu-coroutine.c  | 104 ++
 tests/test-coroutine.c|   2 +-
 util/qemu-thread-posix.c  |  37 +
 util/qemu-thread-win32.c  |  48 -
 9 files changed, 157 insertions(+), 131 deletions(-)

-- 
2.1.0