Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support

Ming Lei Tue, 12 Aug 2014 00:54:26 -0700

On Mon, Aug 11, 2014 at 10:03 PM, Kevin Wolf <kw...@redhat.com> wrote:
> Am 10.08.2014 um 05:46 hat Ming Lei geschrieben:
>> Hi Kevin, Paolo, Stefan and all,
>>
>>
>> On Wed, 6 Aug 2014 10:48:55 +0200
>> Kevin Wolf <kw...@redhat.com> wrote:
>>
>> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>>
>> >
>> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>> > coroutines instead of exiting them, so it can't make any use of the
>> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
>> > version that simply removes the yield at the end):
>> >
>> >                 | bypass        | fixed coro    | buggy coro
>> > ----------------+---------------+---------------+--------------
>> > time            | 1.09s         | 1.10s         | 1.62s
>> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>> > insns per cycle | 2.39          | 2.39          | 1.90
>> >
>> > Begs the question whether you see a similar effect on a real qemu and
>> > the coroutine pool is still not big enough? With correct use of
>> > coroutines, the difference seems to be barely measurable even without
>> > any I/O involved.
>>
>> Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
>> loading, and cause operations per sec very low(~40K/sec), finally I write a 
>> new
>> and simple one which can generate hundreds of kilo operations per sec and
>> the number should match with some fast storage devices, and it does show 
>> there
>> is not small effect from coroutine.
>>
>> Extremely if just getppid() syscall is run in each iteration, with using 
>> coroutine,
>> only 3M operations/sec can be got, and without using coroutine, the number 
>> can
>> reach 16M/sec, and there is more than 4 times difference!!!
>
> I see that you're measuring a lot of things, but the one thing that is
> unclear to me is what question those benchmarks are supposed to answer.
>
> Basically I see two different, useful types of benchmark:
>
> 1. Look at coroutines in isolation and try to get a directly coroutine-
>    related function (like create/destroy or yield/reenter) faster. This
>    is what tests/test-coroutine does.


Actually the tests/test-coroutine does tell us there is not small cost
introduced by using coroutine, as Paolo's computation in his environment[1]:

    - one yield takes 83ns
    - one enter takes 97ns
    - this will introduce 8.3% cost by using coroutine if the block
can reach 300K
      IOPS, like your case of loop over tmpfs
    - it may cause 13.8% cost if the block device can reach 500K IOPS

The cost may show in IOPS, or in CPU utilization or both, which
depends how fast the CPU is.

The above computation suppose all coroutine allocation hits on the pool,
and does not consider effect from switching stack. If both two considered,
the cost becomes more surely.

[1], https://lists.nongnu.org/archive/html/qemu-devel/2014-08/msg01544.html

>    This is quite good at telling you what costs the coroutine functions
>    have and where you need to optimise - without taking the pratical
>    benefits into account, so it's not suitable for comparison.
>
> 2. Look at the whole thing in its realistic environment. This should
>    probably involve at least some asynchronous I/O, but ideally use the
>    whole block layer. qemu-img bench tries to do this. For being even
>    closer to the real environment you'd have to use the virtio-blk code
>    as well, which you currently only get with a full VM (perhaps qtest
>    could do something interesting here in theory).
>
>    This is good for telling how big the costs are in relation to the
>    total workload (and code saved elsewhere) in practice. This is the
>    set of tests that can meaningfully be compared to a callback-based
>    solution.
>
> Running arbitrary workloads like getppid() or open/read/close isn't as
> useful as these. It doesn't isolate the coroutines as well as tests that
> run literally nothing else than coroutine functions, and it is too
> removed from the actual use case to get the relation between additional
> costs, saving and total workload figured out for the real case.

If you think getppid() doesn't isolate the coroutine, you can just do nop,
then you will find the cost may reach 90%.  Basically it is nothing to do
with what the load does, and it is much related to how fast the load can
run. The quicker, the more cost introduced by using coroutine, please
see the computation in above link.

Also another reason I use gettpid() is that:

     After IO plug&unplug is introduced,  bdrv_aio_readv/bdrv_aio_writev
     becomes much quicker because most of times they just queue I/O req
     into I/O queue, no io submit involved at all. Even though coroutine
     operations take little time(<100ns), it still may make a difference
     compared with the time for queuing I/O only, at lest for high-speed I/O,
     like > 300K IOPS in your case.

>> From another file read bench which is the default one:
>>
>>       just doing open(file), read(fd, buf in stack, 512), sum and close() in 
>> each iteration
>>
>> without using coroutine, operations per second can increase ~20% compared
>> with using coroutine. If reading 1024 bytes each time, the number still can
>> increase ~10%. The operations per second level is between 200K~400K per
>> sec which should match the IOPS in dataplane test, and the tests are
>> done in my lenovo T410 notepad(CPU: 2.6GHz, dual core, four threads).
>>
>> When reading 8192 and more bytes each time, the difference between using
>> coroutine and not can't be observed obviously.
>
> All it tells you is that the variation of the workload can make the
> coroutine cost disappear in the noise. It doesn't tell you much about
> how the real use case.

When cost disappear, the IOPS has become very small. That also said
coroutine can fit in high-speed IO case.

> And you're comparing apples and oranges anyway: The real question in
> qemu is whether you use coroutines or pass around heap-allocated state
> between callbacks. Your benchmark doesn't have a single callback because
> it hasn't got any asynchronous operations and doesn't need to allocate
> and pass any state.
>
> It does, however, have an unnecessary yield() for the coroutine case
> because you felt that the real case is more complex and does yield
> (which is true, but it's more complex for both coroutines and
> callbacks).
>
>> Surely, the test result should depend on how fast the machine is, but even
>> for fast machine, I guess the similar result still can be observed by
>> decreasing read bytes each time.
>
> Yes, results looked similar on my laptop. (They just don't tell me
> much.)
>
>
> Let's have a look at some fio results from my laptop:
>
> aggrb KB/s  | master    | coroutine | bypass
> ------------+-----------+-----------+------------
> run 1       | 419934    | 449518    | 445823
> run 2       | 444358    | 456365    | 448332
> run 3       | 444076    | 455209    | 441552
>
>
> And here from my lab test box:
>
> aggrb KB/s  | master    | coroutine | bypass
> ------------+-----------+-----------+------------
> run 1       | 25330     | 56378     | 53541
> run 2       | 26041     | 55709     | 54136
> run 3       | 25811     | 56829     | 49080
>
> The improvement of the bypass patches is barely measurable on my laptop
> (if it even exists), whereas it seems to be a pretty big thing for my
> lab test box. In any case, the optimised coroutine code seems to beat
> the bypass on both machines. (That is for random reads anyway. For
> sequential, I get a much larger variation, and on my lab test box bypass
> is ahead, whereas on my laptop both are roughly on the same level.)
>
> Another thing I tried is creating the coroutine already in virtio-blk to
> avoid the overhead of the bdrv_aio_* emulation. I don't quite understand
> the result of my benchmarks there, maybe you have an idea: For random
> reads, I see a significant improvement, for sequential however a clear
> degradation.
>
> aggrb MB/s  | bypass    | coroutine | virtio-blk-created coroutine
> ------------+-----------+-----------+------------------------------
> seq. read   | 738       | 738       | 694
> random read | 442       | 459       | 475
>
> I would appreciate any ideas about what's going on with sequential reads
> here and how it can be fixed. Anyway, on my machines, coroutines don't
> look like a lost case at all.

Firstly I hope you can bypass the coroutine only to do the test, that said, use
same code path except for coroutine operation to observe effect from coroutine.

Secondly, maybe your machine is fast enough, and we can't observe the
IOPS difference easily, but there should be the difference in CPU utilization,
since the above computation tells us the coroutine cost does exist. Faster
the block faster, the more.


Thanks,

Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support

Reply via email to