Re: [Qemu-devel] [PATCH V2 1/1] linux-aio: prevent submitting more than MAX_EVENTS

Roman Penyaev Fri, 15 Jul 2016 02:20:58 -0700

On Wed, Jul 13, 2016 at 1:45 PM, Kevin Wolf <kw...@redhat.com> wrote:
> Am 13.07.2016 um 13:33 hat Roman Penyaev geschrieben:
>> Just to be sure that we are on the same page:
>>
>> 1. We have this commit "linux-aio: Cancel BH if not needed" which
>>
>>    a) introduces performance regression on my fio workloads on the
>>       following config: "iothread=1, VCPU=8, MQ=8". Performance
>>       dropped from 1878MB/s to 1606MB/s with Stefan's fix, that is
>>       ~14%.
>
> Do we already understand why the performance regresses with the patch?
> As long as we don't, everything we do is just guesswork.


Eventually the issue is clear.  I test on /dev/nullb0, which completes
all submitted bios almost immediately.  That means, that after io_submit()
is called it is worth trying to check completed requests and not to
accumulate them in-flight.

That is the theory.  On practise happens the following:

-------------------------------------------------------------------
>>> sys_poll
<<< sys_poll
>>> aio_dispatch
        >>> aio_bh_poll
        <<< aio_bh_poll
        >>> node->io_read
                !!! ioq_submit(), submitted=98
        <<< node->io_read
        >>> node->io_read
                !!! ioq_submit(), submitted=49
        <<< node->io_read
        >>> node->io_read
                !!! ioq_submit(), submitted=47
        <<< node->io_read
<<< aio_dispatch
>>> sys_poll
<<< sys_poll
>>> aio_dispatch
        >>> aio_bh_poll
        <<< aio_bh_poll
        >>> node->io_read
                !!! ioq_submit(), submitted=50
        <<< node->io_read
        >>> node->io_read
                !!! ioq_submit(), submitted=43
        <<< node->io_read
        >>> node->io_read
                !!! ioq_submit(), submitted=43
        <<< node->io_read
        >>> node->io_read
                !!! ioq_submit(), submitted=8
        <<< node->io_read
        >>> node->io_read
                ~~~ qemu_laio_completion_bh completed=338
        <<< node->io_read
<<< aio_dispatch
-------------------------------------------------------------------
        * this run gave 1461MB/s *

This is the very common hunk of the log which I see running fio load with
the "linux-aio: Cancel BH if not needed" patch applied.

The important thing which is worth paying attention to is submission of
338 requests (almost whole ring buffer of AIO context) before consuming
requests completions.

Very fast backend device completes submitted requests almost immediately,
but we get a chance to fetch completions only some time later.

The following is the common part of the log when
"linux-aio: Cancel BH if not needed" is reverted:

-------------------------------------------------------------------
>>> sys_poll
<<< sys_poll
>>> dispatch
        >>> aio_bh_poll
                ~~~ qemu_laio_completion_bh completed=199
        <<< aio_bh_poll
        >>> node->io_read
                !!! ioq_submit(), submitted=47
        <<< node->io_read
        >>> node->io_read
                !!! ioq_submit(), submitted=49
        <<< node->io_read
        >>> node->io_read
                !!! ioq_submit(), submitted=50
        <<< node->io_read
        >>> node->io_read
                !!! ioq_submit(), submitted=43
        <<< node->io_read
        >>> node->io_read
        <<< node->io_read
<<< dispatch
>>> sys_poll
<<< sys_poll
>>> dispatch
        >>> aio_bh_poll
                ~~~ qemu_laio_completion_bh, completed=189
        <<< aio_bh_poll
        >>> node->io_read
                !!! ioq_submit(), submitted=46
        <<< node->io_read
        >>> node->io_read
                !!! ioq_submit(), submitted=46
        <<< node->io_read
        >>> node->io_read
                !!! ioq_submit(), submitted=51
        <<< node->io_read
        >>> node->io_read
                !!! ioq_submit(), submitted=51
        <<< node->io_read
<<< dispatch
-------------------------------------------------------------------
        * this run gave 1805MB/s *

According to this part of the log I can say, that completions happen
frequently, i.e. we get a chance to fetch completions more often, thus
queue is always "refreshed" by new comming requests.

To be more precise I collected some statistics: each time I enter
qemu_laio_completion_bh() I account the number of collected requests in the
bucket, e.g.:

   "~~~ qemu_laio_completion_bh completed=199"

       bucket[199] += 1;

   "~~~ qemu_laio_completion_bh, completed=189"

       bucket[189] += 1;

   ....

When fio finishes I have a distribution of number of completed requests
which I have observed in the ring buffer.

Here is the sheet:

https://docs.google.com/spreadsheets/d/12CIt6EKiJLqNx0OHNqiabR-oFBrqkH0LN3mjzZ5jGeo/edit?usp=sharing

(Frankly, I could not think of anything better than to send a link on
 google docs, sorry if that insults someone).

There is a chart which shows the whole picture of distribution:

   o  X axis is a number of requests completed at once.
   o  Y axis is a number of times we observe that number of requests.

To avoid scaling problems I plotted the chart starting from 10 requests,
since low numbers of requests do not have much impact but have huge
values.

Those 3 red spikes and a blue hill is what we have to focus on.  The
blue hill at the right corner of the chart means that almost always the
ring buffer was observed as full, i.e. qemu_laio_completion_bh() got
a chance to reap completions not very often, meanwhile completed
requests stand in the ring buffer for quite a long time which degrades
the overall performance.

The results covered by the red line are much better and that can be
explained by those 3 red spikes, which are almost in the middle of the
whole distribution, i.e. qemu_laio_completion_bh() is called more often,
completed requests do not stall, giving fio a chance to submit new fresh
requests.

The theoretical fix would be to schedule completion BH just after
successful io_submit, i.e.:

---------------------------------------------------------------------
@@ -228,6 +228,8 @@ static void ioq_submit(LinuxAioState *s)
         QSIMPLEQ_SPLIT_AFTER(&s->io_q.pending, aiocb, next, &completed);
     } while (ret == len && !QSIMPLEQ_EMPTY(&s->io_q.pending));
     s->io_q.blocked = (s->io_q.n > 0);
+
+    qemu_bh_schedule(s->completion_bh);
 }
---------------------------------------------------------------------

This theoretical fix works pretty fine and numbers return to expected
~1800MB/s.

So believe me or not but BH, which was not accidentally canceled, gives
better results on very fast backend devices.

The other interesting observation is the following: submission limitation
(which I did in the "linux-aio: prevent submitting more than MAX_EVENTS"
patch) also fixes the issue, because before submitting more than MAX_EVENTS
we have to reap something, which obviously do not let already completed
requests stall in the queue for a long time.

--
Roman

Re: [Qemu-devel] [PATCH V2 1/1] linux-aio: prevent submitting more than MAX_EVENTS

Reply via email to