On Wed, Jul 13, 2016 at 1:45 PM, Kevin Wolf <kw...@redhat.com> wrote: > Am 13.07.2016 um 13:33 hat Roman Penyaev geschrieben: >> Just to be sure that we are on the same page: >> >> 1. We have this commit "linux-aio: Cancel BH if not needed" which >> >> a) introduces performance regression on my fio workloads on the >> following config: "iothread=1, VCPU=8, MQ=8". Performance >> dropped from 1878MB/s to 1606MB/s with Stefan's fix, that is >> ~14%. > > Do we already understand why the performance regresses with the patch? > As long as we don't, everything we do is just guesswork.
Eventually the issue is clear. I test on /dev/nullb0, which completes all submitted bios almost immediately. That means, that after io_submit() is called it is worth trying to check completed requests and not to accumulate them in-flight. That is the theory. On practise happens the following: ------------------------------------------------------------------- >>> sys_poll <<< sys_poll >>> aio_dispatch >>> aio_bh_poll <<< aio_bh_poll >>> node->io_read !!! ioq_submit(), submitted=98 <<< node->io_read >>> node->io_read !!! ioq_submit(), submitted=49 <<< node->io_read >>> node->io_read !!! ioq_submit(), submitted=47 <<< node->io_read <<< aio_dispatch >>> sys_poll <<< sys_poll >>> aio_dispatch >>> aio_bh_poll <<< aio_bh_poll >>> node->io_read !!! ioq_submit(), submitted=50 <<< node->io_read >>> node->io_read !!! ioq_submit(), submitted=43 <<< node->io_read >>> node->io_read !!! ioq_submit(), submitted=43 <<< node->io_read >>> node->io_read !!! ioq_submit(), submitted=8 <<< node->io_read >>> node->io_read ~~~ qemu_laio_completion_bh completed=338 <<< node->io_read <<< aio_dispatch ------------------------------------------------------------------- * this run gave 1461MB/s * This is the very common hunk of the log which I see running fio load with the "linux-aio: Cancel BH if not needed" patch applied. The important thing which is worth paying attention to is submission of 338 requests (almost whole ring buffer of AIO context) before consuming requests completions. Very fast backend device completes submitted requests almost immediately, but we get a chance to fetch completions only some time later. The following is the common part of the log when "linux-aio: Cancel BH if not needed" is reverted: ------------------------------------------------------------------- >>> sys_poll <<< sys_poll >>> dispatch >>> aio_bh_poll ~~~ qemu_laio_completion_bh completed=199 <<< aio_bh_poll >>> node->io_read !!! ioq_submit(), submitted=47 <<< node->io_read >>> node->io_read !!! ioq_submit(), submitted=49 <<< node->io_read >>> node->io_read !!! ioq_submit(), submitted=50 <<< node->io_read >>> node->io_read !!! ioq_submit(), submitted=43 <<< node->io_read >>> node->io_read <<< node->io_read <<< dispatch >>> sys_poll <<< sys_poll >>> dispatch >>> aio_bh_poll ~~~ qemu_laio_completion_bh, completed=189 <<< aio_bh_poll >>> node->io_read !!! ioq_submit(), submitted=46 <<< node->io_read >>> node->io_read !!! ioq_submit(), submitted=46 <<< node->io_read >>> node->io_read !!! ioq_submit(), submitted=51 <<< node->io_read >>> node->io_read !!! ioq_submit(), submitted=51 <<< node->io_read <<< dispatch ------------------------------------------------------------------- * this run gave 1805MB/s * According to this part of the log I can say, that completions happen frequently, i.e. we get a chance to fetch completions more often, thus queue is always "refreshed" by new comming requests. To be more precise I collected some statistics: each time I enter qemu_laio_completion_bh() I account the number of collected requests in the bucket, e.g.: "~~~ qemu_laio_completion_bh completed=199" bucket[199] += 1; "~~~ qemu_laio_completion_bh, completed=189" bucket[189] += 1; .... When fio finishes I have a distribution of number of completed requests which I have observed in the ring buffer. Here is the sheet: https://docs.google.com/spreadsheets/d/12CIt6EKiJLqNx0OHNqiabR-oFBrqkH0LN3mjzZ5jGeo/edit?usp=sharing (Frankly, I could not think of anything better than to send a link on google docs, sorry if that insults someone). There is a chart which shows the whole picture of distribution: o X axis is a number of requests completed at once. o Y axis is a number of times we observe that number of requests. To avoid scaling problems I plotted the chart starting from 10 requests, since low numbers of requests do not have much impact but have huge values. Those 3 red spikes and a blue hill is what we have to focus on. The blue hill at the right corner of the chart means that almost always the ring buffer was observed as full, i.e. qemu_laio_completion_bh() got a chance to reap completions not very often, meanwhile completed requests stand in the ring buffer for quite a long time which degrades the overall performance. The results covered by the red line are much better and that can be explained by those 3 red spikes, which are almost in the middle of the whole distribution, i.e. qemu_laio_completion_bh() is called more often, completed requests do not stall, giving fio a chance to submit new fresh requests. The theoretical fix would be to schedule completion BH just after successful io_submit, i.e.: --------------------------------------------------------------------- @@ -228,6 +228,8 @@ static void ioq_submit(LinuxAioState *s) QSIMPLEQ_SPLIT_AFTER(&s->io_q.pending, aiocb, next, &completed); } while (ret == len && !QSIMPLEQ_EMPTY(&s->io_q.pending)); s->io_q.blocked = (s->io_q.n > 0); + + qemu_bh_schedule(s->completion_bh); } --------------------------------------------------------------------- This theoretical fix works pretty fine and numbers return to expected ~1800MB/s. So believe me or not but BH, which was not accidentally canceled, gives better results on very fast backend devices. The other interesting observation is the following: submission limitation (which I did in the "linux-aio: prevent submitting more than MAX_EVENTS" patch) also fixes the issue, because before submitting more than MAX_EVENTS we have to reap something, which obviously do not let already completed requests stall in the queue for a long time. -- Roman