Stefan Hajnoczi <stefa...@gmail.com> writes: > On Tue, Apr 02, 2019 at 02:19:08PM +0200, Sergio Lopez wrote: >> The polling mode in aio_poll is able to trim down ~20us on the average >> request latency, but it needs manual fine tuning to adjust it to the >> characteristics of the storage. >> >> Here we add a new knob to the IOThread object, "poll-inflight". When >> this knob is enabled, aio_poll will always use polling if there are >> in-flight requests, ignoring the rest of poll-* parameters. If there >> aren't any in-flight requests, the usual polling rules apply, which is >> useful given that the default poll-max-ns value of 32us is usually >> enough to catch a new request in the VQ when the Guest is putting >> pressure on us. >> >> To keep track of the number of in-flight requests, AioContext has a >> new counter which is increased/decreased by thread-pool.c and >> linux-aio.c on request submission/completion. >> >> With poll-inflight, users willing to spend more Host CPU resources in >> exchange for a lower latency just need to enable a single knob. >> >> This is just an initial version of this feature and I'm just sharing >> it to get some early feedback. As such, managing this property through >> QAPI is not yet implemented. >> >> Signed-off-by: Sergio Lopez <s...@redhat.com> >> --- >> block/linux-aio.c | 7 +++++++ >> include/block/aio.h | 9 ++++++++- >> include/sysemu/iothread.h | 1 + >> iothread.c | 33 +++++++++++++++++++++++++++++++++ >> util/aio-posix.c | 32 +++++++++++++++++++++++++++++++- >> util/thread-pool.c | 3 +++ >> 6 files changed, 83 insertions(+), 2 deletions(-) > > Hi Sergio, > More polling modes are useful for benchmarking and performance analysis. > From this perspective I think poll-inflight is worthwhile. > > Like most performance optimizations, the effectiveness of this new > polling mode depends on the workload. It could waste CPU, especially on > a queue depth 1 workload with a slow disk. > > Do you think better self-tuning is possible? Then users don't need to > set tunables like this one.
Probably only if we aim for some more complex, which will have its own inherent costs. We could take inspiration from Linux's io_poll hybrid mode, which maintains per-device statistics to calculate the average latency, to take a nap for half that time and free the CPU a bit. Of course, our case is significantly harder. The kernel only deals with the HW, and only a few devices do have support for io_poll. In our case, the IOThread may be shared among various devices with radically different backends, which may also have a wide range of latencies (depending on the underlying storage, file format, cache mode...). But perhaps we can try to be clever and calculate the standard deviation of the collected data to (in)validate the stats. There are also some implementation challenges, as deciding where to store those stats and designing an interface for aio_poll to access that information, preferably in a lockless fashion. If we can figure those out, we should be able to iterate over all the BDSs sharing the AioContext, using the average latency (if valid), combined with a timestamp from when the first in-flight request was issued, to calculate a deadline and, with it, decide if we should either take a nap using ppoll() with a timeout calculated to be wake up early enough to catch the completion while polling, or just enter polling mode for a while. Perhaps it'd be worth doing a simple PoC outside QEMU, using the vhost-user-blk example server to avoid the block layer complexity and evaluate the raw benefits with different kinds of backends and workloads. Thanks, Sergio.
signature.asc
Description: PGP signature