I looked through the socket SO_BUSY_POLL and blk_mq poll support in
recent Linux kernels with an eye towards integrating the ongoing QEMU
polling work.  The main missing feature is eventfd polling support which
I describe below.

Background
----------
We're experimenting with polling in QEMU so I wondered if there are
advantages to having the kernel do polling instead of userspace.

One such advantage has been pointed out by Christian Borntraeger and
Paolo Bonzini: a userspace thread spins blindly without knowing when it
is hogging a CPU that other tasks need.  The kernel knows when other
tasks need to run and can skip polling in that case.

Power management might also benefit if the kernel was aware of polling
activity on the system.  That way polling can be controlled by the
system administrator in a single place.  Perhaps smarter power saving
choices can also be made by the kernel.

Another advantage is that the kernel can poll hardware rings (e.g. NIC
rx rings) whereas QEMU can only poll its own virtual memory (including
guest RAM).  That means the kernel can bypass interrupts for devices
that are using kernel drivers.

State of polling in Linux
-------------------------
SO_BUSY_POLL causes recvmsg(2), select(2), and poll(2) family system
calls to spin awaiting new receive packets.  From what I can tell epoll
is not supported so that system call will sleep without polling.

blk_mq poll is mainly supported by NVMe.  It is only available with
synchronous direct I/O.  select(2), poll(2), epoll, and Linux AIO are
therefore not integrated.  It would be nice to extend the code so a
process waiting on Linux AIO using io_getevents(2), select(2), poll(2),
or epoll will poll.

QEMU and KVM-specific polling
-----------------------------
There are a few QEMU/KVM-specific items that require polling support:

QEMU's event loop aio_notify() mechanism wakes up the event loop from a
blocking poll(2) or epoll call.  It is used when another thread adds or
changes an event loop resource (such as scheduling a BH).  There is a
userspace memory location (ctx->notified) that is written by
aio_notify() as well as an eventfd that can be signalled.

kvm.ko's ioeventfd is signalled upon guest MMIO/PIO accesses.  Virtio
devices use ioeventfd as a doorbell after new requests have been placed
in a virtqueue, which is a descriptor ring in userspace memory.

Eventfd polling support could look like this:

  struct eventfd_poll_info poll_info = {
      .addr = ...memory location...,
      .size = sizeof(uint32_t),
      .op   = EVENTFD_POLL_OP_NOT_EQUAL, /* check *addr != val */
      .val  = ...last value...,
  };
  ioctl(eventfd, EVENTFD_SET_POLL, &poll_info);

In the kernel, eventfd stashes this information and eventfd_poll()
evaluates the operation (e.g. not equal, bitwise and, etc) to detect
progress.

Note that this eventfd polling mechanism doesn't actually poll the
eventfd counter value.  It's useful for situations where the eventfd is
a doorbell/notification that some object in userspace memory has been
updated.  So it polls that userspace memory location directly.

This new eventfd feature also provides a poor man's Linux AIO polling
support: set the Linux AIO shared ring index as the eventfd polling
memory location.  This is not as good as true Linux AIO polling support
where the kernel polls the NVMe, virtio_blk, etc ring since we'd still
rely on an interrupt to complete I/O requests.

Thoughts?

Stefan

Attachment: signature.asc
Description: PGP signature

Reply via email to