On Tue, Nov 29, 2016 at 09:19:22AM +0100, Christian Borntraeger wrote: > On 11/24/2016 04:12 PM, Stefan Hajnoczi wrote: > > I looked through the socket SO_BUSY_POLL and blk_mq poll support in > > recent Linux kernels with an eye towards integrating the ongoing QEMU > > polling work. The main missing feature is eventfd polling support which > > I describe below. > > > > Background > > ---------- > > We're experimenting with polling in QEMU so I wondered if there are > > advantages to having the kernel do polling instead of userspace. > > > > One such advantage has been pointed out by Christian Borntraeger and > > Paolo Bonzini: a userspace thread spins blindly without knowing when it > > is hogging a CPU that other tasks need. The kernel knows when other > > tasks need to run and can skip polling in that case. > > > > Power management might also benefit if the kernel was aware of polling > > activity on the system. That way polling can be controlled by the > > system administrator in a single place. Perhaps smarter power saving > > choices can also be made by the kernel. > > > > Another advantage is that the kernel can poll hardware rings (e.g. NIC > > rx rings) whereas QEMU can only poll its own virtual memory (including > > guest RAM). That means the kernel can bypass interrupts for devices > > that are using kernel drivers. > > > > State of polling in Linux > > ------------------------- > > SO_BUSY_POLL causes recvmsg(2), select(2), and poll(2) family system > > calls to spin awaiting new receive packets. From what I can tell epoll > > is not supported so that system call will sleep without polling. > > > > blk_mq poll is mainly supported by NVMe. It is only available with > > synchronous direct I/O. select(2), poll(2), epoll, and Linux AIO are > > therefore not integrated. It would be nice to extend the code so a > > process waiting on Linux AIO using io_getevents(2), select(2), poll(2), > > or epoll will poll. > > > > QEMU and KVM-specific polling > > ----------------------------- > > There are a few QEMU/KVM-specific items that require polling support: > > > > QEMU's event loop aio_notify() mechanism wakes up the event loop from a > > blocking poll(2) or epoll call. It is used when another thread adds or > > changes an event loop resource (such as scheduling a BH). There is a > > userspace memory location (ctx->notified) that is written by > > aio_notify() as well as an eventfd that can be signalled. > > > > kvm.ko's ioeventfd is signalled upon guest MMIO/PIO accesses. Virtio > > devices use ioeventfd as a doorbell after new requests have been placed > > in a virtqueue, which is a descriptor ring in userspace memory. > > > > Eventfd polling support could look like this: > > > > struct eventfd_poll_info poll_info = { > > .addr = ...memory location..., > > .size = sizeof(uint32_t), > > .op = EVENTFD_POLL_OP_NOT_EQUAL, /* check *addr != val */ > > .val = ...last value..., > > }; > > ioctl(eventfd, EVENTFD_SET_POLL, &poll_info); > > > > In the kernel, eventfd stashes this information and eventfd_poll() > > evaluates the operation (e.g. not equal, bitwise and, etc) to detect > > progress. > > > > Note that this eventfd polling mechanism doesn't actually poll the > > eventfd counter value. It's useful for situations where the eventfd is > > a doorbell/notification that some object in userspace memory has been > > updated. So it polls that userspace memory location directly. > > > > This new eventfd feature also provides a poor man's Linux AIO polling > > support: set the Linux AIO shared ring index as the eventfd polling > > memory location. This is not as good as true Linux AIO polling support > > where the kernel polls the NVMe, virtio_blk, etc ring since we'd still > > rely on an interrupt to complete I/O requests. > > > > Thoughts? > > Would be an interesting excercise, but we should really try to avoid making > the iothreads more costly. When I look at some of our measurements, I/O-wise > we are slightly behind z/VM, which can be tuned to be in a similar area but > we use more host CPUs on s390 for the same throughput. > > So I have two concerns and both a related to overhead. > a: I am able to get a higher bandwidth and lower host cpu utilization > when running fio for multiple disks when I pin the iothreads to a subset of > the host CPUs (there is a sweet spot). Is the polling maybe just influencing > the scheduler to do the same by making the iothread not doing sleep/wakeup > all the time?
Interesting theory, look at sched_switch tracing data to find out whether that is true. Do you get any benefit from combining the sweet spot pinning with polling? > b: what about contention with other guests on the host? What > worries me a bit, is the fact that most performance measurements and > tunings are done for workloads without that. We (including myself) do our > microbenchmarks (or fio runs) with just one guest and are happy if we see > an improvement. But does that reflect real usage? For example have you ever > measured the aio polling with 10 guests or so? > My gut feeling (and obviously I have not done proper measurements myself) is > that we want to stop polling as soon as there is contention. > > As you outlined, we already have something in place in the kernel to stop > polling > > Interestingly enough, for SO_BUSY_POLL the network code seems to consider > !need_resched() && !signal_pending(current) > for stopping the poll, which allows to consume your time slice. KVM instead > uses single_task_running() for the halt_poll_thing. This means that KVM > yields much more aggressively, which is probably the right thing to do for > opportunistic spinning. Another thing I noticed about the busy_poll implementation is that it will spin if *any* file descriptor supports polling. In QEMU we decided to implement the opposite: spin only if *all* event sources support polling. The reason is that we don't want polling to introduce any extra latency on the event sources that do not support polling. > Another thing to consider: In the kernel we have already other opportunistic > spinners and we are in the process of making things less aggressive because > it caused real issues. For example search for the vcpu_is_preempted​ patch > set. > Which by the way shown another issue, running nested you do not only want to > consider your own load, but also the load of the hypervisor. These are good points and it's why I think polling in the kernel can make smarter decisions than in polling userspace. There are multiple components in the system that can do polling, it would be best to have a single place so that the polling activity doesn't interfere. Stefan
signature.asc
Description: PGP signature