Stefan Hajnoczi <stefa...@gmail.com> wrote on wrote on 25/02/2013 02:50:56 PM:
> > However, I am concerned dataplane may not solve the scalability > > problem because QEMU will be still running 1 thread > > per VCPU and 1 per virtual device to handle I/O for each VM. > > Assuming we run N VMs with 1 VCPU and 1 virtual I/O device, > > we will have 2N threads competing for CPU cycles. In a > > cloud-like environment running I/O intensive VMs that could > > be a problem because the I/O threads and VCPU threads may > > starve each other. Further more, the linux kernel can't make > > good scheduling decisions (from I/O perspective) because it > > has no information about the content of the I/O queues. > > The kernel knows when the dataplane thread is schedulable - when the > ioeventfd is signalled. In the worst case the scheduler could allow the > vcpu thread to complete an entire time slice before letting the > dataplane thread run. Yep, but depending on the number of queues and the amount of pending data in the queues you may prefer to use "fine-grained" time slices and switch faster (or slower) than the kernel scheduler does...I am assuming we would like to maximize throughput while we minimize latency. > So are you saying that the Linux scheduler wouldn't allow the dataplane > thread to run on a loaded box? My first thought would be to raise the > priority of the dataplane thread so it preempts the vcpu thread upon > becoming schedulable. It not just about VCPU threads competing with I/O threads, note the I/O threads compete between them too. So I am concerned the I/O threads (at least 1 per VM) may starve each other. Once we start playing with thread priorities the system becomes very complex to tune... > > We did some experiments with a modified vhost-blk back-end > > that uses a single (or few) thread/s to process I/O for many > > VMs as opposed to 1 thread per VM (I/O device). These thread/s > > decide for how-long and when to process the request of each > > VM based on the I/O activity of each queue. We noticed that > > this model (part of what we call ELVIS) significantly improves > > the scalability of the system when you run many I/O intensive > > guests. > > When you say "this model (part of what we call ELVIS) significantly > improves the scalability of the system when you run many I/O intensive > guests", do you mean exit-less vs exit-based or shared thread vs 1 > thread per device (without polling)? I'm not sure if you're advocating > exit-less (polling) or shared thread without polling. I was referring to the shared thread part and not the exitless part. ELVIS uses exitless notifications (guest-to-host and host-to-guest) and also fine-grained I/O scheduling based on a shared-thread. We noticed fine-grained I/O scheduling contributed the most to the scalability and performance of the system. Exitless notifications gave an additional boost. > > I was wondering if you have considered this type of threading > > model for dataplane as well. With vhost-blk (or-net) it's relatively > > easy to use a kernel thread to process I/O for many VMs (user-space > > processes). However, with a QEMU back-end (like dataplane/virtio-blk) > > the shared thread model may be challenging because it requires > > a shared user-space process (for the I/O thread/s) to handle > > I/O for many QEMU processes. > > > > Any thoughts/opinions on the share-thread direction ? > > For low latency polling makes sense and a shared thread is an efficient > way to implement polling. But it throws away resource control and > isolation - now you can no longer use cgroups and other standard > resource control mechanisms to manage guests. Polling is a a mechanism we can use to reduce number of notifications or balance between throughput and latency assuming you use a shared-thread, but you can still use a shared-thread without polling. IMHO, it seems like with a shared thread you can implement a simpler and more efficient resource control than cgroups because you can optimize the logic/heuristics "for I/O", as opposed to applying cgroups "cpu" shares to the I/O threads and relying on the "cpu" scheduler. > You also create a > privileged thread that has access to all guests on the host - a security > bug here compromises all guests. This can be fine for private > deployments where guests are trusted. For untrusted guests and public > clouds it seems risky. But is this significantly different than any other security bug in the host, qemu, kvm....? If you perform the I/O virtualization in a separate (not qemu) process, you have a significantly smaller, self-contained and bounded trusted computing base (TCB) from source code perspective as opposed to a single huge user-space process where it's very difficult to define boundaries and find potential security holes. > Maybe a hybrid approach is possible where exit-less is possible but I/O > emulation still happens in per-guest userspace threads. Not sure how > much performance can be retained by doing that - e.g. a kernel driver > that allows processes to bind an eventfd to a memory notification area. > The kernel driver does polling in a single thread and signals eventfds. > Userspace threads do the actual I/O emulation. Sounds interesting... however, once the userspace thread runs the driver loses control (assuming you don't have spare cores). I mean, a userspace I/O thread will probably consume all its time slice while the driver may prefer to assign less (or more) cycles to a specific I/O thread based on the ongoing activity of all the VMs. Using a shared-thread, you can optimize the linux scheduler to handle virtual/emulated I/O while you actually don't modify the kernel scheduler code.