Re: [Qemu-devel] Block I/O optimizations

Abel Gordon Mon, 25 Feb 2013 10:54:11 -0800


Stefan Hajnoczi <stefa...@gmail.com> wrote on wrote on 25/02/2013 02:50:56
PM:


> > However, I am concerned dataplane may not solve the scalability
> > problem because QEMU will be still running 1 thread
> > per VCPU and 1 per virtual device to handle I/O for each VM.
> > Assuming we run N VMs with 1 VCPU and 1 virtual I/O device,
> > we will have 2N threads competing for CPU cycles.  In a
> > cloud-like environment running I/O intensive VMs that could
> > be a problem because the I/O threads and VCPU threads may
> > starve each other. Further more, the linux kernel can't make
> > good scheduling decisions (from I/O perspective) because it
> > has no information about the content of the I/O queues.
>
> The kernel knows when the dataplane thread is schedulable - when the
> ioeventfd is signalled.  In the worst case the scheduler could allow the
> vcpu thread to complete an entire time slice before letting the
> dataplane thread run.

Yep, but depending on the number of queues and the amount of pending data
in the queues you may prefer to use "fine-grained" time slices and
switch faster (or slower) than the kernel scheduler does...I am assuming
we would like to maximize throughput while we minimize latency.

> So are you saying that the Linux scheduler wouldn't allow the dataplane
> thread to run on a loaded box?  My first thought would be to raise the
> priority of the dataplane thread so it preempts the vcpu thread upon
> becoming schedulable.

It not just about VCPU threads competing with I/O threads, note the
I/O threads compete between them too. So I am concerned the I/O threads
(at least 1 per VM) may starve each other. Once we start playing with
thread priorities the system becomes very complex to tune...

> > We did some experiments with a modified vhost-blk back-end
> > that uses a single (or few) thread/s to process I/O for many
> > VMs as opposed to 1 thread per VM (I/O device).  These thread/s
> > decide for how-long and when to process the request of each
> > VM based on the I/O activity of each queue. We noticed that
> > this model (part of what we call ELVIS) significantly improves
> > the scalability of the system when you run many I/O intensive
> > guests.
>
> When you say "this model (part of what we call ELVIS) significantly
> improves the scalability of the system when you run many I/O intensive
> guests", do you mean exit-less vs exit-based or shared thread vs 1
> thread per device (without polling)?  I'm not sure if you're advocating
> exit-less (polling) or shared thread without polling.

I was referring to the shared thread part and not the exitless part.
ELVIS uses exitless notifications (guest-to-host and host-to-guest) and
also fine-grained I/O scheduling based on a shared-thread.
We noticed fine-grained I/O scheduling contributed the most to the
scalability and performance of the system. Exitless notifications gave
an additional boost.


> > I was wondering if you have considered this type of threading
> > model for dataplane as well. With vhost-blk (or-net) it's relatively
> > easy to use a kernel thread to process I/O for many VMs (user-space
> > processes). However, with a QEMU back-end (like dataplane/virtio-blk)
> > the shared thread model may be challenging because it requires
> > a shared user-space process (for the I/O thread/s) to handle
> > I/O for many QEMU processes.
> >
> > Any thoughts/opinions on the share-thread direction ?
>
> For low latency polling makes sense and a shared thread is an efficient
> way to implement polling.  But it throws away resource control and
> isolation - now you can no longer use cgroups and other standard
> resource control mechanisms to manage guests.

Polling is a a mechanism we can use to reduce number of notifications or
balance between throughput and latency assuming you use a shared-thread,
but
you can still use a shared-thread without polling.

IMHO, it seems like with a shared thread you can implement a simpler and
more efficient resource control than cgroups because you can optimize the
logic/heuristics "for I/O", as opposed to applying cgroups "cpu" shares to
the I/O
threads and relying on the "cpu" scheduler.

> You also create a
> privileged thread that has access to all guests on the host - a security
> bug here compromises all guests.  This can be fine for private
> deployments where guests are trusted.  For untrusted guests and public
> clouds it seems risky.

But is this significantly different than any other security bug in the
host,
qemu, kvm....? If you perform the I/O virtualization in a separate (not
qemu)
process, you have a significantly smaller, self-contained and bounded
trusted computing base (TCB) from source code perspective as opposed to
a single huge user-space process where it's very difficult to define
boundaries and find potential security holes.

> Maybe a hybrid approach is possible where exit-less is possible but I/O
> emulation still happens in per-guest userspace threads.  Not sure how
> much performance can be retained by doing that - e.g. a kernel driver
> that allows processes to bind an eventfd to a memory notification area.
> The kernel driver does polling in a single thread and signals eventfds.
> Userspace threads do the actual I/O emulation.

Sounds interesting... however, once the userspace thread runs the driver
loses
control (assuming you don't have spare cores).
I mean, a userspace I/O thread will probably consume all
its time slice while the driver may prefer to assign less (or more) cycles
to a
specific I/O thread based on the ongoing activity of all the VMs.

Using a shared-thread, you can optimize the linux scheduler to handle
virtual/emulated I/O while you actually don't modify the kernel scheduler
code.

Re: [Qemu-devel] Block I/O optimizations

Reply via email to