Re: Sketch of an idea for handling the "mixed workload" problem

George Dunlap Mon, 22 Jan 2024 03:54:37 -0800

On Mon, Jan 22, 2024 at 12:31 AM Demi Marie Obenour
<d...@invisiblethingslab.com> wrote:
>
> On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote:
> > The basic credit2 algorithm goes something like this:
> >
> > 1. All vcpus start with the same number of credits; about 10ms worth
> > if everyone has the same weight
> >
> > 2. vcpus burn credits as they consume cpu, based on the relative
> > weights: higher weights burn slower, lower weights burn faster
> >
> > 3. At any given point in time, the runnable vcpu with the highest
> > credit is allowed to run
> >
> > 4. When the "next runnable vcpu" on a runqueue is negative, credit is
> > reset: everyone gets another 10ms, and can carry over at most 2ms of
> > credit over the reset.
> >
> > Generally speaking, vcpus that use less than their quota and have lots
> > of interrupts are scheduled immediately, since when they wake up they
> > always have more credit than the vcpus who are burning through their
> > slices.
> >
> > But what about a situation as described recently on Matrix, where a VM
> > uses a non-negligible amount of cpu doing un-accelerated encryption
> > and decryption, which can be delayed by a few MS, as well as handling
> > audio events?  How can we make sure that:
> >
> > 1. We can run whenever interrupts happen
> > 2. We get no more than our fair share of the cpu?
> >
> > The counter-intuitive key here is that in order to achieve the above,
> > you need to *deschedule or preempt early*, so that when the interrupt
> > comes, you have spare credit to run the interrupt handler.  How do we
> > manage that?
> >
> > The idea I'm working out comes from a phrase I used in the Matrix
> > discussion, about a vcpu that "foolishly burned all its credits".
> > Naturally the thing you want to do to have credits available is to
> > save them up.
> >
> > So the idea would be this.  Each vcpu would have a "boost credit
> > ratio" and a "default boost interval"; there would be sensible
> > defaults based on typical workloads, but these could be tweaked for
> > individual VMs.
> >
> > When credit is assigned, all VMs would get the same amount of credit,
> > but divided into two "buckets", according to the boost credit ratio.
> >
> > Under certain conditions, a vcpu would be considered "boosted"; this
> > state would last either until the default boost interval, or until
> > some other event (such as a de-boost yield).
> >
> > The queue would be sorted thus:
> >
> > * Boosted vcpus, by boost credit available
> > * Non-boosted vcpus, by non-boost credit available
> >
> > Getting more boost credit means having lower priority when not
> > boosted; and burning through your boost credit means not being
> > scheduled when you need to be.
> >
> > Other ways we could consider putting a vcpu into a boosted state (some
> > discussed on Matrix or emails linked from Matrix):
> > * Xen is about to preempt, but finds that the vcpu interrupts are
> > blocked (this sort of overlaps with the "when we deliver an interrupt"
> > one)
> > * Xen is about to preempt, but finds that the (currently out-of-tree)
> > "dont_desched" bit has been set in the shared memory area
>
> I think both of these would be good.  Another one would be when Xen is
> about to deliver an interrupt to a guest, provided that there is no
> storm of interrupts.  I’ve seen a USB webcam cause a system-wide latency
> spike through what I presume is an interrupt storm, and I suspect that
> others have observed similar behavior with USB external drives.


How would you determine that a given interrupt was part of a "storm",
and what would you do differently as a result of determining that?

> > Other ways to consider de-boosting:
> > * There's a way to trigger a VMEXIT when interrupts have been
> > re-enabled; setting this up when the VM is in the boost state
>
> That’s a good idea, but should be conditional on “dont_desched” _not_
> being set.  This handles the case where the guest is running a realtime
> thread.

In which case we need some way for the "enlightened" guest to know how
to de-boost itself; a yield might do.

> Generally, I’d like to see something like this:
>
> - A vCPU with sufficient boost credit is boosted by Xen under the
>   following conditions:
>
>   1. Xen interrupts the guest.

I take it you mean, "delivers an interrupt to the guest"?


>   2. Xen is about to preempt, but detects that “dont_desched” is set.
>   3. Xen is about to preempt, but detects that interrupts are disabled.
>
> - A vCPU is deboosted if:
>
>   1. It runs out of boost credit, even if “dont_desched” is set.
>   2. An interrupt handler returns, but only if “dont_desched” is not set.
>   3. Interrupts are re-enabled, but only if “dont_desched” is not set.
>
>   The first case is an abnormal condition and typically means that
>   either the system is overloaded or a vCPU is running boosted for too
>   long.  To help debug this situation, Xen will log a warning and
>   increment both a system-wide and a per-domain counter.  dom0 can
>   retrieve counters for any domain, and a domain can read its own
>   counter.
>
> - When to set “dont_desched” is entirely up to the guest kernel, but
>   there are some general rules guests should follow:
>
>   - Only set “dont_desched” if there is a good reason, and unset it as
>     soon as possible.  Xen gives vCPUs with “dont_desched” set priority
>     over all other vCPUs on the system, but the amount of time a vCPU is
>     allowed to run with an elevated priority is limited.  Xen will log a
>     warning if a guest tries to run with elevated priority for too long.
>
>   - Xen boosts vCPUs before delivering an interrupt, but there should be
>     a way for a vCPU to deboost itself even before returning from the
>     interrupt handler.
>
>   - Guests should always set “dont_desched” when running hard-realtime
>     threads (used for e.g. audio processing), even when the thread is in
>     userspace.  This ensures that Xen gives the underlying vCPU priority
>     over vCPUs
>
>   - Guests should always set “dont_desched” when holding a spin lock,
>     but it is even better to use paravirtualized spin locks (which make
>     a hypercall into Xen and therefore allow other vCPUs to run).
>
>   - Xen does not implement priority inheritance, so guests need to do
>     that.
>
> - Max boost credits can be set by dom0 via a hypercall.
>
> The advantage of this approach is that it keeps almost all policy out of
> Xen.  The only exception is the boosting when an interrupt is received,
> but a well-behaved guest will deboost itself very quickly (by enabling
> interrupts) if the boost was not actually needed, so this should have
> very limited impact.  I think this should be enough for realtime audio,
> and it is somewhat related to (but hopefully simpler than) the KVM RFC
> from Google [1].
>
> Any thoughts on this?

Overall sounds good.  I think a good approach would be to start by
implementing it without the "dont_desched" flag, and then add that on
top later.  It sounds like you have a clear vision for what you want,
so it shouldn't be too hard to write such that adding the
"dont_desched" doesn't require a lot of pointless refactoring.

The other issue I have with this (and essentially where I got stuck
developing credit2 in the first place) is testing: how do you ensure
that it has the properties that you expect?  How do you develop a
"regression test" to make sure that server-based workloads don't have
issues in this sort of case?

 -George

Re: Sketch of an idea for handling the "mixed workload" problem

Reply via email to