Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

Avi Kivity Sun, 30 Nov 2008 09:11:55 -0800

Andi Kleen wrote:

On Sun, Nov 30, 2008 at 06:38:14PM +0200, Avi Kivity wrote:
The guest allocates when it touches the page for the first time. Thismeans very little since all of memory may be touched during guest bootupor shortly afterwards. Even if not, it is still a one-time operation,and any choices we make based on it will last the lifetime of the guest.
I was more thinking about some heuristics that checks when a page
is first mapped into user space. The only problem is that it is zeroed
through the direct mapping before, but perhaps there is a way around it.That's one of the rare cases when 32bit highmem actually makes things easier.
It might be also easier on some other OS than Linux who don't use
direct mapping that aggressively.

In the context of kvm, the mmap() calls happen before the guest everexecutes. First access happens somewhat later, but still we cannotcount on the majority of accesses to come from the same cpu as the firstaccess.

This is roughly equivalent of getting a fresh new demand fault page,
but doesn't require to unmap/free/remap.

Lost again, sorry.


free/unmap/remap gives you normally local memory. I tend to call
it poor man's NUMA policy API.

The alternative is to keep your own pools and allocate from the
correct pool, but then you either need pinning or getcpu()

This is meaningless in kvm context. Other than small bits of memoryneeded for I/O and shadow page tables, the bulk of memory is allocatedonce. Guest processes may repeatedly allocate and free memory, but kvmwill never see this.

We need to mimic real hardware.
The underlying allocation is in pages, so the NUMA affinity canbe as well handled by this.
Basic algorithm:
- If guest touches virtual node that is the same as the local node
of the current vcpu assume it's a local allocation.

The guest is not making the same assumption; lying to the guest iscounterproductive. The big problem is that a local decision takeseffect indefinitely.

- On allocation get the underlying page from the correct underlying
node based on a dynamic getcpu relationship.
- Find some way to get rid of unused pages. e.g. keep track ofthe number of mappings to a page and age or use pv help.


(1) with npt/ept we have no clue as to guest mappings

(2) even without npt/ept, we have no idea how often mappings are usedand by which cpu. finding out is expensive.(3) for many workloads, there are no unused pages. the guestapplication allocates all memory and manages memory by itself.

The static case is simple. We allocate memory from a few nodes (forsmall guests, only one) and establish a guest_node -> host_nodemapping. vcpus on guest node X are constrained to host node accordingto this mapping.
The dynamic case is really complicated. We can allow vcpus to wander toother cpus on cpu overcommit, but need to pull them back soonish, oralternatively migrate the entire node, taking into account the cost ofthe migration, cpu availability on the target node, and memoryavailability on the target node. Since the cost is so huge, this needsto be done on a very coarse scale.
I wrote a scheduler that did that on 2.4 (it was called homenode scheduling),
but it never worked well on small systems. It was moderately successfull on
some big NUMA boxes though. The fundamental problem is that not using
a CPU is always worse than using remote memory on the small systems.

Right. The situation I'm trying to avoid is process A with memory onnode X running on node Y, and process B with memory on node Y running onnode X. The scheduler arrives at a local optimum, caused by somespurious load, and won't move to the global optimum because migratingprocesses across cpus is considered expensive.

I don't know, perhaps the current scheduler is clever enough to do thisalready.

Always migrating memory on CPU migration is also too costly in the general
case, but it might be possible to make it work in the special caseof vCPU guests with some tweaks.

Yes, virtual machines are easier since there are a smaller number ofmm_structs and tasks compared to more general workloads.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

Reply via email to