Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

Avi Kivity Sun, 30 Nov 2008 12:07:19 -0800

Andi Kleen wrote:

The page is allocated at an uninteresting point in time. For example,the boot loaded allocates a bunch of pages.
The far majority of pages are allocated when a process wants them
or the kernel uses them for file cache.

Right. Allocated from the guest kernel's perspective. This may bedifferent from the host kernel's perspective.

Linux will delay touching memory until the last moment, Windows will not(likely it zeros pages on their own nodes, but who knows)?

The bigger problem is lifetime. Inside a guest, 'allocation' happenswhen a page is used for pagecache, or when a process is created andstarts using memory. From the host perspective, it happens just once.

It's very different. The kernel expects an application that touchedpage X on node Y to continue using page X on node Y. Becauseapplications know this, they are written to this assumption. However,
The far majority of applications do not actually know where memory is.


In our case, the application is the guest kernel, which does know.

What matters is that you get local accesses most of the time for the memory
that is touched on a specific CPU. Even the applications who
know won't break if it's somewhere else, because it's only
an optimization. As long as you're faster on average (or in the worst
case not significantly worse) than not having it you're fine.

Also the Linux first touch is a heuristic that can be wrong
later, and I don't see too much difference in having another
heuristic level on top of it.

The difference is, Linux (as a guest) will try to reuse freed pages froman application or pagecache, knowing which node they belong to.

I agree that if all you do is HPC style computation (boot a kernel andone app with one process per cpu), then the heuristics work well.

The scheme I described is a approximate heuristic to get local
memory access in many cases without pinning anything to CPUs.
It is certainly not perfect and has holes (like any heuristics),

but it has the advantage of being fully dynamic.

It also has the advantage of being already implemented (apart from fakeSRAT tables; and that isn't necessary for HPC apps).

in a virtualization context, the guest kernel expects that page Xbelongs to whatever node the SRAT table points at, without regard to thefirst access.
Guest kernels behave differently from applications, because realhardware doesn't allocate pages dynamically like the kernel can forapplications.
Again the kernel just wants local memory access most of the time
for the allocations where it matters.

It does. But the kernel doesn't allocate memory (except for the firsttime); it recycles memory.

Also NUMA is always an optimization, it's not a big issue if you're
wrong occasionally because that doesn't affect correctness.


Agreed.

Mapped once and allocated once (not at the same time, but fairly close).


That seems dubious to me.


That's how it works.

Qemu will mmap() the guest's memory at initialization time. When theguest touches memory, kvm will call get_user_pages_fast() (here's theallocation) and instantiate a pte in the qemu address space, as well asa shadow pte (using either ept/npt two-level maps or direct shadow maps).

With ept/npt, in the absence os swapping, the story ends here. Withoutept/npt, the guest will continue to fault, but now get_user_pages_fast()will return the already allocated page from the pte in the qemu addressspace.

No. Linux will assume a page belongs to the node the SRAT table says itbelongs to. Whether first access will be from the local node depends onthe workload. If the first application running accesses all memory froma single cpu, we will allocate all memory from one node, but this is wrong.
Sorry I don't get your point.


Yeah, we're talking a bit past each other.  I'll try to expand.

Wrong doesn't make sense in this context.

You seem to be saying that an allocation that is not local on a native
kernel wouldn't be local in the approximate heuristic either.But that's a triviality that is of course true and likely not whatyou meant anyways.

I'm saying, that sometimes the guest kernel allocates memory fromvirtual node A but uses it on virtual node B, due to its memory policyor perhaps due to resource scarcity. Unlike a normal application, theguest kernel still tracks the page as belonging to node A (even thoughit is used on node B). Because of this, when the page is recycled, theguest kernel will try to assign it to processes running on node A. Butthe host has allocated it from node B.

When we export a virtual SRAT, we promise to the guest something aboutmemory access latency. The guest will try to optimize according to thisSRAT, and if we don't fulfil the promise, it will make incorrect decisions.

So long as a page has a single use in the lifetime of the guest, itdoesn't matter. But general purpose applications don't behave like that.

It's meaningless information. First access means nothing. And again,
At least in Linux the first access to the majority of memory iseither through a process page allocation or through a file cache
page cache allocation. Yes there are are a few boot loader
and temporary kernel pages for which this is not true,but they are a small insignificant fraction of the total memory
in a reasonably sized guest. I'm just ignoring them.

This can be often observed in that if you have a broken DIMM
you only get problems after using some program that uses
most of your memory.


Right.  If you have a single allocate-once workload, the heuristic works.

Windows will zero all memory in the background, btw.

Sure, for the simple cases it works. But consider your first examplefollowed by the second (you can even reboot the guest in the middle, butthe bad assignment sticks).
If you have a heuristic to detect remapping you'll recover on each remapping.


We do, for non ?pt.

We should try to be predictable,


NUMA is unfortunately somewhat unpredictable even on native kernels
There are always situations whe where a hot page can end up on the wrong node.
That tends to make benchmakers unhappy. But so far no good general
way is known to avoid it.

Doesn't the cache insulate workloads against small numbers of mislocatedpages?

not depend on behavior the guest has noreal reason to follow, if it follows hardware specs.
Sorry, Avi, I suspect you have a somewhat unrealistic mental model ofNUMA knowledge in applications and OS.


That may well be.  I haven't programmed large NUMA apps.

At least Linux's behaviour (and I assume most NUMA optimized
OS) will be handled reasonably well by this scheme. I think.
It was just one proposal.

Well, if it is, we'll find out easily, as that's what we implement rightnow (less migrating-on-remap and providing a virtual SRAT, which isn'teven really needed).

Anyways it might still not work well in practice -- the only way
to find out would be to implement and try -- but I think
it should not be dismissed out of hand.

I can't dismiss it even if I want to -- it's how kvm works now (wellexcept when a device is assigned).


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

Reply via email to