Andi Kleen wrote:
The page is allocated at an uninteresting point in time. For example, the boot loaded allocates a bunch of pages.

The far majority of pages are allocated when a process wants them
or the kernel uses them for file cache.

Right. Allocated from the guest kernel's perspective. This may be different from the host kernel's perspective.

Linux will delay touching memory until the last moment, Windows will not (likely it zeros pages on their own nodes, but who knows)?

The bigger problem is lifetime. Inside a guest, 'allocation' happens when a page is used for pagecache, or when a process is created and starts using memory. From the host perspective, it happens just once.

It's very different. The kernel expects an application that touched page X on node Y to continue using page X on node Y. Because applications know this, they are written to this assumption. However,

The far majority of applications do not actually know where memory is.

In our case, the application is the guest kernel, which does know.

What matters is that you get local accesses most of the time for the memory
that is touched on a specific CPU. Even the applications who
know won't break if it's somewhere else, because it's only
an optimization. As long as you're faster on average (or in the worst
case not significantly worse) than not having it you're fine.

Also the Linux first touch is a heuristic that can be wrong
later, and I don't see too much difference in having another
heuristic level on top of it.

The difference is, Linux (as a guest) will try to reuse freed pages from an application or pagecache, knowing which node they belong to.

I agree that if all you do is HPC style computation (boot a kernel and one app with one process per cpu), then the heuristics work well.

The scheme I described is a approximate heuristic to get local
memory access in many cases without pinning anything to CPUs.
It is certainly not perfect and has holes (like any heuristics),
but it has the advantage of being fully dynamic.

It also has the advantage of being already implemented (apart from fake SRAT tables; and that isn't necessary for HPC apps).

in a virtualization context, the guest kernel expects that page X belongs to whatever node the SRAT table points at, without regard to the first access.

Guest kernels behave differently from applications, because real hardware doesn't allocate pages dynamically like the kernel can for applications.

Again the kernel just wants local memory access most of the time
for the allocations where it matters.


It does. But the kernel doesn't allocate memory (except for the first time); it recycles memory.

Also NUMA is always an optimization, it's not a big issue if you're
wrong occasionally because that doesn't affect correctness.

Agreed.



Mapped once and allocated once (not at the same time, but fairly close).

That seems dubious to me.


That's how it works.

Qemu will mmap() the guest's memory at initialization time. When the guest touches memory, kvm will call get_user_pages_fast() (here's the allocation) and instantiate a pte in the qemu address space, as well as a shadow pte (using either ept/npt two-level maps or direct shadow maps).

With ept/npt, in the absence os swapping, the story ends here. Without ept/npt, the guest will continue to fault, but now get_user_pages_fast() will return the already allocated page from the pte in the qemu address space.

No. Linux will assume a page belongs to the node the SRAT table says it belongs to. Whether first access will be from the local node depends on the workload. If the first application running accesses all memory from a single cpu, we will allocate all memory from one node, but this is wrong.

Sorry I don't get your point.

Yeah, we're talking a bit past each other.  I'll try to expand.

Wrong doesn't make sense in this context.

You seem to be saying that an allocation that is not local on a native
kernel wouldn't be local in the approximate heuristic either. But that's a triviality that is of course true and likely not what you meant anyways.


I'm saying, that sometimes the guest kernel allocates memory from virtual node A but uses it on virtual node B, due to its memory policy or perhaps due to resource scarcity. Unlike a normal application, the guest kernel still tracks the page as belonging to node A (even though it is used on node B). Because of this, when the page is recycled, the guest kernel will try to assign it to processes running on node A. But the host has allocated it from node B.

When we export a virtual SRAT, we promise to the guest something about memory access latency. The guest will try to optimize according to this SRAT, and if we don't fulfil the promise, it will make incorrect decisions.

So long as a page has a single use in the lifetime of the guest, it doesn't matter. But general purpose applications don't behave like that.

It's meaningless information. First access means nothing. And again,

At least in Linux the first access to the majority of memory is either through a process page allocation or through a file cache
page cache allocation. Yes there are are a few boot loader
and temporary kernel pages for which this is not true, but they are a small insignificant fraction of the total memory
in a reasonably sized guest. I'm just ignoring them.

This can be often observed in that if you have a broken DIMM
you only get problems after using some program that uses
most of your memory.


Right.  If you have a single allocate-once workload, the heuristic works.

Windows will zero all memory in the background, btw.

Sure, for the simple cases it works. But consider your first example followed by the second (you can even reboot the guest in the middle, but the bad assignment sticks).

If you have a heuristic to detect remapping you'll recover on each remapping.

We do, for non ?pt.

We should try to be predictable,

NUMA is unfortunately somewhat unpredictable even on native kernels
There are always situations whe where a hot page can end up on the wrong node.
That tends to make benchmakers unhappy. But so far no good general
way is known to avoid it.

Doesn't the cache insulate workloads against small numbers of mislocated pages?

not depend on behavior the guest has no real reason to follow, if it follows hardware specs.

Sorry, Avi, I suspect you have a somewhat unrealistic mental model of NUMA knowledge in applications and OS.

That may well be.  I haven't programmed large NUMA apps.

At least Linux's behaviour (and I assume most NUMA optimized
OS) will be handled reasonably well by this scheme. I think.
It was just one proposal.

Well, if it is, we'll find out easily, as that's what we implement right now (less migrating-on-remap and providing a virtual SRAT, which isn't even really needed).

Anyways it might still not work well in practice -- the only way
to find out would be to implement and try -- but I think
it should not be dismissed out of hand.

I can't dismiss it even if I want to -- it's how kvm works now (well except when a device is assigned).

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to