Andi Kleen wrote:
I was more thinking about some heuristics that checks when a page
is first mapped into user space. The only problem is that it is zeroed
through the direct mapping before, but perhaps there is a way around it. That's one of the rare cases when 32bit highmem actually makes things easier.
It might be also easier on some other OS than Linux who don't use
direct mapping that aggressively.
In the context of kvm, the mmap() calls happen before the guest ever

The mmap call doesn't matter at all, what matters is when the
page is allocated.


The page is allocated at an uninteresting point in time. For example, the boot loaded allocates a bunch of pages.

executes. First access happens somewhat later, but still we cannot count on the majority of accesses to come from the same cpu as the first access.

It is a reasonable heuristic. It's just like the rather
successfull default local allocation heuristic the native kernel uses.

It's very different. The kernel expects an application that touched page X on node Y to continue using page X on node Y. Because applications know this, they are written to this assumption. However, in a virtualization context, the guest kernel expects that page X belongs to whatever node the SRAT table points at, without regard to the first access.

Guest kernels behave differently from applications, because real hardware doesn't allocate pages dynamically like the kernel can for applications.

(btw, what do you do with cpu-less nodes? I think some sgi hardware has them)

The alternative is to keep your own pools and allocate from the
correct pool, but then you either need pinning or getcpu()
This is meaningless in kvm context. Other than small bits of memory needed for I/O and shadow page tables, the bulk of memory is allocated once.

Mapped once. Anyways that could be changed too if there was need.


Mapped once and allocated once (not at the same time, but fairly close).

We can't change it without changing the guest.

Basic algorithm:
- If guest touches virtual node that is the same as the local node
of the current vcpu assume it's a local allocation.
The guest is not making the same assumption; lying to the guest is

Huh? Pretty much all NUMA aware OS should. Linux will definitely.


No. Linux will assume a page belongs to the node the SRAT table says it belongs to. Whether first access will be from the local node depends on the workload. If the first application running accesses all memory from a single cpu, we will allocate all memory from one node, but this is wrong.

(2) even without npt/ept, we have no idea how often mappings are used and by which cpu. finding out is expensive.

You see a fault on the first mapping. That fault is on the CPU that
did the access.  Therefore you know which one it was.

It's meaningless information. First access means nothing. And again, the guest doesn't expect the page to move to the node where it touched it.

(we also see first access with ept)

(3) for many workloads, there are no unused pages. the guest application allocates all memory and manages memory by itself.

First a common case of guest using all memory is file cache,
but for NUMA purposes file cache locality typically doesn't
matter because it's not accessed frequently enough that
non locality is a problem. It really only matters for mapping
that are used often by the CPU.

When a single application allocates everything and keeps it that is fine
too because you'll give it approximately local memory on the initial
set up (assuming the application has reasonable NUMA behaviour by itself
on a first touch local allocation policy)

Sure, for the simple cases it works. But consider your first example followed by the second (you can even reboot the guest in the middle, but the bad assignment sticks).

And if the vcpu moves for some reason, things get screwed up permanently.

We should try to be predictable, not depend on behavior the guest has no real reason to follow, if it follows hardware specs.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to