Andi Kleen wrote:
The page is allocated at an uninteresting point in time. For example,
the boot loaded allocates a bunch of pages.
The far majority of pages are allocated when a process wants them
or the kernel uses them for file cache.
Right. Allocated from the guest kernel's perspective. This may be
different from the host kernel's perspective.
Linux will delay touching memory until the last moment, Windows will not
(likely it zeros pages on their own nodes, but who knows)?
The bigger problem is lifetime. Inside a guest, 'allocation' happens
when a page is used for pagecache, or when a process is created and
starts using memory. From the host perspective, it happens just once.
It's very different. The kernel expects an application that touched
page X on node Y to continue using page X on node Y. Because
applications know this, they are written to this assumption. However,
The far majority of applications do not actually know where memory is.
In our case, the application is the guest kernel, which does know.
What matters is that you get local accesses most of the time for the memory
that is touched on a specific CPU. Even the applications who
know won't break if it's somewhere else, because it's only
an optimization. As long as you're faster on average (or in the worst
case not significantly worse) than not having it you're fine.
Also the Linux first touch is a heuristic that can be wrong
later, and I don't see too much difference in having another
heuristic level on top of it.
The difference is, Linux (as a guest) will try to reuse freed pages from
an application or pagecache, knowing which node they belong to.
I agree that if all you do is HPC style computation (boot a kernel and
one app with one process per cpu), then the heuristics work well.
The scheme I described is a approximate heuristic to get local
memory access in many cases without pinning anything to CPUs.
It is certainly not perfect and has holes (like any heuristics),
but it has the advantage of being fully dynamic.
It also has the advantage of being already implemented (apart from fake
SRAT tables; and that isn't necessary for HPC apps).
in a virtualization context, the guest kernel expects that page X
belongs to whatever node the SRAT table points at, without regard to the
first access.
Guest kernels behave differently from applications, because real
hardware doesn't allocate pages dynamically like the kernel can for
applications.
Again the kernel just wants local memory access most of the time
for the allocations where it matters.
It does. But the kernel doesn't allocate memory (except for the first
time); it recycles memory.
Also NUMA is always an optimization, it's not a big issue if you're
wrong occasionally because that doesn't affect correctness.
Agreed.
Mapped once and allocated once (not at the same time, but fairly close).
That seems dubious to me.
That's how it works.
Qemu will mmap() the guest's memory at initialization time. When the
guest touches memory, kvm will call get_user_pages_fast() (here's the
allocation) and instantiate a pte in the qemu address space, as well as
a shadow pte (using either ept/npt two-level maps or direct shadow maps).
With ept/npt, in the absence os swapping, the story ends here. Without
ept/npt, the guest will continue to fault, but now get_user_pages_fast()
will return the already allocated page from the pte in the qemu address
space.
No. Linux will assume a page belongs to the node the SRAT table says it
belongs to. Whether first access will be from the local node depends on
the workload. If the first application running accesses all memory from
a single cpu, we will allocate all memory from one node, but this is wrong.
Sorry I don't get your point.
Yeah, we're talking a bit past each other. I'll try to expand.
Wrong doesn't make sense in this context.
You seem to be saying that an allocation that is not local on a native
kernel wouldn't be local in the approximate heuristic either.
But that's a triviality that is of course true and likely not what
you meant anyways.
I'm saying, that sometimes the guest kernel allocates memory from
virtual node A but uses it on virtual node B, due to its memory policy
or perhaps due to resource scarcity. Unlike a normal application, the
guest kernel still tracks the page as belonging to node A (even though
it is used on node B). Because of this, when the page is recycled, the
guest kernel will try to assign it to processes running on node A. But
the host has allocated it from node B.
When we export a virtual SRAT, we promise to the guest something about
memory access latency. The guest will try to optimize according to this
SRAT, and if we don't fulfil the promise, it will make incorrect decisions.
So long as a page has a single use in the lifetime of the guest, it
doesn't matter. But general purpose applications don't behave like that.
It's meaningless information. First access means nothing. And again,
At least in Linux the first access to the majority of memory is
either through a process page allocation or through a file cache
page cache allocation. Yes there are are a few boot loader
and temporary kernel pages for which this is not true,
but they are a small insignificant fraction of the total memory
in a reasonably sized guest. I'm just ignoring them.
This can be often observed in that if you have a broken DIMM
you only get problems after using some program that uses
most of your memory.
Right. If you have a single allocate-once workload, the heuristic works.
Windows will zero all memory in the background, btw.
Sure, for the simple cases it works. But consider your first example
followed by the second (you can even reboot the guest in the middle, but
the bad assignment sticks).
If you have a heuristic to detect remapping you'll recover on each remapping.
We do, for non ?pt.
We should try to be predictable,
NUMA is unfortunately somewhat unpredictable even on native kernels
There are always situations whe where a hot page can end up on the wrong node.
That tends to make benchmakers unhappy. But so far no good general
way is known to avoid it.
Doesn't the cache insulate workloads against small numbers of mislocated
pages?
not depend on behavior the guest has no
real reason to follow, if it follows hardware specs.
Sorry, Avi, I suspect you have a somewhat unrealistic mental model of
NUMA knowledge in applications and OS.
That may well be. I haven't programmed large NUMA apps.
At least Linux's behaviour (and I assume most NUMA optimized
OS) will be handled reasonably well by this scheme. I think.
It was just one proposal.
Well, if it is, we'll find out easily, as that's what we implement right
now (less migrating-on-remap and providing a virtual SRAT, which isn't
even really needed).
Anyways it might still not work well in practice -- the only way
to find out would be to implement and try -- but I think
it should not be dismissed out of hand.
I can't dismiss it even if I want to -- it's how kvm works now (well
except when a device is assigned).
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html