Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Avi Kivity wrote: Anthony Liguori wrote: I see no compelling reason to do cpu placement internally. It can be done quite effectively externally. Memory allocation is tough, but I don't think it's out of reach. Looking at the numactl man page, you can do: numactl --offset=1G --length=1G --membind=1 --file /dev/shm/A --touch Bind the second gigabyte in the tmpfs file /dev/shm/A to node 1. Since we can already create VM's with the -mem-path argument, if you create a 2GB guest and want it to span two numa nodes, you could do: numactl --offset=0G --length=1G --membind=0 --file /dev/shm/A --touch numactl --offset=1G --length=1G --membind=1 --file /dev/shm/A --touch And then create the VM with: qemu-system-x86_64 -mem-path /dev/shm/A -mem 2G ... What's best about this approach, is that you get full access to what numactl is capable of. Interleaving, rebalancing, etc. It looks horribly difficult and unintuitive. It forces you to use -mem-path (which is an abomination; the only reason it lives is that we can't allocate large pages with it). As opposed to inventing new options for QEMU that convey all of the same information a slightly different way? We're stuck with -mem-path so we might as well make good use of it. The proposed syntax is: qemu -numanode node=1,cpu=2,cpu=3,start=1G,size=1G,hostnode=3 The new syntax would be: qemu -smp 4 -numa nodes=2,cpus=1:2:3:4,mem=1G:1G -mem-path /dev/hugetlbfs/foo Then you would have to look up the thread ids, and do taskset taskset taskset taskset numactl -o 1G -l 1G -m 0 -f /dev/hugetlbfs/foo numactl -o 1G -l 1G -m 1 -f /dev/hugetlbfs/foo This may look like a lot more, but it's not going to be nearly enough to specify a NUMA placement on startup. What if you have a very large NUMA system and want to rebalance virtual machines? You need a mechanism to do this that now has to be exposed through the monitor. In fact, you'll almost certainly introduce a taskset-like monitor command and a numactl-like monitor command. Why reinvent the wheel? Plus, taskset and numactl gives you a lot of flexibility. All we're going to do by cooking this stuff into QEMU is artificially limit ourselves. Regards, Anthony LIguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Anthony Liguori wrote: Avi Kivity wrote: Andre Przywara wrote: Any other useful commands for the monitor? Maybe (temporary) VCPU migration without page migration? Right now vcpu migration is done externally (we export the thread IDs so management can pin them as it wishes). If we add numa support, I think it makes sense do it internally as well. I suggest using the same syntax for the monitor as for the command line; that's simplest to learn and to implement. I see no compelling reason to do cpu placement internally. It can be done quite effectively externally. Memory allocation is tough, but I don't think it's out of reach. Looking at the numactl man page, you can do: numactl --offset=1G --length=1G --membind=1 --file /dev/shm/A --touch Bind the second gigabyte in the tmpfs file /dev/shm/A to node 1. Since we can already create VM's with the -mem-path argument, if you create a 2GB guest and want it to span two numa nodes, you could do: numactl --offset=0G --length=1G --membind=0 --file /dev/shm/A --touch numactl --offset=1G --length=1G --membind=1 --file /dev/shm/A --touch And then create the VM with: qemu-system-x86_64 -mem-path /dev/shm/A -mem 2G ... What's best about this approach, is that you get full access to what numactl is capable of. Interleaving, rebalancing, etc. It looks horribly difficult and unintuitive. It forces you to use -mem-path (which is an abomination; the only reason it lives is that we can't allocate large pages with it). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Anthony Liguori wrote: Andre Przywara wrote: Hi, this patch series introduces multiple NUMA nodes support within KVM guests. This will improve the performance of guests which are bigger than one node (number of VCPUs and/or amount of memory) and also allows better balancing by taking better usage of each node's memory. It also improves the one node case by pinning a guest to this node and avoiding access of remote memory from one VCPU. Could you please post this to qemu-devel? There's really nothing KVM specific here. It's almost useless to qemu until it can run vcpus on host threads. I agree it should be posted there though. I think the dependency on libnuma is a bad idea. It's mixing a mechanism (emulating NUMA layout) with a policy (how to do memory/VCPU placement). If you split the NUMA emulation bits into a separate patch series, that has no dependency on the host NUMA topology, I think we look at the existing mechanisms we have to see if they're sufficient to do static placement on NUMA boundaries. vcpu pinning is easy enough, I think the only place we're lacking is memory layout. Note, that's totally independent of the guest's NUMA characteristics though. You may still want half of memory to be pinned between two nodes even if the guest has no SRAT tables. You can do that easily with numactl. Fine grained control of host numa layout and guest numa emulation are only useful together (one could argue that guest numa emulation is useful by itself, for debugging the guest OS numa algorithms). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Anthony Liguori wrote: numactl --offset=0G --length=1G --membind=0 --file /dev/shm/A --touch numactl --offset=1G --length=1G --membind=1 --file /dev/shm/A --touch And then create the VM with: qemu-system-x86_64 -mem-path /dev/shm/A -mem 2G ... What's best about this approach, is that you get full access to what numactl is capable of. Interleaving, rebalancing, etc. Prefaulting, generating an error when NUMA placement can't be satisified, hugetlbfs support, yeah, this very much seems like the right thing to do to me. If you care enough about performance to do NUMA placement, you almost certainly are going to be doing hugetlbfs anyway so you get it practically for free. Regards, Anthony Liguori Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Avi Kivity wrote: Andre Przywara wrote: Any other useful commands for the monitor? Maybe (temporary) VCPU migration without page migration? Right now vcpu migration is done externally (we export the thread IDs so management can pin them as it wishes). If we add numa support, I think it makes sense do it internally as well. I suggest using the same syntax for the monitor as for the command line; that's simplest to learn and to implement. I see no compelling reason to do cpu placement internally. It can be done quite effectively externally. Memory allocation is tough, but I don't think it's out of reach. Looking at the numactl man page, you can do: numactl --offset=1G --length=1G --membind=1 --file /dev/shm/A --touch Bind the second gigabyte in the tmpfs file /dev/shm/A to node 1. Since we can already create VM's with the -mem-path argument, if you create a 2GB guest and want it to span two numa nodes, you could do: numactl --offset=0G --length=1G --membind=0 --file /dev/shm/A --touch numactl --offset=1G --length=1G --membind=1 --file /dev/shm/A --touch And then create the VM with: qemu-system-x86_64 -mem-path /dev/shm/A -mem 2G ... What's best about this approach, is that you get full access to what numactl is capable of. Interleaving, rebalancing, etc. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Andre Przywara wrote: Hi, this patch series introduces multiple NUMA nodes support within KVM guests. This will improve the performance of guests which are bigger than one node (number of VCPUs and/or amount of memory) and also allows better balancing by taking better usage of each node's memory. It also improves the one node case by pinning a guest to this node and avoiding access of remote memory from one VCPU. Could you please post this to qemu-devel? There's really nothing KVM specific here. The user (or better: management application) specifies the host nodes the guest should use: -nodes 2,3 would create a two node guest mapped to node 2 and 3 on the host. These numbers are handed over to libnuma: VCPUs are pinned to the nodes and the allocated guest memory is bound to it's respective node. Since libnuma seems not to be installed everywhere, the user has to enable this via configure --enable-numa In the BIOS code an ACPI SRAT table was added, which describes the NUMA topology to the guest. The number of nodes is communicated via the CMOS RAM (offset 0x3E). If someone thinks of this as a bad idea, tell me. I think the dependency on libnuma is a bad idea. It's mixing a mechanism (emulating NUMA layout) with a policy (how to do memory/VCPU placement). If you split the NUMA emulation bits into a separate patch series, that has no dependency on the host NUMA topology, I think we look at the existing mechanisms we have to see if they're sufficient to do static placement on NUMA boundaries. vcpu pinning is easy enough, I think the only place we're lacking is memory layout. Note, that's totally independent of the guest's NUMA characteristics though. You may still want half of memory to be pinned between two nodes even if the guest has no SRAT tables. Regards, Anthony Liguori To take use of the new BIOS, install the iasl compiler (http://acpica.org/downloads/) and type "make bios" before installing, so the default BIOS will be replaced with the modified one. Node over-committing is allowed (-nodes 0,0,0,0), omitting the -nodes parameter reverts to the old behavior. Please apply. Regards, Andre. Patch 1/3: introduce a command line parameter Patch 2/3: allocate guests resources from different host nodes Patch 3/3: generate an appropriate SRAT ACPI table Signed-off-by: Andre Przywara <[EMAIL PROTECTED]> -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Daniel P. Berrange wrote: The only problem is the default option for the host side, as libnuma requires to explicitly name the nodes. Maybe make the pin: part _not_ optional? I would at least want to pin the memory, one could discuss about the VCPUs... I think keeping it optional makes things more flexible for people invoking KVM. If omitted, then query current CPU pinning to determine which host NUMA nodes to allocate from. Well, -numa itself is optional. But yes, we could use the default cpu affinity mask to derive the default host numa nodes. The topology exposed to a guest will likely be the same every time you launch a particular VM, while the guest<-> host pinning is a point in time decision according to current available resources. Thus some apps / users may find it more convenient to have a fixed set of args they always use to invoke the KVM process, and instead control placement during the fork/exec'ing of KVM by explicitly calling sched_setaffinity or using numactl to launch. It should be easy enough to use sched_getaffinity to query current pining and from that determine appropriate NUMA nodes, if they leave out the pin= arg. I agree, nice idea. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
On Mon, Dec 01, 2008 at 03:15:19PM +0100, Andre Przywara wrote: > Avi Kivity wrote: > >>Node over-committing is allowed (-nodes 0,0,0,0), omitting the -nodes > >>parameter reverts to the old behavior. > > > >'-nodes' is too generic a name ('node' could also mean a host). Suggest > >-numanode. > > > >Need more flexibility: specify the range of memory per node, which cpus > >are in the node, relative weights for the SRAT table: > > > > -numanode node=1,cpu=2,cpu=3,start=1G,size=1G,hostnode=3 > > I converted my code to use the new firmware interface. This also makes > it possible to pass more information between qemu and BIOS (which > prevented a more flexible command line in the first version). > So I would opt for the following: > - use numanode (or simply numa?) instead of the misleading -nodes > - allow passing memory sizes, VCPU subsets and host CPU pin info > I would prefer Daniel's version: > -numa [,mem:[;...]] > [,cpu:[;...]] > [,pin:[;...]] > > That would allow easy things like -numa 2 (for a two guest node), not > given options would result in defaults (equally split-up resources). > > The only problem is the default option for the host side, as libnuma > requires to explicitly name the nodes. Maybe make the pin: part _not_ > optional? I would at least want to pin the memory, one could discuss > about the VCPUs... I think keeping it optional makes things more flexible for people invoking KVM. If omitted, then query current CPU pinning to determine which host NUMA nodes to allocate from. The topology exposed to a guest will likely be the same every time you launch a particular VM, while the guest<-> host pinning is a point in time decision according to current available resources. Thus some apps / users may find it more convenient to have a fixed set of args they always use to invoke the KVM process, and instead control placement during the fork/exec'ing of KVM by explicitly calling sched_setaffinity or using numactl to launch. It should be easy enough to use sched_getaffinity to query current pining and from that determine appropriate NUMA nodes, if they leave out the pin= arg. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Andre Przywara wrote: Avi Kivity wrote: Andre Przywara wrote: The user (or better: management application) specifies the host nodes the guest should use: -nodes 2,3 would create a two node guest mapped to node 2 and 3 on the host. These numbers are handed over to libnuma: VCPUs are pinned to the nodes and the allocated guest memory is bound to it's respective node. Since libnuma seems not to be installed everywhere, the user has to enable this via configure --enable-numa In the BIOS code an ACPI SRAT table was added, which describes the NUMA topology to the guest. The number of nodes is communicated via the CMOS RAM (offset 0x3E). If someone thinks of this as a bad idea, tell me. There exists now a firmware interface in qemu for this kind of communications. Oh, right you are, I missed that (was well hidden). I was looking at how the BIOS detects memory size and CPU numbers and these methods are quite cumbersome. Why not convert them to the FW_CFG methods (which the qemu side already sets)? To not diverge too much from the original BOCHS BIOS? Mostly. Also, no one felt the urge. Node over-committing is allowed (-nodes 0,0,0,0), omitting the -nodes parameter reverts to the old behavior. '-nodes' is too generic a name ('node' could also mean a host). Suggest -numanode. Need more flexibility: specify the range of memory per node, which cpus are in the node, relative weights for the SRAT table: -numanode node=1,cpu=2,cpu=3,start=1G,size=1G,hostnode=3 I converted my code to use the new firmware interface. This also makes it possible to pass more information between qemu and BIOS (which prevented a more flexible command line in the first version). So I would opt for the following: - use numanode (or simply numa?) instead of the misleading -nodes - allow passing memory sizes, VCPU subsets and host CPU pin info I would prefer Daniel's version: -numa [,mem:[;...]] [,cpu:[;...]] [,pin:[;...]] That would allow easy things like -numa 2 (for a two guest node), not given options would result in defaults (equally split-up resources). Yes, that look good. The only problem is the default option for the host side, as libnuma requires to explicitly name the nodes. Maybe make the pin: part _not_ optional? I would at least want to pin the memory, one could discuss about the VCPUs... If you can bench it, that would be best. My guess is that we would need to pin the vcpus. hange host nodes dynamically: Implementing a monitor interface is a good idea. (qemu) numanode 1 0 Does that include page migration? That would be easily possible with mbind(MPOL_MF_MOVE), but would take some time and resources (which I think is OK if explicitly triggered in the monitor). Yes, that's the main interest. Allow management to load balance numa nodes (as Linux doesn't do so automatically for long running processes). Any other useful commands for the monitor? Maybe (temporary) VCPU migration without page migration? Right now vcpu migration is done externally (we export the thread IDs so management can pin them as it wishes). If we add numa support, I think it makes sense do it internally as well. I suggest using the same syntax for the monitor as for the command line; that's simplest to learn and to implement. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Avi Kivity wrote: Andre Przywara wrote: The user (or better: management application) specifies the host nodes the guest should use: -nodes 2,3 would create a two node guest mapped to node 2 and 3 on the host. These numbers are handed over to libnuma: VCPUs are pinned to the nodes and the allocated guest memory is bound to it's respective node. Since libnuma seems not to be installed everywhere, the user has to enable this via configure --enable-numa In the BIOS code an ACPI SRAT table was added, which describes the NUMA topology to the guest. The number of nodes is communicated via the CMOS RAM (offset 0x3E). If someone thinks of this as a bad idea, tell me. There exists now a firmware interface in qemu for this kind of communications. Oh, right you are, I missed that (was well hidden). I was looking at how the BIOS detects memory size and CPU numbers and these methods are quite cumbersome. Why not convert them to the FW_CFG methods (which the qemu side already sets)? To not diverge too much from the original BOCHS BIOS? Node over-committing is allowed (-nodes 0,0,0,0), omitting the -nodes parameter reverts to the old behavior. '-nodes' is too generic a name ('node' could also mean a host). Suggest -numanode. Need more flexibility: specify the range of memory per node, which cpus are in the node, relative weights for the SRAT table: -numanode node=1,cpu=2,cpu=3,start=1G,size=1G,hostnode=3 I converted my code to use the new firmware interface. This also makes it possible to pass more information between qemu and BIOS (which prevented a more flexible command line in the first version). So I would opt for the following: - use numanode (or simply numa?) instead of the misleading -nodes - allow passing memory sizes, VCPU subsets and host CPU pin info I would prefer Daniel's version: -numa [,mem:[;...]] [,cpu:[;...]] [,pin:[;...]] That would allow easy things like -numa 2 (for a two guest node), not given options would result in defaults (equally split-up resources). The only problem is the default option for the host side, as libnuma requires to explicitly name the nodes. Maybe make the pin: part _not_ optional? I would at least want to pin the memory, one could discuss about the VCPUs... Also need a monitor command to change host nodes dynamically: Implementing a monitor interface is a good idea. (qemu) numanode 1 0 Does that include page migration? That would be easily possible with mbind(MPOL_MF_MOVE), but would take some time and resources (which I think is OK if explicitly triggered in the monitor). Any other useful commands for the monitor? Maybe (temporary) VCPU migration without page migration? Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 277-84917 to satisfy European Law for business letters: AMD Saxony Limited Liability Company & Co. KG, Wilschdorfer Landstr. 101, 01109 Dresden, Germany Register Court Dresden: HRA 4896, General Partner authorized to represent: AMD Saxony LLC (Wilmington, Delaware, US) General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/3] KVM-userspace: add NUMA support for guests
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Avi Kivity Sent: Sunday, November 30, 2008 4:50 PM To: Andi Kleen Cc: Andre Przywara; kvm@vger.kernel.org Subject: Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests > Well, testing is the only way to know. I'm particularly interested in > how Windows will perform, since we know so little about its internals. > > From some light googling, it looks like Windows has a home node for a > thread, and will allocate pages from the home node even when the thread > is executing on some other node temporarily. It also does automatic > page migration in some cases. Well, there's a couple of way this works: - A program can explicitly specify which NUMA node to use for NUMA-aware apps using the new NUMA-enabled APIs. - Otherwise, a default NUMA node is chosen. For older systems (believe prior to Vista), the default NUMA node is the processor upon which the thread was running. For modern systems, the ideal processor for the thread is used. (The ideal processor is which proc the scheduler will try and run the thread on. It is not a hard association as with an affinity mask, in that the scheduler will run the thread on a different processor if it has to, but it will prefer the ideal processor.) I think later versions of SQL Server are probably a good bet if you want to test something on Windows that is fully NUMA-aware. Most smaller stuff is of course using the default policies. - S -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Andi Kleen wrote: On Sun, Nov 30, 2008 at 10:07:01PM +0200, Avi Kivity wrote: Right. Allocated from the guest kernel's perspective. This may be different from the host kernel's perspective. Linux will delay touching memory until the last moment, Windows will not (likely it zeros pages on their own nodes, but who knows)? The problem on Linux is that the first touch is clear_page() and that unfortunately happens in the direct mapping before mapping, so the "detect mapping" trick doesn't quite work (unless it's a 32bit highmem page). It should still be on the same cpu. Ok one could migrate it on mapping. When the data is still cache hot that shouldn't be that expensive. Thinking about it again it might be actually a reasonable approach. Could also work for normal apps - move code and data to local node. But again, we don't have any guest mapping information when we're running under #pt; only the first access. If we're willing so sacrifice memory, we can get the first access per virtual node. In our case, the application is the guest kernel, which does know. It knows but it doesn't really care all that much. The only thing that counts is the end performance in this case. Well, testing is the only way to know. I'm particularly interested in how Windows will perform, since we know so little about its internals. From some light googling, it looks like Windows has a home node for a thread, and will allocate pages from the home node even when the thread is executing on some other node temporarily. It also does automatic page migration in some cases. The difference is, Linux (as a guest) will try to reuse freed pages from an application or pagecache, knowing which node they belong to. I agree that if all you do is HPC style computation (boot a kernel and one app with one process per cpu), then the heuristics work well. Or if there's a way to detect unmapping/remapping. Sure, if you're willing to drop %pt. It is certainly not perfect and has holes (like any heuristics), but it has the advantage of being fully dynamic. It also has the advantage of being already implemented (apart from fake SRAT tables; and that isn't necessary for HPC apps). What do you mean? Which part? being already implemented? Like I said earlier, right now kvm will allocate memory from the process that runs the vcpu that first touched this memory. Given that Linux prefers allocating from the current node, we already implement the first touch heuristic. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
On Sun, Nov 30, 2008 at 10:07:01PM +0200, Avi Kivity wrote: > Right. Allocated from the guest kernel's perspective. This may be > different from the host kernel's perspective. > > Linux will delay touching memory until the last moment, Windows will not > (likely it zeros pages on their own nodes, but who knows)? The problem on Linux is that the first touch is clear_page() and that unfortunately happens in the direct mapping before mapping, so the "detect mapping" trick doesn't quite work (unless it's a 32bit highmem page). Ok one could migrate it on mapping. When the data is still cache hot that shouldn't be that expensive. Thinking about it again it might be actually a reasonable approach. > > The bigger problem is lifetime. Inside a guest, 'allocation' happens > when a page is used for pagecache, or when a process is created and > starts using memory. From the host perspective, it happens just once. Yes, that's a problem. I discussed some ways to get around that earlier. > > >>It's very different. The kernel expects an application that touched > >>page X on node Y to continue using page X on node Y. Because > >>applications know this, they are written to this assumption. However, > >> > > > >The far majority of applications do not actually know where memory is. > > > > In our case, the application is the guest kernel, which does know. It knows but it doesn't really care all that much. The only thing that counts is the end performance in this case. [Some people also NUMA policy to partition machines, but that's ok in this case because that only needs the same fixed guest physical addresses which is guaranteed of course] > The difference is, Linux (as a guest) will try to reuse freed pages from > an application or pagecache, knowing which node they belong to. > > I agree that if all you do is HPC style computation (boot a kernel and > one app with one process per cpu), then the heuristics work well. Or if there's a way to detect unmapping/remapping. > >It is certainly not perfect and has holes (like any heuristics), > >but it has the advantage of being fully dynamic. > > > > It also has the advantage of being already implemented (apart from fake > SRAT tables; and that isn't necessary for HPC apps). What do you mean? -Andi -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Skywing wrote: The far majority of pages are allocated when a process wants them or the kernel uses them for file cache. Is that not going to be fairly guest-specific? For example, Windows has a thread that does background zeroing of unallocated pages that aren't marked as zeroed already. I'd imagine that touching such pages would translate to an "allocation" as far as any hypervisor would be concerned. Yes. Most likely the thread runs on the same node as the memory, so it stays local. Still, we need to keep the vcpu within the memory node somehow, otherwise all memory becomes non-local. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Andi Kleen wrote: The page is allocated at an uninteresting point in time. For example, the boot loaded allocates a bunch of pages. The far majority of pages are allocated when a process wants them or the kernel uses them for file cache. Right. Allocated from the guest kernel's perspective. This may be different from the host kernel's perspective. Linux will delay touching memory until the last moment, Windows will not (likely it zeros pages on their own nodes, but who knows)? The bigger problem is lifetime. Inside a guest, 'allocation' happens when a page is used for pagecache, or when a process is created and starts using memory. From the host perspective, it happens just once. It's very different. The kernel expects an application that touched page X on node Y to continue using page X on node Y. Because applications know this, they are written to this assumption. However, The far majority of applications do not actually know where memory is. In our case, the application is the guest kernel, which does know. What matters is that you get local accesses most of the time for the memory that is touched on a specific CPU. Even the applications who know won't break if it's somewhere else, because it's only an optimization. As long as you're faster on average (or in the worst case not significantly worse) than not having it you're fine. Also the Linux first touch is a heuristic that can be wrong later, and I don't see too much difference in having another heuristic level on top of it. The difference is, Linux (as a guest) will try to reuse freed pages from an application or pagecache, knowing which node they belong to. I agree that if all you do is HPC style computation (boot a kernel and one app with one process per cpu), then the heuristics work well. The scheme I described is a approximate heuristic to get local memory access in many cases without pinning anything to CPUs. It is certainly not perfect and has holes (like any heuristics), but it has the advantage of being fully dynamic. It also has the advantage of being already implemented (apart from fake SRAT tables; and that isn't necessary for HPC apps). in a virtualization context, the guest kernel expects that page X belongs to whatever node the SRAT table points at, without regard to the first access. Guest kernels behave differently from applications, because real hardware doesn't allocate pages dynamically like the kernel can for applications. Again the kernel just wants local memory access most of the time for the allocations where it matters. It does. But the kernel doesn't allocate memory (except for the first time); it recycles memory. Also NUMA is always an optimization, it's not a big issue if you're wrong occasionally because that doesn't affect correctness. Agreed. Mapped once and allocated once (not at the same time, but fairly close). That seems dubious to me. That's how it works. Qemu will mmap() the guest's memory at initialization time. When the guest touches memory, kvm will call get_user_pages_fast() (here's the allocation) and instantiate a pte in the qemu address space, as well as a shadow pte (using either ept/npt two-level maps or direct shadow maps). With ept/npt, in the absence os swapping, the story ends here. Without ept/npt, the guest will continue to fault, but now get_user_pages_fast() will return the already allocated page from the pte in the qemu address space. No. Linux will assume a page belongs to the node the SRAT table says it belongs to. Whether first access will be from the local node depends on the workload. If the first application running accesses all memory from a single cpu, we will allocate all memory from one node, but this is wrong. Sorry I don't get your point. Yeah, we're talking a bit past each other. I'll try to expand. Wrong doesn't make sense in this context. You seem to be saying that an allocation that is not local on a native kernel wouldn't be local in the approximate heuristic either. But that's a triviality that is of course true and likely not what you meant anyways. I'm saying, that sometimes the guest kernel allocates memory from virtual node A but uses it on virtual node B, due to its memory policy or perhaps due to resource scarcity. Unlike a normal application, the guest kernel still tracks the page as belonging to node A (even though it is used on node B). Because of this, when the page is recycled, the guest kernel will try to assign it to processes running on node A. But the host has allocated it from node B. When we export a virtual SRAT, we promise to the guest something about memory access latency. The guest will try to optimize according to this SRAT, and if we don't fulfil the promise, it will make incorrect decisions. So long as a page has a single use in the lifetime of the guest, it doesn't matter. But ge
RE: [PATCH 0/3] KVM-userspace: add NUMA support for guests
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andi Kleen Sent: Sunday, November 30, 2008 1:56 PM To: Avi Kivity Cc: Andi Kleen; Andre Przywara; kvm@vger.kernel.org Subject: Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests > > The page is allocated at an uninteresting point in time. For example, > > the boot loaded allocates a bunch of pages. > > The far majority of pages are allocated when a process wants them > or the kernel uses them for file cache. Is that not going to be fairly guest-specific? For example, Windows has a thread that does background zeroing of unallocated pages that aren't marked as zeroed already. I'd imagine that touching such pages would translate to an "allocation" as far as any hypervisor would be concerned. - S -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
> The page is allocated at an uninteresting point in time. For example, > the boot loaded allocates a bunch of pages. The far majority of pages are allocated when a process wants them or the kernel uses them for file cache. > > >>executes. First access happens somewhat later, but still we cannot > >>count on the majority of accesses to come from the same cpu as the first > >>access. > >> > > > >It is a reasonable heuristic. It's just like the rather > >successfull default local allocation heuristic the native kernel uses. > > > > It's very different. The kernel expects an application that touched > page X on node Y to continue using page X on node Y. Because > applications know this, they are written to this assumption. However, The far majority of applications do not actually know where memory is. What matters is that you get local accesses most of the time for the memory that is touched on a specific CPU. Even the applications who know won't break if it's somewhere else, because it's only an optimization. As long as you're faster on average (or in the worst case not significantly worse) than not having it you're fine. Also the Linux first touch is a heuristic that can be wrong later, and I don't see too much difference in having another heuristic level on top of it. The scheme I described is a approximate heuristic to get local memory access in many cases without pinning anything to CPUs. It is certainly not perfect and has holes (like any heuristics), but it has the advantage of being fully dynamic. > in a virtualization context, the guest kernel expects that page X > belongs to whatever node the SRAT table points at, without regard to the > first access. > > Guest kernels behave differently from applications, because real > hardware doesn't allocate pages dynamically like the kernel can for > applications. Again the kernel just wants local memory access most of the time for the allocations where it matters. Also NUMA is always an optimization, it's not a big issue if you're wrong occasionally because that doesn't affect correctness. > > (btw, what do you do with cpu-less nodes? I think some sgi hardware has > them) You assign them to a nearby node, But it's really a totally unimportant corner case. > > >>>The alternative is to keep your own pools and allocate from the > >>>correct pool, but then you either need pinning or getcpu() > >>> > >>> > >>This is meaningless in kvm context. Other than small bits of memory > >>needed for I/O and shadow page tables, the bulk of memory is allocated > >>once. > >> > > > >Mapped once. Anyways that could be changed too if there was need. > > > > > > Mapped once and allocated once (not at the same time, but fairly close). That seems dubious to me. > No. Linux will assume a page belongs to the node the SRAT table says it > belongs to. Whether first access will be from the local node depends on > the workload. If the first application running accesses all memory from > a single cpu, we will allocate all memory from one node, but this is wrong. Sorry I don't get your point. Wrong doesn't make sense in this context. You seem to be saying that an allocation that is not local on a native kernel wouldn't be local in the approximate heuristic either. But that's a triviality that is of course true and likely not what you meant anyways. > > >>(2) even without npt/ept, we have no idea how often mappings are used > >>and by which cpu. finding out is expensive. > >> > > > >You see a fault on the first mapping. That fault is on the CPU that > >did the access. Therefore you know which one it was. > > > > It's meaningless information. First access means nothing. And again, At least in Linux the first access to the majority of memory is either through a process page allocation or through a file cache page cache allocation. Yes there are are a few boot loader and temporary kernel pages for which this is not true, but they are a small insignificant fraction of the total memory in a reasonably sized guest. I'm just ignoring them. This can be often observed in that if you have a broken DIMM you only get problems after using some program that uses most of your memory. > the guest doesn't expect the page to move to the node where it touched it. The only thing the guest cares about is to get good performance on the memory access. And even there only if the page is actually used often (i.e. mapped to a process). How that is archived is not its concern. > > (we also see first access with ept) Great. > > >>(3) for many workloads, there are no unused pages. the guest > >>application allocates all memory and manages memory by itself. > >> > > > >First a common case of guest using all memory is file cache, > >but for NUMA purposes file cache locality typically doesn't > >matter because it's not accessed frequently enough that > >non locality is a problem. It really only matters for mapping > >th
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Andi Kleen wrote: I was more thinking about some heuristics that checks when a page is first mapped into user space. The only problem is that it is zeroed through the direct mapping before, but perhaps there is a way around it. That's one of the rare cases when 32bit highmem actually makes things easier. It might be also easier on some other OS than Linux who don't use direct mapping that aggressively. In the context of kvm, the mmap() calls happen before the guest ever The mmap call doesn't matter at all, what matters is when the page is allocated. The page is allocated at an uninteresting point in time. For example, the boot loaded allocates a bunch of pages. executes. First access happens somewhat later, but still we cannot count on the majority of accesses to come from the same cpu as the first access. It is a reasonable heuristic. It's just like the rather successfull default local allocation heuristic the native kernel uses. It's very different. The kernel expects an application that touched page X on node Y to continue using page X on node Y. Because applications know this, they are written to this assumption. However, in a virtualization context, the guest kernel expects that page X belongs to whatever node the SRAT table points at, without regard to the first access. Guest kernels behave differently from applications, because real hardware doesn't allocate pages dynamically like the kernel can for applications. (btw, what do you do with cpu-less nodes? I think some sgi hardware has them) The alternative is to keep your own pools and allocate from the correct pool, but then you either need pinning or getcpu() This is meaningless in kvm context. Other than small bits of memory needed for I/O and shadow page tables, the bulk of memory is allocated once. Mapped once. Anyways that could be changed too if there was need. Mapped once and allocated once (not at the same time, but fairly close). We can't change it without changing the guest. Basic algorithm: - If guest touches virtual node that is the same as the local node of the current vcpu assume it's a local allocation. The guest is not making the same assumption; lying to the guest is Huh? Pretty much all NUMA aware OS should. Linux will definitely. No. Linux will assume a page belongs to the node the SRAT table says it belongs to. Whether first access will be from the local node depends on the workload. If the first application running accesses all memory from a single cpu, we will allocate all memory from one node, but this is wrong. (2) even without npt/ept, we have no idea how often mappings are used and by which cpu. finding out is expensive. You see a fault on the first mapping. That fault is on the CPU that did the access. Therefore you know which one it was. It's meaningless information. First access means nothing. And again, the guest doesn't expect the page to move to the node where it touched it. (we also see first access with ept) (3) for many workloads, there are no unused pages. the guest application allocates all memory and manages memory by itself. First a common case of guest using all memory is file cache, but for NUMA purposes file cache locality typically doesn't matter because it's not accessed frequently enough that non locality is a problem. It really only matters for mapping that are used often by the CPU. When a single application allocates everything and keeps it that is fine too because you'll give it approximately local memory on the initial set up (assuming the application has reasonable NUMA behaviour by itself on a first touch local allocation policy) Sure, for the simple cases it works. But consider your first example followed by the second (you can even reboot the guest in the middle, but the bad assignment sticks). And if the vcpu moves for some reason, things get screwed up permanently. We should try to be predictable, not depend on behavior the guest has no real reason to follow, if it follows hardware specs. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
On Sun, Nov 30, 2008 at 07:11:40PM +0200, Avi Kivity wrote: > Andi Kleen wrote: > >On Sun, Nov 30, 2008 at 06:38:14PM +0200, Avi Kivity wrote: > > > >>The guest allocates when it touches the page for the first time. This > >>means very little since all of memory may be touched during guest bootup > >>or shortly afterwards. Even if not, it is still a one-time operation, > >>and any choices we make based on it will last the lifetime of the guest. > >> > > > >I was more thinking about some heuristics that checks when a page > >is first mapped into user space. The only problem is that it is zeroed > >through the direct mapping before, but perhaps there is a way around it. > >That's one of the rare cases when 32bit highmem actually makes things > >easier. > >It might be also easier on some other OS than Linux who don't use > >direct mapping that aggressively. > > > > In the context of kvm, the mmap() calls happen before the guest ever The mmap call doesn't matter at all, what matters is when the page is allocated. > executes. First access happens somewhat later, but still we cannot > count on the majority of accesses to come from the same cpu as the first > access. It is a reasonable heuristic. It's just like the rather successfull default local allocation heuristic the native kernel uses. > > > >The alternative is to keep your own pools and allocate from the > >correct pool, but then you either need pinning or getcpu() > > > > This is meaningless in kvm context. Other than small bits of memory > needed for I/O and shadow page tables, the bulk of memory is allocated > once. Mapped once. Anyways that could be changed too if there was need. > > >>We need to mimic real hardware. > >> > > > >The underlying allocation is in pages, so the NUMA affinity can > >be as well handled by this. > > > >Basic algorithm: > >- If guest touches virtual node that is the same as the local node > >of the current vcpu assume it's a local allocation. > > > > The guest is not making the same assumption; lying to the guest is Huh? Pretty much all NUMA aware OS should. Linux will definitely. > (1) with npt/ept we have no clue as to guest mappings Yes that is tricky. With A bits in theory it could be made to work with EPT, but there are none, and it would still not work very well. > (2) even without npt/ept, we have no idea how often mappings are used > and by which cpu. finding out is expensive. You see a fault on the first mapping. That fault is on the CPU that did the access. Therefore you know which one it was. > (3) for many workloads, there are no unused pages. the guest > application allocates all memory and manages memory by itself. First a common case of guest using all memory is file cache, but for NUMA purposes file cache locality typically doesn't matter because it's not accessed frequently enough that non locality is a problem. It really only matters for mapping that are used often by the CPU. When a single application allocates everything and keeps it that is fine too because you'll give it approximately local memory on the initial set up (assuming the application has reasonable NUMA behaviour by itself on a first touch local allocation policy) When there's lots of remapping/new processes one would probably need some heuristics to detect reallocations, like the mapping heuristics I described earlier or PV help. > Right. The situation I'm trying to avoid is process A with memory on > node X running on node Y, and process B with memory on node Y running on > node X. The scheduler arrives at a local optimum, caused by some > spurious load, and won't move to the global optimum because migrating > processes across cpus is considered expensive. > > I don't know, perhaps the current scheduler is clever enough to do this > already. It tries too, but there are always extreme cases where it doesn't work. Also once a process is migrated it won't find back to its memory. Still for a approximate dynamic solution trusting it is not the worst you can do. -Andi -- [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Andi Kleen wrote: On Sun, Nov 30, 2008 at 06:38:14PM +0200, Avi Kivity wrote: The guest allocates when it touches the page for the first time. This means very little since all of memory may be touched during guest bootup or shortly afterwards. Even if not, it is still a one-time operation, and any choices we make based on it will last the lifetime of the guest. I was more thinking about some heuristics that checks when a page is first mapped into user space. The only problem is that it is zeroed through the direct mapping before, but perhaps there is a way around it. That's one of the rare cases when 32bit highmem actually makes things easier. It might be also easier on some other OS than Linux who don't use direct mapping that aggressively. In the context of kvm, the mmap() calls happen before the guest ever executes. First access happens somewhat later, but still we cannot count on the majority of accesses to come from the same cpu as the first access. This is roughly equivalent of getting a fresh new demand fault page, but doesn't require to unmap/free/remap. Lost again, sorry. free/unmap/remap gives you normally local memory. I tend to call it poor man's NUMA policy API. The alternative is to keep your own pools and allocate from the correct pool, but then you either need pinning or getcpu() This is meaningless in kvm context. Other than small bits of memory needed for I/O and shadow page tables, the bulk of memory is allocated once. Guest processes may repeatedly allocate and free memory, but kvm will never see this. We need to mimic real hardware. The underlying allocation is in pages, so the NUMA affinity can be as well handled by this. Basic algorithm: - If guest touches virtual node that is the same as the local node of the current vcpu assume it's a local allocation. The guest is not making the same assumption; lying to the guest is counterproductive. The big problem is that a local decision takes effect indefinitely. - On allocation get the underlying page from the correct underlying node based on a dynamic getcpu relationship. - Find some way to get rid of unused pages. e.g. keep track of the number of mappings to a page and age or use pv help. (1) with npt/ept we have no clue as to guest mappings (2) even without npt/ept, we have no idea how often mappings are used and by which cpu. finding out is expensive. (3) for many workloads, there are no unused pages. the guest application allocates all memory and manages memory by itself. The static case is simple. We allocate memory from a few nodes (for small guests, only one) and establish a guest_node -> host_node mapping. vcpus on guest node X are constrained to host node according to this mapping. The dynamic case is really complicated. We can allow vcpus to wander to other cpus on cpu overcommit, but need to pull them back soonish, or alternatively migrate the entire node, taking into account the cost of the migration, cpu availability on the target node, and memory availability on the target node. Since the cost is so huge, this needs to be done on a very coarse scale. I wrote a scheduler that did that on 2.4 (it was called homenode scheduling), but it never worked well on small systems. It was moderately successfull on some big NUMA boxes though. The fundamental problem is that not using a CPU is always worse than using remote memory on the small systems. Right. The situation I'm trying to avoid is process A with memory on node X running on node Y, and process B with memory on node Y running on node X. The scheduler arrives at a local optimum, caused by some spurious load, and won't move to the global optimum because migrating processes across cpus is considered expensive. I don't know, perhaps the current scheduler is clever enough to do this already. Always migrating memory on CPU migration is also too costly in the general case, but it might be possible to make it work in the special case of vCPU guests with some tweaks. Yes, virtual machines are easier since there are a smaller number of mm_structs and tasks compared to more general workloads. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
On Sun, Nov 30, 2008 at 06:38:14PM +0200, Avi Kivity wrote: > The guest allocates when it touches the page for the first time. This > means very little since all of memory may be touched during guest bootup > or shortly afterwards. Even if not, it is still a one-time operation, > and any choices we make based on it will last the lifetime of the guest. I was more thinking about some heuristics that checks when a page is first mapped into user space. The only problem is that it is zeroed through the direct mapping before, but perhaps there is a way around it. That's one of the rare cases when 32bit highmem actually makes things easier. It might be also easier on some other OS than Linux who don't use direct mapping that aggressively. > > >This is roughly equivalent of getting a fresh new demand fault page, > >but doesn't require to unmap/free/remap. > > > > Lost again, sorry. free/unmap/remap gives you normally local memory. I tend to call it poor man's NUMA policy API. The alternative is to keep your own pools and allocate from the correct pool, but then you either need pinning or getcpu() > > >The tricky bit is probably figuring out what is a fresh new page for > >the guest. That might need some paravirtual help. > > > > The guest typically recycles its own pages (exception is ballooning). > Also it doesn't make sense to manage this on a per page basis as the > guest won't do that. > We need to mimic real hardware. The underlying allocation is in pages, so the NUMA affinity can be as well handled by this. Basic algorithm: - If guest touches virtual node that is the same as the local node of the current vcpu assume it's a local allocation. - On allocation get the underlying page from the correct underlying node based on a dynamic getcpu relationship. - Find some way to get rid of unused pages. e.g. keep track of the number of mappings to a page and age or use pv help. > The static case is simple. We allocate memory from a few nodes (for > small guests, only one) and establish a guest_node -> host_node > mapping. vcpus on guest node X are constrained to host node according > to this mapping. > > The dynamic case is really complicated. We can allow vcpus to wander to > other cpus on cpu overcommit, but need to pull them back soonish, or > alternatively migrate the entire node, taking into account the cost of > the migration, cpu availability on the target node, and memory > availability on the target node. Since the cost is so huge, this needs > to be done on a very coarse scale. I wrote a scheduler that did that on 2.4 (it was called homenode scheduling), but it never worked well on small systems. It was moderately successfull on some big NUMA boxes though. The fundamental problem is that not using a CPU is always worse than using remote memory on the small systems. Always migrating memory on CPU migration is also too costly in the general case, but it might be possible to make it work in the special case of vCPU guests with some tweaks. -Andi -- [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Andi Kleen wrote: Please explain. When would you call getcpu() and what would you do at that time? When the guest allocates on the node of its current CPU get memory on the node pool getcpu() tells you it is running on. More tricky is handling guest explicitely accessing other node for NUMA policy purposes, but in this case you can access the cache of the getcpu information of other vcpus. The guest allocates when it touches the page for the first time. This means very little since all of memory may be touched during guest bootup or shortly afterwards. Even if not, it is still a one-time operation, and any choices we make based on it will last the lifetime of the guest. This is roughly equivalent of getting a fresh new demand fault page, but doesn't require to unmap/free/remap. Lost again, sorry. The tricky bit is probably figuring out what is a fresh new page for the guest. That might need some paravirtual help. The guest typically recycles its own pages (exception is ballooning). Also it doesn't make sense to manage this on a per page basis as the guest won't do that. We need to mimic real hardware. The static case is simple. We allocate memory from a few nodes (for small guests, only one) and establish a guest_node -> host_node mapping. vcpus on guest node X are constrained to host node according to this mapping. The dynamic case is really complicated. We can allow vcpus to wander to other cpus on cpu overcommit, but need to pull them back soonish, or alternatively migrate the entire node, taking into account the cost of the migration, cpu availability on the target node, and memory availability on the target node. Since the cost is so huge, this needs to be done on a very coarse scale. I don't see this happening in the kernel anytime soon. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
On Sun, Nov 30, 2008 at 05:38:34PM +0200, Avi Kivity wrote: > Andi Kleen wrote: > >>I don't think the first one works without the second. Calling getcpu() > >>on startup is meaningless since the initial placement doesn't take the > >> > > > >Who said anything about startup? The idea behind getcpu() is to call > >it every time you allocate someting. > > > > > > Qemu only allocates on startup (though of course the kernel actually > allocates the memory lazily). > Please explain. When would you call getcpu() and what would you do at > that time? When the guest allocates on the node of its current CPU get memory on the node pool getcpu() tells you it is running on. More tricky is handling guest explicitely accessing other node for NUMA policy purposes, but in this case you can access the cache of the getcpu information of other vcpus. This is roughly equivalent of getting a fresh new demand fault page, but doesn't require to unmap/free/remap. The tricky bit is probably figuring out what is a fresh new page for the guest. That might need some paravirtual help. It's an approximate scheme, I don't know how well it would really work. > >I think I would prefer to fix that in the kernel. user space will never > >have the full picture. > > > > On the other hand, getting everyone happy so this can get into the > kernel will be very difficult. Many workloads will lose from this; > we're trying to balance both memory affinity and cpu balancing, and each > workload has a different tradeoff. Yes it's unlikely to be a win in general, at least on small systems with moderate NUMA factor. I expect KVM would need to turn it on with some explicit hints. -Andi -- [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Andi Kleen wrote: I don't think the first one works without the second. Calling getcpu() on startup is meaningless since the initial placement doesn't take the Who said anything about startup? The idea behind getcpu() is to call it every time you allocate someting. Qemu only allocates on startup (though of course the kernel actually allocates the memory lazily). Please explain. When would you call getcpu() and what would you do at that time? This could happen completely in the kernel (not an easy task), or by There were experimental patches for tieing memory migration to cpu migration some time ago from Lee S. having a second-level scheduler in userspace polling for cpu usage an rebalancing processes across numa nodes. Given that with virtualization you have a few long lived processes, this does not seem too difficult. I think I would prefer to fix that in the kernel. user space will never have the full picture. On the other hand, getting everyone happy so this can get into the kernel will be very difficult. Many workloads will lose from this; we're trying to balance both memory affinity and cpu balancing, and each workload has a different tradeoff. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
> I don't think the first one works without the second. Calling getcpu() > on startup is meaningless since the initial placement doesn't take the Who said anything about startup? The idea behind getcpu() is to call it every time you allocate someting. > > > >Anyways it's not ideal either, but in my mind would be all preferable > >to default CPU pinning. > > I agree we need something dynamic, and that we need to tie cpu affinity > and memory affinity together. > > This could happen completely in the kernel (not an easy task), or by There were experimental patches for tieing memory migration to cpu migration some time ago from Lee S. > having a second-level scheduler in userspace polling for cpu usage an > rebalancing processes across numa nodes. Given that with virtualization > you have a few long lived processes, this does not seem too difficult. I think I would prefer to fix that in the kernel. user space will never have the full picture. -Andi -- [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Andi Kleen wrote: On Sat, Nov 29, 2008 at 08:43:35PM +0200, Avi Kivity wrote: Andi Kleen wrote: It depends -- it's not necessarily an improvement. e.g. if it leads to some CPUs being idle while others are oversubscribed because of the pinning you typically lose more than you win. In general default pinning is a bad idea in my experience. Alternative more flexible strategies: - Do a mapping from CPU to node at runtime by using getcpu() - Migrate to complete nodes using migrate_pages when qemu detects node migration on the host. Wouldn't that cause lots of migrations? Migrating a 1GB guest can take I assume you mean the second one (the two points were orthogonal) The first one is an approximate method, also has advantages and disadvantages. I don't think the first one works without the second. Calling getcpu() on startup is meaningless since the initial placement doesn't take the current workload into account. a huge amount of cpu time (tens or even hundreds of milliseconds?) compared to very high frequency activity like the scheduler. Yes migration is expensive, although you can do it on demand of course, but the scheduler typically has pretty strong cpu affinity so it shouldn't happen too often. Also it's only a temporary cost compared to the endless overhead of running forever non local or running forever with some cores idle. Another strategy would be to tune the load balancer in the scheduler for this case and make it only migrate in extreme situations. Anyways it's not ideal either, but in my mind would be all preferable to default CPU pinning. I agree we need something dynamic, and that we need to tie cpu affinity and memory affinity together. This could happen completely in the kernel (not an easy task), or by having a second-level scheduler in userspace polling for cpu usage an rebalancing processes across numa nodes. Given that with virtualization you have a few long lived processes, this does not seem too difficult. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
On Sat, Nov 29, 2008 at 08:43:35PM +0200, Avi Kivity wrote: > Andi Kleen wrote: > >It depends -- it's not necessarily an improvement. e.g. if it leads to > >some CPUs being idle while others are oversubscribed because of the > >pinning you typically lose more than you win. In general default > >pinning is a bad idea in my experience. > > > >Alternative more flexible strategies: > > > >- Do a mapping from CPU to node at runtime by using getcpu() > >- Migrate to complete nodes using migrate_pages when qemu detects > >node migration on the host. > > > > Wouldn't that cause lots of migrations? Migrating a 1GB guest can take I assume you mean the second one (the two points were orthogonal) The first one is an approximate method, also has advantages and disadvantages. > a huge amount of cpu time (tens or even hundreds of milliseconds?) > compared to very high frequency activity like the scheduler. Yes migration is expensive, although you can do it on demand of course, but the scheduler typically has pretty strong cpu affinity so it shouldn't happen too often. Also it's only a temporary cost compared to the endless overhead of running forever non local or running forever with some cores idle. Another strategy would be to tune the load balancer in the scheduler for this case and make it only migrate in extreme situations. Anyways it's not ideal either, but in my mind would be all preferable to default CPU pinning. -Andi -- [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Andi Kleen wrote: It depends -- it's not necessarily an improvement. e.g. if it leads to some CPUs being idle while others are oversubscribed because of the pinning you typically lose more than you win. In general default pinning is a bad idea in my experience. Alternative more flexible strategies: - Do a mapping from CPU to node at runtime by using getcpu() - Migrate to complete nodes using migrate_pages when qemu detects node migration on the host. Wouldn't that cause lots of migrations? Migrating a 1GB guest can take a huge amount of cpu time (tens or even hundreds of milliseconds?) compared to very high frequency activity like the scheduler. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Andre Przywara wrote: The user (or better: management application) specifies the host nodes the guest should use: -nodes 2,3 would create a two node guest mapped to node 2 and 3 on the host. These numbers are handed over to libnuma: VCPUs are pinned to the nodes and the allocated guest memory is bound to it's respective node. Since libnuma seems not to be installed everywhere, the user has to enable this via configure --enable-numa In the BIOS code an ACPI SRAT table was added, which describes the NUMA topology to the guest. The number of nodes is communicated via the CMOS RAM (offset 0x3E). If someone thinks of this as a bad idea, tell me. There exists now a firmware interface in qemu for this kind of communications. To take use of the new BIOS, install the iasl compiler (http://acpica.org/downloads/) and type "make bios" before installing, so the default BIOS will be replaced with the modified one. Node over-committing is allowed (-nodes 0,0,0,0), omitting the -nodes parameter reverts to the old behavior. '-nodes' is too generic a name ('node' could also mean a host). Suggest -numanode. Need more flexibility: specify the range of memory per node, which cpus are in the node, relative weights for the SRAT table: -numanode node=1,cpu=2,cpu=3,start=1G,size=1G,hostnode=3 Also need a monitor command to change host nodes dynamically: (qemu) numanode 1 0 -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
On Thu, Nov 27, 2008 at 11:23:21PM +0100, Andre Przywara wrote: > Hi, > > this patch series introduces multiple NUMA nodes support within KVM guests. > This will improve the performance of guests which are bigger than one > node (number of VCPUs and/or amount of memory) and also allows better > balancing by taking better usage of each node's memory. > It also improves the one node case by pinning a guest to this node and > avoiding access of remote memory from one VCPU. > > The user (or better: management application) specifies the host nodes > the guest should use: -nodes 2,3 would create a two node guest mapped to > node 2 and 3 on the host. These numbers are handed over to libnuma: > VCPUs are pinned to the nodes and the allocated guest memory is bound to > it's respective node. I'm wondering whether this is the right level of granularity/expresiveness It is basically encoding 3 pieces of information - Number of NUMA nodes to expose to guest - Which host nodes to use - Which host nodes to pin vCPUS to. The latter item can actually already be done by management applications without a command line flag, with a greater level of flexbility that this allows. In libvirt we start up KVM with -S, so its initially stopped, then run 'info cpus' in the monitor. This gives us the list of thread IDs for each vCPU. We then use sched_setaffinity to control the placement of each vCPU onto pCPUs. KVM could pick which host nodes to use for allocation based on which nodes it vCPUs are pinned to. Since NUMA support is going to be optional, we can't rely on using -nodes for CPU placement, and I'd rather not have to write different codepaths for initial placement for NUMA vs non-NUMA enabled KVM. People not using a mgmt tool may also choose to control host node placement using numactl to launch KVM . They would still need to be able to say how many nodes the guest is given. Finally this CLI arg does not allow you to say which vCPU is placed in which vNUMA node, or how much of the guests RAM is allocated to each guest node. Thus I think it might be desirable, to have the CLI argument focus on describing the guest NUMA configuration, rather than having it encode host & guest NUMA info in one go. Finally you'd also want a way to describe vCPU <-> vNUMA node placement for vCPUS which are not yet present - eg so you can start with 4 vCPUs and hotplug add another 12 later. You can't assume you want all 4 inital CPUs in the same node, nor assume that you want all 4 spread evenly. So some examples off the top of my head for alternate syntax for the guest topology * Create 4 nodes, split RAM & 8 initial vCPUs equally across nodes, and 8 unplugged vCPUs equally too -m 1024 -smp 8 -nodes 4 * Create 4 nodes, split RAM equally across nodes, 8 initial vCPUs on first 2 nodes, and 8 unplugged vCPUs across other 2 nodes. -m 1024 -smp 8 -nodes 4,cpu:0-3;4-7;8-11;12-15 * Create 4 nodes, putting all RAM in first 2 nodes, split 8 initial vCPUs equally across nodes -m 1024 -smp 8 -nodes 4,mem:512;512 * Create 4 nodes, putting all RAM in first 2 nodes, 8 initial vCPUs on first 2 nodes, and 8 unplugged vCPUs across other 2 nodes. -m 1024 -smp 8 -nodes 4,mem:512;512,cpu:0-3;4-7;8-11;12-15 We could optionally also include host node pining for convenience * Create 4 nodes, putting all RAM in first 2 nodes, split 8 initial vCPUs equally across nodes, pin to host nodes 5-8 -m 1024 -smp 8 -nodes 4,mem:512;512,pin:5;6;7;8 If no 'pin' is given, it query its current host pCPU pinning to determine what NUMA nodes it had been launched on. > Since libnuma seems not to be installed > everywhere, the user has to enable this via configure --enable-numa It'd be nicer if the configure script just 'did the right thing'. So if neither --enable-numa, or --disable-numa are given, it should probe for availability and automatically enable it if found, disable if missing. If --enable-numa is given, it should probe and abort if not found. If --disable-numa is given it'd not enable anyhing. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Andre Przywara <[EMAIL PROTECTED]> writes: > It also improves the one node case by pinning a guest to this node and > avoiding access of remote memory from one VCPU. It depends -- it's not necessarily an improvement. e.g. if it leads to some CPUs being idle while others are oversubscribed because of the pinning you typically lose more than you win. In general default pinning is a bad idea in my experience. Alternative more flexible strategies: - Do a mapping from CPU to node at runtime by using getcpu() - Migrate to complete nodes using migrate_pages when qemu detects node migration on the host. -Andi -- [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html