Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-12-01 Thread Anthony Liguori

Avi Kivity wrote:

Anthony Liguori wrote:


I see no compelling reason to do cpu placement internally.  It can be 
done quite effectively externally.


Memory allocation is tough, but I don't think it's out of reach.  
Looking at the numactl man page, you can do:


numactl  --offset=1G  --length=1G --membind=1 --file /dev/shm/A --touch
  Bind the second gigabyte in the tmpfs file /dev/shm/A to node 1.


Since we can already create VM's with the -mem-path argument, if you 
create a 2GB guest and want it to span two numa nodes, you could do:


numactl  --offset=0G  --length=1G --membind=0 --file /dev/shm/A --touch
numactl  --offset=1G  --length=1G --membind=1 --file /dev/shm/A --touch

And then create the VM with:

qemu-system-x86_64 -mem-path /dev/shm/A -mem 2G ...

What's best about this approach, is that you get full access to what 
numactl is capable of.  Interleaving, rebalancing, etc.


It looks horribly difficult and unintuitive.  It forces you to use 
-mem-path (which is an abomination; the only reason it lives is that 
we can't allocate large pages with it).


As opposed to inventing new options for QEMU that convey all of the same 
information a slightly different way?  We're stuck with -mem-path so we 
might as well make good use of it.


The proposed syntax is:

qemu -numanode node=1,cpu=2,cpu=3,start=1G,size=1G,hostnode=3

The new syntax would be:

qemu -smp 4 -numa nodes=2,cpus=1:2:3:4,mem=1G:1G -mem-path 
/dev/hugetlbfs/foo


Then you would have to look up the thread ids, and do

taskset 
taskset 
taskset 
taskset 
numactl -o 1G -l 1G -m 0 -f /dev/hugetlbfs/foo
numactl -o 1G -l 1G -m 1 -f /dev/hugetlbfs/foo

This may look like a lot more, but it's not going to be nearly enough to 
specify a NUMA placement on startup.  What if you have a very large NUMA 
system and want to rebalance virtual machines?  You need a mechanism to 
do this that now has to be exposed through the monitor.  In fact, you'll 
almost certainly introduce a taskset-like monitor command and a 
numactl-like monitor command.


Why reinvent the wheel?  Plus, taskset and numactl gives you a lot of 
flexibility.  All we're going to do by cooking this stuff into QEMU is 
artificially limit ourselves.


Regards,

Anthony LIguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-12-01 Thread Avi Kivity

Anthony Liguori wrote:

Avi Kivity wrote:

Andre Przywara wrote:

Any other useful commands for the monitor? Maybe (temporary) VCPU 
migration without page migration?


Right now vcpu migration is done externally (we export the thread IDs 
so management can pin them as it wishes).  If we add numa support, I 
think it makes sense do it internally as well.  I suggest using the 
same syntax for the monitor as for the command line; that's simplest 
to learn and to implement.


I see no compelling reason to do cpu placement internally.  It can be 
done quite effectively externally.


Memory allocation is tough, but I don't think it's out of reach.  
Looking at the numactl man page, you can do:


numactl  --offset=1G  --length=1G --membind=1 --file /dev/shm/A --touch
  Bind the second gigabyte in the tmpfs file /dev/shm/A to node 1.


Since we can already create VM's with the -mem-path argument, if you 
create a 2GB guest and want it to span two numa nodes, you could do:


numactl  --offset=0G  --length=1G --membind=0 --file /dev/shm/A --touch
numactl  --offset=1G  --length=1G --membind=1 --file /dev/shm/A --touch

And then create the VM with:

qemu-system-x86_64 -mem-path /dev/shm/A -mem 2G ...

What's best about this approach, is that you get full access to what 
numactl is capable of.  Interleaving, rebalancing, etc.


It looks horribly difficult and unintuitive.  It forces you to use 
-mem-path (which is an abomination; the only reason it lives is that we 
can't allocate large pages with it).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-12-01 Thread Avi Kivity

Anthony Liguori wrote:

Andre Przywara wrote:

Hi,

this patch series introduces multiple NUMA nodes support within KVM 
guests.
This will improve the performance of guests which are bigger than one 
node (number of VCPUs and/or amount of memory) and also allows better 
balancing by taking better usage of each node's memory.

It also improves the one node case by pinning a guest to this node and
avoiding access of remote memory from one VCPU.


Could you please post this to qemu-devel?  There's really nothing KVM 
specific here.




It's almost useless to qemu until it can run vcpus on host threads.  I 
agree it should be posted there though.




I think the dependency on libnuma is a bad idea.  It's mixing a 
mechanism (emulating NUMA layout) with a policy (how to do memory/VCPU 
placement).


If you split the NUMA emulation bits into a separate patch series, 
that has no dependency on the host NUMA topology, I think we look at 
the existing mechanisms we have to see if they're sufficient to do 
static placement on NUMA boundaries.  vcpu pinning is easy enough, I 
think the only place we're lacking is memory layout.  Note, that's 
totally independent of the guest's NUMA characteristics though.  You 
may still want half of memory to be pinned between two nodes even if 
the guest has no SRAT tables.


You can do that easily with numactl.  Fine grained control of host numa 
layout and guest numa emulation are only useful together (one could 
argue that guest numa emulation is useful by itself, for debugging the 
guest OS numa algorithms).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-12-01 Thread Anthony Liguori

Anthony Liguori wrote:


numactl  --offset=0G  --length=1G --membind=0 --file /dev/shm/A --touch
numactl  --offset=1G  --length=1G --membind=1 --file /dev/shm/A --touch

And then create the VM with:

qemu-system-x86_64 -mem-path /dev/shm/A -mem 2G ...

What's best about this approach, is that you get full access to what 
numactl is capable of.  Interleaving, rebalancing, etc.


Prefaulting, generating an error when NUMA placement can't be 
satisified, hugetlbfs support, yeah, this very much seems like the right 
thing to do to me.


If you care enough about performance to do NUMA placement, you almost 
certainly are going to be doing hugetlbfs anyway so you get it 
practically for free.


Regards,

Anthony Liguori


Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-12-01 Thread Anthony Liguori

Avi Kivity wrote:

Andre Przywara wrote:

Any other useful commands for the monitor? Maybe (temporary) VCPU 
migration without page migration?


Right now vcpu migration is done externally (we export the thread IDs 
so management can pin them as it wishes).  If we add numa support, I 
think it makes sense do it internally as well.  I suggest using the 
same syntax for the monitor as for the command line; that's simplest 
to learn and to implement.


I see no compelling reason to do cpu placement internally.  It can be 
done quite effectively externally.


Memory allocation is tough, but I don't think it's out of reach.  
Looking at the numactl man page, you can do:


numactl  --offset=1G  --length=1G --membind=1 --file /dev/shm/A --touch
  Bind the second gigabyte in the tmpfs file /dev/shm/A to node 1.


Since we can already create VM's with the -mem-path argument, if you 
create a 2GB guest and want it to span two numa nodes, you could do:


numactl  --offset=0G  --length=1G --membind=0 --file /dev/shm/A --touch
numactl  --offset=1G  --length=1G --membind=1 --file /dev/shm/A --touch

And then create the VM with:

qemu-system-x86_64 -mem-path /dev/shm/A -mem 2G ...

What's best about this approach, is that you get full access to what 
numactl is capable of.  Interleaving, rebalancing, etc.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-12-01 Thread Anthony Liguori

Andre Przywara wrote:

Hi,

this patch series introduces multiple NUMA nodes support within KVM 
guests.
This will improve the performance of guests which are bigger than one 
node (number of VCPUs and/or amount of memory) and also allows better 
balancing by taking better usage of each node's memory.

It also improves the one node case by pinning a guest to this node and
avoiding access of remote memory from one VCPU.


Could you please post this to qemu-devel?  There's really nothing KVM 
specific here.



The user (or better: management application) specifies the host nodes
the guest should use: -nodes 2,3 would create a two node guest mapped to
node 2 and 3 on the host. These numbers are handed over to libnuma:
VCPUs are pinned to the nodes and the allocated guest memory is bound to
it's respective node. Since libnuma seems not to be installed
everywhere, the user has to enable this via configure --enable-numa
In the BIOS code an ACPI SRAT table was added, which describes the NUMA
topology to the guest. The number of nodes is communicated via the CMOS
RAM (offset 0x3E). If someone thinks of this as a bad idea, tell me.


I think the dependency on libnuma is a bad idea.  It's mixing a 
mechanism (emulating NUMA layout) with a policy (how to do memory/VCPU 
placement).


If you split the NUMA emulation bits into a separate patch series, that 
has no dependency on the host NUMA topology, I think we look at the 
existing mechanisms we have to see if they're sufficient to do static 
placement on NUMA boundaries.  vcpu pinning is easy enough, I think the 
only place we're lacking is memory layout.  Note, that's totally 
independent of the guest's NUMA characteristics though.  You may still 
want half of memory to be pinned between two nodes even if the guest has 
no SRAT tables.


Regards,

Anthony Liguori


To take use of the new BIOS, install the iasl compiler
(http://acpica.org/downloads/) and type "make bios" before installing,
so the default BIOS will be replaced with the modified one.
Node over-committing is allowed (-nodes 0,0,0,0), omitting the -nodes
parameter reverts to the old behavior.

Please apply.

Regards,
Andre.

Patch 1/3: introduce a command line parameter
Patch 2/3: allocate guests  resources from different host nodes
Patch 3/3: generate an appropriate SRAT ACPI table

Signed-off-by: Andre Przywara <[EMAIL PROTECTED]>



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-12-01 Thread Avi Kivity

Daniel P. Berrange wrote:
The only problem is the default option for the host side, as libnuma 
requires to explicitly name the nodes. Maybe make the pin: part _not_ 
optional? I would at least want to pin the memory, one could discuss 
about the VCPUs...



I think keeping it optional makes things more flexible for people
invoking KVM. If omitted, then query current CPU pinning to determine
which host NUMA nodes to allocate from. 
  


Well, -numa itself is optional.  But yes, we could use the default cpu 
affinity mask to derive the default host numa nodes.



The topology exposed to a guest  will likely be the same every time
you launch a particular VM, while the guest<-> host pinning is a 
point in time decision according to current available resources.

Thus some apps / users may find it more convenient to have a fixed set
of args they always use to invoke the KVM process, and instead control
placement during the fork/exec'ing of KVM by explicitly calling 
sched_setaffinity or using numactl to launch.  It should be easy enough

to use sched_getaffinity to query current pining and from that determine
appropriate NUMA nodes, if they leave out the pin= arg.
  


I agree, nice idea.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-12-01 Thread Daniel P. Berrange
On Mon, Dec 01, 2008 at 03:15:19PM +0100, Andre Przywara wrote:
> Avi Kivity wrote:
> >>Node over-committing is allowed (-nodes 0,0,0,0), omitting the -nodes
> >>parameter reverts to the old behavior.
> >
> >'-nodes' is too generic a name ('node' could also mean a host).  Suggest 
> >-numanode.
> >
> >Need more flexibility: specify the range of memory per node, which cpus 
> >are in the node, relative weights for the SRAT table:
> >
> >  -numanode node=1,cpu=2,cpu=3,start=1G,size=1G,hostnode=3
> 
> I converted my code to use the new firmware interface. This also makes 
> it possible to pass more information between qemu and BIOS (which 
> prevented a more flexible command line in the first version).
> So I would opt for the following:
> - use numanode (or simply numa?) instead of the misleading -nodes
> - allow passing memory sizes, VCPU subsets and host CPU pin info
> I would prefer Daniel's version:
> -numa [,mem:[;...]]
> [,cpu:[;...]]
> [,pin:[;...]]
> 
> That would allow easy things like -numa 2 (for a two guest node), not 
> given options would result in defaults (equally split-up resources).
> 
> The only problem is the default option for the host side, as libnuma 
> requires to explicitly name the nodes. Maybe make the pin: part _not_ 
> optional? I would at least want to pin the memory, one could discuss 
> about the VCPUs...

I think keeping it optional makes things more flexible for people
invoking KVM. If omitted, then query current CPU pinning to determine
which host NUMA nodes to allocate from. 

The topology exposed to a guest  will likely be the same every time
you launch a particular VM, while the guest<-> host pinning is a 
point in time decision according to current available resources.
Thus some apps / users may find it more convenient to have a fixed set
of args they always use to invoke the KVM process, and instead control
placement during the fork/exec'ing of KVM by explicitly calling 
sched_setaffinity or using numactl to launch.  It should be easy enough
to use sched_getaffinity to query current pining and from that determine
appropriate NUMA nodes, if they leave out the pin= arg.

Daniel
-- 
|: Red Hat, Engineering, London   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-12-01 Thread Avi Kivity

Andre Przywara wrote:

Avi Kivity wrote:

Andre Przywara wrote:

The user (or better: management application) specifies the host nodes
the guest should use: -nodes 2,3 would create a two node guest 
mapped to

node 2 and 3 on the host. These numbers are handed over to libnuma:
VCPUs are pinned to the nodes and the allocated guest memory is 
bound to

it's respective node. Since libnuma seems not to be installed
everywhere, the user has to enable this via configure --enable-numa
In the BIOS code an ACPI SRAT table was added, which describes the NUMA
topology to the guest. The number of nodes is communicated via the CMOS
RAM (offset 0x3E). If someone thinks of this as a bad idea, tell me.


There exists now a firmware interface in qemu for this kind of 
communications.
Oh, right you are, I missed that (was well hidden). I was looking at 
how the BIOS detects memory size and CPU numbers and these methods are 
quite cumbersome. Why not convert them to the FW_CFG methods (which 
the qemu side already sets)? To not diverge too much from the original 
BOCHS BIOS?




Mostly.  Also, no one felt the urge.


Node over-committing is allowed (-nodes 0,0,0,0), omitting the -nodes
parameter reverts to the old behavior.


'-nodes' is too generic a name ('node' could also mean a host).  
Suggest -numanode.


Need more flexibility: specify the range of memory per node, which 
cpus are in the node, relative weights for the SRAT table:


  -numanode node=1,cpu=2,cpu=3,start=1G,size=1G,hostnode=3


I converted my code to use the new firmware interface. This also makes 
it possible to pass more information between qemu and BIOS (which 
prevented a more flexible command line in the first version).

So I would opt for the following:
- use numanode (or simply numa?) instead of the misleading -nodes
- allow passing memory sizes, VCPU subsets and host CPU pin info
I would prefer Daniel's version:
-numa [,mem:[;...]]
[,cpu:[;...]]
[,pin:[;...]]

That would allow easy things like -numa 2 (for a two guest node), not 
given options would result in defaults (equally split-up resources).




Yes, that look good.

The only problem is the default option for the host side, as libnuma 
requires to explicitly name the nodes. Maybe make the pin: part _not_ 
optional? I would at least want to pin the memory, one could discuss 
about the VCPUs...




If you can bench it, that would be best.  My guess is that we would need 
to pin the vcpus.



hange host nodes dynamically:

Implementing a monitor interface is a good idea.

(qemu) numanode 1 0
Does that include page migration? That would be easily possible with 
mbind(MPOL_MF_MOVE), but would take some time and resources (which I 
think is OK if explicitly triggered in the monitor).


Yes, that's the main interest.  Allow management to load balance numa 
nodes (as Linux doesn't do so automatically for long running processes).


Any other useful commands for the monitor? Maybe (temporary) VCPU 
migration without page migration?


Right now vcpu migration is done externally (we export the thread IDs so 
management can pin them as it wishes).  If we add numa support, I think 
it makes sense do it internally as well.  I suggest using the same 
syntax for the monitor as for the command line; that's simplest to learn 
and to implement.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-12-01 Thread Andre Przywara

Avi Kivity wrote:

Andre Przywara wrote:

The user (or better: management application) specifies the host nodes
the guest should use: -nodes 2,3 would create a two node guest mapped to
node 2 and 3 on the host. These numbers are handed over to libnuma:
VCPUs are pinned to the nodes and the allocated guest memory is bound to
it's respective node. Since libnuma seems not to be installed
everywhere, the user has to enable this via configure --enable-numa
In the BIOS code an ACPI SRAT table was added, which describes the NUMA
topology to the guest. The number of nodes is communicated via the CMOS
RAM (offset 0x3E). If someone thinks of this as a bad idea, tell me.


There exists now a firmware interface in qemu for this kind of 
communications.
Oh, right you are, I missed that (was well hidden). I was looking at how 
the BIOS detects memory size and CPU numbers and these methods are quite 
cumbersome. Why not convert them to the FW_CFG methods (which the qemu 
side already sets)? To not diverge too much from the original BOCHS BIOS?



Node over-committing is allowed (-nodes 0,0,0,0), omitting the -nodes
parameter reverts to the old behavior.


'-nodes' is too generic a name ('node' could also mean a host).  Suggest 
-numanode.


Need more flexibility: specify the range of memory per node, which cpus 
are in the node, relative weights for the SRAT table:


  -numanode node=1,cpu=2,cpu=3,start=1G,size=1G,hostnode=3


I converted my code to use the new firmware interface. This also makes 
it possible to pass more information between qemu and BIOS (which 
prevented a more flexible command line in the first version).

So I would opt for the following:
- use numanode (or simply numa?) instead of the misleading -nodes
- allow passing memory sizes, VCPU subsets and host CPU pin info
I would prefer Daniel's version:
-numa [,mem:[;...]]
[,cpu:[;...]]
[,pin:[;...]]

That would allow easy things like -numa 2 (for a two guest node), not 
given options would result in defaults (equally split-up resources).


The only problem is the default option for the host side, as libnuma 
requires to explicitly name the nodes. Maybe make the pin: part _not_ 
optional? I would at least want to pin the memory, one could discuss 
about the VCPUs...




Also need a monitor command to change host nodes dynamically:

Implementing a monitor interface is a good idea.

(qemu) numanode 1 0
Does that include page migration? That would be easily possible with 
mbind(MPOL_MF_MOVE), but would take some time and resources (which I 
think is OK if explicitly triggered in the monitor).
Any other useful commands for the monitor? Maybe (temporary) VCPU 
migration without page migration?


Regards,
Andre.

--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 277-84917
to satisfy European Law for business letters:
AMD Saxony Limited Liability Company & Co. KG,
Wilschdorfer Landstr. 101, 01109 Dresden, Germany
Register Court Dresden: HRA 4896, General Partner authorized
to represent: AMD Saxony LLC (Wilmington, Delaware, US)
General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-30 Thread Skywing
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Avi Kivity
Sent: Sunday, November 30, 2008 4:50 PM
To: Andi Kleen
Cc: Andre Przywara; kvm@vger.kernel.org
Subject: Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

> Well, testing is the only way to know.  I'm particularly interested in 
> how Windows will perform, since we know so little about its internals.
>
> From some light googling, it looks like Windows has a home node for a 
> thread, and will allocate pages from the home node even when the thread 
> is executing on some other node temporarily.  It also does automatic 
> page migration in some cases.

Well, there's a couple of way this works:

- A program can explicitly specify which NUMA node to use for NUMA-aware apps 
using the new NUMA-enabled APIs.
- Otherwise, a default NUMA node is chosen.  For older systems (believe prior 
to Vista), the default NUMA node is the processor upon which the thread was 
running.  For modern systems, the ideal processor for the thread is used.  (The 
ideal processor is which proc the scheduler will try and run the thread on.  It 
is not a hard association as with an affinity mask, in that the scheduler will 
run the thread on a different processor if it has to, but it will prefer the 
ideal processor.)

I think later versions of SQL Server are probably a good bet if you want to 
test something on Windows that is fully NUMA-aware.  Most smaller stuff is of 
course using the default policies.

- S
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-30 Thread Avi Kivity

Andi Kleen wrote:

On Sun, Nov 30, 2008 at 10:07:01PM +0200, Avi Kivity wrote:
  
Right.  Allocated from the guest kernel's perspective.  This may be 
different from the host kernel's perspective.


Linux will delay touching memory until the last moment, Windows will not 
(likely it zeros pages on their own nodes, but who knows)?



The problem on Linux is that the first touch is clear_page() and 
that unfortunately happens in the direct mapping before mapping, 
so the "detect mapping" trick doesn't quite work (unless it's a 32bit highmem 
page). 
  


It should still be on the same cpu.


Ok one could migrate it on mapping. When the data is still cache
hot that shouldn't be that expensive. Thinking about it again 
it might be actually a reasonable approach.


  


Could also work for normal apps - move code and data to local node.

But again, we don't have any guest mapping information when we're 
running under #pt; only the first access.  If we're willing so sacrifice 
memory, we can get the first access per virtual node.



In our case, the application is the guest kernel, which does know.



It knows but it doesn't really care all that much.  The only thing
that counts is the end performance in this case.
  


Well, testing is the only way to know.  I'm particularly interested in 
how Windows will perform, since we know so little about its internals.


From some light googling, it looks like Windows has a home node for a 
thread, and will allocate pages from the home node even when the thread 
is executing on some other node temporarily.  It also does automatic 
page migration in some cases.



The difference is, Linux (as a guest) will try to reuse freed pages from 
an application or pagecache, knowing which node they belong to.


I agree that if all you do is HPC style computation (boot a kernel and 
one app with one process per cpu), then the heuristics work well.



Or if there's a way to detect unmapping/remapping.
  


Sure, if you're willing to drop %pt.


It is certainly not perfect and has holes (like any heuristics),
but it has the advantage of being fully dynamic. 
 
  
It also has the advantage of being already implemented (apart from fake 
SRAT tables; and that isn't necessary for HPC apps).



What do you mean?
  


Which part?  being already implemented?  Like I said earlier, right now 
kvm will allocate memory from the process that runs the vcpu that first 
touched this memory.  Given that Linux prefers allocating from the 
current node, we already implement the first touch heuristic.



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-30 Thread Andi Kleen
On Sun, Nov 30, 2008 at 10:07:01PM +0200, Avi Kivity wrote:
> Right.  Allocated from the guest kernel's perspective.  This may be 
> different from the host kernel's perspective.
> 
> Linux will delay touching memory until the last moment, Windows will not 
> (likely it zeros pages on their own nodes, but who knows)?

The problem on Linux is that the first touch is clear_page() and 
that unfortunately happens in the direct mapping before mapping, 
so the "detect mapping" trick doesn't quite work (unless it's a 32bit highmem 
page). 

Ok one could migrate it on mapping. When the data is still cache
hot that shouldn't be that expensive. Thinking about it again 
it might be actually a reasonable approach.

> 
> The bigger problem is lifetime.  Inside a guest, 'allocation' happens 
> when a page is used for pagecache, or when a process is created and 
> starts using memory.  From the host perspective, it happens just once.

Yes, that's a problem. I discussed some ways to get around that
earlier. 

> 
> >>It's very different.  The kernel expects an application that touched 
> >>page X on node Y to continue using page X on node Y.  Because 
> >>applications know this, they are written to this assumption.  However, 
> >>
> >
> >The far majority of applications do not actually know where memory is. 
> >  
> 
> In our case, the application is the guest kernel, which does know.

It knows but it doesn't really care all that much.  The only thing
that counts is the end performance in this case.

[Some people also NUMA policy to partition machines, but that's
ok in this case because that only needs the same fixed guest physical
addresses which is guaranteed of course]

> The difference is, Linux (as a guest) will try to reuse freed pages from 
> an application or pagecache, knowing which node they belong to.
> 
> I agree that if all you do is HPC style computation (boot a kernel and 
> one app with one process per cpu), then the heuristics work well.

Or if there's a way to detect unmapping/remapping.

> >It is certainly not perfect and has holes (like any heuristics),
> >but it has the advantage of being fully dynamic. 
> >  
> 
> It also has the advantage of being already implemented (apart from fake 
> SRAT tables; and that isn't necessary for HPC apps).

What do you mean?

-Andi
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-30 Thread Avi Kivity

Skywing wrote:

The far majority of pages are allocated when a process wants them
or the kernel uses them for file cache.



Is that not going to be fairly guest-specific?  For example, Windows has a thread that 
does background zeroing of unallocated pages that aren't marked as zeroed already.  I'd 
imagine that touching such pages would translate to an "allocation" as far as 
any hypervisor would be concerned.
  


Yes.  Most likely the thread runs on the same node as the memory, so it 
stays local.


Still, we need to keep the vcpu within the memory node somehow, 
otherwise all memory becomes non-local.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-30 Thread Avi Kivity

Andi Kleen wrote:
The page is allocated at an uninteresting point in time.  For example, 
the boot loaded allocates a bunch of pages.



The far majority of pages are allocated when a process wants them
or the kernel uses them for file cache.
  


Right.  Allocated from the guest kernel's perspective.  This may be 
different from the host kernel's perspective.


Linux will delay touching memory until the last moment, Windows will not 
(likely it zeros pages on their own nodes, but who knows)?


The bigger problem is lifetime.  Inside a guest, 'allocation' happens 
when a page is used for pagecache, or when a process is created and 
starts using memory.  From the host perspective, it happens just once.


It's very different.  The kernel expects an application that touched 
page X on node Y to continue using page X on node Y.  Because 
applications know this, they are written to this assumption.  However, 



The far majority of applications do not actually know where memory is. 
  


In our case, the application is the guest kernel, which does know.


What matters is that you get local accesses most of the time for the memory
that is touched on a specific CPU. Even the applications who
know won't break if it's somewhere else, because it's only
an optimization. As long as you're faster on average (or in the worst
case not significantly worse) than not having it you're fine.

Also the Linux first touch is a heuristic that can be wrong
later, and I don't see too much difference in having another
heuristic level on top of it.
  


The difference is, Linux (as a guest) will try to reuse freed pages from 
an application or pagecache, knowing which node they belong to.


I agree that if all you do is HPC style computation (boot a kernel and 
one app with one process per cpu), then the heuristics work well.



The scheme I described is a approximate heuristic to get local
memory access in many cases without pinning anything to CPUs.
It is certainly not perfect and has holes (like any heuristics),
but it has the advantage of being fully dynamic. 
  


It also has the advantage of being already implemented (apart from fake 
SRAT tables; and that isn't necessary for HPC apps).


in a virtualization context, the guest kernel expects that page X 
belongs to whatever node the SRAT table points at, without regard to the 
first access.


Guest kernels behave differently from applications, because real 
hardware doesn't allocate pages dynamically like the kernel can for 
applications.



Again the kernel just wants local memory access most of the time
for the allocations where it matters.

  


It does.  But the kernel doesn't allocate memory (except for the first 
time); it recycles memory.



Also NUMA is always an optimization, it's not a big issue if you're
wrong occasionally because that doesn't affect correctness.
  


Agreed.




Mapped once and allocated once (not at the same time, but fairly close).



That seems dubious to me.

  


That's how it works.

Qemu will mmap() the guest's memory at initialization time.  When the 
guest touches memory, kvm will call get_user_pages_fast() (here's the 
allocation) and instantiate a pte in the qemu address space, as well as 
a shadow pte (using either ept/npt two-level maps or direct shadow maps).


With ept/npt, in the absence os swapping, the story ends here.  Without 
ept/npt, the guest will continue to fault, but now get_user_pages_fast() 
will return the already allocated page from the pte in the qemu address 
space.


No.  Linux will assume a page belongs to the node the SRAT table says it 
belongs to.  Whether first access will be from the local node depends on 
the workload.  If the first application running accesses all memory from 
a single cpu, we will allocate all memory from one node, but this is wrong.



Sorry I don't get your point. 


Yeah, we're talking a bit past each other.  I'll try to expand.


Wrong doesn't make sense in this context.

You seem to be saying that an allocation that is not local on a native
kernel wouldn't be local in the approximate heuristic either. 
But that's a triviality that is of course true and likely not what 
you meant anyways.


  


I'm saying, that sometimes the guest kernel allocates memory from 
virtual node A but uses it on virtual node B, due to its memory policy 
or perhaps due to resource scarcity.  Unlike a normal application, the 
guest kernel still tracks the page as belonging to node A (even though 
it is used on node B).  Because of this, when the page is recycled, the 
guest kernel will try to assign it to processes running on node A.  But 
the host has allocated it from node B.


When we export a virtual SRAT, we promise to the guest something about 
memory access latency.  The guest will try to optimize according to this 
SRAT, and if we don't fulfil the promise, it will make incorrect decisions.


So long as a page has a single use in the lifetime of the guest, it 
doesn't matter.  But ge

RE: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-30 Thread Skywing

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andi Kleen
Sent: Sunday, November 30, 2008 1:56 PM
To: Avi Kivity
Cc: Andi Kleen; Andre Przywara; kvm@vger.kernel.org
Subject: Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

> > The page is allocated at an uninteresting point in time.  For example, 
> > the boot loaded allocates a bunch of pages.
>
> The far majority of pages are allocated when a process wants them
> or the kernel uses them for file cache.

Is that not going to be fairly guest-specific?  For example, Windows has a 
thread that does background zeroing of unallocated pages that aren't marked as 
zeroed already.  I'd imagine that touching such pages would translate to an 
"allocation" as far as any hypervisor would be concerned.

- S
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-30 Thread Andi Kleen
> The page is allocated at an uninteresting point in time.  For example, 
> the boot loaded allocates a bunch of pages.

The far majority of pages are allocated when a process wants them
or the kernel uses them for file cache.

> 
> >>executes.  First access happens somewhat later, but still we cannot 
> >>count on the majority of accesses to come from the same cpu as the first 
> >>access.
> >>
> >
> >It is a reasonable heuristic. It's just like the rather
> >successfull default local allocation heuristic the native kernel uses.
> >  
> 
> It's very different.  The kernel expects an application that touched 
> page X on node Y to continue using page X on node Y.  Because 
> applications know this, they are written to this assumption.  However, 

The far majority of applications do not actually know where memory is. 
What matters is that you get local accesses most of the time for the memory
that is touched on a specific CPU. Even the applications who
know won't break if it's somewhere else, because it's only
an optimization. As long as you're faster on average (or in the worst
case not significantly worse) than not having it you're fine.

Also the Linux first touch is a heuristic that can be wrong
later, and I don't see too much difference in having another
heuristic level on top of it.

The scheme I described is a approximate heuristic to get local
memory access in many cases without pinning anything to CPUs.
It is certainly not perfect and has holes (like any heuristics),
but it has the advantage of being fully dynamic. 

> in a virtualization context, the guest kernel expects that page X 
> belongs to whatever node the SRAT table points at, without regard to the 
> first access.
> 
> Guest kernels behave differently from applications, because real 
> hardware doesn't allocate pages dynamically like the kernel can for 
> applications.

Again the kernel just wants local memory access most of the time
for the allocations where it matters.

Also NUMA is always an optimization, it's not a big issue if you're
wrong occasionally because that doesn't affect correctness.

> 
> (btw, what do you do with cpu-less nodes? I think some sgi hardware has 
> them)

You assign them to a nearby node, But it's really a totally unimportant
corner case.

> 
> >>>The alternative is to keep your own pools and allocate from the
> >>>correct pool, but then you either need pinning or getcpu()
> >>> 
> >>>  
> >>This is meaningless in kvm context.  Other than small bits of memory 
> >>needed for I/O and shadow page tables, the bulk of memory is allocated 
> >>once. 
> >>
> >
> >Mapped once. Anyways that could be changed too if there was need.
> >
> >  
> 
> Mapped once and allocated once (not at the same time, but fairly close).

That seems dubious to me.

> No.  Linux will assume a page belongs to the node the SRAT table says it 
> belongs to.  Whether first access will be from the local node depends on 
> the workload.  If the first application running accesses all memory from 
> a single cpu, we will allocate all memory from one node, but this is wrong.

Sorry I don't get your point. Wrong doesn't make sense in this context.

You seem to be saying that an allocation that is not local on a native
kernel wouldn't be local in the approximate heuristic either. 
But that's a triviality that is of course true and likely not what 
you meant anyways.

> 
> >>(2) even without npt/ept, we have no idea how often mappings are used 
> >>and by which cpu.  finding out is expensive.
> >>
> >
> >You see a fault on the first mapping. That fault is on the CPU that
> >did the access.  Therefore you know which one it was.
> >  
> 
> It's meaningless information.  First access means nothing.  And again, 

At least in Linux the first access to the majority of memory is 
either through a process page allocation or through a file cache
page cache allocation. Yes there are are a few boot loader
and temporary kernel pages for which this is not true, 
but they are a small insignificant fraction of the total memory
in a reasonably sized guest. I'm just ignoring them.

This can be often observed in that if you have a broken DIMM
you only get problems after using some program that uses
most of your memory.


> the guest doesn't expect the page to move to the node where it touched it.

The only thing the guest cares about is to get good performance on the
memory access. And even there only if the page is actually used
often (i.e. mapped to a process).  How that is archived is not its concern.

> 
> (we also see first access with ept)

Great.

> 
> >>(3) for many workloads, there are no unused pages.  the guest 
> >>application allocates all memory and manages memory by itself.
> >>
> >
> >First a common case of guest using all memory is file cache,
> >but for NUMA purposes file cache locality typically doesn't
> >matter because it's not accessed frequently enough that
> >non locality is a problem. It really only matters for mapping
> >th

Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-30 Thread Avi Kivity

Andi Kleen wrote:

I was more thinking about some heuristics that checks when a page
is first mapped into user space. The only problem is that it is zeroed
through the direct mapping before, but perhaps there is a way around it. 
That's one of the rare cases when 32bit highmem actually makes things 
easier.

It might be also easier on some other OS than Linux who don't use
direct mapping that aggressively.
 
  
In the context of kvm, the mmap() calls happen before the guest ever 



The mmap call doesn't matter at all, what matters is when the
page is allocated.

  


The page is allocated at an uninteresting point in time.  For example, 
the boot loaded allocates a bunch of pages.


executes.  First access happens somewhat later, but still we cannot 
count on the majority of accesses to come from the same cpu as the first 
access.



It is a reasonable heuristic. It's just like the rather
successfull default local allocation heuristic the native kernel uses.
  


It's very different.  The kernel expects an application that touched 
page X on node Y to continue using page X on node Y.  Because 
applications know this, they are written to this assumption.  However, 
in a virtualization context, the guest kernel expects that page X 
belongs to whatever node the SRAT table points at, without regard to the 
first access.


Guest kernels behave differently from applications, because real 
hardware doesn't allocate pages dynamically like the kernel can for 
applications.


(btw, what do you do with cpu-less nodes? I think some sgi hardware has 
them)



The alternative is to keep your own pools and allocate from the
correct pool, but then you either need pinning or getcpu()
 
  
This is meaningless in kvm context.  Other than small bits of memory 
needed for I/O and shadow page tables, the bulk of memory is allocated 
once. 



Mapped once. Anyways that could be changed too if there was need.

  


Mapped once and allocated once (not at the same time, but fairly close).

We can't change it without changing the guest.


Basic algorithm:
- If guest touches virtual node that is the same as the local node
of the current vcpu assume it's a local allocation.
 
  
The guest is not making the same assumption; lying to the guest is 



Huh? Pretty much all NUMA aware OS should. Linux will definitely.

  


No.  Linux will assume a page belongs to the node the SRAT table says it 
belongs to.  Whether first access will be from the local node depends on 
the workload.  If the first application running accesses all memory from 
a single cpu, we will allocate all memory from one node, but this is wrong.


(2) even without npt/ept, we have no idea how often mappings are used 
and by which cpu.  finding out is expensive.



You see a fault on the first mapping. That fault is on the CPU that
did the access.  Therefore you know which one it was.
  


It's meaningless information.  First access means nothing.  And again, 
the guest doesn't expect the page to move to the node where it touched it.


(we also see first access with ept)

(3) for many workloads, there are no unused pages.  the guest 
application allocates all memory and manages memory by itself.



First a common case of guest using all memory is file cache,
but for NUMA purposes file cache locality typically doesn't
matter because it's not accessed frequently enough that
non locality is a problem. It really only matters for mapping
that are used often by the CPU.

When a single application allocates everything and keeps it that is fine
too because you'll give it approximately local memory on the initial
set up (assuming the application has reasonable NUMA behaviour by itself
on a first touch local allocation policy)
  


Sure, for the simple cases it works.  But consider your first example 
followed by the second (you can even reboot the guest in the middle, but 
the bad assignment sticks).


And if the vcpu moves for some reason, things get screwed up permanently.

We should try to be predictable, not depend on behavior the guest has no 
real reason to follow, if it follows hardware specs.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-30 Thread Andi Kleen
On Sun, Nov 30, 2008 at 07:11:40PM +0200, Avi Kivity wrote:
> Andi Kleen wrote:
> >On Sun, Nov 30, 2008 at 06:38:14PM +0200, Avi Kivity wrote:
> >  
> >>The guest allocates when it touches the page for the first time.  This 
> >>means very little since all of memory may be touched during guest bootup 
> >>or shortly afterwards.  Even if not, it is still a one-time operation, 
> >>and any choices we make based on it will last the lifetime of the guest.
> >>
> >
> >I was more thinking about some heuristics that checks when a page
> >is first mapped into user space. The only problem is that it is zeroed
> >through the direct mapping before, but perhaps there is a way around it. 
> >That's one of the rare cases when 32bit highmem actually makes things 
> >easier.
> >It might be also easier on some other OS than Linux who don't use
> >direct mapping that aggressively.
> >  
> 
> In the context of kvm, the mmap() calls happen before the guest ever 

The mmap call doesn't matter at all, what matters is when the
page is allocated.

> executes.  First access happens somewhat later, but still we cannot 
> count on the majority of accesses to come from the same cpu as the first 
> access.

It is a reasonable heuristic. It's just like the rather
successfull default local allocation heuristic the native kernel uses.

> >
> >The alternative is to keep your own pools and allocate from the
> >correct pool, but then you either need pinning or getcpu()
> >  
> 
> This is meaningless in kvm context.  Other than small bits of memory 
> needed for I/O and shadow page tables, the bulk of memory is allocated 
> once. 

Mapped once. Anyways that could be changed too if there was need.

> 
> >>We need to mimic real hardware.
> >>
> >
> >The underlying allocation is in pages, so the NUMA affinity can 
> >be as well handled by this. 
> >
> >Basic algorithm:
> >- If guest touches virtual node that is the same as the local node
> >of the current vcpu assume it's a local allocation.
> >  
> 
> The guest is not making the same assumption; lying to the guest is 

Huh? Pretty much all NUMA aware OS should. Linux will definitely.


> (1) with npt/ept we have no clue as to guest mappings

Yes that is tricky. With A bits in theory it could be made 
to work with EPT, but there are none, and it would still
not work very well.

> (2) even without npt/ept, we have no idea how often mappings are used 
> and by which cpu.  finding out is expensive.

You see a fault on the first mapping. That fault is on the CPU that
did the access.  Therefore you know which one it was.

> (3) for many workloads, there are no unused pages.  the guest 
> application allocates all memory and manages memory by itself.

First a common case of guest using all memory is file cache,
but for NUMA purposes file cache locality typically doesn't
matter because it's not accessed frequently enough that
non locality is a problem. It really only matters for mapping
that are used often by the CPU.

When a single application allocates everything and keeps it that is fine
too because you'll give it approximately local memory on the initial
set up (assuming the application has reasonable NUMA behaviour by itself
on a first touch local allocation policy)

When there's lots of remapping/new processes one would probably need some 
heuristics to detect reallocations, like the mapping heuristics I described 
earlier or PV help.

> Right.  The situation I'm trying to avoid is process A with memory on 
> node X running on node Y, and process B with memory on node Y running on 
> node X.  The scheduler arrives at a local optimum, caused by some 
> spurious load, and won't move to the global optimum because migrating 
> processes across cpus is considered expensive.
> 
> I don't know, perhaps the current scheduler is clever enough to do this 
> already.

It tries too, but there are always extreme cases where it doesn't work.
Also once a process is migrated it won't find back to its memory.
Still for a approximate dynamic solution trusting it is not the worst
you can do.

-Andi
-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-30 Thread Avi Kivity

Andi Kleen wrote:

On Sun, Nov 30, 2008 at 06:38:14PM +0200, Avi Kivity wrote:
  
The guest allocates when it touches the page for the first time.  This 
means very little since all of memory may be touched during guest bootup 
or shortly afterwards.  Even if not, it is still a one-time operation, 
and any choices we make based on it will last the lifetime of the guest.



I was more thinking about some heuristics that checks when a page
is first mapped into user space. The only problem is that it is zeroed
through the direct mapping before, but perhaps there is a way around it. 
That's one of the rare cases when 32bit highmem actually makes things easier.

It might be also easier on some other OS than Linux who don't use
direct mapping that aggressively.
  


In the context of kvm, the mmap() calls happen before the guest ever 
executes.  First access happens somewhat later, but still we cannot 
count on the majority of accesses to come from the same cpu as the first 
access.



This is roughly equivalent of getting a fresh new demand fault page,
but doesn't require to unmap/free/remap.
 
  

Lost again, sorry.



free/unmap/remap gives you normally local memory. I tend to call
it poor man's NUMA policy API.

The alternative is to keep your own pools and allocate from the
correct pool, but then you either need pinning or getcpu()
  


This is meaningless in kvm context.  Other than small bits of memory 
needed for I/O and shadow page tables, the bulk of memory is allocated 
once.  Guest processes may repeatedly allocate and free memory, but kvm 
will never see this.



We need to mimic real hardware.



The underlying allocation is in pages, so the NUMA affinity can 
be as well handled by this. 


Basic algorithm:
- If guest touches virtual node that is the same as the local node
of the current vcpu assume it's a local allocation.
  


The guest is not making the same assumption; lying to the guest is 
counterproductive.  The big problem is that a local decision takes 
effect indefinitely.



- On allocation get the underlying page from the correct underlying
node based on a dynamic getcpu relationship.
- Find some way to get rid of unused pages. e.g. keep track of 
the number of mappings to a page and age or use pv help.


  


(1) with npt/ept we have no clue as to guest mappings
(2) even without npt/ept, we have no idea how often mappings are used 
and by which cpu.  finding out is expensive.
(3) for many workloads, there are no unused pages.  the guest 
application allocates all memory and manages memory by itself.
The static case is simple.  We allocate memory from a few nodes (for 
small guests, only one) and establish a guest_node -> host_node 
mapping.  vcpus on guest node X are constrained to host node according 
to this mapping.


The dynamic case is really complicated.  We can allow vcpus to wander to 
other cpus on cpu overcommit, but need to pull them back soonish, or 
alternatively migrate the entire node, taking into account the cost of 
the migration, cpu availability on the target node, and memory 
availability on the target node.  Since the cost is so huge, this needs 
to be done on a very coarse scale.



I wrote a scheduler that did that on 2.4 (it was called homenode scheduling),
but it never worked well on small systems. It was moderately successfull on
some big NUMA boxes though. The fundamental problem is that not using
a CPU is always worse than using remote memory on the small systems.

  


Right.  The situation I'm trying to avoid is process A with memory on 
node X running on node Y, and process B with memory on node Y running on 
node X.  The scheduler arrives at a local optimum, caused by some 
spurious load, and won't move to the global optimum because migrating 
processes across cpus is considered expensive.


I don't know, perhaps the current scheduler is clever enough to do this 
already.



Always migrating memory on CPU migration is also too costly in the general
case, but it might be possible to make it work in the special case 
of vCPU guests with some tweaks.
  
Yes, virtual machines are easier since there are a smaller number of 
mm_structs and tasks compared to more general workloads.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-30 Thread Andi Kleen
On Sun, Nov 30, 2008 at 06:38:14PM +0200, Avi Kivity wrote:
> The guest allocates when it touches the page for the first time.  This 
> means very little since all of memory may be touched during guest bootup 
> or shortly afterwards.  Even if not, it is still a one-time operation, 
> and any choices we make based on it will last the lifetime of the guest.

I was more thinking about some heuristics that checks when a page
is first mapped into user space. The only problem is that it is zeroed
through the direct mapping before, but perhaps there is a way around it. 
That's one of the rare cases when 32bit highmem actually makes things easier.
It might be also easier on some other OS than Linux who don't use
direct mapping that aggressively.
> 
> >This is roughly equivalent of getting a fresh new demand fault page,
> >but doesn't require to unmap/free/remap.
> >  
> 
> Lost again, sorry.

free/unmap/remap gives you normally local memory. I tend to call
it poor man's NUMA policy API.

The alternative is to keep your own pools and allocate from the
correct pool, but then you either need pinning or getcpu()

> 
> >The tricky bit is probably figuring out what is a fresh new page for
> >the guest. That might need some paravirtual help.
> >  
> 
> The guest typically recycles its own pages (exception is ballooning).  
> Also it doesn't make sense to manage this on a per page basis as the 
> guest won't do that. 

> We need to mimic real hardware.

The underlying allocation is in pages, so the NUMA affinity can 
be as well handled by this. 

Basic algorithm:
- If guest touches virtual node that is the same as the local node
of the current vcpu assume it's a local allocation.
- On allocation get the underlying page from the correct underlying
node based on a dynamic getcpu relationship.
- Find some way to get rid of unused pages. e.g. keep track of 
the number of mappings to a page and age or use pv help.

> The static case is simple.  We allocate memory from a few nodes (for 
> small guests, only one) and establish a guest_node -> host_node 
> mapping.  vcpus on guest node X are constrained to host node according 
> to this mapping.
> 
> The dynamic case is really complicated.  We can allow vcpus to wander to 
> other cpus on cpu overcommit, but need to pull them back soonish, or 
> alternatively migrate the entire node, taking into account the cost of 
> the migration, cpu availability on the target node, and memory 
> availability on the target node.  Since the cost is so huge, this needs 
> to be done on a very coarse scale.

I wrote a scheduler that did that on 2.4 (it was called homenode scheduling),
but it never worked well on small systems. It was moderately successfull on
some big NUMA boxes though. The fundamental problem is that not using
a CPU is always worse than using remote memory on the small systems.

Always migrating memory on CPU migration is also too costly in the general
case, but it might be possible to make it work in the special case 
of vCPU guests with some tweaks.

-Andi

-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-30 Thread Avi Kivity

Andi Kleen wrote:
Please explain.  When would you call getcpu() and what would you do at 
that time?



When the guest allocates on the node of its current CPU get memory on
the node pool getcpu() tells you it is running on. More tricky 
is handling guest explicitely accessing other node for NUMA policy

purposes, but in this case you can access the cache of the getcpu
information of other vcpus. 
  


The guest allocates when it touches the page for the first time.  This 
means very little since all of memory may be touched during guest bootup 
or shortly afterwards.  Even if not, it is still a one-time operation, 
and any choices we make based on it will last the lifetime of the guest.



This is roughly equivalent of getting a fresh new demand fault page,
but doesn't require to unmap/free/remap.
  


Lost again, sorry.


The tricky bit is probably figuring out what is a fresh new page for
the guest. That might need some paravirtual help.
  


The guest typically recycles its own pages (exception is ballooning).  
Also it doesn't make sense to manage this on a per page basis as the 
guest won't do that.  We need to mimic real hardware.


The static case is simple.  We allocate memory from a few nodes (for 
small guests, only one) and establish a guest_node -> host_node 
mapping.  vcpus on guest node X are constrained to host node according 
to this mapping.


The dynamic case is really complicated.  We can allow vcpus to wander to 
other cpus on cpu overcommit, but need to pull them back soonish, or 
alternatively migrate the entire node, taking into account the cost of 
the migration, cpu availability on the target node, and memory 
availability on the target node.  Since the cost is so huge, this needs 
to be done on a very coarse scale.


I don't see this happening in the kernel anytime soon.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-30 Thread Andi Kleen
On Sun, Nov 30, 2008 at 05:38:34PM +0200, Avi Kivity wrote:
> Andi Kleen wrote:
> >>I don't think the first one works without the second.  Calling getcpu() 
> >>on startup is meaningless since the initial placement doesn't take the 
> >>
> >
> >Who said anything about startup? The idea behind getcpu() is to call
> >it every time you allocate someting.
> >
> >  
> 
> Qemu only allocates on startup (though of course the kernel actually 
> allocates the memory lazily).

> Please explain.  When would you call getcpu() and what would you do at 
> that time?

When the guest allocates on the node of its current CPU get memory on
the node pool getcpu() tells you it is running on. More tricky 
is handling guest explicitely accessing other node for NUMA policy
purposes, but in this case you can access the cache of the getcpu
information of other vcpus. 

This is roughly equivalent of getting a fresh new demand fault page,
but doesn't require to unmap/free/remap.

The tricky bit is probably figuring out what is a fresh new page for
the guest. That might need some paravirtual help.

It's an approximate scheme, I don't know how well it would really
work. 

> >I think I would prefer to fix that in the kernel. user space will never
> >have the full picture.
> >  
> 
> On the other hand, getting everyone happy so this can get into the 
> kernel will be very difficult.  Many workloads will lose from this; 
> we're trying to balance both memory affinity and cpu balancing, and each 
> workload has a different tradeoff.

Yes it's unlikely to be a win in general, at least on small systems
with moderate NUMA factor. I expect KVM would need to turn it on with 
some explicit hints.

-Andi

-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-30 Thread Avi Kivity

Andi Kleen wrote:
I don't think the first one works without the second.  Calling getcpu() 
on startup is meaningless since the initial placement doesn't take the 



Who said anything about startup? The idea behind getcpu() is to call
it every time you allocate someting.

  


Qemu only allocates on startup (though of course the kernel actually 
allocates the memory lazily).


Please explain.  When would you call getcpu() and what would you do at 
that time?


This could happen completely in the kernel (not an easy task), or by 



There were experimental patches for tieing memory migration to cpu migration 
some time ago from Lee S.


  
having a second-level scheduler in userspace polling for cpu usage an 
rebalancing processes across numa nodes.  Given that with virtualization 
you have a few long lived processes, this does not seem too difficult.



I think I would prefer to fix that in the kernel. user space will never
have the full picture.
  


On the other hand, getting everyone happy so this can get into the 
kernel will be very difficult.  Many workloads will lose from this; 
we're trying to balance both memory affinity and cpu balancing, and each 
workload has a different tradeoff.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-30 Thread Andi Kleen
> I don't think the first one works without the second.  Calling getcpu() 
> on startup is meaningless since the initial placement doesn't take the 

Who said anything about startup? The idea behind getcpu() is to call
it every time you allocate someting.

> >
> >Anyways it's not ideal either, but in my mind would be all preferable
> >to default CPU pinning.
> 
> I agree we need something dynamic, and that we need to tie cpu affinity 
> and memory affinity together.
> 
> This could happen completely in the kernel (not an easy task), or by 

There were experimental patches for tieing memory migration to cpu migration 
some time ago from Lee S.

> having a second-level scheduler in userspace polling for cpu usage an 
> rebalancing processes across numa nodes.  Given that with virtualization 
> you have a few long lived processes, this does not seem too difficult.

I think I would prefer to fix that in the kernel. user space will never
have the full picture.

-Andi

-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-29 Thread Avi Kivity

Andi Kleen wrote:

On Sat, Nov 29, 2008 at 08:43:35PM +0200, Avi Kivity wrote:
  

Andi Kleen wrote:


It depends -- it's not necessarily an improvement. e.g. if it leads to
some CPUs being idle while others are oversubscribed because of the
pinning you typically lose more than you win. In general default
pinning is a bad idea in my experience.

Alternative more flexible strategies:

- Do a mapping from CPU to node at runtime by using getcpu()
- Migrate to complete nodes using migrate_pages when qemu detects
node migration on the host.
 
  
Wouldn't that cause lots of migrations?  Migrating a 1GB guest can take 



I assume you mean the second one (the two points were orthogonal)
The first one is an approximate method, also has advantages
and disadvantages.

  


I don't think the first one works without the second.  Calling getcpu() 
on startup is meaningless since the initial placement doesn't take the 
current workload into account.


a huge amount of cpu time (tens or even hundreds of milliseconds?) 
compared to very high frequency activity like the scheduler.



Yes migration is expensive, although you can do it on demand of course, 
but the scheduler typically has pretty strong cpu affinity so it shouldn't 
happen too often. Also it's only a temporary cost compared to the 
endless overhead of running forever non local or running forever with 
some cores idle.


Another strategy would be to tune the load balancer in the scheduler
for this case and make it only migrate in extreme situations.

Anyways it's not ideal either, but in my mind would be all preferable
to default CPU pinning.


I agree we need something dynamic, and that we need to tie cpu affinity 
and memory affinity together.


This could happen completely in the kernel (not an easy task), or by 
having a second-level scheduler in userspace polling for cpu usage an 
rebalancing processes across numa nodes.  Given that with virtualization 
you have a few long lived processes, this does not seem too difficult.



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-29 Thread Andi Kleen
On Sat, Nov 29, 2008 at 08:43:35PM +0200, Avi Kivity wrote:
> Andi Kleen wrote:
> >It depends -- it's not necessarily an improvement. e.g. if it leads to
> >some CPUs being idle while others are oversubscribed because of the
> >pinning you typically lose more than you win. In general default
> >pinning is a bad idea in my experience.
> >
> >Alternative more flexible strategies:
> >
> >- Do a mapping from CPU to node at runtime by using getcpu()
> >- Migrate to complete nodes using migrate_pages when qemu detects
> >node migration on the host.
> >  
> 
> Wouldn't that cause lots of migrations?  Migrating a 1GB guest can take 

I assume you mean the second one (the two points were orthogonal)
The first one is an approximate method, also has advantages
and disadvantages.

> a huge amount of cpu time (tens or even hundreds of milliseconds?) 
> compared to very high frequency activity like the scheduler.

Yes migration is expensive, although you can do it on demand of course, 
but the scheduler typically has pretty strong cpu affinity so it shouldn't 
happen too often. Also it's only a temporary cost compared to the 
endless overhead of running forever non local or running forever with 
some cores idle.

Another strategy would be to tune the load balancer in the scheduler
for this case and make it only migrate in extreme situations.

Anyways it's not ideal either, but in my mind would be all preferable
to default CPU pinning.

-Andi

-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-29 Thread Avi Kivity

Andi Kleen wrote:

It depends -- it's not necessarily an improvement. e.g. if it leads to
some CPUs being idle while others are oversubscribed because of the
pinning you typically lose more than you win. In general default
pinning is a bad idea in my experience.

Alternative more flexible strategies:

- Do a mapping from CPU to node at runtime by using getcpu()
- Migrate to complete nodes using migrate_pages when qemu detects
node migration on the host.
  


Wouldn't that cause lots of migrations?  Migrating a 1GB guest can take 
a huge amount of cpu time (tens or even hundreds of milliseconds?) 
compared to very high frequency activity like the scheduler.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-29 Thread Avi Kivity

Andre Przywara wrote:

The user (or better: management application) specifies the host nodes
the guest should use: -nodes 2,3 would create a two node guest mapped to
node 2 and 3 on the host. These numbers are handed over to libnuma:
VCPUs are pinned to the nodes and the allocated guest memory is bound to
it's respective node. Since libnuma seems not to be installed
everywhere, the user has to enable this via configure --enable-numa
In the BIOS code an ACPI SRAT table was added, which describes the NUMA
topology to the guest. The number of nodes is communicated via the CMOS
RAM (offset 0x3E). If someone thinks of this as a bad idea, tell me.


There exists now a firmware interface in qemu for this kind of 
communications.



To take use of the new BIOS, install the iasl compiler
(http://acpica.org/downloads/) and type "make bios" before installing,
so the default BIOS will be replaced with the modified one.
Node over-committing is allowed (-nodes 0,0,0,0), omitting the -nodes
parameter reverts to the old behavior.


'-nodes' is too generic a name ('node' could also mean a host).  Suggest 
-numanode.


Need more flexibility: specify the range of memory per node, which cpus 
are in the node, relative weights for the SRAT table:


  -numanode node=1,cpu=2,cpu=3,start=1G,size=1G,hostnode=3

Also need a monitor command to change host nodes dynamically:

(qemu) numanode 1 0


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-28 Thread Daniel P. Berrange
On Thu, Nov 27, 2008 at 11:23:21PM +0100, Andre Przywara wrote:
> Hi,
> 
> this patch series introduces multiple NUMA nodes support within KVM guests.
> This will improve the performance of guests which are bigger than one 
> node (number of VCPUs and/or amount of memory) and also allows better 
> balancing by taking better usage of each node's memory.
> It also improves the one node case by pinning a guest to this node and
> avoiding access of remote memory from one VCPU.
> 
> The user (or better: management application) specifies the host nodes
> the guest should use: -nodes 2,3 would create a two node guest mapped to
> node 2 and 3 on the host. These numbers are handed over to libnuma:
> VCPUs are pinned to the nodes and the allocated guest memory is bound to
> it's respective node.

I'm wondering whether this is the right level of granularity/expresiveness
It is basically encoding 3 pieces of information

 - Number of NUMA nodes to expose to guest
 - Which host nodes to use
 - Which host nodes to pin vCPUS to.

The latter item can actually already be done by management applications
without a command line flag, with a greater level of flexbility that
this allows. In libvirt we start up KVM with -S, so its initially
stopped, then run 'info cpus' in the monitor. This gives us the list of
thread IDs for each vCPU. We then use sched_setaffinity to control the
placement of each vCPU onto pCPUs. KVM could pick which host nodes to
use for allocation based on which nodes it vCPUs are pinned to.

Since NUMA support is going to be optional, we can't rely on using
-nodes for CPU placement, and I'd rather not have to write different
codepaths for initial placement for NUMA vs non-NUMA enabled KVM.
People not using a mgmt tool may also choose to control host node 
placement using numactl to launch KVM . They would still need to be 
able to say how many nodes the guest is given.

Finally this CLI arg does not allow you to say which vCPU is placed 
in which vNUMA node, or how much of the guests RAM is allocated to 
each guest node.

Thus I think it might be desirable, to have the CLI argument focus
on describing the guest NUMA configuration, rather than having it
encode host & guest NUMA info in one go. Finally you'd also want
a way to describe vCPU <-> vNUMA node placement for vCPUS which
are not yet present - eg so you can start with 4 vCPUs and hotplug
add another 12 later. You can't assume you want all 4 inital CPUs
in the same node, nor assume that you want all 4 spread evenly.

So some examples off the top of my head for alternate syntax for the
guest topology

 * Create 4 nodes, split RAM & 8 initial vCPUs equally across
   nodes, and 8 unplugged vCPUs equally too

-m 1024 -smp 8 -nodes 4

 * Create 4 nodes, split RAM equally across nodes, 8 initial vCPUs
   on first 2 nodes, and 8 unplugged vCPUs across other 2 nodes.

-m 1024 -smp 8 -nodes 4,cpu:0-3;4-7;8-11;12-15

 * Create 4 nodes, putting all RAM in first 2 nodes, split 8
   initial vCPUs equally across nodes

-m 1024 -smp 8 -nodes 4,mem:512;512

 * Create 4 nodes, putting all RAM in first 2 nodes, 8 initial vCPUs
   on first 2 nodes, and 8 unplugged vCPUs across other 2 nodes.

-m 1024 -smp 8 -nodes 4,mem:512;512,cpu:0-3;4-7;8-11;12-15

We could optionally also include host node pining for convenience

 * Create 4 nodes, putting all RAM in first 2 nodes, split 8
   initial vCPUs equally across nodes, pin to host nodes 5-8

-m 1024 -smp 8 -nodes 4,mem:512;512,pin:5;6;7;8

If no 'pin' is given, it query its current host pCPU pinning to determine
what NUMA nodes it had been launched on.


> Since libnuma seems not to be installed
> everywhere, the user has to enable this via configure --enable-numa

It'd be nicer if the configure script just 'did the right thing'. So if
neither --enable-numa, or --disable-numa are given, it should probe for
availability and automatically enable it if found, disable if missing.
If --enable-numa is given, it should probe and abort if not found. If
--disable-numa is given it'd not enable anyhing.

Regards,
Daniel
-- 
|: Red Hat, Engineering, London   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

2008-11-28 Thread Andi Kleen
Andre Przywara <[EMAIL PROTECTED]> writes:

> It also improves the one node case by pinning a guest to this node and
> avoiding access of remote memory from one VCPU.

It depends -- it's not necessarily an improvement. e.g. if it leads to
some CPUs being idle while others are oversubscribed because of the
pinning you typically lose more than you win. In general default
pinning is a bad idea in my experience.

Alternative more flexible strategies:

- Do a mapping from CPU to node at runtime by using getcpu()
- Migrate to complete nodes using migrate_pages when qemu detects
node migration on the host.

-Andi

-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html