Am Thu, 18 Aug 2016 10:42:08 -0400 schrieb Tejun Heo <t...@kernel.org>:
> Hello, Michael. > > On Thu, Aug 18, 2016 at 11:30:51AM +0200, Michael Holzheu wrote: > > Well, "no requirement" this is not 100% correct. Currently we use > > the CPU topology information to assign newly coming CPUs to the > > "best fitting" node. > > > > Example: > > > > 1) We have we two fake NUMA nodes N1 and N2 with the following CPU > > assignment: > > > > - N1: cpu 1 on chip 1 > > - N2: cpu 2 on chip 2 > > > > 2) A new cpu 3 is configured that lives on chip 2 > > 3) We assign cpu 3 to N2 > > > > We do this only if the nodes are balanced. If N2 had already one > > more cpu than N1 we would assign the new cpu to N1. > > I see. Out of curiosity, what's the purpose of fakenuma on s390? > There don't seem to be any actual memory locality concerns. Is it > just to segment memory of a machine into multiple pieces? Correct. > If so, why > is that necessary, do you hit some scalability issues w/o NUMA nodes? Yes we hit a scalability issue. Our performance team found out that for big (> 1 TB) overcommitted (memory / swap ration > 1 : 2) systems we see problems: - Zone locks are highly contended because ZONE_NORMAL is big: * zone->lock * zone->lru_lock - One kswapd is not enough for swapping We hope that those problems are resolved by fake NUMA because for each node a separate memory subsystem is created with separate zone locks and kswapd threads. > As for the solution, if blind RR isn't good enough, although it sounds > like it could given that the balancing wasn't all that strong to begin > with, would it be an option to implement an interface which just > requests a new CPU rather than a specific one and then pick one of the > vacant possible CPUs considering node balancing? IMHO this is a promising idea. To say it in my words: - At boot time we already pin all remaining "not configured" logical CPUs to nodes. So all possible cpus are pinned to nodes and cpu_to_node() will work. - If a new physical cpu get's configured, we get the CPU topology information from the system and find the best node. - We get a logical cpu number from the node pool and assign the new physical cpu to that number. If that works we would be as good as before. We will have a look into the code if it is possible. Michael