[tip:numa/core] sched, numa, mm: Describe the NUMA scheduling problem formally
Commit-ID: 49e0b1784791875787a198a00776ee118cd2233c Gitweb: http://git.kernel.org/tip/49e0b1784791875787a198a00776ee118cd2233c Author: Peter Zijlstra AuthorDate: Fri, 12 Oct 2012 12:13:10 +0200 Committer: Ingo Molnar CommitDate: Mon, 19 Nov 2012 02:15:39 +0100 sched, numa, mm: Describe the NUMA scheduling problem formally This is probably a first: formal description of a complex high-level computing problem, within the kernel source. Signed-off-by: Peter Zijlstra Cc: Linus Torvalds Cc: Andrew Morton Cc: "H. Peter Anvin" Cc: Mike Galbraith Cc: Rik van Riel Cc: Hugh Dickins Link: http://lkml.kernel.org/n/tip-mmnlpupoetcatimvjeld1...@git.kernel.org [ Next step: generate the kernel source from such formal descriptions and retire to a tropical island! ] Signed-off-by: Ingo Molnar --- Documentation/scheduler/numa-problem.txt | 230 +++ 1 file changed, 230 insertions(+) diff --git a/Documentation/scheduler/numa-problem.txt b/Documentation/scheduler/numa-problem.txt new file mode 100644 index 000..a5d2fee --- /dev/null +++ b/Documentation/scheduler/numa-problem.txt @@ -0,0 +1,230 @@ + + +Effective NUMA scheduling problem statement, described formally: + + * minimize interconnect traffic + +For each task 't_i' we have memory, this memory can be spread over multiple +physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on +node 'k' in [pages]. + +If a task shares memory with another task let us denote this as: +'s_i,k', the memory shared between tasks including 't_i' residing on node +'k'. + +Let 'M' be the distribution that governs all 'p' and 's', ie. the page placement. + +Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage +frequency over those memory regions [1/s] such that the product gives an +(average) bandwidth 'bp' and 'bs' in [pages/s]. + +(note: multiple tasks sharing memory naturally avoid duplicat accounting + because each task will have its own access frequency 'fs') + +(pjt: I think this frequency is more numerically consistent if you explicitly + restrict p/s above to be the working-set. (It also makes explicit the + requirement for to change about a change in the working set.) + + Doing this does have the nice property that it lets you use your frequency + measurement as a weak-ordering for the benefit a task would receive when + we can't fit everything. + + e.g. task1 has working set 10mb, f=90% + task2 has working set 90mb, f=10% + + Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit + from task1 being on the right node than task2. ) + +Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i': + + C: t_i -> {c_i, n_i} + +This gives us the total interconnect traffic between nodes 'k' and 'l', +'T_k,l', as: + + T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == l + +And our goal is to obtain C0 and M0 such that: + + T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l + +(note: we could introduce 'nc(k,l)' as the cost function of accessing memory + on node 'l' from node 'k', this would be useful for bigger NUMA systems + + pjt: I agree nice to have, but intuition suggests diminishing returns on more + usual systems given factors like things like Haswell's enormous 35mb l3 + cache and QPI being able to do a direct fetch.) + +(note: do we need a limit on the total memory per node?) + + + * fairness + +For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu +'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a +load 'L_n': + + L_n = 1/P_n * \Sum_i w_i for all c_i = n + +using that we can formulate a load difference between CPUs + + L_n,m = | L_n - L_m | + +Which allows us to state the fairness goal like: + + L_n,m(C0) =< L_n,m(C) for all C, n != m + +(pjt: It can also be usefully stated that, having converged at C0: + + | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) | + + Where G_n,m is the greedy partition of tasks between L_n and L_m. This is + the "worst" partition we should accept; but having it gives us a useful + bound on how much we can reasonably adjust L_n/L_m at a Pareto point to + favor T_n,m. ) + +Together they give us the complete multi-objective optimization problem: + + min_C,M [ L_n,m(C), T_k,l(C,M) ] + + + +Notes: + + - the memory bandwidth problem is very much an inter-process problem, in + particular there is no such concept as a process in the above problem. + + - the naive solution would completely prefer fairness over interconnect + traffic, the more complicated solution could pick another Pareto point using + an aggregate objective function such that we balance the loss of work + efficiency against the gain of running, we'd want to more or less suggest + there to be a fixed bound on the error from the Pareto line for any
[tip:numa/core] sched, numa, mm: Describe the NUMA scheduling problem formally
Commit-ID: 995334a2ee83d70cfa9f2f2f511feb8e0fb49494 Gitweb: http://git.kernel.org/tip/995334a2ee83d70cfa9f2f2f511feb8e0fb49494 Author: Peter Zijlstra AuthorDate: Fri, 12 Oct 2012 12:13:10 +0200 Committer: Ingo Molnar CommitDate: Sun, 28 Oct 2012 15:45:38 +0100 sched, numa, mm: Describe the NUMA scheduling problem formally This is probably a first: formal description of a complex high-level computing problem, within the kernel source. Signed-off-by: Peter Zijlstra Cc: Linus Torvalds Cc: Andrew Morton Cc: Peter Zijlstra Cc: "H. Peter Anvin" Cc: Mike Galbraith Rik van Riel Link: http://lkml.kernel.org/n/tip-mmnlpupoetcatimvjeld1...@git.kernel.org [ Next step: generate the kernel source from such formal descriptions and retire to a tropical island! ] Signed-off-by: Ingo Molnar --- Documentation/scheduler/numa-problem.txt | 230 ++ 1 files changed, 230 insertions(+), 0 deletions(-) diff --git a/Documentation/scheduler/numa-problem.txt b/Documentation/scheduler/numa-problem.txt new file mode 100644 index 000..a5d2fee --- /dev/null +++ b/Documentation/scheduler/numa-problem.txt @@ -0,0 +1,230 @@ + + +Effective NUMA scheduling problem statement, described formally: + + * minimize interconnect traffic + +For each task 't_i' we have memory, this memory can be spread over multiple +physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on +node 'k' in [pages]. + +If a task shares memory with another task let us denote this as: +'s_i,k', the memory shared between tasks including 't_i' residing on node +'k'. + +Let 'M' be the distribution that governs all 'p' and 's', ie. the page placement. + +Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage +frequency over those memory regions [1/s] such that the product gives an +(average) bandwidth 'bp' and 'bs' in [pages/s]. + +(note: multiple tasks sharing memory naturally avoid duplicat accounting + because each task will have its own access frequency 'fs') + +(pjt: I think this frequency is more numerically consistent if you explicitly + restrict p/s above to be the working-set. (It also makes explicit the + requirement for to change about a change in the working set.) + + Doing this does have the nice property that it lets you use your frequency + measurement as a weak-ordering for the benefit a task would receive when + we can't fit everything. + + e.g. task1 has working set 10mb, f=90% + task2 has working set 90mb, f=10% + + Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit + from task1 being on the right node than task2. ) + +Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i': + + C: t_i -> {c_i, n_i} + +This gives us the total interconnect traffic between nodes 'k' and 'l', +'T_k,l', as: + + T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == l + +And our goal is to obtain C0 and M0 such that: + + T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l + +(note: we could introduce 'nc(k,l)' as the cost function of accessing memory + on node 'l' from node 'k', this would be useful for bigger NUMA systems + + pjt: I agree nice to have, but intuition suggests diminishing returns on more + usual systems given factors like things like Haswell's enormous 35mb l3 + cache and QPI being able to do a direct fetch.) + +(note: do we need a limit on the total memory per node?) + + + * fairness + +For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu +'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a +load 'L_n': + + L_n = 1/P_n * \Sum_i w_i for all c_i = n + +using that we can formulate a load difference between CPUs + + L_n,m = | L_n - L_m | + +Which allows us to state the fairness goal like: + + L_n,m(C0) =< L_n,m(C) for all C, n != m + +(pjt: It can also be usefully stated that, having converged at C0: + + | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) | + + Where G_n,m is the greedy partition of tasks between L_n and L_m. This is + the "worst" partition we should accept; but having it gives us a useful + bound on how much we can reasonably adjust L_n/L_m at a Pareto point to + favor T_n,m. ) + +Together they give us the complete multi-objective optimization problem: + + min_C,M [ L_n,m(C), T_k,l(C,M) ] + + + +Notes: + + - the memory bandwidth problem is very much an inter-process problem, in + particular there is no such concept as a process in the above problem. + + - the naive solution would completely prefer fairness over interconnect + traffic, the more complicated solution could pick another Pareto point using + an aggregate objective function such that we balance the loss of work + efficiency against the gain of running, we'd want to more or less suggest + there to be a fixed bound on the error from the Pare