[tip:numa/core] sched, numa, mm: Describe the NUMA scheduling problem formally

2012-11-19 Thread tip-bot for Peter Zijlstra
Commit-ID:  49e0b1784791875787a198a00776ee118cd2233c
Gitweb: http://git.kernel.org/tip/49e0b1784791875787a198a00776ee118cd2233c
Author: Peter Zijlstra 
AuthorDate: Fri, 12 Oct 2012 12:13:10 +0200
Committer:  Ingo Molnar 
CommitDate: Mon, 19 Nov 2012 02:15:39 +0100

sched, numa, mm: Describe the NUMA scheduling problem formally

This is probably a first: formal description of a complex high-level
computing problem, within the kernel source.

Signed-off-by: Peter Zijlstra 
Cc: Linus Torvalds 
Cc: Andrew Morton 
Cc: "H. Peter Anvin" 
Cc: Mike Galbraith 
Cc: Rik van Riel 
Cc: Hugh Dickins 
Link: http://lkml.kernel.org/n/tip-mmnlpupoetcatimvjeld1...@git.kernel.org
[ Next step: generate the kernel source from such formal descriptions and 
retire to a tropical island! ]
Signed-off-by: Ingo Molnar 
---
 Documentation/scheduler/numa-problem.txt | 230 +++
 1 file changed, 230 insertions(+)

diff --git a/Documentation/scheduler/numa-problem.txt 
b/Documentation/scheduler/numa-problem.txt
new file mode 100644
index 000..a5d2fee
--- /dev/null
+++ b/Documentation/scheduler/numa-problem.txt
@@ -0,0 +1,230 @@
+
+
+Effective NUMA scheduling problem statement, described formally:
+
+ * minimize interconnect traffic
+
+For each task 't_i' we have memory, this memory can be spread over multiple
+physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on
+node 'k' in [pages].  
+
+If a task shares memory with another task let us denote this as:
+'s_i,k', the memory shared between tasks including 't_i' residing on node
+'k'.
+
+Let 'M' be the distribution that governs all 'p' and 's', ie. the page 
placement.
+
+Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage
+frequency over those memory regions [1/s] such that the product gives an
+(average) bandwidth 'bp' and 'bs' in [pages/s].
+
+(note: multiple tasks sharing memory naturally avoid duplicat accounting
+   because each task will have its own access frequency 'fs')
+
+(pjt: I think this frequency is more numerically consistent if you explicitly 
+  restrict p/s above to be the working-set. (It also makes explicit the 
+  requirement for  to change about a change in the working set.)
+
+  Doing this does have the nice property that it lets you use your 
frequency
+  measurement as a weak-ordering for the benefit a task would receive when
+  we can't fit everything.
+
+  e.g. task1 has working set 10mb, f=90%
+   task2 has working set 90mb, f=10%
+
+  Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit
+  from task1 being on the right node than task2. )
+
+Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i':
+
+  C: t_i -> {c_i, n_i}
+
+This gives us the total interconnect traffic between nodes 'k' and 'l',
+'T_k,l', as:
+
+  T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == 
l
+
+And our goal is to obtain C0 and M0 such that:
+
+  T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l
+
+(note: we could introduce 'nc(k,l)' as the cost function of accessing memory
+   on node 'l' from node 'k', this would be useful for bigger NUMA systems
+
+ pjt: I agree nice to have, but intuition suggests diminishing returns on more
+  usual systems given factors like things like Haswell's enormous 35mb l3
+  cache and QPI being able to do a direct fetch.)
+
+(note: do we need a limit on the total memory per node?)
+
+
+ * fairness
+
+For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu
+'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a
+load 'L_n':
+
+  L_n = 1/P_n * \Sum_i w_i for all c_i = n
+
+using that we can formulate a load difference between CPUs
+
+  L_n,m = | L_n - L_m |
+
+Which allows us to state the fairness goal like:
+
+  L_n,m(C0) =< L_n,m(C) for all C, n != m
+
+(pjt: It can also be usefully stated that, having converged at C0:
+
+   | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) |
+
+  Where G_n,m is the greedy partition of tasks between L_n and L_m. This is
+  the "worst" partition we should accept; but having it gives us a useful 
+  bound on how much we can reasonably adjust L_n/L_m at a Pareto point to 
+  favor T_n,m. )
+
+Together they give us the complete multi-objective optimization problem:
+
+  min_C,M [ L_n,m(C), T_k,l(C,M) ]
+
+
+
+Notes:
+
+ - the memory bandwidth problem is very much an inter-process problem, in
+   particular there is no such concept as a process in the above problem.
+
+ - the naive solution would completely prefer fairness over interconnect
+   traffic, the more complicated solution could pick another Pareto point using
+   an aggregate objective function such that we balance the loss of work
+   efficiency against the gain of running, we'd want to more or less suggest
+   there to be a fixed bound on the error from the Pareto line for any

[tip:numa/core] sched, numa, mm: Describe the NUMA scheduling problem formally

2012-10-28 Thread tip-bot for Peter Zijlstra
Commit-ID:  995334a2ee83d70cfa9f2f2f511feb8e0fb49494
Gitweb: http://git.kernel.org/tip/995334a2ee83d70cfa9f2f2f511feb8e0fb49494
Author: Peter Zijlstra 
AuthorDate: Fri, 12 Oct 2012 12:13:10 +0200
Committer:  Ingo Molnar 
CommitDate: Sun, 28 Oct 2012 15:45:38 +0100

sched, numa, mm: Describe the NUMA scheduling problem formally

This is probably a first: formal description of a complex high-level
computing problem, within the kernel source.

Signed-off-by: Peter Zijlstra 
Cc: Linus Torvalds 
Cc: Andrew Morton 
Cc: Peter Zijlstra 
Cc: "H. Peter Anvin" 
Cc: Mike Galbraith 
Rik van Riel 
Link: http://lkml.kernel.org/n/tip-mmnlpupoetcatimvjeld1...@git.kernel.org
[ Next step: generate the kernel source from such formal descriptions and 
retire to a tropical island! ]
Signed-off-by: Ingo Molnar 
---
 Documentation/scheduler/numa-problem.txt |  230 ++
 1 files changed, 230 insertions(+), 0 deletions(-)

diff --git a/Documentation/scheduler/numa-problem.txt 
b/Documentation/scheduler/numa-problem.txt
new file mode 100644
index 000..a5d2fee
--- /dev/null
+++ b/Documentation/scheduler/numa-problem.txt
@@ -0,0 +1,230 @@
+
+
+Effective NUMA scheduling problem statement, described formally:
+
+ * minimize interconnect traffic
+
+For each task 't_i' we have memory, this memory can be spread over multiple
+physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on
+node 'k' in [pages].  
+
+If a task shares memory with another task let us denote this as:
+'s_i,k', the memory shared between tasks including 't_i' residing on node
+'k'.
+
+Let 'M' be the distribution that governs all 'p' and 's', ie. the page 
placement.
+
+Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage
+frequency over those memory regions [1/s] such that the product gives an
+(average) bandwidth 'bp' and 'bs' in [pages/s].
+
+(note: multiple tasks sharing memory naturally avoid duplicat accounting
+   because each task will have its own access frequency 'fs')
+
+(pjt: I think this frequency is more numerically consistent if you explicitly 
+  restrict p/s above to be the working-set. (It also makes explicit the 
+  requirement for  to change about a change in the working set.)
+
+  Doing this does have the nice property that it lets you use your 
frequency
+  measurement as a weak-ordering for the benefit a task would receive when
+  we can't fit everything.
+
+  e.g. task1 has working set 10mb, f=90%
+   task2 has working set 90mb, f=10%
+
+  Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit
+  from task1 being on the right node than task2. )
+
+Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i':
+
+  C: t_i -> {c_i, n_i}
+
+This gives us the total interconnect traffic between nodes 'k' and 'l',
+'T_k,l', as:
+
+  T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == 
l
+
+And our goal is to obtain C0 and M0 such that:
+
+  T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l
+
+(note: we could introduce 'nc(k,l)' as the cost function of accessing memory
+   on node 'l' from node 'k', this would be useful for bigger NUMA systems
+
+ pjt: I agree nice to have, but intuition suggests diminishing returns on more
+  usual systems given factors like things like Haswell's enormous 35mb l3
+  cache and QPI being able to do a direct fetch.)
+
+(note: do we need a limit on the total memory per node?)
+
+
+ * fairness
+
+For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu
+'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a
+load 'L_n':
+
+  L_n = 1/P_n * \Sum_i w_i for all c_i = n
+
+using that we can formulate a load difference between CPUs
+
+  L_n,m = | L_n - L_m |
+
+Which allows us to state the fairness goal like:
+
+  L_n,m(C0) =< L_n,m(C) for all C, n != m
+
+(pjt: It can also be usefully stated that, having converged at C0:
+
+   | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) |
+
+  Where G_n,m is the greedy partition of tasks between L_n and L_m. This is
+  the "worst" partition we should accept; but having it gives us a useful 
+  bound on how much we can reasonably adjust L_n/L_m at a Pareto point to 
+  favor T_n,m. )
+
+Together they give us the complete multi-objective optimization problem:
+
+  min_C,M [ L_n,m(C), T_k,l(C,M) ]
+
+
+
+Notes:
+
+ - the memory bandwidth problem is very much an inter-process problem, in
+   particular there is no such concept as a process in the above problem.
+
+ - the naive solution would completely prefer fairness over interconnect
+   traffic, the more complicated solution could pick another Pareto point using
+   an aggregate objective function such that we balance the loss of work
+   efficiency against the gain of running, we'd want to more or less suggest
+   there to be a fixed bound on the error from the Pare