[PATCH][ 5/5] sched: document sd cpu scheduler
Add comprehensive documentation of the Staircase Deadline cpu scheduler design. Signed-off-by: Con Kolivas <[EMAIL PROTECTED]> --- Documentation/sched-design.txt | 240 +++-- 1 file changed, 234 insertions(+), 6 deletions(-) Index: linux-2.6.21-rc5-sd/Documentation/sched-design.txt === --- linux-2.6.21-rc5-sd.orig/Documentation/sched-design.txt 2006-11-30 11:30:31.0 +1100 +++ linux-2.6.21-rc5-sd/Documentation/sched-design.txt 2007-03-27 11:52:55.0 +1000 @@ -1,11 +1,14 @@ - Goals, Design and Implementation of the - new ultra-scalable O(1) scheduler + Goals, Design and Implementation of the ultra-scalable O(1) scheduler by + Ingo Molnar and theStaircase Deadline cpu scheduler policy designed by + Con Kolivas. - This is an edited version of an email Ingo Molnar sent to - lkml on 4 Jan 2002. It describes the goals, design, and - implementation of Ingo's new ultra-scalable O(1) scheduler. - Last Updated: 18 April 2002. + This was originally an edited version of an email Ingo Molnar sent to + lkml on 4 Jan 2002. It describes the goals, design, and implementation + of Ingo's ultra-scalable O(1) scheduler. It now contains a description + of the Staircase Deadline priority scheduler that was built on this + design. + Last Updated: Tue Mar 27 2007 Goal @@ -163,3 +166,228 @@ certain code paths and data constructs. code is smaller than the old one. Ingo + + +Staircase Deadline cpu scheduler policy + + +Design summary +== + +A novel design which incorporates a foreground-background descending priority +system (the staircase) via a bandwidth allocation matrix according to nice +level. + + +Features + + +A starvation free, strict fairness O(1) scalable design with interactivity +as good as the above restrictions can provide. There is no interactivity +estimator, no sleep/run measurements and only simple fixed accounting. +The design has strict enough a design and accounting that task behaviour +can be modelled and maximum scheduling latencies can be predicted by +the virtual deadline mechanism that manages runqueues. The prime concern +in this design is to maintain fairness at all costs determined by nice level, +yet to maintain as good interactivity as can be allowed within the +constraints of strict fairness. + + +Design description +== + +SD works off the principle of providing each task a quota of runtime that it is +allowed to run at a number of priority levels determined by its static priority +(ie. its nice level). If the task uses up its quota it has its priority +decremented to the next level determined by a priority matrix. Once every +runtime quota has been consumed of every priority level, a task is queued on the +"expired" array. When no other tasks exist with quota, the expired array is +activated and fresh quotas are handed out. This is all done in O(1). + +Design details +== + +Each task keeps a record of its own entitlement of cpu time. Most of the rest of +these details apply to non-realtime tasks as rt task management is straight +forward. + +Each runqueue keeps a record of what major epoch it is up to in the +rq->prio_rotation field which is incremented on each major epoch. It also +keeps a record of the current prio_level for each static priority task. + +Each task keeps a record of what major runqueue epoch it was last running +on in p->rotation. It also keeps a record of what priority levels it has +already been allocated quota from during this epoch in a bitmap p->bitmap. + +The only tunable that determines all other details is the RR_INTERVAL. This +is set to 8ms, and is scaled gently upwards with more cpus. This value is +tunable via a /proc interface. + +All tasks are initially given a quota based on RR_INTERVAL. This is equal to +RR_INTERVAL between nice values of -6 and 0, half that size above nice 0, and +progressively larger for nice values from -1 to -20. This is assigned to +p->quota and only changes with changes in nice level. + +As a task is first queued, it checks in recalc_task_prio to see if it has run at +this runqueue's current priority rotation. If it has not, it will have its +p->prio level set according to the first slot in a "priority matrix" and will be +given a p->time_slice equal to the p->quota, and has its allocation bitmap bit +set in p->bitmap for this prio level. It is then queued on the current active +priority array. + +If a task has already been running during this major epoch, and it has +p->time_slice left and the rq->prio_quota for the task's p->prio still +has quota, it will be placed back on the active array, but no more quota +will be added. + +If a task has been running during this major epoch, but does not have +p->time_slice left, it will find the next lowest
[PATCH][ 5/5] sched: document sd cpu scheduler
Add comprehensive documentation of the Staircase Deadline cpu scheduler design. Signed-off-by: Con Kolivas [EMAIL PROTECTED] --- Documentation/sched-design.txt | 240 +++-- 1 file changed, 234 insertions(+), 6 deletions(-) Index: linux-2.6.21-rc5-sd/Documentation/sched-design.txt === --- linux-2.6.21-rc5-sd.orig/Documentation/sched-design.txt 2006-11-30 11:30:31.0 +1100 +++ linux-2.6.21-rc5-sd/Documentation/sched-design.txt 2007-03-27 11:52:55.0 +1000 @@ -1,11 +1,14 @@ - Goals, Design and Implementation of the - new ultra-scalable O(1) scheduler + Goals, Design and Implementation of the ultra-scalable O(1) scheduler by + Ingo Molnar and theStaircase Deadline cpu scheduler policy designed by + Con Kolivas. - This is an edited version of an email Ingo Molnar sent to - lkml on 4 Jan 2002. It describes the goals, design, and - implementation of Ingo's new ultra-scalable O(1) scheduler. - Last Updated: 18 April 2002. + This was originally an edited version of an email Ingo Molnar sent to + lkml on 4 Jan 2002. It describes the goals, design, and implementation + of Ingo's ultra-scalable O(1) scheduler. It now contains a description + of the Staircase Deadline priority scheduler that was built on this + design. + Last Updated: Tue Mar 27 2007 Goal @@ -163,3 +166,228 @@ certain code paths and data constructs. code is smaller than the old one. Ingo + + +Staircase Deadline cpu scheduler policy + + +Design summary +== + +A novel design which incorporates a foreground-background descending priority +system (the staircase) via a bandwidth allocation matrix according to nice +level. + + +Features + + +A starvation free, strict fairness O(1) scalable design with interactivity +as good as the above restrictions can provide. There is no interactivity +estimator, no sleep/run measurements and only simple fixed accounting. +The design has strict enough a design and accounting that task behaviour +can be modelled and maximum scheduling latencies can be predicted by +the virtual deadline mechanism that manages runqueues. The prime concern +in this design is to maintain fairness at all costs determined by nice level, +yet to maintain as good interactivity as can be allowed within the +constraints of strict fairness. + + +Design description +== + +SD works off the principle of providing each task a quota of runtime that it is +allowed to run at a number of priority levels determined by its static priority +(ie. its nice level). If the task uses up its quota it has its priority +decremented to the next level determined by a priority matrix. Once every +runtime quota has been consumed of every priority level, a task is queued on the +expired array. When no other tasks exist with quota, the expired array is +activated and fresh quotas are handed out. This is all done in O(1). + +Design details +== + +Each task keeps a record of its own entitlement of cpu time. Most of the rest of +these details apply to non-realtime tasks as rt task management is straight +forward. + +Each runqueue keeps a record of what major epoch it is up to in the +rq-prio_rotation field which is incremented on each major epoch. It also +keeps a record of the current prio_level for each static priority task. + +Each task keeps a record of what major runqueue epoch it was last running +on in p-rotation. It also keeps a record of what priority levels it has +already been allocated quota from during this epoch in a bitmap p-bitmap. + +The only tunable that determines all other details is the RR_INTERVAL. This +is set to 8ms, and is scaled gently upwards with more cpus. This value is +tunable via a /proc interface. + +All tasks are initially given a quota based on RR_INTERVAL. This is equal to +RR_INTERVAL between nice values of -6 and 0, half that size above nice 0, and +progressively larger for nice values from -1 to -20. This is assigned to +p-quota and only changes with changes in nice level. + +As a task is first queued, it checks in recalc_task_prio to see if it has run at +this runqueue's current priority rotation. If it has not, it will have its +p-prio level set according to the first slot in a priority matrix and will be +given a p-time_slice equal to the p-quota, and has its allocation bitmap bit +set in p-bitmap for this prio level. It is then queued on the current active +priority array. + +If a task has already been running during this major epoch, and it has +p-time_slice left and the rq-prio_quota for the task's p-prio still +has quota, it will be placed back on the active array, but no more quota +will be added. + +If a task has been running during this major epoch, but does not have +p-time_slice left, it will find the next lowest priority in its bitmap