As discussed previously IRL, I'm happy with the design so LGTM, I just have a few nitpicks on the document.
On Thursday, August 25, 2016 at 2:00:32 PM UTC+1, Federico Pareschi wrote: > > This design document introduces a new Ganeti job scheduler that > prioritizes jobs in the queue based on their lock declaration and state > of the cluster. > > Signed-off-by: Federico Morg Pareschi <[email protected]> > --- > Makefile.am | 1 + > doc/design-predictive-queue.rst | 213 > ++++++++++++++++++++++++++++++++++++++++ > doc/index.rst | 1 + > 3 files changed, 215 insertions(+) > create mode 100644 doc/design-predictive-queue.rst > > diff --git a/Makefile.am b/Makefile.am > index 3c67c5c..8cd9f36 100644 > --- a/Makefile.am > +++ b/Makefile.am > @@ -729,6 +729,7 @@ docinput = \ > doc/design-partitioned.rst \ > doc/design-plain-redundancy.rst \ > doc/design-performance-tests.rst \ > + doc/design-predictive-queue.rst \ > doc/design-query-splitting.rst \ > doc/design-query2.rst \ > doc/design-query-splitting.rst \ > diff --git a/doc/design-predictive-queue.rst > b/doc/design-predictive-queue.rst > new file mode 100644 > index 0000000..460b12e > --- /dev/null > +++ b/doc/design-predictive-queue.rst > @@ -0,0 +1,213 @@ > +================================== > +Predictive Queue System for Ganeti > +================================== > + > +.. contents:: :depth: 4 > + > +This design document describes the introduction of a new queue scheduling > +system for Ganeti jobs based on the amount of contention and resources > +needed by each newly submitted job. > + > +Current State and shortcomings > +============================== > + > +Currently, jobs in the Ganeti system are placed in a queue and > subsequently > +set to run by the job scheduler. A job can be in different mutually > exclusive > +states such as QUEUED, WAITING, or RUNNING depending on its position in > the > +scheduling pipeline. Jobs can also reach the final state of SUCCESS, > ERROR or > +CANCELED, but that is not relevant for the scope of this document. Among > the > +jobs that are QUEUED, WAITING or RUNNING, the latter two states are > considered > +by the system as 'running', while the former as 'pending'. > I think "mutually exclusive" is implied, having a job in multiple states would be a quite unusual design. Maybe a state diagram would be nice here? (If it's too much hassle in ascii art, I don't insist, just an idea). > + > +Ganeti has an option to limit the amount of parallel running jobs (either > +WAITING or RUNNING), and when such a limit is reached, all the newly > submitted > +ones will get stuck in the queue waiting for their turn to be executed. > As each > +job transitions from RUNNING to a final state, a new spot in the running > job > +queue opens and the job scheduler is tasked to pick one of the pending > ones > +to be started next. > + > +As a job transitions from the pending to the running queue, it gets > started by > +the job executor which initializes the logical unit mcpu, which in turn > tries > +to acquire the necessary locks on shared resources. If the lock > acquisition > +fails because the locks are already held by other jobs, the opcode > transitions > +into a WAITING state and gets stuck until such locks are released. > + > +At the time of writing, the queue scheduler picks which pending job to > execute > +next based on the following parameters: a priority value, a reason rate > limiter, > +an eligibility status based on dependencies on other jobs, and a series > of > +filters. After all these checks are performed, the first job on the list > is > +picked on a first-come-first-serve basis. On a typical usage scenario, > without > +any special cases, the first job to be scheduled to run is usually the > first > +job that was put in the queue. Each job is executed in order. > + > +A flaw in this system is that jobs are not scheduled to run based on the > It isn't a flaw per se, just some inefficiency. > +resources they need. This means that in many cases, jobs from the pending > queue > +transition to a WAITING state and then are blocked again waiting for > locks. > +Meanwhile, other jobs in the pending queue could be eligible to run > without > +any additional resources, but cannot because the running queue is full > already. > + > +Proposed changes > +================ > + > +This design proposes a new way of choosing which jobs to run from the > pending > +queue based on the resources currently held by all the other running jobs > and > +the locks required by each new yet-to-be-executed job. On top of the > +already-existing filtering parameters, the scheduler will also sort the > queue > +of pending jobs based on a heuristic value calculated by comparing each > job in > +the queue with the sum of all the other jobs already running in the > system. > + > +Since this change introduces a more unpredictable queue system, there may > be > +cases where queued jobs fall into starvation as less resource-intensive > jobs > +get scheduled in front of them and take their spot in the queue. To solve > this > +problem, we also introduce a customizable parameter that determines the > aging > +of a pending job which will then affect the heuristic calculations. > + > +Design decisions > +================ > + > +The addition of a predictive scheduler is a small change on top of the > +already-existing queue. It will be sitting right inbetween the priority > and > +filtering parameters for the queue and it is a simple sort operation > which will > +not impact the current codebase by a lot. For all intents and purposes, > this > +change will not impact any of the previously existing filtering > parameters > +such as reason rate limiting, job filtering, priority values or > eligibility > +checks. > + > +The algorithm relies on a list of static resource declarations for each > opcode. > +Each job is assigned a weight by comparing its own locks to those > currently > +being held by the system. The higher the value, the higher the chance > that > +the job will be stuck in WAITING over some shared resource. Due to how > the > +locking system is implemented in Ganeti, it is currently impossible to > know > +with certainty the exact amount of resources that each job will request > +before its transition from the pending to the running queue. As a > solution to > +this problem, we define a level of uncertainty and a heuristic function > to > +guesstimate the likelihood of a job to not get blocked. > Maybe it would be worth mentioning, that the more certainty we have about jobs, the better this algorithm works (and generally Ganeti as a job workflow management system), so we should keep that in mind when implementing jobs. > + > +Job Promotion Algorithm > +----------------------- > + > +The job locks in Ganeti are separated into 5 different levels (plus the > Big > +Ganeti Lock): NodeGroup, Instance, Node, NodeRes, and Network. Each of > these > +levels can take however many locks they need, in an independent way from > each > +other, as either shared or exclusive mode. The following are the possible > +lock types a job can declare for each level: > The levels don't take locks, jobs take locks on different levels (which are independent, levels aren't affecting eachother). > + > + * None: No locks are required at this level at all; > + > + * Shared(k): The system is aware of which locks are being requested in > shared > + mode. K is the list of all the resource names. > + > + * UnknownShared: The system knows that there is at least one lock > requested > + as shared, but it has no way of knowing which or how many. > + > + * AllShared: All locks at this level are required as shared. > + > + * Exclusive(k) : The system is aware of which locks are being requested > in > + exclusive mode. K is the list of all resource names. > + > + * UnknownExclusive: The system knowes that there is at least one lock > + requested as excusive, but it has no way of knowing which or how many. > + > + * AllExclusive: All locks at this level are required as exclusive. > + > +In the case of those few jobs that require the BGL, they are considered > +as the absolute worst weight scenario possible and will most likely be > +scheduled at the back of the queue as they require the entire set of > resources. > + > +We define a comparison operation between two lock types. The first > operand is > +the lock of a job in a pending queue, whereas the second operand is the > +currently-existing lock already taken by a running job. The operation > behaves > +as follows: > + > ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ > > > +| | None | Shared(j) | UnknownShared | > AllShared | Exclusive(j) | UnknownExclusive | AllExclusive | > ++==================+=========+========================+===============+===========+========================+==================+==============+ > > > +| None | 0 | 0 | 0 | > 0 | 0 | 0 | 0 | > ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ > > > +| Shared(i) | 0.3 | 0 | 0 | > 0 | if j=i then 3 else 0.3 | 1.5 | 3 | > ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ > > > +| UnknownShared | 0.3 | 0.3 | 0.3 | > 0.3 | 1.5 | 1.5 | 3 | > ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ > > > +| AllShared | 0.3 | 0.3 | 0.3 | > 0.3 | 3 | 3 | 3 | > ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ > > > +| Exclusive(i) | 0.5 | if j=i then 3 else 0.5 | 1.5 | > 3 | if j=i then 3 else 0.5 | 1.5 | 3 | > ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ > > > +| UnknownExclusive | 0.5 | 1.5 | 1.5 | > 3 | 1.5 | 1.5 | 3 | > ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ > > > +| AllExclusive | 0.5 | 3 | 3 | > 3 | 3 | 3 | 3 | > ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ > > > + > Do you think we need the entire table ? Since we are using actual locks taken by running jobs, there should be no uncertainty at least in one of the dimensions (in this case less columns, if I am correct). > +The weight values are defined as such: > + > + * 0: There is no contention, and no contention is added to the state of > the > + cluster. > + > + * 0.3: There should be no contention, however the likelihood for future > + contention of resources in the cluster is increased. > + > + * 0.5: There should be no contention, however the likelihood for future > + contention of resources in the cluster is greatly increased. > + > + * 1.5: There is a chance that the job will get stuck but there is no > certain > + way of knowing it. > + > + * 3: The job will very likely get stuck in WAITING. > Isn't that rather certainly ? If both jobs are taking the same exclusive locks, then they certainly block each other. + > +For each of the pending jobs, their locks are compared using this > operation > +with each of the locks of the running jobs. After this comparison, for > each > +level the largest number is picked as the "worst case" for that specific > level > +and used in the heuristic formula to calculate the final weight of the > job. > + > +The heuristic algorithm is approximately defined as such:: > + > + if job.hasBGL(): > + return base_value+15 > + value=base_value > + for level in lock_levels: > + value+=max(compare(job[level], running_jobs[level])) > + return value > It is not entirely clear what are you taking the maximum of, since compare() usually returns a scalar value. > + > +The number 15 is obtained by multiplying the worst case lock weight (3) > by > +the amount of levels (5), the BGL should never have more or less than > this > +value, in weight. Furthermore, the base_value parameter is a customizable > +constant that provides a base value (default is 1) which can be useful > when > +used with the anti-starvation measures. > + > +As previously specified, to prevent starvation we introduce an aging > system > +for queued jobs that keeps the queue fair. Each job in the queue will > have an > +'Age' parameter based on how long it has been sitting in the pending > queue. > +The greater the age, the smaller the job's static lock weight, the > likelier the > +job will be to be scheduled for execution next. The age is calculated on > the > +delta between a job enqueue time and the current time, quantized to a > common > +unit. This quantization will be adjustable as a constant in the Ganeti > code to > +something similar to, for example, 1 tick every 30 seconds. Using the > current > +formula, we can define the actual job weight in the queue:: > + > + weight = spv(job)*(1 - job.Age()/K) > + > +Where spv(job) is the value obtained by the heuristic algorithm (Static > +Predictive Value), Age is the given job's quantized age, and k is an > aging > +coefficient value that roughly means "the amount of ticks of age to wait > +before getting the smallest weight". To explain, K means that regardless > of the > +measured SPV, the weight of a job will always equal 0 after K ticks of > time have > +passed. > + > + > +Static Lock Declaration > +----------------------- > + > +As already explained, unfortunately it is not possible to 100% accurately > +predict the exact amount of resources that a job will require before it > is > +executed and enters the running queue. To solve this problem, we decided > to > +implement an ad-hoc static mapping of opcode:locks declaration. > + > +Given each job, we try to infer as much information as possible from its > +opcode parameters, and then deduce which locks are most likely to be > requested > +as the job transitions into running. Because of the high and varied > combinations > +of parameters, it is not a simple process and more often than not we > simply > +have to declare a lock request as "Unknown" (either shared or exclusive, > as > +defined in the previous section). Luckily, for the currently running > opcodes, > +it is possible to query the system for the state of the locks and we can > +achieve a more accurate map, which helps improving the accuracy of the > +heuristic function. > ^^ yep, I am strongly in favour of reducing uncertainty wherever possible. > + > diff --git a/doc/index.rst b/doc/index.rst > index e5535eb..08f4678 100644 > --- a/doc/index.rst > +++ b/doc/index.rst > @@ -138,6 +138,7 @@ Draft designs > design-partitioned > design-performance-tests.rst > design-plain-redundancy.rst > + design-predictive-queue.rst > design-query2.rst > design-query-splitting.rst > design-rapi-pam.rst > -- > 2.8.0.rc3.226.g39d4020 > >
