This design document introduces a new Ganeti job scheduler that prioritizes jobs in the queue based on their lock declaration and state of the cluster.
Signed-off-by: Federico Morg Pareschi <[email protected]> --- Makefile.am | 1 + doc/design-predictive-queue.rst | 213 ++++++++++++++++++++++++++++++++++++++++ doc/index.rst | 1 + 3 files changed, 215 insertions(+) create mode 100644 doc/design-predictive-queue.rst diff --git a/Makefile.am b/Makefile.am index 3c67c5c..8cd9f36 100644 --- a/Makefile.am +++ b/Makefile.am @@ -729,6 +729,7 @@ docinput = \ doc/design-partitioned.rst \ doc/design-plain-redundancy.rst \ doc/design-performance-tests.rst \ + doc/design-predictive-queue.rst \ doc/design-query-splitting.rst \ doc/design-query2.rst \ doc/design-query-splitting.rst \ diff --git a/doc/design-predictive-queue.rst b/doc/design-predictive-queue.rst new file mode 100644 index 0000000..460b12e --- /dev/null +++ b/doc/design-predictive-queue.rst @@ -0,0 +1,213 @@ +================================== +Predictive Queue System for Ganeti +================================== + +.. contents:: :depth: 4 + +This design document describes the introduction of a new queue scheduling +system for Ganeti jobs based on the amount of contention and resources +needed by each newly submitted job. + +Current State and shortcomings +============================== + +Currently, jobs in the Ganeti system are placed in a queue and subsequently +set to run by the job scheduler. A job can be in different mutually exclusive +states such as QUEUED, WAITING, or RUNNING depending on its position in the +scheduling pipeline. Jobs can also reach the final state of SUCCESS, ERROR or +CANCELED, but that is not relevant for the scope of this document. Among the +jobs that are QUEUED, WAITING or RUNNING, the latter two states are considered +by the system as 'running', while the former as 'pending'. + +Ganeti has an option to limit the amount of parallel running jobs (either +WAITING or RUNNING), and when such a limit is reached, all the newly submitted +ones will get stuck in the queue waiting for their turn to be executed. As each +job transitions from RUNNING to a final state, a new spot in the running job +queue opens and the job scheduler is tasked to pick one of the pending ones +to be started next. + +As a job transitions from the pending to the running queue, it gets started by +the job executor which initializes the logical unit mcpu, which in turn tries +to acquire the necessary locks on shared resources. If the lock acquisition +fails because the locks are already held by other jobs, the opcode transitions +into a WAITING state and gets stuck until such locks are released. + +At the time of writing, the queue scheduler picks which pending job to execute +next based on the following parameters: a priority value, a reason rate limiter, +an eligibility status based on dependencies on other jobs, and a series of +filters. After all these checks are performed, the first job on the list is +picked on a first-come-first-serve basis. On a typical usage scenario, without +any special cases, the first job to be scheduled to run is usually the first +job that was put in the queue. Each job is executed in order. + +A flaw in this system is that jobs are not scheduled to run based on the +resources they need. This means that in many cases, jobs from the pending queue +transition to a WAITING state and then are blocked again waiting for locks. +Meanwhile, other jobs in the pending queue could be eligible to run without +any additional resources, but cannot because the running queue is full already. + +Proposed changes +================ + +This design proposes a new way of choosing which jobs to run from the pending +queue based on the resources currently held by all the other running jobs and +the locks required by each new yet-to-be-executed job. On top of the +already-existing filtering parameters, the scheduler will also sort the queue +of pending jobs based on a heuristic value calculated by comparing each job in +the queue with the sum of all the other jobs already running in the system. + +Since this change introduces a more unpredictable queue system, there may be +cases where queued jobs fall into starvation as less resource-intensive jobs +get scheduled in front of them and take their spot in the queue. To solve this +problem, we also introduce a customizable parameter that determines the aging +of a pending job which will then affect the heuristic calculations. + +Design decisions +================ + +The addition of a predictive scheduler is a small change on top of the +already-existing queue. It will be sitting right inbetween the priority and +filtering parameters for the queue and it is a simple sort operation which will +not impact the current codebase by a lot. For all intents and purposes, this +change will not impact any of the previously existing filtering parameters +such as reason rate limiting, job filtering, priority values or eligibility +checks. + +The algorithm relies on a list of static resource declarations for each opcode. +Each job is assigned a weight by comparing its own locks to those currently +being held by the system. The higher the value, the higher the chance that +the job will be stuck in WAITING over some shared resource. Due to how the +locking system is implemented in Ganeti, it is currently impossible to know +with certainty the exact amount of resources that each job will request +before its transition from the pending to the running queue. As a solution to +this problem, we define a level of uncertainty and a heuristic function to +guesstimate the likelihood of a job to not get blocked. + +Job Promotion Algorithm +----------------------- + +The job locks in Ganeti are separated into 5 different levels (plus the Big +Ganeti Lock): NodeGroup, Instance, Node, NodeRes, and Network. Each of these +levels can take however many locks they need, in an independent way from each +other, as either shared or exclusive mode. The following are the possible +lock types a job can declare for each level: + + * None: No locks are required at this level at all; + + * Shared(k): The system is aware of which locks are being requested in shared + mode. K is the list of all the resource names. + + * UnknownShared: The system knows that there is at least one lock requested + as shared, but it has no way of knowing which or how many. + + * AllShared: All locks at this level are required as shared. + + * Exclusive(k) : The system is aware of which locks are being requested in + exclusive mode. K is the list of all resource names. + + * UnknownExclusive: The system knowes that there is at least one lock + requested as excusive, but it has no way of knowing which or how many. + + * AllExclusive: All locks at this level are required as exclusive. + +In the case of those few jobs that require the BGL, they are considered +as the absolute worst weight scenario possible and will most likely be +scheduled at the back of the queue as they require the entire set of resources. + +We define a comparison operation between two lock types. The first operand is +the lock of a job in a pending queue, whereas the second operand is the +currently-existing lock already taken by a running job. The operation behaves +as follows: + ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ +| | None | Shared(j) | UnknownShared | AllShared | Exclusive(j) | UnknownExclusive | AllExclusive | ++==================+=========+========================+===============+===========+========================+==================+==============+ +| None | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ +| Shared(i) | 0.3 | 0 | 0 | 0 | if j=i then 3 else 0.3 | 1.5 | 3 | ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ +| UnknownShared | 0.3 | 0.3 | 0.3 | 0.3 | 1.5 | 1.5 | 3 | ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ +| AllShared | 0.3 | 0.3 | 0.3 | 0.3 | 3 | 3 | 3 | ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ +| Exclusive(i) | 0.5 | if j=i then 3 else 0.5 | 1.5 | 3 | if j=i then 3 else 0.5 | 1.5 | 3 | ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ +| UnknownExclusive | 0.5 | 1.5 | 1.5 | 3 | 1.5 | 1.5 | 3 | ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ +| AllExclusive | 0.5 | 3 | 3 | 3 | 3 | 3 | 3 | ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ + +The weight values are defined as such: + + * 0: There is no contention, and no contention is added to the state of the + cluster. + + * 0.3: There should be no contention, however the likelihood for future + contention of resources in the cluster is increased. + + * 0.5: There should be no contention, however the likelihood for future + contention of resources in the cluster is greatly increased. + + * 1.5: There is a chance that the job will get stuck but there is no certain + way of knowing it. + + * 3: The job will very likely get stuck in WAITING. + +For each of the pending jobs, their locks are compared using this operation +with each of the locks of the running jobs. After this comparison, for each +level the largest number is picked as the "worst case" for that specific level +and used in the heuristic formula to calculate the final weight of the job. + +The heuristic algorithm is approximately defined as such:: + + if job.hasBGL(): + return base_value+15 + value=base_value + for level in lock_levels: + value+=max(compare(job[level], running_jobs[level])) + return value + +The number 15 is obtained by multiplying the worst case lock weight (3) by +the amount of levels (5), the BGL should never have more or less than this +value, in weight. Furthermore, the base_value parameter is a customizable +constant that provides a base value (default is 1) which can be useful when +used with the anti-starvation measures. + +As previously specified, to prevent starvation we introduce an aging system +for queued jobs that keeps the queue fair. Each job in the queue will have an +'Age' parameter based on how long it has been sitting in the pending queue. +The greater the age, the smaller the job's static lock weight, the likelier the +job will be to be scheduled for execution next. The age is calculated on the +delta between a job enqueue time and the current time, quantized to a common +unit. This quantization will be adjustable as a constant in the Ganeti code to +something similar to, for example, 1 tick every 30 seconds. Using the current +formula, we can define the actual job weight in the queue:: + + weight = spv(job)*(1 - job.Age()/K) + +Where spv(job) is the value obtained by the heuristic algorithm (Static +Predictive Value), Age is the given job's quantized age, and k is an aging +coefficient value that roughly means "the amount of ticks of age to wait +before getting the smallest weight". To explain, K means that regardless of the +measured SPV, the weight of a job will always equal 0 after K ticks of time have +passed. + + +Static Lock Declaration +----------------------- + +As already explained, unfortunately it is not possible to 100% accurately +predict the exact amount of resources that a job will require before it is +executed and enters the running queue. To solve this problem, we decided to +implement an ad-hoc static mapping of opcode:locks declaration. + +Given each job, we try to infer as much information as possible from its +opcode parameters, and then deduce which locks are most likely to be requested +as the job transitions into running. Because of the high and varied combinations +of parameters, it is not a simple process and more often than not we simply +have to declare a lock request as "Unknown" (either shared or exclusive, as +defined in the previous section). Luckily, for the currently running opcodes, +it is possible to query the system for the state of the locks and we can +achieve a more accurate map, which helps improving the accuracy of the +heuristic function. + diff --git a/doc/index.rst b/doc/index.rst index e5535eb..08f4678 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -138,6 +138,7 @@ Draft designs design-partitioned design-performance-tests.rst design-plain-redundancy.rst + design-predictive-queue.rst design-query2.rst design-query-splitting.rst design-rapi-pam.rst -- 2.8.0.rc3.226.g39d4020
