Re: [PATCH master] Add design document for a predictive queue system

'Viktor Bachraty' via ganeti-devel Fri, 26 Aug 2016 08:02:06 -0700

As discussed previously IRL,  I'm happy with the design so LGTM, I just 
have a few nitpicks on the document.



On Thursday, August 25, 2016 at 2:00:32 PM UTC+1, Federico Pareschi wrote:
>
> This design document introduces a new Ganeti job scheduler that 
> prioritizes jobs in the queue based on their lock declaration and state 
> of the cluster. 
>
> Signed-off-by: Federico Morg Pareschi <[email protected]> 
> --- 
>  Makefile.am                     |   1 + 
>  doc/design-predictive-queue.rst | 213 
> ++++++++++++++++++++++++++++++++++++++++ 
>  doc/index.rst                   |   1 + 
>  3 files changed, 215 insertions(+) 
>  create mode 100644 doc/design-predictive-queue.rst 
>
> diff --git a/Makefile.am b/Makefile.am 
> index 3c67c5c..8cd9f36 100644 
> --- a/Makefile.am 
> +++ b/Makefile.am 
> @@ -729,6 +729,7 @@ docinput = \ 
>          doc/design-partitioned.rst \ 
>          doc/design-plain-redundancy.rst \ 
>          doc/design-performance-tests.rst \ 
> +        doc/design-predictive-queue.rst \ 
>          doc/design-query-splitting.rst \ 
>          doc/design-query2.rst \ 
>          doc/design-query-splitting.rst \ 
> diff --git a/doc/design-predictive-queue.rst 
> b/doc/design-predictive-queue.rst 
> new file mode 100644 
> index 0000000..460b12e 
> --- /dev/null 
> +++ b/doc/design-predictive-queue.rst 
> @@ -0,0 +1,213 @@ 
> +================================== 
> +Predictive Queue System for Ganeti 
> +================================== 
> + 
> +.. contents:: :depth: 4 
> + 
> +This design document describes the introduction of a new queue scheduling 
> +system for Ganeti jobs based on the amount of contention and resources 
> +needed by each newly submitted job. 
> + 
> +Current State and shortcomings 
> +============================== 
> + 
> +Currently, jobs in the Ganeti system are placed in a queue and 
> subsequently 
> +set to run by the job scheduler. A job can be in different mutually 
> exclusive 
> +states such as QUEUED, WAITING, or RUNNING depending on its position in 
> the 
> +scheduling pipeline. Jobs can also reach the final state of SUCCESS, 
> ERROR or 
> +CANCELED, but that is not relevant for the scope of this document. Among 
> the 
> +jobs that are QUEUED, WAITING or RUNNING, the latter two states are 
> considered 
> +by the system as 'running', while the former as 'pending'. 
>
 
I think "mutually exclusive" is implied, having a job in multiple states 
would be a quite unusual design. 
Maybe a state diagram would be nice here? (If it's too much hassle in ascii 
art, I don't insist, just an idea).
 

> + 
> +Ganeti has an option to limit the amount of parallel running jobs (either 
> +WAITING or RUNNING), and when such a limit is reached, all the newly 
> submitted 
> +ones will get stuck in the queue waiting for their turn to be executed. 
> As each 
> +job transitions from RUNNING to a final state, a new spot in the running 
> job 
> +queue opens and the job scheduler is tasked to pick one of the pending 
> ones 
> +to be started next. 
> + 
> +As a job transitions from the pending to the running queue, it gets 
> started by 
> +the job executor which initializes the logical unit mcpu, which in turn 
> tries 
> +to acquire the necessary locks on shared resources. If the lock 
> acquisition 
> +fails because the locks are already held by other jobs, the opcode 
> transitions 
> +into a WAITING state and gets stuck until such locks are released. 
> + 
> +At the time of writing, the queue scheduler picks which pending job to 
> execute 
> +next based on the following parameters: a priority value, a reason rate 
> limiter, 
> +an eligibility status based on dependencies on other jobs, and a series 
> of 
> +filters. After all these checks are performed, the first job on the list 
> is 
> +picked on a first-come-first-serve basis. On a typical usage scenario, 
> without 
> +any special cases, the first job to be scheduled to run is usually the 
> first 
> +job that was put in the queue. Each job is executed in order. 
> + 
> +A flaw in this system is that jobs are not scheduled to run based on the 
>
 
It isn't a flaw per se, just some inefficiency.
 

> +resources they need. This means that in many cases, jobs from the pending 
> queue 
> +transition to a WAITING state and then are blocked again waiting for 
> locks. 
> +Meanwhile, other jobs in the pending queue could be eligible to run 
> without 
> +any additional resources, but cannot because the running queue is full 
> already. 
> + 
> +Proposed changes 
> +================ 
> + 
> +This design proposes a new way of choosing which jobs to run from the 
> pending 
> +queue based on the resources currently held by all the other running jobs 
> and 
> +the locks required by each new yet-to-be-executed job. On top of the 
> +already-existing filtering parameters, the scheduler will also sort the 
> queue 
> +of pending jobs based on a heuristic value calculated by comparing each 
> job in 
> +the queue with the sum of all the other jobs already running in the 
> system. 
> + 
> +Since this change introduces a more unpredictable queue system, there may 
> be 
> +cases where queued jobs fall into starvation as less resource-intensive 
> jobs 
> +get scheduled in front of them and take their spot in the queue. To solve 
> this 
> +problem, we also introduce a customizable parameter that determines the 
> aging 
> +of a pending job which will then affect the heuristic calculations. 
> + 
> +Design decisions 
> +================ 
> + 
> +The addition of a predictive scheduler is a small change on top of the 
> +already-existing queue. It will be sitting right inbetween the priority 
> and 
> +filtering parameters for the queue and it is a simple sort operation 
> which will 
> +not impact the current codebase by a lot. For all intents and purposes, 
> this 
> +change will not impact any of the previously existing filtering 
> parameters 
> +such as reason rate limiting, job filtering, priority values or 
> eligibility 
> +checks. 
> + 
> +The algorithm relies on a list of static resource declarations for each 
> opcode. 
> +Each job is assigned a weight by comparing its own locks to those 
> currently 
> +being held by the system. The higher the value, the higher the chance 
> that 
> +the job will be stuck in WAITING over some shared resource. Due to how 
> the 
> +locking system is implemented in Ganeti, it is currently impossible to 
> know 
> +with certainty the exact amount of resources that each job will request 
> +before its transition from the pending to the running queue. As a 
> solution to 
> +this problem, we define a level of uncertainty and a heuristic function 
> to 
> +guesstimate the likelihood of a job to not get blocked. 
>

Maybe it would be worth mentioning, that the more certainty we have about 
jobs,
the better this algorithm works (and generally Ganeti as a job workflow 
management
system), so we should keep that in mind when implementing jobs.
  

> + 
> +Job Promotion Algorithm 
> +----------------------- 
> + 
> +The job locks in Ganeti are separated into 5 different levels (plus the 
> Big 
> +Ganeti Lock): NodeGroup, Instance, Node, NodeRes, and Network. Each of 
> these 
> +levels can take however many locks they need, in an independent way from 
> each 
> +other, as either shared or exclusive mode. The following are the possible 
> +lock types a job can declare for each level: 
>

The levels don't take locks, jobs take locks on different levels (which are 
independent,
levels aren't affecting eachother).
 

> + 
> + * None: No locks are required at this level at all; 
> + 
> + * Shared(k): The system is aware of which locks are being requested in 
> shared 
> +   mode. K is the list of all the resource names. 
> + 
> + * UnknownShared: The system knows that there is at least one lock 
> requested 
> +   as shared, but it has no way of knowing which or how many. 
> + 
> + * AllShared: All locks at this level are required as shared. 
> + 
> + * Exclusive(k) : The system is aware of which locks are being requested 
> in 
> +   exclusive mode. K is the list of all resource names. 
> + 
> + * UnknownExclusive: The system knowes that there is at least one lock 
> +   requested as excusive, but it has no way of knowing which or how many. 
> + 
> + * AllExclusive: All locks at this level are required as exclusive. 
> + 
> +In the case of those few jobs that require the BGL, they are considered 
> +as the absolute worst weight scenario possible and will most likely be 
> +scheduled at the back of the queue as they require the entire set of 
> resources. 
> + 
> +We define a comparison operation between two lock types. The first 
> operand is 
> +the lock of a job in a pending queue, whereas the second operand is the 
> +currently-existing lock already taken by a running job. The operation 
> behaves 
> +as follows: 
> + 
> ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+
>  
>
> +|                  |  None   |  Shared(j)             | UnknownShared | 
> AllShared | Exclusive(j)           | UnknownExclusive | AllExclusive | 
> ++==================+=========+========================+===============+===========+========================+==================+==============+
>  
>
> +| None             |    0    |      0                 |       0       |   
>   0     |      0                 |        0         |       0      | 
> ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+
>  
>
> +| Shared(i)        |   0.3   |      0                 |       0       |   
>   0     | if j=i then 3 else 0.3 |        1.5       |       3      | 
> ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+
>  
>
> +| UnknownShared    |   0.3   |     0.3                |      0.3      |   
>  0.3    |      1.5               |        1.5       |       3      | 
> ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+
>  
>
> +| AllShared        |   0.3   |     0.3                |      0.3      |   
>  0.3    |           3            |        3         |       3      | 
> ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+
>  
>
> +|  Exclusive(i)    |   0.5   | if j=i then 3 else 0.5 |    1.5        |   
> 3       | if j=i then 3 else 0.5 |       1.5        |      3       | 
> ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+
>  
>
> +| UnknownExclusive |   0.5   |      1.5               |       1.5     |   
>   3     |         1.5            |        1.5       |       3      | 
> ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+
>  
>
> +| AllExclusive     |   0.5   |      3                 |       3       |   
>   3     |      3                 |        3         |       3      | 
> ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+
>  
>
> + 
>

Do you think we need the entire table ? Since we are using actual locks 
taken by running jobs, 
there should be no uncertainty at least in one of the dimensions (in this 
case less columns, if I am
correct).
 

> +The weight values are defined as such: 
> + 
> +  * 0: There is no contention, and no contention is added to the state of 
> the 
> +    cluster. 
> + 
> +  * 0.3: There should be no contention, however the likelihood for future 
> +    contention of resources in the cluster is increased. 
> + 
> +  * 0.5: There should be no contention, however the likelihood for future 
> +    contention of resources in the cluster is greatly increased. 
> + 
> +  * 1.5: There is a chance that the job will get stuck but there is no 
> certain 
> +    way of knowing it. 
> + 
> +  * 3: The job will very likely get stuck in WAITING. 
>
 
Isn't that rather certainly ? If both jobs are taking the same exclusive 
locks, then
they certainly block each other.

+ 
> +For each of the pending jobs, their locks are compared using this 
> operation 
> +with each of the locks of the running jobs. After this comparison, for 
> each 
> +level the largest number is picked as the "worst case" for that specific 
> level 
> +and used in the heuristic formula to calculate the final weight of the 
> job. 
> + 
> +The heuristic algorithm is approximately defined as such:: 
> + 
> +  if job.hasBGL(): 
> +    return base_value+15 
> +  value=base_value 
> +  for level in lock_levels: 
> +    value+=max(compare(job[level], running_jobs[level])) 
> +  return value 
>

It is not entirely clear what are you taking the maximum of, since compare()
usually returns a scalar value.
 

> + 
> +The number 15 is obtained by multiplying the worst case lock weight (3) 
> by 
> +the amount of levels (5), the BGL should never have more or less than 
> this 
> +value, in weight. Furthermore, the base_value parameter is a customizable 
> +constant that provides a base value (default is 1) which can be useful 
> when 
> +used with the anti-starvation measures. 
> + 
> +As previously specified, to prevent starvation we introduce an aging 
> system 
> +for queued jobs that keeps the queue fair. Each job in the queue will 
> have an 
> +'Age' parameter based on how long it has been sitting in the pending 
> queue. 
> +The greater the age, the smaller the job's static lock weight, the 
> likelier the 
> +job will be to be scheduled for execution next. The age is calculated on 
> the 
> +delta between a job enqueue time and the current time, quantized to a 
> common 
> +unit. This quantization will be adjustable as a constant in the Ganeti 
> code to 
> +something similar to, for example, 1 tick every 30 seconds. Using the 
> current 
> +formula, we can define the actual job weight in the queue:: 
> + 
> +  weight = spv(job)*(1 - job.Age()/K) 
> + 
> +Where spv(job) is the value obtained by the heuristic algorithm (Static 
> +Predictive Value), Age is the given job's quantized age, and k is an 
> aging 
> +coefficient value that roughly means "the amount of ticks of age to wait 
> +before getting the smallest weight". To explain, K means that regardless 
> of the 
> +measured SPV, the weight of a job will always equal 0 after K ticks of 
> time have 
> +passed. 
> + 
> + 
> +Static Lock Declaration 
> +----------------------- 
> + 
> +As already explained, unfortunately it is not possible to 100% accurately 
> +predict the exact amount of resources that a job will require before it 
> is 
> +executed and enters the running queue. To solve this problem, we decided 
> to 
> +implement an ad-hoc static mapping of opcode:locks declaration. 
> + 
> +Given each job, we try to infer as much information as possible from its 
> +opcode parameters, and then deduce which locks are most likely to be 
> requested 
> +as the job transitions into running. Because of the high and varied 
> combinations 
> +of parameters, it is not a simple process and more often than not we 
> simply 
> +have to declare a lock request as "Unknown" (either shared or exclusive, 
> as 
> +defined in the previous section). Luckily, for the currently running 
> opcodes, 
> +it is possible to query the system for the state of the locks and we can 
> +achieve a more accurate map, which helps improving the accuracy of the 
> +heuristic function. 
>
 
^^ yep, I am strongly in favour of reducing uncertainty wherever possible.
 

> + 
> diff --git a/doc/index.rst b/doc/index.rst 
> index e5535eb..08f4678 100644 
> --- a/doc/index.rst 
> +++ b/doc/index.rst 
> @@ -138,6 +138,7 @@ Draft designs 
>     design-partitioned 
>     design-performance-tests.rst 
>     design-plain-redundancy.rst 
> +   design-predictive-queue.rst 
>     design-query2.rst 
>     design-query-splitting.rst 
>     design-rapi-pam.rst 
> -- 
> 2.8.0.rc3.226.g39d4020 
>
>

Re: [PATCH master] Add design document for a predictive queue system

Reply via email to