Responses inline On Friday, August 26, 2016 at 4:01:45 PM UTC+1, Viktor Bachraty wrote: > > As discussed previously IRL, I'm happy with the design so LGTM, I just > have a few nitpicks on the document. > > > On Thursday, August 25, 2016 at 2:00:32 PM UTC+1, Federico Pareschi wrote: >> >> This design document introduces a new Ganeti job scheduler that >> prioritizes jobs in the queue based on their lock declaration and state >> of the cluster. >> >> Signed-off-by: Federico Morg Pareschi <[email protected] <javascript:>> >> --- >> Makefile.am | 1 + >> doc/design-predictive-queue.rst | 213 >> ++++++++++++++++++++++++++++++++++++++++ >> doc/index.rst | 1 + >> 3 files changed, 215 insertions(+) >> create mode 100644 doc/design-predictive-queue.rst >> >> diff --git a/Makefile.am b/Makefile.am >> index 3c67c5c..8cd9f36 100644 >> --- a/Makefile.am >> +++ b/Makefile.am >> @@ -729,6 +729,7 @@ docinput = \ >> doc/design-partitioned.rst \ >> doc/design-plain-redundancy.rst \ >> doc/design-performance-tests.rst \ >> + doc/design-predictive-queue.rst \ >> doc/design-query-splitting.rst \ >> doc/design-query2.rst \ >> doc/design-query-splitting.rst \ >> diff --git a/doc/design-predictive-queue.rst >> b/doc/design-predictive-queue.rst >> new file mode 100644 >> index 0000000..460b12e >> --- /dev/null >> +++ b/doc/design-predictive-queue.rst >> @@ -0,0 +1,213 @@ >> +================================== >> +Predictive Queue System for Ganeti >> +================================== >> + >> +.. contents:: :depth: 4 >> + >> +This design document describes the introduction of a new queue >> scheduling >> +system for Ganeti jobs based on the amount of contention and resources >> +needed by each newly submitted job. >> + >> +Current State and shortcomings >> +============================== >> + >> +Currently, jobs in the Ganeti system are placed in a queue and >> subsequently >> +set to run by the job scheduler. A job can be in different mutually >> exclusive >> +states such as QUEUED, WAITING, or RUNNING depending on its position in >> the >> +scheduling pipeline. Jobs can also reach the final state of SUCCESS, >> ERROR or >> +CANCELED, but that is not relevant for the scope of this document. Among >> the >> +jobs that are QUEUED, WAITING or RUNNING, the latter two states are >> considered >> +by the system as 'running', while the former as 'pending'. >> > > I think "mutually exclusive" is implied, having a job in multiple states > would be a quite unusual design. > Maybe a state diagram would be nice here? (If it's too much hassle in > ascii art, I don't insist, just an idea). >
You're right, dropping mutually exclusive sounds better too. A diagram would probably be a bit overkill for just 3 states though... > + >> +Ganeti has an option to limit the amount of parallel running jobs >> (either >> +WAITING or RUNNING), and when such a limit is reached, all the newly >> submitted >> +ones will get stuck in the queue waiting for their turn to be executed. >> As each >> +job transitions from RUNNING to a final state, a new spot in the running >> job >> +queue opens and the job scheduler is tasked to pick one of the pending >> ones >> +to be started next. >> + >> +As a job transitions from the pending to the running queue, it gets >> started by >> +the job executor which initializes the logical unit mcpu, which in turn >> tries >> +to acquire the necessary locks on shared resources. If the lock >> acquisition >> +fails because the locks are already held by other jobs, the opcode >> transitions >> +into a WAITING state and gets stuck until such locks are released. >> + >> +At the time of writing, the queue scheduler picks which pending job to >> execute >> +next based on the following parameters: a priority value, a reason rate >> limiter, >> +an eligibility status based on dependencies on other jobs, and a series >> of >> +filters. After all these checks are performed, the first job on the list >> is >> +picked on a first-come-first-serve basis. On a typical usage scenario, >> without >> +any special cases, the first job to be scheduled to run is usually the >> first >> +job that was put in the queue. Each job is executed in order. >> + >> +A flaw in this system is that jobs are not scheduled to run based on the >> > > It isn't a flaw per se, just some inefficiency. > Fixed. > > >> +resources they need. This means that in many cases, jobs from the >> pending queue >> +transition to a WAITING state and then are blocked again waiting for >> locks. >> +Meanwhile, other jobs in the pending queue could be eligible to run >> without >> +any additional resources, but cannot because the running queue is full >> already. >> + >> +Proposed changes >> +================ >> + >> +This design proposes a new way of choosing which jobs to run from the >> pending >> +queue based on the resources currently held by all the other running >> jobs and >> +the locks required by each new yet-to-be-executed job. On top of the >> +already-existing filtering parameters, the scheduler will also sort the >> queue >> +of pending jobs based on a heuristic value calculated by comparing each >> job in >> +the queue with the sum of all the other jobs already running in the >> system. >> + >> +Since this change introduces a more unpredictable queue system, there >> may be >> +cases where queued jobs fall into starvation as less resource-intensive >> jobs >> +get scheduled in front of them and take their spot in the queue. To >> solve this >> +problem, we also introduce a customizable parameter that determines the >> aging >> +of a pending job which will then affect the heuristic calculations. >> + >> +Design decisions >> +================ >> + >> +The addition of a predictive scheduler is a small change on top of the >> +already-existing queue. It will be sitting right inbetween the priority >> and >> +filtering parameters for the queue and it is a simple sort operation >> which will >> +not impact the current codebase by a lot. For all intents and purposes, >> this >> +change will not impact any of the previously existing filtering >> parameters >> +such as reason rate limiting, job filtering, priority values or >> eligibility >> +checks. >> + >> +The algorithm relies on a list of static resource declarations for each >> opcode. >> +Each job is assigned a weight by comparing its own locks to those >> currently >> +being held by the system. The higher the value, the higher the chance >> that >> +the job will be stuck in WAITING over some shared resource. Due to how >> the >> +locking system is implemented in Ganeti, it is currently impossible to >> know >> +with certainty the exact amount of resources that each job will request >> +before its transition from the pending to the running queue. As a >> solution to >> +this problem, we define a level of uncertainty and a heuristic function >> to >> +guesstimate the likelihood of a job to not get blocked. >> > > Maybe it would be worth mentioning, that the more certainty we have about > jobs, > the better this algorithm works (and generally Ganeti as a job workflow > management > system), so we should keep that in mind when implementing jobs. > Done. > + >> +Job Promotion Algorithm >> +----------------------- >> + >> +The job locks in Ganeti are separated into 5 different levels (plus the >> Big >> +Ganeti Lock): NodeGroup, Instance, Node, NodeRes, and Network. Each of >> these >> +levels can take however many locks they need, in an independent way from >> each >> +other, as either shared or exclusive mode. The following are the >> possible >> +lock types a job can declare for each level: >> > > The levels don't take locks, jobs take locks on different levels (which > are independent, > levels aren't affecting eachother). > Fixed. > + >> + * None: No locks are required at this level at all; >> + >> + * Shared(k): The system is aware of which locks are being requested in >> shared >> + mode. K is the list of all the resource names. >> + >> + * UnknownShared: The system knows that there is at least one lock >> requested >> + as shared, but it has no way of knowing which or how many. >> + >> + * AllShared: All locks at this level are required as shared. >> + >> + * Exclusive(k) : The system is aware of which locks are being requested >> in >> + exclusive mode. K is the list of all resource names. >> + >> + * UnknownExclusive: The system knowes that there is at least one lock >> + requested as excusive, but it has no way of knowing which or how >> many. >> + >> + * AllExclusive: All locks at this level are required as exclusive. >> + >> +In the case of those few jobs that require the BGL, they are considered >> +as the absolute worst weight scenario possible and will most likely be >> +scheduled at the back of the queue as they require the entire set of >> resources. >> + >> +We define a comparison operation between two lock types. The first >> operand is >> +the lock of a job in a pending queue, whereas the second operand is the >> +currently-existing lock already taken by a running job. The operation >> behaves >> +as follows: >> + >> ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ >> >> >> +| | None | Shared(j) | UnknownShared | >> AllShared | Exclusive(j) | UnknownExclusive | AllExclusive | >> ++==================+=========+========================+===============+===========+========================+==================+==============+ >> >> >> +| None | 0 | 0 | 0 | >> 0 | 0 | 0 | 0 | >> ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ >> >> >> +| Shared(i) | 0.3 | 0 | 0 | >> 0 | if j=i then 3 else 0.3 | 1.5 | 3 | >> ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ >> >> >> +| UnknownShared | 0.3 | 0.3 | 0.3 | >> 0.3 | 1.5 | 1.5 | 3 | >> ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ >> >> >> +| AllShared | 0.3 | 0.3 | 0.3 | >> 0.3 | 3 | 3 | 3 | >> ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ >> >> >> +| Exclusive(i) | 0.5 | if j=i then 3 else 0.5 | 1.5 | >> 3 | if j=i then 3 else 0.5 | 1.5 | 3 | >> ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ >> >> >> +| UnknownExclusive | 0.5 | 1.5 | 1.5 | >> 3 | 1.5 | 1.5 | 3 | >> ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ >> >> >> +| AllExclusive | 0.5 | 3 | 3 | >> 3 | 3 | 3 | 3 | >> ++------------------+---------+------------------------+---------------+-----------+------------------------+------------------+--------------+ >> >> >> + >> > > Do you think we need the entire table ? Since we are using actual locks > taken by running jobs, > there should be no uncertainty at least in one of the dimensions (in this > case less columns, if I am > correct). > > I think the entire table is good to have for two reason. First of all, we're able to provide an operation that covers all cases for the given type. Second, the first implementation of this design will only operate only on the static lock declaration and will read the runtime data only in a future patch, this means it's good to have cases for uncertain locks. Especially in case of jobs stuck in WAITING mode, we might not know exactly which locks they will get when they transition to RUNNING anyway. > +The weight values are defined as such: >> + >> + * 0: There is no contention, and no contention is added to the state >> of the >> + cluster. >> + >> + * 0.3: There should be no contention, however the likelihood for >> future >> + contention of resources in the cluster is increased. >> + >> + * 0.5: There should be no contention, however the likelihood for >> future >> + contention of resources in the cluster is greatly increased. >> + >> + * 1.5: There is a chance that the job will get stuck but there is no >> certain >> + way of knowing it. >> + >> + * 3: The job will very likely get stuck in WAITING. >> > > Isn't that rather certainly ? If both jobs are taking the same exclusive > locks, then > they certainly block each other. > > You're correct, a slight overlook on my side, thanks. (although even in this case we cannot be 100% sure in theory, because a job might finish before the new one start and release the exclusive locks anyway, but I get what you mean) + >> +For each of the pending jobs, their locks are compared using this >> operation >> +with each of the locks of the running jobs. After this comparison, for >> each >> +level the largest number is picked as the "worst case" for that specific >> level >> +and used in the heuristic formula to calculate the final weight of the >> job. >> + >> +The heuristic algorithm is approximately defined as such:: >> + >> + if job.hasBGL(): >> + return base_value+15 >> + value=base_value >> + for level in lock_levels: >> + value+=max(compare(job[level], running_jobs[level])) >> + return value >> > > It is not entirely clear what are you taking the maximum of, since > compare() > usually returns a scalar value. > This is the comparison operation that I outlined above, I changed it to have a more meaningful name and I'll add a small explanation. > + >> +The number 15 is obtained by multiplying the worst case lock weight (3) >> by >> +the amount of levels (5), the BGL should never have more or less than >> this >> +value, in weight. Furthermore, the base_value parameter is a >> customizable >> +constant that provides a base value (default is 1) which can be useful >> when >> +used with the anti-starvation measures. >> + >> +As previously specified, to prevent starvation we introduce an aging >> system >> +for queued jobs that keeps the queue fair. Each job in the queue will >> have an >> +'Age' parameter based on how long it has been sitting in the pending >> queue. >> +The greater the age, the smaller the job's static lock weight, the >> likelier the >> +job will be to be scheduled for execution next. The age is calculated on >> the >> +delta between a job enqueue time and the current time, quantized to a >> common >> +unit. This quantization will be adjustable as a constant in the Ganeti >> code to >> +something similar to, for example, 1 tick every 30 seconds. Using the >> current >> +formula, we can define the actual job weight in the queue:: >> + >> + weight = spv(job)*(1 - job.Age()/K) >> + >> +Where spv(job) is the value obtained by the heuristic algorithm (Static >> +Predictive Value), Age is the given job's quantized age, and k is an >> aging >> +coefficient value that roughly means "the amount of ticks of age to wait >> +before getting the smallest weight". To explain, K means that regardless >> of the >> +measured SPV, the weight of a job will always equal 0 after K ticks of >> time have >> +passed. >> + >> + >> +Static Lock Declaration >> +----------------------- >> + >> +As already explained, unfortunately it is not possible to 100% >> accurately >> +predict the exact amount of resources that a job will require before it >> is >> +executed and enters the running queue. To solve this problem, we decided >> to >> +implement an ad-hoc static mapping of opcode:locks declaration. >> + >> +Given each job, we try to infer as much information as possible from its >> +opcode parameters, and then deduce which locks are most likely to be >> requested >> +as the job transitions into running. Because of the high and varied >> combinations >> +of parameters, it is not a simple process and more often than not we >> simply >> +have to declare a lock request as "Unknown" (either shared or exclusive, >> as >> +defined in the previous section). Luckily, for the currently running >> opcodes, >> +it is possible to query the system for the state of the locks and we can >> +achieve a more accurate map, which helps improving the accuracy of the >> +heuristic function. >> > > ^^ yep, I am strongly in favour of reducing uncertainty wherever possible. > > Thanks for the review, I'm sending a patch with the fixed design doc as a follow up :) > + >> diff --git a/doc/index.rst b/doc/index.rst >> index e5535eb..08f4678 100644 >> --- a/doc/index.rst >> +++ b/doc/index.rst >> @@ -138,6 +138,7 @@ Draft designs >> design-partitioned >> design-performance-tests.rst >> design-plain-redundancy.rst >> + design-predictive-queue.rst >> design-query2.rst >> design-query-splitting.rst >> design-rapi-pam.rst >> -- >> 2.8.0.rc3.226.g39d4020 >> >>
