Re: [DISCUSS] Replicator scheduler improvement

Joan Touzet Thu, 14 Jan 2021 13:51:37 -0800

On 2021-01-12 12:55 p.m., Nick Vatamaniuc wrote:

Hello everyone


Hi (Dr.) Nick! ;)

I wanted to see what we thought about adding a scheduling improvement
to the replicator. Specifically adding round-robin fairness amongst
different _replicator dbs


We've needed some sort of priority system for a while -- critically, in
some cases, our failure to offer something has forced my hand into
recommending use of external replication scheduling techniques. I'd
never thought of using multiple databases as a way of handling this
problem, though.

Currently, the scheduler runs all the jobs in the system fairly. Itdoes it by using the jobs' "last started" timestamp to select whichjobs to stop and start next. So each scheduling cycle, the jobs which
ran the longest (have the lowest start time) get stopped, then, from
the list of pending jobs we also pick those with the lowest start
times (since they waited the longest) to run next.


There is also some difference between continuous and one-shot
replication jobs, isn't there? I seem to recall one-shot jobs get
priority over continuous ones.

However, this algorithm can be unfair among replication jobs fromdifferent _replicator dbs. For example, if max_jobs=10 and one_replicator db has a 1000 jobs and another has only one doc then,that one job might have to wait for quite a while to get its turn to
 run. If we picked fairly amongst _replicator dbs, then for example,
the one job would get at least one of the 10 max_jobs run "slots",and the rest of the 9 slots would go to the replicator db with 1000docs. If there would be 11 _replicator dbs, then a docs from the ones
with the lowest start times amongst them would be picked first.


That does sound unfair. I wasn't aware of this "gotcha" with multiple
replicator DBs (in the field I rarely see >1 replicator DB) but it
intuitively makes sense - all replicator DBs are unioned in the
scheduler's queue and treated like a single, big DB. (There is basically
no advantage to multiple _replicator DBs today then, other than perhaps
security / obscurity reasons, is there?)

This feature would also allow running some quick replication jobs,when there is already a full queue main _replicator db jobs bycreating a "quickjobs/_replicator" db, and insert these new jobsthere. With this new scheme they would start running right away evenif the queue is full with jobs from the main _replicator db. Another,
related idea I had, was to add per job user-configurable priorities:
high/default/low or numeric. However, that scheme is not as good as
it could lead to permanent starvation of jobs while the round-robin
db scheduling scheme still guarantees all jobs will eventually run.


I think this proposal can be improved. I like the idea of leveraging
multiple replicator DBs to provide some sort of priority tweak, but I am
not sure I can predict what will happen in the case of e.g. 10 different
replicator DBs fighting with each other in the scheduler as you've laid
it out.

Would a fair-share scheduler [1] make sense here? We can't borrow many
of the OS-level scheduling ideas because we don't have the same insight
into the runtime characteristics of our jobs, but fair share might be a
good match.

Here's one way it could be implemented:

* Each replicator DB could represent a logical grouping of jobs, e.g.
  background continuous backup jobs, one-shot migration jobs,
  high-priority user requests, etc.

* An INI-level config could determine how these groups are handled:

  * 100% fair (default) -- total available job slots are divided by
    the number of replicator DBs, then allocated round-robin within
    that DB. The difference from today's setup is that each DB gets
    its own round-robin scheduler. I *think* this might be what Nick
    was suggesting, but I can't be 100% sure.

  * Specific priority -- A list of DBs and their fixed percentages.
    Example: _replicator:30,high/_replicator:50,low/_replicator:20

    If a configured database is not present, or has no active jobs in
    it, its fixed proportion of cycles are "given up" to the other
    replicator DBs.

* Jobs are then allocated based on these breakdowns, and we continue
  to use round-robin within each database to maintain the current
  approach. This means no visible change for upgrading users with a
  single _replicator DB.


Example, with the numbers picked to make the math turn out nice:

```
[replicator]
db_priorities = _replicator:30,high/_replicator:50,low/_replicator:20
max_jobs = 100
```

The % allocation of jobs across DBs is, by default:

  _replicator:      30 / (30+20+50) = 30%
  high/_replicator: 20 / (30+20+50) = 20%
  low/_replicator:  50 / (30+20+50) = 50%

First, put 100 jobs into _replicator. Because the other two DBs have no
jobs, all 100 free jobs are allocated to those jobs:

  _replicator:      30 / (30) = 100% * 100 jobs = 100

Now, we add another 100 jobs into the high DB. Since low is empty, we
recalculate calculate the distribution ratios:

  _replicator:      30 / (30+50) = 37.5% * 100 jobs max = 37 jobs
  high/_replicator: 50 / (30+50) = 62.5% * 100 jobs max = 62 jobs

So 62 _replicator jobs are halted, and 62 high/_replicator jobs are
started. (The 'lost' fractional job could be dynamically given to either
DB as desired.)

Finally, we add a further 100 jobs into the low DB:

  _replicator:      30 / (30+20+50) = 30% * 100 jobs max = 30 jobs
  high/_replicator: 50 / (30+20+50) = 50% * 100 jobs max = 50 jobs
  low/_replicator:  20 / (30+20+50) = 20% * 100 jobs max = 20 jobs

Now the allocation exactly matches the chosen percentages.

Have I missed any starvation scenarios?

Would this feature be useful to have? I was thinking of giving it atry on a branch. I suspect implementing this for 3.x might be a tadsimpler since scheduling is done in memory independently on eachseparate node so was thinking of starting there first. For main(future 4.x) we might require some coordination state to live in FDBand would have to possibly patch up couch_jobs to know aboutpriorities.


It does sound worth investigating, and on 3.x to start.


Cheers, -Nick


[1] The original 1988 paper by Judy Kay and Piers Lauder is a good,
short read. I especially like the Design Goals as listed on page 52 (9).
https://proteusmaster.urcf.drexel.edu/urcfwiki/images/KayLauderFairShare.pdf

Re: [DISCUSS] Replicator scheduler improvement

Reply via email to