Re: [DISCUSS] Replicator scheduler improvement

Joan Touzet Fri, 15 Jan 2021 15:41:02 -0800

On 15/01/2021 17:51, Nick Vatamaniuc wrote:> You're absolutely right.
There are two places where job priority comes
> into play: one is when jobs are started, and then the other is when
> they are stopped. When starting jobs, both continuous and one-shot
> (normal) jobs are at the same priority. However when jobs are stopped,
> we try to stop only the continuous jobs. The only time we'd stop
> one-shot jobs is when the operator manually lowers the max_jobs config
> limit. This means there is an imbalance between the starting and
> stopping algorithms and eventually a large number of one-shot
> replications could completely crowd out and replace all the continuous
> ones. Continuous ones would periodically stop, and if the new ones
> started are one-shot, those would slowly accumulate over time.


I do think this is a concern, but I also think it can be addressed
separately from this proposal. I don't have a good answer re: fixing it.

> The idea is to still guarantee fairness with hopefully an easy to
> understand algorithm. For the 10 different replicator dbs we still
> round-robin among them, and pick the jobs with the lowest start time
> from each in turn, then, do the same thing next cycle, until we have
> enough jobs picked out. So for example if we had this setup:
> 
> db1: j4
> db2: j3, j5
> db3: j1, j2
> 
> Where the index for the jobs is the start time (j1 was last stated
> before j2), we'd pick them as j1, j3, j4 during the first round-robin
> selection loop, then we'd be left with
> 
> db1 :
> db2 : j5
> db3 : j2
> 
> Next we'd pick j2 and j5 and so on. So overall we'd get
> j1,j3,j4,j2,j5. If we look at the dbs they came from it would be:
> db3,db2,db1,db3,db2. In case db1 had a 98 jobs and db2 just 2:
> 
> db1: j1, j2, j3 - j41, j43 - j99
> db2: j42, j100
> 
> The algorithm would pick: j1, j42, j2, j100, j3, j4, j5, j6 - j99. The
> dbs they'd come from would be db1,db2,db1,db2,db1,db1,db1,... to make
> sure all the dbs get a fair chance to run their jobs. If there is one
> db only, the algorithm doesn't change from the current one. However,
> one thing worrying me here is this implies either a lot of sorting or
> keeping a scheduler queue per db, so things could end up being quite
> expensive performance-wise.

So this doesn't really let us do any sort of priority queue, which is
unfortunate. You can take your higher priority tasks and move them
outside of the db that has all of your background tasks in it, but if
you have > max_jobs/num_replicator_dbs in that separate db, some will
still get ignored. Fair share would provide a way to ensure starvation
doesn't occur.

(Also consider the situation where num_replicator_dbs > max_jobs...)

>> Would a fair-share scheduler [1] make sense here? We can't borrow many
>> of the OS-level scheduling ideas because we don't have the same insight
>> into the runtime characteristics of our jobs, but fair share might be a
>> good match.
>>
>> Here's one way it could be implemented:
>>
>> * Each replicator DB could represent a logical grouping of jobs, e.g.
>>    background continuous backup jobs, one-shot migration jobs,
>>    high-priority user requests, etc.
>>
>> * An INI-level config could determine how these groups are handled:
>>
>>    * 100% fair (default) -- total available job slots are divided by
>>      the number of replicator DBs, then allocated round-robin within
>>      that DB. The difference from today's setup is that each DB gets
>>      its own round-robin scheduler. I *think* this might be what Nick
>>      was suggesting, but I can't be 100% sure.
>>
> 
> Yeah, I think that's what I had in mind too but just having an equal
> priority between them. I like the idea of being able to tweak
> priorities for replicator dbs. Thanks for the reference to the OS
> scheduler and the paper! I looked at the Multiqueue Feedback Scheduler
> http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched-mlfq.pdf which was
> used in Linux previously and is still used by some operating system
> for CPU scheduling. The reason I looked at it, is that initially I
> contemplated a static user-configurable priorities, but realized we'd
> need something like the Multiqueue Feedback Scheduler to dynamically
> tweak job priority to avoid the case where some were permanently
> starved.

Yeah, I did a full review of mlfq and the Linux scehdulers before
suggesting Fair Share... I think the design goal of "is easier to
intuit" is an important one here, and I found it easier to reason about
my variant than yours. But I'm not attached to my idea, let's do what is
the easiest to explain and meets the other design goals.

>> [Fair share worked example]
> 
> I think this might work, but could also starve lower priority jobs. I
> was thinking when we have a high/_replicator:99 db which is fed newer
> jobs continuously, and the low/_replicator:1 which never gets a chance
> to run if its share gets rounded down to 0. I think one way to avoid
> that would be to ensure we pick at least one job from each db, and
> then apply the proportional shares, or just make it round down 1 as a
> minimum...

I would consider that user error in configuration, which I guess is
similar to the corner case in your proposal of having >max_jobs
replicator DBs.

But yeah, min(1, <formula previously presented>) solves that problem
nicely too.

> The paper and your proposal presents an interesting way of
> "materializing" the proportional shares as explicit quantity, and the
> current and my scheme don't have that part (there is no
> "scheduler_share" variable and we just pick the jobs in the order of
> start time).  But I like the idea and it's worth giving it a try. It
> might have to be one of those things, that once we start playing with
> the code, some things might become more obvious and easy to do and
> some too complicated.

Sure, let's try it!

> Thank you, Joan, for the great suggestions and for taking a look!

My pleasure!

If you need a use case that is hard to reason about, consider a
resource-constrained cluster (where # of replications vastly outstrips
resources avialable) and you need to ensure fairness across a very large
number of continuous and one-shot replications. Within these the
one-shot are considered higher priority, but they should not completely
drown out the continuous ones.

I can give more detail on the example if it sounds interesting to you -
this was a client system I worked on for a while (after it was too late
to improve the design) and we never were able to get CouchDB's
replication scheduler to do the right thing. Everything got replaced by
an external scheduler that knew what max_jobs was, and did everything as
one-shot replications.

-Joan

Re: [DISCUSS] Replicator scheduler improvement

Reply via email to