Re: [boinc_dev] 6.6.20 and work scheduling

John . McLeod Tue, 28 Apr 2009 08:00:37 -0700

jm7


Jonathan Hoser <[email protected]> wrote on 04/28/2009 10:25:02
AM:

> Jonathan Hoser <[email protected]>
> 04/28/2009 10:25 AM
>
> To
>
> [email protected]
>
> cc
>
> Martin <[email protected]>, BOINC dev <[email protected]>
>
> Subject
>
> Re: [boinc_dev] 6.6.20 and work scheduling
>
>
> The events that drive rr_sim:
>
> >>>RPC complete:  We have now committed to some more work.  Is there
anything
> >>>that needs to start right now to complete on time.
>
> >>Nope. No need. The request should be reasonable enough and the WUs
> >>committed to should be reasonable enough that there should be no change

> >>to the presently running work.
>
> >There is no guarantee at all what the server is going to hand us.  It
may
> >NOT be reasonable.  There is no check at the server to determine if the
> >tasks sent to the client are going to be trouble.
>
> Alright. If it's not getting scheduled by reasonable scheduling until
> the deadline is reached, we'll just have to return it, won't we?
>
> As a side notice:
> I too disagree on RPC complete being a trigger point:
> If we get new work, we WILL have to wait for the download to finish,until
then
> everything is grey theory.

The check at this point is to see if something else needs attention now
that more work is scheduled.  I realize that the new task cannot run until
all its files are downloaded.  Often the RPC and file transfers complete
are only seconds apart, but sometimes it can be hours, in particular when
an RPC occurs just prior to a "time of day" network shutoff.  I have seen
this example several times.

An RPC can also remove running work and yet to be started work as well as
adding work.

Example:

You have several tasks including a task with an hour remaining wall time
estimated that is due in 24 hours.
The host is currently running a CPDN task due in a few months with a few
days of run time left.

A project server sends an (completely unreasonable) set of work that will
take 23 hours, and the deadline is only 24 hours away.

It is now time to start doing something about the work that is due in 24
hours.

Half an hour later, the last of the files for the new tasks that are due
(now in 23.5 hours) completes download.  If this half an hour was spent
doing the CPDN task, something would be late that would not have to be
late.

While the work request may be reasonable, the work supplied may not be.

>
> >>> File Download Complete:  We now have a task ready to run.  Does it
need to
> >>> get started right now?
>
> >> Again, should be no change for the above reasoning.
>
> > Same reasoning:  The server has no check for reasonable.
>
> this is one of the points I'd call
> 'doScheduling' function, opposed to what Martin says
> and more in the line of what Jm7 does.
>
>
> >>> Task complete:  What should we run now.  We have a free processor.
>
> >> Yep, but streamline the scheduling to be adding the work to a
maintained
> >> fifo linked list with a total duration accumulated for the list. Use
one
> >> variable that is added to and subtracted from. DON'T blindly iterate
> >> through the linked list to add it up! DCF is applied to that one
> >> variable. That is, the linked list WU times are raw uncorrected times.
>
> > I have no idea what you are trying to say here.  Are you trying to say
that
> > we should always start the next task in FIFO order?
>
> I think he does - but (in my opinion) missing the different DCF
> values for different projects
> But certainly a triggerpoint for scheduling.
>
>
> >>> Project and Task Suspend:  Do we still have the same tasks running?
Do we
> >>> have an empty resource?
> >> Yep.
> Yep.
>
> >>> Project and Task Resume:  Some tasks have been suspended for an
unknown
> >>> amount of time.  Do we need to run one of them?
> >> Yep.
> Yep.
>
>
> In my eyes this amounts to the following
>
> checkpointing:
> if task.runtime < TSI?
>    do nothing
> else
>    halt job
>    do rescheduling (job)
>
> task complete:
> if task.complete
>    do rescheduling
>
> download complete:
>    do scheduling
>
> project/task resume/suspend
>    do rescheduling
>
> maybe (for sake of completeness):
> RPC complete:
> if server asks to drop WU
>    halt job;
>    do rescheduling (job)
>
> The trickier part of course are the
> scheduling /rescheduling calls, and I'm currently leafing through my
> notepad looking for the sketch...
> for my idea we'd need
>
> list of jobs run (wct, project)
> -> containing the wall-clock times for every job run during the last24
hours.
>
> per resource (CPU/ GPU Type1 / GPU Type2 / Coproc) two(2) linked
> list of jobs eligible to run on that resource
> -> linked list1 (CPU) (containing all jobs in order of getting them)
> -> linked list2 (CPU) (short, only working'state)
> -> linked list1 (GPU)
> -> linked list2 (GPU)
> -> linked list (...)
>
> reschedule (job)
>    if reason: task/project suspend
>       mark job(s) as suspended in linkedlist1(resourcetype)
>       addlast to linkedlist2(resourcetype)
>
>    if reason: task/project resume
>       mark job(s) as runable in linkedlist1(resourcetype)
>       if job(s) contained in linkedlist2(resourcetype)
>          order runnable jobs in linkedlist2(resourcetype) by order
> in linkedlist1(resourcetype) (or EDF)
>
>    if reason: drop WU
>       do cleanup
>       do reschedule reason 'task complete'
>
>    if reason: task complete
>       do cleanup
>       check (via wct-job-list and resourceshare/project priority)
> what project shall get resource now.
>       traverse linkedlist2(resourcetype) until end or eligible job found
>       if job is found, launch it now.
>       else
>       traverse linkedlist1(resourcetype) until eligible job found or end
>       if job is found, launch it now.
>       else
>       redo the above check and choose second/third/...-highest scoring
project
>
>    if reason: checkpointing / TSI
>       addfirst (job) to linkedlist2(resourcetype)
>       do reschedule reason 'task complete'
>
> schedule(job)
>    if new job has Deadline earlier than
>       now + (sum of estimatedjobruntime of jobs in
linkedlist1(resourcetype)
>          divided by resources available (e.g. CPUs))
>       then insert job in linkedlist1(resourcetype) so that
>          deadline will be met.
>
> sofar... my idea in pseudocode...
> the second (temporary) list could be omitted if having certain flags
> for the jobs in the first list -
> but adds some clarity for the first draft.
> The 'do cleanup' thingy would also need to add the runtime for the
> rescheduled job to the list of wallclocktimes;
> other than that... I hope I could outline my idea in a clear fashion;
>
> Best
> -Jonathan
>
>
>
Note that scheduling and enforcement are split.  They do not always run at
the same time.  Scheduling consists of finding the preferred set of tasks
that should be running now.  Enforcement consists of comparing the
currently running set of tasks with the preferred set of tasks and possibly
changing the set of running tasks.

I really don't understand the reasoning of schedule(job) or reschedule.
There can never be a perfect understanding of what just changed in the
system because one of the things that changes all of the time is the
estimated remaining runtime of tasks, and this is one of the items that
needs to drive the calculation of what is going to miss deadline.  What is
going to miss deadline depends on all of the other tasks on the host, and a
single task cannot be isolated from the rest for this test.

jm7

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] 6.6.20 and work scheduling

Reply via email to