Re: [boinc_dev] 6.6.20 and work scheduling

Jonathan Hoser Tue, 28 Apr 2009 07:25:19 -0700

The events that drive rr_sim:

>>>RPC complete:  We have now committed to some more work.  Is there anything
>>>that needs to start right now to complete on time.


>>Nope. No need. The request should be reasonable enough and the WUs 
>>committed to should be reasonable enough that there should be no change 
>>to the presently running work.

>There is no guarantee at all what the server is going to hand us.  It may
>NOT be reasonable.  There is no check at the server to determine if the
>tasks sent to the client are going to be trouble.

Alright. If it's not getting scheduled by reasonable scheduling until
the deadline is reached, we'll just have to return it, won't we?

As a side notice:
I too disagree on RPC complete being a trigger point:
If we get new work, we WILL have to wait for the download to finish, until then
everything is grey theory.

>>> File Download Complete:  We now have a task ready to run.  Does it need to
>>> get started right now?

>> Again, should be no change for the above reasoning.

> Same reasoning:  The server has no check for reasonable.

this is one of the points I'd call
'doScheduling' function, opposed to what Martin says
and more in the line of what Jm7 does.


>>> Task complete:  What should we run now.  We have a free processor.

>> Yep, but streamline the scheduling to be adding the work to a maintained
>> fifo linked list with a total duration accumulated for the list. Use one
>> variable that is added to and subtracted from. DON'T blindly iterate
>> through the linked list to add it up! DCF is applied to that one
>> variable. That is, the linked list WU times are raw uncorrected times.

> I have no idea what you are trying to say here.  Are you trying to say that
> we should always start the next task in FIFO order?

I think he does - but (in my opinion) missing the different DCF values for 
different projects
But certainly a triggerpoint for scheduling.


>>> Project and Task Suspend:  Do we still have the same tasks running?  Do we
>>> have an empty resource?
>> Yep.
Yep.

>>> Project and Task Resume:  Some tasks have been suspended for an unknown
>>> amount of time.  Do we need to run one of them?
>> Yep.
Yep.


In my eyes this amounts to the following

checkpointing:
if task.runtime < TSI?
        do nothing
else
        halt job
        do rescheduling (job)

task complete:
if task.complete
        do rescheduling

download complete:
        do scheduling

project/task resume/suspend
        do rescheduling

maybe (for sake of completeness):
RPC complete:
if server asks to drop WU
        halt job;
        do rescheduling (job)

The trickier part of course are the
scheduling /rescheduling calls, and I'm currently leafing through my notepad 
looking for the sketch...
for my idea we'd need

list of jobs run (wct, project)
-> containing the wall-clock times for every job run during the last 24 hours.

per resource (CPU/ GPU Type1 / GPU Type2 / Coproc) two(2) linked list of jobs 
eligible to run on that resource
-> linked list1 (CPU) (containing all jobs in order of getting them)
-> linked list2 (CPU) (short, only working'state)
-> linked list1 (GPU)
-> linked list2 (GPU)
-> linked list (...)

reschedule (job)
        if reason: task/project suspend
                mark job(s) as suspended in linkedlist1(resourcetype)
                addlast to linkedlist2(resourcetype)
        
        if reason: task/project resume
                mark job(s) as runable in linkedlist1(resourcetype)
                if job(s) contained in linkedlist2(resourcetype)
                        order runnable jobs in linkedlist2(resourcetype) by 
order in linkedlist1(resourcetype) (or EDF)
 
        if reason: drop WU
                do cleanup
                do reschedule reason 'task complete'

        if reason: task complete
                do cleanup
                check (via wct-job-list and resourceshare/project priority) 
what project shall get resource now.
                traverse linkedlist2(resourcetype) until end or eligible job 
found
                if job is found, launch it now.
                else
                traverse linkedlist1(resourcetype) until eligible job found or 
end
                if job is found, launch it now.
                else
                redo the above check and choose second/third/...-highest 
scoring project
 
        if reason: checkpointing / TSI
                addfirst (job) to linkedlist2(resourcetype)
                do reschedule reason 'task complete'

schedule(job)
        if new job has Deadline earlier than 
                now + (sum of estimatedjobruntime of jobs in 
linkedlist1(resourcetype) 
                        divided by resources available (e.g. CPUs))
                then insert job in linkedlist1(resourcetype) so that
                        deadline will be met.

sofar... my idea in pseudocode... 
the second (temporary) list could be omitted if having certain flags for the 
jobs in the first list - 
but adds some clarity for the first draft.
The 'do cleanup' thingy would also need to add the runtime for the rescheduled 
job to the list of wallclocktimes;
other than that... I hope I could outline my idea in a clear fashion;

Best
-Jonathan


_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] 6.6.20 and work scheduling

Reply via email to