Re: [boinc_dev] 6.6.20 and work scheduling

Paul D. Buck Thu, 30 Apr 2009 09:05:33 -0700

On Apr 30, 2009, at 5:48 AM, [email protected] wrote:

>
>
> jm7
>
>>
>> 1) We do it too often (event driven)
> Exactly what we are not listening to.  The rate of tests is NOT the  
> reason
> for incorrect switches.


No, but it is the reason we have difficulty finding them.  And it is a  
source of instability. John, you can stick your head in the sand all  
you want, it will not make the problems go away because you refuse to  
see them.

Also, because you consider "globally" and the universe is different  
every time you recalculate, because you also recalculate based on the  
situation as it is NOW, and that situation evolves if for no other  
reason than work is done in the mean time, well, if you are doing that  
10 times a minute you are going to get 10 different answers.  Those  
answers MAY be close enough that no change is needed based on the  
rules as they are, but, coupled with the other limitations this is an  
issue.

I did show a very specific example of this effect where task A  
completes, B starts, A uploads and completes the upload, B suspended,  
C started, Another task D completes, E started, D's upload completes,  
E Suspended and F is started.

The last item I will remind you once again.  I may not be able to walk  
straight anymore, and I sometime have trouble talking, but, I am a  
trained and skilled systems engineer.  This is what I used to do.  I  
know I cannot put my finger on a line in a log to convince you or  
anyone else, But, this is a problem.  It is a problem because it loads  
up the logs with unneeded entries and it is also a cause of some of  
the instability we see.

Anyone that works with unstable systems knows that bumping an unstable  
system causes problems, the more you bump it the faster those problems  
arise.

>> 2) All currently running tasks are eligible for preemption
> Not completely true, and not the problem.  Tasks that are not in the  
> list
> of want to run are preemptable, tasks that are in the list of want  
> to run
> are preemptable.  They should only be preempted if either that task  
> is past
> its TSI, or there is a task with deadline trouble (please work with  
> me on
> the definition of deadline trouble).

Which means you have not looked at the code.  The first loop in the  
code marks the next state of ALL running tasks as preempted.  Dr.  
Anderson made a change that was supposed to cure that, but it does not.

>> 3) TSI is not respected as a limiting factor
> It cannot be in all cases.  There may be more cases where the TSI  
> could be
> honored.

For the reason above, it is not honored at all.  I have pointed to the  
block of code where all tasks are marked for preemption and that my  
friend means that TSI is not considered at all ...

Again, you are thinking in terms of single stream systems and on those  
I agree that this is the case.  On multi-core systems it is much less  
of an issue to the point where it might never be an issue at all.

8 Core system
all tasks running are 8 Hours in length

Average time between task completions: 1 hour

Assuming that the system has been running for awhile that is what  
statistics tells me.

With the mix of task lengths I see on my systems the situation is  
usually much better than that.  See the numbers below.  One of my  
first posts I actually listed the numbers of tasks and the run  
times ... but the nubmers below are illustrative enough.

>> 4) TSI is used in calculating deadline peril
> And it has to be.  Since tasks may (or may not) be re-scheduled at all
> during a TSI, and the TSI may line up badly with a connection, the  
> TSI is
> an important part of the calculation.
>
> Example:
> 12 hour TSI.
> 1 hour CPU time left on the task.
> 12 hours and 1 second left before deadline.
> No events for the next 12 hours.
> Without TSI in the calculation, there is the distinct possibility that
> there is no deadline trouble recorded.
> Wait 12 hours.
> You how have 1 second wall time left and 1 hour CPU time left.  Your  
> task
> is now late.
>
> With TSI in the calculation.
> Deadline trouble is noted at the point 12 hours and 1 second before
> deadline (if not somewhat earlier depending on other load).  The  
> task gets
> started and completes before deadline.

Proving once again you are thinking of systems that are running a  
single processing stream.  I suppose that you forgot my last test  
where you did not want to read the numbers.  Or the test before that.   
In the first test the average time to completion between tasks run off  
was 6 minutes (measured over 24 hours), on the other test there were:

Request CPU reschedule:                  3       11      14       
22     19
handle_finished_apps

In a three hour period.  Those numbers were for a 4, 4, 8, 4, 8 CPU  
system respectfully.  Over a 3 hour period.  Meaning that the time  
between a completed task and the next was at worst 60 minutes and at  
best about 8 minutes apart (6 minutes for the first test).  Your  
theory falls apart because when the next task completes the pending  
task can be picked up and scheduled next.

We are not talking about scheduling problems on single core systems.   
It would be nice if you would keep that in mind.  We are talking about  
the use of parameters to control the scheduling that were developed on  
single thread systems being inappropriate on multi-core systems.

>> 5) Work mix is not kept "interesting"
>> 6) Resource Share is used in calculating run time allocations
> A simulation that tracks what the machine is likely do actually do  
> has to
> track what happens based on resource share.  It may not want to be the
> trigger for instant preemption though.

Sadly it does do that right now, trigger preemption at the slightest  
breeze.  Last night I had 5 uFluids tasks all running in parallel  
because the scheduler decided that the deadline of 5/13 could not be  
met.  It ran those tasks for several hours before I suspended most of  
them.  Later it suspended the one it was still running and late last  
night I unsuspended all of them again.  They are STILL waiting to be  
restarted.  Because they have deadlines that are close the mechanisms  
used to "globally" calculate will always select these tasks in batches  
and screw up the work mix, which means that my i7 is run in a mode  
that is significantly less efficient.

This is also why I have proposed other metrics and rules to make these  
decisions to lower the driving by Resource Share on the selection  
process.

>> 7) Work "batches" (tasks with roughly similar deadlines) are not  
>> "bank
>> teller queued"
> I really don't understand this one.  A bank teller queue means that  
> tasks
> come from one queue and are spread across the available resources as  
> they
> become available.  Are they always run in FIFO?  No.  However, that  
> does
> not mean that they are not coming from the same queue.

Probably because you keep refusing to read what I write carefully.   
See the example above.  If you schedule "globally" as you so love,  
then tasks with close deadlines and relatively low Resource Shares  
will always cause these panics.  I get them for IBERCIVIS, VTU, and  
just recently uFluids

>> 8) History of work scheduling is not preserved and all deadlines are
>> calculated fresh each invocation.
> Please explain why this is a problem?  The history of work  
> scheduling may
> have no bearing on what has to happen in the future.

See above.  It also leads to other instabilities that you don't want  
to recognize.  When I re-enabled the uFluid tasks that were such a  
cause for panic yesterday it sure would seem that it should be a cause  
for panic today.  I have a NQueens task that was suspended yesterday  
with 12 minutes to run and it still has not restarted.  If it was so  
important to start yesterday to run up to that point, why, 24 hours  
later has BOINC been running off task from projects that it has just  
downloaded work from that have deadlines that are later?

>> 9) True deadline peril is rare, but "false positives" are common
> Methods that defer leaving RR for a long time will increase true  
> deadline
> peril.  What is needed is something in between.

Again, the systems of which we speak tend to be completing tasks fast  
enough that this argument makes no sense.  With resources coming free  
in minutes, on average, there is no chance that this is going to be as  
common as you posit.  Again, and again, you are thinking of the old  
slow systems and when you refuse to consider the evidence that people  
like Richard and I supply, well ...

I know it is harder to see on a 4 core system.  Though I did notice  
these issues in 2005 after I had gotten my first 4 CPU system (the  
first two of the test above), but, you can see it if you watch the  
patterns of operation.

>> 10) Some of the sources of work peril may be caused by a defective
>> work fetch allocation
> Please give examples from logs.

I don't have to.  You have described over and over again why every  
suggested change cannot work because of these very issues.  Go back  
and look at your examples.  Virtually all your examples involve BOINC  
downloading work that all of a sudden causes this magical situation  
where I have to madly start processing the new work because BOINC  
fetched something that causes the world to change.  Ergo, if BOINC had  
not fetched that work, the problem would not have occurred and the  
universe would not be ending.

Even so, many of those examples of panics are still modeled on only  
having a single stream of work processing.

>> 11) Other factors either obscured by the above, I forgot them, or
>> maybe nothing else ...
>>
>>> work-fetch decisions
>>
>> Seems to be related to:
>>
>> 1) Bad debt calculations
>> 2) Asking for inappropriate work loads
>> 3) asking for inappropriate amounts
> Please give examples.

I have, any number of times.

I could send you another long log showing that the CUDA debt is slowly  
building and in another 24 hours or so is going to be so out of whack  
that the client is going to stop asking for work from GPU Grid the  
only project from which  GPU work can be fetched, and BOINC is still  
happily ignoring all evidence to the contrary and trying to get CUDA  
work from every other project in the universe and pouting because it  
cannot get it.  There is the Rosetta guy who cannot get a queue full  
of Rosetta work because of the opposite problem (he is only attached  
to GPU Grid and Rosetta), there is Richard's logs where he needs one  
class of work in one part and the work fetch asks for the wrong kind  
of work.

Others have mentioned this before, but the next is where I ask for 1  
second of work and instead of getting one task I get 10 or more, or  
even more than one.  This is a long standing problem and the issue is  
on the server end, but, it is still a problem

>> 4) Design of client / server interactions
> There are design constraints that limit the transaction to one round  
> trip.

Actually they are design choices.  And they may or may not be the best  
choices.  One of the recent examples and questions was why we feed up  
the list of tasks to the server each time.  Another design choice.   
The server is supposed to use that information to make a good choice  
to feed work down.  If I understand the other proposal made recently  
changes could be made to this exchange that might be beneficial.   
Changes which you have also rejected out of hand.

>>
>>> bad debt calculation
>>
>> Seems to be related to:
>>
>> 1) Assuming that all projects have CUDA work and asking for it
>> 2) Assuming that a CUDA only project has CPU work and asking for it.
>> 3) Not necessarily taking into account system width correctly
> I don't understand what you mean by system width.

More modern systems are faster, they are also "wider" with more  
processing units.  My i7 has 12 with 8 virtual CPUs and 4 GPU  
engines.  I am actively considering a system with 16 CPUs with room  
for as many as 6 or 8 GPU cores which could bring that number up to 24  
elements.  As I have been struggling to get through that this changes  
the way work can be processed I have been using this term, a lot.   
Which tells me yet again that you have not actually been carefully  
reading what I have been writing.

I know it is a PITA to read things carefully, but, I am not wordy out  
of spite, but to be as clear as possible.  Skimming proposals looking  
only for reasons to reject them is not actually that helpful.

>> 4) Not taking into account CUDA capability correctly
>>
>>> efficiency
>>> of the scheduling calculations (if it's an issue)
>>
>> It is, but you and other nay-sayers don't have systems that  
>> experience
>> the issues so, you and others denigrate or ignore the reports.
> Fix the algorithm FIRST, optimize SECOND.

Reducing the hit rate is not intended to be done to optimize  
anything.  Sadly this is a point that I know I will never be able to  
prove to your satisfaction, and it is apparent that I cannot explain  
it well though I have tried very hard to do so.  But, even with a  
perfect rule set, the system will retain the characteristic of  
instability if we keep calling the scheduler at times when there is no  
specific need.  I get why some of those calls are made, but, the way  
we proceed from there is the secondary cause.

And when I suggest that there may not be specific needs you have made  
examples time and again where work is downloaded and for some reason  
cannot quite grasp the fact that in most cases waiting for 30 seconds  
before we check to see how the schedule might be affected by this new  
work insist that the world is magically better if I check it  
instantaneously.  With no evidence I might add.  Even you defunct  
project with 5 minute deadlines would only be affected if the tasks  
took 4 minutes and 59 seconds ... which means they would also blow the  
deadlines because of the latency in uploads and downloads.  If the  
task were a reasonable 1 minute in length then the only effect of  
waiting to schedule the task by 30 seconds would be to trim the margin  
slightly.

But the more cogent point is you are offering a straw-man argument  
using a project that essentially collapsed because they had  
unreasonable requirements.  So, why are we coding BOINC to handle  
unreasonable requirements from a project that does not exist anymore?   
That is a poser I cannot fathom.

The fact that reducing the call rate has the side effect of increasing  
efficiency is a nice side effect. But it is not the reason I have  
proposed it, and I wish you would stop pretending that it is.

In either case, the two main reasons to reduce the call rate are:

a) to lower the log clutter
b) reduce the rate of false changes so they are easier to identify

Your intransigence on this matter is nothing short of amazing.  You  
complain about the large logs that obscure the very problems we are  
hunting and yet denigrate the one way we can start to get a handle on  
that very issue.

>> The worse point is that to identify some of the problems requires
>> logging, because we do, for example, resource scheduling so often the
>> logs get so big they are not usable because of the sheer size because
>> we are performing actions that ARE NOT NECESSARY ... because the
>> assumption is that there is no cost.  But, here is a cost right here.
>> If we do resource scheduling 10 times more often than needed then
>> there is 10 times more data to sift.  Which is the main reason I have
>> harped on SLOWING THIS DOWN.
>>
>> It is also why in my pseudo-code proposal I suggested that we do two
>> things, one, make it switchable so that we can start with a bare  
>> bones
>> "bank teller" style queuing system and only add refinements as we see
>> where it does not work adequately.  Let us not add more rules than
>> needed.  Start with the simplest rule set possible, run it, find
>> exceptions, figure out why, fix those, move on ...
> In other words step back 5 years.  We were there, and we had to add
> refinements to get it to work.

See, that is the way we fixed it then, why are you so resistant to  
this approach now?  Back then the most common system was single core,  
with some duals.  And, as I point out that was the time I started to  
notice these issues on my 4 core system.  Those issues were not  
handled back then and they are worse now ...

So lets try a new mechanism for the wide systems with as few rules as  
possible and see if it works.  If we can create situations where it  
starts to fail, well, then we add complexity.

I suspect that many of the rules we have now will not be needed at  
all.  In fact, I think that much of the complexity can go away because  
now we can make choices that are not at all possible on single  
processing thread machines.

> Let us not throw the baby out with the bath water.

If the baby is dead, why not?

The problem is fundamentally that we developed elaborate rules to  
handle scheduling on single processing thread machines.  Duals made  
some of those rules passe but the effects were almost unnoticible.   
The effects started to become visible on 4 core systems and are now  
quite obvious on wider systems.

This is one reason in my psuedo code I suggested that at least for the  
time being we keep the current scheduler for systems of less than 4  
cores and try something new on the 4 and wider systems.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] 6.6.20 and work scheduling

Reply via email to