The DCF used for CPU scheduling should (but isn't) be based on the highest
DCF calculated for that application version to date.  Even if it never
finishes, as some other task could have the same behavior.  I agree that a
"provisional DCF" for CPU scheduling would be a good idea.  It would mean
calling a function get_DCF that looks at all tasks for the application
version that is running.  NOTE that there should be fallback positions to
look at the application and project if there is insufficient data to
determine a DCF for the version.

jm7


|------------>
| From:      |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Richard Haselgrove <[email protected]>                             
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |BOINC Developers Mailing List <[email protected]>                   
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |11/07/2011 03:11 PM                                                          
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Re: [boinc_dev] APR, DCF and non-deterministic projects                      
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Sent by:   |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |<[email protected]>                                         
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|





Part 3: There's a particular problem with the way DCF is calculated by
BOINC only on completion of a task.

All of these screenshots show the same group of NumberFields tasks.

I first noticed that my longest-running task was unusual some 63 hours into
the run:

http://img189.imageshack.us/img189/1185/numberfieldsruntimeesti.png

45 hours later, elapsed time has only moved on by 20 hours, and BOINC has
felt it safe to pre-empt that single task. This screenshot was taken
mid-afternoon 5/11/2011, so maybe 52 hours before deadline. Note that the
estimate for the unstarted tasks has decreased - BOINC must have run a task
on another core in the meantime, and found it was a short-running one.

http://img263.imageshack.us/img263/9779/numberfieldspreempted3.png

About an hour later, the long task completed. Suddenly (but not until now)
the unstarted tasks are in deadline trouble, with an 84 hour estimate
(already reduced - luckily - by four rapid exits) and only 60 hours
remaining to deadline.

http://img823.imageshack.us/img823/3107/numberfieldscompleted.png

I think I've suggested before that BOINC should start increasing its
estimate of DCF as soon as a running task passes the runtime predicted by
the current DCF value. The objection which is raised is that the task may
be faulty (in an infinite loop, or something) and may never complete with a
'success' outcome to generate a genuine recalculated DCF. But in this case,
it did....

BOINC could maintain and apply a 'provisional' DCF for the project while a
long-running task is still active - that would have brought the additional
queued tasks forward in high priority sooner than occurred in this case.
When the individual long task completes, the 'provisional' DCF would be
discarded, and the 'permanent' DCF would be updated as normal, depending on
the task outcome (full DCF correction if success outcome, no change if
error outcome).


----- Original Message -----
From: <[email protected]>
To: "Richard Haselgrove" <[email protected]>
Cc: "BOINC Developers Mailing List" <[email protected]>;
<[email protected]>
Sent: Monday, November 07, 2011 4:36 PM
Subject: Re: [boinc_dev] APR, DCF and non-deterministic projects


> Actually, it isn't hard to get close on average at all.  It does involve
> re-writing some code though.
>
> Currently DCF is based on "recent results" and falls slowly and increases
> quickly.
>
> A better method of determining the DCF would be to base it off of the
Mean
> and Standard Deviation of the tasks for a particular application (or
better
> application version).
>
> Work fetch would request work based on the mean project DCF for a
> particular resource.  This will do better than the current method that
> always requests too little.
> CPU scheduler would assume that any particular task may be the worst
case.
> And therefore use Mean + 3 * standard deviation for the expected runtime.
> This assumes a curve similar to a bell curve which is not the case for
LHC,
> but LHC would not miss deadlines because, in my experience, the mean is
> closer to the maximum value than expected in a bell curve.
>
> jm7
>
>
> |------------>
> | From:      |
> |------------>
>
>--------------------------------------------------------------------------------------------------------------------------------------------------|

>  |Richard Haselgrove <[email protected]>
|
>
>--------------------------------------------------------------------------------------------------------------------------------------------------|

> |------------>
> | To:        |
> |------------>
>
>--------------------------------------------------------------------------------------------------------------------------------------------------|

>  |BOINC Developers Mailing List <[email protected]>
|
>
>--------------------------------------------------------------------------------------------------------------------------------------------------|

> |------------>
> | Date:      |
> |------------>
>
>--------------------------------------------------------------------------------------------------------------------------------------------------|

>  |11/07/2011 11:22 AM
|
>
>--------------------------------------------------------------------------------------------------------------------------------------------------|

> |------------>
> | Subject:   |
> |------------>
>
>--------------------------------------------------------------------------------------------------------------------------------------------------|

>  |[boinc_dev] APR, DCF and non-deterministic projects
|
>
>--------------------------------------------------------------------------------------------------------------------------------------------------|

> |------------>
> | Sent by:   |
> |------------>
>
>--------------------------------------------------------------------------------------------------------------------------------------------------|

>  |<[email protected]>
|
>
>--------------------------------------------------------------------------------------------------------------------------------------------------|

>
>
>
>
>
> The important part of the subject line is "non-deterministic projects" -
by
> that, I mean projects where task runtimes can't be predicted in advance.
A
> well-known case in point is the original LHC@home, now renamed
> lhcathomeclassic. The problem is clearly stated on their current 'about'
> page:
>
> "Typically SixTrack simulates 60 particles at a time as they travel
around
> the ring, and runs the simulation for 100000 loops (or sometimes 1
million
> loops) around the ring. That may sound like a lot, but it is less than
10s
> in the real world. Still, it is enough to test whether the beam is going
to
> remain on a stable orbit for a much longer time, or risks losing control
> and flying off course into the walls of the vacuum tube. Such a beam
> instability would be a very serious problem that could result in the
> machine being stopped for repairs if it happened in real life."
>
> Sixtrack exits when the simulated beam hits the simulated tunnel wall.
> Within the last week, I've seen runtimes ranging from 4 seconds to over
10
> hours. It's hard to see how the runtime could be known in advance,
without
> already knowing the answer to the problem they're trying to research.
>
> A mathematical project which also exhibits non-deterministic runtime is
> NumberFields@home (http://numberfields.asu.edu/NumberFields/index.php).
> NumberFields is now running the very latest server code (software
version:
> 24527, according to their status page), and I've been running it on the
> latest available v6.13.10 client. We get a clear view of how well
CreditNew
> and the server-based runtime estimation process work for
non-deterministic
> projects.
>
> This report will come in several parts, but here's the first problem:
>
> 05-Nov-2011 09:51:49 [NumberFields@home] Requesting new tasks for CPU
> 05-Nov-2011 09:51:49 [NumberFields@home] [sched_op] CPU work request:
> 365384.35 seconds; 0.00 CPUs
> 05-Nov-2011 09:51:49 [NumberFields@home] [sched_op] NVIDIA work request:
> 0.00 seconds; 0.00 CPUs
> 05-Nov-2011 09:51:52 [NumberFields@home] Scheduler request completed: got
> 43 new tasks
> 05-Nov-2011 09:51:52 [NumberFields@home] [sched_op] Server version 613
> 05-Nov-2011 09:51:52 [NumberFields@home] Project requested delay of 21
> seconds
> 05-Nov-2011 09:51:52 [NumberFields@home] [sched_op] estimated total CPU
> task duration: 2773276 seconds
>
> That's a request for three hundred thousand seconds, and an estimated
> allocation of over two million. It's not a fluke:
>
> 06-Nov-2011 17:34:25 [NumberFields@home] [sched_op] CPU work request:
> 318987.60 seconds; 0.00 CPUs
> 06-Nov-2011 17:34:25 [NumberFields@home] [sched_op] NVIDIA work request:
> 0.00 seconds; 0.00 CPUs
> 06-Nov-2011 17:34:28 [NumberFields@home] Scheduler request completed: got
> 85 new tasks
> 06-Nov-2011 17:34:28 [NumberFields@home] [sched_op] Server version 613
> 06-Nov-2011 17:34:28 [NumberFields@home] Project requested delay of 21
> seconds
> 06-Nov-2011 17:34:28 [NumberFields@home] [sched_op] estimated total CPU
> task duration: 1036408 seconds
>
> BOINC v6.13/v7 is deliberately designed to make these large work requests
> because of the max/min hysteresis fetch policy.
>
> The problem here is that non-deterministic runtimes can't be tracked in
> real time by the server APR averaging, and client DCF goes into overdrive
> to try and compensate. I've attached a full log of the DCF changes since
> this host was attached to the NumberFields. At the times of the two logs
> above, client DCF was 7.770888 and 4.360754: according to
> http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation, "DCF is no longer
> used", and indeed http://boinc.berkeley.edu/trac/changeset/21153/boinc
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
>
>
>
>
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.




_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to