Re: [boinc_dev] updates for trickle_deadline.cpp

Michael Goetz Mon, 25 Nov 2013 06:29:15 -0800

David,

tl;dr version:  I have what may be a better idea; skip to the bottom...

At PrimeGrid, we have more than passing interest in a "deadline extension"
mechanism since we have really long tasks that could theoretically be run
on anything from 32-bit CPUs to the latest GPUs.  We don't currently allow
the longest tasks to run on CPUs because that would mean extending the
deadlines to many months, and that would drastically slow down the progress
of the overall projects.

Being able to extend deadlines would allow us to set the deadlines much
shorter, which in turn would make for a quicker turn around for validation.

It should be noted that, by a large margin, the #1 cause of missed
deadlines is NOT slow machines.  It's primarily caused by users simply
abandoning PrimeGrid (i.e., detaching) or abandoning BOINC altogether.
 Therefore, your suggestion about enumerating the jobs wouldn't help
because in most cases the host simply never communicates with PrimeGrid
again.

When this topic surfaced a few months ago, I did some research into how we
could utilize such a function at PrimeGrid.  In the end it's simply not
useful unless the client can recognize that the deadline has changed.  But
if that could happen via a trickle-down message, we'd use it like this:

1) Start jobs with fairly short time limits that are reasonable for fast
24/7 computers.

2) The app (either the native app or the wrapper) would look at the
deadline and the expected run time and if it's not going to finish with at
least 24 hours to spare it would request a deadline extension to
expected_finish_time + 48 hours.  For this calculation the app would assume
it's crunching 24/7 and would use application-specific logic to compute
run-time.  At PrimeGrid we can predict the run time far more accurately
than BOINC can measure and extrapolate, but this might not work for other
projects.  If a host is not computing 24/7, the requested deadline
extension will be too small, but that just means the deadline will get
extended every day (or every other day), and that's fine.

3) The app will be send a trickle-up message every day.  By detecting that
a trickle hasn't been received in several days, the server could decide the
task is abandoned long before the deadline and send a new result out to
another host.  This could result in extraneous results being sent out if a
host is offline for several days, but it could also result in much faster
cancellations of abandoned task.

4) On the server side, when we get a trickle-up message requesting a
deadline extension, we can decide whether or not to extend the deadline,
and convey that back to the client by trickle-down.

Now for the better idea:

In theory, we could employ ONLY step 3 and use REALLY long deadline plus
this mechanism to allow slow computers while still avoiding huge delays
simply by using trickle ups to report status without needing to extend
deadlines.  It's a simple server change:  if you haven't received a trickle
up message showing progress on the task in N days, mark it as expired in
the database and a new task gets sent out.  That effectively makes the
deadline (as it pertains to sending out a replacement task) fairly short,
whereas the deadline that affects how long a host has to finish could be
very long.  For me, at least, this seems to have the same end result as
building a deadline-extension mechanism, but is much, much simpler.

The only drawback of the simplified approach is that users who use app_info
and don't update to the new app that sends status trickles will "time-out"
prematurely and cause the server to send out unneeded tasks.

Mike

On Mon, Nov 25, 2013 at 8:43 AM, McLeod, John <[email protected]> wrote:

> If the user has paused a job, they should probably not get it replaced.
>  If it is past deadline, and is still paused, then we might want to abort
> it.  If it is paused and is in deadline trouble, then we might want to warn
> the user of the problem.
>
> -----Original Message-----
> From: boinc_dev [mailto:[email protected]] On Behalf Of
> David Anderson
> Sent: Friday, November 22, 2013 2:33 PM
> To: Christian Beer; BOINC Developers Mailing List
> Subject: Re: [boinc_dev] updates for trickle_deadline.cpp
>
> Christian:
> Each scheduler RPC request includes a list of jobs on the client.
> How about if we add the following optional scheduler feature:
> enumerate the jobs assigned to the host,
> and if any of them is not listed in the request,
> assume it's been lost and create a new instance.
>
> This doesn't handle the case where the user paused a job and forgot about
> it.
> Does this case matter?
>
> -- David
>
> On 22-Nov-2013 11:13 AM, Christian Beer wrote:
> > Not when the task is lost because the user formated the harddrive or
> > paused the task and forgot about it. In those cases, where the user
> > doesn't cancel the task but it is not processed either, we would
> > generate a new task very late. This is not a desired behavior.
> > We could use the trickle up logic to abort the task server side if we
> > don't receive a trickle within 14 days but than we have to use a new
> > table or other structure to store the last trickle contact.
> >
> > Am 22.11.2013 20:02, schrieb David Anderson:
> >> Wouldn't this be equivalent to having an extremely long deadline to
> >> begin with?
> >>
> >> On 22-Nov-2013 4:50 AM, Christian Beer wrote:
> >>> Hi David,
> >>>
> >>> maybe something else is possible. What if the server can mark the
> >>> deadline of the task as "non compulsive" so the client won't go into
> >>> high priority mode to keep the deadline. This would of course only be
> >>> suitable for projects that either increase the deadline using trickles
> >>> or don't care about the deadline at all.
> >>>
> >>> Regards
> >>> Christian
> >>>
> >>> Am 12.11.2013 06:00, schrieb David Anderson:
> >>>> Christian:
> >>>> Unfortunately, with the current architecture there's no easy way to
> >>>> communicate
> >>>> to the client that the deadline has changed.
> >>>> -- David
> >>>> On 11-Nov-2013 2:05 PM, Christian Beer wrote:
> >>>>> Some users reported that for our long running jobs the client
> switches
> >>>>> to High priority mode for RNA World and will not switch to other
> >>>>> projects as usual.
> >>>>>
> >>>>> I currently have a task on my desktop with an estimation of 340 hours
> >>>>> with a 20 day deadline (that I can not meet with an uptime of 6h per
> >>>>> day). I don't want to increase the deadline for those long runners
> >>>>> because than we have to wait 2 months until a new task is created
> >>>>> because the first task vanished on the host. Sure this is the worst
> >>>>> case scenario but we are more flexible with a shorter deadline.
> >>>>>
> >>>>> My fear is that users are aborting our tasks because they think they
> >>>>> missed the deadline or can't even meet the deadline. I see a lot of
> >>>>> EXIT_ABORTED_VIA_GUI with our new VM application. This maybe only be
> >>>>> fixed with an increased deadline but the problem of an underestimated
> >>>>> runtime can still occur and if the task is still running on the
> client
> >>>>> we want to know on the server. And the client should also know that
> >>>>> there is more time available to finish the task and there is no
> hurry.
> >>>>>
> >>>>> Regards
> >>>>> Christian
> >>>>>
> >>>>> Am 11.11.2013 22:28, schrieb David Anderson:
> >>>>>> Thanks; I committed these.
> >>>>>>
> >>>>>> Currently the deadline isn't changed on the client.
> >>>>>> I'm not sure this really matters; what do you think?
> >>>>>>
> >>>>>> -- David
> >>>>>>
> >>>>>> On 11-Nov-2013 11:28 AM, Christian Beer wrote:
> >>>>>>> Hi David,
> >>>>>>>
> >>>>>>> now that Trickles are working again I updated the trickle_deadline
> >>>>>>> handler. I changed the output to the BOINC format like in
> >>>>>>> scheduler.log
> >>>>>>> and added a hostid check to the result lookup for more security.
> Now
> >>>>>>> every host can only extend the own results and not others.
> >>>>>>>
> >>>>>>> The code is tested on RNA World.
> >>>>>>>
> >>>>>>> Regards
> >>>>>>> Christian
> >>>>>
> >>>>
> >>>>
> >>>
> >
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] updates for trickle_deadline.cpp

Reply via email to