Re: [boinc_dev] updates for trickle_deadline.cpp

McLeod, John Mon, 25 Nov 2013 06:47:38 -0800

CPDN does something similar.  They send a trickle up message every couple of % 
toward completion.  If the deadline is past, and there has not been a trickle 
up within a period of time (2 weeks?  a month?), then the task is marked as 
abandoned, and replaced by sending it out to another computer.  The client is 
not updated with a new deadline, but the user base has been educated to know 
that the project has soft deadlines, and not to worry too much about a slightly 
missed deadline.

You could also use the setting to contact the server at least once every X days 
as well.  If there is work on the client, the client will contact the server at 
least this frequently.  Since this is a project setting, there would not have 
to be a change of application.

Having the reply for a trickle up message be able to set a new deadline would 
be nice.

From: Michael Goetz [mailto:[email protected]]
Sent: Monday, November 25, 2013 9:29 AM
To: McLeod, John
Cc: David Anderson; Christian Beer; BOINC Developers Mailing List
Subject: Re: [boinc_dev] updates for trickle_deadline.cpp

David,

tl;dr version:  I have what may be a better idea; skip to the bottom...

At PrimeGrid, we have more than passing interest in a "deadline extension" 
mechanism since we have really long tasks that could theoretically be run on 
anything from 32-bit CPUs to the latest GPUs.  We don't currently allow the 
longest tasks to run on CPUs because that would mean extending the deadlines to 
many months, and that would drastically slow down the progress of the overall 
projects.

Being able to extend deadlines would allow us to set the deadlines much 
shorter, which in turn would make for a quicker turn around for validation.

It should be noted that, by a large margin, the #1 cause of missed deadlines is 
NOT slow machines.  It's primarily caused by users simply abandoning PrimeGrid 
(i.e., detaching) or abandoning BOINC altogether.  Therefore, your suggestion 
about enumerating the jobs wouldn't help because in most cases the host simply 
never communicates with PrimeGrid again.

When this topic surfaced a few months ago, I did some research into how we 
could utilize such a function at PrimeGrid.  In the end it's simply not useful 
unless the client can recognize that the deadline has changed.  But if that 
could happen via a trickle-down message, we'd use it like this:

1) Start jobs with fairly short time limits that are reasonable for fast 24/7 
computers.

2) The app (either the native app or the wrapper) would look at the deadline 
and the expected run time and if it's not going to finish with at least 24 
hours to spare it would request a deadline extension to expected_finish_time + 
48 hours.  For this calculation the app would assume it's crunching 24/7 and 
would use application-specific logic to compute run-time.  At PrimeGrid we can 
predict the run time far more accurately than BOINC can measure and 
extrapolate, but this might not work for other projects.  If a host is not 
computing 24/7, the requested deadline extension will be too small, but that 
just means the deadline will get extended every day (or every other day), and 
that's fine.

3) The app will be send a trickle-up message every day.  By detecting that a 
trickle hasn't been received in several days, the server could decide the task 
is abandoned long before the deadline and send a new result out to another 
host.  This could result in extraneous results being sent out if a host is 
offline for several days, but it could also result in much faster cancellations 
of abandoned task.

4) On the server side, when we get a trickle-up message requesting a deadline 
extension, we can decide whether or not to extend the deadline, and convey that 
back to the client by trickle-down.

Now for the better idea:

In theory, we could employ ONLY step 3 and use REALLY long deadline plus this 
mechanism to allow slow computers while still avoiding huge delays simply by 
using trickle ups to report status without needing to extend deadlines.  It's a 
simple server change:  if you haven't received a trickle up message showing 
progress on the task in N days, mark it as expired in the database and a new 
task gets sent out.  That effectively makes the deadline (as it pertains to 
sending out a replacement task) fairly short, whereas the deadline that affects 
how long a host has to finish could be very long.  For me, at least, this seems 
to have the same end result as building a deadline-extension mechanism, but is 
much, much simpler.

The only drawback of the simplified approach is that users who use app_info and 
don't update to the new app that sends status trickles will "time-out" 
prematurely and cause the server to send out unneeded tasks.

Mike

On Mon, Nov 25, 2013 at 8:43 AM, McLeod, John 
<[email protected]<mailto:[email protected]>> wrote:
If the user has paused a job, they should probably not get it replaced.  If it 
is past deadline, and is still paused, then we might want to abort it.  If it 
is paused and is in deadline trouble, then we might want to warn the user of 
the problem.

-----Original Message-----
From: boinc_dev 
[mailto:[email protected]<mailto:[email protected]>]
 On Behalf Of David Anderson
Sent: Friday, November 22, 2013 2:33 PM
To: Christian Beer; BOINC Developers Mailing List
Subject: Re: [boinc_dev] updates for trickle_deadline.cpp

Christian:
Each scheduler RPC request includes a list of jobs on the client.
How about if we add the following optional scheduler feature:
enumerate the jobs assigned to the host,
and if any of them is not listed in the request,
assume it's been lost and create a new instance.

This doesn't handle the case where the user paused a job and forgot about it.
Does this case matter?

-- David

On 22-Nov-2013 11:13 AM, Christian Beer wrote:
> Not when the task is lost because the user formated the harddrive or
> paused the task and forgot about it. In those cases, where the user
> doesn't cancel the task but it is not processed either, we would
> generate a new task very late. This is not a desired behavior.
> We could use the trickle up logic to abort the task server side if we
> don't receive a trickle within 14 days but than we have to use a new
> table or other structure to store the last trickle contact.
>
> Am 22.11.2013 20:02, schrieb David Anderson:
>> Wouldn't this be equivalent to having an extremely long deadline to
>> begin with?
>>
>> On 22-Nov-2013 4:50 AM, Christian Beer wrote:
>>> Hi David,
>>>
>>> maybe something else is possible. What if the server can mark the
>>> deadline of the task as "non compulsive" so the client won't go into
>>> high priority mode to keep the deadline. This would of course only be
>>> suitable for projects that either increase the deadline using trickles
>>> or don't care about the deadline at all.
>>>
>>> Regards
>>> Christian
>>>
>>> Am 12.11.2013 06:00, schrieb David Anderson:
>>>> Christian:
>>>> Unfortunately, with the current architecture there's no easy way to
>>>> communicate
>>>> to the client that the deadline has changed.
>>>> -- David
>>>> On 11-Nov-2013 2:05 PM, Christian Beer wrote:
>>>>> Some users reported that for our long running jobs the client switches
>>>>> to High priority mode for RNA World and will not switch to other
>>>>> projects as usual.
>>>>>
>>>>> I currently have a task on my desktop with an estimation of 340 hours
>>>>> with a 20 day deadline (that I can not meet with an uptime of 6h per
>>>>> day). I don't want to increase the deadline for those long runners
>>>>> because than we have to wait 2 months until a new task is created
>>>>> because the first task vanished on the host. Sure this is the worst
>>>>> case scenario but we are more flexible with a shorter deadline.
>>>>>
>>>>> My fear is that users are aborting our tasks because they think they
>>>>> missed the deadline or can't even meet the deadline. I see a lot of
>>>>> EXIT_ABORTED_VIA_GUI with our new VM application. This maybe only be
>>>>> fixed with an increased deadline but the problem of an underestimated
>>>>> runtime can still occur and if the task is still running on the client
>>>>> we want to know on the server. And the client should also know that
>>>>> there is more time available to finish the task and there is no hurry.
>>>>>
>>>>> Regards
>>>>> Christian
>>>>>
>>>>> Am 11.11.2013 22:28, schrieb David Anderson:
>>>>>> Thanks; I committed these.
>>>>>>
>>>>>> Currently the deadline isn't changed on the client.
>>>>>> I'm not sure this really matters; what do you think?
>>>>>>
>>>>>> -- David
>>>>>>
>>>>>> On 11-Nov-2013 11:28 AM, Christian Beer wrote:
>>>>>>> Hi David,
>>>>>>>
>>>>>>> now that Trickles are working again I updated the trickle_deadline
>>>>>>> handler. I changed the output to the BOINC format like in
>>>>>>> scheduler.log
>>>>>>> and added a hostid check to the result lookup for more security. Now
>>>>>>> every host can only extend the own results and not others.
>>>>>>>
>>>>>>> The code is tested on RNA World.
>>>>>>>
>>>>>>> Regards
>>>>>>> Christian
>>>>>
>>>>
>>>>
>>>
>
_______________________________________________
boinc_dev mailing list
[email protected]<mailto:[email protected]>
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]<mailto:[email protected]>
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] updates for trickle_deadline.cpp

Reply via email to