David, tl;dr version: I have what may be a better idea; skip to the bottom...
At PrimeGrid, we have more than passing interest in a "deadline extension" mechanism since we have really long tasks that could theoretically be run on anything from 32-bit CPUs to the latest GPUs. We don't currently allow the longest tasks to run on CPUs because that would mean extending the deadlines to many months, and that would drastically slow down the progress of the overall projects. Being able to extend deadlines would allow us to set the deadlines much shorter, which in turn would make for a quicker turn around for validation. It should be noted that, by a large margin, the #1 cause of missed deadlines is NOT slow machines. It's primarily caused by users simply abandoning PrimeGrid (i.e., detaching) or abandoning BOINC altogether. Therefore, your suggestion about enumerating the jobs wouldn't help because in most cases the host simply never communicates with PrimeGrid again. When this topic surfaced a few months ago, I did some research into how we could utilize such a function at PrimeGrid. In the end it's simply not useful unless the client can recognize that the deadline has changed. But if that could happen via a trickle-down message, we'd use it like this: 1) Start jobs with fairly short time limits that are reasonable for fast 24/7 computers. 2) The app (either the native app or the wrapper) would look at the deadline and the expected run time and if it's not going to finish with at least 24 hours to spare it would request a deadline extension to expected_finish_time + 48 hours. For this calculation the app would assume it's crunching 24/7 and would use application-specific logic to compute run-time. At PrimeGrid we can predict the run time far more accurately than BOINC can measure and extrapolate, but this might not work for other projects. If a host is not computing 24/7, the requested deadline extension will be too small, but that just means the deadline will get extended every day (or every other day), and that's fine. 3) The app will be send a trickle-up message every day. By detecting that a trickle hasn't been received in several days, the server could decide the task is abandoned long before the deadline and send a new result out to another host. This could result in extraneous results being sent out if a host is offline for several days, but it could also result in much faster cancellations of abandoned task. 4) On the server side, when we get a trickle-up message requesting a deadline extension, we can decide whether or not to extend the deadline, and convey that back to the client by trickle-down. Now for the better idea: In theory, we could employ ONLY step 3 and use REALLY long deadline plus this mechanism to allow slow computers while still avoiding huge delays simply by using trickle ups to report status without needing to extend deadlines. It's a simple server change: if you haven't received a trickle up message showing progress on the task in N days, mark it as expired in the database and a new task gets sent out. That effectively makes the deadline (as it pertains to sending out a replacement task) fairly short, whereas the deadline that affects how long a host has to finish could be very long. For me, at least, this seems to have the same end result as building a deadline-extension mechanism, but is much, much simpler. The only drawback of the simplified approach is that users who use app_info and don't update to the new app that sends status trickles will "time-out" prematurely and cause the server to send out unneeded tasks. Mike On Mon, Nov 25, 2013 at 8:43 AM, McLeod, John <[email protected]> wrote: > If the user has paused a job, they should probably not get it replaced. > If it is past deadline, and is still paused, then we might want to abort > it. If it is paused and is in deadline trouble, then we might want to warn > the user of the problem. > > -----Original Message----- > From: boinc_dev [mailto:[email protected]] On Behalf Of > David Anderson > Sent: Friday, November 22, 2013 2:33 PM > To: Christian Beer; BOINC Developers Mailing List > Subject: Re: [boinc_dev] updates for trickle_deadline.cpp > > Christian: > Each scheduler RPC request includes a list of jobs on the client. > How about if we add the following optional scheduler feature: > enumerate the jobs assigned to the host, > and if any of them is not listed in the request, > assume it's been lost and create a new instance. > > This doesn't handle the case where the user paused a job and forgot about > it. > Does this case matter? > > -- David > > On 22-Nov-2013 11:13 AM, Christian Beer wrote: > > Not when the task is lost because the user formated the harddrive or > > paused the task and forgot about it. In those cases, where the user > > doesn't cancel the task but it is not processed either, we would > > generate a new task very late. This is not a desired behavior. > > We could use the trickle up logic to abort the task server side if we > > don't receive a trickle within 14 days but than we have to use a new > > table or other structure to store the last trickle contact. > > > > Am 22.11.2013 20:02, schrieb David Anderson: > >> Wouldn't this be equivalent to having an extremely long deadline to > >> begin with? > >> > >> On 22-Nov-2013 4:50 AM, Christian Beer wrote: > >>> Hi David, > >>> > >>> maybe something else is possible. What if the server can mark the > >>> deadline of the task as "non compulsive" so the client won't go into > >>> high priority mode to keep the deadline. This would of course only be > >>> suitable for projects that either increase the deadline using trickles > >>> or don't care about the deadline at all. > >>> > >>> Regards > >>> Christian > >>> > >>> Am 12.11.2013 06:00, schrieb David Anderson: > >>>> Christian: > >>>> Unfortunately, with the current architecture there's no easy way to > >>>> communicate > >>>> to the client that the deadline has changed. > >>>> -- David > >>>> On 11-Nov-2013 2:05 PM, Christian Beer wrote: > >>>>> Some users reported that for our long running jobs the client > switches > >>>>> to High priority mode for RNA World and will not switch to other > >>>>> projects as usual. > >>>>> > >>>>> I currently have a task on my desktop with an estimation of 340 hours > >>>>> with a 20 day deadline (that I can not meet with an uptime of 6h per > >>>>> day). I don't want to increase the deadline for those long runners > >>>>> because than we have to wait 2 months until a new task is created > >>>>> because the first task vanished on the host. Sure this is the worst > >>>>> case scenario but we are more flexible with a shorter deadline. > >>>>> > >>>>> My fear is that users are aborting our tasks because they think they > >>>>> missed the deadline or can't even meet the deadline. I see a lot of > >>>>> EXIT_ABORTED_VIA_GUI with our new VM application. This maybe only be > >>>>> fixed with an increased deadline but the problem of an underestimated > >>>>> runtime can still occur and if the task is still running on the > client > >>>>> we want to know on the server. And the client should also know that > >>>>> there is more time available to finish the task and there is no > hurry. > >>>>> > >>>>> Regards > >>>>> Christian > >>>>> > >>>>> Am 11.11.2013 22:28, schrieb David Anderson: > >>>>>> Thanks; I committed these. > >>>>>> > >>>>>> Currently the deadline isn't changed on the client. > >>>>>> I'm not sure this really matters; what do you think? > >>>>>> > >>>>>> -- David > >>>>>> > >>>>>> On 11-Nov-2013 11:28 AM, Christian Beer wrote: > >>>>>>> Hi David, > >>>>>>> > >>>>>>> now that Trickles are working again I updated the trickle_deadline > >>>>>>> handler. I changed the output to the BOINC format like in > >>>>>>> scheduler.log > >>>>>>> and added a hostid check to the result lookup for more security. > Now > >>>>>>> every host can only extend the own results and not others. > >>>>>>> > >>>>>>> The code is tested on RNA World. > >>>>>>> > >>>>>>> Regards > >>>>>>> Christian > >>>>> > >>>> > >>>> > >>> > > > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. > _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
