Richard Haselgrove wrote on 14/04/2011 11:30:
> I'd love to see something like this built into BOINC apps, and activated by 
> default - one problem is that the app developers who need it most are 
> possibly the ones least likely to enable a non-default option in the API 
> library code.
>
> BUT, I'm worried about basing 'something useful' on fraction_done updates. 
> Some projects issue applications - the AQUA 'ROQS' application is a current 
> case in point - where fraction_done makes huge quantum jumps at infrequent 
> intervals (I'm talking several hours apart). But they are usually running - 
> unless waiting for memory - and they should just be allowed to continue. I 
> suspect they would fail Joe's test.
>
> ROQS apps do checkpoint regularly, at the defined intervals, while apparently 
> making no progress. Would a truly stalled app still checkpoint? If not, could 
> adding a second test for a recent checkpoint help to decide the matter? Only 
> if BOTH 'no progress' AND 'no checkpoint' would we decide to give the app a 
> restart kick.

QMC qasino tasks would definitely have a problem if this was to become 
standard behaviour.  It looks like they only update the progress when 
checkpoints are made at points fixed by the application.

Some of these tasks sit at 0% progress for over an hour on my system 
before the first checkpoint.  The last one it ran had an initial 
estimate of 21 hours.  The first checkpoint (and progress update) took 
78 minutes, the second took 16 minutes, with the interval gradually 
reducing to every 6 or 7 minutes after that (the checkpoint interval is 
set to 300 seconds).  It had a total elapsed time of 21:14:37.

Ian

> ----- Original Message -----
> From: "Josef W. Segur"<[email protected]>
> To: "David Anderson"<[email protected]>;<[email protected]>
> Sent: Thursday, April 14, 2011 5:54 AM
> Subject: [boinc_dev] check_progress option
>
>
>> Users find it discouraging to check BOINC and find that an application 
>> hasn't made any progress in hours, and though the eventual cutoff based on 
>> rsc_fpops_bound is needed it is hardly the best we can do. IMO what I'm 
>> suggesting will be an improvement.
>>
>> The proposed change provides an option for the timer thread to check whether 
>> a science application seems to still be doing something useful. It's based 
>> on the assumption that correct operation will update the fraction_done 
>> frequently, and if that doesn't happen within a reasonable time the 
>> application should be shut down. That's done like the no heartbeat case, 
>> since at least some cases can be cured by a restart. Even if it's not a 
>> direct help, having BOINC trying to correct the situation ought to be less 
>> discouraging to users.
>>
>> I've based the "reasonable time" on the rsc_fpops_est/host_info.p_fpops 
>> runtime approximation divided by 100. Although that's not in any sense 
>> accurate it does provide for old slow systems. If the values to calculate 
>> that time are not available the period is defaulted to 1800 seconds, and on 
>> the short end there's a minimum of 120 seconds. The actual count used is 
>> based on the running_interrupt_count value of course, to exclude time when 
>> the application is suspended.
>>
>> I've defaulted the option off so application builds using trunk code won't 
>> have the feature unless a project decides to use it. The changes needed are 
>> in the attached diffs. I've done some testing with builds of the S@H v7 Beta 
>> application including those changes plus code to simulate an unintended 
>> looping condition. That is, the change builds and runs as I intended.
>> --
>>                                                           Joe
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to