I don't think this is worth doing.
Some apps, because of their structure,
do not change fraction done for unpredictably long periods.
I don't see a good way to distinguish such periods from infinite loops.

We already have a mechanism for detecting infinite loops:
the FPOPS bound, after which jobs are aborted.
If this is happening to a significant number of jobs,
the project should notice it (on the admin web pages)
and fix the bug ASAP;
if they don't notice it, it should be pointed out to them.

-- David

On 13-Apr-2011 9:54 PM, Josef W. Segur wrote:
> Users find it discouraging to check BOINC and find that an application hasn't
> made any progress in hours, and though the eventual cutoff based on
> rsc_fpops_bound is needed it is hardly the best we can do. IMO what I'm
> suggesting will be an improvement.
>
> The proposed change provides an option for the timer thread to check whether a
> science application seems to still be doing something useful. It's based on 
> the
> assumption that correct operation will update the fraction_done frequently, 
> and
> if that doesn't happen within a reasonable time the application should be shut
> down. That's done like the no heartbeat case, since at least some cases can be
> cured by a restart. Even if it's not a direct help, having BOINC trying to
> correct the situation ought to be less discouraging to users.
>
> I've based the "reasonable time" on the rsc_fpops_est/host_info.p_fpops 
> runtime
> approximation divided by 100. Although that's not in any sense accurate it 
> does
> provide for old slow systems. If the values to calculate that time are not
> available the period is defaulted to 1800 seconds, and on the short end 
> there's
> a minimum of 120 seconds. The actual count used is based on the
> running_interrupt_count value of course, to exclude time when the application 
> is
> suspended.
>
> I've defaulted the option off so application builds using trunk code won't 
> have
> the feature unless a project decides to use it. The changes needed are in the
> attached diffs. I've done some testing with builds of the S@H v7 Beta
> application including those changes plus code to simulate an unintended 
> looping
> condition. That is, the change builds and runs as I intended.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to