I don't think this is worth doing. Some apps, because of their structure, do not change fraction done for unpredictably long periods. I don't see a good way to distinguish such periods from infinite loops.
We already have a mechanism for detecting infinite loops: the FPOPS bound, after which jobs are aborted. If this is happening to a significant number of jobs, the project should notice it (on the admin web pages) and fix the bug ASAP; if they don't notice it, it should be pointed out to them. -- David On 13-Apr-2011 9:54 PM, Josef W. Segur wrote: > Users find it discouraging to check BOINC and find that an application hasn't > made any progress in hours, and though the eventual cutoff based on > rsc_fpops_bound is needed it is hardly the best we can do. IMO what I'm > suggesting will be an improvement. > > The proposed change provides an option for the timer thread to check whether a > science application seems to still be doing something useful. It's based on > the > assumption that correct operation will update the fraction_done frequently, > and > if that doesn't happen within a reasonable time the application should be shut > down. That's done like the no heartbeat case, since at least some cases can be > cured by a restart. Even if it's not a direct help, having BOINC trying to > correct the situation ought to be less discouraging to users. > > I've based the "reasonable time" on the rsc_fpops_est/host_info.p_fpops > runtime > approximation divided by 100. Although that's not in any sense accurate it > does > provide for old slow systems. If the values to calculate that time are not > available the period is defaulted to 1800 seconds, and on the short end > there's > a minimum of 120 seconds. The actual count used is based on the > running_interrupt_count value of course, to exclude time when the application > is > suspended. > > I've defaulted the option off so application builds using trunk code won't > have > the feature unless a project decides to use it. The changes needed are in the > attached diffs. I've done some testing with builds of the S@H v7 Beta > application including those changes plus code to simulate an unintended > looping > condition. That is, the change builds and runs as I intended. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
