The info about how often particular app should update its state can be passed along app itself to BOINC. Just as max allowed time, max memory and so on. So BOINC can have info how often app should update its state in "all alright" state and if it doesn't do it with some pre-defined margin - it could restart app. This way apps with long times between updates will not suffer - they just need to say to BOINC "I will update my state infrequently, don't panic!"..... (or, by default it could be - "don't panic", but some apps should have way to say BOINC "I will update my state with such frequency, if not, please, restart me cause because of GPU driver restart I can't do it by myself. And you, mighty framework, should help me with this! "
----- Original Message ----- From: Richard Haselgrove To: David Anderson ; [email protected] ; Josef W. Segur Sent: Thursday, April 14, 2011 2:30 PM Subject: Re: [boinc_dev] check_progress option I'd love to see something like this built into BOINC apps, and activated by default - one problem is that the app developers who need it most are possibly the ones least likely to enable a non-default option in the API library code. BUT, I'm worried about basing 'something useful' on fraction_done updates. Some projects issue applications - the AQUA 'ROQS' application is a current case in point - where fraction_done makes huge quantum jumps at infrequent intervals (I'm talking several hours apart). But they are usually running - unless waiting for memory - and they should just be allowed to continue. I suspect they would fail Joe's test. ROQS apps do checkpoint regularly, at the defined intervals, while apparently making no progress. Would a truly stalled app still checkpoint? If not, could adding a second test for a recent checkpoint help to decide the matter? Only if BOTH 'no progress' AND 'no checkpoint' would we decide to give the app a restart kick. ----- Original Message ----- From: "Josef W. Segur" <[email protected]> To: "David Anderson" <[email protected]>; <[email protected]> Sent: Thursday, April 14, 2011 5:54 AM Subject: [boinc_dev] check_progress option > Users find it discouraging to check BOINC and find that an application hasn't > made any progress in hours, and though the eventual cutoff based on > rsc_fpops_bound is needed it is hardly the best we can do. IMO what I'm > suggesting will be an improvement. > > The proposed change provides an option for the timer thread to check whether > a science application seems to still be doing something useful. It's based on > the assumption that correct operation will update the fraction_done > frequently, and if that doesn't happen within a reasonable time the > application should be shut down. That's done like the no heartbeat case, > since at least some cases can be cured by a restart. Even if it's not a > direct help, having BOINC trying to correct the situation ought to be less > discouraging to users. > > I've based the "reasonable time" on the rsc_fpops_est/host_info.p_fpops > runtime approximation divided by 100. Although that's not in any sense > accurate it does provide for old slow systems. If the values to calculate > that time are not available the period is defaulted to 1800 seconds, and on > the short end there's a minimum of 120 seconds. The actual count used is > based on the running_interrupt_count value of course, to exclude time when > the application is suspended. > > I've defaulted the option off so application builds using trunk code won't > have the feature unless a project decides to use it. The changes needed are > in the attached diffs. I've done some testing with builds of the S@H v7 Beta > application including those changes plus code to simulate an unintended > looping condition. That is, the change builds and runs as I intended. > -- > Joe _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
