I'd love to see something like this built into BOINC apps, and activated by 
default - one problem is that the app developers who need it most are possibly 
the ones least likely to enable a non-default option in the API library code.

BUT, I'm worried about basing 'something useful' on fraction_done updates. Some 
projects issue applications - the AQUA 'ROQS' application is a current case in 
point - where fraction_done makes huge quantum jumps at infrequent intervals 
(I'm talking several hours apart). But they are usually running - unless 
waiting for memory - and they should just be allowed to continue. I suspect 
they would fail Joe's test.

ROQS apps do checkpoint regularly, at the defined intervals, while apparently 
making no progress. Would a truly stalled app still checkpoint? If not, could 
adding a second test for a recent checkpoint help to decide the matter? Only if 
BOTH 'no progress' AND 'no checkpoint' would we decide to give the app a 
restart kick.


----- Original Message ----- 
From: "Josef W. Segur" <[email protected]>
To: "David Anderson" <[email protected]>; <[email protected]>
Sent: Thursday, April 14, 2011 5:54 AM
Subject: [boinc_dev] check_progress option


> Users find it discouraging to check BOINC and find that an application hasn't 
> made any progress in hours, and though the eventual cutoff based on 
> rsc_fpops_bound is needed it is hardly the best we can do. IMO what I'm 
> suggesting will be an improvement.
> 
> The proposed change provides an option for the timer thread to check whether 
> a science application seems to still be doing something useful. It's based on 
> the assumption that correct operation will update the fraction_done 
> frequently, and if that doesn't happen within a reasonable time the 
> application should be shut down. That's done like the no heartbeat case, 
> since at least some cases can be cured by a restart. Even if it's not a 
> direct help, having BOINC trying to correct the situation ought to be less 
> discouraging to users.
> 
> I've based the "reasonable time" on the rsc_fpops_est/host_info.p_fpops 
> runtime approximation divided by 100. Although that's not in any sense 
> accurate it does provide for old slow systems. If the values to calculate 
> that time are not available the period is defaulted to 1800 seconds, and on 
> the short end there's a minimum of 120 seconds. The actual count used is 
> based on the running_interrupt_count value of course, to exclude time when 
> the application is suspended.
> 
> I've defaulted the option off so application builds using trunk code won't 
> have the feature unless a project decides to use it. The changes needed are 
> in the attached diffs. I've done some testing with builds of the S@H v7 Beta 
> application including those changes plus code to simulate an unintended 
> looping condition. That is, the change builds and runs as I intended.
> -- 
>                                                          Joe
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to