Re: [boinc_dev] Unrecoverable 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED ...

Josef W. Segur Fri, 21 Dec 2012 09:43:33 -0800

In this case, Raistmer has <flops> in the app_info.xml file, so the BOINC 
client is not estimating the GPU speed. Witness that the cutoff for time 
exceeded remains at just over 4000 seconds.


The real issue is that the et average for "SETI@home v7 (anonymous platform, 
ATI GPU)" on that host is not realistic. Average processing rate is indicating 
over 443 GFLOPS, but Raistmer's current GPU configuration is performing at 
about 1/10 of that. It is completing full runs successfully on a fraction of 
the tasks it gets, with runtimes slightly less than the limit, so the et 
average will eventually adapt but with a lot of waste meanwhile.

I know Eric has considered raising the rsc_fpops_bound to more than 10x 
rsc_fpops_est, and that would relieve Raistmer's situation. But there are 
competing factors, if a CPU host which normally takes half a day to do a task 
gets hung in a loop it already takes 5 days for BOINC to decide to kill the 
task.

@David:
Raistmer is a developer so probably changes hardware/software configuration 
more often than most. However, an ordinary user could get into Raistmer's 
situation if a new GPU has driven the estimated processing rate up but it 
suffers infant mortality and the user reinstalls the old slow GPU. If the speed 
estimate is much more than ten times too high, AFAIK the only way to get out of 
incessant EXIT_TIME_LIMIT_EXCEEDED errors is to force BOINC to assign a new 
hostID and thereby start new averages for all applications on the host.

If there were a way for a user to reset the host averages for an individual 
app_version, that would be much better. Something like a "reset" button by each 
app_version on a host's application details page, with a confirm/cancel dialog 
explaining the action, might be suitable. (The "reset" would of course have to 
be only available to the account owner).
-- 
                                                                       Joe


On Fri, 21 Dec 2012 06:41:01 -0500, Stephen Maclagan 
<[email protected]> wrote:

> Raistmer has just upgraded his Boinc version on that host from 7.0.28 to
> 7.0.42,  the errors are happening because of this change in Boinc 7.0.32:
>
> client: when estimating FLOPS for an anonymous-platform app version for
> which no estimate has been supplied by user, use (CPU speed)*(cpu_usage +
> 10*gpu_usage) (--> add the 10*)
>
> Because the GPU now has a speed 10 times faster than before, all existing
> GPU work is on the verge of going Maximum Time Exceeded,
>
> New GPU work will get revised <rsc_fpops_est> and <rsc_fpops_bound> figures
> so won't be effected by this problem,
>
> Richard reported the problem on the 2nd August in the following post, there
> were No Responses:
>
> http://lists.ssl.berkeley.edu/mailman/private/boinc_alpha/2012-August/017036.html
>
> Because of this change, New Boinc versions should revise the <rsc_fpops_est>
> and <rsc_fpops_bound> figures when upgrading from Boinc 7.0.31 and earlier,
> so these errors don't happen.
>
> Claggy
>
>
> -----Original Message-----
> From: David Anderson
> Sent: Thursday, December 20, 2012 8:15 PM
> To: [email protected]
> Subject: Re: [boinc_dev] Unrecoverable 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED
> ...
>
> What is the app_info.xml entry?
> What are the estimated CPU and GPU speeds?
> Does this happen also with 7.0.42?
> -- David
>
> On 20-Dec-2012 11:08 AM, Raistmer wrote:
>> Please, look at this host:
>> http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=39394&offset=0&show_names=0&state=6&appid=
>>
>> For some reason (maybe recent driver change, maybe some project-side
>> change)
>> server decided that 4k seconds is max time app can spend for task.
>> No matter what reason was (for now), it happened.
>> And BOINC client started to kill one task after another. 4k spent - kill
>> and
>> so on. App making progress in those 4k seconds so such kill is pure waste
>> of
>> host resourses.
>> But even that would be ok, if BOINC could accomodate somehow to new
>> crunching times... but seems it can't!
>> Task aborted with computation error hence its elapsed time doesn't mean
>> anything for BOINC, it just discards it.
>> That is, BOINC will kill tasks on host until all of them will be killed
>> w/o
>> any chance to recover from this situation.
>>
>> I consider this behavior as pure design flaw, some way should be provided
>> for BOINC to accomodate to new crunching times. And even better if whole
>> EXIT_TIME_LIMIT_EXCEEDED behavior will be re-designed. Its primary aim was
>> to prevent endless loops and now it just kills host performance and lead
>> to
>> resourse waste, not save.
>>
>> Any wrong time estimate, especially at new app release and we see lots of
>> such EXIT_TIME_LIMIT_EXCEEDED results killed for nothing.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] Unrecoverable 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED ...

Reply via email to