In this case, Raistmer has <flops> in the app_info.xml file, so the BOINC
client is not estimating the GPU speed. Witness that the cutoff for time
exceeded remains at just over 4000 seconds.
The real issue is that the et average for "SETI@home v7 (anonymous platform,
ATI GPU)" on that host is not realistic. Average processing rate is indicating
over 443 GFLOPS, but Raistmer's current GPU configuration is performing at
about 1/10 of that. It is completing full runs successfully on a fraction of
the tasks it gets, with runtimes slightly less than the limit, so the et
average will eventually adapt but with a lot of waste meanwhile.
I know Eric has considered raising the rsc_fpops_bound to more than 10x
rsc_fpops_est, and that would relieve Raistmer's situation. But there are
competing factors, if a CPU host which normally takes half a day to do a task
gets hung in a loop it already takes 5 days for BOINC to decide to kill the
task.
@David:
Raistmer is a developer so probably changes hardware/software configuration
more often than most. However, an ordinary user could get into Raistmer's
situation if a new GPU has driven the estimated processing rate up but it
suffers infant mortality and the user reinstalls the old slow GPU. If the speed
estimate is much more than ten times too high, AFAIK the only way to get out of
incessant EXIT_TIME_LIMIT_EXCEEDED errors is to force BOINC to assign a new
hostID and thereby start new averages for all applications on the host.
If there were a way for a user to reset the host averages for an individual
app_version, that would be much better. Something like a "reset" button by each
app_version on a host's application details page, with a confirm/cancel dialog
explaining the action, might be suitable. (The "reset" would of course have to
be only available to the account owner).
--
Joe
On Fri, 21 Dec 2012 06:41:01 -0500, Stephen Maclagan
<[email protected]> wrote:
> Raistmer has just upgraded his Boinc version on that host from 7.0.28 to
> 7.0.42, the errors are happening because of this change in Boinc 7.0.32:
>
> client: when estimating FLOPS for an anonymous-platform app version for
> which no estimate has been supplied by user, use (CPU speed)*(cpu_usage +
> 10*gpu_usage) (--> add the 10*)
>
> Because the GPU now has a speed 10 times faster than before, all existing
> GPU work is on the verge of going Maximum Time Exceeded,
>
> New GPU work will get revised <rsc_fpops_est> and <rsc_fpops_bound> figures
> so won't be effected by this problem,
>
> Richard reported the problem on the 2nd August in the following post, there
> were No Responses:
>
> http://lists.ssl.berkeley.edu/mailman/private/boinc_alpha/2012-August/017036.html
>
> Because of this change, New Boinc versions should revise the <rsc_fpops_est>
> and <rsc_fpops_bound> figures when upgrading from Boinc 7.0.31 and earlier,
> so these errors don't happen.
>
> Claggy
>
>
> -----Original Message-----
> From: David Anderson
> Sent: Thursday, December 20, 2012 8:15 PM
> To: [email protected]
> Subject: Re: [boinc_dev] Unrecoverable 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED
> ...
>
> What is the app_info.xml entry?
> What are the estimated CPU and GPU speeds?
> Does this happen also with 7.0.42?
> -- David
>
> On 20-Dec-2012 11:08 AM, Raistmer wrote:
>> Please, look at this host:
>> http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=39394&offset=0&show_names=0&state=6&appid=
>>
>> For some reason (maybe recent driver change, maybe some project-side
>> change)
>> server decided that 4k seconds is max time app can spend for task.
>> No matter what reason was (for now), it happened.
>> And BOINC client started to kill one task after another. 4k spent - kill
>> and
>> so on. App making progress in those 4k seconds so such kill is pure waste
>> of
>> host resourses.
>> But even that would be ok, if BOINC could accomodate somehow to new
>> crunching times... but seems it can't!
>> Task aborted with computation error hence its elapsed time doesn't mean
>> anything for BOINC, it just discards it.
>> That is, BOINC will kill tasks on host until all of them will be killed
>> w/o
>> any chance to recover from this situation.
>>
>> I consider this behavior as pure design flaw, some way should be provided
>> for BOINC to accomodate to new crunching times. And even better if whole
>> EXIT_TIME_LIMIT_EXCEEDED behavior will be re-designed. Its primary aim was
>> to prevent endless loops and now it just kills host performance and lead
>> to
>> resourse waste, not save.
>>
>> Any wrong time estimate, especially at new app release and we see lots of
>> such EXIT_TIME_LIMIT_EXCEEDED results killed for nothing.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.