Two things it would be worth checking on this front.

1) When new hosts get their first work, how far off 'realistic' are the 
runtime estimates?
2) Whenabouts in the host's lifecycle do the 'Maximum elapsed time exceeded' 
errors happen?

We've been seeing a lot of -177 errors at SETI@home under CreditNew, round 
about and soon after each new host reaches the transition point after 10 
validated tasks. For SETI, task runtimes are routinely overestimated for new 
joiners with modern hardware - target DCF was set at 0.4 for Astropulse, and 
I think is even lower for MB tasks running on CPU. Then, for GPUs, the 
over-estimation is even more marked.

The errors we see come - I think only with anonymous platform - when the 
<rsc_fpops_est> is reduced by the server after the 10th validation, but the 
client is still using DCF correction built up by the earlier (full-estimate) 
tasks. If you can persevere through the transition point, the errors go away 
again - but if you keep resetting the host application detail records, it'll 
keep coming back.


----- Original Message ----- 
From: "Tom Ritter" <[email protected]>
To: "Travis Desell" <[email protected]>
Cc: <[email protected]>; <[email protected]>
Sent: Monday, May 02, 2011 2:46 AM
Subject: Re: [boinc_projects] clients not getting rsc_fpops_bound correctly?


> I've run a small network of hosts for testing workunit-creation scripts, 
> and
> I've found that sometimes a host will get it's fpops estimation way out of
> wack under some circumstances (usually workunit mistakes or failures).
>
> It got to the point where I would add a bunch of debugging statements in
> like these:
>
> Index: lib/hostinfo.cpp
> ===================================================================
> --- lib/hostinfo.cpp    (revision 22824)
> +++ lib/hostinfo.cpp    (working copy)
> @@ -77,6 +77,7 @@
>             // fix foolishness that could result in negative value here
>             //
>             if (p_fpops < 0) p_fpops = -p_fpops;
> +           printf("[>] Just set flops to %.2f in spot 5\n", p_fpops);
>             continue;
>         }
>         else if (parse_double(buf, "<p_iops>", p_iops)) {
> Index: client/app_control.cpp
> ===================================================================
> --- client/app_control.cpp      (revision 22824)
> +++ client/app_control.cpp      (working copy)
> @@ -624,11 +624,13 @@
>         if (atp->task_state() != PROCESS_EXECUTING) continue;
>                if (!atp->result->project->non_cpu_intensive &&
> (atp->elapsed_time > atp->max_elapsed_time)) {
>                        msg_printf(atp->result->project, MSG_INFO,
> -                               "Aborting task %s: exceeded elapsed time
> limit %.2f (%.2fG/%.2fG)",
> -                               atp->result->name, atp->max_elapsed_time,
> -                atp->result->wup->rsc_fpops_bound/1e9,
> -                atp->result->avp->flops/1e9
> -                       );
> +                                  "Aborting task %s: exceeded elapsed 
> time
> limit %.2f > %.2f (%.2fG/%.2fG)",
> +                                  atp->result->name,
> +                                  atp->elapsed_time,
> +                                  atp->max_elapsed_time,
> +                                  atp->result->wup->rsc_fpops_bound/1e9,
> +                                  atp->result->avp->flops/1e9
> +                                  );
>                        atp->abort_task(ERR_RSC_LIMIT_EXCEEDED, "Maximum
> elapsed time exceeded");
>                        did_anything = true;
>                        continue;
>
> And wrote a script to parse it (cause I kept forgetting what it meant):
>
> <?php
>
> $results = sscanf($line, "Aborting task %s exceeded elapsed time limit %f 
>  >
> %f (%fG/%fG)");
>
> $elapsed_time = $results[1];
> $max_time = $results[2];
> $resource_bound = $results[3];
> $current_flops = $results[4];
>
> echo "This workunit was bound to run in $max_time seconds - but died after
> $elapsed_time seconds.\n";
> echo "This fpops bound was {$resource_bound}G operations, and the client 
> was
> operating at {$current_flops}G ops/sec\n";
>
>
> To get a host back 'normal' I'd detach, turn off boinc, delete it from the
> database on the server, and remove all the files from the client (pretty
> much everything in /var/lib/boinc/ like client_state.xml and so on).
>
> I'm sure parts of this are completely overkill.
>
> -tom
> _______________________________________________
> boinc_projects mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_projects
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
> 


_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to