Hello,

I've been tracking down a strange scheduler problem where it fails to
find a valid app version for a host at random times only to succeed the
next round. Obviously there's a valid app version for the host.

I added some additional logging (attached as it helps understand all
failure conditions of version select) and found out that the if at
https://github.com/BOINC/boinc/blob/master/sched/sched_version.cpp#L845
seems to be one that fails even though there's no BAVP yet. Somehow the
r ends up negative here as shown in the log:
  "[version] Not selected, AV#36 r*45.66 GFLOP <= Best AV 0.00 GFLOP
(r=-1.391884, n=1)"
and so never select the app version even if that's the only one
otherwise valid for the host. :(

What this relates is the <version_select_random_factor> option as
explained in
https://boinc.berkeley.edu/trac/wiki/ProjectOptions#Appversionselection.
We had no value for that before so the default at
https://github.com/BOINC/boinc/blob/master/sched/sched_config.cpp#L94
gets used (1.0 that already seems to contradict the documented default
of 0.1). Now I'm not sure what range rand_normal() at
https://github.com/BOINC/boinc/blob/master/lib/util.cpp#L571 can return
values but my guess is they can be negative (and less than -1).

I'm not sure how to fix this as I don't quite understand the logic
behind this code. I've worked around it for now by setting
<version_select_random_factor> explicitly to 0.1 and now there have been
no negative r incidents. Might as well disable it as the estimation
discrepancies (especially between a new OpenCL app and an old
CUDA/Stream app) can be in the order 10 or 100, not mere 1.
-- 
Teemu Mannermaa
System Specialist

"Anything is possible but probabilities vary."

>From 678219922b456634217de4ea7440f19de0f2058a Mon Sep 17 00:00:00 2001
From: Teemu Mannermaa <[email protected]>
Date: Mon, 1 Jun 2015 14:31:25 +0300
Subject: [PATCH] Scheduler: Improve version selection logging

---
 sched/sched_version.cpp | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/sched/sched_version.cpp b/sched/sched_version.cpp
index 56cecc7..18c7da5 100644
--- a/sched/sched_version.cpp
+++ b/sched/sched_version.cpp
@@ -825,7 +825,7 @@ BEST_APP_VERSION* get_app_version(
                 if ((havp->pfc.n==0) && (havp->max_jobs_per_day==1) && 
(havp->consecutive_valid==0)) {
                     if (drand()>0.01) {
                         host_usage.projected_flops*=0.01;
-                        if (config.debug_version_select  && bavp && bavp->avp) 
{
+                        if (config.debug_version_select) {
                             log_messages.printf(MSG_NORMAL,
                                 "[version] App version AV#%d is failing on 
HOST#%d\n",
                                 havp->app_version_id,havp->host_id
@@ -837,10 +837,11 @@ BEST_APP_VERSION* get_app_version(
             if (config.version_select_random_factor) {
                 r += config.version_select_random_factor*rand_normal()/n;
             }
-            if (config.debug_version_select  && bavp && bavp->avp) {
+            if (config.debug_version_select && bavp && bavp->avp) {
                 log_messages.printf(MSG_NORMAL,
                     "[version] Comparing AV#%d (%.2f GFLOP) against AV#%d 
(%.2f GFLOP)\n",
-                    
av.id,host_usage.projected_flops/1e+9,bavp->avp->id,bavp->host_usage.projected_flops/1e+9
+                    av.id, host_usage.projected_flops/1e+9,
+                    bavp->avp->id, bavp->host_usage.projected_flops/1e+9
                 );
             }
             if (r*host_usage.projected_flops > 
bavp->host_usage.projected_flops) {
@@ -861,7 +862,14 @@ BEST_APP_VERSION* get_app_version(
                           bavp->avp->id, bavp->host_usage.projected_flops/1e+9
                     );
                 }
-
+            } else {
+                if (config.debug_version_select) {
+                    log_messages.printf(MSG_NORMAL,
+                            "[version] Not selected, AV#%d r*%.2f GFLOP <= 
Best AV %.2f GFLOP (r=%f, n=%ld)\n",
+                            av.id, host_usage.projected_flops/1e+9,
+                            bavp->host_usage.projected_flops/1e+9, r, n
+                    );
+                }
             }
         }   // loop over app versions
 
-- 
1.9.5.msysgit.1

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to