Hello, I've been tracking down a strange scheduler problem where it fails to find a valid app version for a host at random times only to succeed the next round. Obviously there's a valid app version for the host.
I added some additional logging (attached as it helps understand all failure conditions of version select) and found out that the if at https://github.com/BOINC/boinc/blob/master/sched/sched_version.cpp#L845 seems to be one that fails even though there's no BAVP yet. Somehow the r ends up negative here as shown in the log: "[version] Not selected, AV#36 r*45.66 GFLOP <= Best AV 0.00 GFLOP (r=-1.391884, n=1)" and so never select the app version even if that's the only one otherwise valid for the host. :( What this relates is the <version_select_random_factor> option as explained in https://boinc.berkeley.edu/trac/wiki/ProjectOptions#Appversionselection. We had no value for that before so the default at https://github.com/BOINC/boinc/blob/master/sched/sched_config.cpp#L94 gets used (1.0 that already seems to contradict the documented default of 0.1). Now I'm not sure what range rand_normal() at https://github.com/BOINC/boinc/blob/master/lib/util.cpp#L571 can return values but my guess is they can be negative (and less than -1). I'm not sure how to fix this as I don't quite understand the logic behind this code. I've worked around it for now by setting <version_select_random_factor> explicitly to 0.1 and now there have been no negative r incidents. Might as well disable it as the estimation discrepancies (especially between a new OpenCL app and an old CUDA/Stream app) can be in the order 10 or 100, not mere 1. -- Teemu Mannermaa System Specialist "Anything is possible but probabilities vary."
>From 678219922b456634217de4ea7440f19de0f2058a Mon Sep 17 00:00:00 2001 From: Teemu Mannermaa <[email protected]> Date: Mon, 1 Jun 2015 14:31:25 +0300 Subject: [PATCH] Scheduler: Improve version selection logging --- sched/sched_version.cpp | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/sched/sched_version.cpp b/sched/sched_version.cpp index 56cecc7..18c7da5 100644 --- a/sched/sched_version.cpp +++ b/sched/sched_version.cpp @@ -825,7 +825,7 @@ BEST_APP_VERSION* get_app_version( if ((havp->pfc.n==0) && (havp->max_jobs_per_day==1) && (havp->consecutive_valid==0)) { if (drand()>0.01) { host_usage.projected_flops*=0.01; - if (config.debug_version_select && bavp && bavp->avp) { + if (config.debug_version_select) { log_messages.printf(MSG_NORMAL, "[version] App version AV#%d is failing on HOST#%d\n", havp->app_version_id,havp->host_id @@ -837,10 +837,11 @@ BEST_APP_VERSION* get_app_version( if (config.version_select_random_factor) { r += config.version_select_random_factor*rand_normal()/n; } - if (config.debug_version_select && bavp && bavp->avp) { + if (config.debug_version_select && bavp && bavp->avp) { log_messages.printf(MSG_NORMAL, "[version] Comparing AV#%d (%.2f GFLOP) against AV#%d (%.2f GFLOP)\n", - av.id,host_usage.projected_flops/1e+9,bavp->avp->id,bavp->host_usage.projected_flops/1e+9 + av.id, host_usage.projected_flops/1e+9, + bavp->avp->id, bavp->host_usage.projected_flops/1e+9 ); } if (r*host_usage.projected_flops > bavp->host_usage.projected_flops) { @@ -861,7 +862,14 @@ BEST_APP_VERSION* get_app_version( bavp->avp->id, bavp->host_usage.projected_flops/1e+9 ); } - + } else { + if (config.debug_version_select) { + log_messages.printf(MSG_NORMAL, + "[version] Not selected, AV#%d r*%.2f GFLOP <= Best AV %.2f GFLOP (r=%f, n=%ld)\n", + av.id, host_usage.projected_flops/1e+9, + bavp->host_usage.projected_flops/1e+9, r, n + ); + } } } // loop over app versions -- 1.9.5.msysgit.1
_______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
