I checked in the following change to address the problem
Rytis describes below.

    client: work fetch policy tweak

    If a project has active uploads, defer work fetch from it for 5 minutes
    even if there are idle devices (that's the change).
    This addresses a situation (reported by Rytis) where
    - a project P has a jobs-in-progress limit less than NCPUS
    - P's jobs finish and are uploading
    - the client asks P for work and doesn't get any because of the limit
    - the client does exponential backoff from P
    Over the long term, P can get much less than its fair share of work

-- David

-------- Original Message --------
Subject:        Scheduler troubles in conjunction with rate limiting from server
Date:   Fri, 7 Feb 2014 12:41:04 +0200
From:   Rytis Slatkevičius <[email protected]>
To:     David Anderson <[email protected]>
CC:     Matthew Blumberg <[email protected]>



Hello David,

we observed an interesting trouble with task scheduling:

Project A (our project) limits number of tasks per proc to 2 and has resource 
share
of 500;
Project B (Einstein) does not limit number of tasks and has resource share of 
25.

B has longer tasks than A, and also longer tasks than the minimum work buffer.

When attaching both, A has priority because of resource share. It fetches 2 
tasks
(as the server does not send any more). B then fetches tasks to fill the 
remaining
buffers up to the minimum threshold.

When A finishes work, scheduler request happens as there is not enough work
available to fill all work slots. However, because the completed tasks have not 
been
uploaded yet, scheduler does not send any new work as it is limited to 2 tasks 
on
host (and it still has them, even though computation is complete). Backoff 
happens
for A as no work is provided, and therefore B is asked for work. Now only B is 
running.

When B finishes work, either A is asked again (if the backoff has completed), 
two
tasks are sent, and process repeats again, or A is not even asked (if the 
backoff is
still in progress) and B is asked again.

The end result: system runs work from B almost exclusively, even though A has 
work
available (BOINC just thinks it does not). We increased the job limits to a 
number
higher than the minimum threshold and the issue seems to have disappeared.

--
Pagarbiai / Sincerely
Rytis Slatkevičius
+370 670 77777


_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to