Re: [boinc_dev] Client scheduler and big downloads

Paul D. Buck Mon, 10 May 2010 20:38:26 -0700

To my mind one of the first steps might be to consider the discussion JM VII 
and I had just a week or so ago about the fact that work fetch does not 
consider "packing" the processing "lanes" when assessing if there is enough 
work on hand.  With the multi-core systems we have now in almost all cases the 
current mode of linear adding of the tasks times on hand is clearly not 
adequate to the task.  For example I have several CPDN tasks on hand on my Mac 
and this bias leads BOINC to think it has enough work on hand when in fact with 
an 8 core system, 3 or even 5 CPDN tasks is not sufficient to keep all of the 8 
cores busy.  But the essentially linear addition of the task times gives the 
system a false sense of security that it had sufficient tasks on hand time-wise 
to keep the system busy ... when in fact it does not ...

At any rate, the discussion thread was "Queue Size" of April 17 to 19 with 
John's answer of the 19th probably the best summation of the problem and a 
solution.

The case of long downloads I have seen recently myself with BURP where the DL 
was inching along for several days before failure.  I was fortunate enough that 
I did not fail for work, but, if the queue is too small then it is easy to get 
into trouble.

However, when you add in GPU processing and you have whole 'nother set of 
issues... especially in "mixed" systems where, as I have, there can be CPU, 
CUDA, *AND* ATI resources in the same system.  I reported some while ago the 
issue I saw similar to this where the system would run my GPU queue out and 
only then begin to search for new work.  When the new queue was filled I would 
be good to go until the last batch obtained was drained.  **THIS** version of 
the problem reported below and in my prior report "6.10.32 Idle GPU in dual GPU 
system" Feb 18 may be one and the same...

In the face of multi-project GPU usage we also have the collision of the Strict 
FIFO rule with project server side instability invalidating effective RS 
balancing and proper queuing of tasks for server side outages.  I don't think 
Richard is seeing this because he runs with a fairly large queue size and 
against projects that normally supply a generous supply of work (SaH and GPU 
Grid where the problem is more commonly seen with MW vs. Collatz). And I am not 
sure the system adequately addresses the issues with multi-resource use tasks 
(Einstein for example and SaH AP tasks for another) where there is simultaneous 
hight usage of a GPU and a CPU core...

But the real solution I feel is otherwise indicated...

I know it is not a popular take on the issue(s), but for the past year or so 
quite a few of us (myself, Richard, JM VII, and others) have been hitting on 
the deficiencies of the triad of RR Sim, Resource Scheduler, and Work Fetch 
modules, attempting to highlight all of the different ways that they are 
failing to operate correctly.  This latest repot by Janus is merely one more 
issue on that whole pile.

As is Richards recent post on STD "leaking" and a probable imbalance in RS 
calculation ... and the many many others of the past few months ...

Perhaps the real solution would be to go back and review this history and then 
to attempt to devise a more comprehensive repair strategy and then to do a more 
fundamental tuning of the triad.  And, in the mean time, try to fix the 
reporting so that the debug dumps would be more comprehensible ... I, for one, 
can hardly understand the debug outputs both because of the basic formatting 
(for which I submitted a suggested change, not yet implemented, my thread 
"Changeset 21335" on the Alpha list) and for the data content which seems to 
become more confusing with each revision adding data outputs to each debug 
statement...

WIth a more general overhaul of these modules including an update of the 
reporting of debug data we could then go back to the prior cases and use them 
to prove that we have in fact cured these issues...  Of course for this to work 
there has to be a general agreement that fundamental change may be the order of 
the day, otherwise there may be little point to starting ...

On May 10, 2010, at 9:17 AM, David Anderson wrote:

> What you describe is how things work currently;
> the work fetch policy treats downloading jobs as if they were downloaded.
> This prevents infinite work fetch,
> but as you point out it can lead to processor starvation.
> If BOINC is to be used for truly data-intensive problems we'll need
> to address this issue.
> 
> I can think of some ways to relax the policy,
> but they're a little tricky.
> If anyone has a good idea let me know.
> 
> -- David
> 
> Janus Kristensen wrote:
>> Hey all
>> 
>> I'm forwarding a report concerning the client and possible issues with 
>> lengthy or failing downloads. I haven't had the time to verify it so 
>> just ignore it if this is not (or no longer is) the case.
>> 
>> Situation:
>> 1) Client attached to multiple projects
>> 2) Cache set to X days of work
>> 3) A project releases workunits with downloads that take around X days 
>> to get (slow, big or failing, the issue is naturally more pressing for 
>> clients with smaller values of X.)
>> 4) Client is registered to some WUs and starts downloading - the 
>> scheduling mechanism is satisfied
>> 5) Client eventually runs out of work and stalls until the download 
>> completes or fails entirely.
>> 
>> Problem:
>> The client isn't doing anything. CPU utilization is 0%.
>> 
>> Expected behaviour:
>> The client would fetch and work on something else while completing the 
>> lengthy download.
>> 
>> Difficulties:
>> Detecting the situation and determining the course of action (finding 
>> alternative work).
>> 
>> 
>> -- Janus
>> 
>> 
>> _______________________________________________
>> boinc_dev mailing list
>> [email protected]
>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>> To unsubscribe, visit the above URL and
>> (near bottom of page) enter your email address.
> 
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] Client scheduler and big downloads

Reply via email to