Re: review of std.parallelism

Andrei Alexandrescu Sat, 19 Mar 2011 13:41:27 -0700

On 03/19/2011 12:16 PM, dsimcha wrote:

On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:

On 03/19/2011 02:32 AM, dsimcha wrote:

Ok, thanks again for clarifying **how** the docs could be improved. I've
implemented the suggestions and generally given the docs a good reading
over and clean up. The new docs are at:


http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html


* Still no synopsis example that illustrates in a catchy way the most
attractive artifacts.


I don't see what I could put here that isn't totally redundant with the
rest of the documentation. Anything I could think of would basically
just involve concatentating all the examples. Furthermore, none of the
other Phobos modules have this, so I don't know what one should look
like.


I'm thinking along the lines of:

http://www.digitalmars.com/d/2.0/phobos/std_exception.html

A nice synopsis would be the pi computation. Just move that up to thesynopsis. It's simple, clean, and easy to relate to. Generally, you'dput here not all details but the stuff you think would make it easiestfor people to get into your library.

In general I feel like std.parallelism is being held to a
ridiculously high standard that none of the other Phobos modules
currently meet.

I agree, but that goes without saying. This is not a double standard;it's a simple means to improve quality of Phobos overall. Clearlycertain modules that are already in Phobos would not make it if proposedtoday. And that's a good thing! Comparing our current ambitions to thequality of the average Phobos module would not help us.

Generally it seems you have grown already tired of the exchange and itwould take only a few more exchanges for you to say, "you know what,either let's get it over it or forget about it."

Please consider for a minute how this is the wrong tone and attitude tobe having on several levels. This is not a debate and not the place toget defensive. Your role in the discussion is not symmetric with theothers' and at best you'd use the review as an opportunity to improveyour design, not stick heels in the ground to defend its currentincarnation (within reason). Your potential customers are attempting tohelp you by asking questions (some of which no doubt are silly) and bymaking suggestions (some of which, again, are ill-founded).Nevertheless, we _are_ your potential customers and in a way thecustomer is always right. Your art is to steer customers from what theythink they want to what you know they need - because you're the expert!- and to improve design, nomenclature, and implementation to the extentthat would help them.

Furthermore, you should expect that the review process will promptchanges. My perception is that you consider the submission more or lessfinal modulo possibly a few minor nits. You shouldn't. I'm convinced youknow much more about SMP than most or all others in this group, but inno way that means your design has reached perfection and is beyondimprovement even from a non-expert.

I know you'd have no problem finding the right voice in this discussionif you only frame it in the right light. Again, people are trying tohelp (however awkwardly) and in no way is that ridiculous.

* "After creation, Task objects are submitted to a TaskPool for
execution." I understand it's possible to use Task straight as a
promise/future, so s/are/may be/.


No. The only way Task is useful is by submitting it to a pool to be
executed. (Though this may change, see below.)

I very much hope this does change. Otherwise the role of Task in thedesign could be drastically reduced (e.g. nested type inside ofTaskPool) without prejudice. At the minimum I want to be able to createa task, launch it, and check its result later without involving a pool.A pool is when I have many tasks that may exceed the number of CPUs etc.Simplicity would be great.


// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
auto foo = readFoo.yieldWait();
auto bar = readBar.yieldWait();
auto baz = readBaz.yieldWait();

There's no need at this level for a task pool. What would be nice wouldbe to have a join() that joins all tasks spawned by the current thread:


// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
join();
// fetch results
auto foo = readFoo.spinWait();
auto bar = readBar.spinWait();
auto baz = readBaz.spinWait();

The way I see it is, task pools are an advanced means that coordinate mthreads over n CPUs. If I don't care about that (as above) there shouldbe no need for a pool at all. (Of course it's fine if used by theimplementation.)

Also it is my understanding that
TaskPool efficiently adapts the concrete number of CPUs to an arbitrary
number of tasks. If that's the case, it's worth mentioning.


Isn't this kind of obvious from the examples, etc.?

It wasn't to me. I had an intuition that that might be the case, butobvious - far from it. The fact that I need a task pool even when Idon't care about m:n issues made me doubtful.

* "If a Task has been submitted to a TaskPool instance, is being stored
in a stack frame, and has not yet finished, the destructor for this
struct will automatically call yieldWait() so that the task can finish
and the stack frame can be destroyed safely." At this point in the doc
the reader doesn't understand that at all because TaskPool has not been
seen yet. The reader gets worried that she'll be essentially serializing
the entire process by mistake. Either move this explanation down or
provide an example.


This is getting ridiculous. There are too many circular dependencies
between Task and TaskPool that are impossible to remove here that I'm
not even going to try. One or the other has to be introduced first, but
neither can be explained without mentioning the other. This is why I
explain the relationship briefly in the module level summary, so that
the user has at least some idea. I think this is about the best I can do.

Perhaps cross-references could help (see macro XREF). As I mentioned, anexample here could go a long way, too.

* The example that reads two files at the same time should NOT use
taskPool. It's just one task, why would the pool ever be needed? If you
also provided an example that reads n files in memory at the same time
using a pool, that would illustrate nicely why you need it. If a Task
can't be launched without being put in a pool, there should be a
possibility to do so. At my work we have a simple function called
callInNewThread that does what's needed to launch a function in a new
thread.


I guess I could add something like this to Task. Under the hood it would
(for implementation simplicity, to reuse a bunch of code from TaskPool)
fire up a new single-thread pool, submit the task, call
TaskPool.finish(), and then return. Since you're already creating a new
thread, the extra overhead of creating a new TaskPool for the thread
would be negligible and it would massively simplify the implementation.
My only concern is that, when combined with scoped versus non-scoped
tasks (which are definitely here to stay, see below) this small
convenience function would add way more API complexity than it's worth.
Besides, task pools don't need to be created explicitly anyhow. That's
what the default instantiation is for. I don't see how callInNewThread
would really solve much.

If we sit down for a second and think outside of the current design, thesimplest Task use would be as such: "I start a task and get a sort of afuture. I do something else. Then when I nede the result of the future Icall some method force() against it." Nowhere is the notion of a taskpool to be found. To me this looks like an artifact of theimplementation has spilled into the API, which I understand how isnatural to you. But also it's unnatural to me.

Secondly, I think you're reading **WAY** too much into what was meant to
be a simple example to illustrate usage mechanics. This is another case
where I can't think of a small, cute example of where you'd really need
the pool. There are plenty of larger examples, but the smallest/most
self-contained one I can think of is a parallel sort. I decided to use
file reading because it was good enough to illustrate the mechanics of
usage, even if it didn't illustrate a particularly good use case.

It's impossible to not have a good small example. Sorting is great. Youhave the partition primitive already in std.algorithm, then off you gowith tasks. Dot product on dense vectors is another good one. There'sjust plenty of operations that people understand are important to make fast.


So to me it looks like this:

* there are already some good examples of "I want 1-3 of tasks but don'tcare about pooling them": file reading and processing etc.

* there should be 1-2 good example of partitioning a large chunk of workinto many tasks and let taskPool carry them.

With such, people would clearly understand the capabilities of thelibrary and what they can achieve with various levels of involvement.

* The note below that example gets me thinking: it is an artificial
limitation to force users of Task to worry about scope and such. One
should be able to create a Future object (Task I think in your
terminology), pass it around like a normal value, and ultimately force
it. This is the case for all other languages that implement futures. I
suspect the "scope" parameter associated with the delegate a couple of
definitions below plays a role here, but I think we need to work for
providing the smoothest interface possible (possibly prompting
improvements in the language along the way).


This is what TaskPool.task is for. Maybe this should be moved to the top
of the definition of TaskPool and emphasized, and the scoped/stack
allocated versions should be moved below TaskPool and de-emphasized?

At any rate, though, anyone who's playing with parallelism should be an
advanced enough programmer that concepts like scope and destructors are
second nature.

I agree. That's not a reason to not find good names for entities. To methe relationship "global task() <=> scoped delegate; TaskPool methodtask() <=> closure" is inscrutable. I see no way of remembering itexcept rote memorization.

We absolutely **need** scope delegates for calling stuff where a closure
can't be allocated due to objects on the stack frame having scoped
destruction. This is not going anywhere. Also, I **much** prefer to have
everything that does conceptually the same thing have the same name. I
think this is clearer, not less clear.


I agree in principle, but this case is just not that.

* As I mentioned, in the definition:

Task!(run,TypeTuple!(F,Args)) task(F, Args...)(F fun, Args args);

I can't tell what "run" is.


run() is just the adapter function described right above this code. I
cannot for the life of me understand how this could possibly be unclear.

OK.

* "A goto from inside the parallel foreach loop to a label outside the
loop will result in undefined behavior." Would this be a bug in dmd?


No, it's because a goto of this form has no reasonable, useful
semantics. I should probably mention in the docs that the same applies
to labeled break and continue.

I have no idea what semantics these should have, and even if I did,
given the long odds that even one person would actually need them, I
think they'd be more trouble than they're worth to implement. For
example, once you break out of a parallel foreach loop to some arbitrary
address (and different threads can goto different labels, etc.), well,
it's no longer a parallel foreach loop. It's just a bunch of completely
unstructured threading doing god-knows-what.

Therefore, I slapped undefined behavior on it as a big sign that says,
"Just don't do it." This also has the advantage that, if anyone ever
thinks of any good, clearly useful semantics, these will be
implementable without breaking code later.

Yah, I was actually thinking of disabling goto outside a local delegateeverywhere.

* Again: speed of e.g. parallel min/max vs. serial, pi computation etc.
on a usual machine?


I **STRONGLY** believe this does not belong in API documentation because
it's too machine specific, compiler specific, stack alignment specific,
etc. and almost any benchmark worth doing takes up more space than an
example should. Furthermore, anyone who wants to know this can easily
time it themselves. I have absolutely no intention of including this.
While in general I appreciate and have tried to accommodate your
suggestions, this is one I'll be standing firm on.

If scalability information is present in however a non-committal form,then people would be compelled ("ok, so this shape of the loop wouldactually scale linearly with CPUs... neat").

Speaking of efficiency, I assume parallel foreach uses opApply with adelegate and the inherent overhead. So I'm thinking that a practical wayto take advantage of parallel foreach would be to parallelize at someblock level and then do a regular foreach inside the block?


foreach (i, ref elem; taskPool.parallel(logs)) {
    foreach (???)
        elem = log(i + 1.0);
}

How can I arrange things such that I compute e.g. 64 logs seriallyinside each pass?

* The join example is fine, but the repetitive code suggests that loops
should be better:

import std.file;

void main() {
auto pool = new TaskPool();
foreach (filename; ["foo.txt", "bar.txt", "baz.txt"]) {
pool.put(task!read(filename));
}

// Call join() to guarantee that all tasks are done running, the worker
// threads have terminated and that the results of all of the tasks can
// be accessed without any synchronization primitives.
pool.join();

ubyte[][] data; // void[], meh
// Use spinWait() since the results are guaranteed to have been computed
// and spinWait() is the cheapest of the wait functions.
foreach (task; pool.howDoIEnumerateTasks()) {
data ~= task1.spinWait();
}


You can't enumerate the tasks like this, but there's a good reason for
this limitation. The design requires the pool not hold onto any
references to the tasks after they're out of the queue, so that they can
be safely destroyed as soon as the caller wants to destroy them. If you
need to enumerate over them like this, then:

1. You're probably doing something wrong and parallel foreach would be a
better choice.

2. If you insist on doing this, you need to explicitly store them in an
array or something.

As a more general comment, I know the join() example sucks. join() is
actually a seldom-used primitive, not a frequently used one as you
suggest. Most of the time you wait for individual tasks (explicitly or
implicitly through the higher level functions), not the whole pool, to
finish executing. I feel like it's obligatory that join() and finish()
exist, but I never use them myself. They were much more useful in
pre-release versions of this lib, when parallel foreach didn't exist and
it made sense to have explicit arrays of Tasks. The difficulty I've had
in coming up with a decent example for them makes me halfway tempted to
just rip those $#)*% functions out entirely.

I'm confused here. I use join() pretty much all the time. I launch stuff(e.g. run a large matrix-vector multiplication for distinct vectors) andthen I join that. Then again and again. The thread of execution has arepeated hourglass shape - it fans out and then becomes single-threadedat join points. I assume some of those computations are indeed thecharter of parallel foreach, but I'd guess not all. But then what do Iknow - I'm the customer :o).



Andrei

Re: review of std.parallelism

Reply via email to