On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
On 03/19/2011 12:16 PM, dsimcha wrote:
On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:
On 03/19/2011 02:32 AM, dsimcha wrote:
Ok, thanks again for clarifying **how** the docs could be improved.
I've
implemented the suggestions and generally given the docs a good reading
over and clean up. The new docs are at:

http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html

* Still no synopsis example that illustrates in a catchy way the most
attractive artifacts.

I don't see what I could put here that isn't totally redundant with the
rest of the documentation. Anything I could think of would basically
just involve concatentating all the examples. Furthermore, none of the
other Phobos modules have this, so I don't know what one should look
like.

I'm thinking along the lines of:

http://www.digitalmars.com/d/2.0/phobos/std_exception.html

A nice synopsis would be the pi computation. Just move that up to the
synopsis. It's simple, clean, and easy to relate to. Generally, you'd
put here not all details but the stuff you think would make it easiest
for people to get into your library.

Good example, will do.

* "After creation, Task objects are submitted to a TaskPool for
execution." I understand it's possible to use Task straight as a
promise/future, so s/are/may be/.

No. The only way Task is useful is by submitting it to a pool to be
executed. (Though this may change, see below.)

I very much hope this does change. Otherwise the role of Task in the
design could be drastically reduced (e.g. nested type inside of
TaskPool) without prejudice. At the minimum I want to be able to create
a task, launch it, and check its result later without involving a pool.
A pool is when I have many tasks that may exceed the number of CPUs etc.
Simplicity would be great.

// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
auto foo = readFoo.yieldWait();
auto bar = readBar.yieldWait();
auto baz = readBaz.yieldWait();

This is definitely feasible in principle. I'd like to implement it, but there's a few annoying, hairy details standing in the way. For reasons I detailed previously, we need both scoped and non-scoped tasks. We also have alias vs. callable (i.e. function pointer or delegate) tasks. Now we're adding pool vs. new-thread tasks. This is turning into a combinatorial explosion and needs to be simplified somehow. I propose the following:

1. I've reconsidered and actually like the idea of task() vs. scopedTask(). task() returns a pointer on the heap. scopedTask() returns a struct on the stack. Neither would be a member function of TaskPool.

2. Non-scoped Task pointers would need to be explicitly submitted to the task pool via the put() method. This means getting rid of TaskPool.task().

3. The Task struct would grow a function runInNewThread() or something similar. (If you think this would be a common case, even just execute() might cut it.)

The work flow would now be that you call task() to get a heap-allocated Task*, or scopedTask to get a stack-allocated Task. You then call either TaskPool.put() to execute it on a pool or Task.runInNewThread() to run it in a new thread. The creation of the Task is completely orthogonal to how it's run.


There's no need at this level for a task pool. What would be nice would
be to have a join() that joins all tasks spawned by the current thread:

// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
join();
// fetch results
auto foo = readFoo.spinWait();
auto bar = readBar.spinWait();
auto baz = readBaz.spinWait();


I don't understand how this would be a substantial improvement over the first example, where you just call yieldWait() on all three. Furthermore, implementing join() as shown in this example would require some kind of central registry of all tasks/worker threads/task pools/something similar, which would be a huge PITA to implement efficiently.


Secondly, I think you're reading **WAY** too much into what was meant to
be a simple example to illustrate usage mechanics. This is another case
where I can't think of a small, cute example of where you'd really need
the pool. There are plenty of larger examples, but the smallest/most
self-contained one I can think of is a parallel sort. I decided to use
file reading because it was good enough to illustrate the mechanics of
usage, even if it didn't illustrate a particularly good use case.

It's impossible to not have a good small example. Sorting is great. You
have the partition primitive already in std.algorithm, then off you go
with tasks. Dot product on dense vectors is another good one. There's
just plenty of operations that people understand are important to make
fast.

I forgot about std.algorithm.partition. This makes a parallel quick sort so trivial to implement (ignoring the issue of selecting a good pivot, which I think can be safely ignored in example code) that it might actually make a good example.



* "A goto from inside the parallel foreach loop to a label outside the
loop will result in undefined behavior." Would this be a bug in dmd?

No, it's because a goto of this form has no reasonable, useful
semantics. I should probably mention in the docs that the same applies
to labeled break and continue.

I have no idea what semantics these should have, and even if I did,
given the long odds that even one person would actually need them, I
think they'd be more trouble than they're worth to implement. For
example, once you break out of a parallel foreach loop to some arbitrary
address (and different threads can goto different labels, etc.), well,
it's no longer a parallel foreach loop. It's just a bunch of completely
unstructured threading doing god-knows-what.

Therefore, I slapped undefined behavior on it as a big sign that says,
"Just don't do it." This also has the advantage that, if anyone ever
thinks of any good, clearly useful semantics, these will be
implementable without breaking code later.

Yah, I was actually thinking of disabling goto outside a local delegate
everywhere.

See the discussion I'm having with Michael Fortin. My latest idea is that break, labeled break/continue, return, and goto should all throw exceptions when found inside a parallel foreach loop. They affect the flow of subsequent iterations, and "subsequent iterations" only makes sense when the loop is being executed in serial.


* Again: speed of e.g. parallel min/max vs. serial, pi computation etc.
on a usual machine?

I **STRONGLY** believe this does not belong in API documentation because
it's too machine specific, compiler specific, stack alignment specific,
etc. and almost any benchmark worth doing takes up more space than an
example should. Furthermore, anyone who wants to know this can easily
time it themselves. I have absolutely no intention of including this.
While in general I appreciate and have tried to accommodate your
suggestions, this is one I'll be standing firm on.

If scalability information is present in however a non-committal form,
then people would be compelled ("ok, so this shape of the loop would
actually scale linearly with CPUs... neat").

Ok, I thought you were asking for something much more rigorous than this. I therefore didn't want to provide it because I figured that, no matter what I did, someone would be able to say that the benchmark is flawed some how, yada, yada, yada. Given how inexact a science benchmarking is, I'm still hesitant to put such results in API docs, but I can see where you're coming from here.


Speaking of efficiency, I assume parallel foreach uses opApply with a
delegate and the inherent overhead. So I'm thinking that a practical way
to take advantage of parallel foreach would be to parallelize at some
block level and then do a regular foreach inside the block?

foreach (i, ref elem; taskPool.parallel(logs)) {
foreach (???)
elem = log(i + 1.0);
}

How can I arrange things such that I compute e.g. 64 logs serially
inside each pass?


Three things here:

1. For this kind of nano-parallelism, map() might be better suited. To fill an existing array, just use map's explicit buffer feature.

taskPool.map!log(iota(1, logs.length + 1), logs);

The option of using map() for nano-parallelism is part of my rationale for keeping the pretty but mildly inefficient delegate call in parallel foreach.

2.  You're severely overestimating the overhead of the delegate call.
log() is a pretty cheap function and even so speedups are decent with parallel foreach compared to a regular foreach loop.

import std.stdio, std.parallelism, std.datetime, std.math;


void main() {
    auto logs = new float[10_000_000];
    taskPool();  // Initialize TaskPool before timing.

    auto sw = StopWatch(autoStart);
    foreach(i, ref elem; logs) {
        elem = log(i + 1);
    }
    writeln("Serial:  ", sw.peek.msecs);

    sw.reset();
    foreach(i, ref elem; parallel(logs)) {
        elem = log(i + 1);
    }
    writeln("Parallel Foreach:  ", sw.peek.msecs);
}


Results:

Serial:  619
Parallel Foreach:  388

I'd include parallel map, too, but for some reason (probably stack alignment or something) including it changes the results for parallel foreach.

3. If you really want to do as you suggest, just make a chunks range or something (maybe this should be in std.range) that iterates over all non-overlapping size N slices of a sliceable range. Use this for the parallel loop, then loop over the individual elements of the slice inside.

I'm confused here. I use join() pretty much all the time. I launch stuff
(e.g. run a large matrix-vector multiplication for distinct vectors) and
then I join that. Then again and again. The thread of execution has a
repeated hourglass shape - it fans out and then becomes single-threaded
at join points. I assume some of those computations are indeed the
charter of parallel foreach, but I'd guess not all. But then what do I
know - I'm the customer :o).

Yes, parallel foreach is idiomatic here, or maybe parallel map.

In the pre-release versions of std.parallelism (my experimentation with parallelism libraries in D goes back to mid-2008, over a year before I released the first version as Parallelfuture), only the task primitive existed. I discovered that, in practice, creating Task objects in a loop is almost always a PITA. It can also be inefficient in that, in a naive implementation all of these exist in memory simultaneously when this might be completely unnecessary. I decided to create higher level primitives to handle all the use cases I could think of where you might otherwise want to create Task objects in a loop. If you're explicitly creating Task objects in a loop in std.parallelism, I can just about guarantee that there's an easier and at least equally efficient way to accomplish what you want to accomplish. If there are any important use cases I missed, I'd be much more interested in creating a few more high-level primitives rather than making it easier to work with Task objects created in a loop.

Reply via email to