Re: review of std.parallelism

dsimcha Sat, 19 Mar 2011 19:21:12 -0700

On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:

On 03/19/2011 12:16 PM, dsimcha wrote:

On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:

On 03/19/2011 02:32 AM, dsimcha wrote:

Ok, thanks again for clarifying **how** the docs could be improved.
I've
implemented the suggestions and generally given the docs a good reading
over and clean up. The new docs are at:


http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html


* Still no synopsis example that illustrates in a catchy way the most
attractive artifacts.


I don't see what I could put here that isn't totally redundant with the
rest of the documentation. Anything I could think of would basically
just involve concatentating all the examples. Furthermore, none of the
other Phobos modules have this, so I don't know what one should look
like.


I'm thinking along the lines of:

http://www.digitalmars.com/d/2.0/phobos/std_exception.html

A nice synopsis would be the pi computation. Just move that up to the
synopsis. It's simple, clean, and easy to relate to. Generally, you'd
put here not all details but the stuff you think would make it easiest
for people to get into your library.


Good example, will do.

* "After creation, Task objects are submitted to a TaskPool for
execution." I understand it's possible to use Task straight as a
promise/future, so s/are/may be/.


No. The only way Task is useful is by submitting it to a pool to be
executed. (Though this may change, see below.)


I very much hope this does change. Otherwise the role of Task in the
design could be drastically reduced (e.g. nested type inside of
TaskPool) without prejudice. At the minimum I want to be able to create
a task, launch it, and check its result later without involving a pool.
A pool is when I have many tasks that may exceed the number of CPUs etc.
Simplicity would be great.

// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
auto foo = readFoo.yieldWait();
auto bar = readBar.yieldWait();
auto baz = readBaz.yieldWait();

This is definitely feasible in principle. I'd like to implement it, butthere's a few annoying, hairy details standing in the way. For reasonsI detailed previously, we need both scoped and non-scoped tasks. Wealso have alias vs. callable (i.e. function pointer or delegate) tasks.Now we're adding pool vs. new-thread tasks. This is turning into acombinatorial explosion and needs to be simplified somehow. I proposethe following:

1. I've reconsidered and actually like the idea of task() vs.scopedTask(). task() returns a pointer on the heap. scopedTask()returns a struct on the stack. Neither would be a member function ofTaskPool.

2. Non-scoped Task pointers would need to be explicitly submitted tothe task pool via the put() method. This means getting rid ofTaskPool.task().

3. The Task struct would grow a function runInNewThread() or somethingsimilar. (If you think this would be a common case, even just execute()might cut it.)

The work flow would now be that you call task() to get a heap-allocatedTask*, or scopedTask to get a stack-allocated Task. You then calleither TaskPool.put() to execute it on a pool or Task.runInNewThread()to run it in a new thread. The creation of the Task is completelyorthogonal to how it's run.


There's no need at this level for a task pool. What would be nice would
be to have a join() that joins all tasks spawned by the current thread:

// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
join();
// fetch results
auto foo = readFoo.spinWait();
auto bar = readBar.spinWait();
auto baz = readBaz.spinWait();

I don't understand how this would be a substantial improvement over thefirst example, where you just call yieldWait() on all three.Furthermore, implementing join() as shown in this example would requiresome kind of central registry of all tasks/worker threads/taskpools/something similar, which would be a huge PITA to implementefficiently.

Secondly, I think you're reading **WAY** too much into what was meant to
be a simple example to illustrate usage mechanics. This is another case
where I can't think of a small, cute example of where you'd really need
the pool. There are plenty of larger examples, but the smallest/most
self-contained one I can think of is a parallel sort. I decided to use
file reading because it was good enough to illustrate the mechanics of
usage, even if it didn't illustrate a particularly good use case.


It's impossible to not have a good small example. Sorting is great. You
have the partition primitive already in std.algorithm, then off you go
with tasks. Dot product on dense vectors is another good one. There's
just plenty of operations that people understand are important to make
fast.

I forgot about std.algorithm.partition. This makes a parallel quicksort so trivial to implement (ignoring the issue of selecting a goodpivot, which I think can be safely ignored in example code) that itmight actually make a good example.

* "A goto from inside the parallel foreach loop to a label outside the
loop will result in undefined behavior." Would this be a bug in dmd?


No, it's because a goto of this form has no reasonable, useful
semantics. I should probably mention in the docs that the same applies
to labeled break and continue.

I have no idea what semantics these should have, and even if I did,
given the long odds that even one person would actually need them, I
think they'd be more trouble than they're worth to implement. For
example, once you break out of a parallel foreach loop to some arbitrary
address (and different threads can goto different labels, etc.), well,
it's no longer a parallel foreach loop. It's just a bunch of completely
unstructured threading doing god-knows-what.

Therefore, I slapped undefined behavior on it as a big sign that says,
"Just don't do it." This also has the advantage that, if anyone ever
thinks of any good, clearly useful semantics, these will be
implementable without breaking code later.


Yah, I was actually thinking of disabling goto outside a local delegate
everywhere.

See the discussion I'm having with Michael Fortin. My latest idea isthat break, labeled break/continue, return, and goto should all throwexceptions when found inside a parallel foreach loop. They affect theflow of subsequent iterations, and "subsequent iterations" only makessense when the loop is being executed in serial.

* Again: speed of e.g. parallel min/max vs. serial, pi computation etc.
on a usual machine?


I **STRONGLY** believe this does not belong in API documentation because
it's too machine specific, compiler specific, stack alignment specific,
etc. and almost any benchmark worth doing takes up more space than an
example should. Furthermore, anyone who wants to know this can easily
time it themselves. I have absolutely no intention of including this.
While in general I appreciate and have tried to accommodate your
suggestions, this is one I'll be standing firm on.


If scalability information is present in however a non-committal form,
then people would be compelled ("ok, so this shape of the loop would
actually scale linearly with CPUs... neat").

Ok, I thought you were asking for something much more rigorous thanthis. I therefore didn't want to provide it because I figured that, nomatter what I did, someone would be able to say that the benchmark isflawed some how, yada, yada, yada. Given how inexact a sciencebenchmarking is, I'm still hesitant to put such results in API docs, butI can see where you're coming from here.


Speaking of efficiency, I assume parallel foreach uses opApply with a
delegate and the inherent overhead. So I'm thinking that a practical way
to take advantage of parallel foreach would be to parallelize at some
block level and then do a regular foreach inside the block?

foreach (i, ref elem; taskPool.parallel(logs)) {
foreach (???)
elem = log(i + 1.0);
}

How can I arrange things such that I compute e.g. 64 logs serially
inside each pass?


Three things here:

1. For this kind of nano-parallelism, map() might be better suited. Tofill an existing array, just use map's explicit buffer feature.


taskPool.map!log(iota(1, logs.length + 1), logs);

The option of using map() for nano-parallelism is part of my rationalefor keeping the pretty but mildly inefficient delegate call in parallelforeach.


2.  You're severely overestimating the overhead of the delegate call.

log() is a pretty cheap function and even so speedups are decent withparallel foreach compared to a regular foreach loop.


import std.stdio, std.parallelism, std.datetime, std.math;


void main() {
    auto logs = new float[10_000_000];
    taskPool();  // Initialize TaskPool before timing.

    auto sw = StopWatch(autoStart);
    foreach(i, ref elem; logs) {
        elem = log(i + 1);
    }
    writeln("Serial:  ", sw.peek.msecs);

    sw.reset();
    foreach(i, ref elem; parallel(logs)) {
        elem = log(i + 1);
    }
    writeln("Parallel Foreach:  ", sw.peek.msecs);
}


Results:

Serial:  619
Parallel Foreach:  388

I'd include parallel map, too, but for some reason (probably stackalignment or something) including it changes the results for parallelforeach.

3. If you really want to do as you suggest, just make a chunks range orsomething (maybe this should be in std.range) that iterates over allnon-overlapping size N slices of a sliceable range. Use this for theparallel loop, then loop over the individual elements of the slice inside.

I'm confused here. I use join() pretty much all the time. I launch stuff
(e.g. run a large matrix-vector multiplication for distinct vectors) and
then I join that. Then again and again. The thread of execution has a
repeated hourglass shape - it fans out and then becomes single-threaded
at join points. I assume some of those computations are indeed the
charter of parallel foreach, but I'd guess not all. But then what do I
know - I'm the customer :o).


Yes, parallel foreach is idiomatic here, or maybe parallel map.

In the pre-release versions of std.parallelism (my experimentation withparallelism libraries in D goes back to mid-2008, over a year before Ireleased the first version as Parallelfuture), only the task primitiveexisted. I discovered that, in practice, creating Task objects in aloop is almost always a PITA. It can also be inefficient in that, in anaive implementation all of these exist in memory simultaneously whenthis might be completely unnecessary. I decided to create higher levelprimitives to handle all the use cases I could think of where you mightotherwise want to create Task objects in a loop. If you're explicitlycreating Task objects in a loop in std.parallelism, I can just aboutguarantee that there's an easier and at least equally efficient way toaccomplish what you want to accomplish. If there are any important usecases I missed, I'd be much more interested in creating a few morehigh-level primitives rather than making it easier to work with Taskobjects created in a loop.

Re: review of std.parallelism

Reply via email to