Re: review of std.parallelism

2011-03-21 Thread dsimcha

On 3/21/2011 11:58 AM, dsimcha wrote:

== Quote from dsimcha (dsim...@yahoo.com)'s article

== Quote from Michel Fortin (michel.for...@michelf.com)'s article

On second thought, no, but for practical, not theoretical reasons:
One, you can't introspect whether a foreach loop is using a ref or a
value parameter.  This is an issue with how opApply works.

Indeed a problem. Either we fix the compiler to support that, or we
change the syntax to something like this:
taskPool.apply(range, (ref int value) {
...
});
Or we leave things as they are.

Two, AFAIK there's no way to get the native word size.

Right. That's a problem too... you could probably alleviate this by
doing a runtime check with some fancy instructions to get the native
word size, but I'd expect that to be rather convoluted.
I'd like to check if I understand that well. For instance this code:
int[100] values;
foreach (i, ref value; parallel(values))
value = i;
would normally run fine on a 32-bit processor, but it'd create
low-level a race on a 64-bit processor (even a 64-bit processor running
a 32-bit program in 32-bit compatibility mode). And even that is a
generalization, some 32-bit processors out there *might* have 64-bit
native words. So the code above isn't portable. Is that right?
Which makes me think... we need to document those pitfalls somewhere.
Perhaps std.parallelism's documentation should link to a related page
about what you can and what you can't do safely. People who read that
"all the safeties are off" in std.parallelism aren't going to
understand what you're talking about unless you explain the pitfalls
with actual examples (like the one above).

This problem is **much** less severe than you are suggesting.  x86 can address
single bytes, so it's not a problem even if you're iterating over bytes on a
64-bit machine.  CPUs like Alpha (which no D compiler even exists for) can't
natively address individual bytes.  Therefore, writing to a byte would be
implemented much like writing to a bit is on x86:  You'd read the full word in,
change one byte, write the full word back.  I'm not sure exactly how it would be
implemented at the compiler level, or whether you'd even be allowed to have a
reference to a byte in such an implementation.  This is why I consider this 
more a
theoretical problem than a serious practical issue.


Actually, just remembered that word tearing is also an issue with unaligned 
memory
access.  I guess I could just include a warning that says not to do this with
unaligned memory.


s/is/may be .  I'm trying to read up on word tearing and post questions 
here and to StackOverflow.  Good documentation about it seems 
ridiculously hard to come by.  About the only solid pieces of 
information I've found are that x86 definitely **can** do byte 
granularity (which doesn't mean it always does) and that Java makes 
these guarantees (which is only useful if you're writing in Java).  The 
general gut feeling people here and on StackOverflow seem to have is 
that it's not an issue, but there doesn't seem to be much backing up of 
this.


Maybe I'm just being paranoid and this is a complete non-issue except on 
the most obscure/ancient hardware and/or most stupidly implemented 
compilers, which is why few people even think of it.


Re: review of std.parallelism

2011-03-21 Thread dsimcha
== Quote from dsimcha (dsim...@yahoo.com)'s article
> == Quote from Michel Fortin (michel.for...@michelf.com)'s article
> > > On second thought, no, but for practical, not theoretical reasons:
> > > One, you can't introspect whether a foreach loop is using a ref or a
> > > value parameter.  This is an issue with how opApply works.
> > Indeed a problem. Either we fix the compiler to support that, or we
> > change the syntax to something like this:
> > taskPool.apply(range, (ref int value) {
> > ...
> > });
> > Or we leave things as they are.
> > > Two, AFAIK there's no way to get the native word size.
> > Right. That's a problem too... you could probably alleviate this by
> > doing a runtime check with some fancy instructions to get the native
> > word size, but I'd expect that to be rather convoluted.
> > I'd like to check if I understand that well. For instance this code:
> > int[100] values;
> > foreach (i, ref value; parallel(values))
> > value = i;
> > would normally run fine on a 32-bit processor, but it'd create
> > low-level a race on a 64-bit processor (even a 64-bit processor running
> > a 32-bit program in 32-bit compatibility mode). And even that is a
> > generalization, some 32-bit processors out there *might* have 64-bit
> > native words. So the code above isn't portable. Is that right?
> > Which makes me think... we need to document those pitfalls somewhere.
> > Perhaps std.parallelism's documentation should link to a related page
> > about what you can and what you can't do safely. People who read that
> > "all the safeties are off" in std.parallelism aren't going to
> > understand what you're talking about unless you explain the pitfalls
> > with actual examples (like the one above).
> This problem is **much** less severe than you are suggesting.  x86 can address
> single bytes, so it's not a problem even if you're iterating over bytes on a
> 64-bit machine.  CPUs like Alpha (which no D compiler even exists for) can't
> natively address individual bytes.  Therefore, writing to a byte would be
> implemented much like writing to a bit is on x86:  You'd read the full word 
> in,
> change one byte, write the full word back.  I'm not sure exactly how it would 
> be
> implemented at the compiler level, or whether you'd even be allowed to have a
> reference to a byte in such an implementation.  This is why I consider this 
> more a
> theoretical problem than a serious practical issue.

Actually, just remembered that word tearing is also an issue with unaligned 
memory
access.  I guess I could just include a warning that says not to do this with
unaligned memory.


Re: review of std.parallelism

2011-03-21 Thread dsimcha
== Quote from Michel Fortin (michel.for...@michelf.com)'s article
> > On second thought, no, but for practical, not theoretical reasons:
> > One, you can't introspect whether a foreach loop is using a ref or a
> > value parameter.  This is an issue with how opApply works.
> Indeed a problem. Either we fix the compiler to support that, or we
> change the syntax to something like this:
>   taskPool.apply(range, (ref int value) {
>   ...
>   });
> Or we leave things as they are.
> > Two, AFAIK there's no way to get the native word size.
> Right. That's a problem too... you could probably alleviate this by
> doing a runtime check with some fancy instructions to get the native
> word size, but I'd expect that to be rather convoluted.
> I'd like to check if I understand that well. For instance this code:
>   int[100] values;
>   foreach (i, ref value; parallel(values))
>   value = i;
> would normally run fine on a 32-bit processor, but it'd create
> low-level a race on a 64-bit processor (even a 64-bit processor running
> a 32-bit program in 32-bit compatibility mode). And even that is a
> generalization, some 32-bit processors out there *might* have 64-bit
> native words. So the code above isn't portable. Is that right?
> Which makes me think... we need to document those pitfalls somewhere.
> Perhaps std.parallelism's documentation should link to a related page
> about what you can and what you can't do safely. People who read that
> "all the safeties are off" in std.parallelism aren't going to
> understand what you're talking about unless you explain the pitfalls
> with actual examples (like the one above).

This problem is **much** less severe than you are suggesting.  x86 can address
single bytes, so it's not a problem even if you're iterating over bytes on a
64-bit machine.  CPUs like Alpha (which no D compiler even exists for) can't
natively address individual bytes.  Therefore, writing to a byte would be
implemented much like writing to a bit is on x86:  You'd read the full word in,
change one byte, write the full word back.  I'm not sure exactly how it would be
implemented at the compiler level, or whether you'd even be allowed to have a
reference to a byte in such an implementation.  This is why I consider this 
more a
theoretical problem than a serious practical issue.


> > Unfortunately, I'm going to have to take a hard line on this one.  The
> > issue of integrating std.parallelism into the race safety system had
> > been discussed a bunch in the past and it was basically agreed that
> > std.parallelism is a "here be dragons" module that cannot reasonably be
> > made to conform to such a model.
> About the "cannot reasonably be made to conform to such a model" part:
> that is certainly true today, but might turn out not to be true as the
> model evolves. It certainly is true as long as we don't have shared
> delegates. Beyond that it becomes more fuzzy. The final goal of the
> module shouldn't be to bypass safeties but to provide good parallelism
> primitives. By all means, if safeties can be enabled reasonably they
> should be (but as of today, they can't).

I'll agree that, if the model evolves sufficiently, everything I've said 
deserves
to be reconsidered.  My comments only apply to the model as it exists currently 
or
will exist in the foreseeable future.  It's just that I have **serious** doubts
that it will evolve much and I don't want to delay things to have an academic
discussion about what-ifs with regard to this.

> And I totally agree with you that it's quite silly to require casts
> everywhere to use synchronized. I'll be the first to admit that it's
> hard to see synchronized classes being very practical as they're
> implemented today. There's room for improvements there too.

This is the point I'm trying to make:  Requiring these casts is perfectly
reasonable if you assume that shared state is going to be rare, as it is in the
std.concurrency model.  When the user is looking for a more fine-grained, more
shared state-heavy paradigm, I don't think it's possible to make it safe without
also making it virtually unusable.  This is why we punted and decided to go with
message passing and very limited shared data as the flagship multithreading 
model.
 On the other hand, in a systems language, we can't virtually prohibit shared
state-heavy multithreading by making it virtually unusable.  Thus, the no low
level race guarantee needs to apply to std.concurrency only.


Re: review of std.parallelism

2011-03-21 Thread Michel Fortin

On 2011-03-21 09:50:09 -0400, dsimcha  said:


On 3/21/2011 8:37 AM, Michel Fortin wrote:

Well, it'll work irrespective of whether shared delegates are used or
not. I think you could add a compile-time check that the array element
size is a multiple of the word size when the element is passed by ref in
the loop and leave the clever trick as a possible future improvements.
Would that work?


On second thought, no, but for practical, not theoretical reasons:  
One, you can't introspect whether a foreach loop is using a ref or a 
value parameter.  This is an issue with how opApply works.


Indeed a problem. Either we fix the compiler to support that, or we 
change the syntax to something like this:


taskPool.apply(range, (ref int value) {
...
});

Or we leave things as they are.


Two, AFAIK there's no way to get the native word size.


Right. That's a problem too... you could probably alleviate this by 
doing a runtime check with some fancy instructions to get the native 
word size, but I'd expect that to be rather convoluted.


I'd like to check if I understand that well. For instance this code:

int[100] values;
foreach (i, ref value; parallel(values))
value = i;

would normally run fine on a 32-bit processor, but it'd create 
low-level a race on a 64-bit processor (even a 64-bit processor running 
a 32-bit program in 32-bit compatibility mode). And even that is a 
generalization, some 32-bit processors out there *might* have 64-bit 
native words. So the code above isn't portable. Is that right?


Which makes me think... we need to document those pitfalls somewhere. 
Perhaps std.parallelism's documentation should link to a related page 
about what you can and what you can't do safely. People who read that 
"all the safeties are off" in std.parallelism aren't going to 
understand what you're talking about unless you explain the pitfalls 
with actual examples (like the one above).



Unfortunately, I'm going to have to take a hard line on this one.  The 
issue of integrating std.parallelism into the race safety system had 
been discussed a bunch in the past and it was basically agreed that 
std.parallelism is a "here be dragons" module that cannot reasonably be 
made to conform to such a model.


About the "cannot reasonably be made to conform to such a model" part: 
that is certainly true today, but might turn out not to be true as the 
model evolves. It certainly is true as long as we don't have shared 
delegates. Beyond that it becomes more fuzzy. The final goal of the 
module shouldn't be to bypass safeties but to provide good parallelism 
primitives. By all means, if safeties can be enabled reasonably they 
should be (but as of today, they can't).


And I totally agree with you that it's quite silly to require casts 
everywhere to use synchronized. I'll be the first to admit that it's 
hard to see synchronized classes being very practical as they're 
implemented today. There's room for improvements there too.



Given the choice (and I hope I'm not forced to make this choice) I'd 
rather std.parallelism be a third-party module that's actually usable 
than a Phobos module that is such a PITA to use that noone does in 
practice.


Which is understandable.

--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/



Re: review of std.parallelism

2011-03-21 Thread dsimcha

On 3/21/2011 8:37 AM, Michel Fortin wrote:

Well, it'll work irrespective of whether shared delegates are used or
not. I think you could add a compile-time check that the array element
size is a multiple of the word size when the element is passed by ref in
the loop and leave the clever trick as a possible future improvements.
Would that work?


On second thought, no, but for practical, not theoretical reasons:  One, 
you can't introspect whether a foreach loop is using a ref or a value 
parameter.  This is an issue with how opApply works.  Two, AFAIK there's 
no way to get the native word size.




I'd go a little further. If the guarantees that shared was supposed to
provide are strong, i.e. apply no matter what threading module is
used, then I utterly despise it. It's one of the worst decisions made
in the design of D. Making things pedantically strict, so that the
type system gets in the way more than it helps, encourages the user to
reflexively circumvent the type system without thinking hard about
doing this, thus defeating its purpose. (The alternative of always
complying with what the type system "expects" you to do is too
inflexible to even be worth considering.) Type systems should err on
the side of accepting a superset of what's correct and treating code
as innocent until proven guilty, not the other way around. I still
believe this even if some of the bugs it could be letting pass through
might be very difficult to debug. See the discussion we had a few
weeks ago about implicit integer casting and porting code to 64.


I agree with you that this is a serious problem. I think part of why it
hasn't been talked much yet is that nobody is currently using D2
seriously for multithreaded stuff at this time (apart from you I guess),
so we're missing experience with it. Andrei seems to think it's fine to
required casts as soon as you need to protect something beyond an
indirection inside synchronized classes, with the mitigation measure
that you can make classes share their mutex (not implemented yet I
think) so if the indirection leads to a class it is less of a problem.
Personally, I don't.



My excuse for std.parallelism is that it's pedal-to-metal parallelism,
so it's more acceptable for it to be dangerous than general case
concurrency. IMHO when you use the non-@safe parts of std.parallelism
(i.e. most of the library), that's equivalent to casting away shared
in a whole bunch of places. Typing "import std.parallelism;" in a
non-@safe module is an explicit enough step here.


I still think this "pedal-to-metal" qualification needs to be justified.
Not having shared delegates in the language seems like an appropriate
justification to me. Wanting to bypass casts you normally have to do
around synchronized as the sole reason seems like a bad justification to
me.

It's not that I like how synchronized works, it's just that I think it
should work the same everywhere.


This is where you and I disagree.  I think that the type system's 
guarantees should be weak, i.e. only apply to std.concurrency.  IMHO the 
strictness is reasonable when using message passing as your primary 
method of multithreading and only very little shared state.  However, 
it's completely unreasonable if you want to use a paradigm where shared 
state is more heavily used.  D, being a systems language, needs to allow 
other styles of multithreading without making them a huge PITA that 
requires casts everywhere.



The guarantee is still preserved that, if you only use std.concurrency
(D's flagship "safe" concurrency module) for multithreading and don't
cast away shared, there can be no low level data races. IMHO this is
still a substantial accomplishment in that there exists a way to do
safe, statically checkable concurrency in D, even if it's not the
**only** way concurrency can be done. BTW, core.thread can also be
used to get around D's type system, not just std.parallelism. If you
want to check that only safe concurrency is used, importing
std.parallelism and core.thread can be grepped just as easily as
casting away shared.


Unless I'm mistaken, the only thing that bypasses race-safety in
core.thread is the Thread constructor that takes a delegate. Which means
it could easily be made race-safe by making that delegate parameter
shared (once shared delegates are implemented).


And then you'd only need one cast to break it if you wanted to, not 
casts everywhere.  Just cast an unshared delegate to shared when passing 
it to core.thread.





If, on the other hand, the guarantees of shared are supposed to be
weak in that they only apply to programs where only std.concurrency is
used for multithreading, then I think strictness is the right thing to
do. The whole point of std.concurrency is to give strong guarantees,
but if you prefer more dangerous but more flexible multithreading,
other paradigms should be readily available.


I think the language as a whole is designed to have strong guaranties,
otherwise synchronized classes 

Re: review of std.parallelism

2011-03-21 Thread Michel Fortin

On 2011-03-20 23:21:49 -0400, dsimcha  said:


On 3/20/2011 10:44 PM, Michel Fortin wrote:


I don't see a problem with the above. The array elements you modify are
passed through parallel's opApply which can check easily whether it's
safe or not to pass them by ref to different threads (by checking the
element's size) and allow or disallow the operation accordingly.

It could even do a clever trick to make it safe to pass things such as
elements of array of bytes by ref (by coalescing loop iterations for all
bytes sharing the same word into one task). That might not work for
ranges which are not arrays however.

That said, feel free to suggest more problematic examples.


Ok, I completely agree in principle, though I question whether it's 
worth actually implementing something like this, especially until we 
get some kind of support for shared delegates.


Well, it'll work irrespective of whether shared delegates are used or 
not. I think you could add a compile-time check that the array element 
size is a multiple of the word size when the element is passed by ref 
in the loop and leave the clever trick as a possible future 
improvements. Would that work?




Also, your example can be trivially modified to be safe.

void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
synchronized sum += value;
}
writeln(sum);
}

In this case that kills all parallelism, but in more realistic cases I
use this pattern often. I find it very common to have an expensive
loop body can be performed in parallel, except for a tiny portion that
must update a shared data structure. I'm aware that it might be
possible, in theory, to write this more formally using reduce() or
something. However:

1. If the portion of the loop that deals with shared data is very
small (and therefore the serialization caused by the synchronized
block is negligible), it's often more efficient to only keep one data
structure in memory and update it concurrently, rather than use
stronger isolation between threads like reduce() does, and have to
maintain one data structure for each thread.

2. In my experience synchronizing on a small portion of the loop body
works very well in practice. My general philosophy is that, in a
library like this, dangerous but useful constructs must be supported
and treated as innocent until proven guilty, not the other way round.


Your second example is not really a good justification of anything. I'll
refer you to how synchronized classes work. It was decided that
synchronized in a class protects everything that is directly stored in
the class. Anything behind an indirection is considered shared by the
compiler. The implication of this is that if you have an array or a
pointer to something that you want semantically to be protected by the
class's mutex, you have to cast things to unshared. It was decided that
things should be safe against low-level races first, and convenience was
relegated as a secondary concern. I don't like it very much, but that's
what was decided and written in TDPL.


I'd go a little further.  If the guarantees that shared was supposed to 
provide are strong, i.e. apply no matter what threading module is used, 
then I utterly despise it.  It's one of the worst decisions made in the 
design of D.  Making things pedantically strict, so that the type 
system gets in the way more than it helps, encourages the user to 
reflexively circumvent the type system without thinking hard about 
doing this, thus defeating its purpose.  (The alternative of always 
complying with what the type system "expects" you to do is too 
inflexible to even be worth considering.)  Type systems should err on 
the side of accepting a superset of what's correct and treating code as 
innocent until proven guilty, not the other way around.  I still 
believe this even if some of the bugs it could be letting pass through 
might be very difficult to debug.  See the discussion we had a few 
weeks ago about implicit integer casting and porting code to 64.


I agree with you that this is a serious problem. I think part of why it 
hasn't been talked much yet is that nobody is currently using D2 
seriously for multithreaded stuff at this time (apart from you I 
guess), so we're missing experience with it. Andrei seems to think it's 
fine to required casts as soon as you need to protect something beyond 
an indirection inside synchronized classes, with the mitigation measure 
that you can make classes share their mutex (not implemented yet I 
think) so if the indirection leads to a class it is less of a problem. 
Personally, I don't.



My excuse for std.parallelism is that it's pedal-to-metal parallelism, 
so it's more acceptable for it to be dangerous than general case 
concurrency.  IMHO when you use the non-@safe parts of std.parallelism 
(i.e. most of the library), that's equivalent to casting away shared in 
a whole bunch of places.  Typing "import std.parallelism;" in a 
non-@safe module is an ex

Re: review of std.parallelism

2011-03-20 Thread dsimcha

On 3/20/2011 10:44 PM, Michel Fortin wrote:


I don't see a problem with the above. The array elements you modify are
passed through parallel's opApply which can check easily whether it's
safe or not to pass them by ref to different threads (by checking the
element's size) and allow or disallow the operation accordingly.

It could even do a clever trick to make it safe to pass things such as
elements of array of bytes by ref (by coalescing loop iterations for all
bytes sharing the same word into one task). That might not work for
ranges which are not arrays however.

That said, feel free to suggest more problematic examples.



Ok, I completely agree in principle, though I question whether it's 
worth actually implementing something like this, especially until we get 
some kind of support for shared delegates.






Also, your example can be trivially modified to be safe.

void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
synchronized sum += value;
}
writeln(sum);
}

In this case that kills all parallelism, but in more realistic cases I
use this pattern often. I find it very common to have an expensive
loop body can be performed in parallel, except for a tiny portion that
must update a shared data structure. I'm aware that it might be
possible, in theory, to write this more formally using reduce() or
something. However:

1. If the portion of the loop that deals with shared data is very
small (and therefore the serialization caused by the synchronized
block is negligible), it's often more efficient to only keep one data
structure in memory and update it concurrently, rather than use
stronger isolation between threads like reduce() does, and have to
maintain one data structure for each thread.

2. In my experience synchronizing on a small portion of the loop body
works very well in practice. My general philosophy is that, in a
library like this, dangerous but useful constructs must be supported
and treated as innocent until proven guilty, not the other way round.


Your second example is not really a good justification of anything. I'll
refer you to how synchronized classes work. It was decided that
synchronized in a class protects everything that is directly stored in
the class. Anything behind an indirection is considered shared by the
compiler. The implication of this is that if you have an array or a
pointer to something that you want semantically to be protected by the
class's mutex, you have to cast things to unshared. It was decided that
things should be safe against low-level races first, and convenience was
relegated as a secondary concern. I don't like it very much, but that's
what was decided and written in TDPL.


I'd go a little further.  If the guarantees that shared was supposed to 
provide are strong, i.e. apply no matter what threading module is used, 
then I utterly despise it.  It's one of the worst decisions made in the 
design of D.  Making things pedantically strict, so that the type system 
gets in the way more than it helps, encourages the user to reflexively 
circumvent the type system without thinking hard about doing this, thus 
defeating its purpose.  (The alternative of always complying with what 
the type system "expects" you to do is too inflexible to even be worth 
considering.)  Type systems should err on the side of accepting a 
superset of what's correct and treating code as innocent until proven 
guilty, not the other way around.  I still believe this even if some of 
the bugs it could be letting pass through might be very difficult to 
debug.  See the discussion we had a few weeks ago about implicit integer 
casting and porting code to 64.


My excuse for std.parallelism is that it's pedal-to-metal parallelism, 
so it's more acceptable for it to be dangerous than general case 
concurrency.  IMHO when you use the non-@safe parts of std.parallelism 
(i.e. most of the library), that's equivalent to casting away shared in 
a whole bunch of places.  Typing "import std.parallelism;" in a 
non-@safe module is an explicit enough step here.


The guarantee is still preserved that, if you only use std.concurrency 
(D's flagship "safe" concurrency module) for multithreading and don't 
cast away shared, there can be no low level data races.  IMHO this is 
still a substantial accomplishment in that there exists a way to do 
safe, statically checkable concurrency in D, even if it's not the 
**only** way concurrency can be done.  BTW, core.thread can also be used 
to get around D's type system, not just std.parallelism.  If you want to 
check that only safe concurrency is used, importing std.parallelism and 
core.thread can be grepped just as easily as casting away shared.


If, on the other hand, the guarantees of shared are supposed to be weak 
in that they only apply to programs where only std.concurrency is used 
for multithreading, then I think strictness is the right thing to do. 
The whole point of std.concurrency is to give strong guar

Re: review of std.parallelism

2011-03-20 Thread Michel Fortin

On 2011-03-20 11:42:04 -0400, dsimcha  said:


On 3/19/2011 2:14 PM, Michel Fortin wrote:

On 2011-03-19 13:20:00 -0400, dsimcha  said:


On 3/19/2011 1:09 PM, Michel Fortin wrote:

For instance:

void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
sum += value;
}
writeln(sum);
}

The "+=" would need to be an atomic operation to avoid low-level races.

I think that ParallelForeach's opApply should only accept a shared
delegate. I define shared delegate as a delegate that does not reference
any non-shared variables of its outer scope. The problem is that DMD
currently doesn't know how to determine whether a delegate literal is
shared or not, thus a delegate literal is never shared and if
ParallelForeach's opApply asked a shared delegate as it should it would
just not work. Fix DMD to create shared delegate literals where
appropriate and everything can be guarantied race-free.


If you want pedal-to-metal parallelism without insane levels of
verbosity, you can't have these safety features.


I'm not sure where my proposal asks for more verbosity or less
performance. All I can see is a few less casts in std.parallelism and
that it'd disallow the case in my example above that is totally wrong.
Either you're interpreting it wrong or there are things I haven't
thought about (and I'd be happy to know about them).

But by looking at all the examples in the documentation, I cannot find
one that would need to be changed... well, except the one I'll discuss
below.



Ok, I've had some time to think about this.  The following example is 
safe, but wouldn't work if I understand correctly what you're proposing:


auto logs = new double[1_000_000];

foreach(i, ref elem; taskPool.parallel(logs)) {
 elem = log(i + 1.0);
}

The problem is that you're writing to the same array from multiple 
threads, which the compiler would interpret as writing to the same 
variable.  However, you can guarantee that no two threads ever write to 
the same element, so it's safe.


I don't see a problem with the above. The array elements you modify are 
passed through parallel's opApply which can check easily whether it's 
safe or not to pass them by ref to different threads (by checking the 
element's size) and allow or disallow the operation accordingly.


It could even do a clever trick to make it safe to pass things such as 
elements of array of bytes by ref (by coalescing loop iterations for 
all bytes sharing the same word into one task). That might not work for 
ranges which are not arrays however.


That said, feel free to suggest more problematic examples.


Note:  I'm aware of the false sharing issue when writing to adjacent 
memory addresses.  However, when writing to an array this big it occurs 
to a negligible extent.  If you were using a smaller array, then either 
the loop body would be so expensive that the cost of false sharing 
would be negligible, or the whole loop would be too cheap to be worth 
parallelizing.


I'm also aware that word tearing is a concern on some architectures, 
though not x86.  IMHO std.parallelism and its documentation should not 
be pedantic about portability to architectures that a D compiler 
doesn't even exist for yet.


No question there.



Also, your example can be trivially modified to be safe.

void main() {
 int sum = 0;
 foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
  synchronized sum += value;
  }
  writeln(sum);
}

In this case that kills all parallelism, but in more realistic cases I 
use this pattern often.  I find it very common to have an expensive 
loop body can be performed in parallel, except for a tiny portion that 
must update a shared data structure.  I'm aware that it might be 
possible, in theory, to write this more formally using reduce() or 
something.  However:


1.  If the portion of the loop that deals with shared data is very 
small (and therefore the serialization caused by the synchronized block 
is negligible), it's often more efficient to only keep one data 
structure in memory and update it concurrently, rather than use 
stronger isolation between threads like reduce() does, and have to 
maintain one data structure for each thread.


2.  In my experience synchronizing on a small portion of the loop body 
works very well in practice.  My general philosophy is that, in a 
library like this, dangerous but useful constructs must be supported 
and treated as innocent until proven guilty, not the other way round.


Your second example is not really a good justification of anything. 
I'll refer you to how synchronized classes work. It was decided that 
synchronized in a class protects everything that is directly stored in 
the class. Anything behind an indirection is considered shared by the 
compiler. The implication of this is that if you have an array or a 
pointer to something that you want semantically to be protected by the 
class's mutex, you have to cast things to unshared. It w

Re: review of std.parallelism

2011-03-20 Thread Robert Jacques
On Sat, 19 Mar 2011 13:09:23 -0400, Michel Fortin  
 wrote:


On 2011-03-19 12:03:51 -0400, Andrei Alexandrescu  
 said:


* "Most of this module completely subverts..." Vague characterizations  
("most", "completely", "some") don't belong in a technical  
documentation. (For example there's either subversion going on or there  
isn't.) Also, std.concurrency and std.parallelism address different  
needs so there's little competition between them. Better: "Unless  
explicitly marked as $(D @trusted) or $(D @safe), artifacts in this  
module are not provably memory-safe and cannot be used with SafeD. If  
used as documented, memory safety is guaranteed."


Actually, I think this is a bad description of what it subverts. What it  
subverts isn't the memory-safety that SafeD provides, but the safety  
against low-level races that even unsafe D protects against unless you  
cast shared away. For instance:


void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) 
{
sum += value;
}
writeln(sum);
}

The "+=" would need to be an atomic operation to avoid low-level races.

I think that ParallelForeach's opApply should only accept a shared  
delegate. I define shared delegate as a delegate that does not reference  
any non-shared variables of its outer scope. The problem is that DMD  
currently doesn't know how to determine whether a delegate literal is  
shared or not, thus a delegate literal is never shared and if  
ParallelForeach's opApply asked a shared delegate as it should it would  
just not work. Fix DMD to create shared delegate literals where  
appropriate and everything can be guarantied race-free.




Technically, you'd only need a shared delegate or a const delegate to  
guarantee race safety for parallel foreach.


Re: review of std.parallelism

2011-03-20 Thread dsimcha

On 3/19/2011 10:16 PM, dsimcha wrote:

* Again: speed of e.g. parallel min/max vs. serial, pi computation etc.
on a usual machine?


I **STRONGLY** believe this does not belong in API documentation because
it's too machine specific, compiler specific, stack alignment specific,
etc. and almost any benchmark worth doing takes up more space than an
example should. Furthermore, anyone who wants to know this can easily
time it themselves. I have absolutely no intention of including this.
While in general I appreciate and have tried to accommodate your
suggestions, this is one I'll be standing firm on.


If scalability information is present in however a non-committal form,
then people would be compelled ("ok, so this shape of the loop would
actually scale linearly with CPUs... neat").


Ok, I thought you were asking for something much more rigorous than
this. I therefore didn't want to provide it because I figured that, no
matter what I did, someone would be able to say that the benchmark is
flawed some how, yada, yada, yada. Given how inexact a science
benchmarking is, I'm still hesitant to put such results in API docs, but
I can see where you're coming from here.


I tried putting a few benchmarks in for the performance of parallel 
foreach, map and reduce.  Basically, I just put the timings in for the 
examples on my dual core machine, and for an equivalent serial version 
that is not shown in the interest of brevity but is easily inferred.  I 
didn't do this for the pipelining or future/promise primitives because 
these aren't as capable of automatically taking full advantage of as 
many cores as you can throw at them as map, reduce and foreach.  This 
means that coming up with a benchmark that shows where a good speedup 
can be obtained is a much more difficult art.


Re: review of std.parallelism

2011-03-20 Thread dsimcha

On 3/19/2011 2:14 PM, Michel Fortin wrote:

On 2011-03-19 13:20:00 -0400, dsimcha  said:


On 3/19/2011 1:09 PM, Michel Fortin wrote:

For instance:

void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
sum += value;
}
writeln(sum);
}

The "+=" would need to be an atomic operation to avoid low-level races.

I think that ParallelForeach's opApply should only accept a shared
delegate. I define shared delegate as a delegate that does not reference
any non-shared variables of its outer scope. The problem is that DMD
currently doesn't know how to determine whether a delegate literal is
shared or not, thus a delegate literal is never shared and if
ParallelForeach's opApply asked a shared delegate as it should it would
just not work. Fix DMD to create shared delegate literals where
appropriate and everything can be guarantied race-free.


If you want pedal-to-metal parallelism without insane levels of
verbosity, you can't have these safety features.


I'm not sure where my proposal asks for more verbosity or less
performance. All I can see is a few less casts in std.parallelism and
that it'd disallow the case in my example above that is totally wrong.
Either you're interpreting it wrong or there are things I haven't
thought about (and I'd be happy to know about them).

But by looking at all the examples in the documentation, I cannot find
one that would need to be changed... well, except the one I'll discuss
below.



Ok, I've had some time to think about this.  The following example is 
safe, but wouldn't work if I understand correctly what you're proposing:


auto logs = new double[1_000_000];

foreach(i, ref elem; taskPool.parallel(logs)) {
elem = log(i + 1.0);
}

The problem is that you're writing to the same array from multiple 
threads, which the compiler would interpret as writing to the same 
variable.  However, you can guarantee that no two threads ever write to 
the same element, so it's safe.


Note:  I'm aware of the false sharing issue when writing to adjacent 
memory addresses.  However, when writing to an array this big it occurs 
to a negligible extent.  If you were using a smaller array, then either 
the loop body would be so expensive that the cost of false sharing would 
be negligible, or the whole loop would be too cheap to be worth 
parallelizing.


I'm also aware that word tearing is a concern on some architectures, 
though not x86.  IMHO std.parallelism and its documentation should not 
be pedantic about portability to architectures that a D compiler doesn't 
even exist for yet.


Also, your example can be trivially modified to be safe.

void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
 synchronized sum += value;
 }
 writeln(sum);
}

In this case that kills all parallelism, but in more realistic cases I 
use this pattern often.  I find it very common to have an expensive loop 
body can be performed in parallel, except for a tiny portion that must 
update a shared data structure.  I'm aware that it might be possible, in 
theory, to write this more formally using reduce() or something.  However:


1.  If the portion of the loop that deals with shared data is very small 
(and therefore the serialization caused by the synchronized block is 
negligible), it's often more efficient to only keep one data structure 
in memory and update it concurrently, rather than use stronger isolation 
between threads like reduce() does, and have to maintain one data 
structure for each thread.


2.  In my experience synchronizing on a small portion of the loop body 
works very well in practice.  My general philosophy is that, in a 
library like this, dangerous but useful constructs must be supported and 
treated as innocent until proven guilty, not the other way round.


Re: review of std.parallelism

2011-03-20 Thread Lars T. Kyllingstad
On Sat, 19 Mar 2011 17:58:46 -0500, Andrei Alexandrescu wrote:

> On 03/19/2011 04:25 PM, dsimcha wrote:
>> On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
>>> I know you'd have no problem finding the right voice in this
>>> discussion if you only frame it in the right light. Again, people are
>>> trying to help (however awkwardly) and in no way is that ridiculous.
>>
>> Fair enough. Now that I think of it most of my frustration is that
>> these details are only getting looked at now, when I have a week (and
>> an otherwise very busy week) to fix all this stuff, when this module
>> has been officially in review for the past two weeks and unofficially
>> for several months. I would be much more open to this process if the
>> issues raised could be fixed at my leisure rather than on a hard and
>> tight deadline. This is exacerbated by the fact that I have another
>> important, unrelated deadline, also next Friday.
>>
>> At the same time, though, I'm afraid that if we didn't fix a vote date
>> and put some urgency into things, the pace of the reviews would be
>> glacial at best, like it was for the first two weeks of official review
>> and the months of unofficial review.
> 
> Exactly. I understand. Well here's a suggestion. How about we "git
> stash" this review? I recall another submission was vying for the queue
> so we can proceed with that while you have time to operate the changes
> you want. If not, we can simply wait 2-3 weeks and then have you
> resubmit for a shorter process (1-2 weeks) that would gather more
> feedback (hopefully minor that time) and votes.
> 
> Lars?

Sounds good.  I'll make the announcement.

-Lars


Re: review of std.parallelism

2011-03-19 Thread dsimcha

On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:

On 03/19/2011 12:16 PM, dsimcha wrote:

On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:

On 03/19/2011 02:32 AM, dsimcha wrote:

Ok, thanks again for clarifying **how** the docs could be improved.
I've
implemented the suggestions and generally given the docs a good reading
over and clean up. The new docs are at:

http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html


* Still no synopsis example that illustrates in a catchy way the most
attractive artifacts.


I don't see what I could put here that isn't totally redundant with the
rest of the documentation. Anything I could think of would basically
just involve concatentating all the examples. Furthermore, none of the
other Phobos modules have this, so I don't know what one should look
like.


I'm thinking along the lines of:

http://www.digitalmars.com/d/2.0/phobos/std_exception.html

A nice synopsis would be the pi computation. Just move that up to the
synopsis. It's simple, clean, and easy to relate to. Generally, you'd
put here not all details but the stuff you think would make it easiest
for people to get into your library.


Good example, will do.


* "After creation, Task objects are submitted to a TaskPool for
execution." I understand it's possible to use Task straight as a
promise/future, so s/are/may be/.


No. The only way Task is useful is by submitting it to a pool to be
executed. (Though this may change, see below.)


I very much hope this does change. Otherwise the role of Task in the
design could be drastically reduced (e.g. nested type inside of
TaskPool) without prejudice. At the minimum I want to be able to create
a task, launch it, and check its result later without involving a pool.
A pool is when I have many tasks that may exceed the number of CPUs etc.
Simplicity would be great.

// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
auto foo = readFoo.yieldWait();
auto bar = readBar.yieldWait();
auto baz = readBaz.yieldWait();


This is definitely feasible in principle.  I'd like to implement it, but 
there's a few annoying, hairy details standing in the way.  For reasons 
I detailed previously, we need both scoped and non-scoped tasks.  We 
also have alias vs. callable (i.e. function pointer or delegate) tasks. 
 Now we're adding pool vs. new-thread tasks.  This is turning into a 
combinatorial explosion and needs to be simplified somehow.  I propose 
the following:


1.  I've reconsidered and actually like the idea of task() vs. 
scopedTask().  task() returns a pointer on the heap.  scopedTask() 
returns a struct on the stack.  Neither would be a member function of 
TaskPool.


2.  Non-scoped Task pointers would need to be explicitly submitted to 
the task pool via the put() method.  This means getting rid of 
TaskPool.task().


3.  The Task struct would grow a function runInNewThread() or something 
similar.  (If you think this would be a common case, even just execute() 
might cut it.)


The work flow would now be that you call task() to get a heap-allocated 
Task*, or scopedTask to get a stack-allocated Task.  You then call 
either TaskPool.put() to execute it on a pool or Task.runInNewThread() 
to run it in a new thread.  The creation of the Task is completely 
orthogonal to how it's run.




There's no need at this level for a task pool. What would be nice would
be to have a join() that joins all tasks spawned by the current thread:

// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
join();
// fetch results
auto foo = readFoo.spinWait();
auto bar = readBar.spinWait();
auto baz = readBaz.spinWait();



I don't understand how this would be a substantial improvement over the 
first example, where you just call yieldWait() on all three. 
Furthermore, implementing join() as shown in this example would require 
some kind of central registry of all tasks/worker threads/task 
pools/something similar, which would be a huge PITA to implement 
efficiently.





Secondly, I think you're reading **WAY** too much into what was meant to
be a simple example to illustrate usage mechanics. This is another case
where I can't think of a small, cute example of where you'd really need
the pool. There are plenty of larger examples, but the smallest/most
self-contained one I can think of is a parallel sort. I decided to use
file reading because it was good enough to illustrate the mechanics of
usage, even if it didn't illustrate a particularly good use case.


It's impossible to not have a good small example. Sorting is great. You
have the partition primitive already in std.algorithm, then off you go
with tasks. Dot product on dense vectors is another good one. There's
just plenty of operations that people understand are important to make
fast.


I forgot about std.algorithm.partition

Re: review of std.parallelism

2011-03-19 Thread dsimcha

On 3/19/2011 8:48 PM, Jonathan M Davis wrote:

On Saturday 19 March 2011 17:31:18 dsimcha wrote:

On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:

Furthermore, you should expect that the review process will prompt
changes. My perception is that you consider the submission more or less
final modulo possibly a few minor nits. You shouldn't. I'm convinced you
know much more about SMP than most or all others in this group, but in
no way that means your design has reached perfection and is beyond
improvement even from a non-expert.


In addition the the deadline issues already mentioned and resolved, I
did misunderstand the review process somewhat.  I didn't participate in
the reviews for std.datetime (because I know nothing about what makes a
good date/time lib) or for std.unittest (because I was fairly ambivalent
about it), so I didn't learn anything from them.  I was under the
impression that the module is **expected** to be very close to its final
form and that, if a lot of issues are found, then that basically means
the proposal is going to be rejected.


Both std.datetime and std.unittests underwent a fair number of changes over the
course the review process. A lot of the stuff stayed the same, but a lot of it
changed too. On the whole, though, the end results were much better for it.

- Jonathan M Davis


Please check your newsreader settings.  You've been double-posting a lot 
lately.


Re: review of std.parallelism

2011-03-19 Thread Jonathan M Davis
On Saturday 19 March 2011 17:31:18 dsimcha wrote:
> On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
> > Furthermore, you should expect that the review process will prompt
> > changes. My perception is that you consider the submission more or less
> > final modulo possibly a few minor nits. You shouldn't. I'm convinced you
> > know much more about SMP than most or all others in this group, but in
> > no way that means your design has reached perfection and is beyond
> > improvement even from a non-expert.
> 
> In addition the the deadline issues already mentioned and resolved, I
> did misunderstand the review process somewhat.  I didn't participate in
> the reviews for std.datetime (because I know nothing about what makes a
> good date/time lib) or for std.unittest (because I was fairly ambivalent
> about it), so I didn't learn anything from them.  I was under the
> impression that the module is **expected** to be very close to its final
> form and that, if a lot of issues are found, then that basically means
> the proposal is going to be rejected.

Both std.datetime and std.unittests underwent a fair number of changes over the 
course the review process. A lot of the stuff stayed the same, but a lot of it 
changed too. On the whole, though, the end results were much better for it.

- Jonathan M Davis


Re: review of std.parallelism

2011-03-19 Thread dsimcha

On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:

Furthermore, you should expect that the review process will prompt
changes. My perception is that you consider the submission more or less
final modulo possibly a few minor nits. You shouldn't. I'm convinced you
know much more about SMP than most or all others in this group, but in
no way that means your design has reached perfection and is beyond
improvement even from a non-expert.


In addition the the deadline issues already mentioned and resolved, I 
did misunderstand the review process somewhat.  I didn't participate in 
the reviews for std.datetime (because I know nothing about what makes a 
good date/time lib) or for std.unittest (because I was fairly ambivalent 
about it), so I didn't learn anything from them.  I was under the 
impression that the module is **expected** to be very close to its final 
form and that, if a lot of issues are found, then that basically means 
the proposal is going to be rejected.


Re: review of std.parallelism

2011-03-19 Thread dsimcha

On 3/19/2011 6:58 PM, Andrei Alexandrescu wrote:

On 03/19/2011 04:25 PM, dsimcha wrote:

On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:

I know you'd have no problem finding the right voice in this discussion
if you only frame it in the right light. Again, people are trying to
help (however awkwardly) and in no way is that ridiculous.


Fair enough. Now that I think of it most of my frustration is that these
details are only getting looked at now, when I have a week (and an
otherwise very busy week) to fix all this stuff, when this module has
been officially in review for the past two weeks and unofficially for
several months. I would be much more open to this process if the issues
raised could be fixed at my leisure rather than on a hard and tight
deadline. This is exacerbated by the fact that I have another important,
unrelated deadline, also next Friday.

At the same time, though, I'm afraid that if we didn't fix a vote date
and put some urgency into things, the pace of the reviews would be
glacial at best, like it was for the first two weeks of official review
and the months of unofficial review.


Exactly. I understand. Well here's a suggestion. How about we "git
stash" this review? I recall another submission was vying for the queue
so we can proceed with that while you have time to operate the changes
you want. If not, we can simply wait 2-3 weeks and then have you
resubmit for a shorter process (1-2 weeks) that would gather more
feedback (hopefully minor that time) and votes.


Sounds like a plan.


Re: review of std.parallelism

2011-03-19 Thread dsimcha

On 3/19/2011 7:33 PM, Daniel Gibson wrote:

Am 19.03.2011 22:45, schrieb dsimcha:

IMHO all early termination that affects subsequent loop iterations as well
(break, goto, labeled break and continue, but not regular continue) should just
throw because they make absolutely no sense in a parallel context.


What about some kind of parallel search? You just want to know if something is
there and on first sight you're done, so you want to break out of this iteration
and don't want any new parallel iterations to be done.
(Not sure what should be done with currently running iterations.. should they be
killed or go on until they're done - anyway, the executing thread shouldn't
start a new iteration after that)


It's an interesting suggestion, and I'll give some thought to it, but on 
first glance it sounds **very** difficult to implement efficiently.  To 
avoid doing a lot of extra work after you've found what you need, you'd 
need to use very small work units.  To avoid excessive overhead you'd 
need to use fairly large work units.  The only case I can think of where 
this would be do-able is when evaluating the predicate is very expensive.


Re: review of std.parallelism

2011-03-19 Thread Daniel Gibson
Am 19.03.2011 22:45, schrieb dsimcha:
> IMHO all early termination that affects subsequent loop iterations as well
> (break, goto, labeled break and continue, but not regular continue) should 
> just
> throw because they make absolutely no sense in a parallel context.

What about some kind of parallel search? You just want to know if something is
there and on first sight you're done, so you want to break out of this iteration
and don't want any new parallel iterations to be done.
(Not sure what should be done with currently running iterations.. should they be
killed or go on until they're done - anyway, the executing thread shouldn't
start a new iteration after that)


Re: review of std.parallelism

2011-03-19 Thread Andrei Alexandrescu

On 03/19/2011 04:25 PM, dsimcha wrote:

On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:

I know you'd have no problem finding the right voice in this discussion
if you only frame it in the right light. Again, people are trying to
help (however awkwardly) and in no way is that ridiculous.


Fair enough. Now that I think of it most of my frustration is that these
details are only getting looked at now, when I have a week (and an
otherwise very busy week) to fix all this stuff, when this module has
been officially in review for the past two weeks and unofficially for
several months. I would be much more open to this process if the issues
raised could be fixed at my leisure rather than on a hard and tight
deadline. This is exacerbated by the fact that I have another important,
unrelated deadline, also next Friday.

At the same time, though, I'm afraid that if we didn't fix a vote date
and put some urgency into things, the pace of the reviews would be
glacial at best, like it was for the first two weeks of official review
and the months of unofficial review.


Exactly. I understand. Well here's a suggestion. How about we "git 
stash" this review? I recall another submission was vying for the queue 
so we can proceed with that while you have time to operate the changes 
you want. If not, we can simply wait 2-3 weeks and then have you 
resubmit for a shorter process (1-2 weeks) that would gather more 
feedback (hopefully minor that time) and votes.


Lars?


Also increasing the deadline pressure issue, Michael Fortin just caused
me to rethink the issue of exception handling in parallel foreach. I had
more-or-less working code for this, but I realized it's severely broken
in subtle ways that I've (knock on wood) never actually run into in real
world code. It's gonna take some time to fix. These kinds of issues with
error handling code can very easily slip under the radar in a library
with only a few users, and unfortunately, one has. I definitely know how
to fix it, but I have to implement the fix and it's somewhat
non-trivial, i.e. I'm debugging it now and it's looking like a marathon
debugging session.


Fixing these sooner rather than later would be great, so  all the better 
for suspending the review.


Andrei


Re: review of std.parallelism

2011-03-19 Thread Jonathan M Davis
On Saturday 19 March 2011 09:03:51 Andrei Alexandrescu wrote:
> On 03/19/2011 02:32 AM, dsimcha wrote:
> > Ok, thanks again for clarifying **how** the docs could be improved. I've
> > implemented the suggestions and generally given the docs a good reading
> > over and clean up. The new docs are at:
> > 
> > http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html
> 
> * When using "parallelism" as a common noun, prefix is with a '_' so
> ddoc doesn't underline it.

Oooh. That's a neat trick. I didn't know that you could do that.

- Jonathan M Davis


Re: review of std.parallelism

2011-03-19 Thread dsimcha

On 3/19/2011 5:33 PM, Michel Fortin wrote:

On 2011-03-19 15:36:24 -0400, dsimcha  said:


The only problem is that there's no easy, well-documented way to tell
from the return value of opApply whether it was a break, a goto, a
labeled break/continue, etc. This would be implementable only if I
changed the semantics of break to also throw. This might not be a bad
thing (IMHO any type of breaking out of a parallel foreach loop is
just silly) but others had asked for different semantics for break.


It's not that silly.

Essentially, what you'd express like this with a normal function taking
a delegate:

taskPool.apply([1,2,3], (int i) {
if (i == 1)
return;
// do some things
});

you'd express like this in a parallel foreach:

foreach (int i; parallel([1,2,3])) {
if (i == 1)
break;
// do some things
}

It's not following the semantics of break within a foreach, but it's
still useful to be able to return early from a function (from a loop
iteration in this case), so I see the use case for making 'break' do
what it does.



Using continue is well defined behavior and does exactly what I think 
you're suggesting.  The problem with break is that it's supposed to stop 
all subsequent iterations of the loop from being executed, not just end 
the current one early.  This only makes sense in a serial context, where 
"subsequent iterations" is well-defined.  Currently break breaks from 
the current work unit but continues executing all other work units.  I 
put this semantic in because I couldn't think of anything else to make 
it do and someone (I don't remember who) asked for it.  I have never 
encountered a use case for it.


IMHO all early termination that affects subsequent loop iterations as 
well (break, goto, labeled break and continue, but not regular continue) 
should just throw because they make absolutely no sense in a parallel 
context.


Re: review of std.parallelism

2011-03-19 Thread Michel Fortin

On 2011-03-19 15:36:24 -0400, dsimcha  said:


On 3/19/2011 2:35 PM, Michel Fortin wrote:

On 2011-03-19 13:16:02 -0400, dsimcha  said:


* "A goto from inside the parallel foreach loop to a label outside the
loop will result in undefined behavior." Would this be a bug in dmd?


No, it's because a goto of this form has no reasonable, useful
semantics. I should probably mention in the docs that the same applies
to labeled break and continue.

I have no idea what semantics these should have, and even if I did,
given the long odds that even one person would actually need them, I
think they'd be more trouble than they're worth to implement. For
example, once you break out of a parallel foreach loop to some
arbitrary address (and different threads can goto different labels,
etc.), well, it's no longer a parallel foreach loop. It's just a bunch
of completely unstructured threading doing god-knows-what.

Therefore, I slapped undefined behavior on it as a big sign that says,
"Just don't do it." This also has the advantage that, if anyone ever
thinks of any good, clearly useful semantics, these will be
implementable without breaking code later.


I think an improvement over undefined behaviour would be to throw an
exception.


The only problem is that there's no easy, well-documented way to tell 
from the return value of opApply whether it was a break, a goto, a 
labeled break/continue, etc.  This would be implementable only if I 
changed the semantics of break to also throw.  This might not be a bad 
thing (IMHO any type of breaking out of a parallel foreach loop is just 
silly) but others had asked for different semantics for break.


It's not that silly.

Essentially, what you'd express like this with a normal function taking 
a delegate:


taskPool.apply([1,2,3], (int i) {
if (i == 1)
return;
// do some things
});

you'd express like this in a parallel foreach:

foreach (int i; parallel([1,2,3])) {
if (i == 1)
break;
// do some things
}

It's not following the semantics of break within a foreach, but it's 
still useful to be able to return early from a function (from a loop 
iteration in this case), so I see the use case for making 'break' do 
what it does.


My only gripe is that the semantics are distorted. In fact, just making 
foreach parallel distorts its semantics. I was confused earlier about a 
foreach being parallel when it was not, someone could also be confused 
in the other direction, thinking foreach is the normal foreach when it 
actually is parallel. This makes code harder to review. Don't consider 
only my opinion on this, but in my opinion the first form above 
(taskPool.apply) is preferable because you absolutely can't mistake it 
with a normal foreach. And I think it's especially important to make it 
stand out given that the compiler can't check for low-level races.




Also, what happens if one of the tasks throws an exception?


It gets rethrown when yieldWait()/spinWait()/workWait() is called.  In 
the case of the higher-level primitives, it gets re-thrown to the 
calling thread at some non-deterministic point in the execution of 
these functions.  I didn't see the need to document this explicitly 
because it "just works".


That's indeed what I'd expect. But I think it'd still be worth 
mentioning in a short sentense in yeildWait()/spinWait()/workWait()'s 
documentation. It's comforting when the documentation confirms your 
expectations.


--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/



Re: review of std.parallelism

2011-03-19 Thread dsimcha

On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:

On 03/19/2011 12:16 PM, dsimcha wrote:

On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:

On 03/19/2011 02:32 AM, dsimcha wrote:

Ok, thanks again for clarifying **how** the docs could be improved.
I've
implemented the suggestions and generally given the docs a good reading
over and clean up. The new docs are at:

http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html


* Still no synopsis example that illustrates in a catchy way the most
attractive artifacts.


I don't see what I could put here that isn't totally redundant with the
rest of the documentation. Anything I could think of would basically
just involve concatentating all the examples. Furthermore, none of the
other Phobos modules have this, so I don't know what one should look
like.


I'm thinking along the lines of:

http://www.digitalmars.com/d/2.0/phobos/std_exception.html

A nice synopsis would be the pi computation. Just move that up to the
synopsis. It's simple, clean, and easy to relate to. Generally, you'd
put here not all details but the stuff you think would make it easiest
for people to get into your library.


In general I feel like std.parallelism is being held to a
ridiculously high standard that none of the other Phobos modules
currently meet.


I agree, but that goes without saying. This is not a double standard;
it's a simple means to improve quality of Phobos overall. Clearly
certain modules that are already in Phobos would not make it if proposed
today. And that's a good thing! Comparing our current ambitions to the
quality of the average Phobos module would not help us.

Generally it seems you have grown already tired of the exchange and it
would take only a few more exchanges for you to say, "you know what,
either let's get it over it or forget about it."

Please consider for a minute how this is the wrong tone and attitude to
be having on several levels. This is not a debate and not the place to
get defensive. Your role in the discussion is not symmetric with the
others' and at best you'd use the review as an opportunity to improve
your design, not stick heels in the ground to defend its current
incarnation (within reason). Your potential customers are attempting to
help you by asking questions (some of which no doubt are silly) and by
making suggestions (some of which, again, are ill-founded).
Nevertheless, we _are_ your potential customers and in a way the
customer is always right. Your art is to steer customers from what they
think they want to what you know they need - because you're the expert!
- and to improve design, nomenclature, and implementation to the extent
that would help them.

Furthermore, you should expect that the review process will prompt
changes. My perception is that you consider the submission more or less
final modulo possibly a few minor nits. You shouldn't. I'm convinced you
know much more about SMP than most or all others in this group, but in
no way that means your design has reached perfection and is beyond
improvement even from a non-expert.

I know you'd have no problem finding the right voice in this discussion
if you only frame it in the right light. Again, people are trying to
help (however awkwardly) and in no way is that ridiculous.


Fair enough.  Now that I think of it most of my frustration is that 
these details are only getting looked at now, when I have a week (and an 
otherwise very busy week) to fix all this stuff, when this module has 
been officially in review for the past two weeks and unofficially for 
several months.  I would be much more open to this process if the issues 
raised could be fixed at my leisure rather than on a hard and tight 
deadline.  This is exacerbated by the fact that I have another 
important, unrelated deadline, also next Friday.


At the same time, though, I'm afraid that if we didn't fix a vote date 
and put some urgency into things, the pace of the reviews would be 
glacial at best, like it was for the first two weeks of official review 
and the months of unofficial review.


How about we make next Friday a soft deadline/start of voting?  It can 
be extended as necessary by mutual agreement of me and Lars (the review 
manager), and likely will be if the review process is still yielding 
good suggestions and/or I haven't had time to implement/clean up some 
key things.  Having a deadline breathing down your neck like this is 
really not conducive to being open to suggestions and thoughtful 
consideration, especially for issues that seem like fairly minor details.


Also increasing the deadline pressure issue, Michael Fortin just caused 
me to rethink the issue of exception handling in parallel foreach.  I 
had more-or-less working code for this, but I realized it's severely 
broken in subtle ways that I've (knock on wood) never actually run into 
in real world code.  It's gonna take some time to fix.  These kinds of 
issues with error handling code can very easily slip under the radar in 
a librar

Re: review of std.parallelism

2011-03-19 Thread Andrei Alexandrescu

On 03/19/2011 12:16 PM, dsimcha wrote:

On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:

On 03/19/2011 02:32 AM, dsimcha wrote:

Ok, thanks again for clarifying **how** the docs could be improved. I've
implemented the suggestions and generally given the docs a good reading
over and clean up. The new docs are at:

http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html


* Still no synopsis example that illustrates in a catchy way the most
attractive artifacts.


I don't see what I could put here that isn't totally redundant with the
rest of the documentation. Anything I could think of would basically
just involve concatentating all the examples. Furthermore, none of the
other Phobos modules have this, so I don't know what one should look
like.


I'm thinking along the lines of:

http://www.digitalmars.com/d/2.0/phobos/std_exception.html

A nice synopsis would be the pi computation. Just move that up to the 
synopsis. It's simple, clean, and easy to relate to. Generally, you'd 
put here not all details but the stuff you think would make it easiest 
for people to get into your library.



In general I feel like std.parallelism is being held to a
ridiculously high standard that none of the other Phobos modules
currently meet.


I agree, but that goes without saying. This is not a double standard; 
it's a simple means to improve quality of Phobos overall. Clearly 
certain modules that are already in Phobos would not make it if proposed 
today. And that's a good thing! Comparing our current ambitions to the 
quality of the average Phobos module would not help us.


Generally it seems you have grown already tired of the exchange and it 
would take only a few more exchanges for you to say, "you know what, 
either let's get it over it or forget about it."


Please consider for a minute how this is the wrong tone and attitude to 
be having on several levels. This is not a debate and not the place to 
get defensive. Your role in the discussion is not symmetric with the 
others' and at best you'd use the review as an opportunity to improve 
your design, not stick heels in the ground to defend its current 
incarnation (within reason). Your potential customers are attempting to 
help you by asking questions (some of which no doubt are silly) and by 
making suggestions (some of which, again, are ill-founded). 
Nevertheless, we _are_ your potential customers and in a way the 
customer is always right. Your art is to steer customers from what they 
think they want to what you know they need - because you're the expert! 
- and to improve design, nomenclature, and implementation to the extent 
that would help them.


Furthermore, you should expect that the review process will prompt 
changes. My perception is that you consider the submission more or less 
final modulo possibly a few minor nits. You shouldn't. I'm convinced you 
know much more about SMP than most or all others in this group, but in 
no way that means your design has reached perfection and is beyond 
improvement even from a non-expert.


I know you'd have no problem finding the right voice in this discussion 
if you only frame it in the right light. Again, people are trying to 
help (however awkwardly) and in no way is that ridiculous.



* "After creation, Task objects are submitted to a TaskPool for
execution." I understand it's possible to use Task straight as a
promise/future, so s/are/may be/.


No. The only way Task is useful is by submitting it to a pool to be
executed. (Though this may change, see below.)


I very much hope this does change. Otherwise the role of Task in the 
design could be drastically reduced (e.g. nested type inside of 
TaskPool) without prejudice. At the minimum I want to be able to create 
a task, launch it, and check its result later without involving a pool. 
A pool is when I have many tasks that may exceed the number of CPUs etc. 
Simplicity would be great.


// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
auto foo = readFoo.yieldWait();
auto bar = readBar.yieldWait();
auto baz = readBaz.yieldWait();

There's no need at this level for a task pool. What would be nice would 
be to have a join() that joins all tasks spawned by the current thread:


// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
join();
// fetch results
auto foo = readFoo.spinWait();
auto bar = readBar.spinWait();
auto baz = readBaz.spinWait();

The way I see it is, task pools are an advanced means that coordinate m 
threads over n CPUs. If I don't care about that (as above) there should 
be no need for a pool at all. (Of course it's fine if used by the 
implementation.)



Also it is my understanding that
TaskPool efficiently adapts the concrete number of CPUs to an arbitrary
number of tasks. If that's the case, it's worth mentioni

Re: review of std.parallelism

2011-03-19 Thread dsimcha

On 3/19/2011 3:36 PM, dsimcha wrote:

On 3/19/2011 2:35 PM, Michel Fortin wrote:
It gets rethrown when yieldWait()/spinWait()/workWait() is called. In
the case of the higher-level primitives, it gets re-thrown to the
calling thread at some non-deterministic point in the execution of these
functions. I didn't see the need to document this explicitly because it
"just works".



...Though now that you make me think of it I need to support exception 
chaining for the case of multiple concurrently thrown exceptions instead 
of just arbitrarily and non-deterministically rethrowing one.  The code 
that handles this was written before exception chaining existed.  Will 
get on that.


Re: review of std.parallelism

2011-03-19 Thread dsimcha

On 3/19/2011 2:35 PM, Michel Fortin wrote:

On 2011-03-19 13:16:02 -0400, dsimcha  said:


* "A goto from inside the parallel foreach loop to a label outside the
loop will result in undefined behavior." Would this be a bug in dmd?


No, it's because a goto of this form has no reasonable, useful
semantics. I should probably mention in the docs that the same applies
to labeled break and continue.

I have no idea what semantics these should have, and even if I did,
given the long odds that even one person would actually need them, I
think they'd be more trouble than they're worth to implement. For
example, once you break out of a parallel foreach loop to some
arbitrary address (and different threads can goto different labels,
etc.), well, it's no longer a parallel foreach loop. It's just a bunch
of completely unstructured threading doing god-knows-what.

Therefore, I slapped undefined behavior on it as a big sign that says,
"Just don't do it." This also has the advantage that, if anyone ever
thinks of any good, clearly useful semantics, these will be
implementable without breaking code later.


I think an improvement over undefined behaviour would be to throw an
exception.


The only problem is that there's no easy, well-documented way to tell 
from the return value of opApply whether it was a break, a goto, a 
labeled break/continue, etc.  This would be implementable only if I 
changed the semantics of break to also throw.  This might not be a bad 
thing (IMHO any type of breaking out of a parallel foreach loop is just 
silly) but others had asked for different semantics for break.




Also, what happens if one of the tasks throws an exception?



It gets rethrown when yieldWait()/spinWait()/workWait() is called.  In 
the case of the higher-level primitives, it gets re-thrown to the 
calling thread at some non-deterministic point in the execution of these 
functions.  I didn't see the need to document this explicitly because it 
"just works".




Re: review of std.parallelism

2011-03-19 Thread bearophile
dsimcha:

> 1.  I give up.
> 
> 2.  I wish someone had told me earlier.

Please, don't give up. It's a lot of time from the first time I've asked a 
parallel map in Phobos :-)

Bye,
bearophile


Re: review of std.parallelism

2011-03-19 Thread dsimcha

On 3/19/2011 2:25 PM, Michel Fortin wrote:

On 2011-03-19 14:14:51 -0400, Michel Fortin 
said:


I'm not too convinced about the "I know what I'm doing" argument when
I look at this example from asyncBuf's documentation:

auto lines = File("foo.txt").byLine();
auto duped = map!"a.idup"(lines); // Necessary b/c byLine() recycles
buffer

// Fetch more lines in the background while we process the lines already
// read into memory into a matrix of doubles.
double[][] matrix;
auto asyncReader = taskPool.asyncBuf(duped);

foreach(line; asyncReader) {
auto ls = line.split("\t");
matrix ~= to!(double[])(ls);
}

Look at the last line of the foreach. You are appending to a
non-shared array from many different threads. How is that not a race
condition?


 or maybe I just totally misunderstood asyncBuf. Rereading the
documentation I'm under the impression I'd have to write this to get
what I expected:

foreach (line; parallel(asyncReader))
...

And that would cause a race condition. If that's the case, the example
is fine. Sorry for the misunderstanding.



Right.  And this is pretty obviously a race.  The other example (without 
the parallel) is completely safe.


Re: review of std.parallelism

2011-03-19 Thread Michel Fortin

On 2011-03-19 13:16:02 -0400, dsimcha  said:


* "A goto from inside the parallel foreach loop to a label outside the
loop will result in undefined behavior." Would this be a bug in dmd?


No, it's because a goto of this form has no reasonable, useful 
semantics.  I should probably mention in the docs that the same applies 
to labeled break and continue.


I have no idea what semantics these should have, and even if I did, 
given the long odds that even one person would actually need them, I 
think they'd be more trouble than they're worth to implement.  For 
example, once you break out of a parallel foreach loop to some 
arbitrary address (and different threads can goto different labels, 
etc.), well, it's no longer a parallel foreach loop.  It's just a bunch 
of completely unstructured threading doing god-knows-what.


Therefore, I slapped undefined behavior on it as a big sign that says, 
"Just don't do it."  This also has the advantage that, if anyone ever 
thinks of any good, clearly useful semantics, these will be 
implementable without breaking code later.


I think an improvement over undefined behaviour would be to throw an exception.

Also, what happens if one of the tasks throws an exception?

--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/



Re: review of std.parallelism

2011-03-19 Thread Michel Fortin

On 2011-03-19 14:14:51 -0400, Michel Fortin  said:

I'm not too convinced about the "I know what I'm doing" argument when I 
look at this example from asyncBuf's documentation:


auto lines = File("foo.txt").byLine();
auto duped = map!"a.idup"(lines);  // Necessary b/c byLine() 
recycles buffer


// Fetch more lines in the background while we process the lines already
// read into memory into a matrix of doubles.
double[][] matrix;
auto asyncReader = taskPool.asyncBuf(duped);

foreach(line; asyncReader) {
auto ls = line.split("\t");
matrix ~= to!(double[])(ls);
}

Look at the last line of the foreach. You are appending to a non-shared 
array from many different threads. How is that not a race condition?


... or maybe I just totally misunderstood asyncBuf. Rereading the 
documentation I'm under the impression I'd have to write this to get 
what I expected:


foreach (line; parallel(asyncReader))
...

And that would cause a race condition. If that's the case, the example 
is fine. Sorry for the misunderstanding.


--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/



Re: review of std.parallelism

2011-03-19 Thread Michel Fortin

On 2011-03-19 13:20:00 -0400, dsimcha  said:


On 3/19/2011 1:09 PM, Michel Fortin wrote:

For instance:

void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
sum += value;
}
writeln(sum);
}

The "+=" would need to be an atomic operation to avoid low-level races.

I think that ParallelForeach's opApply should only accept a shared
delegate. I define shared delegate as a delegate that does not reference
any non-shared variables of its outer scope. The problem is that DMD
currently doesn't know how to determine whether a delegate literal is
shared or not, thus a delegate literal is never shared and if
ParallelForeach's opApply asked a shared delegate as it should it would
just not work. Fix DMD to create shared delegate literals where
appropriate and everything can be guarantied race-free.


If you want pedal-to-metal parallelism without insane levels of 
verbosity, you can't have these safety features.


I'm not sure where my proposal asks for more verbosity or less 
performance. All I can see is a few less casts in std.parallelism and 
that it'd disallow the case in my example above that is totally wrong. 
Either you're interpreting it wrong or there are things I haven't 
thought about (and I'd be happy to know about them).


But by looking at all the examples in the documentation, I cannot find 
one that would need to be changed... well, except the one I'll discuss 
below.



I thought long and hard about this issue before submitting this lib for 
review and concluded that any solution would make std.parallelism so 
slow, so limited and/or such a PITA to use that I'd rather it just punt 
these issues to the programmer.  In practice, parallel foreach is used 
with very small, performance-critical snippets that are fairly easy to 
reason about even without any language-level race safety.


I'm not too convinced about the "I know what I'm doing" argument when I 
look at this example from asyncBuf's documentation:


   auto lines = File("foo.txt").byLine();
   auto duped = map!"a.idup"(lines);  // Necessary b/c byLine() 
recycles buffer


   // Fetch more lines in the background while we process the lines already
   // read into memory into a matrix of doubles.
   double[][] matrix;
   auto asyncReader = taskPool.asyncBuf(duped);

   foreach(line; asyncReader) {
   auto ls = line.split("\t");
   matrix ~= to!(double[])(ls);
   }

Look at the last line of the foreach. You are appending to a non-shared 
array from many different threads. How is that not a race condition?


With my proposal, the compiler would have caught that because opApply 
would want the foreach body to be  a shared delegate, and 
reading/writing to non-shared variable "matrix" in the outer scope from 
a shared delegate literal would be an error.


I'm not too sure how hard it'd be to do that in the compiler, but I 
think it's the right thing to do. Once the compiler can do this we can 
have safety; until that time I'd agree to see std.parallelism stay 
unsafe.


--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/



Re: review of std.parallelism

2011-03-19 Thread dsimcha

On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:


* "workUnitSize: The number of elements to evaluate in a single Task.
Must be less than or equal to bufSize, and in practice should be a
fraction of bufSize such that all worker threads can be used." Then why
not specify a different parameter such a multiplier or something? The
dependence between bufSize and worUnitSize is a sign that these two
should be improved. If you have good reasons that the user must have the
parameters in this form, give an example substantiating that.


I did this for consistency with all the other functions, where work unit 
size is specified directly.  I don't want lazyMap to be the oddball 
function where it's specified in a completely different way.


Re: review of std.parallelism

2011-03-19 Thread dsimcha

On 3/19/2011 1:09 PM, Michel Fortin wrote:

On 2011-03-19 12:03:51 -0400, Andrei Alexandrescu
 said:


* "Most of this module completely subverts..." Vague characterizations
("most", "completely", "some") don't belong in a technical
documentation. (For example there's either subversion going on or
there isn't.) Also, std.concurrency and std.parallelism address
different needs so there's little competition between them. Better:
"Unless explicitly marked as $(D @trusted) or $(D @safe), artifacts in
this module are not provably memory-safe and cannot be used with
SafeD. If used as documented, memory safety is guaranteed."


Actually, I think this is a bad description of what it subverts. What it
subverts isn't the memory-safety that SafeD provides, but the safety
against low-level races that even unsafe D protects against unless you
cast shared away. For instance:

void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
sum += value;
}
writeln(sum);
}

The "+=" would need to be an atomic operation to avoid low-level races.

I think that ParallelForeach's opApply should only accept a shared
delegate. I define shared delegate as a delegate that does not reference
any non-shared variables of its outer scope. The problem is that DMD
currently doesn't know how to determine whether a delegate literal is
shared or not, thus a delegate literal is never shared and if
ParallelForeach's opApply asked a shared delegate as it should it would
just not work. Fix DMD to create shared delegate literals where
appropriate and everything can be guarantied race-free.



If you want pedal-to-metal parallelism without insane levels of 
verbosity, you can't have these safety features.  I thought long and 
hard about this issue before submitting this lib for review and 
concluded that any solution would make std.parallelism so slow, so 
limited and/or such a PITA to use that I'd rather it just punt these 
issues to the programmer.  In practice, parallel foreach is used with 
very small, performance-critical snippets that are fairly easy to reason 
about even without any language-level race safety.  This is a 
fundamental design decision that will not be changing.  If it's 
unacceptable then:


1.  I give up.

2.  I wish someone had told me earlier.


Re: review of std.parallelism

2011-03-19 Thread dsimcha

On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:

On 03/19/2011 02:32 AM, dsimcha wrote:

Ok, thanks again for clarifying **how** the docs could be improved. I've
implemented the suggestions and generally given the docs a good reading
over and clean up. The new docs are at:

http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html


* Still no synopsis example that illustrates in a catchy way the most
attractive artifacts.


I don't see what I could put here that isn't totally redundant with the 
rest of the documentation.  Anything I could think of would basically 
just involve concatentating all the examples.  Furthermore, none of the 
other Phobos modules have this, so I don't know what one should look 
like.  In general I feel like std.parallelism is being held to a 
ridiculously high standard that none of the other Phobos modules 
currently meet.




* "After creation, Task objects are submitted to a TaskPool for
execution." I understand it's possible to use Task straight as a
promise/future, so s/are/may be/.


No.  The only way Task is useful is by submitting it to a pool to be 
executed.  (Though this may change, see below.)



Also it is my understanding that
TaskPool efficiently adapts the concrete number of CPUs to an arbitrary
number of tasks. If that's the case, it's worth mentioning.


Isn't this kind of obvious from the examples, etc.?


* "If a Task has been submitted to a TaskPool instance, is being stored
in a stack frame, and has not yet finished, the destructor for this
struct will automatically call yieldWait() so that the task can finish
and the stack frame can be destroyed safely." At this point in the doc
the reader doesn't understand that at all because TaskPool has not been
seen yet. The reader gets worried that she'll be essentially serializing
the entire process by mistake. Either move this explanation down or
provide an example.


This is getting ridiculous.  There are too many circular dependencies 
between Task and TaskPool that are impossible to remove here that I'm 
not even going to try.  One or the other has to be introduced first, but 
neither can be explained without mentioning the other.  This is why I 
explain the relationship briefly in the module level summary, so that 
the user has at least some idea.  I think this is about the best I can do.




* Is done() a property?


Yes.  DDoc sucks.



* The example that reads two files at the same time should NOT use
taskPool. It's just one task, why would the pool ever be needed? If you
also provided an example that reads n files in memory at the same time
using a pool, that would illustrate nicely why you need it. If a Task
can't be launched without being put in a pool, there should be a
possibility to do so. At my work we have a simple function called
callInNewThread that does what's needed to launch a function in a new
thread.


I guess I could add something like this to Task.  Under the hood it 
would (for implementation simplicity, to reuse a bunch of code from 
TaskPool) fire up a new single-thread pool, submit the task, call 
TaskPool.finish(), and then return.  Since you're already creating a new 
thread, the extra overhead of creating a new TaskPool for the thread 
would be negligible and it would massively simplify the implementation. 
 My only concern is that, when combined with scoped versus non-scoped 
tasks (which are definitely here to stay, see below) this small 
convenience function would add way more API complexity than it's worth. 
 Besides, task pools don't need to be created explicitly anyhow. 
That's what the default instantiation is for.  I don't see how 
callInNewThread would really solve much.


Secondly, I think you're reading **WAY** too much into what was meant to 
be a simple example to illustrate usage mechanics.  This is another case 
where I can't think of a small, cute example of where you'd really need 
the pool.  There are plenty of larger examples, but the smallest/most 
self-contained one I can think of is a parallel sort.  I decided to use 
file reading because it was good enough to illustrate the mechanics of 
usage, even if it didn't illustrate a particularly good use case.




* The note below that example gets me thinking: it is an artificial
limitation to force users of Task to worry about scope and such. One
should be able to create a Future object (Task I think in your
terminology), pass it around like a normal value, and ultimately force
it. This is the case for all other languages that implement futures. I
suspect the "scope" parameter associated with the delegate a couple of
definitions below plays a role here, but I think we need to work for
providing the smoothest interface possible (possibly prompting
improvements in the language along the way).


This is what TaskPool.task is for.  Maybe this should be moved to the 
top of the definition of TaskPool and emphasized, and the scoped/stack 
allocated versions should be moved below TaskPool and de-emphasized?


At any rate,

Re: review of std.parallelism

2011-03-19 Thread Michel Fortin
On 2011-03-19 12:03:51 -0400, Andrei Alexandrescu 
 said:


* "Most of this module completely subverts..." Vague characterizations 
("most", "completely", "some") don't belong in a technical 
documentation. (For example there's either subversion going on or there 
isn't.) Also, std.concurrency and std.parallelism address different 
needs so there's little competition between them. Better: "Unless 
explicitly marked as $(D @trusted) or $(D @safe), artifacts in this 
module are not provably memory-safe and cannot be used with SafeD. If 
used as documented, memory safety is guaranteed."


Actually, I think this is a bad description of what it subverts. What 
it subverts isn't the memory-safety that SafeD provides, but the safety 
against low-level races that even unsafe D protects against unless you 
cast shared away. For instance:


void main() {
int sum = 0;
foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) 
{
sum += value;
}
writeln(sum);
}

The "+=" would need to be an atomic operation to avoid low-level races.

I think that ParallelForeach's opApply should only accept a shared 
delegate. I define shared delegate as a delegate that does not 
reference any non-shared variables of its outer scope. The problem is 
that DMD currently doesn't know how to determine whether a delegate 
literal is shared or not, thus a delegate literal is never shared and 
if ParallelForeach's opApply asked a shared delegate as it should it 
would just not work. Fix DMD to create shared delegate literals where 
appropriate and everything can be guarantied race-free.


--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/



Re: review of std.parallelism

2011-03-19 Thread Andrei Alexandrescu

On 03/19/2011 10:28 AM, dsimcha wrote:

On 3/19/2011 10:54 AM, Andrei Alexandrescu wrote:

Towards the bottom of the document there are overloads of task that
don't have examples.


You mean TaskPool.task()? Since these are such slight variations of the
other overloads, I thought an example would be overkill. Since people
less familiar with the library don't think so, though, I've added
examples that are accordingly slight variations of the examples for the
other overloads.


A great way to handle bunches of almost-identical overloads is to group 
them together with /// ditto and explain the slight differences in the 
consolidated documentation.


Andrei


Re: review of std.parallelism

2011-03-19 Thread Andrei Alexandrescu

On 03/19/2011 02:32 AM, dsimcha wrote:

Ok, thanks again for clarifying **how** the docs could be improved. I've
implemented the suggestions and generally given the docs a good reading
over and clean up. The new docs are at:

http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html


* When using "parallelism" as a common noun, prefix is with a '_' so 
ddoc doesn't underline it.


* "Most of this module completely subverts..." Vague characterizations 
("most", "completely", "some") don't belong in a technical 
documentation. (For example there's either subversion going on or there 
isn't.) Also, std.concurrency and std.parallelism address different 
needs so there's little competition between them. Better: "Unless 
explicitly marked as $(D @trusted) or $(D @safe), artifacts in this 
module are not provably memory-safe and cannot be used with SafeD. If 
used as documented, memory safety is guaranteed."


* Speaking of std.concurrency vs. std.parallelism, the first paragraph 
might be something like: "This module implements high-level primitives 
for shared memory SMP parallelism. These include parallel foreach, 
parallel reduce, parallel eager map, pipelining and future/promise 
parallelism primitives. $(D std._parallelism) is best recommended when 
the same operation is to be executed in parallel over different data. 
For communication between arbitrary threads, see $(D std.concurrency)."


* Still no synopsis example that illustrates in a catchy way the most 
attractive artifacts.


* "After creation, Task objects are submitted to a TaskPool for 
execution." I understand it's possible to use Task straight as a 
promise/future, so s/are/may be/. Also it is my understanding that 
TaskPool efficiently adapts the concrete number of CPUs to an arbitrary 
number of tasks. If that's the case, it's worth mentioning.


* "A call to workWait(), yieldWait(), or spinWait() can be used to 
retrive the return value after the function is finished executing." As 
an aside, a spell checking step would be great ("retrieve") - just put 
the text in an editor with spellchecking. I think what this means is: 
"The methods workWait(), yieldWait(), or spinWait() make sure that the 
function finishes execution and then return its result to the initiating 
thread. Each uses a different waiting strategy as detailed below."


* "If a Task has been submitted to a TaskPool instance, is being stored 
in a stack frame, and has not yet finished, the destructor for this 
struct will automatically call yieldWait() so that the task can finish 
and the stack frame can be destroyed safely." At this point in the doc 
the reader doesn't understand that at all because TaskPool has not been 
seen yet. The reader gets worried that she'll be essentially serializing 
the entire process by mistake. Either move this explanation down or 
provide an example.


* "Function results are returned from yieldWait() and friends by ref." 
Someone coming from C++ may be thrown off by this sudden casual use of 
"friends" and think there's a notion of frienship by reference in D. 
Better: "The forcing methods yieldWait(), workWait(), and spinWait() 
return the result by reference."


* Speaking of which, I'd replace "Wait" with "Force". Right now the 
nomenclature is far removed from futures and promises.


* Is done() a property?

* The example that reads two files at the same time should NOT use 
taskPool. It's just one task, why would the pool ever be needed? If you 
also provided an example that reads n files in memory at the same time 
using a pool, that would illustrate nicely why you need it. If a Task 
can't be launched without being put in a pool, there should be a 
possibility to do so. At my work we have a simple function called 
callInNewThread that does what's needed to launch a function in a new 
thread.


* The note below that example gets me thinking: it is an artificial 
limitation to force users of Task to worry about scope and such. One 
should be able to create a Future object (Task I think in your 
terminology), pass it around like a normal value, and ultimately force 
it. This is the case for all other languages that implement futures. I 
suspect the "scope" parameter associated with the delegate a couple of 
definitions below plays a role here, but I think we need to work for 
providing the smoothest interface possible (possibly prompting 
improvements in the language along the way).


* I'm not sure how to interpret the docs for

ReturnType!(F) run(F, Args...)(F fpOrDelegate, ref Args args);

So it's documented but I'm not supposed to care. Why not just remove? 
Surely there must be higher-level examples that clarify that I can use 
delegates etc.


* The examples have code at top-level. That's fine for short snippets 
but not when using import etc. I recommend putting the code inside 
unittests or function bodies for such cases.


* "If you want to escape the Task object from the function in which it 
was created or prefer to heap alloca

Re: review of std.parallelism

2011-03-19 Thread dsimcha

On 3/19/2011 10:54 AM, Andrei Alexandrescu wrote:

On 03/18/2011 11:40 PM, dsimcha wrote:


It should just be private. The fact that it's public is an artifact of
when I was designing worker-local storage and didn't know how it was
going to work yet. I never thought to revisit this until now. It really
isn't useful to client code.


It could be public and undocumented.


I've already made it private.  I can't see what purpose having it public 
would serve.  The fact that it was public before was **purely** an 
oversight.



* defaultPoolThreads - should it be a @property?


Yes. In spirit it's a global variable. It requires some extra
machinations, though, to be threadsafe, which is why it's not
implemented as a simple global variable.


Then it should be a @property. I think ddoc doesn't reflect that, but an
example could.


Right.  It's always been @property but ddoc doesn't reflect this.  I've 
changed the docs slightly to call it a "property" instead of a 
"function".  It seems like overkill to me to give examples for this, 
though, since it's just a getter and a setter.





* No example for task().


 Yes there is, for both flavors, though these could admittedly be
improved. Only the safe version doesn't have an example, and this is
just a more restricted version of the function pointer case, so it seems
silly to make a separate example for it.


Towards the bottom of the document there are overloads of task that
don't have examples.


You mean TaskPool.task()?  Since these are such slight variations of the 
other overloads, I thought an example would be overkill.  Since people 
less familiar with the library don't think so, though, I've added 
examples that are accordingly slight variations of the examples for the 
other overloads.





* What is 'run' in the definition of safe task()?


It's just the run() adapter function. Isn't that obvious?


I'm referring to this:

Task!(run,TypeTuple!(F,Args)) task(F, Args...)(scope F delegateOrFp,
Args args);

What is "run"?


Ok, I ended up moving the docs for run() directly above the stuff that 
uses it.  run() is described as "Calls a delegate or function pointer 
with args. This is an adapter that makes Task work with delegates, 
function pointers and functors instead of just aliases. It is included 
in the documentation to clarify how this case is handled, but is not 
meant to be used directly by client code."


I know the Higher Principles of Encapsulation say this should be private 
and the relevant overloads should return auto.  I strongly believe, 
though, that being anal about encapsulation of this detail is silly 
since it is so unlikely to change and that exposing it helps to clarify 
what's really going on here.  Encapsulation is good up to a point, but 
sometimes it's just easier to think about things when you know how they 
really work at a concrete level, and this tradeoff needs to be weighted.


Re: review of std.parallelism

2011-03-19 Thread Andrei Alexandrescu

On 03/18/2011 11:40 PM, dsimcha wrote:

Thanks for the advice. You mentioned in the past that the documentation
was inadequate but didn't give enough specifics as to how until now. As
the author of the library, things seem obvious to me that don't seem
obvious to anyone else, so I don't feel that I'm in a good position to
judge the quality of the documentation and where it needs improvement. I
plan to fix most of the issues you raised, but I've left comments for
the few that I can't/won't fix or believe are based on misunderstandings
below.


Great, thanks.


On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:

1. Library proper:

* "In the case of non-random access ranges, parallel foreach is still
usable but buffers lazily to an array..." Wouldn't strided processing
help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
1, 5, 9, ... and so on.


You can have this if you want, by setting the work unit size to 1.
Setting it to a larger size just causes more elements to be buffered,
which may be more efficient in some cases.


Got it.


* Why not make workerIndex a ulong and be done with it?


I doubt anyone's really going to create anywhere near 4 billion TaskPool
threads over the lifetime of a program. Part of the point of TaskPool is
recycling threads rather than paying the overhead of creating and
destroying them. Using a ulong on a 32-bit architecture would make
worker-local storage substantially slower. workerIndex is how
worker-local storage works under the hood, so it needs to be fast.


If you're confident that overflow won't occur, you may want to eliminate 
that detail from the docs. It throws off the reader.



 > * No example for workerIndex and why it's useful.

It should just be private. The fact that it's public is an artifact of
when I was designing worker-local storage and didn't know how it was
going to work yet. I never thought to revisit this until now. It really
isn't useful to client code.


It could be public and undocumented.


* Is stop() really trusted or just unsafe? If it's forcibly killing
threads then its unsafe.


It's not forcibly killing threads. As the documentation states, it has
no effect on jobs already executing, only ones in the queue.
Furthermore, it's needed unless makeDaemon is called. Speaking of which,
makeDaemon and makeAngel should probably be trusted, too.


Great. The more safe/trusted, the better.


* defaultPoolThreads - should it be a @property?


Yes. In spirit it's a global variable. It requires some extra
machinations, though, to be threadsafe, which is why it's not
implemented as a simple global variable.


Then it should be a @property. I think ddoc doesn't reflect that, but an 
example could.



* No example for task().


 Yes there is, for both flavors, though these could admittedly be
improved. Only the safe version doesn't have an example, and this is
just a more restricted version of the function pointer case, so it seems
silly to make a separate example for it.


Towards the bottom of the document there are overloads of task that 
don't have examples.



* What is 'run' in the definition of safe task()?


It's just the run() adapter function. Isn't that obvious?


I'm referring to this:

Task!(run,TypeTuple!(F,Args)) task(F, Args...)(scope F delegateOrFp, 
Args args);


What is "run"?


Andrei


Re: review of std.parallelism

2011-03-19 Thread Simen kjaeraas

On Sat, 19 Mar 2011 13:51:56 +0100, dsimcha  wrote:


== Quote from Simen kjaeraas (simen.kja...@gmail.com)'s article

On Sat, 19 Mar 2011 05:40:08 +0100, dsimcha  wrote:
> On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:
>> 1. Library proper:
>>
>> * "In the case of non-random access ranges, parallel foreach is still
>> usable but buffers lazily to an array..." Wouldn't strided processing
>> help? If e.g. 4 threads the first works on 0, 4, 8, ... second works  
on

>> 1, 5, 9, ... and so on.
>
> You can have this if you want, by setting the work unit size to 1.
> Setting it to a larger size just causes more elements to be buffered,
> which may be more efficient in some cases.
Please add an example showing that, too. Sure, the documentation says
that's what's being done, but an example would show it more clearly.


I don't understand how this can be demonstrated in an example.  It's an
under-the-hood thing.  The only place this appears in the API is in the
workUnitSize parameter.


Yeah, scratch that. I for some reason thought the "array of size
workUnitSize" was global, but it's per thread, innit? Seems so logical now.


--
Simen


Re: review of std.parallelism

2011-03-19 Thread dsimcha
== Quote from Simen kjaeraas (simen.kja...@gmail.com)'s article
> On Sat, 19 Mar 2011 05:40:08 +0100, dsimcha  wrote:
> > On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:
> >> 1. Library proper:
> >>
> >> * "In the case of non-random access ranges, parallel foreach is still
> >> usable but buffers lazily to an array..." Wouldn't strided processing
> >> help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
> >> 1, 5, 9, ... and so on.
> >
> > You can have this if you want, by setting the work unit size to 1.
> > Setting it to a larger size just causes more elements to be buffered,
> > which may be more efficient in some cases.
> Please add an example showing that, too. Sure, the documentation says
> that's what's being done, but an example would show it more clearly.

I don't understand how this can be demonstrated in an example.  It's an
under-the-hood thing.  The only place this appears in the API is in the
workUnitSize parameter.


Re: review of std.parallelism

2011-03-19 Thread Simen kjaeraas

On Sat, 19 Mar 2011 05:40:08 +0100, dsimcha  wrote:


On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:

1. Library proper:

* "In the case of non-random access ranges, parallel foreach is still
usable but buffers lazily to an array..." Wouldn't strided processing
help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
1, 5, 9, ... and so on.


You can have this if you want, by setting the work unit size to 1.  
Setting it to a larger size just causes more elements to be buffered,  
which may be more efficient in some cases.


Please add an example showing that, too. Sure, the documentation says
that's what's being done, but an example would show it more clearly.


--
Simen


Re: review of std.parallelism

2011-03-19 Thread dsimcha
Ok, thanks again for clarifying **how** the docs could be improved. 
I've implemented the suggestions and generally given the docs a good 
reading over and clean up.  The new docs are at:


http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html

On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:

0. Overview and vote

I think the library delivers the high-level goods (parallel foreach,
map, reduce) but is a bit fuzzy in the lower level details.

Documentation hurts understanding of the capabilities of the library and
essentially is of inadequate quality. Entity documentation and examples
do little more than give the impression of dispensing with an unpleasant
chore.

My vote in favor of acceptance is contingent upon a radical improvement
in the quality of documentation and examples. Most if not all artifacts
should be motivated by simple, compelling examples. The introductory
section must contain a brief and attractive synopsis of the flagship
capabilities. All terms must be defined before being used and introduced
in a carefully-chosen order. The relationship between various entities
should be clarified.

I've seen the argument that simple and strong examples are difficult to
find. Though I agree such examples are not easy to come up with, I also
believe the author of library is in the best position to produce them.



1. Library proper:

* "In the case of non-random access ranges, parallel foreach is still
usable but buffers lazily to an array..." Wouldn't strided processing
help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
1, 5, 9, ... and so on.

* Example with squares would be better if double replaced uint, and if a
more complicated operation (sqrt, log...) were involved.

* I'm unclear on the tactics used by lazyMap. I'm thinking the obvious
method should be better: just use one circular buffer. The presence of
two dependent parameters makes this abstraction difficult to operate with.

* Same question about asyncBuf. What is wrong with a circular buffer
filled on one side by threads and on the consumed from the other by the
client? I can think of a couple of answers but it would be great if they
were part of the documentation.

* Why not make workerIndex a ulong and be done with it?

* Is stop() really trusted or just unsafe? If it's forcibly killing
threads then its unsafe.

* uint size() should be size_t for conformity.

* defaultPoolThreads - should it be a @property?

* I don't understand what Task is supposed to do. It is circularly
defined: "A struct that encapsulates the information about a task,
including its current status, what pool it was submitted to, and its
arguments." OK, but what's a task? Could a task be used outside a task
pool, and if so to what effect(s)?

* If LazyMap is only necessary as the result of lazyMap, could that
become a local type inside lazyMap?


2. Documentation:

* Documentation unacceptable. It lacks precision and uses terms before
defining them. For example: "This class encapsulates a task queue and a
set of worker threads" comes before the notions of "task queue" and
"worker thread" are defined. Clearly there is an intuition of what those
are supposed to mean, but in practice each library lays down some fairly
detailed definition of such.

* "Note: Initializing a pool with zero threads (as would happen in the
case of a single-core CPU) is well-tested and does work." The absence of
a bug should not be advertised in so many words. Simpler: "Note:
Single-CPU machines will operate transparently with zero-sized pools."

* "Allows for custom pool size." I have no idea what pool size means.

* "// Do something interesting." Worst example ever. You are the creator
of the library. If _you_ can't find a brief compelling example, who could?

* "Immediately after the range argument, an optional block size argument
may be provided. If none is provided, the default block size is used. An
optional buffer for returining the results may be provided as the last
argument. If one is not provided, one will be automatically allocated.
If one is provided, it must be the same length as the range." An example
should be inserted after each of these sentences.

* "// Do something expensive with line." Better you do something
expensive with line.

* "This is not the same thing as commutativity." I think this is obvious
enough to be left out. The example of matrices (associative but not
commutative) is nice though.

* "immutable n = 10;" -> use underscores

* Would be great if the documentation examples included some rough
experimental results, e.g. "3.8 times faster on a 4-core machine than
the equivalent serial loop".

* No example for workerIndex and why it's useful.

* Is LocalWorker better than WorkerLocal? No, because it's not the
worker that's local, it's the storage - which is missing from the name!
WorkerLocalStorage? Also replace "create" with "make" or drop it
entirely. The example doesn't tell me how I can use bufs. I suspect
work

Re: review of std.parallelism

2011-03-18 Thread dsimcha
Thanks for the advice.  You mentioned in the past that the documentation 
was inadequate but didn't give enough specifics as to how until now.  As 
the author of the library, things seem obvious to me that don't seem 
obvious to anyone else, so I don't feel that I'm in a good position to 
judge the quality of the documentation and where it needs improvement. 
I plan to fix most of the issues you raised, but I've left comments for 
the few that I can't/won't fix or believe are based on misunderstandings 
below.


On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:

1. Library proper:

* "In the case of non-random access ranges, parallel foreach is still
usable but buffers lazily to an array..." Wouldn't strided processing
help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on
1, 5, 9, ... and so on.


You can have this if you want, by setting the work unit size to 1. 
Setting it to a larger size just causes more elements to be buffered, 
which may be more efficient in some cases.




* I'm unclear on the tactics used by lazyMap. I'm thinking the obvious
method should be better: just use one circular buffer. The presence of
two dependent parameters makes this abstraction difficult to operate with.

* Same question about asyncBuf. What is wrong with a circular buffer
filled on one side by threads and on the consumed from the other by the
client? I can think of a couple of answers but it would be great if they
were part of the documentation.


Are you really suggesting I give detailed rationales for implementation 
decisions in the documentation?  Anyhow, the two reasons for this choice 
are to avoid needing synchronization/atomic ops/etc. on every write to 
the buffer (which we would need since it can be read and written 
concurrently and we need to track whether we have space to write to) and 
because parallel map works best when it operates on relatively large 
buffers, resulting in minimal synchronization overhead per element. 
(Under the hood, the buffer is filled and then eager parallel map is 
called.)



* Why not make workerIndex a ulong and be done with it?


I doubt anyone's really going to create anywhere near 4 billion TaskPool 
threads over the lifetime of a program.  Part of the point of TaskPool 
is recycling threads rather than paying the overhead of creating and 
destroying them.  Using a ulong on a 32-bit architecture would make 
worker-local storage substantially slower.  workerIndex is how 
worker-local storage works under the hood, so it needs to be fast.


> * No example for workerIndex and why it's useful.

It should just be private.  The fact that it's public is an artifact of 
when I was designing worker-local storage and didn't know how it was 
going to work yet.  I never thought to revisit this until now.  It 
really isn't useful to client code.



* Is stop() really trusted or just unsafe? If it's forcibly killing
threads then its unsafe.


It's not forcibly killing threads.  As the documentation states, it has 
no effect on jobs already executing, only ones in the queue. 
Furthermore, it's needed unless makeDaemon is called.  Speaking of 
which, makeDaemon and makeAngel should probably be trusted, too.



* defaultPoolThreads - should it be a @property?


Yes.  In spirit it's a global variable.  It requires some extra 
machinations, though, to be threadsafe, which is why it's not 
implemented as a simple global variable.



* No example for task().


 Yes there is, for both flavors, though these could admittedly be 
improved.  Only the safe version doesn't have an example, and this is 
just a more restricted version of the function pointer case, so it seems 
silly to make a separate example for it.



* What is 'run' in the definition of safe task()?


It's just the run() adapter function.  Isn't that obvious?


review of std.parallelism

2011-03-18 Thread Andrei Alexandrescu

0. Overview and vote

I think the library delivers the high-level goods (parallel foreach, 
map, reduce) but is a bit fuzzy in the lower level details.


Documentation hurts understanding of the capabilities of the library and 
essentially is of inadequate quality. Entity documentation and examples 
do little more than give the impression of dispensing with an unpleasant 
chore.


My vote in favor of acceptance is contingent upon a radical improvement 
in the quality of documentation and examples. Most if not all artifacts 
should be motivated by simple, compelling examples. The introductory 
section must contain a brief and attractive synopsis of the flagship 
capabilities. All terms must be defined before being used and introduced 
in a carefully-chosen order. The relationship between various entities 
should be clarified.


I've seen the argument that simple and strong examples are difficult to 
find. Though I agree such examples are not easy to come up with, I also 
believe the author of library is in the best position to produce them.




1. Library proper:

* "In the case of non-random access ranges, parallel foreach is still 
usable but buffers lazily to an array..." Wouldn't strided processing 
help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on 
1, 5, 9, ... and so on.


* Example with squares would be better if double replaced uint, and if a 
more complicated operation (sqrt, log...) were involved.


* I'm unclear on the tactics used by lazyMap. I'm thinking the obvious 
method should be better: just use one circular buffer. The presence of 
two dependent parameters makes this abstraction difficult to operate with.


* Same question about asyncBuf. What is wrong with a circular buffer 
filled on one side by threads and on the consumed from the other by the 
client? I can think of a couple of answers but it would be great if they 
were part of the documentation.


* Why not make workerIndex a ulong and be done with it?

* Is stop() really trusted or just unsafe? If it's forcibly killing 
threads then its unsafe.


* uint size() should be size_t for conformity.

* defaultPoolThreads - should it be a @property?

* I don't understand what Task is supposed to do. It is circularly 
defined: "A struct that encapsulates the information about a task, 
including its current status, what pool it was submitted to, and its 
arguments." OK, but what's a task? Could a task be used outside a task 
pool, and if so to what effect(s)?


* If LazyMap is only necessary as the result of lazyMap, could that 
become a local type inside lazyMap?



2. Documentation:

* Documentation unacceptable. It lacks precision and uses terms before 
defining them. For example: "This class encapsulates a task queue and a 
set of worker threads" comes before the notions of "task queue" and 
"worker thread" are defined. Clearly there is an intuition of what those 
are supposed to mean, but in practice each library lays down some fairly 
detailed definition of such.


* "Note: Initializing a pool with zero threads (as would happen in the 
case of a single-core CPU) is well-tested and does work." The absence of 
a bug should not be advertised in so many words. Simpler: "Note: 
Single-CPU machines will operate transparently with zero-sized pools."


* "Allows for custom pool size." I have no idea what pool size means.

* "// Do something interesting." Worst example ever. You are the creator 
of the library. If _you_ can't find a brief compelling example, who could?


* "Immediately after the range argument, an optional block size argument 
may be provided. If none is provided, the default block size is used. An 
optional buffer for returining the results may be provided as the last 
argument. If one is not provided, one will be automatically allocated. 
If one is provided, it must be the same length as the range." An example 
should be inserted after each of these sentences.


* "// Do something expensive with line." Better you do something 
expensive with line.


* "This is not the same thing as commutativity." I think this is obvious 
enough to be left out. The example of matrices (associative but not 
commutative) is nice though.


* "immutable n = 10;" -> use underscores

* Would be great if the documentation examples included some rough 
experimental results, e.g. "3.8 times faster on a 4-core machine than 
the equivalent serial loop".


* No example for workerIndex and why it's useful.

* Is LocalWorker better than WorkerLocal? No, because it's not the 
worker that's local, it's the storage - which is missing from the name! 
WorkerLocalStorage? Also replace "create" with "make" or drop it 
entirely. The example doesn't tell me how I can use bufs. I suspect 
workerIndex has something to do with it but the example fails to divulge 
that relationship.


* No example for join(), which is quite a central primitive.

* No example for put(), which would be interesting to see.

* No example for t