I've done some experiments with Intel's Threading Building Blocks, and
based on my initial tests, it seems incredibly light-weight, and also easy
to use.
I haven't tested with actual GNU APL code yet though (I've written separate
test programs to experiment). My next tests will be on the real
To clarify, the appealing feature of TBB that made me interested (apart
from it being very fast) is that its algorithms implement task stealing.
This should make the dispatch quite effective, even if some subtasks are
slower than others. I.e. it may actually adress some of the concerns raised