Hi Damian —
Taking your three versions: Original:
forall r in rows do { const ref ur = u[r, ..]; for j in cslice do { t[r, j] = vmDot(common, ur, vslab[j, ..]); } } } }
Forall expr:
forall r in rows do { const ref ur = u[r, ..]; const x = [j in cslice] vmDot(common, ur, vslab[j, ..]); t[r, cslice] = x; }
Succinct:
forall r in rows do { const x = [j in cslice] vmDot(common, u[r, ..], vslab[j, ..]); t[r, cslice] = x; } slows down seriously, about 25+%.
I believe the difference between the final two is a simple case of Chapel not doing loop hoisting optimizations for non-trivial expressions. Specifically, you and I can see that `u[r, ..]` is independent of the value of 'j' so could be evaluated once and re-used for all iterations of the 'j' loop, but the Chapel compiler isn't mature enough to do this yet. So your "Forall expr" version gets an improvement by manually hoisting the evaluation of that expression out of the loop.
The delta between the original and forall expression version is less obvious, but I would guess that it could be due to the use of nested parallelism (though we'd hope that the impact would be more minimal than 5%, at least for loops with large trip counts). Specifically, by default, '[j in cslice]' will be executed in parallel, but it'll first check to see whether there's already a task per core, and if so, will serialize the loop. Maybe this execution-time check is adding the 5% overhead? A way
to check would be to write the initialization of 'x' as:
const x = for j in cslice do vmDot(common, ur, vslab[j, ..]);
If this returned the lost 5%, I think that's the answer. -Brad
_______________________________________________ Chapel-developers mailing list Chapel-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/chapel-developers