First of all, please let me apologize for both bringing to the
attention of the community the issues which I believe to date
back to the days of APL, and for which, to the best of my
knowledge, no satisfactory solution was proposed ever since, and
for being, for the most part, a self-taught programmer, which
may occasionally result in the lack of knowledge whatsoever
about some of the well-known problems, solutions, textbooks and
so on on my part.
The problem that I'll discuss the first is the one of suboptimal
performance of the constructs like $y = F ($x), where there
aren't any variables involved besides of $y and $x, but where
the expression for F may otherwise be of arbitrarily complexity.
Consider, for example:
my $b
= $a * $a * ... * $a;
To my understanding, with PDL, the code above will be computed
roughly as follows:
my $b;
{
my $b2 = $a * $a;
my $b3 = $a * $b2;
...
$b = $a * $bN;
}
with each multiplication involving its own loop. Or, in
pseudo-C:
number b[size];
number b2[size];
for (i = 0; i < size; i++) { b2[i] = a[i] * a[i]; }
number b3[size];
for (i = 0; i < size; i++) { b3[i] = a[i] * b2[i]; }
...
for (i = 0; i < size; i++) { b[i] = a[i] * bN[i]; }
However, the same computation may be organized much more
efficiently by /lifting/ the loop out, like:
number b[size];
for (i = 0; i < size; i++) {
b[i] = a[i] * a[i] * ... a[i];
}
Now, why is the former code inefficient?
• N loops of M iterations each take at least N × M machine code
comparisons and jumps; which is N times more than a single
loop of M iterations for the latter case;
• each time the multiplication is done, a new object is created,
so we have as many memory allocations as there're non-mutating
operations;
• with the exception of the first and the last ones, the loop
body expressions for the first case access three arrays, which
hampers memory caching.
A simple-minded test case for this issue is like:
$ cat < ex-1.pdl
use PDL;
use PDL::GSL::RNG;
my $sz = 1024 * 1024 * 16;
my $rep = 4;
my $rng = PDL::GSL::RNG->new ('taus');
$rng->set_seed (1296745560);
my $a = $rng->get_uniform ($sz);
for (my $i = 0; $i < $rep; $i++) {
my $b = $a * $a * $a * $a * $a * $a * $a * $a;
print ($b->sumover ()->sumover (), "\n");
}
$ /usr/bin/time perl -f ex-1.pdl
1863048.54211578
1863048.54211578
1863048.54211578
1863048.54211578
4.76user 4.01system 0:08.78elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+953426minor)pagefaults 0swaps
$
For the comparison, let us replace all the seven multiplications
with a call to a single 8-ary multiplication function (which is
trivial to create thanks to PDL::PP.) The change and the
resulting times are:
$ tail -n 4 < ex-2.pdl
for (my $i = 0; $i < $rep; $i++) {
my $b = mymult8 ($a, $a, $a, $a, $a, $a, $a, $a);
print ($b->sumover ()->sumover (), "\n");
}
$ /usr/bin/time perl -MMyPPex2 -f ex-2.pdl
1863048.54211578
1863048.54211578
1863048.54211578
1863048.54211578
2.26user 0.75system 0:03.01elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+166973minor)pagefaults 0swaps
$
Note the reduction of both the processing time (2.9 times) and
the page faults count (5.7 times.)
However, it should also be noted that, at the larger scale, this
problem doesn't lead to wasted machine cycles alone. Rather, it
leads to wasted human cycles, as it pressurizes the programmers
/not/ to use functions, as the loops cannot easily be lifted out
of functions into the calling code, with all the negative
consequences of this.
I believe that the most of the major array processors of today
(including the ones in GNU Octave, GNU R, r.mapcalc(1grass),
etc.) have this problem. (I have no experience with post-77
versions of Fortran with regards to this issue, and guess that
the particular implementations may have solutions for the
problem, or they may not.) My question is, however, will this
issue be considered by the PDL developers?
Somehow, I feel that the necessary parts of the solution may
include:
• deferred computation; (only compute the values, and thus
iterate, when the context is known);
• a language (and thus a set of primitives), to record the
computation as part of a deferred value, or /promise/; the
language should /not/ be Turing-complete, so that the
necessary rewriting work wouldn't become unmanageable.
(Note also that having a suitable language, a kind of dynamic
chunking may be implemented just as well, so that the solution
may be extended to utilize both shared- and separate-memory
multiprocessing architectures.)
However, I'm short of ideas currently.
Any suggestions?
TIA.
--
FSF associate member #7257
pgp4jaNXgKheY.pgp
Description: PGP signature
_______________________________________________ Perldl mailing list [email protected] http://mailman.jach.hawaii.edu/mailman/listinfo/perldl
