First of all, please let me apologize for both bringing to the
        attention of the community the issues which I believe to date
        back to the days of APL, and for which, to the best of my
        knowledge, no satisfactory solution was proposed ever since, and
        for being, for the most part, a self-taught programmer, which
        may occasionally result in the lack of knowledge whatsoever
        about some of the well-known problems, solutions, textbooks and
        so on on my part.

        The problem that I'll discuss the first is the one of suboptimal
        performance of the constructs like $y = F ($x), where there
        aren't any variables involved besides of $y and $x, but where
        the expression for F may otherwise be of arbitrarily complexity.
        Consider, for example:

my $b
    = $a * $a * ... * $a;

        To my understanding, with PDL, the code above will be computed
        roughly as follows:

my $b;
{
    my $b2 = $a * $a;
    my $b3 = $a * $b2;
    ...
    $b     = $a * $bN;
}

        with each multiplication involving its own loop.  Or, in
        pseudo-C:

number b[size];
number b2[size];
for (i = 0; i < size; i++) { b2[i] = a[i] *  a[i]; }
number b3[size];
for (i = 0; i < size; i++) { b3[i] = a[i] * b2[i]; }
...
for (i = 0; i < size; i++) { b[i]  = a[i] * bN[i]; }

        However, the same computation may be organized much more
        efficiently by /lifting/ the loop out, like:

number b[size];
for (i = 0; i < size; i++) {
    b[i]  = a[i] * a[i] * ... a[i];
}

        Now, why is the former code inefficient?

        • N loops of M iterations each take at least N × M machine code
          comparisons and jumps; which is N times more than a single
          loop of M iterations for the latter case;

        • each time the multiplication is done, a new object is created,
          so we have as many memory allocations as there're non-mutating
          operations;

        • with the exception of the first and the last ones, the loop
          body expressions for the first case access three arrays, which
          hampers memory caching.

        A simple-minded test case for this issue is like:

$ cat < ex-1.pdl 
use PDL;
use PDL::GSL::RNG;

my $sz = 1024 * 1024 * 16;
my $rep = 4;

my $rng = PDL::GSL::RNG->new ('taus');
$rng->set_seed (1296745560);
my $a   = $rng->get_uniform ($sz);
for (my $i = 0; $i < $rep; $i++) {
    my $b = $a * $a * $a * $a * $a * $a * $a * $a;
    print ($b->sumover ()->sumover (), "\n");
}
$ /usr/bin/time perl -f ex-1.pdl 
1863048.54211578
1863048.54211578
1863048.54211578
1863048.54211578
4.76user 4.01system 0:08.78elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+953426minor)pagefaults 0swaps
$ 

        For the comparison, let us replace all the seven multiplications
        with a call to a single 8-ary multiplication function (which is
        trivial to create thanks to PDL::PP.)  The change and the
        resulting times are:

$ tail -n 4 < ex-2.pdl 
for (my $i = 0; $i < $rep; $i++) {
    my $b = mymult8 ($a, $a, $a, $a, $a, $a, $a, $a);
    print ($b->sumover ()->sumover (), "\n");
}
$ /usr/bin/time perl -MMyPPex2 -f ex-2.pdl 
1863048.54211578
1863048.54211578
1863048.54211578
1863048.54211578
2.26user 0.75system 0:03.01elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+166973minor)pagefaults 0swaps
$ 

        Note the reduction of both the processing time (2.9 times) and
        the page faults count (5.7 times.)

        However, it should also be noted that, at the larger scale, this
        problem doesn't lead to wasted machine cycles alone.  Rather, it
        leads to wasted human cycles, as it pressurizes the programmers
        /not/ to use functions, as the loops cannot easily be lifted out
        of functions into the calling code, with all the negative
        consequences of this.

        I believe that the most of the major array processors of today
        (including the ones in GNU Octave, GNU R, r.mapcalc(1grass),
        etc.) have this problem.  (I have no experience with post-77
        versions of Fortran with regards to this issue, and guess that
        the particular implementations may have solutions for the
        problem, or they may not.)  My question is, however, will this
        issue be considered by the PDL developers?

        Somehow, I feel that the necessary parts of the solution may
        include:

        • deferred computation; (only compute the values, and thus
          iterate, when the context is known);

        • a language (and thus a set of primitives), to record the
          computation as part of a deferred value, or /promise/; the
          language should /not/ be Turing-complete, so that the
          necessary rewriting work wouldn't become unmanageable.

        (Note also that having a suitable language, a kind of dynamic
        chunking may be implemented just as well, so that the solution
        may be extended to utilize both shared- and separate-memory
        multiprocessing architectures.)

        However, I'm short of ideas currently.

        Any suggestions?

        TIA.

-- 
FSF associate member #7257

Attachment: pgp4jaNXgKheY.pgp
Description: PGP signature

_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

Reply via email to