Re: [Perldl] help with parallel procesing

David Mertens Mon, 06 Aug 2012 15:02:06 -0700

Denis -

I have not explored PDL's threading capabilities very much, but I've read
about it, I've worked with other threading models (namely CUDA and MPI),
and I will be happy to help out as best I can. I would also add that
understanding, documenting, and extending this is on my list of things to
do "some day," which means not very soon. Apart from writing this email and
responding to any questions you may have, it will probably be quite some
time before I revisit this topic.

A terminological clarification: in the PDL world, "threading" is the term
given to auto-looping over higher dimensions; "pthreading" is the term
given to splitting a single PDL task into multiple posix threads. I will
use pthreads throughout this email. pthreading is supported by core
functionality if your operating system supports posix threads. Normal MS
Windows is, to the best of my knowledge, left out of this, though it may
work on Cygwin.

PDL supports splitting an operation across multiple processors for a number
of operations. Some PDL operations must be performed in sequence, however.
At the moment, unless the author explicitly noted it in the docs, it is
impossible to know which operations allow for pthreading without looking at
the source code. Fortunately, almost all operations *do* support
pthreading, the notable exceptions being (1) random, (2) randsym, (3)
derivatives of these, especially grandom, (4) fft and ifft, (5)
PDL::IO::Pnm::pnmout, and (6) all PLplot method calls. I suspect that there
may be a few more functions that do not actually support pthreading, though
they have not been marked as such. Still, the vast majority of operations
including addition, matrix multiplication, etc, support pthreading.

So, what does pthreading buy us? Unfortunately, for many calculations, not
as much as we might have hoped. This is in part due to common constraints
in parallel programming, and in part due to constraints in how pthreading
is implemented and executed.

Let's begin with a naive example of something that *should* parallelize
easily---incrementing data stored in a matrix:

__CODE__
use strict;
use warnings;
use PDL;

my $data_size = shift @ARGV || 2000;
my $reps = shift @ARGV || 500;
my $targ = shift @ARGV || 2;
my $thread_cutoff = shift @ARGV || 1;
set_autopthread_targ($targ);
set_autopthread_size($thread_cutoff);

my $data = zeros($data_size, $data_size);
$data += 1 for (1..$reps);
__END__

This code takes up to four arguments on the command line. The code builds
an NxN matrix (N is the first argument) and increments all the values in
the matrix a specified number of repetitions (the second argument). We
specify a high number of repetitions to get good timing statistics. To
increment a 200x200 matrix 30 times, you would say:

> perl pthread-test.pl 200 30

To time this on a Unixish machine, you can probably say this:

> time perl pthread-test.pl 200 30

The next two arguments indicate the number of processors to use for
pthreading (if pthreading is used, which I'll get to in a second), and the
minimum size cutoff before pthreading is used. So this tells PDL to
increment a 200x200 matrix 30 times, splitting the calculation across two
processors if the number of elements in the operation exceeds 1 million:

> perl pthread-test.pl 200 30 2 1

Note that the data that we created isn't large enough to achieve the
auto-parallelization at 1 million elements, so we crank up the matrix size
to actually trigger the parallelization. On my machine I ran this:

> time perl pthread-test.pl 8400 500 3 1
real    1m40.860s
user    4m8.124s
sys    0m2.200s

You can turn-off automatic parallelization by specifying zero processors to
parallelize:

> time perl pthread-test.pl 8400 500 0
real    1m37.764s
user    3m7.176s
sys    0m2.204s

Comparing the real times, you'll notice that THE PARALLELIZED VERSION TAKES
LONGER. This is a perfect example of how parallelization may lead to
unexpectedly poor performance. In this example, we are probably running
into a memory access clash, where all three processors are trying to write
to consecutive addresses in memory. Memory access by different processors
to the same general region in memory can lead to serialization of the
memory access, which means that the memory access pattern is frustrating
the parallelization. What's the moral of the story? PARALLELIZING TRIVIAL
TASKS RARELY WINS YOU ANYTHING.

"But David," you say, "If it's a memory access pattern issue, can't we just
use something like transpose() to change how PDL accesses the memory?"
That's an excellent question! Let's try:

__CODE__
use strict;
use warnings;
use PDL;

my $data_size = shift @ARGV || 2000;
my $reps = shift @ARGV || 500;
my $targ = shift @ARGV || 2;
my $thread_cutoff = shift @ARGV || 1;
set_autopthread_targ($targ);
set_autopthread_size($thread_cutoff);

my $data = zeros($data_size, $data_size)->transpose; # <-- only change
$data += 1 for (1..$reps);
__END__

Running that on my machine gives me the following results. First for the
parallelized version:

> time perl pthread-test.pl 8400 500 3 1
real    3m21.872s
user    8m42.573s
sys    0m2.560s

and now for the sequential version:

> time perl pthread-test.pl 8400 500 0
real    4m0.560s
user    7m45.769s
sys    0m3.492s

Now we see that the real time for the parallelized execution is faster than
the sequential version by 40 seconds for an operation that took 3 minutes
and 20 seconds. It looks like we did indeed have a memory access issue. But
notice that the transpose *doubled* the amount of time for the operation!
This is because the operation being performed is so trivial that the
transpose operation---fast though it is---is still contributing a
substantial amount of time to the performance of the operation. Again,
PARALLELIZING TRIVIAL TASKS RARELY WINS YOU ANYTHING.

(Note that it might be possible to alter the means by which the
parallelization is handled to avoid these sorts of memory conflicts for
trivial calculations like addition. Hopefully John Cerney, who recently
worked on the parallelization stuff, can look into this.)

But not all is lost. If you can have each processor perform a nontrivial
task in its pthread, and if the behavior is not dominated by ram access,
you can get pretty good performance gains. This is quintessentially
demonstrated with matrix-matrix multiplication. Here's a decent script to
demonstrate how this works:

__CODE__
use strict;
use warnings;
use PDL;

my $targ = shift @ARGV || 0;
my $data_size = 1024;
my $data_depth = 6;
set_autopthread_targ($targ);
set_autopthread_size(1);

my $A = random($data_size, $data_size, $data_depth);
my $B = random($data_size, $data_size);
for (1..10) {
    print "Rep $_\n";
    my $C = $A x $B;
}
print "Done\n";
__END__

Notice that, for this operation, we avoid memory access conflicts because
each processor is accessing data that is separated by each other process by
at least a megabyte. Also, as written (and on my machine where I have four
cores), I can run this with one, two, and three processors. Doing so gives
noticeably improved output. Cutting out the repetition status printouts, I
get for one processor:

> time perl pthread-test.pl
real    2m40.511s
user    2m38.882s
sys    0m0.360s

for two processors:

> time perl pthread-test.pl
real    1m21.360s
user    2m40.190s
sys    0m0.664s

and for three processors:

> time perl pthread-test.pl 3
real    1m6.337s
user    2m54.651s
sys    0m1.036s

Comparing the real times shows a vast improvement between one and two
processors. The improvement moving to three processors is not so
impressive, but is still an improvement.

One aspect of PDL's pthreading that makes it not as good as we might like
is that pthreading occurs at each and every invocation of a (pthreadable)
PDL::PP operation. It's not possible to string together a large collection
of PDL::PP operations in one parallelization; rather, each operation gets
individually chopped into parallel threads, then collected, then the next
gets chopped, then collected, etc. If your code is a series of trivial
operations, parallelization may be of little use because the overhead for
launching and collecting those threads consumes all the extra time you
might have gotten simply running them sequentially. You'll get the best
performance gain when you have a handful of large operations performed by
one or two PDL::PP operations, like matrix-matrix multiplication. If you
find yourself needing to perform some collection of trivial operations, you
might consider combining all of them into a PDL::PP function. Not only will
it be faster, but it stands a chance of benefitting from
auto-parallelization.

I hope this gives you some guidance for using pthreads with PDL. If you
have any questions, I'll be happy to answer them, as best I can.

David

On Sun, Aug 5, 2012 at 4:41 PM, Denis Gonzalez <[email protected]>wrote:

> Hi everyone,
>
> I would like to learn how to work with parallel processing. I need fill a
> global "F" matrix of 100x100 with values of a function f(x,y). How can I do
> that in a 2 core computer?
>
> I am thinking in to build a code similar to this:
>
> $F = joint fill_F(1,50) with fill_F(51,100)   # how can I do this using
> parallel processing???
>
> sub fill_F {
>    my ($xa,$xb) = @_;
>    my $Faux = zeroes(50,100);
>    ------------------------------------------
>    Here I fill $Faux with values for f(x,y)
>    by using
>    x=[$xa...xb]
>    y=[1..100]
>    -------------------------------------------
>    return $Faux
> }
> Thanks,
> Denis
>
> _______________________________________________
> Perldl mailing list
> [email protected]
> http://mailman.jach.hawaii.edu/mailman/listinfo/perldl
>
>

-- 
 "Debugging is twice as hard as writing the code in the first place.
  Therefore, if you write the code as cleverly as possible, you are,
  by definition, not smart enough to debug it." -- Brian Kernighan

_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

Re: [Perldl] help with parallel procesing

Reply via email to