My experience with very large datasets in PDL comes down to this:
USE THE SMALLEST SUITABLE DATATYPE
I can't stress enough how important that is :-)
I'm dealing with vectors of ~500M values (whole human chromosomes, if you're
interested :-). If I only need a bitmask, I use a byte() piddle, if I have
counts, I use byte/ushort, and I even sometimes convert rational numbers to
integers for performance reasons. I'm pretty sure that most of your problems on
a mac are due to over-allocating memory.
In general, I find that PDL DOES eat Perl's lunch a million times if you do
things cleverly. I was able to do sliding window averaging on those 500M
vectors using PDL::PP in a second or two, compared to hours in pure Perl.
Cheers,
Ben
On 9 Jul 2010, at 16:53, P Kishor wrote:
> Craig, David, others,
>
> I find your explanation satisfying, but not the actual results that I
> am getting. I am experiencing a more stable performance from Perl,
> with the performance scaling predictably. PDL shows itself to be more
> moody. From one run to another, the performance can really swing. This
> is on my MacBook with no other user process running (meaning, I am not
> ripping music or watching a movie on Hulu at the same time...).
>
> First, no doubt my simplistic PDL approach was wrong. I figured, I
> have to calculate one "column" based on two other "columns" -- "Hey!
> the PDL docs show how to get a column... use slice." So, that is what
> I went with. However, using Craig's better and more efficient
> calculation approach, I did experience much better results, but not
> completely.
>
> I used Craig's reworked script and ran it three times. The results are
> below (use fixed width font to see the results), but here is some
> discussion --
>
> Both David and Craig implied that making the data (the array for Perl
> and the piddle for PDL) would be more efficient in Perl because it
> would do some up-front memory allocation, so 'push'ing an element on
> to the array would not be costly. That is not the case. PDL is pretty
> good, in fact, better than Perl in converting an array into a piddle
> than Perl is in making the array in the first place.
>
> Another assertion was that PDL will eat Perl's lunch when it comes to
> calculation. That is also not the case *always*. PDL is much faster at
> smaller data sets. But, at a certain threshold, (for me, that
> threshold is 3 million), PDL gets bogged down. Actually, at 3.5
> million, PDL gets very slow, and at 4 million, it basically locks up
> my computer.
>
> Another interesting issue -- Perl seems to be better at sharing the
> resources. When the Perl calculation is running, my machine is
> responsive. I can switch back to the browser, scroll a page, etc. When
> the PDL calc is running, it is like my machine is frozen.
>
> This kinda worries me. If we write-up the gotchas and the limits
> between which PDL use is optimal, then it is "caveat emptor" and all
> that. However, on a more realistic front, I was hoping to use PDL with
> a 13 million elements piddle. I did some tests, and I found that a 2D
> piddle where ("first D" * "second D") = 13 million, PDL was smokingly
> fast. I am wondering though -- will its performance change if the
> piddle was a 1D piddle that was 13 million elements long? Does it
> matter to PDL if my dataset is a "long rope" vs. a "carpet", but both
> with the same "thread count" (to use a fabric analogy)?
>
> Test results (reformatted) shown below
>
>
> count: 10000
> ============================
> Perl PDL
> ----------------------------
> make data: 0.0097 0.0065
> calculate: 0.0064 0.0014
>
> make data: 0.0106 0.0065
> calculate: 0.0064 0.0014
>
> make data: 0.0104 0.0065
> calculate: 0.0063 0.0014
> ____________________________
>
>
> count: 100000
> ============================
> Perl PDL
> ----------------------------
> make data: 0.0962 0.0791
> calculate: 0.0624 0.0108
>
> make data: 0.0966 0.0811
> calculate: 0.0621 0.0109
>
> make data: 0.0966 0.0789
> calculate: 0.0626 0.0109
> ____________________________
>
>
> count: 1000000
> ============================
> Perl PDL
> ----------------------------
> make data: 0.9626 0.8014
> calculate: 0.6269 0.1170
>
> make data: 0.9656 0.8064
> calculate: 0.6275 0.1182
>
> make data: 0.9643 0.8203
> calculate: 0.6275 0.1168
> ____________________________
>
>
> count: 2000000
> ============================
> Perl PDL
> ----------------------------
> make data: 1.7542 1.5168
> calculate: 1.2462 0.2381
>
> make data: 1.7519 1.5221
> calculate: 1.2500 0.2391
>
> make data: 1.7517 1.5226
> calculate: 1.2699 0.2394
> ____________________________
>
>
> count: 3000000
> ============================
> Perl PDL
> ----------------------------
> make data: 2.5263 2.5722
> calculate: 1.9163 3.2107
>
> make data: 2.5411 2.2062
> calculate: 1.8897 6.9557
>
> make data: 2.5305 2.2822
> calculate: 1.9204 7.2502
> ____________________________
> On Fri, Jul 9, 2010 at 2:32 AM, Craig DeForest
> <[email protected]> wrote:
>> Wow, Puneet really stirred us all up (again). Puneet, as David said, your
>> PDL code is slow because you are using a complicated expression, which
>> forced PDL to create and destroy intermediate PDLs (every binary operation
>> has to have a complete temporary PDL allocated and then freed to store its
>> result!). I attach a variant of your test, with the operation carried out
>> as much in-place as possible to eliminate extra allocations. PDL runs
>> almost exactly a factor of 10 faster on my computer than does raw Perl in
>> this case.
>> Note that the original ingestion of the Perl array to PDL is quite slow: it
>> generally takes slightly longer to create the PDL than to generate the
>> random numbers and create the Perl array in the first place! That is
>> because PDL has to make several passes through the Perl array to determine
>> its size, and then has to individually probe and convert each numeric value
>> in the Perl array.
>>
>> On Jul 9, 2010, at 1:09 AM, David Mertens wrote:
>>
>> FYI, for really thorough timing results, check out Devel::NYTProf:
>> http://search.cpan.org/~timb/Devel-NYTProf-4.03/lib/Devel/NYTProf.pm
>>
>> You have a lot of things going on to mix up the results - you have both a
>> memory allocation and a calculation. As I understand it, Perl will likely
>> outperform PDL in the memory allocation portion of this exercise, but PDL
>> should have Perl's lunch for the calculation portion.
>>
>> Perl will outperform PDL in the memory allocation because in all likelihood,
>> it doesn't perform any allocation with the push. It likely already allocated
>> more than three elements for (all of) its arrays, so pushing the new value
>> on the array does not cost anything, except for a higher up-front memory
>> cost. I suspect this is where PDL is losing to Perl - Perl is performing the
>> allocation ahead of where you start the timer.
>>
>> In terms of the calculation itself, PDL should far outperform Perl. The
>> reason is that the actual contents of the calculation loop are very slim, so
>> the cost of all of the Perl stack manipulation should significantly increase
>> its cost. The reason Perl for loops usually make sense are because the code
>> inside the for loops often involve IO operations or other such things, in
>> which case the Perl stack manipulations comprise only a small portion of the
>> total compute time.
>>
>> Try a situation when Perl and PDL allocate their memory as part of the
>> timing and see what that gives.
>>
>> David
>>
>> --
>> Sent via my carrier pigeon.
>> _______________________________________________
>> Perldl mailing list
>> [email protected]
>> http://mailman.jach.hawaii.edu/mailman/listinfo/perldl
>>
>>
>>
>>
>>
>
>
>
> --
> Puneet Kishor http://www.punkish.org
> Carbon Model http://carbonmodel.org
> Charter Member, Open Source Geospatial Foundation http://www.osgeo.org
> Science Commons Fellow, http://sciencecommons.org/about/whoweare/kishor
> Nelson Institute, UW-Madison http://www.nelson.wisc.edu
> -----------------------------------------------------------------------
> Assertions are politics; backing up assertions with evidence is science
> =======================================================================
>
> _______________________________________________
> Perldl mailing list
> [email protected]
> http://mailman.jach.hawaii.edu/mailman/listinfo/perldl
_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl