[Perldl] PDL beginner's questions

Emmanuel Tue, 29 Sep 2009 21:23:37 -0700

Hi. I am a beginner to PDL and have been browsing through the docs,
wiki, examples, cookbooks, etc.
I am still confused about some topics.


PDL examples are of two categories: the 2 x 3 matrix examples scattered
all over the doc, and the examples based on some advanced astronomical
image processing.

My first question would be: can PDL be used for generic array
programming for people without astronomy or image processing background?

To seek an answer to this important question, I decide to make a little
experiment on some mundane array problem.

Let's imagine we are an online gaming company that sells time
subscriptions for access to our game site.
Each purchase by a subscriber goes into a payments table inside a DB:

. timestamp (date of payment, UNIX timestamp in seconds)
. providerid (the source of the payment, can be one of 4 values)
. subid (the subscriber who makes the payment)
. time (the amount of time purchased, in days)

For simplicity's sake, all fields are assumed to be of type LONG. We
work on a CSV file extracted from the database.
In this example I am using a sample file which contains 374,540 lines.
The real figure would go in the millions of lines.

* at this point, one question: we can use rasc and rcols to read from
ASCII files, are there any methods to populate piddles directly from a
SQL database?

1. Populating our arrays

Using $PDL::IO::Misc::colsep = "," to handle CSV files.

First attempt:

($ts, $pid, $sub, $time) = rcols ("payments.csv", { perlcols => [4],
DEFTYPE => long }
This create a list of 1D piddles.

Second attempt:

$all = cat(rcols ("payments.csv")
This created a 374540x 4 array of type DOUBLE. Couldn't manage to get
it to create of type LONG:

$all = cat((rcols ("testpay.csv"), { DEFTYPE => long }))
Reading data into piddles of type: [ Double Double Double Double ]
Read in  374540elements.
Hash given as a pdl - but not {PDL} key! at
/usr/lib/perl5/site_perl/5.10/i686-cygwin/PDL/Core.pm line 521.

Any idea what is the proper syntax here?

Third attempt:

$all = zeroes(long, 4,374540)
$all->rasc("payments.csv")
This created a 4 x 374540 array, and was by far the fastest method.
However it seems I need to know the number of rows in advance, to
pre-dimension the array.

2. Doing some simple array operations

The PDL docs explain how matrix operations can be easily written e.g. $a
= $b + $c
In our case we are not going to add or multiply matrices, just trying to
massage our data into something useful.

Example 1: We want to find out how many payments each subscriber has made.

in SQL this would be written as
SELECT subid, count(*) from payments group by subid;

In Perl/PDL here is what I am trying:

#!/usr/bin/perl
use PDL;
$PDL::IO::Misc::colsep = ",";

$rows = 498399;                   # initialize the row size
$all = zeroes(long, 4, $rows); # create array of LONG
$all->rasc("payments.csv");    # read from CSV file
$sub = $all->slice('2,:');         # Subscribers' column
print "Read ", $sub->nelem(), " subs\n";
%count = (); # Hash of number of purchases for each subscriber

for ($i=0; $i<$rows; $i++) { $count{$all->at(3,$i)}++; } # populate hash
foreach (sort keys %count) { print "$_ made $count{$_} purchases\n"; } #
display it

Some issues I am wondering:

- Do I really have to know the number of rows to dimension my LONG array
before doing an rasc()?
- Is "$sub = $all->slice('2,:')" the proper way to get the third column
of my piddle? Can it be written in a nicer way?
- Is the "at" function the proper way to address an element of the
array? Really?

And the real questions:

- Is PDL suited for this kind of general-purpose matrix operations?
- I used a hash to store the number of purchases per subscribers, would
there be a more elegant PDL way?

Example 2: We want to find out how frequently subscribers re-purchase

In our payments table, we have a 'timestamp' indicating the date of
purchase, and a 'time' that indicates for how long is that purchase
valid for.
We want to find out the average amount of time between a subscription
ends and a new purchase, for each subscriber.

For any subscriber, if we look at his purchases:
first purchase: timestamp0, time0
-> the subscription is valid from timestamp0 to timestamp0+time0*86400
second purchase: timestamp1, time1
-> the amount of time between when the first purchase expires and the
second purchase is therefore: timestamp1 - (timestamp0+time0*86400)
third purchase: timestamp2, time2
-> the amount of time between when the second purchase expires and the
third purchase is therefore: timestamp2 - (timestamp1+time1*86400)
etc.

So let me try to do that with PDL, starting where I left off at the
previous example.

#!/usr/bin/perl
use PDL;
$PDL::IO::Misc::colsep = ",";

$daysec = 86400;
$rows = 374540;
$all = zeroes(long, 4, $rows);
$all->rasc("payments.csv");
$sub = $all->slice('2,:');
%count = ();

for ($i=0; $i<$rows; $i++) { $count{$all->at(2,$i)}++; }

$global = $inc = 0;

foreach (keys %count)
{
       next if $count{$_} < 2;
       $idx = which $sub == $_;
       $nbr = $idx->nelem();
       $sum = 0;
       for ($i=1; $i<$nbr; $i++)
       {
               $timestamp1 = $all->at(0,$idx->at($i));
               $timestamp0 = $all->at(0,$idx->at($i-1));
               $time0 = $all->at(3,$idx->at($i-1));

               $delta = $timestamp1 - ($timestamp0 + $time0 * $daysec);
               if ($delta < 0) { $delta = 0 };
               $sum += $delta;

       }
       $avg = $sum / ($nbr - 1);
       $global += $avg; $inc++;

}
$average = $global / $inc;
print "\nGlobal average repurchase time: ";
printf "%d days, %d hours, %d minutes and %d seconds\n\n",(gmtime
$average)[7,2,1,0];

Question:

- This code works, but is really how things should be done?
- Are constructs such as $timestamp1 = $all->at(0,$idx->at($i)) the
right way to access the piddle's data
- Is the for (..) loop the only way to scan a piddle's elements?
There's got to be MTOWTDI
- This code calculates a simple global average. What if we wanteds to
see a statistical distribution?


I hope you will be able to comment on my poor attempts to understand
the PDL arrays. I'm really interested in your opinions.

Regards

_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

[Perldl] PDL beginner's questions

Reply via email to