Hi. I am a beginner to PDL and have been browsing through the docs,
wiki, examples, cookbooks, etc.
I am still confused about some topics.
PDL examples are of two categories: the 2 x 3 matrix examples scattered
all over the doc, and the examples based on some advanced astronomical
image processing.
My first question would be: can PDL be used for generic array
programming for people without astronomy or image processing background?
To seek an answer to this important question, I decide to make a little
experiment on some mundane array problem.
Let's imagine we are an online gaming company that sells time
subscriptions for access to our game site.
Each purchase by a subscriber goes into a payments table inside a DB:
. timestamp (date of payment, UNIX timestamp in seconds)
. providerid (the source of the payment, can be one of 4 values)
. subid (the subscriber who makes the payment)
. time (the amount of time purchased, in days)
For simplicity's sake, all fields are assumed to be of type LONG. We
work on a CSV file extracted from the database.
In this example I am using a sample file which contains 374,540 lines.
The real figure would go in the millions of lines.
* at this point, one question: we can use rasc and rcols to read from
ASCII files, are there any methods to populate piddles directly from a
SQL database?
1. Populating our arrays
Using $PDL::IO::Misc::colsep = "," to handle CSV files.
First attempt:
($ts, $pid, $sub, $time) = rcols ("payments.csv", { perlcols => [4],
DEFTYPE => long }
This create a list of 1D piddles.
Second attempt:
$all = cat(rcols ("payments.csv")
This created a 374540x 4 array of type DOUBLE. Couldn't manage to get
it to create of type LONG:
$all = cat((rcols ("testpay.csv"), { DEFTYPE => long }))
Reading data into piddles of type: [ Double Double Double Double ]
Read in 374540elements.
Hash given as a pdl - but not {PDL} key! at
/usr/lib/perl5/site_perl/5.10/i686-cygwin/PDL/Core.pm line 521.
Any idea what is the proper syntax here?
Third attempt:
$all = zeroes(long, 4,374540)
$all->rasc("payments.csv")
This created a 4 x 374540 array, and was by far the fastest method.
However it seems I need to know the number of rows in advance, to
pre-dimension the array.
2. Doing some simple array operations
The PDL docs explain how matrix operations can be easily written e.g. $a
= $b + $c
In our case we are not going to add or multiply matrices, just trying to
massage our data into something useful.
Example 1: We want to find out how many payments each subscriber has made.
in SQL this would be written as
SELECT subid, count(*) from payments group by subid;
In Perl/PDL here is what I am trying:
#!/usr/bin/perl
use PDL;
$PDL::IO::Misc::colsep = ",";
$rows = 498399; # initialize the row size
$all = zeroes(long, 4, $rows); # create array of LONG
$all->rasc("payments.csv"); # read from CSV file
$sub = $all->slice('2,:'); # Subscribers' column
print "Read ", $sub->nelem(), " subs\n";
%count = (); # Hash of number of purchases for each subscriber
for ($i=0; $i<$rows; $i++) { $count{$all->at(3,$i)}++; } # populate hash
foreach (sort keys %count) { print "$_ made $count{$_} purchases\n"; } #
display it
Some issues I am wondering:
- Do I really have to know the number of rows to dimension my LONG array
before doing an rasc()?
- Is "$sub = $all->slice('2,:')" the proper way to get the third column
of my piddle? Can it be written in a nicer way?
- Is the "at" function the proper way to address an element of the
array? Really?
And the real questions:
- Is PDL suited for this kind of general-purpose matrix operations?
- I used a hash to store the number of purchases per subscribers, would
there be a more elegant PDL way?
Example 2: We want to find out how frequently subscribers re-purchase
In our payments table, we have a 'timestamp' indicating the date of
purchase, and a 'time' that indicates for how long is that purchase
valid for.
We want to find out the average amount of time between a subscription
ends and a new purchase, for each subscriber.
For any subscriber, if we look at his purchases:
first purchase: timestamp0, time0
-> the subscription is valid from timestamp0 to timestamp0+time0*86400
second purchase: timestamp1, time1
-> the amount of time between when the first purchase expires and the
second purchase is therefore: timestamp1 - (timestamp0+time0*86400)
third purchase: timestamp2, time2
-> the amount of time between when the second purchase expires and the
third purchase is therefore: timestamp2 - (timestamp1+time1*86400)
etc.
So let me try to do that with PDL, starting where I left off at the
previous example.
#!/usr/bin/perl
use PDL;
$PDL::IO::Misc::colsep = ",";
$daysec = 86400;
$rows = 374540;
$all = zeroes(long, 4, $rows);
$all->rasc("payments.csv");
$sub = $all->slice('2,:');
%count = ();
for ($i=0; $i<$rows; $i++) { $count{$all->at(2,$i)}++; }
$global = $inc = 0;
foreach (keys %count)
{
next if $count{$_} < 2;
$idx = which $sub == $_;
$nbr = $idx->nelem();
$sum = 0;
for ($i=1; $i<$nbr; $i++)
{
$timestamp1 = $all->at(0,$idx->at($i));
$timestamp0 = $all->at(0,$idx->at($i-1));
$time0 = $all->at(3,$idx->at($i-1));
$delta = $timestamp1 - ($timestamp0 + $time0 * $daysec);
if ($delta < 0) { $delta = 0 };
$sum += $delta;
}
$avg = $sum / ($nbr - 1);
$global += $avg; $inc++;
}
$average = $global / $inc;
print "\nGlobal average repurchase time: ";
printf "%d days, %d hours, %d minutes and %d seconds\n\n",(gmtime
$average)[7,2,1,0];
Question:
- This code works, but is really how things should be done?
- Are constructs such as $timestamp1 = $all->at(0,$idx->at($i)) the
right way to access the piddle's data
- Is the for (..) loop the only way to scan a piddle's elements?
There's got to be MTOWTDI
- This code calculates a simple global average. What if we wanteds to
see a statistical distribution?
I hope you will be able to comment on my poor attempts to understand
the PDL arrays. I'm really interested in your opinions.
Regards
_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl