Hi Emmanuel. I am including your questions in order and will attempt to answer
them.
________SNIP_____________
Some issues I am wondering:
- Do I really have to know the number of rows to dimension my LONG array before
doing an rasc()?
>> Don't know, but the documentation on this is pretty good.
- Is "$sub = $all->slice('2,:')" the proper way to get the third column of my
piddle? Can it be written in a nicer way?
>> Check the documentation on "NiceSlice".
- Is the "at" function the proper way to address an element of the array?
Really?
>> if you are looking for a single value - I believe that is the case.
>> Alternatively you can create multiple slices to get to the element. I don't
>> know which is preferred or quicker - I use multiple slices or masks -
>> depending on the application.
And the real questions:
- Is PDL suited for this kind of general-purpose matrix operations?
>> definitely - but, I am not sure that your application is a general purpose
>> matrix application. General purpose matrix operations are when you perform
>> mathematical operations on entire matrices. It seems that the application
>> shown is more database query and element wise operations. It doesn't appear
>> that you are actually doing anything to entire matrices at once - like
>> passing them through a filter, rotating them, inverting them. PDL will still
>> do these things, but I am not sure if it is your best option for random
>> element operations.
- I used a hash to store the number of purchases per subscribers, would there
be a more elegant PDL way?
>> Hashes are good. I have used loops for 6 columns x 1E6 rows where the first
>> three columns included hashes for location identification when using the
>> descriptive statistics package (normal perl) to convert my piddles back in
>> to arrays to perform general statistics on the results.
Question:
- This code works, but is really how things should be done?
>> TIMTOWTDT in perl...
- Are constructs such as $timestamp1 = $all->at(0,$idx->at($i)) the right way
to access the piddle's data
>> seems that it works. When I do need a specific area I use multiple slices.
>> You could use that also to get down to the element. Using a mask may be
>> quicker as shown in the
>> http://www.johnlapeyre.com/pdl/pdldoc/newbook/node3.html
- Is the for (..) loop the only way to scan a piddle's elements?
>> Matrix operations work on the entire matrix. Using a mask may provide what
>> you want (above). It may not be appropriate for your application.
There's got to be MTOWTDI
- This code calculates a simple global average. What if we wanteds to see a
statistical distribution?
>> use the descriptive stats package. This requires you to convert between
>> piddles and normal arrays though. There is a stats package that was recently
>> released for pdl as well.
>> performing an average is simply $a/nelem($a).
_________SNIP___________
Hope I answered some of them for you.
Cliff Sobchuk esn 361-8169, 403-262-4010 ext: 361-8169
Fax: 403-262-4010 ext: 361-8170
Nortel Core RF Field Support: All information is Nortel confidential.
-----Original Message-----
From: Emmanuel [mailto:[email protected]]
Sent: September 29, 2009 10:23 PM
To: [email protected]
Subject: [Perldl] PDL beginner's questions
Hi. I am a beginner to PDL and have been browsing through the docs, wiki,
examples, cookbooks, etc.
I am still confused about some topics.
PDL examples are of two categories: the 2 x 3 matrix examples scattered all
over the doc, and the examples based on some advanced astronomical image
processing.
My first question would be: can PDL be used for generic array programming for
people without astronomy or image processing background?
To seek an answer to this important question, I decide to make a little
experiment on some mundane array problem.
Let's imagine we are an online gaming company that sells time subscriptions for
access to our game site.
Each purchase by a subscriber goes into a payments table inside a DB:
. timestamp (date of payment, UNIX timestamp in seconds) . providerid (the
source of the payment, can be one of 4 values) . subid (the subscriber who
makes the payment) . time (the amount of time purchased, in days)
For simplicity's sake, all fields are assumed to be of type LONG. We work on a
CSV file extracted from the database.
In this example I am using a sample file which contains 374,540 lines.
The real figure would go in the millions of lines.
* at this point, one question: we can use rasc and rcols to read from ASCII
files, are there any methods to populate piddles directly from a SQL database?
1. Populating our arrays
Using $PDL::IO::Misc::colsep = "," to handle CSV files.
First attempt:
($ts, $pid, $sub, $time) = rcols ("payments.csv", { perlcols => [4], DEFTYPE =>
long } This create a list of 1D piddles.
Second attempt:
$all = cat(rcols ("payments.csv")
This created a 374540x 4 array of type DOUBLE. Couldn't manage to get it to
create of type LONG:
$all = cat((rcols ("testpay.csv"), { DEFTYPE => long })) Reading data into
piddles of type: [ Double Double Double Double ] Read in 374540elements.
Hash given as a pdl - but not {PDL} key! at
/usr/lib/perl5/site_perl/5.10/i686-cygwin/PDL/Core.pm line 521.
Any idea what is the proper syntax here?
Third attempt:
$all = zeroes(long, 4,374540)
$all->rasc("payments.csv")
This created a 4 x 374540 array, and was by far the fastest method.
However it seems I need to know the number of rows in advance, to pre-dimension
the array.
2. Doing some simple array operations
The PDL docs explain how matrix operations can be easily written e.g. $a = $b +
$c In our case we are not going to add or multiply matrices, just trying to
massage our data into something useful.
Example 1: We want to find out how many payments each subscriber has made.
in SQL this would be written as
SELECT subid, count(*) from payments group by subid;
In Perl/PDL here is what I am trying:
#!/usr/bin/perl
use PDL;
$PDL::IO::Misc::colsep = ",";
$rows = 498399; # initialize the row size
$all = zeroes(long, 4, $rows); # create array of LONG
$all->rasc("payments.csv"); # read from CSV file
$sub = $all->slice('2,:'); # Subscribers' column
print "Read ", $sub->nelem(), " subs\n"; %count = (); # Hash of number of
purchases for each subscriber
for ($i=0; $i<$rows; $i++) { $count{$all->at(3,$i)}++; } # populate hash
foreach (sort keys %count) { print "$_ made $count{$_} purchases\n"; } #
display it
Some issues I am wondering:
- Do I really have to know the number of rows to dimension my LONG array before
doing an rasc()?
- Is "$sub = $all->slice('2,:')" the proper way to get the third column of my
piddle? Can it be written in a nicer way?
- Is the "at" function the proper way to address an element of the array?
Really?
And the real questions:
- Is PDL suited for this kind of general-purpose matrix operations?
- I used a hash to store the number of purchases per subscribers, would there
be a more elegant PDL way?
Example 2: We want to find out how frequently subscribers re-purchase
In our payments table, we have a 'timestamp' indicating the date of purchase,
and a 'time' that indicates for how long is that purchase valid for.
We want to find out the average amount of time between a subscription ends and
a new purchase, for each subscriber.
For any subscriber, if we look at his purchases:
first purchase: timestamp0, time0
-> the subscription is valid from timestamp0 to timestamp0+time0*86400
second purchase: timestamp1, time1
-> the amount of time between when the first purchase expires and the
second purchase is therefore: timestamp1 - (timestamp0+time0*86400) third
purchase: timestamp2, time2
-> the amount of time between when the second purchase expires and the
third purchase is therefore: timestamp2 - (timestamp1+time1*86400) etc.
So let me try to do that with PDL, starting where I left off at the previous
example.
#!/usr/bin/perl
use PDL;
$PDL::IO::Misc::colsep = ",";
$daysec = 86400;
$rows = 374540;
$all = zeroes(long, 4, $rows);
$all->rasc("payments.csv");
$sub = $all->slice('2,:');
%count = ();
for ($i=0; $i<$rows; $i++) { $count{$all->at(2,$i)}++; }
$global = $inc = 0;
foreach (keys %count)
{
next if $count{$_} < 2;
$idx = which $sub == $_;
$nbr = $idx->nelem();
$sum = 0;
for ($i=1; $i<$nbr; $i++)
{
$timestamp1 = $all->at(0,$idx->at($i));
$timestamp0 = $all->at(0,$idx->at($i-1));
$time0 = $all->at(3,$idx->at($i-1));
$delta = $timestamp1 - ($timestamp0 + $time0 * $daysec);
if ($delta < 0) { $delta = 0 };
$sum += $delta;
}
$avg = $sum / ($nbr - 1);
$global += $avg; $inc++;
}
$average = $global / $inc;
print "\nGlobal average repurchase time: "; printf "%d days, %d hours, %d
minutes and %d seconds\n\n",(gmtime $average)[7,2,1,0];
Question:
- This code works, but is really how things should be done?
- Are constructs such as $timestamp1 = $all->at(0,$idx->at($i)) the right way
to access the piddle's data
- Is the for (..) loop the only way to scan a piddle's elements?
There's got to be MTOWTDI
- This code calculates a simple global average. What if we wanteds to see a
statistical distribution?
I hope you will be able to comment on my poor attempts to understand the PDL
arrays. I'm really interested in your opinions.
Regards
_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl
_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl