Re: [Perldl] PDL beginner's questions

Cliff Sobchuk Wed, 30 Sep 2009 09:04:10 -0700

Hi Emmanuel. I am including your questions in order and will attempt to answer 
them.


________SNIP_____________

Some issues I am wondering:

- Do I really have to know the number of rows to dimension my LONG array before 
doing an rasc()?
>> Don't know, but the documentation on this is pretty good.

- Is "$sub = $all->slice('2,:')" the proper way to get the third column of my 
piddle? Can it be written in a nicer way?
>> Check the documentation on "NiceSlice".

- Is the "at" function the proper way to address an element of the array? 
Really?
>> if you are looking for a single value - I believe that is the case. 
>> Alternatively you can create multiple slices to get to the element. I don't 
>> know which is preferred or quicker - I use multiple slices or masks - 
>> depending on the application.

And the real questions:

- Is PDL suited for this kind of general-purpose matrix operations?
>> definitely - but, I am not sure that your application is a general purpose 
>> matrix application. General purpose matrix operations are when you perform 
>> mathematical operations on entire matrices. It seems that the application 
>> shown is more database query and element wise operations. It doesn't appear 
>> that you are actually doing anything to entire matrices at once - like 
>> passing them through a filter, rotating them, inverting them. PDL will still 
>> do these things, but I am not sure if it is your best option for random 
>> element operations.

- I used a hash to store the number of purchases per subscribers, would there 
be a more elegant PDL way?
>> Hashes are good. I have used loops for 6 columns x 1E6 rows where the first 
>> three columns included hashes for location identification when using the 
>> descriptive statistics package (normal perl) to convert my piddles back in 
>> to arrays to perform general statistics on the results. 

Question:

- This code works, but is really how things should be done?
>> TIMTOWTDT in perl...

- Are constructs such as $timestamp1 = $all->at(0,$idx->at($i)) the right way 
to access the piddle's data
>> seems that it works. When I do need a specific area I use multiple slices. 
>> You could use that also to get down to the element. Using a mask may be 
>> quicker as shown in the 
>> http://www.johnlapeyre.com/pdl/pdldoc/newbook/node3.html

- Is the for (..) loop the only way to scan a piddle's elements?
>> Matrix operations work on the entire matrix. Using a mask may provide what 
>> you want (above). It may not be appropriate for your application. 

There's got to be MTOWTDI
- This code calculates a simple global average. What if we wanteds to see a 
statistical distribution?
>> use the descriptive stats package. This requires you to convert between 
>> piddles and normal arrays though. There is a stats package that was recently 
>> released for pdl as well. 
>> performing an average is simply $a/nelem($a).

_________SNIP___________

Hope I answered some of them for you.


Cliff Sobchuk esn 361-8169, 403-262-4010 ext: 361-8169
Fax: 403-262-4010 ext: 361-8170
Nortel Core RF Field Support: All information is Nortel confidential.

-----Original Message-----
From: Emmanuel [mailto:[email protected]] 
Sent: September 29, 2009 10:23 PM
To: [email protected]
Subject: [Perldl] PDL beginner's questions

Hi. I am a beginner to PDL and have been browsing through the docs, wiki, 
examples, cookbooks, etc.
I am still confused about some topics.

PDL examples are of two categories: the 2 x 3 matrix examples scattered all 
over the doc, and the examples based on some advanced astronomical image 
processing.

My first question would be: can PDL be used for generic array programming for 
people without astronomy or image processing background?

To seek an answer to this important question, I decide to make a little 
experiment on some mundane array problem.

Let's imagine we are an online gaming company that sells time subscriptions for 
access to our game site.
Each purchase by a subscriber goes into a payments table inside a DB:

. timestamp (date of payment, UNIX timestamp in seconds) . providerid (the 
source of the payment, can be one of 4 values) . subid (the subscriber who 
makes the payment) . time (the amount of time purchased, in days)

For simplicity's sake, all fields are assumed to be of type LONG. We work on a 
CSV file extracted from the database.
In this example I am using a sample file which contains 374,540 lines.
The real figure would go in the millions of lines.

* at this point, one question: we can use rasc and rcols to read from ASCII 
files, are there any methods to populate piddles directly from a SQL database?

1. Populating our arrays

Using $PDL::IO::Misc::colsep = "," to handle CSV files.

First attempt:

($ts, $pid, $sub, $time) = rcols ("payments.csv", { perlcols => [4], DEFTYPE => 
long } This create a list of 1D piddles.

Second attempt:

$all = cat(rcols ("payments.csv")
This created a 374540x 4 array of type DOUBLE. Couldn't manage to get it to 
create of type LONG:

$all = cat((rcols ("testpay.csv"), { DEFTYPE => long })) Reading data into 
piddles of type: [ Double Double Double Double ] Read in  374540elements.
Hash given as a pdl - but not {PDL} key! at 
/usr/lib/perl5/site_perl/5.10/i686-cygwin/PDL/Core.pm line 521.

Any idea what is the proper syntax here?

Third attempt:

$all = zeroes(long, 4,374540)
$all->rasc("payments.csv")
This created a 4 x 374540 array, and was by far the fastest method.
However it seems I need to know the number of rows in advance, to pre-dimension 
the array.

2. Doing some simple array operations

The PDL docs explain how matrix operations can be easily written e.g. $a = $b + 
$c In our case we are not going to add or multiply matrices, just trying to 
massage our data into something useful.

Example 1: We want to find out how many payments each subscriber has made.

in SQL this would be written as
SELECT subid, count(*) from payments group by subid;

In Perl/PDL here is what I am trying:

#!/usr/bin/perl
use PDL;
$PDL::IO::Misc::colsep = ",";

$rows = 498399;                   # initialize the row size
$all = zeroes(long, 4, $rows); # create array of LONG 
$all->rasc("payments.csv");    # read from CSV file
$sub = $all->slice('2,:');         # Subscribers' column
print "Read ", $sub->nelem(), " subs\n"; %count = (); # Hash of number of 
purchases for each subscriber

for ($i=0; $i<$rows; $i++) { $count{$all->at(3,$i)}++; } # populate hash 
foreach (sort keys %count) { print "$_ made $count{$_} purchases\n"; } # 
display it

Some issues I am wondering:

- Do I really have to know the number of rows to dimension my LONG array before 
doing an rasc()?
- Is "$sub = $all->slice('2,:')" the proper way to get the third column of my 
piddle? Can it be written in a nicer way?
- Is the "at" function the proper way to address an element of the array? 
Really?

And the real questions:

- Is PDL suited for this kind of general-purpose matrix operations?
- I used a hash to store the number of purchases per subscribers, would there 
be a more elegant PDL way?

Example 2: We want to find out how frequently subscribers re-purchase

In our payments table, we have a 'timestamp' indicating the date of purchase, 
and a 'time' that indicates for how long is that purchase valid for.
We want to find out the average amount of time between a subscription ends and 
a new purchase, for each subscriber.

For any subscriber, if we look at his purchases:
first purchase: timestamp0, time0
-> the subscription is valid from timestamp0 to timestamp0+time0*86400
second purchase: timestamp1, time1
-> the amount of time between when the first purchase expires and the
second purchase is therefore: timestamp1 - (timestamp0+time0*86400) third 
purchase: timestamp2, time2
-> the amount of time between when the second purchase expires and the
third purchase is therefore: timestamp2 - (timestamp1+time1*86400) etc.

So let me try to do that with PDL, starting where I left off at the previous 
example.

#!/usr/bin/perl
use PDL;
$PDL::IO::Misc::colsep = ",";

$daysec = 86400;
$rows = 374540;
$all = zeroes(long, 4, $rows);
$all->rasc("payments.csv");
$sub = $all->slice('2,:');
%count = ();

for ($i=0; $i<$rows; $i++) { $count{$all->at(2,$i)}++; }

$global = $inc = 0;

foreach (keys %count)
{
       next if $count{$_} < 2;
       $idx = which $sub == $_;
       $nbr = $idx->nelem();
       $sum = 0;
       for ($i=1; $i<$nbr; $i++)
       {
               $timestamp1 = $all->at(0,$idx->at($i));
               $timestamp0 = $all->at(0,$idx->at($i-1));
               $time0 = $all->at(3,$idx->at($i-1));

               $delta = $timestamp1 - ($timestamp0 + $time0 * $daysec);
               if ($delta < 0) { $delta = 0 };
               $sum += $delta;

       }
       $avg = $sum / ($nbr - 1);
       $global += $avg; $inc++;

}
$average = $global / $inc;
print "\nGlobal average repurchase time: "; printf "%d days, %d hours, %d 
minutes and %d seconds\n\n",(gmtime $average)[7,2,1,0];

Question:

- This code works, but is really how things should be done?
- Are constructs such as $timestamp1 = $all->at(0,$idx->at($i)) the right way 
to access the piddle's data
- Is the for (..) loop the only way to scan a piddle's elements?
There's got to be MTOWTDI
- This code calculates a simple global average. What if we wanteds to see a 
statistical distribution?


I hope you will be able to comment on my poor attempts to understand the PDL 
arrays. I'm really interested in your opinions.

Regards

_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

Re: [Perldl] PDL beginner's questions

Reply via email to