Hi David, Thanks for your great answer - I appreciate the time you spent looking into this.
On Wed, Sep 30, 2009 at 10:43 PM, David Mertens <[email protected]> wrote: > Welcome to PDL! As you have surmized, it is a language that is dominated by > the sciences; it was originally created by an astronomer (or > astrophysicist?), so that's why anything complicated deals with image > processing. I read it is going to be included as a new Perl6 datatype so I was hoping that its applications would not be limited to a single domain. Perl "arrays" have always been lacking so it is a welcome addition. > I am also something of a beginner and I would like to see the > introductory documentation improved, but as with all things it takes time. > I can't address all your questions, but I can point you in the direction of > some useful basics that you seem to have missed (which given the > organization of the documentation, is not surprising). There seem to be a lot of documentation but they are not so many examples of practical code. The wiki has very limited information too. > PDL is meant to handle these sorts of number-crunching problems, but it > really works best with vector-oriented calculations, which I discuss near > the end of my response. I like that you're posing a business-oriented > question, and it's a shame that such questions are unusual on this mailing > list. Keep them coming! This is why I posted this question. All my searches led to examples of image-processing or galaxy-crunching applications. While it is great that astronomers get proper tools to handle their data, a lot of us would be happy with better arrays in Perl. It seems indeed that PDL is focused on vector-oriented calculations. What I liked about it is its ability to handle large data structures in memory, something that Perl does quite poorly. > I know nothing about databases, but thankfully Doug already answered this > part of your question. Indeed he did, and it's really simple. >> ($ts, $pid, $sub, $time) = rcols ("payments.csv", { perlcols => [4], >> DEFTYPE => long } >> This create a list of 1D piddles. > > Did this work for you? It did not work for me, at least not as expected. It did work as expected, maybe because of the perlcols => [4] that you did not include? >> $all = cat(rcols ("payments.csv")) >> This created a 374540x 4 array of type DOUBLE. Couldn't manage to get >> it to create of type LONG: > > Are you sure about your dimensions? I got 477 x 1 array (my file is 477 > lines long) because it's reading only the first column in for me. I think I > get what you want if I specify the columns: > >> $all = cat(rcols ("payments.csv", 0,1,2,3)) > Reading data into piddles of type: [ Double Double Double Double ] > Read in 477 elements. > > Looks right to me. Will check the dimensions as soon as I am back to my workstation (am traveling at the moment). >> $all = cat(rcols ("payments.csv", {DEFTYPE => long}, 0,1,2,3)) > Reading data into piddles of type: [ Long Long Long Long ] > Read in 477 elements. > > Looks right to me. However, you should probably stick with the first > attempt anyway, since the data really are distinct and should be treated as > such. Right, this is an important point. My data is really 4 vectors of different unrelated data. > So this is how you use rasc... I never could figure it out, but as you say, > you must know the size ahead of time. The speed difference is huge - rasc was almost 10 times faster. The number of lines can be easily obtained with a call to "wc -l" so it might be worth it. > In case you haven't come across it yet, there's an important difference > between the '=' operator and the '.=' operator. You'll probably appreciate > it if you ever try to assign values to a slice, but I won't go into it now. > It's adequately discussed in the docs. I've noticed the discussion about it and the 'dataflow' concepts, which might be very useful in some applications. >> $sub = $all->slice('2,:'); # Subscribers' column > > First of all, this slicing operation is only really necessary when you load > all the data into a single array. You don't have to do that, and probably > shouldn't Noted. >> # This shouldn't be two-dimensional... >> $sub = $sub->reshape(); >> p $sub->dims; > 477 I like that, was looking for it! > This is made possibly by NiceSlice, a source filter. Read the documentation > for it. If you're impatient, jump to section 5 and read the sections > "Parentheses following a scalar variable" and then jump to "The argument > list" and read until you get bored or hit section 6. Then read about the > default method invocation, just to be sure you're aware of it. Ok. What a steep learning curve for someone who just wants to do generic array operations! > Finally, read up on the 'where' and 'which' functions, which are aluded to > in the NiceSlice documentation, and which we will use shortly. You can find > a good summary by typing 'help where' and 'help which' at the PDL prompt, or > reading the PDL::Primitive documentation (which will tell you about a whole > bunch of other useful functions). I did see those functions (and you can see it used 'which' in the code I posted) and they are amazingly useful functions. I'll have to do some performance tests to see how they compare to other methods, but my guess would be that they're very efficient. > Also consider using dims() (or getdims) and dim (or getdim), which are a bit > more precise. Noted. >> %count = (); # Hash of number of purchases for each subscriber >> >> for ($i=0; $i<$rows; $i++) { $count{$all->at(3,$i)}++; } # populate hash > > TYPO? I used 'at($i,2)' to get this to work. No typo, this is how it worked for me. at(3,$i) means (in my understanding) "at column 3, row $i". The subscriber ID is in column 3 in my data. > This is definitely the Perl way to go about doing this. I don't think there > is a PDL way to do this, however. We discussed a related question on the > mailing list a couple of months ago entitled "Compute a distribution > function from irregular data", which suggests to me a different approach: > >> $uniq_subs = $sub->uniq(); >> for($i = 0; $i < $uniqu_sub->dim(0); $i++) {%count{$uniq($i;-)} = >> $sub->where($sub == $uniq_subs($i))->dim(0); } > > but this appears to be MUCH slower than your technique. Generally > speaking, however, you should avoid explicit for loops along large > dimensions unless it's absolutely necessary. For this problem, I think it > is actually necessary. For many array operations that I work with (the mundane type, e.g. data mining, log analysis, accounting, transaction logs, etc.) it is necessary to scan the entire dimension. >> - Do I really have to know the number of rows to dimension my LONG array >> before doing an rasc()? >> - Is "$sub = $all->slice('2,:')" the proper way to get the third column of >> my piddle? Can it be written in a nicer way? >> - Is the "at" function the proper way to address an element of the array? >> Really? > > First question - I think so. I'm not sure, but you can dig through the > source code to check. Good point. > Second and Third questions: already answered - use NiceSlice. Ok. >> Example 2: We want to find out how frequently subscribers re-purchase >> >> - This code works, but is really how things should be done? > No. When working with PDL, if you ever feel tempted to write a for-loop, > you should be on guard. With vectorized languages, such as PDL, Matlab, > IDL, Octave, etc, you want to express as much as you can with vector > operations rather than element-by-element calculations. Right. I'm thinking too much like a database programmer I guess. > So for this problem, here's how I would do it: > > %avg_lapses = (); > > foreach (keys %count) # can't get rid of this for loop, unfortunately > { > next if $count{$_} < 2; > ($purchase_times, $purchase_durations) = where ($tx, $time, $sub, > $sub == $_); > $purchase_count = $purchases->nelem(); > > $start_times = $purchase_times(0:-2); > $start_durations = $purchase_durations(0:-2); > $next_times = $purchase_times(1:-1); > > $lapse = $next_times - $start_times + $start_durations * $daysec; > %avg_lapses{$_} = $lapses->avg; > } This is neat. I will try it asap! > Now %avg_lapses contains the average lapse time per subscriber. This is > both more concise and much faster, since the differencing operation is > computed using C-code. Also, though I could be wrong about this, it doesn't > even require more memory since the $start... and $next... piddles are > virtual piddles. If you want the global average, you can insert an > accumulator in the foreach loop, and divide that by the number of entries in > the %count hash. Very nice, and it seems a much better way to use piddle. This is exactly the kind of comment I was looking for... thanks! >> - Are constructs such as $timestamp1 = $all->at(0,$idx->at($i)) the right >> way to access the piddle's data > > Yes. You can also use NiceSlice. Note that if you need to change a value, > you can use slicing together with the '.=' assignment operator, not the '=' > operator, which won't do what you want. Ok. > With vectorized languages, you try to construct everything so that you DON'T > need to scan every element of a piddle. If you must do that, yes, a for > loop is pretty much the only way to go. For speed, you could alwasy right a > routine in PDL::PP, but I don't know how to do that yet. I had a quick look in PDL::PP and got dizzy ;) > Like a histogram? PDL can do that, try 'help histogram' for starters. > There are some basic statistics routines available using $piddle->stats and > there's been a recent contribution of a much more advanced statistics > library called PDL::Stats. Histograms! Yes that's what I wanted. It looks promising, I will experiment and post back my code. > I hope that helps! Post back with more questions. It's got an annoying > learning curve and it's easy to miss seemingly basic stuff, but keep at it. It did help tremendously, once again I really appreciate your kind answer. I will post back some questions once I test more code, in a few days. Regards Emmanuel _______________________________________________ Perldl mailing list [email protected] http://mailman.jach.hawaii.edu/mailman/listinfo/perldl
