Generating these types of summary statistics is very common in SAS. In SAS you would set up a sequence of procedures. First sort by the variables of interest and then calculate the metrics of interest by the combination of values. In numpy/scipy this might be something like:
1. Sort by date and region 2. Determine 1st and last index of the blocks 3. Calculate mean,sum, etc. for each of the blocks. Based on the earlier arguments I am wondering, however, if this would provide any speed up. I am very interested in this issue so if you implement some general procedures and perhaps speed tests please share them with the list. Best, Vincent On Jul 31, 10:00 am, "Geoffrey Zhu" <[EMAIL PROTECTED]> wrote: > Hi Timothy, > > On 7/31/07, Timothy Hochberg <[EMAIL PROTECTED]> wrote: > > > [SNIP] > > > > The 'brute-force' way is basically what you suggested -- looping > > > through all the records and building a two-way hash-table of the data. > > > > The problem of the brute-force' approach is that it is not taking > > > advantage of facilities of numpy and can be slow in speed. If only > > > there is some built-in mechanism in numpy to handle this. > > > I'm curious; have you tried this and found it slow, or is this a hunch based > > on the reputed slowness of Python? Algorithmically, you can't do much > > better: the dictionary and set operations are O(1), so the whole operation > > is O(N), and you won't do any better than that, order wise. What your left > > with is trying to reduce constant factors. > > I agree that algorithmically you can't do much better. It is basically > a C vs Python thing. One advantage of numpy is that you can do > vectorized operations at the speed of C. With this algorithm, we have > to process the data element by element and the speed advantage of > numpy is lost. Since data has to be stored in python sets and maps, I > imagine the storage advantage is also lost. > > I was in fact looking for some implemention of this algorithm in numpy > (and so C) that does exactly this, or some implementation of this > algorithm that can leverage the fast numpy routines to do this. > > I haven't tried it with the real data load yet. I know the number of > records will be huge and it is just a hunch that it will be slow. > > > There are various ways one might go about reducing constant factors, but > > they depend on the details of the problem. For example, if the dates are > > dense and you are going to parse them anyway, you could replace the hash > > with table that you index into with the date as an integer. I'm not sure > > that you are going to do a lot better than the brute force algorithm in the > > generic force case though. > > Unfortunately it has to be something generic. > > Thanks a lot for your help. > Geoffrey > _______________________________________________ > Numpy-discussion mailing list > [EMAIL PROTECTED]://projects.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion