Re: [Numpy-discussion] 2d binning and linear regression

2010-06-23 Thread Bruce Southey
On 06/22/2010 02:58 PM, josef.p...@gmail.com wrote:
 On Tue, Jun 22, 2010 at 10:09 AM, Tom Durrantthdurr...@gmail.com  wrote:

 the basic idea is in polyfit  on multiple data points on
 numpy-disscusion mailing list April 2009

 In this case, calculations have to be done by groups

 subtract mean (this needs to be replaced by group demeaning)
 modeldm = model - model.mean()
 obsdm = obs - obs.mean()

 xx = np.histogram2d(
 xx, xedges, yedges = np.histogram2d(lat, lon, weights=modeldm*modeldm,
   bins=(latedges,lonedges))
 xy, xedges, yedges = np.histogram2d(lat, lon, weights=obsdm*obsdm,
   bins=(latedges,lonedges))


 slopes = xy/xx  # slopes by group

 expand slopes to length of original array
 predicted = model - obs * slopes_expanded
 ...

 the main point is to get the group functions, for demeaning, ... for
 the 2d labels (and get the labels out of histogramdd)

 I'm out of time (off to the airport soon), but I can look into it next
 weekend.

 Josef


 Thanks Josef, I will chase up the April list...
 If I understand what you have done above, this returns the slope of best fit
 lines forced through the origin, is that right?
  
 Not if both variables, model and obs, are demeaned first, demeaning
 removes any effect of a constant and only the slope is left over,
 which can be done with the ration xx/xy.

 But to get independent intercept per group, the demeaning has to be by group.

 What's the size of your problem, how many groups or how many separate
 regressions ?

 demeaning by group has setup cost in this case, so the main speed
 benefit would come if you calculate the digitize and label generation,
 that histogram2d does, only ones and reuse it in later calculations.

 Using dummy variables as Bruce proposes works very well if there are
 not a very large number of groups, otherwise I think the memory
 requirements and size of array would be very costly in terms of
 performance.

 Josef


There is always a tradeoff of memory vs speed. It is too easy to be too 
clever just to find that a more brute force approach is considerably 
faster. You want to avoid Python code and array indexing as much as 
possible since those can be major speed bumps. Also some approaches have 
hidden memory costs as well (histogram2d calls as histogramdd([x,y],...).

If memory is an issue, then obviously you need to decide how to handle 
it because there are many ways around it. For example, the biggest 
memory usage in my code should be related to the design matrix and 
creating the normal equations. At worst (which requires passing by value 
rather than reference) it would be two 2-d arrays with the number of 
rows being the number of observations and the number of columns being 
two times the number of groups. If you have scipy, then sparse matrices 
can reduce the memory footprint of the design matrix and could be 
faster. There are also ways to construct the design matrix and the 
normal equations either observation by observation (essential for very 
large data sets) or pieces (uses within group information of numbers of 
observations, sum of 'model' and sum of 'model*model'). If the your 
boxes have homogeneous variance (which is already being assumed), you 
can first divide the data into groups of boxes and then loop over each 
group reusing the existing arrays as needed.

Bruce
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2d binning and linear regression

2010-06-22 Thread Tom Durrant


 the basic idea is in polyfit  on multiple data points on
 numpy-disscusion mailing list April 2009

 In this case, calculations have to be done by groups

 subtract mean (this needs to be replaced by group demeaning)
 modeldm = model - model.mean()
 obsdm = obs - obs.mean()

 xx = np.histogram2d(
 xx, xedges, yedges = np.histogram2d(lat, lon, weights=modeldm*modeldm,
  bins=(latedges,lonedges))
 xy, xedges, yedges = np.histogram2d(lat, lon, weights=obsdm*obsdm,
  bins=(latedges,lonedges))


 slopes = xy/xx  # slopes by group

 expand slopes to length of original array
 predicted = model - obs * slopes_expanded
 ...

 the main point is to get the group functions, for demeaning, ... for
 the 2d labels (and get the labels out of histogramdd)

 I'm out of time (off to the airport soon), but I can look into it next
 weekend.

 Josef

 Thanks Josef, I will chase up the April list...

If I understand what you have done above, this returns the slope of best fit
lines forced through the origin, is that right?

Have a great trip!

Tom
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2d binning and linear regression

2010-06-22 Thread Tom Durrant


 
 What exactly are trying to fit because it is rather bad practice to fit
 a model to some summarized data as you lose the uncertainty in the
 original data?
 If you define your boxes, you can loop through directly on each box and
 even fit the equation:

 model=mu +beta1*obs

 The extension is to fit the larger equation:
 model=mu + boxes + beta1*obs + beta2*obs*boxes

 where your boxes is a indicator or dummy variable for each box.
 Since you are only interested in the box by model term, you probably can
 use this type of model
 model=mu + boxes + beta2*obs*boxes

 However, these models assume that the residual variance is identical for
 all boxes. (That is solved by a mixed model approach.)

 Bruce


Bruce,

I am trying to determine spatially based linear corrections for surface
winds in order to force a wave model.  The basic idea is, use satellite
observations from sattelites to determine the errors and the wind, and
reduce them by applying a linear correction prior to forcing the wave model.

I am not sure I understand what you are saying, I am possibly trying to do
what you are describing.  i.e. for each box, gather observations, determine
a linear correction, and apply it to the model

model = a*x + b

Does that make sense?

Cheers
Tom





 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2d binning and linear regression

2010-06-22 Thread Bruce Southey

On 06/22/2010 09:13 AM, Tom Durrant wrote:




What exactly are trying to fit because it is rather bad practice
to fit
a model to some summarized data as you lose the uncertainty in the
original data?
If you define your boxes, you can loop through directly on each
box and
even fit the equation:

model=mu +beta1*obs

The extension is to fit the larger equation:
model=mu + boxes + beta1*obs + beta2*obs*boxes

where your boxes is a indicator or dummy variable for each box.
Since you are only interested in the box by model term, you
probably can
use this type of model
model=mu + boxes + beta2*obs*boxes

However, these models assume that the residual variance is
identical for
all boxes. (That is solved by a mixed model approach.)

Bruce


Bruce,

I am trying to determine spatially based linear corrections for 
surface winds in order to force a wave model.  The basic idea is, use 
satellite observations from sattelites to determine the errors and the 
wind, and reduce them by applying a linear correction prior to forcing 
the wave model.


I am not sure I understand what you are saying, I am possibly trying 
to do what you are describing.  i.e. for each box, gather 
observations, determine a linear correction, and apply it to the model


model = a*x + b

Does that make sense?

Cheers
Tom




I used the data you gave and may have swapped 'model' and 'x' here - the 
code should work if you switch these.
First I assume that you can create a variable 'box' based on the lat/lon 
data - I just created one from the first values.


As I understand it, your problem is essentially is an analysis of 
covariance with one factor (box) and one regressor (x). So I used some 
technical knowledge about the parameterization of the normal equations 
that may not be true in general for all models as it depends on finding 
'equivalent models'.  Just print out the different arrays for a small 
example as it is rather hard to describe using words alone.


Basically you can create a design matrix that just includes dummy 
variables for each box and the interactions between that dummy variable 
and your 'x' array. This amounts to the equation:


y = box + box*x

where
y is your 'model' array
box is a 'dummy variable of box - number of columns is the number of boxes.
x is your 'x' array.

Note that there is no general or overall intercept here because you are 
not interested in that. Rather you are interested in the 'intercept' for 
each box which comes from the corresponding solution.


The code provides a function to create the dummy variables for each box 
and a second function to compute the interactions between that dummy 
variable and your 'x' array - these functions originated in 
pystatsmodels. After that I form and solve the normal equations to get 
the standard errors of the solutions and other useful statistics. Note 
that the residual variance (MSE) is probably the pooled residual 
variance across all boxes (I am too lazy to check).



Also, if you are not interested in the standard errors etc. then you 
probably should use a more efficient solver available in numpy.


I (hopefully) reshaped the solutions ('beta') and standard errors 
('pSE') so the rows are for each box, the first column is the 
'intercept' and the second column is the 'slope'.


The output is:

$ python reg_box.py
Residual Sum of Squares  0.0306718851814
Residual Mean Sum of Squares 0.00511198086357
RSquared 0.975657233983
Estimated Intercept and regression for each box
[[ 0.69044944  0.44569288]
 [-1.53272813  1.43358273]]
Estimated standard error of the Intercept and regression for each box
[[ 0.41081864  0.23061509]
 [ 0.21123387  0.095004  ]]

For box=1:
a=0.45 (se=0.23) and b=0.69 (se= 0.41)

For box=2:
a=1.43 (se=0.095) and b=-1.53 (se=0.21)

Bruce
(If you have questions, you can contact me off list if you want.)



import numpy as np
y =np.array(   [1.42, 1.35, 1.55, 1.50, 1.59, 1.76, 2.15, 1.90, 1.55, 0.73  ])
x =np.array( [1.67, 1.68, 1.70, 1.79, 2.04, 2.36, 2.53, 2.38, 2.149, 1.57 ])
box =np.array( [1, 1, 1, 1, 1, 2, 2, 2, 2, 2 ])

def data2dummy(x):
if not isinstance(x, np.ndarray): #if not an ndarray object attempt to convert it
x=np.array(x, dtype=dtype)
if len(x.shape)  1:
raise ValueError, 'Too many columns'
groups = np.unique(x)
return (x[:, None] == groups).astype(int), groups

def design_int(A, B):
if len(B.shape)  1:
rowB, colB=B.shape
ab=A*B[:,0].reshape(rowB,1)
else:
rowB=B.shape[0]
colB=1
ab=A*B[:].reshape(rowB,1)
for col in range(1,colB):
ncol=B[:,col].reshape(rowB,1)
ab=np.hstack((ab,A*ncol))
return ab

dbox, lbox= data2dummy(box) #create dummy variable for box
Xdesign=np.hstack((dbox,design_int(dbox, x))) #create Design matrix
#Create components of the normal equations
YY=np.dot(y.T,y)
XX=np.dot(Xdesign.T,Xdesign)

Re: [Numpy-discussion] 2d binning and linear regression

2010-06-22 Thread josef . pktd
On Tue, Jun 22, 2010 at 10:09 AM, Tom Durrant thdurr...@gmail.com wrote:

 the basic idea is in polyfit  on multiple data points on
 numpy-disscusion mailing list April 2009

 In this case, calculations have to be done by groups

 subtract mean (this needs to be replaced by group demeaning)
 modeldm = model - model.mean()
 obsdm = obs - obs.mean()

 xx = np.histogram2d(
 xx, xedges, yedges = np.histogram2d(lat, lon, weights=modeldm*modeldm,
      bins=(latedges,lonedges))
 xy, xedges, yedges = np.histogram2d(lat, lon, weights=obsdm*obsdm,
      bins=(latedges,lonedges))


 slopes = xy/xx  # slopes by group

 expand slopes to length of original array
 predicted = model - obs * slopes_expanded
 ...

 the main point is to get the group functions, for demeaning, ... for
 the 2d labels (and get the labels out of histogramdd)

 I'm out of time (off to the airport soon), but I can look into it next
 weekend.

 Josef

 Thanks Josef, I will chase up the April list...
 If I understand what you have done above, this returns the slope of best fit
 lines forced through the origin, is that right?

Not if both variables, model and obs, are demeaned first, demeaning
removes any effect of a constant and only the slope is left over,
which can be done with the ration xx/xy.

But to get independent intercept per group, the demeaning has to be by group.

What's the size of your problem, how many groups or how many separate
regressions ?

demeaning by group has setup cost in this case, so the main speed
benefit would come if you calculate the digitize and label generation,
that histogram2d does, only ones and reuse it in later calculations.

Using dummy variables as Bruce proposes works very well if there are
not a very large number of groups, otherwise I think the memory
requirements and size of array would be very costly in terms of
performance.

Josef




 Have a great trip!
 Tom


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2d binning and linear regression

2010-06-21 Thread josef . pktd
On Sun, Jun 20, 2010 at 10:57 PM, Tom Durrant thdurr...@gmail.com wrote:


 are you doing something like np.polyfit(model, obs, 1) ?

 If you are using polyfit with deg=1, i.e. fitting a straight line,
 then this could be also calculated using the weights in histogram2d.

 histogram2d (histogramdd) uses np.digitize and np.bincount, so I'm
 surprised if the histogram2d version is much faster. If a quick
 reading of histogramdd is correct, the main improvement would be to
 get the labels xy out of it, so it can be used repeatedly with
 np.bincount.

 Josef

 Thanks Josef,

 From my limited understanding, you are right the histogram is much faster 
 due to
 the fact that it doesn't have to keep reading in the array over and over

 I am using np.polyfit(model, obs, 1).  I couldn't work out a way to do these
 regression using histogram2d and weights, but you think it can be done?  This
 would be great!

the basic idea is in polyfit  on multiple data points on
numpy-disscusion mailing list April 2009

In this case, calculations have to be done by groups

subtract mean (this needs to be replaced by group demeaning)
modeldm = model - model.mean()
obsdm = obs - obs.mean()

xx = np.histogram2d(
xx, xedges, yedges = np.histogram2d(lat, lon, weights=modeldm*modeldm,
  bins=(latedges,lonedges))
xy, xedges, yedges = np.histogram2d(lat, lon, weights=obsdm*obsdm,
  bins=(latedges,lonedges))


slopes = xy/xx  # slopes by group

expand slopes to length of original array
predicted = model - obs * slopes_expanded
...

the main point is to get the group functions, for demeaning, ... for
the 2d labels (and get the labels out of histogramdd)

I'm out of time (off to the airport soon), but I can look into it next weekend.

Josef



 Cheers
 Tom





 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2d binning and linear regression

2010-06-21 Thread Bruce Southey
On 06/20/2010 03:24 AM, Tom Durrant wrote:
 Hi All,

 I have a problem involving lat/lon data.  Basically, I am evaluating 
 numerical weather model data against satellite data, and trying to 
 produce gridded plots of various statistics.  There are various steps 
 involved with this, but basically, I get to the point where I have 
 four arrays of the same length, containing lat, lon, model and 
 observed values respectively along the path of the sattelite.

 eg:

 lat   =  [   50.32   50.7851.2451.82   52.5553.15   
  53.7554.28   54.79   55.16  ... ]
 lon  =  [ 192.83  193.38  193.94  194.67  195.65  196.49  197.35 
  198.15  198.94 199.53  ... ]
 obs =  [1.42  1.35  1.55 1.50  1.59 1.76   
   2.15   1.90 1.550.73  ... ]
 model  = [ 1.67  1.68 1.70  1.79 2.04 2.36 
  2.53  2.38  2.149   1.57 ... ]

 I then want to calculate statistics based on bins of say 2 X 2 degree 
 boxes. These arrays are very large, on the order of 10^6. For ease of 
 explanation, I will focus initially on bias.

 My first approach was to use loops through lat/lon boxes and use 
 np.where statements to extract all the values of the model and 
 observations within the given box, and calculate the bias as the mean 
 of the difference.  This was obviously very slow.

 histogram2d provided a FAR better way to do this.  i.e.


 import numpy as np

 latedges=np.arange(-90,90,2)
 lonedges=np.arange(0,360,2)

 diff = model-obs
 grid_N, xedges, yedges = np.histogram2d(lat, lon],
 bins=(latedges,lonedges))
 grid_bias_sum, xedges, yedges = np.histogram2d(lat, lon, weights=diff,
 bins=(latedges,lonedges))
 grid_bias = grid_bias_sum/grid_N


 I now want to determine the the linear regression coefficients for 
 each each box after fitting a least squares linear regression to the 
 data in each bin.  I have been looking at using np.digitize to extract 
 the bin indeces, but haven't had much success trying to do this in two 
 dimensions.  I am back to looping through the lat and lon box values, 
 using np.where to extract the observations and model data within that 
 box, and using np.polyfit to obtain the regression coefficients.  This 
 is, of course, very slow.

 Can anyone advise a smarter, vectorized way to do this?  Any thoughts 
 would be greatly appreciated.

 Thanks in advance
 Tom



What exactly are trying to fit because it is rather bad practice to fit 
a model to some summarized data as you lose the uncertainty in the 
original data?
If you define your boxes, you can loop through directly on each box and 
even fit the equation:

model=mu +beta1*obs

The extension is to fit the larger equation:
model=mu + boxes + beta1*obs + beta2*obs*boxes

where your boxes is a indicator or dummy variable for each box.
Since you are only interested in the box by model term, you probably can 
use this type of model
model=mu + boxes + beta2*obs*boxes

However, these models assume that the residual variance is identical for 
all boxes. (That is solved by a mixed model approach.)

Bruce






___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] 2d binning and linear regression

2010-06-20 Thread Tom Durrant
Hi All,

I have a problem involving lat/lon data.  Basically, I am evaluating
numerical weather model data against satellite data, and trying to produce
gridded plots of various statistics.  There are various steps involved with
this, but basically, I get to the point where I have four arrays of the same
length, containing lat, lon, model and observed values respectively along
the path of the sattelite.

eg:

lat   =  [   50.32   50.7851.2451.82   52.5553.1553.75
 54.28   54.79   55.16  ... ]
lon  =  [ 192.83  193.38  193.94  194.67  195.65  196.49  197.35  198.15
 198.94 199.53  ... ]
obs =  [1.42  1.35  1.55 1.50  1.59 1.76
2.15   1.90 1.550.73  ... ]
model  = [ 1.67  1.68 1.70  1.79 2.04 2.36  2.53
 2.38  2.149   1.57 ... ]

I then want to calculate statistics based on bins of say 2 X 2 degree boxes.
These arrays are very large, on the order of 10^6. For ease of explanation,
I will focus initially on bias.

My first approach was to use loops through lat/lon boxes and use np.where
statements to extract all the values of the model and observations within
the given box, and calculate the bias as the mean of the difference.  This
was obviously very slow.

histogram2d provided a FAR better way to do this.  i.e.


import numpy as np

latedges=np.arange(-90,90,2)
lonedges=np.arange(0,360,2)

diff = model-obs
grid_N, xedges, yedges = np.histogram2d(lat, lon],
bins=(latedges,lonedges))
grid_bias_sum, xedges, yedges = np.histogram2d(lat, lon, weights=diff,
bins=(latedges,lonedges))
grid_bias = grid_bias_sum/grid_N


I now want to determine the the linear regression coefficients for each each
box after fitting a least squares linear regression to the data in each bin.
 I have been looking at using np.digitize to extract the bin indeces, but
haven't had much success trying to do this in two dimensions.  I am back to
looping through the lat and lon box values, using np.where to extract the
observations and model data within that box, and using np.polyfit to obtain
the regression coefficients.  This is, of course, very slow.

Can anyone advise a smarter, vectorized way to do this?  Any thoughts would
be greatly appreciated.

Thanks in advance
Tom
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2d binning and linear regression

2010-06-20 Thread josef . pktd
On Sun, Jun 20, 2010 at 4:24 AM, Tom Durrant thdurr...@gmail.com wrote:
 Hi All,
 I have a problem involving lat/lon data.  Basically, I am evaluating
 numerical weather model data against satellite data, and trying to produce
 gridded plots of various statistics.  There are various steps involved with
 this, but basically, I get to the point where I have four arrays of the same
 length, containing lat, lon, model and observed values respectively along
 the path of the sattelite.
 eg:
 lat       =  [   50.32   50.78    51.24    51.82   52.55    53.15    53.75
  54.28   54.79   55.16  ... ]
 lon      =  [ 192.83  193.38  193.94  194.67  195.65  196.49  197.35  198.15
  198.94 199.53  ... ]
 obs     =  [    1.42      1.35      1.55     1.50      1.59     1.76
 2.15       1.90     1.55    0.73  ... ]
 model  = [     1.67      1.68     1.70      1.79     2.04     2.36      2.53
      2.38      2.149   1.57 ... ]
 I then want to calculate statistics based on bins of say 2 X 2 degree boxes.
 These arrays are very large, on the order of 10^6. For ease of explanation,
 I will focus initially on bias.
 My first approach was to use loops through lat/lon boxes and use np.where
 statements to extract all the values of the model and observations within
 the given box, and calculate the bias as the mean of the difference.  This
 was obviously very slow.
 histogram2d provided a FAR better way to do this.  i.e.

 import numpy as np
 latedges=np.arange(-90,90,2)
 lonedges=np.arange(0,360,2)
 diff = model-obs
 grid_N, xedges, yedges = np.histogram2d(lat, lon],
                 bins=(latedges,lonedges))
 grid_bias_sum, xedges, yedges = np.histogram2d(lat, lon, weights=diff,
                 bins=(latedges,lonedges))
 grid_bias = grid_bias_sum/grid_N

 I now want to determine the the linear regression coefficients for each each
 box after fitting a least squares linear regression to the data in each bin.
  I have been looking at using np.digitize to extract the bin indeces, but
 haven't had much success trying to do this in two dimensions.  I am back to
 looping through the lat and lon box values, using np.where to extract the
 observations and model data within that box, and using np.polyfit to obtain
 the regression coefficients.  This is, of course, very slow.
 Can anyone advise a smarter, vectorized way to do this?  Any thoughts would
 be greatly appreciated.

For a general linear regression problem, there wouldn't be much that I
can see that can be done.

If there are many small regression problem, then sometimes stacking
them into one big sparse least squares problem can be faster, it's
faster to solve but not always faster to create in the first place.

are you doing something like np.polyfit(model, obs, 1) ?

If you are using polyfit with deg=1, i.e. fitting a straight line,
then this could be also calculated using the weights in histogram2d.

histogram2d (histogramdd) uses np.digitize and np.bincount, so I'm
surprised if the histogram2d version is much faster. If a quick
reading of histogramdd is correct, the main improvement would be to
get the labels xy out of it, so it can be used repeatedly with
np.bincount.

Josef


 Thanks in advance
 Tom




 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2d binning and linear regression

2010-06-20 Thread Tom Durrant

 
 are you doing something like np.polyfit(model, obs, 1) ?
 
 If you are using polyfit with deg=1, i.e. fitting a straight line,
 then this could be also calculated using the weights in histogram2d.
 
 histogram2d (histogramdd) uses np.digitize and np.bincount, so I'm
 surprised if the histogram2d version is much faster. If a quick
 reading of histogramdd is correct, the main improvement would be to
 get the labels xy out of it, so it can be used repeatedly with
 np.bincount.
 
 Josef
 
Thanks Josef, 

From my limited understanding, you are right the histogram is much faster due 
to 
the fact that it doesn't have to keep reading in the array over and over

I am using np.polyfit(model, obs, 1).  I couldn't work out a way to do these 
regression using histogram2d and weights, but you think it can be done?  This 
would be great!

Cheers
Tom
 




___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2d binning on regular grid

2010-06-03 Thread Friedrich Romstedt
Hello Andreas,

please see this as a side remark.

A colleague of mine made me aware of a very beautiful thing about
covering spheres by evenly spaced points:

http://healpix.jpl.nasa.gov/

Since you want to calculate mean and stddev, to my understanding a
grid in longitude/latitude is without proper weighting factors
problematic.  Do you use weighting factors?  If yes, of what kind?

For Healpix, there exists exists a Python binding, but I never worked with it.

Friedrich
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2d binning on regular grid

2010-06-03 Thread josef . pktd
On Wed, Jun 2, 2010 at 11:40 AM, Andreas Hilboll li...@hilboll.de wrote:
 Hi there,

 I'm interested in the solution to a special case of the parallel thread
 '2D binning', which is going on at the moment. My data is on a fine global
 grid, say .125x.125 degrees. I'm looking for a way to do calculations on
 coarser grids, e.g.

 * calculate means()
 * calculate std()
 * ...

 on a, say, 2.5x3.75 degree grid. One very crude approach would be to
 iterate through latitudes and longitudes, like this:


 latstep_orig = .125
 lonstep_orig = .125
 data_orig =
 np.arange((180./latstep_orig)*(360./lonstep_orig)).reshape((180./latstep_orig,360./lonstep_orig))
 latstep_new = 2.5
 lonstep_new = 3.75
 latstep = int(latstep_new / latstep_orig)
 lonstep = int(lonstep_new / lonstep_orig)

 print 'one new lat equals',latstep,'new lats'
 print 'one new lon equals',lonstep,'new lons'

 result = ma.zeros((180./latstep_new,360./lonstep_new))
 latidx = 0
 while latidx*latstep_new  180.:
    lonidx = 0
    while lonidx*lonstep_new  360.:
        m = np.mean( \
            data_orig[latidx*latstep:(latidx+1)*latstep, \
            lonidx*lonstep:(lonidx+1)*lonstep])
        result[latidx,lonidx] = m
        lonidx += 1
    latidx += 1

 However, this is very crude, and I was wondering if there's any more
 elegant way to do it ...

 Thanks for your insight!

I thought maybe there is something in ndimage for this, but since
nobody mentions it, maybe not.

If there are no memory problems and my interpretation of the question
is correct, then something like this might work:

 x = np.arange(16).reshape(4,-1)
 d = np.kron(np.eye(2),np.ones(2))
 s = np.dot(np.dot(d,x),d.T)
 m = np.dot(np.dot(d,x),d.T)/4
 v = np.dot(np.dot(d,x**2),d.T)/4 - m**2
 x
array([[ 0,  1,  2,  3],
   [ 4,  5,  6,  7],
   [ 8,  9, 10, 11],
   [12, 13, 14, 15]])
 s
array([[ 10.,  18.],
   [ 42.,  50.]])
 m
array([[  2.5,   4.5],
   [ 10.5,  12.5]])
 v
array([[ 4.25,  4.25],
   [ 4.25,  4.25]])
 x[:2,:2].sum()
10
 x[:2,:2].mean()
2.5
 x[:2,:2].var()
4.25

Josef


 A.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2D binning

2010-06-02 Thread Vincent Schut
On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote:
 On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu  
 wrote:
 I guess it's as fast as I'm going to get. I don't really see any
 other way. BTW, the lat/lons are integers)

 You could (in c or cython) try a brain-dead hashtable with no
 collision detection:

 for lat, long, data in dataset:
bin = (lat ^ long) % num_bins
hashtable[bin] = update_incremental_mean(hashtable[bin], data)

 you'll of course want to do some experiments to see if your data are
 sufficiently sparse and/or you can afford a large enough hashtable
 array that you won't get spurious hash collisions. Adding error-
 checking to ensure that there are no collisions would be pretty
 trivial (just keep a table of the lat/long for each hash value, which
 you'll need anyway, and check that different lat/long pairs don't get
 assigned the same bin).

 Zach



 -Mathew

 On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu
 wrote:
 Hi
 Can anyone think of a clever (non-lopping) solution to the
 following?

 A have a list of latitudes, a list of longitudes, and list of data
 values. All lists are the same length.

 I want to compute an average  of data values for each lat/lon pair.
 e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
 data[1001] = (data[1001] + data[2001])/2

 Looping is going to take wa to long.

 As a start, are the equal lat/lon pairs exactly equal (i.e. either
 not floating-point, or floats that will always compare equal, that is,
 the floating-point bit-patterns will be guaranteed to be identical) or
 approximately equal to float tolerance?

 If you're in the approx-equal case, then look at the KD-tree in scipy
 for doing near-neighbors queries.

 If you're in the exact-equal case, you could consider hashing the lat/
 lon pairs or something. At least then the looping is O(N) and not
 O(N^2):

 import collections
 grouped = collections.defaultdict(list)
 for lt, ln, da in zip(lat, lon, data):
grouped[(lt, ln)].append(da)

 averaged = dict((ltln, numpy.mean(da)) for ltln, da in
 grouped.items())

 Is that fast enough?

 If the lat lon can be converted to a 1d label as Wes suggested, then
 in a similar timing exercise ndimage was the fastest.
 http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html

And as you said your lats and lons are integers, you could simply do

ll = lat*1000 + lon

to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or 
lon will never exceed 360 (degrees).

After that, either use the ndimage approach, or you could use 
histogramming with weighting by data values and divide by histogram 
withouth weighting, or just loop.

Vincent


 (this was for python 2.4, also later I found np.bincount which
 requires that the labels are consecutive integers, but is as fast as
 ndimage)

 I don't know how it would compare to the new suggestions.

 Josef




 Zach
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2D binning

2010-06-02 Thread Benjamin Root
Why not simply use a set?

uniquePoints = set(zip(lats, lons))

Ben Root

On Wed, Jun 2, 2010 at 2:41 AM, Vincent Schut sc...@sarvision.nl wrote:

 On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote:
  On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu
  wrote:
  I guess it's as fast as I'm going to get. I don't really see any
  other way. BTW, the lat/lons are integers)
 
  You could (in c or cython) try a brain-dead hashtable with no
  collision detection:
 
  for lat, long, data in dataset:
 bin = (lat ^ long) % num_bins
 hashtable[bin] = update_incremental_mean(hashtable[bin], data)
 
  you'll of course want to do some experiments to see if your data are
  sufficiently sparse and/or you can afford a large enough hashtable
  array that you won't get spurious hash collisions. Adding error-
  checking to ensure that there are no collisions would be pretty
  trivial (just keep a table of the lat/long for each hash value, which
  you'll need anyway, and check that different lat/long pairs don't get
  assigned the same bin).
 
  Zach
 
 
 
  -Mathew
 
  On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu
  wrote:
  Hi
  Can anyone think of a clever (non-lopping) solution to the
  following?
 
  A have a list of latitudes, a list of longitudes, and list of data
  values. All lists are the same length.
 
  I want to compute an average  of data values for each lat/lon pair.
  e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
  data[1001] = (data[1001] + data[2001])/2
 
  Looping is going to take wa to long.
 
  As a start, are the equal lat/lon pairs exactly equal (i.e. either
  not floating-point, or floats that will always compare equal, that is,
  the floating-point bit-patterns will be guaranteed to be identical) or
  approximately equal to float tolerance?
 
  If you're in the approx-equal case, then look at the KD-tree in scipy
  for doing near-neighbors queries.
 
  If you're in the exact-equal case, you could consider hashing the lat/
  lon pairs or something. At least then the looping is O(N) and not
  O(N^2):
 
  import collections
  grouped = collections.defaultdict(list)
  for lt, ln, da in zip(lat, lon, data):
 grouped[(lt, ln)].append(da)
 
  averaged = dict((ltln, numpy.mean(da)) for ltln, da in
  grouped.items())
 
  Is that fast enough?
 
  If the lat lon can be converted to a 1d label as Wes suggested, then
  in a similar timing exercise ndimage was the fastest.
  http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html

 And as you said your lats and lons are integers, you could simply do

 ll = lat*1000 + lon

 to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or
 lon will never exceed 360 (degrees).

 After that, either use the ndimage approach, or you could use
 histogramming with weighting by data values and divide by histogram
 withouth weighting, or just loop.

 Vincent

 
  (this was for python 2.4, also later I found np.bincount which
  requires that the labels are consecutive integers, but is as fast as
  ndimage)
 
  I don't know how it would compare to the new suggestions.
 
  Josef
 
 
 
 
  Zach
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] 2d binning on regular grid

2010-06-02 Thread Andreas Hilboll
Hi there,

I'm interested in the solution to a special case of the parallel thread
'2D binning', which is going on at the moment. My data is on a fine global
grid, say .125x.125 degrees. I'm looking for a way to do calculations on
coarser grids, e.g.

* calculate means()
* calculate std()
* ...

on a, say, 2.5x3.75 degree grid. One very crude approach would be to
iterate through latitudes and longitudes, like this:


latstep_orig = .125
lonstep_orig = .125
data_orig =
np.arange((180./latstep_orig)*(360./lonstep_orig)).reshape((180./latstep_orig,360./lonstep_orig))
latstep_new = 2.5
lonstep_new = 3.75
latstep = int(latstep_new / latstep_orig)
lonstep = int(lonstep_new / lonstep_orig)

print 'one new lat equals',latstep,'new lats'
print 'one new lon equals',lonstep,'new lons'

result = ma.zeros((180./latstep_new,360./lonstep_new))
latidx = 0
while latidx*latstep_new  180.:
lonidx = 0
while lonidx*lonstep_new  360.:
m = np.mean( \
data_orig[latidx*latstep:(latidx+1)*latstep, \
lonidx*lonstep:(lonidx+1)*lonstep])
result[latidx,lonidx] = m
lonidx += 1
latidx += 1

However, this is very crude, and I was wondering if there's any more
elegant way to do it ...

Thanks for your insight!

A.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2D binning

2010-06-02 Thread Wes McKinney
On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote:
 On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote:
 On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu  
 wrote:
 I guess it's as fast as I'm going to get. I don't really see any
 other way. BTW, the lat/lons are integers)

 You could (in c or cython) try a brain-dead hashtable with no
 collision detection:

 for lat, long, data in dataset:
    bin = (lat ^ long) % num_bins
    hashtable[bin] = update_incremental_mean(hashtable[bin], data)

 you'll of course want to do some experiments to see if your data are
 sufficiently sparse and/or you can afford a large enough hashtable
 array that you won't get spurious hash collisions. Adding error-
 checking to ensure that there are no collisions would be pretty
 trivial (just keep a table of the lat/long for each hash value, which
 you'll need anyway, and check that different lat/long pairs don't get
 assigned the same bin).

 Zach



 -Mathew

 On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincuszachary.pin...@yale.edu
 wrote:
 Hi
 Can anyone think of a clever (non-lopping) solution to the
 following?

 A have a list of latitudes, a list of longitudes, and list of data
 values. All lists are the same length.

 I want to compute an average  of data values for each lat/lon pair.
 e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
 data[1001] = (data[1001] + data[2001])/2

 Looping is going to take wa to long.

 As a start, are the equal lat/lon pairs exactly equal (i.e. either
 not floating-point, or floats that will always compare equal, that is,
 the floating-point bit-patterns will be guaranteed to be identical) or
 approximately equal to float tolerance?

 If you're in the approx-equal case, then look at the KD-tree in scipy
 for doing near-neighbors queries.

 If you're in the exact-equal case, you could consider hashing the lat/
 lon pairs or something. At least then the looping is O(N) and not
 O(N^2):

 import collections
 grouped = collections.defaultdict(list)
 for lt, ln, da in zip(lat, lon, data):
    grouped[(lt, ln)].append(da)

 averaged = dict((ltln, numpy.mean(da)) for ltln, da in
 grouped.items())

 Is that fast enough?

 If the lat lon can be converted to a 1d label as Wes suggested, then
 in a similar timing exercise ndimage was the fastest.
 http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html

 And as you said your lats and lons are integers, you could simply do

 ll = lat*1000 + lon

 to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or
 lon will never exceed 360 (degrees).

 After that, either use the ndimage approach, or you could use
 histogramming with weighting by data values and divide by histogram
 withouth weighting, or just loop.

 Vincent


 (this was for python 2.4, also later I found np.bincount which
 requires that the labels are consecutive integers, but is as fast as
 ndimage)

 I don't know how it would compare to the new suggestions.

 Josef




 Zach
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


I was curious about how fast ndimage was for this operation so here's
the complete function.

import scipy.ndimage as ndi

N = 1

lat = np.random.randint(0, 360, N)
lon = np.random.randint(0, 360, N)
data = np.random.randn(N)

def group_mean(lat, lon, data):
indexer = np.lexsort((lon, lat))
lat = lat.take(indexer)
lon = lon.take(indexer)
sorted_data = data.take(indexer)

keys = 1000 * lat + lon
unique_keys = np.unique(keys)

result = ndi.mean(sorted_data, labels=keys, index=unique_keys)
decoder = keys.searchsorted(unique_keys)

return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result))

Appears to be about 13x faster (and could be made faster still) than
the naive version on my machine:

def group_mean_naive(lat, lon, data):
grouped = collections.defaultdict(list)
for lt, ln, da in zip(lat, lon, data):
  grouped[(lt, ln)].append(da)

averaged = dict((ltln, np.mean(da)) for ltln, da in grouped.items())

return averaged

I had to get the latest scipy trunk to not get an error from ndimage.mean
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2D binning

2010-06-02 Thread Mathew Yeates
thanks. I am also getting an error in ndi.mean
Were you getting the error
RuntimeError: data type not supported?

-Mathew


On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com wrote:

 On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote:
  On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote:
  On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu
  wrote:
  I guess it's as fast as I'm going to get. I don't really see any
  other way. BTW, the lat/lons are integers)
 
  You could (in c or cython) try a brain-dead hashtable with no
  collision detection:
 
  for lat, long, data in dataset:
 bin = (lat ^ long) % num_bins
 hashtable[bin] = update_incremental_mean(hashtable[bin], data)
 
  you'll of course want to do some experiments to see if your data are
  sufficiently sparse and/or you can afford a large enough hashtable
  array that you won't get spurious hash collisions. Adding error-
  checking to ensure that there are no collisions would be pretty
  trivial (just keep a table of the lat/long for each hash value, which
  you'll need anyway, and check that different lat/long pairs don't get
  assigned the same bin).
 
  Zach
 
 
 
  -Mathew
 
  On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincus
 zachary.pin...@yale.edu
  wrote:
  Hi
  Can anyone think of a clever (non-lopping) solution to the
  following?
 
  A have a list of latitudes, a list of longitudes, and list of data
  values. All lists are the same length.
 
  I want to compute an average  of data values for each lat/lon pair.
  e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
  data[1001] = (data[1001] + data[2001])/2
 
  Looping is going to take wa to long.
 
  As a start, are the equal lat/lon pairs exactly equal (i.e. either
  not floating-point, or floats that will always compare equal, that is,
  the floating-point bit-patterns will be guaranteed to be identical) or
  approximately equal to float tolerance?
 
  If you're in the approx-equal case, then look at the KD-tree in scipy
  for doing near-neighbors queries.
 
  If you're in the exact-equal case, you could consider hashing the lat/
  lon pairs or something. At least then the looping is O(N) and not
  O(N^2):
 
  import collections
  grouped = collections.defaultdict(list)
  for lt, ln, da in zip(lat, lon, data):
 grouped[(lt, ln)].append(da)
 
  averaged = dict((ltln, numpy.mean(da)) for ltln, da in
  grouped.items())
 
  Is that fast enough?
 
  If the lat lon can be converted to a 1d label as Wes suggested, then
  in a similar timing exercise ndimage was the fastest.
  http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html
 
  And as you said your lats and lons are integers, you could simply do
 
  ll = lat*1000 + lon
 
  to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or
  lon will never exceed 360 (degrees).
 
  After that, either use the ndimage approach, or you could use
  histogramming with weighting by data values and divide by histogram
  withouth weighting, or just loop.
 
  Vincent
 
 
  (this was for python 2.4, also later I found np.bincount which
  requires that the labels are consecutive integers, but is as fast as
  ndimage)
 
  I don't know how it would compare to the new suggestions.
 
  Josef
 
 
 
 
  Zach
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 

 I was curious about how fast ndimage was for this operation so here's
 the complete function.

 import scipy.ndimage as ndi

 N = 1

 lat = np.random.randint(0, 360, N)
 lon = np.random.randint(0, 360, N)
 data = np.random.randn(N)

 def group_mean(lat, lon, data):
indexer = np.lexsort((lon, lat))
lat = lat.take(indexer)
lon = lon.take(indexer)
sorted_data = data.take(indexer)

keys = 1000 * lat + lon
unique_keys = np.unique(keys)

result = ndi.mean(sorted_data, labels=keys, index=unique_keys)
decoder = keys.searchsorted(unique_keys)

return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result))

 Appears to be about 13x faster (and could be made faster still) than
 the naive version on my machine:

 def group_mean_naive(lat, lon, data):
 grouped = collections.defaultdict(list)
for lt, ln, da in zip(lat, lon, data):
  grouped[(lt, ln)].append(da)

 averaged = dict((ltln, np.mean(da)) for ltln, da in grouped.items())


Re: [Numpy-discussion] 2D binning

2010-06-02 Thread Wes McKinney
On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates mat.yea...@gmail.com wrote:
 thanks. I am also getting an error in ndi.mean
 Were you getting the error
 RuntimeError: data type not supported?

 -Mathew

 On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com wrote:

 On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl wrote:
  On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote:
  On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincuszachary.pin...@yale.edu
   wrote:
  I guess it's as fast as I'm going to get. I don't really see any
  other way. BTW, the lat/lons are integers)
 
  You could (in c or cython) try a brain-dead hashtable with no
  collision detection:
 
  for lat, long, data in dataset:
     bin = (lat ^ long) % num_bins
     hashtable[bin] = update_incremental_mean(hashtable[bin], data)
 
  you'll of course want to do some experiments to see if your data are
  sufficiently sparse and/or you can afford a large enough hashtable
  array that you won't get spurious hash collisions. Adding error-
  checking to ensure that there are no collisions would be pretty
  trivial (just keep a table of the lat/long for each hash value, which
  you'll need anyway, and check that different lat/long pairs don't get
  assigned the same bin).
 
  Zach
 
 
 
  -Mathew
 
  On Tue, Jun 1, 2010 at 1:49 PM, Zachary
  Pincuszachary.pin...@yale.edu
  wrote:
  Hi
  Can anyone think of a clever (non-lopping) solution to the
  following?
 
  A have a list of latitudes, a list of longitudes, and list of data
  values. All lists are the same length.
 
  I want to compute an average  of data values for each lat/lon pair.
  e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
  data[1001] = (data[1001] + data[2001])/2
 
  Looping is going to take wa to long.
 
  As a start, are the equal lat/lon pairs exactly equal (i.e. either
  not floating-point, or floats that will always compare equal, that
  is,
  the floating-point bit-patterns will be guaranteed to be identical)
  or
  approximately equal to float tolerance?
 
  If you're in the approx-equal case, then look at the KD-tree in scipy
  for doing near-neighbors queries.
 
  If you're in the exact-equal case, you could consider hashing the
  lat/
  lon pairs or something. At least then the looping is O(N) and not
  O(N^2):
 
  import collections
  grouped = collections.defaultdict(list)
  for lt, ln, da in zip(lat, lon, data):
     grouped[(lt, ln)].append(da)
 
  averaged = dict((ltln, numpy.mean(da)) for ltln, da in
  grouped.items())
 
  Is that fast enough?
 
  If the lat lon can be converted to a 1d label as Wes suggested, then
  in a similar timing exercise ndimage was the fastest.
  http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html
 
  And as you said your lats and lons are integers, you could simply do
 
  ll = lat*1000 + lon
 
  to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or
  lon will never exceed 360 (degrees).
 
  After that, either use the ndimage approach, or you could use
  histogramming with weighting by data values and divide by histogram
  withouth weighting, or just loop.
 
  Vincent
 
 
  (this was for python 2.4, also later I found np.bincount which
  requires that the labels are consecutive integers, but is as fast as
  ndimage)
 
  I don't know how it would compare to the new suggestions.
 
  Josef
 
 
 
 
  Zach
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 

 I was curious about how fast ndimage was for this operation so here's
 the complete function.

 import scipy.ndimage as ndi

 N = 1

 lat = np.random.randint(0, 360, N)
 lon = np.random.randint(0, 360, N)
 data = np.random.randn(N)

 def group_mean(lat, lon, data):
    indexer = np.lexsort((lon, lat))
    lat = lat.take(indexer)
    lon = lon.take(indexer)
    sorted_data = data.take(indexer)

    keys = 1000 * lat + lon
    unique_keys = np.unique(keys)

    result = ndi.mean(sorted_data, labels=keys, index=unique_keys)
    decoder = keys.searchsorted(unique_keys)

    return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result))

 Appears to be about 13x faster (and could be made faster still) than
 the naive version on my machine:

 def group_mean_naive(lat, lon, data):
    grouped = collections.defaultdict(list)
    for lt, ln, da in zip(lat, lon, data):
      grouped[(lt, ln)].append(da)


Re: [Numpy-discussion] 2D binning

2010-06-02 Thread Mathew Yeates
I'm on Windows, using a precompiled binary. I never built numpy/scipy on
Windows.

On Wed, Jun 2, 2010 at 10:45 AM, Wes McKinney wesmck...@gmail.com wrote:

 On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates mat.yea...@gmail.com
 wrote:
  thanks. I am also getting an error in ndi.mean
  Were you getting the error
  RuntimeError: data type not supported?
 
  -Mathew
 
  On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com
 wrote:
 
  On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl
 wrote:
   On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote:
   On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincus
 zachary.pin...@yale.edu
wrote:
   I guess it's as fast as I'm going to get. I don't really see any
   other way. BTW, the lat/lons are integers)
  
   You could (in c or cython) try a brain-dead hashtable with no
   collision detection:
  
   for lat, long, data in dataset:
  bin = (lat ^ long) % num_bins
  hashtable[bin] = update_incremental_mean(hashtable[bin], data)
  
   you'll of course want to do some experiments to see if your data are
   sufficiently sparse and/or you can afford a large enough hashtable
   array that you won't get spurious hash collisions. Adding error-
   checking to ensure that there are no collisions would be pretty
   trivial (just keep a table of the lat/long for each hash value,
 which
   you'll need anyway, and check that different lat/long pairs don't
 get
   assigned the same bin).
  
   Zach
  
  
  
   -Mathew
  
   On Tue, Jun 1, 2010 at 1:49 PM, Zachary
   Pincuszachary.pin...@yale.edu
   wrote:
   Hi
   Can anyone think of a clever (non-lopping) solution to the
   following?
  
   A have a list of latitudes, a list of longitudes, and list of data
   values. All lists are the same length.
  
   I want to compute an average  of data values for each lat/lon
 pair.
   e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
   data[1001] = (data[1001] + data[2001])/2
  
   Looping is going to take wa to long.
  
   As a start, are the equal lat/lon pairs exactly equal (i.e.
 either
   not floating-point, or floats that will always compare equal, that
   is,
   the floating-point bit-patterns will be guaranteed to be identical)
   or
   approximately equal to float tolerance?
  
   If you're in the approx-equal case, then look at the KD-tree in
 scipy
   for doing near-neighbors queries.
  
   If you're in the exact-equal case, you could consider hashing the
   lat/
   lon pairs or something. At least then the looping is O(N) and not
   O(N^2):
  
   import collections
   grouped = collections.defaultdict(list)
   for lt, ln, da in zip(lat, lon, data):
  grouped[(lt, ln)].append(da)
  
   averaged = dict((ltln, numpy.mean(da)) for ltln, da in
   grouped.items())
  
   Is that fast enough?
  
   If the lat lon can be converted to a 1d label as Wes suggested, then
   in a similar timing exercise ndimage was the fastest.
   http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html
  
   And as you said your lats and lons are integers, you could simply do
  
   ll = lat*1000 + lon
  
   to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat
 or
   lon will never exceed 360 (degrees).
  
   After that, either use the ndimage approach, or you could use
   histogramming with weighting by data values and divide by histogram
   withouth weighting, or just loop.
  
   Vincent
  
  
   (this was for python 2.4, also later I found np.bincount which
   requires that the labels are consecutive integers, but is as fast as
   ndimage)
  
   I don't know how it would compare to the new suggestions.
  
   Josef
  
  
  
  
   Zach
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
  
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
 
  I was curious about how fast ndimage was for this operation so here's
  the complete function.
 
  import scipy.ndimage as ndi
 
  N = 1
 
  lat = np.random.randint(0, 360, N)
  lon = np.random.randint(0, 360, N)
  data = np.random.randn(N)
 
  def group_mean(lat, lon, data):
 indexer = np.lexsort((lon, lat))
 lat = lat.take(indexer)
 lon = lon.take(indexer)
 sorted_data = data.take(indexer)
 
 keys = 1000 * lat + lon
 unique_keys = np.unique(keys)
 
 result = ndi.mean(sorted_data, labels=keys, index=unique_keys)
 decoder = keys.searchsorted(unique_keys)
 
 return 

Re: [Numpy-discussion] 2D binning

2010-06-02 Thread josef . pktd
On Wed, Jun 2, 2010 at 2:09 PM, Mathew Yeates mat.yea...@gmail.com wrote:
 I'm on Windows, using a precompiled binary. I never built numpy/scipy on
 Windows.

ndimage measurements has been recently rewritten. ndimage is very fast
but (the old version) has insufficient type checking and may crash on
wrong inputs.

I managed to work with it in the past on Windows. Maybe you could try
to check the dtypes of the arguments for ndi.mean. (Preferably in an
interpreter session where you don't mind if it crashes)

I don't remember the type restrictions, but there are/were several
tickets for it.

Josef



 On Wed, Jun 2, 2010 at 10:45 AM, Wes McKinney wesmck...@gmail.com wrote:

 On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates mat.yea...@gmail.com
 wrote:
  thanks. I am also getting an error in ndi.mean
  Were you getting the error
  RuntimeError: data type not supported?
 
  -Mathew
 
  On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com
  wrote:
 
  On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl
  wrote:
   On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote:
   On Tue, Jun 1, 2010 at 9:57 PM, Zachary
   Pincuszachary.pin...@yale.edu
    wrote:
   I guess it's as fast as I'm going to get. I don't really see any
   other way. BTW, the lat/lons are integers)
  
   You could (in c or cython) try a brain-dead hashtable with no
   collision detection:
  
   for lat, long, data in dataset:
      bin = (lat ^ long) % num_bins
      hashtable[bin] = update_incremental_mean(hashtable[bin], data)
  
   you'll of course want to do some experiments to see if your data
   are
   sufficiently sparse and/or you can afford a large enough hashtable
   array that you won't get spurious hash collisions. Adding error-
   checking to ensure that there are no collisions would be pretty
   trivial (just keep a table of the lat/long for each hash value,
   which
   you'll need anyway, and check that different lat/long pairs don't
   get
   assigned the same bin).
  
   Zach
  
  
  
   -Mathew
  
   On Tue, Jun 1, 2010 at 1:49 PM, Zachary
   Pincuszachary.pin...@yale.edu
   wrote:
   Hi
   Can anyone think of a clever (non-lopping) solution to the
   following?
  
   A have a list of latitudes, a list of longitudes, and list of
   data
   values. All lists are the same length.
  
   I want to compute an average  of data values for each lat/lon
   pair.
   e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
   data[1001] = (data[1001] + data[2001])/2
  
   Looping is going to take wa to long.
  
   As a start, are the equal lat/lon pairs exactly equal (i.e.
   either
   not floating-point, or floats that will always compare equal, that
   is,
   the floating-point bit-patterns will be guaranteed to be
   identical)
   or
   approximately equal to float tolerance?
  
   If you're in the approx-equal case, then look at the KD-tree in
   scipy
   for doing near-neighbors queries.
  
   If you're in the exact-equal case, you could consider hashing the
   lat/
   lon pairs or something. At least then the looping is O(N) and not
   O(N^2):
  
   import collections
   grouped = collections.defaultdict(list)
   for lt, ln, da in zip(lat, lon, data):
      grouped[(lt, ln)].append(da)
  
   averaged = dict((ltln, numpy.mean(da)) for ltln, da in
   grouped.items())
  
   Is that fast enough?
  
   If the lat lon can be converted to a 1d label as Wes suggested, then
   in a similar timing exercise ndimage was the fastest.
   http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html
  
   And as you said your lats and lons are integers, you could simply do
  
   ll = lat*1000 + lon
  
   to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat
   or
   lon will never exceed 360 (degrees).
  
   After that, either use the ndimage approach, or you could use
   histogramming with weighting by data values and divide by histogram
   withouth weighting, or just loop.
  
   Vincent
  
  
   (this was for python 2.4, also later I found np.bincount which
   requires that the labels are consecutive integers, but is as fast as
   ndimage)
  
   I don't know how it would compare to the new suggestions.
  
   Josef
  
  
  
  
   Zach
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
  
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
 
  I was curious about how fast ndimage was for this operation so here's
  

Re: [Numpy-discussion] 2D binning

2010-06-02 Thread Wes McKinney
On Wed, Jun 2, 2010 at 2:26 PM,  josef.p...@gmail.com wrote:
 On Wed, Jun 2, 2010 at 2:09 PM, Mathew Yeates mat.yea...@gmail.com wrote:
 I'm on Windows, using a precompiled binary. I never built numpy/scipy on
 Windows.

 ndimage measurements has been recently rewritten. ndimage is very fast
 but (the old version) has insufficient type checking and may crash on
 wrong inputs.

 I managed to work with it in the past on Windows. Maybe you could try
 to check the dtypes of the arguments for ndi.mean. (Preferably in an
 interpreter session where you don't mind if it crashes)

 I don't remember the type restrictions, but there are/were several
 tickets for it.

 Josef



 On Wed, Jun 2, 2010 at 10:45 AM, Wes McKinney wesmck...@gmail.com wrote:

 On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates mat.yea...@gmail.com
 wrote:
  thanks. I am also getting an error in ndi.mean
  Were you getting the error
  RuntimeError: data type not supported?
 
  -Mathew
 
  On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney wesmck...@gmail.com
  wrote:
 
  On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut sc...@sarvision.nl
  wrote:
   On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote:
   On Tue, Jun 1, 2010 at 9:57 PM, Zachary
   Pincuszachary.pin...@yale.edu
    wrote:
   I guess it's as fast as I'm going to get. I don't really see any
   other way. BTW, the lat/lons are integers)
  
   You could (in c or cython) try a brain-dead hashtable with no
   collision detection:
  
   for lat, long, data in dataset:
      bin = (lat ^ long) % num_bins
      hashtable[bin] = update_incremental_mean(hashtable[bin], data)
  
   you'll of course want to do some experiments to see if your data
   are
   sufficiently sparse and/or you can afford a large enough hashtable
   array that you won't get spurious hash collisions. Adding error-
   checking to ensure that there are no collisions would be pretty
   trivial (just keep a table of the lat/long for each hash value,
   which
   you'll need anyway, and check that different lat/long pairs don't
   get
   assigned the same bin).
  
   Zach
  
  
  
   -Mathew
  
   On Tue, Jun 1, 2010 at 1:49 PM, Zachary
   Pincuszachary.pin...@yale.edu
   wrote:
   Hi
   Can anyone think of a clever (non-lopping) solution to the
   following?
  
   A have a list of latitudes, a list of longitudes, and list of
   data
   values. All lists are the same length.
  
   I want to compute an average  of data values for each lat/lon
   pair.
   e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
   data[1001] = (data[1001] + data[2001])/2
  
   Looping is going to take wa to long.
  
   As a start, are the equal lat/lon pairs exactly equal (i.e.
   either
   not floating-point, or floats that will always compare equal, that
   is,
   the floating-point bit-patterns will be guaranteed to be
   identical)
   or
   approximately equal to float tolerance?
  
   If you're in the approx-equal case, then look at the KD-tree in
   scipy
   for doing near-neighbors queries.
  
   If you're in the exact-equal case, you could consider hashing the
   lat/
   lon pairs or something. At least then the looping is O(N) and not
   O(N^2):
  
   import collections
   grouped = collections.defaultdict(list)
   for lt, ln, da in zip(lat, lon, data):
      grouped[(lt, ln)].append(da)
  
   averaged = dict((ltln, numpy.mean(da)) for ltln, da in
   grouped.items())
  
   Is that fast enough?
  
   If the lat lon can be converted to a 1d label as Wes suggested, then
   in a similar timing exercise ndimage was the fastest.
   http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html
  
   And as you said your lats and lons are integers, you could simply do
  
   ll = lat*1000 + lon
  
   to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat
   or
   lon will never exceed 360 (degrees).
  
   After that, either use the ndimage approach, or you could use
   histogramming with weighting by data values and divide by histogram
   withouth weighting, or just loop.
  
   Vincent
  
  
   (this was for python 2.4, also later I found np.bincount which
   requires that the labels are consecutive integers, but is as fast as
   ndimage)
  
   I don't know how it would compare to the new suggestions.
  
   Josef
  
  
  
  
   Zach
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
  
   ___
   NumPy-Discussion mailing list
   NumPy-Discussion@scipy.org
   http://mail.scipy.org/mailman/listinfo/numpy-discussion
  
 
  

Re: [Numpy-discussion] 2D binning

2010-06-02 Thread Mathew Yeates
Nope. This version didn't work either.



 If you're on Python 2.6 the binary on here might work for you:

 http://www.lfd.uci.edu/~gohlke/pythonlibs/

 It looks recent enough to have the rewritten ndimage
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2D binning

2010-06-02 Thread Christoph Gohlke


On 6/2/2010 2:32 PM, Mathew Yeates wrote:
 Nope. This version didn't work either.



 If you're on Python 2.6 the binary on here might work for you:

 http://www.lfd.uci.edu/~gohlke/pythonlibs/

 It looks recent enough to have the rewritten ndimage



On 6/2/2010 2:32 PM, Mathew Yeates wrote:
 Nope. This version didn't work either.


Please note that in order to use the ndimage package from 
http://www.lfd.uci.edu/~gohlke/pythonlibs/#ndimage you have to use 
import ndimage as ndi instead of import scipy.ndimage as ndi. The 
code posted by Wes works for me on Windows after that change.

--
Christoph

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2D binning

2010-06-02 Thread Stephen Simmons

On 1/06/2010 10:51 PM, Wes McKinney wrote:
  snip
  This is a pretty good example of the group-by problem that will
  hopefully work its way into a future edition of NumPy.

Wes (or anyone else), please can you elaborate on any plans for groupby?

I've made my own modification to numpy.bincount for doing groupby-type 
operations but not contributed anything back to the numpy community.

Is there any interest from European-based people to work on groupby etc 
at the Euro SciPy sprints in July? If others are interested, maybe we 
could work out requirements beforehand and then do some coding in Paris.

Cheers
Stephen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2D binning

2010-06-02 Thread Wes McKinney
On Wed, Jun 2, 2010 at 6:18 PM, Stephen Simmons m...@stevesimmons.com wrote:

 On 1/06/2010 10:51 PM, Wes McKinney wrote:
   snip
   This is a pretty good example of the group-by problem that will
   hopefully work its way into a future edition of NumPy.

 Wes (or anyone else), please can you elaborate on any plans for groupby?

 I've made my own modification to numpy.bincount for doing groupby-type
 operations but not contributed anything back to the numpy community.

 Is there any interest from European-based people to work on groupby etc
 at the Euro SciPy sprints in July? If others are interested, maybe we
 could work out requirements beforehand and then do some coding in Paris.

 Cheers
 Stephen

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


I know we're planning to discuss at SciPy in Austin and will hopefully
have a gameplan. We should coordinate the discussions / implementation
somewhere. I've implemented some groupby functionality in pandas but I
guess at a higher level than what's been proposed to add to NumPy.

On the pystatsmodels mailing list we've been starting to have some
discussions about statistically-oriented data structures in general--
I'll be sending an e-mail to the NumPy mailing list soon to start a
broader discussion which will hopefully lead to some good things at
the conferences.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] 2D binning

2010-06-01 Thread Mathew Yeates
Hi
Can anyone think of a clever (non-lopping) solution to the following?

A have a list of latitudes, a list of longitudes, and list of data values.
All lists are the same length.

I want to compute an average  of data values for each lat/lon pair. e.g. if
lat[1001] lon[1001] = lat[2001] [lon [2001] then
data[1001] = (data[1001] + data[2001])/2

Looping is going to take wa to long.

Mathew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2D binning

2010-06-01 Thread Keith Goodman
On Tue, Jun 1, 2010 at 1:07 PM, Mathew Yeates mat.yea...@gmail.com wrote:
 Hi
 Can anyone think of a clever (non-lopping) solution to the following?
 A have a list of latitudes, a list of longitudes, and list of data values.
 All lists are the same length.
 I want to compute an average  of data values for each lat/lon pair. e.g. if
 lat[1001] lon[1001] = lat[2001] [lon [2001] then
 data[1001] = (data[1001] + data[2001])/2
 Looping is going to take wa to long.

Looping N and searching in N would be O(N^2). But would it be too long
just to loop, so O(N)? If you need to find all averages you could try
something like this:

from collections import defaultdict

def aggdict(x):
Convert [(1, 'a'), (2, 'b'), (1, 'A')] to {1: ['a', 'A'], 2: ['b']}.
d = defaultdict(list)
for k, v in x:
d[k].append(v)
return dict(d)
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2D binning

2010-06-01 Thread Zachary Pincus
 Hi
 Can anyone think of a clever (non-lopping) solution to the following?

 A have a list of latitudes, a list of longitudes, and list of data  
 values. All lists are the same length.

 I want to compute an average  of data values for each lat/lon pair.  
 e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
 data[1001] = (data[1001] + data[2001])/2

 Looping is going to take wa to long.

As a start, are the equal lat/lon pairs exactly equal (i.e. either  
not floating-point, or floats that will always compare equal, that is,  
the floating-point bit-patterns will be guaranteed to be identical) or  
approximately equal to float tolerance?

If you're in the approx-equal case, then look at the KD-tree in scipy  
for doing near-neighbors queries.

If you're in the exact-equal case, you could consider hashing the lat/ 
lon pairs or something. At least then the looping is O(N) and not  
O(N^2):

import collections
grouped = collections.defaultdict(list)
for lt, ln, da in zip(lat, lon, data):
   grouped[(lt, ln)].append(da)

averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items())

Is that fast enough?

Zach
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2D binning

2010-06-01 Thread Wes McKinney
On Tue, Jun 1, 2010 at 4:49 PM, Zachary Pincus zachary.pin...@yale.edu wrote:
 Hi
 Can anyone think of a clever (non-lopping) solution to the following?

 A have a list of latitudes, a list of longitudes, and list of data
 values. All lists are the same length.

 I want to compute an average  of data values for each lat/lon pair.
 e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
 data[1001] = (data[1001] + data[2001])/2

 Looping is going to take wa to long.

 As a start, are the equal lat/lon pairs exactly equal (i.e. either
 not floating-point, or floats that will always compare equal, that is,
 the floating-point bit-patterns will be guaranteed to be identical) or
 approximately equal to float tolerance?

 If you're in the approx-equal case, then look at the KD-tree in scipy
 for doing near-neighbors queries.

 If you're in the exact-equal case, you could consider hashing the lat/
 lon pairs or something. At least then the looping is O(N) and not
 O(N^2):

 import collections
 grouped = collections.defaultdict(list)
 for lt, ln, da in zip(lat, lon, data):
   grouped[(lt, ln)].append(da)

 averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items())

 Is that fast enough?

 Zach
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


This is a pretty good example of the group-by problem that will
hopefully work its way into a future edition of NumPy. Given that, a
good approach would be to produce a unique key from the lat and lon
vectors, and pass that off to the groupby routine (when it exists).
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2D binning

2010-06-01 Thread Mathew Yeates
I guess it's as fast as I'm going to get. I don't really see any other way.
BTW, the lat/lons are integers)

-Mathew

On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincus zachary.pin...@yale.eduwrote:

  Hi
  Can anyone think of a clever (non-lopping) solution to the following?
 
  A have a list of latitudes, a list of longitudes, and list of data
  values. All lists are the same length.
 
  I want to compute an average  of data values for each lat/lon pair.
  e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
  data[1001] = (data[1001] + data[2001])/2
 
  Looping is going to take wa to long.

 As a start, are the equal lat/lon pairs exactly equal (i.e. either
 not floating-point, or floats that will always compare equal, that is,
 the floating-point bit-patterns will be guaranteed to be identical) or
 approximately equal to float tolerance?

 If you're in the approx-equal case, then look at the KD-tree in scipy
 for doing near-neighbors queries.

 If you're in the exact-equal case, you could consider hashing the lat/
 lon pairs or something. At least then the looping is O(N) and not
 O(N^2):

 import collections
 grouped = collections.defaultdict(list)
 for lt, ln, da in zip(lat, lon, data):
   grouped[(lt, ln)].append(da)

 averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items())

 Is that fast enough?

 Zach
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2D binning

2010-06-01 Thread Zachary Pincus
 I guess it's as fast as I'm going to get. I don't really see any  
 other way. BTW, the lat/lons are integers)

You could (in c or cython) try a brain-dead hashtable with no  
collision detection:

for lat, long, data in dataset:
   bin = (lat ^ long) % num_bins
   hashtable[bin] = update_incremental_mean(hashtable[bin], data)

you'll of course want to do some experiments to see if your data are  
sufficiently sparse and/or you can afford a large enough hashtable  
array that you won't get spurious hash collisions. Adding error- 
checking to ensure that there are no collisions would be pretty  
trivial (just keep a table of the lat/long for each hash value, which  
you'll need anyway, and check that different lat/long pairs don't get  
assigned the same bin).

Zach



 -Mathew

 On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincus zachary.pin...@yale.edu 
  wrote:
  Hi
  Can anyone think of a clever (non-lopping) solution to the  
 following?
 
  A have a list of latitudes, a list of longitudes, and list of data
  values. All lists are the same length.
 
  I want to compute an average  of data values for each lat/lon pair.
  e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
  data[1001] = (data[1001] + data[2001])/2
 
  Looping is going to take wa to long.

 As a start, are the equal lat/lon pairs exactly equal (i.e. either
 not floating-point, or floats that will always compare equal, that is,
 the floating-point bit-patterns will be guaranteed to be identical) or
 approximately equal to float tolerance?

 If you're in the approx-equal case, then look at the KD-tree in scipy
 for doing near-neighbors queries.

 If you're in the exact-equal case, you could consider hashing the lat/
 lon pairs or something. At least then the looping is O(N) and not
 O(N^2):

 import collections
 grouped = collections.defaultdict(list)
 for lt, ln, da in zip(lat, lon, data):
   grouped[(lt, ln)].append(da)

 averaged = dict((ltln, numpy.mean(da)) for ltln, da in  
 grouped.items())

 Is that fast enough?

 Zach
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2D binning

2010-06-01 Thread josef . pktd
On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincus zachary.pin...@yale.edu wrote:
 I guess it's as fast as I'm going to get. I don't really see any
 other way. BTW, the lat/lons are integers)

 You could (in c or cython) try a brain-dead hashtable with no
 collision detection:

 for lat, long, data in dataset:
   bin = (lat ^ long) % num_bins
   hashtable[bin] = update_incremental_mean(hashtable[bin], data)

 you'll of course want to do some experiments to see if your data are
 sufficiently sparse and/or you can afford a large enough hashtable
 array that you won't get spurious hash collisions. Adding error-
 checking to ensure that there are no collisions would be pretty
 trivial (just keep a table of the lat/long for each hash value, which
 you'll need anyway, and check that different lat/long pairs don't get
 assigned the same bin).

 Zach



 -Mathew

 On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincus zachary.pin...@yale.edu
  wrote:
  Hi
  Can anyone think of a clever (non-lopping) solution to the
 following?
 
  A have a list of latitudes, a list of longitudes, and list of data
  values. All lists are the same length.
 
  I want to compute an average  of data values for each lat/lon pair.
  e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
  data[1001] = (data[1001] + data[2001])/2
 
  Looping is going to take wa to long.

 As a start, are the equal lat/lon pairs exactly equal (i.e. either
 not floating-point, or floats that will always compare equal, that is,
 the floating-point bit-patterns will be guaranteed to be identical) or
 approximately equal to float tolerance?

 If you're in the approx-equal case, then look at the KD-tree in scipy
 for doing near-neighbors queries.

 If you're in the exact-equal case, you could consider hashing the lat/
 lon pairs or something. At least then the looping is O(N) and not
 O(N^2):

 import collections
 grouped = collections.defaultdict(list)
 for lt, ln, da in zip(lat, lon, data):
   grouped[(lt, ln)].append(da)

 averaged = dict((ltln, numpy.mean(da)) for ltln, da in
 grouped.items())

 Is that fast enough?

If the lat lon can be converted to a 1d label as Wes suggested, then
in a similar timing exercise ndimage was the fastest.
http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html

(this was for python 2.4, also later I found np.bincount which
requires that the labels are consecutive integers, but is as fast as
ndimage)

I don't know how it would compare to the new suggestions.

Josef




 Zach
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2D binning

2010-06-01 Thread Brent Pedersen
On Tue, Jun 1, 2010 at 1:51 PM, Wes McKinney wesmck...@gmail.com wrote:
 On Tue, Jun 1, 2010 at 4:49 PM, Zachary Pincus zachary.pin...@yale.edu 
 wrote:
 Hi
 Can anyone think of a clever (non-lopping) solution to the following?

 A have a list of latitudes, a list of longitudes, and list of data
 values. All lists are the same length.

 I want to compute an average  of data values for each lat/lon pair.
 e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
 data[1001] = (data[1001] + data[2001])/2

 Looping is going to take wa to long.

 As a start, are the equal lat/lon pairs exactly equal (i.e. either
 not floating-point, or floats that will always compare equal, that is,
 the floating-point bit-patterns will be guaranteed to be identical) or
 approximately equal to float tolerance?

 If you're in the approx-equal case, then look at the KD-tree in scipy
 for doing near-neighbors queries.

 If you're in the exact-equal case, you could consider hashing the lat/
 lon pairs or something. At least then the looping is O(N) and not
 O(N^2):

 import collections
 grouped = collections.defaultdict(list)
 for lt, ln, da in zip(lat, lon, data):
   grouped[(lt, ln)].append(da)

 averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items())

 Is that fast enough?

 Zach
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 This is a pretty good example of the group-by problem that will
 hopefully work its way into a future edition of NumPy. Given that, a
 good approach would be to produce a unique key from the lat and lon
 vectors, and pass that off to the groupby routine (when it exists).
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


meanwhile groupby from itertools will work but might be a bit slower
since it'll have to convert every row to tuple and group in a list.


import numpy as np
import itertools

# fake data
N = 1
lats = np.repeat(180 * (np.random.ranf(N/ 250) - 0.5), 250)
lons = np.repeat(360 * (np.random.ranf(N/ 250) - 0.5), 250)

np.random.shuffle(lats)
np.random.shuffle(lons)

vals = np.arange(N)
#

inds = np.lexsort((lons, lats))

sorted_lats = lats[inds]
sorted_lons = lons[inds]
sorted_vals = vals[inds]

llv = np.array((sorted_lats, sorted_lons, sorted_vals)).T

for (lat, lon), group in itertools.groupby(llv, lambda row: tuple(row[:2])):
group_vals = [g[-1] for g in group]
print lat, lon, np.mean(group_vals)

# make sure the mean for the last lat/lon from the loop matches the mean
# for that lat/lon from original data.
tests_idx, = np.where((lats == lat)  (lons == lon))
assert np.mean(vals[tests_idx]) == np.mean(group_vals)
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion