subject:"\[Numpy\-discussion\] 2D binning"

Re: [Numpy-discussion] 2d binning and linear regression

2010-06-23 Thread Bruce Southey

On 06/22/2010 02:58 PM, josef.p...@gmail.com wrote:
 On Tue, Jun 22, 2010 at 10:09 AM, Tom Durrantthdurr...@gmail.com  wrote:

 the basic idea is in polyfit  on multiple data points on
 numpy-disscusion mailing list April 2009

 In this case, calculations have to be done by groups

 subtract mean (this needs to be replaced by group demeaning)
 modeldm = model - model.mean()
 obsdm = obs - obs.mean()

 xx = np.histogram2d(
 xx, xedges, yedges = np.histogram2d(lat, lon, weights=modeldm*modeldm,
   bins=(latedges,lonedges))
 xy, xedges, yedges = np.histogram2d(lat, lon, weights=obsdm*obsdm,
   bins=(latedges,lonedges))


 slopes = xy/xx  # slopes by group

 expand slopes to length of original array
 predicted = model - obs * slopes_expanded
 ...

 the main point is to get the group functions, for demeaning, ... for
 the 2d labels (and get the labels out of histogramdd)

 I'm out of time (off to the airport soon), but I can look into it next
 weekend.

 Josef


 Thanks Josef, I will chase up the April list...
 If I understand what you have done above, this returns the slope of best fit
 lines forced through the origin, is that right?
  
 Not if both variables, model and obs, are demeaned first, demeaning
 removes any effect of a constant and only the slope is left over,
 which can be done with the ration xx/xy.

 But to get independent intercept per group, the demeaning has to be by group.

 What's the size of your problem, how many groups or how many separate
 regressions ?

 demeaning by group has setup cost in this case, so the main speed
 benefit would come if you calculate the digitize and label generation,
 that histogram2d does, only ones and reuse it in later calculations.

 Using dummy variables as Bruce proposes works very well if there are
 not a very large number of groups, otherwise I think the memory
 requirements and size of array would be very costly in terms of
 performance.

 Josef


There is always a tradeoff of memory vs speed. It is too easy to be too 
clever just to find that a more brute force approach is considerably 
faster. You want to avoid Python code and array indexing as much as 
possible since those can be major speed bumps. Also some approaches have 
hidden memory costs as well (histogram2d calls as histogramdd([x,y],...).

If memory is an issue, then obviously you need to decide how to handle 
it because there are many ways around it. For example, the biggest 
memory usage in my code should be related to the design matrix and 
creating the normal equations. At worst (which requires passing by value 
rather than reference) it would be two 2-d arrays with the number of 
rows being the number of observations and the number of columns being 
two times the number of groups. If you have scipy, then sparse matrices 
can reduce the memory footprint of the design matrix and could be 
faster. There are also ways to construct the design matrix and the 
normal equations either observation by observation (essential for very 
large data sets) or pieces (uses within group information of numbers of 
observations, sum of 'model' and sum of 'model*model'). If the your 
boxes have homogeneous variance (which is already being assumed), you 
can first divide the data into groups of boxes and then loop over each 
group reusing the existing arrays as needed.

Bruce
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] 2d binning and linear regression

2010-06-22 Thread Tom Durrant



 the basic idea is in polyfit  on multiple data points on
 numpy-disscusion mailing list April 2009

 In this case, calculations have to be done by groups

 subtract mean (this needs to be replaced by group demeaning)
 modeldm = model - model.mean()
 obsdm = obs - obs.mean()

 xx = np.histogram2d(
 xx, xedges, yedges = np.histogram2d(lat, lon, weights=modeldm*modeldm,
  bins=(latedges,lonedges))
 xy, xedges, yedges = np.histogram2d(lat, lon, weights=obsdm*obsdm,
  bins=(latedges,lonedges))


 slopes = xy/xx  # slopes by group

 expand slopes to length of original array
 predicted = model - obs * slopes_expanded
 ...

 the main point is to get the group functions, for demeaning, ... for
 the 2d labels (and get the labels out of histogramdd)

 I'm out of time (off to the airport soon), but I can look into it next
 weekend.

 Josef

 Thanks Josef, I will chase up the April list...

If I understand what you have done above, this returns the slope of best fit
lines forced through the origin, is that right?

Have a great trip!

Tom
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] 2d binning and linear regression

2010-06-22 Thread Tom Durrant



 
 What exactly are trying to fit because it is rather bad practice to fit
 a model to some summarized data as you lose the uncertainty in the
 original data?
 If you define your boxes, you can loop through directly on each box and
 even fit the equation:

 model=mu +beta1*obs

 The extension is to fit the larger equation:
 model=mu + boxes + beta1*obs + beta2*obs*boxes

 where your boxes is a indicator or dummy variable for each box.
 Since you are only interested in the box by model term, you probably can
 use this type of model
 model=mu + boxes + beta2*obs*boxes

 However, these models assume that the residual variance is identical for
 all boxes. (That is solved by a mixed model approach.)

 Bruce


Bruce,

I am trying to determine spatially based linear corrections for surface
winds in order to force a wave model.  The basic idea is, use satellite
observations from sattelites to determine the errors and the wind, and
reduce them by applying a linear correction prior to forcing the wave model.

I am not sure I understand what you are saying, I am possibly trying to do
what you are describing.  i.e. for each box, gather observations, determine
a linear correction, and apply it to the model

model = a*x + b

Does that make sense?

Cheers
Tom





 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] 2d binning and linear regression

2010-06-22 Thread Bruce Southey


On 06/22/2010 09:13 AM, Tom Durrant wrote:




What exactly are trying to fit because it is rather bad practice
to fit
a model to some summarized data as you lose the uncertainty in the
original data?
If you define your boxes, you can loop through directly on each
box and
even fit the equation:

model=mu +beta1*obs

The extension is to fit the larger equation:
model=mu + boxes + beta1*obs + beta2*obs*boxes

where your boxes is a indicator or dummy variable for each box.
Since you are only interested in the box by model term, you
probably can
use this type of model
model=mu + boxes + beta2*obs*boxes

However, these models assume that the residual variance is
identical for
all boxes. (That is solved by a mixed model approach.)

Bruce


Bruce,

I am trying to determine spatially based linear corrections for 
surface winds in order to force a wave model.  The basic idea is, use 
satellite observations from sattelites to determine the errors and the 
wind, and reduce them by applying a linear correction prior to forcing 
the wave model.


I am not sure I understand what you are saying, I am possibly trying 
to do what you are describing.  i.e. for each box, gather 
observations, determine a linear correction, and apply it to the model


model = a*x + b

Does that make sense?

Cheers
Tom




I used the data you gave and may have swapped 'model' and 'x' here - the 
code should work if you switch these.
First I assume that you can create a variable 'box' based on the lat/lon 
data - I just created one from the first values.


As I understand it, your problem is essentially is an analysis of 
covariance with one factor (box) and one regressor (x). So I used some 
technical knowledge about the parameterization of the normal equations 
that may not be true in general for all models as it depends on finding 
'equivalent models'.  Just print out the different arrays for a small 
example as it is rather hard to describe using words alone.


Basically you can create a design matrix that just includes dummy 
variables for each box and the interactions between that dummy variable 
and your 'x' array. This amounts to the equation:


y = box + box*x

where
y is your 'model' array
box is a 'dummy variable of box - number of columns is the number of boxes.
x is your 'x' array.

Note that there is no general or overall intercept here because you are 
not interested in that. Rather you are interested in the 'intercept' for 
each box which comes from the corresponding solution.


The code provides a function to create the dummy variables for each box 
and a second function to compute the interactions between that dummy 
variable and your 'x' array - these functions originated in 
pystatsmodels. After that I form and solve the normal equations to get 
the standard errors of the solutions and other useful statistics. Note 
that the residual variance (MSE) is probably the pooled residual 
variance across all boxes (I am too lazy to check).



Also, if you are not interested in the standard errors etc. then you 
probably should use a more efficient solver available in numpy.


I (hopefully) reshaped the solutions ('beta') and standard errors 
('pSE') so the rows are for each box, the first column is the 
'intercept' and the second column is the 'slope'.


The output is:

$ python reg_box.py
Residual Sum of Squares  0.0306718851814
Residual Mean Sum of Squares 0.00511198086357
RSquared 0.975657233983
Estimated Intercept and regression for each box
[[ 0.69044944  0.44569288]
 [-1.53272813  1.43358273]]
Estimated standard error of the Intercept and regression for each box
[[ 0.41081864  0.23061509]
 [ 0.21123387  0.095004  ]]

For box=1:
a=0.45 (se=0.23) and b=0.69 (se= 0.41)

For box=2:
a=1.43 (se=0.095) and b=-1.53 (se=0.21)

Bruce
(If you have questions, you can contact me off list if you want.)



import numpy as np
y =np.array(   [1.42, 1.35, 1.55, 1.50, 1.59, 1.76, 2.15, 1.90, 1.55, 0.73  ])
x =np.array( [1.67, 1.68, 1.70, 1.79, 2.04, 2.36, 2.53, 2.38, 2.149, 1.57 ])
box =np.array( [1, 1, 1, 1, 1, 2, 2, 2, 2, 2 ])

def data2dummy(x):
if not isinstance(x, np.ndarray): #if not an ndarray object attempt to convert it
x=np.array(x, dtype=dtype)
if len(x.shape)  1:
raise ValueError, 'Too many columns'
groups = np.unique(x)
return (x[:, None] == groups).astype(int), groups

def design_int(A, B):
if len(B.shape)  1:
rowB, colB=B.shape
ab=A*B[:,0].reshape(rowB,1)
else:
rowB=B.shape[0]
colB=1
ab=A*B[:].reshape(rowB,1)
for col in range(1,colB):
ncol=B[:,col].reshape(rowB,1)
ab=np.hstack((ab,A*ncol))
return ab

dbox, lbox= data2dummy(box) #create dummy variable for box
Xdesign=np.hstack((dbox,design_int(dbox, x))) #create Design matrix
#Create components of the normal equations
YY=np.dot(y.T,y)
XX=np.dot(Xdesign.T,Xdesign)

Re: [Numpy-discussion] 2d binning and linear regression

2010-06-22 Thread josef . pktd

On Tue, Jun 22, 2010 at 10:09 AM, Tom Durrant thdurr...@gmail.com wrote:

 the basic idea is in polyfit  on multiple data points on
 numpy-disscusion mailing list April 2009

 In this case, calculations have to be done by groups

 subtract mean (this needs to be replaced by group demeaning)
 modeldm = model - model.mean()
 obsdm = obs - obs.mean()

 xx = np.histogram2d(
 xx, xedges, yedges = np.histogram2d(lat, lon, weights=modeldm*modeldm,
      bins=(latedges,lonedges))
 xy, xedges, yedges = np.histogram2d(lat, lon, weights=obsdm*obsdm,
      bins=(latedges,lonedges))


 slopes = xy/xx  # slopes by group

 expand slopes to length of original array
 predicted = model - obs * slopes_expanded
 ...

 the main point is to get the group functions, for demeaning, ... for
 the 2d labels (and get the labels out of histogramdd)

 I'm out of time (off to the airport soon), but I can look into it next
 weekend.

 Josef

 Thanks Josef, I will chase up the April list...
 If I understand what you have done above, this returns the slope of best fit
 lines forced through the origin, is that right?

Not if both variables, model and obs, are demeaned first, demeaning
removes any effect of a constant and only the slope is left over,
which can be done with the ration xx/xy.

But to get independent intercept per group, the demeaning has to be by group.

What's the size of your problem, how many groups or how many separate
regressions ?

demeaning by group has setup cost in this case, so the main speed
benefit would come if you calculate the digitize and label generation,
that histogram2d does, only ones and reuse it in later calculations.

Using dummy variables as Bruce proposes works very well if there are
not a very large number of groups, otherwise I think the memory
requirements and size of array would be very costly in terms of
performance.

Josef




 Have a great trip!
 Tom


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] 2d binning and linear regression

2010-06-21 Thread josef . pktd

On Sun, Jun 20, 2010 at 10:57 PM, Tom Durrant thdurr...@gmail.com wrote:


 are you doing something like np.polyfit(model, obs, 1) ?

 If you are using polyfit with deg=1, i.e. fitting a straight line,
 then this could be also calculated using the weights in histogram2d.

 histogram2d (histogramdd) uses np.digitize and np.bincount, so I'm
 surprised if the histogram2d version is much faster. If a quick
 reading of histogramdd is correct, the main improvement would be to
 get the labels xy out of it, so it can be used repeatedly with
 np.bincount.

 Josef

 Thanks Josef,

 From my limited understanding, you are right the histogram is much faster 
 due to
 the fact that it doesn't have to keep reading in the array over and over

 I am using np.polyfit(model, obs, 1).  I couldn't work out a way to do these
 regression using histogram2d and weights, but you think it can be done?  This
 would be great!

the basic idea is in polyfit  on multiple data points on
numpy-disscusion mailing list April 2009

In this case, calculations have to be done by groups

subtract mean (this needs to be replaced by group demeaning)
modeldm = model - model.mean()
obsdm = obs - obs.mean()

xx = np.histogram2d(
xx, xedges, yedges = np.histogram2d(lat, lon, weights=modeldm*modeldm,
  bins=(latedges,lonedges))
xy, xedges, yedges = np.histogram2d(lat, lon, weights=obsdm*obsdm,
  bins=(latedges,lonedges))


slopes = xy/xx  # slopes by group

expand slopes to length of original array
predicted = model - obs * slopes_expanded
...

the main point is to get the group functions, for demeaning, ... for
the 2d labels (and get the labels out of histogramdd)

I'm out of time (off to the airport soon), but I can look into it next weekend.

Josef



 Cheers
 Tom





 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] 2d binning and linear regression

2010-06-21 Thread Bruce Southey

On 06/20/2010 03:24 AM, Tom Durrant wrote:
 Hi All,

 I have a problem involving lat/lon data.  Basically, I am evaluating 
 numerical weather model data against satellite data, and trying to 
 produce gridded plots of various statistics.  There are various steps 
 involved with this, but basically, I get to the point where I have 
 four arrays of the same length, containing lat, lon, model and 
 observed values respectively along the path of the sattelite.

 eg:

 lat   =  [   50.32   50.7851.2451.82   52.5553.15   
  53.7554.28   54.79   55.16  ... ]
 lon  =  [ 192.83  193.38  193.94  194.67  195.65  196.49  197.35 
  198.15  198.94 199.53  ... ]
 obs =  [1.42  1.35  1.55 1.50  1.59 1.76   
   2.15   1.90 1.550.73  ... ]
 model  = [ 1.67  1.68 1.70  1.79 2.04 2.36 
  2.53  2.38  2.149   1.57 ... ]

 I then want to calculate statistics based on bins of say 2 X 2 degree 
 boxes. These arrays are very large, on the order of 10^6. For ease of 
 explanation, I will focus initially on bias.

 My first approach was to use loops through lat/lon boxes and use 
 np.where statements to extract all the values of the model and 
 observations within the given box, and calculate the bias as the mean 
 of the difference.  This was obviously very slow.

 histogram2d provided a FAR better way to do this.  i.e.


 import numpy as np

 latedges=np.arange(-90,90,2)
 lonedges=np.arange(0,360,2)

 diff = model-obs
 grid_N, xedges, yedges = np.histogram2d(lat, lon],
 bins=(latedges,lonedges))
 grid_bias_sum, xedges, yedges = np.histogram2d(lat, lon, weights=diff,
 bins=(latedges,lonedges))
 grid_bias = grid_bias_sum/grid_N


 I now want to determine the the linear regression coefficients for 
 each each box after fitting a least squares linear regression to the 
 data in each bin.  I have been looking at using np.digitize to extract 
 the bin indeces, but haven't had much success trying to do this in two 
 dimensions.  I am back to looping through the lat and lon box values, 
 using np.where to extract the observations and model data within that 
 box, and using np.polyfit to obtain the regression coefficients.  This 
 is, of course, very slow.

 Can anyone advise a smarter, vectorized way to do this?  Any thoughts 
 would be greatly appreciated.

 Thanks in advance
 Tom



What exactly are trying to fit because it is rather bad practice to fit 
a model to some summarized data as you lose the uncertainty in the 
original data?
If you define your boxes, you can loop through directly on each box and 
even fit the equation:

model=mu +beta1*obs

The extension is to fit the larger equation:
model=mu + boxes + beta1*obs + beta2*obs*boxes

where your boxes is a indicator or dummy variable for each box.
Since you are only interested in the box by model term, you probably can 
use this type of model
model=mu + boxes + beta2*obs*boxes

However, these models assume that the residual variance is identical for 
all boxes. (That is solved by a mixed model approach.)

Bruce






___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] 2d binning and linear regression

2010-06-20 Thread Tom Durrant

Hi All,

I have a problem involving lat/lon data.  Basically, I am evaluating
numerical weather model data against satellite data, and trying to produce
gridded plots of various statistics.  There are various steps involved with
this, but basically, I get to the point where I have four arrays of the same
length, containing lat, lon, model and observed values respectively along
the path of the sattelite.

eg:

lat   =  [   50.32   50.7851.2451.82   52.5553.1553.75
 54.28   54.79   55.16  ... ]
lon  =  [ 192.83  193.38  193.94  194.67  195.65  196.49  197.35  198.15
 198.94 199.53  ... ]
obs =  [1.42  1.35  1.55 1.50  1.59 1.76
2.15   1.90 1.550.73  ... ]
model  = [ 1.67  1.68 1.70  1.79 2.04 2.36  2.53
 2.38  2.149   1.57 ... ]

I then want to calculate statistics based on bins of say 2 X 2 degree boxes.
These arrays are very large, on the order of 10^6. For ease of explanation,
I will focus initially on bias.

My first approach was to use loops through lat/lon boxes and use np.where
statements to extract all the values of the model and observations within
the given box, and calculate the bias as the mean of the difference.  This
was obviously very slow.

histogram2d provided a FAR better way to do this.  i.e.


import numpy as np

latedges=np.arange(-90,90,2)
lonedges=np.arange(0,360,2)

diff = model-obs
grid_N, xedges, yedges = np.histogram2d(lat, lon],
bins=(latedges,lonedges))
grid_bias_sum, xedges, yedges = np.histogram2d(lat, lon, weights=diff,
bins=(latedges,lonedges))
grid_bias = grid_bias_sum/grid_N


I now want to determine the the linear regression coefficients for each each
box after fitting a least squares linear regression to the data in each bin.
 I have been looking at using np.digitize to extract the bin indeces, but
haven't had much success trying to do this in two dimensions.  I am back to
looping through the lat and lon box values, using np.where to extract the
observations and model data within that box, and using np.polyfit to obtain
the regression coefficients.  This is, of course, very slow.

Can anyone advise a smarter, vectorized way to do this?  Any thoughts would
be greatly appreciated.

Thanks in advance
Tom
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] 2d binning and linear regression

2010-06-20 Thread josef . pktd

On Sun, Jun 20, 2010 at 4:24 AM, Tom Durrant thdurr...@gmail.com wrote:
Hi All,
I have a problem involving lat/lon data. Basically, I am evaluating
numerical weather model data against satellite data, and trying to produce
gridded plots of various statistics. There are various steps involved with
this, but basically, I get to the point where I have four arrays of the same
length, containing lat, lon, model and observed values respectively along
the path of the sattelite.
eg:
lat = [ 50.32 50.78 51.24 51.82 52.55 53.15 53.75
54.28 54.79 55.16 ... ]
lon = [ 192.83 193.38 193.94 194.67 195.65 196.49 197.35 198.15
198.94 199.53 ... ]
obs = [ 1.42 1.35 1.55 1.50 1.59 1.76
2.15 1.90 1.55 0.73 ... ]
model = [ 1.67 1.68 1.70 1.79 2.04 2.36 2.53
2.38 2.149 1.57 ... ]
I then want to calculate statistics based on bins of say 2 X 2 degree boxes.
These arrays are very large, on the order of 10^6. For ease of explanation,
I will focus initially on bias.
My first approach was to use loops through lat/lon boxes and use np.where
statements to extract all the values of the model and observations within
the given box, and calculate the bias as the mean of the difference. This
was obviously very slow.
histogram2d provided a FAR better way to do this. i.e.

import numpy as np
latedges=np.arange(-90,90,2)
lonedges=np.arange(0,360,2)
diff = model-obs
grid_N, xedges, yedges = np.histogram2d(lat, lon],
bins=(latedges,lonedges))
grid_bias_sum, xedges, yedges = np.histogram2d(lat, lon, weights=diff,
bins=(latedges,lonedges))
grid_bias = grid_bias_sum/grid_N

I now want to determine the the linear regression coefficients for each each
box after fitting a least squares linear regression to the data in each bin.
I have been looking at using np.digitize to extract the bin indeces, but
haven't had much success trying to do this in two dimensions. I am back to
looping through the lat and lon box values, using np.where to extract the
observations and model data within that box, and using np.polyfit to obtain
the regression coefficients. This is, of course, very slow.
Can anyone advise a smarter, vectorized way to do this? Any thoughts would
be greatly appreciated.

For a general linear regression problem, there wouldn't be much that I
can see that can be done.

If there are many small regression problem, then sometimes stacking
them into one big sparse least squares problem can be faster, it's
faster to solve but not always faster to create in the first place.

are you doing something like np.polyfit(model, obs, 1) ?

If you are using polyfit with deg=1, i.e. fitting a straight line,
then this could be also calculated using the weights in histogram2d.

histogram2d (histogramdd) uses np.digitize and np.bincount, so I'm
surprised if the histogram2d version is much faster. If a quick
reading of histogramdd is correct, the main improvement would be to
get the labels xy out of it, so it can be used repeatedly with
np.bincount.

Josef

Thanks in advance
Tom

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] 2d binning and linear regression

2010-06-20 Thread Tom Durrant


 
 are you doing something like np.polyfit(model, obs, 1) ?
 
 If you are using polyfit with deg=1, i.e. fitting a straight line,
 then this could be also calculated using the weights in histogram2d.
 
 histogram2d (histogramdd) uses np.digitize and np.bincount, so I'm
 surprised if the histogram2d version is much faster. If a quick
 reading of histogramdd is correct, the main improvement would be to
 get the labels xy out of it, so it can be used repeatedly with
 np.bincount.
 
 Josef
 
Thanks Josef, 

From my limited understanding, you are right the histogram is much faster due 
to 
the fact that it doesn't have to keep reading in the array over and over

I am using np.polyfit(model, obs, 1).  I couldn't work out a way to do these 
regression using histogram2d and weights, but you think it can be done?  This 
would be great!

Cheers
Tom
 




___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] 2d binning on regular grid

2010-06-03 Thread Friedrich Romstedt

Hello Andreas,

please see this as a side remark.

A colleague of mine made me aware of a very beautiful thing about
covering spheres by evenly spaced points:

http://healpix.jpl.nasa.gov/

Since you want to calculate mean and stddev, to my understanding a
grid in longitude/latitude is without proper weighting factors
problematic.  Do you use weighting factors?  If yes, of what kind?

For Healpix, there exists exists a Python binding, but I never worked with it.

Friedrich
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] 2d binning on regular grid

2010-06-03 Thread josef . pktd

On Wed, Jun 2, 2010 at 11:40 AM, Andreas Hilboll li...@hilboll.de wrote:
 Hi there,

 I'm interested in the solution to a special case of the parallel thread
 '2D binning', which is going on at the moment. My data is on a fine global
 grid, say .125x.125 degrees. I'm looking for a way to do calculations on
 coarser grids, e.g.

 * calculate means()
 * calculate std()
 * ...

 on a, say, 2.5x3.75 degree grid. One very crude approach would be to
 iterate through latitudes and longitudes, like this:


 latstep_orig = .125
 lonstep_orig = .125
 data_orig =
 np.arange((180./latstep_orig)*(360./lonstep_orig)).reshape((180./latstep_orig,360./lonstep_orig))
 latstep_new = 2.5
 lonstep_new = 3.75
 latstep = int(latstep_new / latstep_orig)
 lonstep = int(lonstep_new / lonstep_orig)

 print 'one new lat equals',latstep,'new lats'
 print 'one new lon equals',lonstep,'new lons'

 result = ma.zeros((180./latstep_new,360./lonstep_new))
 latidx = 0
 while latidx*latstep_new  180.:
    lonidx = 0
    while lonidx*lonstep_new  360.:
        m = np.mean( \
            data_orig[latidx*latstep:(latidx+1)*latstep, \
            lonidx*lonstep:(lonidx+1)*lonstep])
        result[latidx,lonidx] = m
        lonidx += 1
    latidx += 1

 However, this is very crude, and I was wondering if there's any more
 elegant way to do it ...

 Thanks for your insight!

I thought maybe there is something in ndimage for this, but since
nobody mentions it, maybe not.

If there are no memory problems and my interpretation of the question
is correct, then something like this might work:

 x = np.arange(16).reshape(4,-1)
 d = np.kron(np.eye(2),np.ones(2))
 s = np.dot(np.dot(d,x),d.T)
 m = np.dot(np.dot(d,x),d.T)/4
 v = np.dot(np.dot(d,x**2),d.T)/4 - m**2
 x
array([[ 0,  1,  2,  3],
   [ 4,  5,  6,  7],
   [ 8,  9, 10, 11],
   [12, 13, 14, 15]])
 s
array([[ 10.,  18.],
   [ 42.,  50.]])
 m
array([[  2.5,   4.5],
   [ 10.5,  12.5]])
 v
array([[ 4.25,  4.25],
   [ 4.25,  4.25]])
 x[:2,:2].sum()
10
 x[:2,:2].mean()
2.5
 x[:2,:2].var()
4.25

Josef


 A.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

33 matches

Mail list logo