Re: [Numpy-discussion] 2d binning and linear regression

2010-06-22 Thread Tom Durrant
>
>
> >
> What exactly are trying to fit because it is rather bad practice to fit
> a model to some summarized data as you lose the uncertainty in the
> original data?
> If you define your boxes, you can loop through directly on each box and
> even fit the equation:
>
> model=mu +beta1*obs
>
> The extension is to fit the larger equation:
> model=mu + boxes + beta1*obs + beta2*obs*boxes
>
> where your boxes is a indicator or dummy variable for each box.
> Since you are only interested in the box by model term, you probably can
> use this type of model
> model=mu + boxes + beta2*obs*boxes
>
> However, these models assume that the residual variance is identical for
> all boxes. (That is solved by a mixed model approach.)
>
> Bruce


Bruce,

I am trying to determine spatially based linear corrections for surface
winds in order to force a wave model.  The basic idea is, use satellite
observations from sattelites to determine the errors and the wind, and
reduce them by applying a linear correction prior to forcing the wave model.

I am not sure I understand what you are saying, I am possibly trying to do
what you are describing.  i.e. for each box, gather observations, determine
a linear correction, and apply it to the model

model = a*x + b

Does that make sense?

Cheers
Tom


>
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2d binning and linear regression

2010-06-22 Thread Tom Durrant
>
>
> the basic idea is in "polyfit  on multiple data points" on
> numpy-disscusion mailing list April 2009
>
> In this case, calculations have to be done by groups
>
> subtract mean (this needs to be replaced by group demeaning)
> modeldm = model - model.mean()
> obsdm = obs - obs.mean()
>
> xx = np.histogram2d(
> xx, xedges, yedges = np.histogram2d(lat, lon, weights=modeldm*modeldm,
>  bins=(latedges,lonedges))
> xy, xedges, yedges = np.histogram2d(lat, lon, weights=obsdm*obsdm,
>  bins=(latedges,lonedges))
>
>
> slopes = xy/xx  # slopes by group
>
> expand slopes to length of original array
> predicted = model - obs * slopes_expanded
> ...
>
> the main point is to get the group functions, for demeaning, ... for
> the 2d labels (and get the labels out of histogramdd)
>
> I'm out of time (off to the airport soon), but I can look into it next
> weekend.
>
> Josef
>
> Thanks Josef, I will chase up the April list...

If I understand what you have done above, this returns the slope of best fit
lines forced through the origin, is that right?

Have a great trip!

Tom
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 2d binning and linear regression

2010-06-20 Thread Tom Durrant

> 
> are you doing something like np.polyfit(model, obs, 1) ?
> 
> If you are using polyfit with deg=1, i.e. fitting a straight line,
> then this could be also calculated using the weights in histogram2d.
> 
> histogram2d (histogramdd) uses np.digitize and np.bincount, so I'm
> surprised if the histogram2d version is much faster. If a quick
> reading of histogramdd is correct, the main improvement would be to
> get the labels "xy" out of it, so it can be used repeatedly with
> np.bincount.
> 
> Josef
> 
Thanks Josef, 

>From my limited understanding, you are right the histogram is much faster due 
>to 
the fact that it doesn't have to keep reading in the array over and over

I am using np.polyfit(model, obs, 1).  I couldn't work out a way to do these 
regression using histogram2d and weights, but you think it can be done?  This 
would be great!

Cheers
Tom
 




___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] 2d binning and linear regression

2010-06-20 Thread Tom Durrant
Hi All,

I have a problem involving lat/lon data.  Basically, I am evaluating
numerical weather model data against satellite data, and trying to produce
gridded plots of various statistics.  There are various steps involved with
this, but basically, I get to the point where I have four arrays of the same
length, containing lat, lon, model and observed values respectively along
the path of the sattelite.

eg:

lat   =  [   50.32   50.7851.2451.82   52.5553.1553.75
 54.28   54.79   55.16  ... ]
lon  =  [ 192.83  193.38  193.94  194.67  195.65  196.49  197.35  198.15
 198.94 199.53  ... ]
obs =  [1.42  1.35  1.55 1.50  1.59 1.76
2.15   1.90 1.550.73  ... ]
model  = [ 1.67  1.68 1.70  1.79 2.04 2.36  2.53
 2.38  2.149   1.57 ... ]

I then want to calculate statistics based on bins of say 2 X 2 degree boxes.
These arrays are very large, on the order of 10^6. For ease of explanation,
I will focus initially on bias.

My first approach was to use loops through lat/lon boxes and use np.where
statements to extract all the values of the model and observations within
the given box, and calculate the bias as the mean of the difference.  This
was obviously very slow.

histogram2d provided a FAR better way to do this.  i.e.


import numpy as np

latedges=np.arange(-90,90,2)
lonedges=np.arange(0,360,2)

diff = model-obs
grid_N, xedges, yedges = np.histogram2d(lat, lon],
bins=(latedges,lonedges))
grid_bias_sum, xedges, yedges = np.histogram2d(lat, lon, weights=diff,
bins=(latedges,lonedges))
grid_bias = grid_bias_sum/grid_N


I now want to determine the the linear regression coefficients for each each
box after fitting a least squares linear regression to the data in each bin.
 I have been looking at using np.digitize to extract the bin indeces, but
haven't had much success trying to do this in two dimensions.  I am back to
looping through the lat and lon box values, using np.where to extract the
observations and model data within that box, and using np.polyfit to obtain
the regression coefficients.  This is, of course, very slow.

Can anyone advise a smarter, vectorized way to do this?  Any thoughts would
be greatly appreciated.

Thanks in advance
Tom
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion